Attribute
|
Value
|
Name
|
Filter Rows
|
Description
|
produce a new dataset based by applying a condition on an input dataset
|
Function
|
Reduce a dataset vertical dimension by removing unrequired column
|
Aim
|
Allow a data engineer (user) to select a subset of rows from a dataset that are not required for the analysis which the dataset if being prepared for
|
Context
|
This operation is used when a subset of rows present in the original/unrefined dataset are deemed irrelevant to the analysis for which the dataset is to serve as input.
|
Rationale
|
removing the irrelevant subset of rows for subsequent operations or end-goal analysis makes the dataset size smaller which would affects the processing time for subsequent intermediary operations of the wrangling process and the analysis the dataset is being prepared for.
|
Mechanisim
|
Reduce dataset dimension by removing a subset of rows in a dataset. This can be done by exploring the facilities found in GUI-based tools and programming language functions.
|
Formalisim
|
σ(R, pred)={(a1,...,an) | (a1,...,an ∈ R ∧ pred((a1,...,an)}, where R is a relation with n columns. pred is a function returning a Boolean. (Raman, V and Hellerstein, J 2001)
|
Relational Algebra (RA)
|
Similar to RA operation
Select(σ)
|
Type
|
Atomic
|
Class
|
Unary
|
Transformation_category
|
1:1
|
Inputs
|
Inputs | Number of input datasets |
Input dataset, condition to split | 1 |
|
Outputs
|
Outputs | Number of output datasets |
filtered dataset | 1 |
|
Used in stage(s)
|
Cleaning
,
Structuring2
|