Attribute
|
Value
|
Name
|
Split Table
|
Description
|
Split a single input dataset into multiple datasets
|
Function
|
produce two subsets of a certain dataset based on a predicate and its inverse
|
Aim
|
Allow a data engineer (user) to split a dataset into two different datasets using a specified condition
|
Context
|
This operation is used when a dataset requires splitting into two subset of rows present in the original/unrefined dataset that are required for the analysis for which the dataset is to serve as input.
|
Rationale
|
splitting the datasets into multiple subsets of rows allows for different subsequent operations or end-goal analysis making each dataset size smaller affecting the processing time for subsequent intermediary operations of the wrangling process and the analysis the dataset is being prepared for.
|
Mechanisim
|
Perform different wrangling operations on each of the subsets of a dataset created by dividing the original dataset based on a certain condition using parallel <b>Filter Rows</b>.
|
Formalisim
|
split(R, pred)={ Ra , Rb | Ra ∈ σ(R, pred) ^ Rb ∈ σ(R, ¬pred)} , where R,Ra and Rb are relations with n columns. pred is a function returning a Boolean.
|
Relational Algebra (RA)
|
Similar to RA operation
Select(σ)
|
Type
|
Composite
|
Class
|
Router
|
Transformation_category
|
1:M
|
Inputs
|
Inputs | Number of input datasets |
Input dataset, condition to split | 1 |
|
Outputs
|
Outputs | Number of output datasets |
datasets | M |
|
Used in stage(s)
|
Structuring2
|