DW Handbook

This page shows the name of the construct, its description

Attribute

Value

Name

Split Table

Description

Split a single input dataset into multiple datasets

Function

produce two subsets of a certain dataset based on a predicate and its inverse

Aim

Allow a data engineer (user) to split a dataset into two different datasets using a specified condition

Context

This operation is used when a dataset requires splitting into two subset of rows present in the original/unrefined dataset that are required for the analysis for which the dataset is to serve as input.

Rationale

splitting the datasets into multiple subsets of rows allows for different subsequent operations or end-goal analysis making each dataset size smaller affecting the processing time for subsequent intermediary operations of the wrangling process and the analysis the dataset is being prepared for.

Mechanisim

Perform different wrangling operations on each of the subsets of a dataset created by dividing the original dataset based on a certain condition using parallel <b>Filter Rows</b>.

Formalisim

split(R, pred)={ R_a , R_b | R_a ∈ σ(R, pred) ^ R_b ∈ σ(R, ¬pred)} , where R,R_a and R_b are relations with n columns. pred is a function returning a Boolean.

Relational Algebra (RA)

Similar to RA operation Select(σ)

Type

Composite

Class

Router

Transformation_category

1:M

Inputs

Inputs	Number of input datasets
Input dataset, condition to split	1

Outputs

Outputs	Number of output datasets
datasets	M

Used in stage(s)

Structuring2

Back