Use the Join transformation to combine records from two different inputs. The join component can be used to add information from one data source to another data source or to filter data that exists in both data sources or exists in only one of them. It is important to select join keys (see below) that are unique in at least one of the data inputs to avoid data duplication which may result undesired consequences (e.g. billions of records in output or cluster errors due to disk or memory problems).
Join inputs
Select 2 inputs and drag and drop the component menu button onto the join component:
Within the component you can define which input is the left input and which is the right input.
.
Join properties
Join type
-
Inner - returns only those records that have a matching value in the join keys from both the left and right inputs. Note that null values are not considered matching values.
-
Left - returns the same records as an inner join, as well as records from the left input that have no matches in the join key in the right input. Such records will have null values in the right input fields.
-
Right - returns the same records as an inner join, as well as records from the right input that have no matches in the join key in the left input. Such records will have null values in the left input fields.
-
Full - returns all records that would be returned by a inner (matches on both inputs), records from the left input that have no matches in the right input and records from the right input that have no matches in the left input.
Join optimization
-
Default - uses Hash join - both inputs are read, tagged by input and are sorted and put into buckets according to the join keys. Then for each key, the records are cross joined by input tags.
-
Replicated - use when one input is small enough to fit into main memory, thereby improving efficiency. The large relation should be the left input and the small one should be the right input. If the small relation doesn't fit into main memory, the process fails and an error is generated. Replicated join only works with inner or left joins.
-
Skewed - use if the underlying key values are very skewed, so that processing isn't evenly distributed. This will affect performance and may cause the process that deals with most of the data to go out of memory. When Skewed join is used, a histogram is computed on the join key using the left input and this data is used to allocate more reducers for a given key.
Join keys
Inputs are joined on equal key values. Select the key fields that should have equal values in both inputs. If multiple key fields are selected, the join component's output the records that meet all the join conditions (logical "AND").
If the join's inputs contain fields with the same names, they will appear in the join output prefixed by the input's name (e.g. adwords1::customer_id
). Use a select component to fix the field aliases or remove fields that contain the same values.