Spark: Sort Merge Join -SMJ

The two implementations described here are more applicable to tables of a certain size, but when both tables are very large, it is clear that whatever of them will apply a lot of pressure on the memory. This is because the join is taken when the two are hash join, is the side of the data completely loaded into memory, the use of hash code to take bond values equal to the record to connect.

When the two tables are very large, Spark SQL uses a new algorithm to join the table, that is, Sort Merge Join. This method does not have to load all the data and then into the start hash join, but need to sort the data before the join.

You can see that the first two tables in accordance with the join keys were re-shuffle, to ensure that the same value of the join keys will be divided in the corresponding partition. After partitioning the data in each partition, sorting and then the corresponding partition within the record to connect, as shown below:

Sort merge: if the matching join keys are sortable then this join is possible ...

The property spark.sql.join.preferSortMergeJoin which controls the behavior of the algorithm.It can be seen, no matter how large the partition, Sort Merge Join do not have a side of the data all loaded into memory, but that is ready to take away, which greatly enhance the large amount of data under the stability of sql join.

Spark

Sort Merge Join -SMJ

No comments:

Post a Comment