SPARK Transformation

A Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation. Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.
  • All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file).
  • The transformations are only computed when an action requires a result to be returned to the driver program. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
  • By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
There two types of transformations:
  • Narrow Transformation
  • Narrow Transformation
Narrow Transformation: In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. Narrow transformations transform data without any shuffle involved. These transformations transform the data on a per-partition basis; that is to say, each element of the output RDD can be computed without involving any elements from different partitions. This leads to an important point: The new RDD will always have the same number of partitions as its parent RDDs, and that's why they are easy to recompute in the case of failure. Let's understand this with the following example:  Narrow transformations are the result of map(), filter().

rdd narrow transformation

Wide Transformation: In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD.Wide transformations involve a shuffle of the data between the partitions. The partition may live in many partitions of parent RDD.The groupByKey(), reduceByKey(), join(), distinct(), and intersect() are some examples of wide transformations. In the case of these transformations, the result will be computed using data from multiple partitions and thus requires a shuffle. Wide transformations are similar to the shuffle-and-sort phase of MapReduce. Let's understand the concept with the help of the following example:

rdd wider transformation

Spark RDD transformations:
Transformation MethodsMethod usage and description
cache()Caches the RDD
filter()Returns a new RDD after applying filter function on source dataset.
flatMap()Returns flattern map meaning if you have a dataset with array, it converts each elements in a array as a row. In other words it return 0 or more items in output for each element in dataset.
map()Applies transformation function on dataset and returns same number of elements in distributed dataset.
mapPartitions()Similar to map, but executs transformation function on each partition, This gives better performance than map function
mapPartitionsWithIndex()Similar to map Partitions, but also provides func with an integer value representing the index of the partition.
randomSplit()Splits the RDD by the weights specified in the argument. For example rdd.randomSplit(0.7,0.3)
union()Combines elements from source dataset and the argument and returns combined dataset. This is similar to union function in Math set operations.
sample()Returns the sample dataset.
intersection()Returns the dataset which contains elements in both source dataset and an argument
distinct()Returns the dataset by eliminating all duplicated elements.
repartition()Return a dataset with number of partition specified in the argument. This operation reshuffles the RDD randamly, It could either return lesser or more partioned RDD based on the input supplied.
coalesce()Similar to repartition by operates better when we want to the decrease the partitions. Betterment acheives by reshuffling the data from fewer nodes compared with all nodes by repartition.

No comments:

Post a Comment