RDD Basic

RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations.

RDD can be created below three method
  • Parallelizing already existing collection in driver program.
  • Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file system).
  • Creating RDD from already existing RDDs.
RDD basic commands
collect(): Its returns all the elements of the dataset as an array at the driver program, and using for loop on this array, print elements of RDD.

var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.collect


count(): It count the elements of the RDD.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.count



take(n):Read elements of an RDD where n is the number of elements.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.take(3)

 

first: Restun first element of an RDD.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.first

 

foreach: It is an action. It does not return any value. It executes input function on each element of an RDD. It executes the function on each item in RDD. It is good for writing database or publishing to web services.

var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.foreach{x=>println(x)} 

To Find the Number of Partitions 
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.partitions.length



var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.getNumPartitions


Best case Scenario is that we should make RDD in following way:
numbers of cores in Cluster = no. of partitions

Consider the size of wc-data.txt is of 1280 MB and Default block size is 128 MB. So there will be 10 blocks created and 10 default partitions(1 per block). For a better performance, we can increase the number of partitions on each block. Below code will create 20 partitions on 10 blocks(2 partitions/block). Performance will be improved but need to make sure that each cluster is running on 2 cores minimum.
 
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
rdd1.glom.collect

No comments:

Post a Comment