Spark: RDD Basic

RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations.

RDD can be created below three method

Parallelizing already existing collection in driver program.
Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file system).
Creating RDD from already existing RDDs.

RDD basic commands
collect(): Its returns all the elements of the dataset as an array at the driver program, and using for loop on this array, print elements of RDD.

var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.collect

count(): It count the elements of the RDD.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.count

take(n):Read elements of an RDD where n is the number of elements.

var rdd=sc.parallelize(List(1,2,3,4,5))

rdd.take(3)

first: Restun first element of an RDD.

var rdd=sc.parallelize(List(1,2,3,4,5))

rdd.first

foreach: It is an action. It does not return any value. It executes input function on each element of an RDD. It executes the function on each item in RDD. It is good for writing database or publishing to web services.

var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.foreach{x=>println(x)}

To Find the Number of Partitions

var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.partitions.length

var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.getNumPartitions

Best case Scenario is that we should make RDD in following way:

numbers of cores in Cluster = no. of partitions

Consider the size of wc-data.txt is of 1280 MB and Default block size is 128 MB. So there will be 10 blocks created and 10 default partitions(1 per block). For a better performance, we can increase the number of partitions on each block. Below code will create 20 partitions on 10 blocks(2 partitions/block). Performance will be improved but need to make sure that each cluster is running on 2 cores minimum.

val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
rdd1.glom.collect

Spark

RDD Basic

No comments:

Post a Comment