RDDs are created by starting with a file in the Hadoop file system (or
any other Hadoop-supported file system), or an existing Scala collection
in the driver program, and transforming it. Users may also ask Spark to
persist an RDD in memory, allowing it to be reused efficiently across
parallel operations.
RDD can be created below three method
- Parallelizing already existing collection in driver program.
- Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file system).
- Creating RDD from already existing RDDs.
RDD basic commands
collect(): Its returns all the elements of the dataset as an array at the driver program, and using for loop on this array, print elements of RDD.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.collect
count(): It count the elements of the RDD.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.count
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.getNumPartitions
Best case Scenario is that we should make RDD in following way:
collect(): Its returns all the elements of the dataset as an array at the driver program, and using for loop on this array, print elements of RDD.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.collect
count(): It count the elements of the RDD.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.count
take(n):Read elements of an RDD where n is the number of elements.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.take(3)
first: Restun first element of an RDD.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.first
foreach: It is an action. It does not return any value. It
executes input function on each element of an RDD. It executes the
function on each item in RDD. It is good for writing database or
publishing to web services.
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.foreach{x=>println(x)}
rdd.foreach{x=>println(x)}
To Find the Number of Partitions
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.partitions.length
rdd.partitions.length
var rdd=sc.parallelize(List(1,2,3,4,5))
rdd.getNumPartitions
Best case Scenario is that we should make RDD in following way:
numbers of cores in Cluster = no. of partitions
Consider
the size of wc-data.txt is of 1280 MB and Default block size is 128 MB.
So there will be 10 blocks created and 10 default partitions(1 per
block). For a better performance, we can increase the number of
partitions on each block. Below code will create 20 partitions on 10
blocks(2 partitions/block). Performance will be improved but need to
make sure that each cluster is running on 2 cores minimum.
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
rdd1.glom.collect
rdd1.glom.collect
No comments:
Post a Comment