RDD can be created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5.
RDD From Array
val rdd = sc.parallelize(Array("sun","mon","tue","wed","thu","fri"),4)
rdd.collect
rdd = sc.parallelize(("sun","mon","tue","wed","thu","fri"),4)
rdd.collect()
RDD From Array
val rdd = sc.parallelize(Array("sun","mon","tue","wed","thu","fri"),4)
rdd.collect
rdd = sc.parallelize(("sun","mon","tue","wed","thu","fri"),4)
rdd.collect()
RDD From List
val rdd =sc.parallelize(List(1,2,3,4,5))
rdd.collect
rdd =sc.parallelize([1,2,3,4,5])
rdd.collect()
val rdd = sc.parallelize(List((2,"teradata"), (1, "vertica"),(2, "oracle"), (1, "hive"), (2, "hbase"), (1, "cassandra"), (2, "mongoBB")))
rdd.collect
rdd = sc.parallelize([(2,"teradata"), (1, "vertica"),(2, "oracle"), (1, "hive"), (2, "hbase"), (1, "cassandra"), (2, "mongoBB")])
print(rdd.collect())
print(rdd.collect())
RDD From Nested List
val Student = List(("IT",("Pradeep",50000.0)),("CS",("CP",50200.0)),("ECE",("Deepak",62200.0)),("Mech",("Narpat",65000.0)))
val StudentRDD = sc.makeRDD(Student)
StudentRDD.collect
val StudentRDD = sc.makeRDD(Student)
StudentRDD.collect
Student = List(("IT",("Pradeep",50000.0)),("CS",("CP",50200.0)),("ECE",("Deepak",62200.0)),("Mech",("Narpat",65000.0)))
StudentRDD = sc.makeRDD(Student)
StudentRDD.collect
RDD Using RangeStudentRDD = sc.makeRDD(Student)
StudentRDD.collect
val rdd = sc.parallelize(1 to 10, 3)
rdd.collect
rdd.collect
RDD From Sequence
rdd.collect
rdd = sc.parallelize(Seq(("Seq1", 1),("Seq2", 2),("Seq3", 3)))
rdd.collect
rdd.collect
Empty RDD
var rdd=sc.parallelize(Seq.empty[String])
rdd.collect
rdd.collect
How to access the elements of the RDD.
val datasets = sc.parallelize(List(("A","HYD",10),("B","BLR",30),("A","HYD",40),("B","BLR",50),("C","DEL",60)))
datasets.map(x => (x._1,x._2)).collect
Accessing the first and second elements of the RDD.
How to access the elements the particular elements of an RDD
val rdd = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 1)val kvRdd= rdd.keyBy(_.length)
val filterRdd = kvRdd.filter(p => p._1 == 4)
filterRdd.collect
val rdd = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 1)
val kvRdd= rdd.keyBy(_.length)
val filterRdd = kvRdd.filter(p => p._1 == 4)
kvRdd.lookup(4)
val kvRdd= rdd.keyBy(_.length)
val filterRdd = kvRdd.filter(p => p._1 == 4)
kvRdd.lookup(4)
No comments:
Post a Comment