Spark: Interview-3

Problem:- Need to find SUM of Marks of the Students.

Below is the data for the students . Data.csv which is Pipe Delimited. Save this file into hdfs location

Name|Marks

A|10,20,30,40

B|40,50,60,70

C|15,25,34,23

D|20,45,31,78

Read this data

val df=spark.read.option(“header”,”true”).option(“delimiter”,”|”).csv(“/tmp/ABC/SPARK/Data/Data.csv”)

Output

scala> df.show()

+ — — + — — — — — -+

|Name| Marks|

+ — — + — — — — — -+

| A|10,20,30,40|

| B|40,50,60,70|

| C|15,25,34,23|

| D|20,45,31,78|

+ — — + — — — — — -+

Check Schema and Data Types of each columns

scala> df.printSchema

root

| — Name: string (nullable = true)

| — Marks: string (nullable = true)

NOTE: Explode does not apply on String column If you do this then below error will come.

scala> df.withColumn(“Marks”,explode($”Marks”)).show()

org.apache.spark.sql.AnalysisException: cannot resolve ‘explode(`Marks`)’ due to data type mismatch: input to function explode should be array or map type, not StringType;;

‘Project [Name#12, explode(Marks#13) AS Marks#23

Cast Marks Column into Array Type

scala> val df_new=df.withColumn(“Marks”,split(col(“Marks”),”,”).cast(“array<long>”))

df_new: org.apache.spark.sql.DataFrame = [Name: string, Marks: array<bigint>]

scala> df_new.show()

+ — — + — — — — — — — — +

|Name| Marks|

+ — — + — — — — — — — — +

| A|[10, 20, 30, 40]|

| B|[40, 50, 60, 70]|

| C|[15, 25, 34, 23]|

| D|[20, 45, 31, 78]|

+ — — + — — — — — — — — +

scala> df_new.printSchema

root

| — Name: string (nullable = true)

| — Marks: array (nullable = true)

| | — element: long (containsNull = true)

Now use Explode Function

scala> val df_ex=df_new.withColumn(“Marks”,explode($”Marks”))

df_ex.show()

+ — — + — — -+

|Name|Marks|

+ — — + — — -+

| A| 10|

| A| 20|

| A| 30|

| A| 40|

| B| 40|

| B| 50|

| B| 60|

| B| 70|

| C| 15|

| C| 25|

| C| 34|

| C| 23|

| D| 20|

| D| 45|

| D| 31|

| D| 78|

+ — — + — — -+

Find the SUM of the marks

scala> df_ex.groupBy($”Name”).agg(sum($”Marks”).alias(“Marks”)).orderBy($”Name”).show()

+ — — + — — — — — +

|Name|Marks |

+ — — + — — — — — +

| A| 100|

| B| 220|

| C| 97|

| D| 174|

+ — — + — — — — — +

collect_list() and collect_set() functions

empDF.withColumn("Dept_Sal", collect_list("sal").over(Window.partitionBy("deptno").orderBy("sal"))).show(false)

empDF.withColumn("Dept_Sal", collect_list("sal").over(Window.partitionBy("deptno"))).show

empDF.withColumn("arr", array("empno","ename")).show

Spark

Interview-3

No comments:

Post a Comment