Problem:- Need to find SUM of Marks of the Students.
Below is the data for the students . Data.csv which is Pipe Delimited. Save this file into hdfs location
Name|Marks
A|10,20,30,40
B|40,50,60,70
C|15,25,34,23
D|20,45,31,78
Read this data
val df=spark.read.option(“header”,”true”).option(“delimiter”,”|”).csv(“/tmp/ABC/SPARK/Data/Data.csv”)
Output
scala> df.show()
+ — — + — — — — — -+
|Name| Marks|
+ — — + — — — — — -+
| A|10,20,30,40|
| B|40,50,60,70|
| C|15,25,34,23|
| D|20,45,31,78|
+ — — + — — — — — -+
Check Schema and Data Types of each columns
scala> df.printSchema
root
| — Name: string (nullable = true)
| — Marks: string (nullable = true)
NOTE: Explode does not apply on String column If you do this then below error will come.
scala> df.withColumn(“Marks”,explode($”Marks”)).show()
org.apache.spark.sql.AnalysisException: cannot resolve ‘explode(`Marks`)’ due to data type mismatch: input to function explode should be array or map type, not StringType;;
‘Project [Name#12, explode(Marks#13) AS Marks#23
Cast Marks Column into Array Type
scala> val df_new=df.withColumn(“Marks”,split(col(“Marks”),”,”).cast(“array<long>”))
df_new: org.apache.spark.sql.DataFrame = [Name: string, Marks: array<bigint>]
scala> df_new.show()
+ — — + — — — — — — — — +
|Name| Marks|
+ — — + — — — — — — — — +
| A|[10, 20, 30, 40]|
| B|[40, 50, 60, 70]|
| C|[15, 25, 34, 23]|
| D|[20, 45, 31, 78]|
+ — — + — — — — — — — — +
scala> df_new.printSchema
root
| — Name: string (nullable = true)
| — Marks: array (nullable = true)
| | — element: long (containsNull = true)
Now use Explode Function
scala> val df_ex=df_new.withColumn(“Marks”,explode($”Marks”))
df_ex.show()
+ — — + — — -+
|Name|Marks|
+ — — + — — -+
| A| 10|
| A| 20|
| A| 30|
| A| 40|
| B| 40|
| B| 50|
| B| 60|
| B| 70|
| C| 15|
| C| 25|
| C| 34|
| C| 23|
| D| 20|
| D| 45|
| D| 31|
| D| 78|
+ — — + — — -+
Find the SUM of the marks
scala> df_ex.groupBy($”Name”).agg(sum($”Marks”).alias(“Marks”)).orderBy($”Name”).show()
+ — — + — — — — — +
|Name|Marks |
+ — — + — — — — — +
| A| 100|
| B| 220|
| C| 97|
| D| 174|
+ — — + — — — — — +
collect_list() and collect_set() functions
empDF.withColumn("Dept_Sal", collect_list("sal").over(Window.partitionBy("deptno").orderBy("sal"))).show(false)
empDF.withColumn("Dept_Sal", collect_list("sal").over(Window.partitionBy("deptno"))).show
empDF.withColumn("arr", array("empno","ename")).show
No comments:
Post a Comment