In
Spark , you can perform aggregate operations on dataframe. This is
similar to what we have in SQL like MAX, MIN, SUM etc. We can also
perform aggregation on some specific columns which is equivalent to
GROUP BY clause we have in typical SQL. Let’s see it with some examples.
val empDF = spark.createDataFrame(Seq(
(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
(7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
)).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")
val empDF = spark.createDataFrame(Seq(
(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
(7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
)).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")
Create Temp View from dataframe
empDF.createOrReplaceTempView("emp")
First method To calculate count, max, min, sum we can use below syntax:
import org.apache.spark.sql.functions._
empDF.select(min($"sal"),max($"sal"),avg($"sal"),count($"deptno")).show()
empDF.select(min($"sal"),max($"sal"),avg($"sal"),count($"deptno")).show()
Second method we can use is “agg”. To calculate count, max, min, sum we can use below syntax:
import org.apache.spark.sql.functions._
empDF.agg(count($"deptno"),min($"sal"),max($"sal"),sum("sal")).show()
import org.apache.spark.sql.functions._
empDF.agg(count($"deptno"),min($"sal"),max($"sal"),sum("sal")).show()
Find the number of records in a dataFrame
count function returns number of records present in the dataframe.
empDF.count
empDF.count
Find the sum of Salary by Department
import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").sum("sal").show
empDF.groupBy($"deptno").sum("sal").show
import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").agg(sum($"sal")).show
empDF.groupBy($"deptno").agg(sum($"sal")).show
Find the maximum Salary by Department
import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").max("sal").show
empDF.groupBy($"deptno").max("sal").show
Find the maximum Salary and record count by Department
import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").agg(Map("deptno"->"count","sal"->"max")).show
import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").agg(Map("deptno"->"count","sal"->"max")).show
No comments:
Post a Comment