Spark: Aggregate Function

In Spark , you can perform aggregate operations on dataframe. This is similar to what we have in SQL like MAX, MIN, SUM etc. We can also perform aggregation on some specific columns which is equivalent to GROUP BY clause we have in typical SQL. Let’s see it with some examples.

val empDF = spark.createDataFrame(Seq(
      (7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
      (7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
      (7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
      (7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
      (7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
      (7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
      (7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
      (7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
      (7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
      (7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
      (7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
    )).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")

Create Temp View from dataframe

empDF.createOrReplaceTempView("emp")

First method To calculate count, max, min, sum we can use below syntax:

import org.apache.spark.sql.functions._
empDF.select(min($"sal"),max($"sal"),avg($"sal"),count($"deptno")).show()

Second method we can use is “agg”. To calculate count, max, min, sum we can use below syntax:
import org.apache.spark.sql.functions._
empDF.agg(count($"deptno"),min($"sal"),max($"sal"),sum("sal")).show()

Find the number of records in a dataFrame

count function returns number of records present in the dataframe.
empDF.count

Find the sum of Salary by Department

import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").sum("sal").show

import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").agg(sum($"sal")).show

Find the maximum Salary by Department

import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").max("sal").show

Find the maximum Salary and record count by Department
import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").agg(Map("deptno"->"count","sal"->"max")).show

Spark

Aggregate Function

No comments:

Post a Comment