Spark: DataFrame Null Values

NULL value can be identified in multiple manner.Many people confuse it with BLANK or empty string however there is a difference. NULL means unknown where BLANK is empty. Alright now let’s see what all operations are available in Spark Dataframe which can help us in handling NULL values.

Create DataFrame

import org.apache.spark.sql.types._
val schema = new StructType()
    .add("empno", IntegerType)
    .add("ename", StringType)
    .add("job", StringType)
    .add("mgr", IntegerType)
    .add("hiredate",StringType)
    .add("sal", IntegerType)
    .add("comm", IntegerType)
    .add("deptno",IntegerType)

val data = Seq(
      Row(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
      Row(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
      Row(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, null),
      Row(7566, "JONES", "MANAGER", null, "2-Apr-81", 2975, 0, 20),
      Row(null,null,null,null,null,null,null,null),
      Row(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, null),
      Row(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
      Row(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
      Row(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
      Row(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
      Row(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 20),
      Row(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20))

var empDF=spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
empDF.show

Find the records which Department No is NULL