NULL value can be identified in multiple manner.Many people confuse it with BLANK or empty string however there is a
difference. NULL means unknown where BLANK is empty. Alright now let’s
see what all operations are available in Spark Dataframe which can help
us in handling NULL values.
Create DataFrame
import org.apache.spark.sql.types._
val schema = new StructType()
.add("empno", IntegerType)
.add("ename", StringType)
.add("job", StringType)
.add("mgr", IntegerType)
.add("hiredate",StringType)
.add("sal", IntegerType)
.add("comm", IntegerType)
.add("deptno",IntegerType)
val data = Seq(
Row(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
Row(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
Row(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, null),
Row(7566, "JONES", "MANAGER", null, "2-Apr-81", 2975, 0, 20),
Row(null,null,null,null,null,null,null,null),
Row(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, null),
Row(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
Row(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
Row(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
Row(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
Row(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 20),
Row(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20))
var empDF=spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
empDF.show
val schema = new StructType()
.add("empno", IntegerType)
.add("ename", StringType)
.add("job", StringType)
.add("mgr", IntegerType)
.add("hiredate",StringType)
.add("sal", IntegerType)
.add("comm", IntegerType)
.add("deptno",IntegerType)
val data = Seq(
Row(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
Row(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
Row(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, null),
Row(7566, "JONES", "MANAGER", null, "2-Apr-81", 2975, 0, 20),
Row(null,null,null,null,null,null,null,null),
Row(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, null),
Row(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
Row(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
Row(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
Row(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
Row(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 20),
Row(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20))
var empDF=spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
empDF.show
Find the records which Department No is NULL
Drop rows which any columns as NULL.
empDF.na.drop("any").show()
Drop rows which has all columns as NULL.
empDF.na.drop("all").show()
Drop rows if it does not have "n" number of columns as NOT NULL.
empDF.na.drop(5).show()
empDF.na.drop(6).show()
Drop rows when all the specified column has NULL in it. Default value is
any so “all” must be explicitly mention in DROP method with column
list.
empDF.na.drop("all",Array("comm","deptno")).show()
Drop rows which has any value as NULL for specific column
empDF.na.drop(Array("comm")).show()
empDF.na.drop(Array("comm","deptno")).show()
Fill all the “numeric” columns with default value if NULL
empDF.na.fill(-1).show()
Replace value in specific column with default value. If default value is not of datatype of column then it is ignored.
Fill values for multiple columns with default values for each specific column.
empDF.na.fill(-1,Array("deptno")).show()
Fill all the “string” columns with default value if NULL
empDF.na.fill(Map("deptno" -> -1, "hiredate" -> "2999-12-31")).show()

Spark dataframe filter both nulls and spaces
val myDF = Seq((1, "abc"),(2,null),(3,null),(4, ""),(5,"def")).toDF("col1", "col2")
myDF.filter(($"col2" =!= "") && ($"col2".isNotNull)).show
col1 col2
1 abc
2 null
3 null
4
5 def
myDF.filter(($"col2" =!= "") && ($"col2".isNotNull)).show
No comments:
Post a Comment