Spark filter() function is used to filter the rows from DataFrame or Dataset based on the given condition or SQL expression, alternatively, you can also use where() operator instead of the filter if you are coming from SQL background. Both these functions are exactly the same.
val empDF = spark.createDataFrame(Seq(
(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
(7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
)).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")
empDF.show
filter with Column Condition
Use Column with the condition to filter the rows from DataFrame, using this you can express complex condition by referring column names using col(name), $"colname" dfObject("colname") , this approach is mostly used while working with DataFrames. Use “===” for comparison.
empDF.where('deptno === 30).show
empDF.filter("job == 'CLERK'").show
Filter with multiple conditions
To filter() rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example, you can extend this with AND(&&), OR(||), and NOT(!) conditional expressions as needed.
empDF.filter(($"job" === "SALESMAN") && ($"deptno" === 30)).show(false)
empDF.filter(($"job" === "SALESMAN") || ($"deptno" === 30)).show(false)
empDF.filter($"sal" > 1500).show
(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
(7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
)).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")
empDF.show
filter with Column Condition
Use Column with the condition to filter the rows from DataFrame, using this you can express complex condition by referring column names using col(name), $"colname" dfObject("colname") , this approach is mostly used while working with DataFrames. Use “===” for comparison.
There are various alternate syntaxes that give you the same result and same performance.
- df.where("column_Name = value")
- df.where($"column_Name" === value)
- df.where('column_Name === true)
empDF.filter("deptno = 30").show
empDF.where("deptno = 30").show
empDF.filter($"deptno" === 30).show
empDF.where($"deptno" === 30).show
empDF.where($"deptno" === 30).show
empDF.filter('deptno === 30).show
empDF.where('deptno === 30).show
empDF.filter($"job" === "CLERK").show
you can also use "where" in place "filter"
empDF.where(($"job" === "SALESMAN")).show
DataFrame filter() with SQL Expression
If you are coming from SQL background, you can use that knowledge in Spark to filter DataFrame rows with SQL expressions.
empDF.filter("deptno == 30").show
empDF.where(($"job" === "SALESMAN")).show
DataFrame filter() with SQL Expression
If you are coming from SQL background, you can use that knowledge in Spark to filter DataFrame rows with SQL expressions.
empDF.filter("deptno == 30").show
empDF.filter("job == 'CLERK'").show
Filter with multiple conditions
To filter() rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example, you can extend this with AND(&&), OR(||), and NOT(!) conditional expressions as needed.
empDF.filter(($"job" === "SALESMAN") && ($"deptno" === 30)).show(false)
empDF.filter(($"job" === "SALESMAN") || ($"deptno" === 30)).show(false)
empDF.filter($"sal" > 1500).show
empDF.filter($"sal" <1500).show
No comments:
Post a Comment