DataFrame from json File

Spark SQL provides spark.read.json("path") to read a single line and multiline (multiple lines) JSON file into Spark DataFrame and dataframe.write.json("path") to save or write to JSON file, In this tutorial, you will learn how to read a single file, multiple files, all files from a directory into DataFrame and writing DataFrame back to JSON file using Scala example.

Json data in a single Line
[{"empno":"7369","ename":"SMITH","designation":"CLERK","manager":"7902","hire_date":"12/17/1980","sal":"800","deptno":"20" },{"empno":"7499","ename":"ALLEN","designation":"SALESMAN","manager":"7698","hire_date":"2/20/1981","sal":"1600","deptno":"30"},{"empno":"7521","ename":"WARD","designation":"SALESMAN","manager":"7698","hire_date":"2/22/1981","sal":"1250","deptno":"30"  },{"empno":"7566","ename":"TURNER","designation":"MANAGER","manager":"7839","hire_date":"4/2/1981","sal":"2975","deptno":"20"},{"empno":"7654","ename":"MARTIN","designation":"SALESMAN","manager":"7698","hire_date":"9/28/1981","sal":"1250","deptno":"30"},{"empno":"7698","ename":"MILLER","designation":"MANAGER","manager":"7839","hire_date":"5/1/1981","sal":"2850","deptno":"30"},{"empno":"7782","ename":"CLARK","designation":"MANAGER","manager":"7839","hire_date":"6/9/1981","sal":"2450","deptno":"10"},{"empno":"7788","ename":"SCOTT","designation":"ANALYST","manager":"7566","hire_date":"12/9/1982","sal":"3000","deptno":"20"},{"empno":"7839","ename":"KING","designation":"PRESIDENT","manager":"NULL","hire_date":"11/17/1981","sal":"5000","deptno":"10"}]

Read single Line Json File
val df = spark.read.json("/FileStore/tables/employees_singleLine.json")
df.show

 

Json data in a multi Line[
  {
    "empno":"7369",
    "ename":"SMITH",
    "designation":"CLERK",
    "manager":"7902",
    "hire_date":"12/17/1980",
    "sal":"800",
    "deptno":"20"
  },
  {
    "empno":"7499",
    "ename":"ALLEN",
    "designation":"SALESMAN",
    "manager":"7698",
    "hire_date":"2/20/1981",
    "sal":"1600",
    "deptno":"30"
  },
  {
    "empno":"7521",
    "ename":"WARD",
    "designation":"SALESMAN",
    "manager":"7698",
    "hire_date":"2/22/1981",
    "sal":"1250",
    "deptno":"30"
  },
  {
    "empno":"7566",
    "ename":"TURNER",
    "designation":"MANAGER",
    "manager":"7839",
    "hire_date":"4/2/1981",
    "sal":"2975",
    "deptno":"20"
  },
  {
    "empno":"7654",
    "ename":"MARTIN",
    "designation":"SALESMAN",
    "manager":"7698",
    "hire_date":"9/28/1981",
    "sal":"1250",
    "deptno":"30"
  },
  {
    "empno":"7698",
    "ename":"MILLER",
    "designation":"MANAGER",
    "manager":"7839",
    "hire_date":"5/1/1981",
    "sal":"2850",
    "deptno":"30"
  },
  {
    "empno":"7782",
    "ename":"CLARK",
    "designation":"MANAGER",
    "manager":"7839",
    "hire_date":"6/9/1981",
    "sal":"2450",
    "deptno":"10"
  },
  {
    "empno":"7788",
    "ename":"SCOTT",
    "designation":"ANALYST",
    "manager":"7566",
    "hire_date":"12/9/1982",
    "sal":"3000",
    "deptno":"20"
  },
  {
    "empno":"7839",
    "ename":"KING",
    "designation":"PRESIDENT",
    "manager":"NULL",
    "hire_date":"11/17/1981",
    "sal":"5000",
    "deptno":"10"
  }
]
 

Read Multi Line Json File  
val df = spark.read.option("multiline","true").json("/FileStore/tables/employees_multiLine.json")
df.show


 
Read Multi Line Json File 
 
{
    "user": "gT35Hhhre9m",
    "dates": ["2016-01-29", "2016-01-28"],
    "status": "OK",
    "reason": "some reason",
    "content": [{
        "foo": 123,
        "bar": "val1"
    }, {
        "foo": 456,
        "bar": "val2"
    }, {
        "foo": 789,
        "bar": "val3"
    }, {
        "foo": 124,
        "bar": "val4"
    }, {
        "foo": 126,
        "bar": "val5"
    }]
}
 
val df = spark.read.option("multiline","true").json("/FileStore/tables/data.json")
df.show 
 
Write data in  Json File  

  df2.write.json("/tmp/spark_output/zipcodes.json")
 
 
Processing the JSON files

JSON is a data interchange format, developed from Javascript. JSON actually stands for JavaScript Object Notation. It is a text-based format, and can be expressed,for instance, as XML. The following example uses the SQL context method called jsonFile to load the HDFS-based JSON data file named device.json .

The resulting data is created as a data frame:
val dframe = sqlContext.jsonFile("hdfs:///data/spark/device.json")

First, the Apache Spark and Spark SQL classes are imported:
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
object sql1 {
def main(args: Array[String]) {
val appName = "sql example 1"
val conf = new SparkConf()
conf.setAppName(appName)
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rawRdd = sc.textFile("hdfs:///data/spark/sql/adult.test.data_1x")
val schemaString = "age workclass fnlwgt education " +
"educational-num marital-status occupation relationship " +
"race gender capital-gain capital-loss hours-per-week " +
"native-country income"
val schema =
StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,StringType, true)))
val rowRDD = rawRdd.map(_.split(","))
.map(p => Row( p(0),p(1),p(2),p(3),p(4),p(5),p(6),p(7),p(8),p(9),p(10),p(11),p(12),p(13),p(14) ))
val adultDataFrame = sqlContext.createDataFrame(rowRDD, schema)
val jsonData = adultDataFrame.toJSON
jsonData.saveAsTextFile("hdfs:///data/spark/sql/adult.json")
} // end main
} // end sql1
So the resulting data can be seen on HDFS, the Hadoop file system ls command below shows that the data resides in the target directory as a success file and two part files.

 


No comments:

Post a Comment