Spark SQL provides spark.read.json("path") to read a single line and multiline (multiple lines) JSON file into Spark DataFrame and dataframe.write.json("path") to save or write to JSON file, In this tutorial, you will learn how to read a single file, multiple files, all files from a directory into DataFrame and writing DataFrame back to JSON file using Scala example.
Json data in a single Line
[{"empno":"7369","ename":"SMITH","designation":"CLERK","manager":"7902","hire_date":"12/17/1980","sal":"800","deptno":"20" },{"empno":"7499","ename":"ALLEN","designation":"SALESMAN","manager":"7698","hire_date":"2/20/1981","sal":"1600","deptno":"30"},{"empno":"7521","ename":"WARD","designation":"SALESMAN","manager":"7698","hire_date":"2/22/1981","sal":"1250","deptno":"30" },{"empno":"7566","ename":"TURNER","designation":"MANAGER","manager":"7839","hire_date":"4/2/1981","sal":"2975","deptno":"20"},{"empno":"7654","ename":"MARTIN","designation":"SALESMAN","manager":"7698","hire_date":"9/28/1981","sal":"1250","deptno":"30"},{"empno":"7698","ename":"MILLER","designation":"MANAGER","manager":"7839","hire_date":"5/1/1981","sal":"2850","deptno":"30"},{"empno":"7782","ename":"CLARK","designation":"MANAGER","manager":"7839","hire_date":"6/9/1981","sal":"2450","deptno":"10"},{"empno":"7788","ename":"SCOTT","designation":"ANALYST","manager":"7566","hire_date":"12/9/1982","sal":"3000","deptno":"20"},{"empno":"7839","ename":"KING","designation":"PRESIDENT","manager":"NULL","hire_date":"11/17/1981","sal":"5000","deptno":"10"}]
Read single Line Json File
val df = spark.read.json("/FileStore/tables/employees_singleLine.json")
df.show
Json data in a multi Line[
{
"empno":"7369",
"ename":"SMITH",
"designation":"CLERK",
"manager":"7902",
"hire_date":"12/17/1980",
"sal":"800",
"deptno":"20"
},
{
"empno":"7499",
"ename":"ALLEN",
"designation":"SALESMAN",
"manager":"7698",
"hire_date":"2/20/1981",
"sal":"1600",
"deptno":"30"
},
{
"empno":"7521",
"ename":"WARD",
"designation":"SALESMAN",
"manager":"7698",
"hire_date":"2/22/1981",
"sal":"1250",
"deptno":"30"
},
{
"empno":"7566",
"ename":"TURNER",
"designation":"MANAGER",
"manager":"7839",
"hire_date":"4/2/1981",
"sal":"2975",
"deptno":"20"
},
{
"empno":"7654",
"ename":"MARTIN",
"designation":"SALESMAN",
"manager":"7698",
"hire_date":"9/28/1981",
"sal":"1250",
"deptno":"30"
},
{
"empno":"7698",
"ename":"MILLER",
"designation":"MANAGER",
"manager":"7839",
"hire_date":"5/1/1981",
"sal":"2850",
"deptno":"30"
},
{
"empno":"7782",
"ename":"CLARK",
"designation":"MANAGER",
"manager":"7839",
"hire_date":"6/9/1981",
"sal":"2450",
"deptno":"10"
},
{
"empno":"7788",
"ename":"SCOTT",
"designation":"ANALYST",
"manager":"7566",
"hire_date":"12/9/1982",
"sal":"3000",
"deptno":"20"
},
{
"empno":"7839",
"ename":"KING",
"designation":"PRESIDENT",
"manager":"NULL",
"hire_date":"11/17/1981",
"sal":"5000",
"deptno":"10"
}
]
Read Multi Line Json File
val df = spark.read.option("multiline","true").json("/FileStore/tables/employees_multiLine.json")
df.show
Json data in a single Line
[{"empno":"7369","ename":"SMITH","designation":"CLERK","manager":"7902","hire_date":"12/17/1980","sal":"800","deptno":"20" },{"empno":"7499","ename":"ALLEN","designation":"SALESMAN","manager":"7698","hire_date":"2/20/1981","sal":"1600","deptno":"30"},{"empno":"7521","ename":"WARD","designation":"SALESMAN","manager":"7698","hire_date":"2/22/1981","sal":"1250","deptno":"30" },{"empno":"7566","ename":"TURNER","designation":"MANAGER","manager":"7839","hire_date":"4/2/1981","sal":"2975","deptno":"20"},{"empno":"7654","ename":"MARTIN","designation":"SALESMAN","manager":"7698","hire_date":"9/28/1981","sal":"1250","deptno":"30"},{"empno":"7698","ename":"MILLER","designation":"MANAGER","manager":"7839","hire_date":"5/1/1981","sal":"2850","deptno":"30"},{"empno":"7782","ename":"CLARK","designation":"MANAGER","manager":"7839","hire_date":"6/9/1981","sal":"2450","deptno":"10"},{"empno":"7788","ename":"SCOTT","designation":"ANALYST","manager":"7566","hire_date":"12/9/1982","sal":"3000","deptno":"20"},{"empno":"7839","ename":"KING","designation":"PRESIDENT","manager":"NULL","hire_date":"11/17/1981","sal":"5000","deptno":"10"}]
Read single Line Json File
val df = spark.read.json("/FileStore/tables/employees_singleLine.json")
df.show
Json data in a multi Line[
{
"empno":"7369",
"ename":"SMITH",
"designation":"CLERK",
"manager":"7902",
"hire_date":"12/17/1980",
"sal":"800",
"deptno":"20"
},
{
"empno":"7499",
"ename":"ALLEN",
"designation":"SALESMAN",
"manager":"7698",
"hire_date":"2/20/1981",
"sal":"1600",
"deptno":"30"
},
{
"empno":"7521",
"ename":"WARD",
"designation":"SALESMAN",
"manager":"7698",
"hire_date":"2/22/1981",
"sal":"1250",
"deptno":"30"
},
{
"empno":"7566",
"ename":"TURNER",
"designation":"MANAGER",
"manager":"7839",
"hire_date":"4/2/1981",
"sal":"2975",
"deptno":"20"
},
{
"empno":"7654",
"ename":"MARTIN",
"designation":"SALESMAN",
"manager":"7698",
"hire_date":"9/28/1981",
"sal":"1250",
"deptno":"30"
},
{
"empno":"7698",
"ename":"MILLER",
"designation":"MANAGER",
"manager":"7839",
"hire_date":"5/1/1981",
"sal":"2850",
"deptno":"30"
},
{
"empno":"7782",
"ename":"CLARK",
"designation":"MANAGER",
"manager":"7839",
"hire_date":"6/9/1981",
"sal":"2450",
"deptno":"10"
},
{
"empno":"7788",
"ename":"SCOTT",
"designation":"ANALYST",
"manager":"7566",
"hire_date":"12/9/1982",
"sal":"3000",
"deptno":"20"
},
{
"empno":"7839",
"ename":"KING",
"designation":"PRESIDENT",
"manager":"NULL",
"hire_date":"11/17/1981",
"sal":"5000",
"deptno":"10"
}
]
Read Multi Line Json File
val df = spark.read.option("multiline","true").json("/FileStore/tables/employees_multiLine.json")
df.show
Read Multi Line Json File
{
"user": "gT35Hhhre9m",
"dates": ["2016-01-29", "2016-01-28"],
"status": "OK",
"reason": "some reason",
"content": [{
"foo": 123,
"bar": "val1"
}, {
"foo": 456,
"bar": "val2"
}, {
"foo": 789,
"bar": "val3"
}, {
"foo": 124,
"bar": "val4"
}, {
"foo": 126,
"bar": "val5"
}]
}
"user": "gT35Hhhre9m",
"dates": ["2016-01-29", "2016-01-28"],
"status": "OK",
"reason": "some reason",
"content": [{
"foo": 123,
"bar": "val1"
}, {
"foo": 456,
"bar": "val2"
}, {
"foo": 789,
"bar": "val3"
}, {
"foo": 124,
"bar": "val4"
}, {
"foo": 126,
"bar": "val5"
}]
}
val df = spark.read.option("multiline","true").json("/FileStore/tables/data.json")
df.show
df.show
Write data in Json File
df2.write.json("/tmp/spark_output/zipcodes.json")
Processing the JSON files
JSON
is a data interchange format, developed from Javascript. JSON actually
stands for JavaScript Object Notation. It is a text-based format, and
can be expressed,for instance, as XML. The following example uses the
SQL context method called jsonFile to load the HDFS-based JSON data file
named device.json .
The resulting data is created as a data frame:
val dframe = sqlContext.jsonFile("hdfs:///data/spark/device.json")
val dframe = sqlContext.jsonFile("hdfs:///data/spark/device.json")
First, the Apache Spark and Spark SQL classes are imported:
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
object sql1 {
def main(args: Array[String]) {
val appName = "sql example 1"
val conf = new SparkConf()
conf.setAppName(appName)
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rawRdd = sc.textFile("hdfs:///data/spark/sql/adult.test.data_1x")
val schemaString = "age workclass fnlwgt education " +
"educational-num marital-status occupation relationship " +
"race gender capital-gain capital-loss hours-per-week " +
"native-country income"
val schema =
StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,StringType, true)))
val rowRDD = rawRdd.map(_.split(","))
.map(p => Row( p(0),p(1),p(2),p(3),p(4),p(5),p(6),p(7),p(8),p(9),p(10),p(11),p(12),p(13),p(14) ))
val adultDataFrame = sqlContext.createDataFrame(rowRDD, schema)
val jsonData = adultDataFrame.toJSON
jsonData.saveAsTextFile("hdfs:///data/spark/sql/adult.json")
} // end main
} // end sql1
So
the resulting data can be seen on HDFS, the Hadoop file system ls
command below shows that the data resides in the target directory as a
success file and two part files.
No comments:
Post a Comment