Spark: Hive File Formate

SEQUENCE-FILE
We know that Hadoop’s performance is drawn out when we work with a small number of files with big size rather than a large number of files with small size. If the size of a file is smaller than the typical block size in Hadoop, we consider it as a small file. Due to this, a number of metadata increases which will become an overhead to the NameNode. To solve this problem sequence files are introduced in Hadoop. Sequence files act as a container to store the small files.

Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence files are in the binary format which can be split and the main use of these files is to club two or more smaller files and make them as a one sequence file. In Hive we can create a sequence file by specifying STORED AS SEQUENCEFILE in the end of a CREATE TABLE statement.

There are three types of sequence files:

Uncompressed key/value records.
Record compressed key/value records – only ‘values’ are compressed here
Block compressed key/value records – both keys and values are collected in‘blocks’ separately and compressed. The size of the ‘block’ is configurable.

Hive has its own SEQUENCEFILE reader and SEQUENCEFILE writer libraries for reading and writing through sequence files.In Hive we can create a sequence file format as follows:

create table table_name (schema of the table) row format delimited fileds terminated by ',' | stored as SEQUENCEFILE

Hive uses the SEQUENCEFILE input and output formats from the following packages:
org.apache.hadoop.mapred.SequenceFileInputFormat
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Creating SEQUENCEFILE
create table olympic_sequencefile (
athelete STRING
,age INT
,country STRING
,year STRING
,closing STRING
,sport STRING
,gold INT
,silver INT
,bronze INT
,total INT) row format delimited fields terminated by '\t' stored as sequencefile;

Here we are creating a table with name olympic_sequencefile and the schema of the table is as specified above and the data inside my input file is delimited by tab space. At the end the file format is specified as SEQUENCEFILE format. You can check the schema of your created table using:

Now to load data into this table is somewhat different from loading into the table created using TEXTFILE format. You need to insert the data from another table because this SEQUENCEFILE format is the binary format.

INSERT OVERWRITE TABLE olympic_sequencefile SELECT * FROM olympic;
It compresses the data and then stores it into the table. If you want to load directly as in TEXTFILE format that is not possible because we cannot insert the compressed files into tables.

TEXT-FILE
TEXTFILE format is a famous input/output format used in Hadoop.

RC-FILE
RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows.
RCFILE is used when we want to perform operations on multiple rows at a time.
RCFILEs are flat files consisting of binary key/value pairs, which shares many similarities with SEQUENCEFILE. RCFILE stores columns of a table in form of record in a columnar manner. It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way. RCFILE first stores the metadata of a row split, as the key part of a record, and all the data of a row split as the value part. This means that RCFILE encourages column oriented storage rather than row oriented storage.
This column oriented storage is very useful while performing analytics. It is easy to perform analytics when we “hive’ a column oriented storage type.
Facebook uses RCFILE as its default file format for storing of data in their data warehouse as they perform different types of analytics using Hive.

In Hive we can create a RCFILE format as follows:
create table table_name (schema of the table) row format delimited fields terminated by ',' | stored as RCFILE

ORC-FILE
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.

In Hive we can create a RCFILE format as follows:
create table table_name (schema of the table) row format delimited fields terminated by ',' | stored as ORC

Spark

Hive File Formate

No comments:

Post a Comment