Spark: Hive Interview Questions

Where does the data of a Hive table gets stored?
By default, the Hive table is stored in an HDFS directory – /user/hive/warehouse. One can change it by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml.

How to concatenate in Hive column.
CONCAT(SUBSTR(ELGBLTY_CLNDR_MNTH_END_DT,1,4), SUBSTR(ELGBLTY_CLNDR_MNTH_END_DT ,6,2))

Incremental load in Hive
select t1.* from (select * from students union all select * from incremental) t1
join (select ID, max(Time_Stamp) as last_update_date from (select * From students union all select * From incremental ) t2 group by ID) t3
on t1.ID = t3.ID
and t1.Time_Stamp = t3.last_update_date;

What is a metastore in Hive?
Metastore in Hive stores the meta data information using RDBMS. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API.

Hive metastore consists of two fundamental units:

A service that provides metastore access to other Apache Hive services.
Disk storage for the Hive metadata which is separate from HDFS storage.

Why Hive does not store metadata information in HDFS?

Hive stores metadata information in the metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.

What is SerDe in Hive and what is the use of it?
SerDe is basically a library in the Hive. It is used to Serialize and Deserialize the data. It allows the user to read data from the table and write into HDFS location in any custom format.

The hive having below built-in SerDe:

JsonSerDe
CSV
Parquet and ORC
Thrift etc

When should we use SORT BY instead of ORDER BY?
We should use SORT BY instead of ORDER BY when we have to sort huge datasets because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer. Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute.

What is a partition in Hive?
Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory.

Why do we perform partitioning in Hive?
Partitioning provides granularity in a Hive table and therefore, reduces the query latency by scanning only relevant partitioned data instead of the whole data set.

What is dynamic partitioning and when is it used?
In dynamic partitioning values for partition columns are known in the runtime, i.e. It is known during loading of the data into a Hive table.

One may use dynamic partition in following two cases:

Loading data from an existing non-partitioned table to improve the sampling and therefore, decrease the query latency.
When one does not know all the values of the partitions before hand and therefore, finding these partition values manually from a huge data sets is a tedious task.b

What are the complex data types in Hive?

Map – The Map contains a key-value pair where you can search for a value using the key.
Struct – A Struct is a collection of elements of different data types. For example, if you take the address, it can have different data types. For example, pin code will be in Integer format.
Array – An Array will have a collection of homogeneous elements. For example, if you take your skillset, you can have N number of skills.
Uniontype – It represents a column which can have a value that can belong to any of the data types of your choice.

3rd highest Salary in Hive?
select * from
(
select RANK() OVER (PARTITION BY Depart ORDER BY salary DESC) AS Rank, *
FROM sampletable
) AS A WHERE RANK >=4

How to pass parameter in hive script file?
use octdb;
select * from '${hiveconf:table}' where year = '${hiveconf:year}';
[cloudera@quickstart ~]$ hive -hiveconf table='employee', year=2016 -f sample.hql

set hive.input.dir.recursive=true;
set hive.supports.subdirectories=true;
set mapreduce.input.fileinputformat.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.execution.engine=spark;

SELECT * FROM ${hivevar:schema}.${hivevar:tablename} WHERE total_emp>${hivevar: no_of_employee}

hiveparam.txt
set schema=bdp;
set tablename=infostore;
set no_of_employees=5000;

Spark

Hive Interview Questions

No comments:

Post a Comment