Spark: HBASE

Hbase is an open source, non-relational, distributed database modelled after Google’s BigTable and is written in Java. It is developed by Apache Software Foundation and is a part of Apache Hadoop project. HBase runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database.

HBase Architecture: HBase Data Model
As we know, HBase is a column-oriented NoSQL database. Although it looks similar to a relational database which contains rows and columns, but it is not a relational database. Relational databases are row oriented while HBase is column-oriented. So, let us first understand the difference between Column-oriented and Row-oriented databases:

HBase Architecture: Region
A region contains all the rows between the start key and the end key assigned to that region. HBase tables can be divided into a number of regions in such a way that all the columns of a column family is stored in one region. Each region contains the rows in a sorted order.

Many regions are assigned to a Region Server, which is responsible for handling, managing, executing reads and writes operations on that set of regions.

So, concluding in a simpler way:

A table can be divided into a number of regions. A Region is a sorted range of rows storing data between a start key and an end key.
A Region has a default size of 256MB which can be configured according to the need.
A Group of regions is served to the clients by a Region Server.
A Region Server can serve approximately 1000 regions to the client.

Now starting from the top of the hierarchy, I would first like to explain you about HMaster Server which acts similarly as a NameNode in HDFS. Then, moving down in the hierarchy, I will take you through ZooKeeper and Region Server.

Basics of HBase:
The following keywords are required to gain an understanding of the subjects that forms the core foundation of HBase:

Rowkey
Column Family
Column
Timestamp

Introduction to HBase

Tables: Data is stored in a table format in HBase. But here tables are in column-oriented format.
Row Key: Row keys are used to search records which make searches fast.
Column Families: Various columns are combined in a column family. These column families are stored together which makes the searching process faster because data belonging to same column family can be accessed together in a single seek.
Column Qualifiers: Each column’s name is known as its column qualifier.
Cell: Data is stored in cells. The data is dumped into cells which are specifically identified by rowkey and column qualifiers.
Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is stored with its timestamp. This makes easy to search for a particular version of data.

To Start hbase Shell

hbase shell

HBase Table Management Commands

To Create table use the create table commands

Create '<table Name>','<column family>'

List :- To List the Table information

Describe :- Prints the schema of a table.it will give information about table name with column families, associated filters, versions and some more details.

Import CSV data into HBase

Sample data
empno    ename    designation    manager    hire_date    sal    deptno
7369    SMITH    CLERK    7902    12/17/1980    800    20
7499    ALLEN    SALESMAN    7698    2/20/1981    1600    30
7521    WARD    SALESMAN    7698    2/22/1981    1250    30
7566    TURNER    MANAGER    7839    4/2/1981    2975    20
7654    MARTIN    SALESMAN    7698    9/28/1981    1250    30
7698    MILLER    MANAGER    7839    5/1/1981    2850    30
7782    CLARK    MANAGER    7839    6/9/1981    2450    10
7788    SCOTT    ANALYST    7566    12/9/1982    3000    20
7839    KING    PRESIDENT    NULL    11/17/1981    5000    10
7844    TURNER    SALESMAN    7698    9/8/1981    1500    30
7876    ADAMS    CLERK    7788    1/12/1983    1100    20
7900    JAMES    CLERK    7698    12/3/1981    950    30
7902    FORD    ANALYST    7566    12/3/1981    3000    20
7934    MILLER    CLERK    7782    1/23/1982    1300    10

hadoop fs -put /Desktop/data/hbase/*.csv /user/hbase/data

Create HBase Table
create 'emp_data',{NAME => 'cf'}
Load data into HBase

cd /usr/hdp/2.4.0.0-169/hbase

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=’,’ -Dimporttsv.columns=’HBASE_ROW_KEY,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno’ emp_data /user/bdp/hbase/data/emp_data.csv

Once we submit the job, a MapReduce job will get started. Let’s understand each argument in more details:

-Dimporttsv.separator – Specify the delimiter of the source file

-Dimporttsv.columns – Mentioned columns name. Here if you observe, we have not mentioned empno. The Rowkey will have empno value. The row key needs to be identified using the all-caps HBASE_ROW_KEY string; otherwise, it won’t start the job.

Read data from HBase

Scan emp_data

HBASE

2 comments: