Hbase is an open source, non-relational, distributed database modelled after Google’s BigTable and is written in Java. It is developed by Apache Software Foundation and is a part of Apache Hadoop project. HBase runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database.
HBase Architecture: HBase Data Model
As we know, HBase is a column-oriented NoSQL database. Although it looks similar to a relational database which contains rows and columns, but it is not a relational database. Relational databases are row oriented while HBase is column-oriented. So, let us first understand the difference between Column-oriented and Row-oriented databases:
HBase Architecture: Region
A region contains all the rows between the start key and the end key assigned to that region. HBase tables can be divided into a number of regions in such a way that all the columns of a column family is stored in one region. Each region contains the rows in a sorted order.
Many regions are assigned to a Region Server, which is responsible for handling, managing, executing reads and writes operations on that set of regions.
So, concluding in a simpler way:
- A table can be divided into a number of regions. A Region is a sorted range of rows storing data between a start key and an end key.
- A Region has a default size of 256MB which can be configured according to the need.
- A Group of regions is served to the clients by a Region Server.
- A Region Server can serve approximately 1000 regions to the client.
Now starting from the top of the hierarchy, I would first like to explain you about HMaster Server which acts similarly as a NameNode in HDFS. Then, moving down in the hierarchy, I will take you through ZooKeeper and Region Server.
The following keywords are required to gain an understanding of the subjects that forms the core foundation of HBase:
- Rowkey
- Column Family
- Column
- Timestamp
- Tables: Data is stored in a table format in HBase. But here tables are in column-oriented format.
- Row Key: Row keys are used to search records which make searches fast.
- Column Families: Various columns are combined in a column family. These column families are stored together which makes the searching process faster because data belonging to same column family can be accessed together in a single seek.
- Column Qualifiers: Each column’s name is known as its column qualifier.
- Cell: Data is stored in cells. The data is dumped into cells which are specifically identified by rowkey and column qualifiers.
- Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is stored with its timestamp. This makes easy to search for a particular version of data.
hbase shell
HBase Table Management Commands
To Create table use the create table commands
Create '<table Name>','<column family>'
List :- To List the Table information
Describe :- Prints the schema of a table.it will give information about table name with column families, associated filters, versions and some more details.
Import CSV data into HBase
Sample data
empno ename designation manager hire_date sal deptno
7369 SMITH CLERK 7902 12/17/1980 800 20
7499 ALLEN SALESMAN 7698 2/20/1981 1600 30
7521 WARD SALESMAN 7698 2/22/1981 1250 30
7566 TURNER MANAGER 7839 4/2/1981 2975 20
7654 MARTIN SALESMAN 7698 9/28/1981 1250 30
7698 MILLER MANAGER 7839 5/1/1981 2850 30
7782 CLARK MANAGER 7839 6/9/1981 2450 10
7788 SCOTT ANALYST 7566 12/9/1982 3000 20
7839 KING PRESIDENT NULL 11/17/1981 5000 10
7844 TURNER SALESMAN 7698 9/8/1981 1500 30
7876 ADAMS CLERK 7788 1/12/1983 1100 20
7900 JAMES CLERK 7698 12/3/1981 950 30
7902 FORD ANALYST 7566 12/3/1981 3000 20
7934 MILLER CLERK 7782 1/23/1982 1300 10
hadoop fs -put /Desktop/data/hbase/*.csv /user/hbase/data
Create HBase Table
create 'emp_data',{NAME => 'cf'}
Load data into HBase
cd /usr/hdp/2.4.0.0-169/hbase
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=’,’ -Dimporttsv.columns=’HBASE_ROW_KEY,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno’ emp_data /user/bdp/hbase/data/emp_data.csv
Once we submit the job, a MapReduce job will get started. Let’s understand each argument in more details:
-Dimporttsv.separator – Specify the delimiter of the source file
-Dimporttsv.columns – Mentioned columns name. Here if you observe, we have not mentioned empno. The Rowkey will have empno value. The row key needs to be identified using the all-caps HBASE_ROW_KEY string; otherwise, it won’t start the job.
Read data from HBase
Scan emp_data
where to execute this
ReplyDeletecd /usr/hdp/2.4.0.0-169/hbase
very nice post,keep posting more blogs with us.
ReplyDeletebig data online course