No SQL and HBase
Introduction to NoSQL databases and HBase in detail
Challenges with Traditional RDBMS
Not optimised for horizontal scaling out.
•Data size has increased tremendously to the range of petabytes.
Schema-less data
•Majority of the data comes in a semi-structured or unstructured format
High velocity of data ingestion
•RDBMS lacks in high velocity because it’s designed for steady data retention rather than rapid growth.
Cost
•High licensing cost for data analysis
Features of NoSQL Databases
Generic data model
•Heterogeneous containers, including sets, maps, and arrays
Dynamic type discovery and conversion
•NoSQL analytics systems support runtime type identification and conversion so that custom business logic can be used to dictate analytic treatment of variation.
Non-relational and De-normalised
•Data is stored in single tables as compared to joining multiple tables.
Commodity hardware
•Adding more of the economical servers allows NoSQL databases to scale to handle more data.
Highly distributed
•Distributed databases can store and process a set of information on more than one device.
CAP Theorem
•Consistency-This means that the data in the database remains consistent after the execution of an operation. For example after an update operation, all clients see the same data.
•Availability-This means that the system is always on (service guarantee availability), no downtime.
•Partition Tolerance-This means that the system continues to function even if the communication among the servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot communicate with one another.
•Duplicate Copy of same data is maintained on Multiple Machines.
•This increases availability, but decreases consistency.
•If data on one machine changes, the update propagates to the other
•Machine, system is inconsistent, but will become eventually consistent.
•If duplicate copy of same data is not maintained, consistency is superior
•But availability decreases.
•If data on one machine changes, the update propagates to the other
•Machine, system is inconsistent, but will become eventually consistent.
What is Hbase?
•Apache Hbase is the Hadoop database, a distributed, column oriented, scalable, big data store.
•Use Apache HBase when you need random, realtime read/write access to your Big Data.
•This project’s goal is the hosting of very large tables -billions of rows, X millions of columns -atop clusters of commodity hardware.
•Apache HBase is an open-source, distributed, versioned, non-relational database modelled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al.
•Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Introduction to HBASE Architecture
•HBase is composed of three types of servers in a master slave type of architecture.
•Region servers serve data for reads and writes.
•HBase Master process handles the Region assignment, DDL (create, delete tables) operations
•Zookeeper maintains a live cluster state.
•The Hadoop DataNode stores the data that the Region Server is managing.
•All HBase data is stored in HDFS files.
•The NameNode maintains metadata information for all the physical data blocks that comprise the files.
Regions
•HBase Tables are divided horizontally by row key range into “Regions.”
•A region contains all rows in the table between the region’s start key and end key.
•Regions are assigned to the nodes in the cluster, called “Region Servers,” and these serve data for reads and writes.
•A region server can serve about 1,000 regions.
HBase HMaster
•Region assignment, DDL (create, delete tables) operations are handled by the HBase Master.
A master is responsible for:
•Coordinating the region servers
•Assigning regions on startup
•Re-assigning regions for recovery or load balancing
•Monitoring all RegionServer instances in the cluster (listens for notifications from zookeeper)
Admin functions
•Interface for creating, deleting, updating tables
ZooKeeper: The Coordinator
•HBase uses ZooKeeper as a distributed coordination service to maintain server state in the cluster.
•Zookeeper maintains which servers are alive and available, and provides server failure notification.
•Zookeeper uses consensus to guarantee common shared state. Note that there should be three or five machines for consensus.
HBase First Read
•There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster.
•ZooKeeper stores the location of the META table.
•The client gets the Region server that hosts the META table from ZooKeeper
•The client will query the .META. server to get the region server corresponding to the row key it wants to access. The client caches this information along with the META table location.
•It will get the row from the corresponding Region Server.
•For future reads, the client uses the cache to retrieve the META location and previously read row keys.
•Over time, it does not need to query the META table, unless there is a miss because a region has moved; then it will re-query and update the cache.
HBase Meta Table
![Uploading Hbase Arch.png]
•This META table is an HBase table that keeps a list of all regions in the system.
•The .META. table is like a b tree.
•The .META. table structure is as follows:
Key: region start key, region id
Values: RegionServer
Region server components
•Region Server runs on an HDFS data node and has the following components:
WAL
•Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn’t yet been persisted to permanent storage; it is used for recovery in the case of failure.
BlockCache
•It is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.
MemStore
•It is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.
Hfiles
•They store the rows as sorted KeyValues on disk.
Region Split
•Initially there is one region per table.
•When a region grows too large, it splits into two child regions.
•Both child regions, representing one-half of the original region, are opened in parallel on the same Region server, and then the split is reported to the HMaster
•For load balancing reasons, the HMaster may schedule for new regions to be moved off to other servers.
HBase provides the following benefits:
•Strong consistency model-When a write returns, all readers will see same value
•Scales automatically-Regions split when data grows too large-Uses HDFS to spread and replicate data
•Built-in recovery-Using Write Ahead Log (similar to journaling on file system)
•Integrated with Hadoop-MapReduce on HBase is straightforward
Accessing HBase
Out of many ways to interact with HBase, two most popular are:
•HBase interactive shell mode
•Java API
Shell Commandscreate
list
•“List” command will display all the tables that are present or created in HBase
describe