Sqoop
Sqoop is a data import export tool in the Hadoop ecosystem
Introduction
•Allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems.
•Using Sqoop, we can provision the data from external system on to HDFS, and populate tables in Hive and HBase
•Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.
•Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems.
•The following command is used to import all data from a table called ORDERS from a MySQL database:
sqoop import – connect jdbc:mysql://localhost/ acmedb --table ORDERS --username test --password ****
•In this command the various options specified are as follows:
import:
•This is the sub-command that instructs Sqoop to initiate an import.
–
connect
•Connection String contains type of RDBMS Server, RDBMS Sever IP Address, port, database name and other optional arguments
–
username
–
password
•These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.
– table
•This parameter specifies the table which will be imported.
•The import is done in two steps:
•In the first Step Sqoop introspects the database to gather the necessary metadata for the data being imported.
•The second step is a map-only Hadoop job that Sqoop submits to the cluster.
•It is this job that does the actual data transfer using the metadata captured in the previous step.
Architecture Split By
•It is used to specify the column of the table used to generate splits for imports.
•This means that it specifies which column will be used to create the split while importing the data into your cluster.
•It can be used to enhance the import performance by achieving greater parallelism.
•Sqoop creates splits based on values in a particular column of the table which is specified by split-by by the user through the import command.
•If it is not available, the primary key of the input table is used to create the splits.
Reason to use:
•Sometimes the primary key doesn’t have an even distribution of values between the min and max values (which is used to create the splits if split-by is not available).
•In such a situation you can specify some other column which has proper distribution of data to create splits for efficient imports.
Handling Passwords
•Typing your password into the command-line interface is insecure.
•It can be easily retrieved from listing the operating system’s running processes.
•You have two options besides specifying the password on the command line with the password parameter.
•The first option is to use the parameter -P that will instruct Sqoop to read the password from standard input. •Alternatively, you can save your pass-word in a file and specify the path to this file with the parameter
password-file
•Try to store password in HDFS with read permission only to the owner.
Hive Import
•If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive.
•Importing data into Hive is as simple as adding the hive-import option to your Sqoop command line.
•If the Hive table already exists, you can specify the hive-overwrite option to indicate that existing table in hive must be replaced.
•After your data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.
Incremental Import
•Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows. •The following arguments control incremental imports:
check-column
•Specifies the column to be examined when determining which rows to import.
incremental
•Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified
last-value
•Specifies the maximum value of the check column from the previous import. •Sqoop supports two types of incremental imports: append and lastmodified
•You can use the - incremental argument to specify the type of incremental import to perform.
•You should specify append mode when importing a table where new rows are continually being added with increasing row id values. •You specify the column containing the row’s id with
•Sqoop imports rows where the check column has a value greater than the one specified with last-value
•An alternate table update strategy supported by Sqoop is called lastmodified mode. •You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. •Rows where the check column holds a timestamp more recent than the timestamp specified with last-value are imported.
SqoopJob
•Incremental import is a great feature that you’re using a lot. •Remembering the last imported value is a hassle. •You can take advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse. •You can create a simple incremental import job. •Start it with the exec parameter:
sqoop job exec <job_name>
•You can list all retained jobs using the -list parameter:
sqoop job list
•To remove the old job definitions that are no longer needed, use the -delete parameter, for example:
sqoop job -delete <job_name>
•You can also view content of the saved job definitions using the
show parameter, for example:
sqoop job show <job_name>
Sqoop Export
•The export tool exports a set of files from HDFS back to an RDBMS. The target table must exist in the database. •The input files are read and parsed into a set of records according to the user-specified delimiters. •By default, Sqoop will use four tasks in parallel for the export process.
Stage Tables
•Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database.
•This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others.
•You can overcome this problem by specifying a staging table via the staging-table option which acts as an auxiliary table that is used to stage exported data.
•The staged data is finally moved to the destination table in a single transaction.