I have been working day in and day out with Apache Sqoop lately, and also, it’s been a while since I’ve imparted some knowledge to the world. 😜
So, here’s a tutorial on Sqoop Import and Export. Expect this to be a very basic but short and sweet tutorial as even I’m a beginner at Apache Sqoop.
Let’s start with WHY Sqoop?
Sqoop is used whenever there is a need to transfer enormous amounts of data across Hadoop & Relational/Structured databases like Oracle or MySQL. You can rely on Sqoop to do this job for you efficiently.
Sqoop import, as it implies, is used to transfer data from relational databases to a Hadoop file system (HDFS), and Sqoop export does the opposite of this, i.e, from Hadoop to external databases (RDBMS).
Sqoop has many tools, 2 of which are import and export. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Let’s go for a basic Sqoop Import Job; below are 2 basic types for it:
## specify tool being used
sqoop import \
## arguments
-Dmapred.job.queue.name="<queue name>" \
-Dmapreduce.map.memory.mb=1024 \
--connect <JDBC URL> \
--username "<usename>" \
--password "<password>" \
--table <rdbms table name>\
--hcatalog-database <hive database name> \
--hcatalog-table <hive table name> \
--hive-partition-key partition_date \
--hive-partition-value <date> \
--hcatalog-storage-stanza "<specify storage format>" \
--num-mappers <number of mappers>
Here, import - tool to import a database table into HDFS, -Dmapred.job.queue.name is used to specify which queue you want your sqoop job to run in; if you don’t specify, it’ll take the default queue.
-Dmapreduce.map.memory.mb is used to specify memory on your mappers; you can increase or decrease memory for the mappers using this argument.
—connect is the tool-specific argument to connect to the database which uses JDBC connect string.
—username is the argument to set the authentication user name.
—password is the argument to set an authentication password.
Note: In case you do not have a password available with you, you can use —password-file argument, and pass the path to the options file (password file) stored in HDFS.
—table is the argument to specify which table to read,
—hcatalog database here, HCatalog is a table storage and management service for Hadoop. This argument specifies the database name of the table we are importing data to.
—hcatalog-table is the argument to specify the table name we are importing data into.
—hive-partition-key is used to specify the partitioning column of the table.
—hive-partition-value is used to specify into which partition are we going to import the data.
—hcatalog-storage-stanza is used to specify the storage format of the table.
—num-mappers argument is used to specify the number of mappers the job will be using.
Mappers are nothing but a number of parallel processes sqoop shall use to import the data.
Note: Make sure you have all the columns from the source in your table, i.e, hcatalog table as well, or else your sqoop job will throw an error, and also, the partitioned column is of ‘string’ type.
If you haven’t created a landing table for your import job, you can also use ‘--create-hive-table’ argument, and pass <db>.<table name> to create a table for sqoop import on the go.
Wait! What if I do not want the whole table but only some columns from the source table?
Here’s how to do it. The only addition to the existing arguments is
—columns which is the tool-specific argument to specify columns from the source table you want to import into your table.
## specify tool being used
sqoop import \
## arguments
-Dmapred.job.queue.name="<queue name>" \
-Dmapreduce.map.memory.mb=1024 \
--connect <JDBC URL>
--username "<usename>" \
--password "<password>" \
--table <source table name>\
--hcatalog-database <hive database name> \
--hcatalog-table <hive table name> \
--columns "col1, col2, col3..." \
--hive-partition-key partition_date \
--hive-partition-value <date> \
--hcatalog-storage-stanza "<specify storage format>" \
--num-mappers 1
Much of the arguments remain the same for sqoop export, with a few additions:
## specify tool being used
sqoop export \
## arguments
-Dmapred.job.queue.name="<queue name>" \
-Dmapreduce.map.memory.mb=1024 \
-Dsqoop.export.records.per.statement=<records per statement> \
-Dsqoop.export.statements.per.transaction=<number of statements per transaction> \
--connect <JDBC URL>
--username "<usename>" \
--password "<password>" \
--table <destination schema.destination table>\
--hcatalog-database <database name> \
--hcatalog-table <hive table name> \
--hive-partition-key partition_date \
--hcatalog-partition-values <partition from which to export> \
--num-mappers 1 \
--update-key <primary key> \
--update-mode allowinsert
Here, -Dsqoop.export.records.per.statement implies the number of rows sqoop should export at a time to the RDBMS; when you assign 100 to it, you are asking sqoop to export 100 rows in one go.
-Dsqoop.export.statements.per.transaction is used to define how many such statements to happen per transaction.
—update-key is used to update the records into the destination table based on the column you pass; here, it’s the primary key.
—update-mode allows new records insertion into the destination table if they do not exist yet.
Note: If users do not use the update-mode to allowinsert with update-key, no new records will be sqoop exported to the destination table; only the columns in the destination table will be updated based on the primary key.
That’s all for the tutorial. Happy Reading!