How to Install Sqoop on Amazon Elastic Map Reduce (EMR)

It is possible to install Sqoop on Amazon EMR. You can use Sqoop to import and export data from a relational database such as MySQL. Here’s how I did it with MySQL. If you are using a different database, you’ll probably need a different JDBC connector for that database.

I’m using Amazon’s Hadoop version 0.20.205, which, I think, was the default. You can see all supported versions of Amazon’s Hadoop here:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/usingemr_config_supportedversions.html

I downloaded sqoop-1.4.1-incubating__hadoop-0.20.tar.gz from here: http://www.apache.org/dyn/closer.cgi/sqoop/

I downloaded mysql-connector-java-5.1.19.tar.gz from here: http://www.mysql.com/downloads/connector/j/

Once I downloaded these two tar.gz files, I uploaded them to an S3 bucket. I also put this script below in the S3 bucket. Make sure to replace <BUCKET_NAME> with your own bucket name.

#!/bin/bash
#Install Sqoop - s3://<BUCKET_NAME>/install_sqoop.sh
cd
hadoop fs -copyToLocal s3://<BUCKET_NAME>/sqoop-1.4.1-incubating__hadoop-0.20.tar.gz sqoop-1.4.1-incubating__hadoop-0.20.tar.gz
tar -xzf sqoop-1.4.1-incubating__hadoop-0.20.tar.gz
hadoop fs -copyToLocal s3://<BUCKET_NAME>/mysql-connector-java-5.1.19.tar.gz mysql-connector-java-5.1.19.tar.gz
tar -xzf mysql-connector-java-5.1.19.tar.gz
cp mysql-connector-java-5.1.19/mysql-connector-java-5.1.19-bin.jar sqoop-1.4.1-incubating__hadoop-0.20/lib/

After I started a job flow, I added this script as a step to the job flow. You can do this via the API, or the CLI like this:

./elastic-mapreduce -j <JOBFLOW_ID> --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar --arg s3://<BUCKET_NAME>/install_sqoop.sh

Once the step completes, you can run sqoop imports and exports. Here’s an example of a sqoop export:

./sqoop-1.4.1-incubating__hadoop-0.20/bin/sqoop export --connect jdbc:mysql://<MYSQL_HOST>/<DATABASE_NAME> --table <TABLE_NAME> --export-dir <HDFS_PATH> --fields-terminated-by , --input-null-non-string '\\N' --username <USERNAME> --password <PASSWORD>

Hope that helped. Let me know if you have any questions.

14 Responses to “How to Install Sqoop on Amazon Elastic Map Reduce (EMR)”

  1. Bhavesh Shah Says:

    Hello,
    I have followed followed your link and it qorks for me. Sqoop gets installed on Amazon EMR. But there is one problem, when I run the sqoop-import query I am getting errors as:

    12/04/27 09:29:16 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/45d0bca50cb78c50c20acf18fcd64f90/QualityMeasureComputation.jar
    Exception in thread “main” java.lang.NoSuchMethodError: com.cloudera.sqoop.manager.ImportJobContext.setConnManager(Lcom/cloudera/sqoop/manager/ConnManager;)V
    at com.microsoft.sqoop.SqlServer.MSSQLServerManager.importTable(MSSQLServerManager.java:142)
    at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:380)
    at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:453)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
    at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)

  2. sumit Says:

    We have tried the all the steps to install SQOOP for MS SQL db connectivity but giving error while importing data from MS SQL 2008 R2 db to HDFS.
    Error:
    Exception in thread “main” java.lang.NoSuchMethodError: com.cloudera.sqoop.manager.ImportJobContext.setConnManager(Lcom/cloudera/sqoop/manager/ConnManager;)V
    at com.microsoft.sqoop.SqlServer.MSSQLServerManager.importTable(MSSQLServerManager.java:142)
    at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:380)
    at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:453)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
    at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)

    we are using sqoop-1.4.1

    It worked for only one time, now it is not. Please let us how to resolve above error..

  3. Kyle Mulka Says:

    Bhavesh, Sumit, my instructions will probably only work for MySQL. You will probably need to install a JDBC driver for MS SQL to use MS SQL.

  4. BigDataBlog » Some Links and more on Sqoop installation Says:

    […] to some other great blogs – if any of them work better than my instructions let me know http://blog.kylemulka.com/2012/04/how-to-install-sqoop-on-amazon-elastic-map-reduce-emr/ https://ccp.cloudera.com/display/CDH4B2/Sqoop+Installation […]

  5. Brendan Benson Says:

    This is a huge help. It sucks that this has to be such a hacky process through AWS. Thanks for the example.

    One typo: in the script runner command, it should be –args, not –arg.

  6. meble Says:

    helpfull information’s about how to install sqoop thx!:)

  7. ryan Says:

    Got any examples of how to take data from Hive table and make a CSV file on HDFS?

    Also, have you tried the ‘–num-mappers’ param to sqoop (http://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html#_syntax_3)? Wondering if/how much that speeds things up. I’m going to be trying an export from hive into a mysql table and doing an update on a multi-col key (via –update-key field1,field2,field3) where my original source is a dynamodb table. Using EMR to aggregate metrics and push the aggregation numbers into a mysql table.

  8. Siyin Says:

    Hi,

    I followed your steps exactly and got the following error running sqoop:

    Exception in thread “main” java.lang.NoClassDefFoundError: com/cloudera/sqoop/Sqoop
    Caused by: java.lang.ClassNotFoundException: com.cloudera.sqoop.Sqoop
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    Could not find the main class: com.cloudera.sqoop.Sqoop. Program will exit.

    Any ideas on why it’s looking for Cloudera’s Sqoop library and how do I get past this? Thanks.

  9. ranjith Says:

    The last sqoop import and export statement
    ./sqoop-1.4.1-incubating__hadoop-0.20/bin/sqoop export –connect jdbc:mysql:/// –table –export-dir –fields-terminated-by , –input-null-non-string ‘\\N’ –username –password

    Is this a part of the emr job flow ? Can we add this as a step to the job flow that which runs the export of data from hive to mysql on the emr instance? Or do I need to run this from terminal?

  10. Rupinder Says:

    Thanks. This was helpful.
    Is it possible to configure the script to install Sqoop as a bootstrap action in EMR Console ?

  11. Rupinder Says:

    I just checked and successfully installed sqoop as a bootstrap action in EMR.

  12. Justin Says:

    This worked for me. One obstacles I had was that I wrote my .sh file on Windows so the “End Of Line” format was incorrect.

    You can use Notepad++ to fix this, go to Edit > EOL Conversion > UNIX Format

    Thanks Kyle.

  13. Sqoop via an ssh tunnel on Amazon EMR « Getting there is 99.5% the fun Says:

    […] turns out other people have solved this problem, but here’s our solution for posterity. Note that it’s […]

  14. SN Says:

    I tried these steps to install SQOOP but I get the following error message. Can someone please tell me why?

    Exception in thread “main” java.lang.RuntimeException: java.io.IOException: Cannot run program “/mnt/var/lib/hadoop/steps/s-2EZU086RJ5ABY/./install-sqoop-SN.sh” (in directory “.”): error=2, No such file or directory
    at com.amazon.elasticmapreduce.scriptrunner.ProcessRunner.exec(ProcessRunner.java:143)
    at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.main(ScriptRunner.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
    Caused by: java.io.IOException: Cannot run program “/mnt/var/lib/hadoop/steps/s-2EZU086RJ5ABY/./install-sqoop-SN.sh” (in directory “.”): error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
    at com.amazon.elasticmapreduce.scriptrunner.ProcessRunner.exec(ProcessRunner.java:96)
    … 6 more
    Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.(UNIXProcess.java:186)
    at java.lang.ProcessImpl.start(ProcessImpl.java:130)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
    … 7 more

Leave a Reply