Archive for April, 2012

DynamoDB, Explained

Monday, April 30th, 2012

What is DynamoDB?

DynamoDB is a NoSQL database service offered by Amazon Web Services. It is designed to seamlessly scale in terms of the amount of data and the read/write request volume. You tell it how many writes per second and how many reads per second you want to be able to handle, and it takes care of partitioning your data across the required amount of hardware.

It is a key-value store meaning that the primary way of putting and getting data is by the primary index. There are no secondary indexes. (yet?) The primary index is the main key which can either be a single hash key, or a hash key and a range key. The hash key is what DynamoDB uses to partition your data across machines. Because of this, you should make sure that that the read/write request volume is evenly distributed across different hash keys. If you have one hash key that gets a lot of writes, all those writes will go to the same partition, and use up all of your write throughput for that partition even if you have more writes per second available in other partitions.

In addition to getting items out of DynamoDB by using their key, there are two other ways you can get items. DynamoDB implements scan and query functions. The scan is like a full table scan. Every item in the datastore is looked at. You can filter based on attributes in the item, but the performance will still be based on the total number of items in the table, not the number of items returned. Query retrieves a subset of items from the table based on key. You specify a single hash key, and a condition for the range key such that all the range keys returned in the query are next to each other in the table. Query performance is based on how many items are returned, not how many are in the table.

Hopefully that helps! Leave a comment if you have questions.

How to Install Sqoop on Amazon Elastic Map Reduce (EMR)

Monday, April 23rd, 2012

It is possible to install Sqoop on Amazon EMR. You can use Sqoop to import and export data from a relational database such as MySQL. Here’s how I did it with MySQL. If you are using a different database, you’ll probably need a different JDBC connector for that database.

I’m using Amazon’s Hadoop version 0.20.205, which, I think, was the default. You can see all supported versions of Amazon’s Hadoop here:

I downloaded sqoop-1.4.1-incubating__hadoop-0.20.tar.gz from here:

I downloaded mysql-connector-java-5.1.19.tar.gz from here:

Once I downloaded these two tar.gz files, I uploaded them to an S3 bucket. I also put this script below in the S3 bucket. Make sure to replace <BUCKET_NAME> with your own bucket name.

#Install Sqoop - s3://<BUCKET_NAME>/
hadoop fs -copyToLocal s3://<BUCKET_NAME>/sqoop-1.4.1-incubating__hadoop-0.20.tar.gz sqoop-1.4.1-incubating__hadoop-0.20.tar.gz
tar -xzf sqoop-1.4.1-incubating__hadoop-0.20.tar.gz
hadoop fs -copyToLocal s3://<BUCKET_NAME>/mysql-connector-java-5.1.19.tar.gz mysql-connector-java-5.1.19.tar.gz
tar -xzf mysql-connector-java-5.1.19.tar.gz
cp mysql-connector-java-5.1.19/mysql-connector-java-5.1.19-bin.jar sqoop-1.4.1-incubating__hadoop-0.20/lib/

After I started a job flow, I added this script as a step to the job flow. You can do this via the API, or the CLI like this:

./elastic-mapreduce -j <JOBFLOW_ID> --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar --arg s3://<BUCKET_NAME>/

Once the step completes, you can run sqoop imports and exports. Here’s an example of a sqoop export:

./sqoop-1.4.1-incubating__hadoop-0.20/bin/sqoop export --connect jdbc:mysql://<MYSQL_HOST>/<DATABASE_NAME> --table <TABLE_NAME> --export-dir <HDFS_PATH> --fields-terminated-by , --input-null-non-string '\\N' --username <USERNAME> --password <PASSWORD>

Hope that helped. Let me know if you have any questions.

Script to delete all tables in Hive

Wednesday, April 18th, 2012

hive -e ‘show tables’ | xargs -I ‘{}’ hive -e ‘drop table {}’