DynamoDB, Explained

April 30th, 2012

What is DynamoDB?

DynamoDB is a NoSQL database service offered by Amazon Web Services. It is designed to seamlessly scale in terms of the amount of data and the read/write request volume. You tell it how many writes per second and how many reads per second you want to be able to handle, and it takes care of partitioning your data across the required amount of hardware.

It is a key-value store meaning that the primary way of putting and getting data is by the primary index. There are no secondary indexes. (yet?) The primary index is the main key which can either be a single hash key, or a hash key and a range key. The hash key is what DynamoDB uses to partition your data across machines. Because of this, you should make sure that that the read/write request volume is evenly distributed across different hash keys. If you have one hash key that gets a lot of writes, all those writes will go to the same partition, and use up all of your write throughput for that partition even if you have more writes per second available in other partitions.

In addition to getting items out of DynamoDB by using their key, there are two other ways you can get items. DynamoDB implements scan and query functions. The scan is like a full table scan. Every item in the datastore is looked at. You can filter based on attributes in the item, but the performance will still be based on the total number of items in the table, not the number of items returned. Query retrieves a subset of items from the table based on key. You specify a single hash key, and a condition for the range key such that all the range keys returned in the query are next to each other in the table. Query performance is based on how many items are returned, not how many are in the table.

Hopefully that helps! Leave a comment if you have questions.

How to Install Sqoop on Amazon Elastic Map Reduce (EMR)

April 23rd, 2012

It is possible to install Sqoop on Amazon EMR. You can use Sqoop to import and export data from a relational database such as MySQL. Here’s how I did it with MySQL. If you are using a different database, you’ll probably need a different JDBC connector for that database.

I’m using Amazon’s Hadoop version 0.20.205, which, I think, was the default. You can see all supported versions of Amazon’s Hadoop here:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/usingemr_config_supportedversions.html

I downloaded sqoop-1.4.1-incubating__hadoop-0.20.tar.gz from here: http://www.apache.org/dyn/closer.cgi/sqoop/

I downloaded mysql-connector-java-5.1.19.tar.gz from here: http://www.mysql.com/downloads/connector/j/

Once I downloaded these two tar.gz files, I uploaded them to an S3 bucket. I also put this script below in the S3 bucket. Make sure to replace <BUCKET_NAME> with your own bucket name.

#!/bin/bash
#Install Sqoop - s3://<BUCKET_NAME>/install_sqoop.sh
cd
hadoop fs -copyToLocal s3://<BUCKET_NAME>/sqoop-1.4.1-incubating__hadoop-0.20.tar.gz sqoop-1.4.1-incubating__hadoop-0.20.tar.gz
tar -xzf sqoop-1.4.1-incubating__hadoop-0.20.tar.gz
hadoop fs -copyToLocal s3://<BUCKET_NAME>/mysql-connector-java-5.1.19.tar.gz mysql-connector-java-5.1.19.tar.gz
tar -xzf mysql-connector-java-5.1.19.tar.gz
cp mysql-connector-java-5.1.19/mysql-connector-java-5.1.19-bin.jar sqoop-1.4.1-incubating__hadoop-0.20/lib/

After I started a job flow, I added this script as a step to the job flow. You can do this via the API, or the CLI like this:

./elastic-mapreduce -j <JOBFLOW_ID> --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar --arg s3://<BUCKET_NAME>/install_sqoop.sh

Once the step completes, you can run sqoop imports and exports. Here’s an example of a sqoop export:

./sqoop-1.4.1-incubating__hadoop-0.20/bin/sqoop export --connect jdbc:mysql://<MYSQL_HOST>/<DATABASE_NAME> --table <TABLE_NAME> --export-dir <HDFS_PATH> --fields-terminated-by , --input-null-non-string '\\N' --username <USERNAME> --password <PASSWORD>

Hope that helped. Let me know if you have any questions.

Script to delete all tables in Hive

April 18th, 2012

hive -e ‘show tables’ | xargs -I ‘{}’ hive -e ‘drop table {}’

What #newtwitter means for Twilk

September 27th, 2010

Welp… #newtwitter is coming out. Part of the “enhancement” of the new Twitter layout is the fact that the space in the middle is larger. They basically took the small right column and doubled the size. They also put a bar across the top of the page and kept the main content in about the same spot. So, there is less room for the Twitter background, both on the sides and on the top. This means that the effectiveness of the Twitter background is reduced. It also means that there are less people showing up on Twilk backgrounds. I’ve seen a small drop in traffic to Twilk.com, and a few Twilk Pro cancelations because of this new layout. What does this mean for Twilk? It means I have to innovate. It means that Twilk is probably going to branch out from doing just Twitter backgrounds. We’ll probably have to have a separate web page to display Twitter followers profile photos. (and more) If you have any ideas, or want to be updated when we launch these changes shoot us an email.

How to Scale a Web Application

April 28th, 2010

In my mind there are two scaling patterns that are used to scale a typical web application. One handles the computation requirements, the other handles the storage requirements. Another way to think about this is stateful vs stateless scaling.

If you don’t need to handle any state (storage beyond each web request) in your web application, you can use the stateless scaling approach. The stateless scaling approach is pretty simple. You get what is called a load balancer and put a bunch of servers behind it. A good load balancer can handle hundreds if not thousands of servers, so you should be good for quite a lot of traffic before you’d need a different strategy such as DNS round robin or multi-homed IPs. Of course, the load balancer here is a single point of failure, so if you are worried about downtime if the load balancer ever fails, you should look into some other high availability solutions. You can keep adding (and removing servers) from the load balancer as traffic goes up and down. A good way to do this is with Amazon EC2’s auto-scaling feature.

If you need to store state in your web application (which is usually the case) you need a different strategy for scaling out the storage. A good strategy here is what is know as partitioning or sharding. The idea is to split up the data onto different servers in some way. What you need is some form of a distributed hash table. The data is typically split based on the primary key, in other words, the identifier that is most often used to access the data. Once you get a large enough set of data, you’ll need a way to split up the data such that when you add or remove a server, you don’t have to shuffle all the data around. For this, I would suggest using a concept known as consistent hashing. If you are just storing files, I’d recommend going with Amazon’s S3 which does this sharding for you, basically infinitely. If you need faster access to a bunch of smaller pieces of data look at MySQL or try one of the many NoSQL database systems out there, some of which have built in sharding.

Why I Switched From SVN to Git

January 7th, 2010

A HUGE benefit of git (or any other distributed source code control system) is that the entire repository is stored in each developer’s environment. This means that you automatically have as many backups of the source code as you have developers. If you use a hosted service such as github, this means that even if github looses ALL of your data, you still have all your source code (and revision history) on your own machines.

In a software startup, your source code is like your crown jewels. Losing your source code can be disastrous.

This is primarily why I switched from svn to git, and why you should too.

Umich CAEN Wireless on Snow Leopard

November 11th, 2009

I was just told how to get on to the University of Michigan CAEN Wireless with the VPN client built into Mac OSX Snow Leopard, so I thought I would share.

If there’s anyone actually subscribed to my blog that doesn’t care about this… sorry, I just use it as a dumping ground for information.

Since some of this info is protected, I’ll just refer you to the protected URL that has the info:
https://www.itcom.itd.umich.edu/vpn/software/UM-on-campus-wireless.pcf

Go to System Preferences, Network.
Click the plus sign in the lower left to add a new connection.
Interface: VPN
VPN Type: Cisco IPSec (if this doesn’t show up, try downloading the Cisco VPN client from here)
Server address: <host in the pcf file>
Account name: <your uniqname>
Password: <your regular umich password>
Click “Authentication Settings…”
Shared secret: <grouppwd in the pcf file>
Group Name: <groupname in the pcf file>

My Thoughts on Startup Weekend Redmond

September 2nd, 2009

So, Startup Weekend Redmond happened last weekend. It was hosted by BizSpark on Microsoft’s campus, and heavily branded that way. 14 out of the 15 startups were built using Microsoft technologies [edit] likely because of the $5,000 prize from BizSpark[/edit]. Guess who won the popular vote! The only team that DIDN’T build using Microsoft. They built an iPhone app, a Palm Pre app, and I believe a web app using something other than ASP or Azure. (correct me if I’m wrong) Apparently that team was disqualified not eligible for the prize money from BizSpark because of that and the prize was given instead to the #2 team. More info can be found on the TechFlash report about Startup Weekend Redmond.

Microsoft/BizSpark got a lot of bad press as a result. Clint Nelson, one of the guys behind of the national Startup Weekend organization posted a blog entry called Sticking Up for the Big Guy. You might want to read that since what follows is basically my response to that article.

Startup Weekend is a great concept. It’s a great community building event where people in the same city interested in the same thing (namely building a startup) get together for a weekend and work together. You get to meet new people, and get to know people better that you’ve already met. But, the fact of the matter is most teams formed at Startup Weekend don’t continue working together on the startup after the weekend is over. So, saying that “we launched 15 startups that otherwise would not exist” is kind of a misnomer. It’s not about the startups that are launched that weekend. It’s about the connections made between the people. Hopefully those people will continue the conversation and partner to form their own startups later.

It’s great that Microsoft wants to support the startup community via BizSpark, but I feel that Microsoft is being disingenuous by only giving an award to a startup that uses Microsoft’s technology at Startup Weekend.

If they want to have their own BizSpark Weekend or whatever, that’s fine. They can run it themselves. They have enough money, they have enough people, they have a big enough marketing budget. Microsoft doesn’t need Startup Weekend to run their own event that is similar that is restricted to building on the Microsoft stack.

“Bizspark is absolutely being crucified for giving us the community exactly what we asked for.” Really? You asked them to disqualify anyone not using Microsoft technology?

In the future, please keep prize money out of Startup Weekend. kthxbye

Twilk, a Twitter Background Generator

July 4th, 2009

Twilk is a web service I’ve been working for a couple months now. I launched it a few weeks ago at a Twitter conference. It automatically creates a Twitter background made up of the profile photos of the people you follow on Twitter. If you are on Twitter, you have to check it out! I’d love to hear your feedback on the service so that I can continue to iterate and make it better. So, leave a comment, or use the feedback form after using the service. If you’d like an example, check out my Twitter page. The background has a bunch of my friends’ photos on it, like this:

List of Sites Affected by Fisher Plaza Data Center Fire

July 3rd, 2009

I’m keeping a list of the sites that were/are seemingly affected by the Fisher Plaza data center fire last night sorted by Alexa Traffic Rank. Comment if you have more information. Sites marked with a * appear to be back up. Follow me (@mulka) on Twitter to get notified when I update this list.

http://bing.com/travel 57 *
http://allrecipes.com 871 *
http://bigfishgames.com 1,822 *
http://geocaching.com 4,233 *
http://authorize.net 5,345 *
http://komonews.com 13,306 *
http://dotster.com 27,895 *
http://waymarking.com 38,446 *
http://kcls.org 41,085 *
http://marshillchurch.org 63,317 *
http://ideascale.com 85,951 *
http://adhost.com 180,491 *
http://onlinemetals.com 180,846 *
http://tomsofmaine.com 247,800 *
http://pacsci.org 285,570 *
http://pccnaturalmarkets.com 300,451 *
http://avayausers.com 413,083 *
http://pspinc.com 556,777 *
http://bartelldrugs.com 471,734 *
http://ovaleye.com 3,392,698 *
http://newsdata.com 3,456,473 *
http://tradetech.net 7,113,570 *

Even more in the comments including:

http://www.portentinteractive.com
http://www.momagenda.com
http://www.princesslodges.com
http://www.goiam.org

http://www.ringor.com

http://www.nettica.com/

http://www.motherjones.com

http://www.questionpro.com
http://micropoll.com

http://www.square1books.com