*** This article is originally written by Craig Carl from AWS. Thank you, Craig! We modified it slightly to address issues with spaces in S3 object names and downloading big number of files from S3 ***
As you know the trick to improving S3 performance is parallelism. Any single S3 operation is limited to between ~12 and 17MB/sec but there is no limit to the number of simultaneous operations you can run at once. I'll demonstrate how it's trivial to parallelize S3 puts and gets using the GNU Parallel [1,2,3] command.
GNU Parallel is a powerful xargs like command that reads multiple inputs from stdin and then executes a command for each input. Parallel commands generally take the form of -
cat list | parallel do_something
Customers who use the S3 API directly are likely to have the skill set to build parallelism into their application already so these examples are going to use s3cmd [4], a popular CLI to S3.
I'm getting objects from the Google Books Ngrams [5] public dataset (2.4TB). This dataset has 55 objects of various sizes. All of the instances are alinux on c1.xlarge instances in us-east-1a. I'm getting the objects to a NFS volume mounted as /mnt/volume
Using a standard single threaded get process it takes ~48 hours to get the entire bucket. Looking at CloudWatch metrics during the download shows we never exceed ~16MB/sec -
s3cmd get --recursive s3://datasets.elasticmapreduce/ngrams/books/array/
The best download speed we could hope to achieve on a single instance is limited by the network interface at ~80MB/sec, parallel makes this easy -
s3cmd ls --recursive s3://datasets.elasticmapreduce/ngrams/books/ | awk -F "s3://" '{print "s3://"$2; print "/mnt/volume/"$2}' | parallel -j50 -N2 --progress /usr/bin/s3cmd --noprogress get {1} {2}
Breaking this command apart;
1. 's3cmd ls --recursive s3://datasets.elasticmapreduce/ngrams/books/' lists every object in the bucket.
2. awk -F "s3://" '{print "s3://"$2; print "/mnt/volume/"$2}' gets the path to the S3 object and the local destination path.
3. 'parallel -j50 -N2 --progress' runs parallel with 50 concurrent threads, '-N2' tells parallel there were two arguments on stdin and assigns them to {1} and {2}.
4 '/usr/bin/s3cmd --no-progress get {1} {2}' - is the command parallel will run, '{1}' is substituted with the S3 object path, '{2}' is substituted with the local destination path.
Using this command network utilization is at ~80MB/sec and the download completes in ~9 hours. This approach will work as long as there are multiple objects in the bucket, to get to 80MB/sec there should be at least 6 objects.
The parallel -j option controls the number of simultaneous jobs parallel will start. By default parallel starts one job for every core, -j0 starts one job for every object, up to 255. With smaller objects it may be more efficient to limit the number of jobs to < ~50 to limit random disk I/O and the impact of setting up / tearing down S3 connections.
Parallel gets really powerful when you spread the download processes across multiple instances. In this example I use parallel to distribute the S3 gets across every instance.
s3cmd ls --recursive s3://datasets.elasticmapreduce/ngrams/books/ | awk -F "s3://" '{print "s3://"$2; print "/mnt/volume/"$2}' | parallel -j50 --sshloginfile ssh_hosts -N2 --progress /usr/bin/s3cmd --noprogress get {1} {2}
This command is the same as above with the addition of '--sshloginfile ssh_hosts' which parallel uses to distribute the jobs across every instance. The entire bucket is downloaded in ~100 minutes!
I setup passwordless ssh across the instances and mounted the NFS volume on each node as /mnt/volume/.
You can setup your ssh_hosts like:
ssh -i your_pem_file.pem ubuntu@<your_host_ip1>
ssh -i your_pem_file.pem ubuntu@<your_host_ip2>
ssh -i your_pem_file.pem ubuntu@<your_host_ip3>
...
See parallel's manual for more examples of sshloginfile.
To the best of my knowledge these tools can't be extended to Windows.
[1] https://savannah.gnu.org/projects/parallel/
[2] http://www.youtube.com/watch?v=OpaiGYxkSuQ
[3] https://en.wikipedia.org/wiki/GNU_parallel
[4] http://s3tools.org/s3cmd
[5] http://aws.amazon.com/datasets/8172056142375670