« Back to Home Page or Back to Books and Articles by the Author

Big data recipes

With cloud computing, non-RDBMS data storage and visualization tools.

Go to Table of Contents

Set up your own big data sets on an Amazon Web Services (AWS) server

Problem

You want to upload your big data to Amazon Web Services (AWS).

Solution

There are several ways to set up your big data on Amazon Web Services (AWS). Use a secure FTP connection to upload data to a server's (EC2 instance) elastic block storage(EBS). Upload data to simple storage service(S3), followed by downloading the data to the server's EBS. To download data from S3 to a server's EBS you can use a standard web tool (e.g. wget) or a native S3 tool (e.g. s3cmd). You can also use the AWS Import/Export service which consists of physically sending a device with your big data sets to Amazon so they can upload them to your server.

How it works

You've assembled all your big data on your workstation or local server, now you need to upload it to a high-powered server to expedite its processing or present it to end users. There are several ways to upload data to an AWS server depending on the volume you plan to upload.

Note Set up a server on AWS

If you haven't set up a server on AWS, you can find step by step instructions on the prior entry: Set up a server and storage on Amazon Web Services (AWS) .

Secure FTP connection to the server's EBS

By default, access to a server or EC2 instance is enabled through Secure Shell (SSH). Using SSH gives you access to a server's console, where you can run command line tasks. There is however a variation for SSH called SSH File Transfer Protocol (SFTP) to transfer files over a SSH connection. Unlike the classical File Transfer Protocol (FTP) used to move files between networks, SFTP not only cryptographically protects the data transfer, but also doesn't need an FTP server on the receiving end to accept files, it just requires SSH. And since SSH is already enabled on an EC2 instance, transferring big data sets through SFTP requires no additional setup steps.

On your workstation or local server, all you need is an application that supports SFTP to do the upload. Now a days, practically all applications that support FTP uploads, also support SFTP. In case you're unfamiliar with them, I'll mention a few: FileZilla (Available for all OS), Win SCP (For Windows OS) and gFTP (For Unix/Linux OS).

The more elaborate part of doing SFTP connections to an EC2 instance is the fact that authentication is key based. In the prior entry 'Set up a server and storage on AWS' in the section Connect to the server , I described how you don't use regular passwords to access the server or EC2 instance, but rather a key file generated when you set up the server -- this last process is also covered in the same entry under Create a key pair . This means you'll need to configure your FTP/SFTP application to work with key based authentication, instead of the default out-of-the-box password authentication used in practically all FTP/SFTP applications -- a few extra minutes consulting the FTP/SFTP application's documentation should be enough to configure this though.

Finally, before even trying to upload a big data set via SFTP to the server, ensure the target directory is in a partition with sufficient free space, otherwise the transfer will fail until space is exhausted (i.e. minutes or hours after you start the transfer). The process for creating large partitions using EBS is described in the prior entry 'Set up a server and storage on AWS' in the section Add and manage storage on the server .

Upload to S3, followed by downloading to the server's EBS

S3 or Simple Storage Service is AWS solution for storing files from 1 byte up to 5 Terabytes. The simplicity of S3 lies in the fact that you can upload and download files from any web device, without setting up FTP, SSH or a web server. And since both S3 and EBS are part of the AWS stack, it's fast to transfer data between both services. Therefore once a file is uploaded to S3 it can easily be copied to EBS, where the server or EC2 instance has access to it.

Note Why involve S3 in the transfer to EBS ? Isn't it easier to use SFTP ?

It's easier to use SFTP to place most files in EBS. However, for large files (e.g. 1 Terabyte) Internet connectivity can suffer between end points, making the transfer unreliable and error prone. In addition, there's another option to set up big data sets called AWS Import/Export that involves using S3 -- this option is described last.

S3 concepts and uploads

S3 operates on the concepts of buckets and objects. In layman's terms, a bucket is a unique location across the S3 service to store objects and objects are any kind of file. Bucket naming is open, so long as the name is not taken by another S3 account user, this uniqueness is due to the fact that access to a bucket's contents is done through URLs (e.g. https://s3.amazonaws.com/<bucket_name>/). Objects can be any type of file and can be made publicly accessible through a URL like https://s3.amazonaws.com/<bucket_name>/<object_name>.

To create an S3 bucket, inside the 'AWS Management Console' click on the S3 tab at the top, followed by the 'Create Bucket' button in the top-left corner -- this is illustrated in Figure 1. A pop-up window will emerge asking for a bucket name and region to deploy the bucket, introduce a name and select a region accordingly, as illustrated in Figure 2. Once the bucket is created and with the bucket selected from the S3 bucket list, click on the 'Upload' button in the top-left to initiate a file upload as illustrated in Figure 3. Another pop-up window will emerge with functionality to select files from your workstation as illustrated in Figure 4.

thumb Create S3 bucket
Figure 1.- Create S3 bucket
Figure 1.- Create S3 bucket
thumb S3 bucket name & location
Figure 2.- S3 bucket name & location
Figure 2.- S3 bucket name & location
thumb S3 upload
Figure 3.- S3 upload
Figure 3.- S3 upload
thumb S3 uploader
Figure 4.- S3 uploader
Figure 4.- S3 uploader

Once a file is inside a bucket, place your mouse over the file row under the 'Name' column and click the right button of your mouse, a pop-up window will emerge, select 'Properties' as illustrated in Figure 5. Once selected, at the bottom of the page you'll see a separate pane with more details about the file, notice the 'Details' tab has 'Link' value with a URL, this corresponds to the URL where you can download the file -- this is also illustrated in Figure 5.

By default, all files uploaded to S3 are not publicly accessible even though you'll see a URL reference. To grant access to a file you'll need to modify its access properties (a.k.a ACL-Access Control List). One option illustrated in Figure 6 is to select the file, do a right-mouse click to bring up a contextual menu and then select the 'Make Public' value. The second option illustrated in Figure 7 is by using the 'Permissions' tab in the bottom pane for a file's properties and introducing as grantee 'Everyone' & the 'Open/Download' value. If you wish to revoke public access for a file, you unselect the 'Open/Download' value for a file's properties as illustrated in Figure 8, irrespective of the two prior methods described to grant public access.

thumb S3 file properties
Figure 5.- S3 file properties
Figure 5.- S3 file properties
thumb S3 make public
Figure 6.- S3 make public
Figure 6.- S3 make public
thumb S3 permissions
Figure 7.- S3 permissions
Figure 7.- S3 permissions
thumb S3 uploader
Figure 8.- S3 revoke permissions
Figure 8.- S3 revoke permissions

Once you have your big data files on S3, you can proceed to download them to the EBS volume from where and EC2 instance is able to access the data. There are several approaches you can take for this process, I'll describe two of them next.

S3 download to EBS using wget

The 'quick and dirty' way to get S3 files into EBS is to treat S3 files as if they were any other kind of file or web resource on the Internet. By doing so, you can use regular download utilities from within the EC2 instance and save files to an attached EBS volume.

One such Unix/Linux utility is called wget which lets you download any type of web resource. Ensuring the S3 file has public access permission, consult the file's public URL on the 'Link' line of the 'Details' tab for a file's 'Properties' -- this is shown in figure 5. Next, access your server or EC2 instance and invoke the command illustrated in Listing 1:

Listing 1 - Transfer from S3 to EBS using wget
[ec2-user@ip-[ip-address] ~]$ wget https://s3.amazonaws.com//<bucket_name>/<object_name>

--<Date>--  https://s3.amazonaws.com//<bucket_name>/<object_name>
Resolving s3.amazonaws.com... 
Connecting to s3.amazonaws.com||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 218947000 (214000K) [application/zip]
Saving to: "<object_name>.zip"

100%[======================================>] 218,947,000     --.-K/s     

<Date> (Speed MB/s) - "<object_name>" saved [218947000/218947000]

It's as simple as that, now you have your big data set from S3 on an EBS volume ready to be processed by your EC2 instance. The transfer speed should be fast considering the S3 bucket and EBS volume are in the same data center. Next, I'll describe a second option for transferring S3 files onto an EBS volume.

S3 download to EBS using s3cmd

The previous method using wget assumed you didn't know files were stored on S3. But now lets assume you know this and you want to use a tool more attune with S3 to download files into an EBS volume. Why does this even matter ? Well it could be that you want to download many files at once -- even though wget can handle this -- or simply do an S3 operation that a regular download tool like wget can't do, such as downloading S3 files to EBS and then deleting them from S3 in one operation.

For scenarios like this I'll introduce you to another command line tool called s3cmd. The difference between a tool like wget and s3cmd is that s3cmd is specifically built to interact with S3. It knows how to enlist S3 files, it can upload files to S3 and it can even delete files from S3 buckets, all from the command line, just as if you did these steps from the 'AWS Management Console'.

The s3cmd utility is available at http://s3tools.org/s3cmd . There are several installation packages depending on your target OS. Listing 2 illustrates the process for installing s3cmd from source, which is applicable for all OS.

Listing 2 - Install s3cmd from source
[ec2-user@ip-[ip-address] ~]$ wget http://downloads.sourceforge.net/project/s3tools/s3cmd/1.1.0-beta3/s3cmd-1.1.0-beta3.tar.gz
tar -xzvf s3cmd-1.1.0-beta3.tar.gz
cd s3cmd-1.1.0-beta3
sudo python setup.py install

Once you install s3cmd, you need to configure it to access your S3 service. This configuration requires that you have your AWS account's Access Key ID and Secret Access Key on hand. These security credentials are not the same as the key pair you created to access your EC2 instance. To consult the Access Key ID and Secret Access Key for your account, go to the 'Main Account' page at https://aws-portal.amazon.com/gp/aws/manageYourAccount and then click under the 'Security Credentials' section, as illustrated in Figure 9.

thumb AWS account Access Key ID and Secure Access Key
Figure 9.- AWS account Access Key ID and Secure Access Key
Figure 9.- AWS account Access Key ID and Secure Access Key

Once you have your AWS Access Key ID and Secret Access Key, you can proceed to configure s3cmd. Listing 3 illustrates the step by step process.

Listing 3 - Configure s3cmd
[ec2-user@ip-[ip-address] ~]$ s3cmd --configure

Enter new values or accept defaults in brackets with Enter.
Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3
Access Key: XXXXXXXXXXXXXXXXXXXX
Secret Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Encryption password is used to protect your files from reading
by unauthorized persons while in transfer to S3
Encryption password: XXXXXXXXX
Path to GPG program [/usr/bin/gpg]: 

When using secure HTTPS protocol all communication with Amazon S3
servers is protected from 3rd party eavesdropping. This method is
slower than plain HTTP and can't be used if you're behind a proxy
Use HTTPS protocol [No]: 

On some networks all internet access must go through a HTTP proxy.
Try setting it here if you can't connect to S3 directly
HTTP Proxy server name: 

New settings:
  Access Key: XXXXXXXXXXXXXXXXXXXX
  Secret Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  Encryption password: XXXXXXXXX
  Path to GPG program: /usr/bin/gpg
  Use HTTPS protocol: False
  HTTP Proxy server name: 
  HTTP Proxy server port: 0

Test access with supplied credentials? [Y/n] Y
Please wait, attempting to list all buckets...
Success. Your access key and secret key worked fine :-)

Now verifying that encryption works...
Success. Encryption and decryption worked fine :-)

Save settings? [y/N] y
Configuration saved to '/home/ec2-user/.s3cfg'

With s3cmd configured and verified it can connect to your S3 service, now you can bring in S3 files into an EBS volume attached to your EC2 instance. In listing 4 you can find two examples of s3cmd in action, one to list the buckets in your S3 service and the other to copy an S3 bucket's objects into a directory of the EC2 instance.

Listing 4 - List and fetch files from S3 to EBS with s3cmd

[ec2-user@ip-[ip-address] ~]$ s3cmd ls
<Date> s3://<bucket_name>
[ec2-user@ip-[ip-address] ~]$ s3cmd sync s3://<bucket_name> <OS_folder>
s3://<bucket_name>/<object_name> -> ./<object_name>  [1 of 1]
 218947000 of 218947000   100% in    0s  (Speed kB/s)  done
Done. Downloaded 218947000 bytes in (Time), (Speed kB/s)
[ec2-user@ip-[ip-address] ~]$ ls
&object_name>
[ec2-user@ip-[ip-address] ~]$

The s3cmd sync command is the most interesting because it synchronizes the contents of an S3 bucket with a folder on your server. At first it copies all the contents of an S3 bucket to the indicated OS folder and on subsequent executions synchronizes the differences, making it very efficient. In fact, if you swap the values and place the folder first and S3 bucket second (e.g.s3cmd sync <OS_folder> s3://<bucket_name>) it copies the OS folder's contents into the indicated S3 bucket, so you're effectively copying files from an EBS volume to S3.

s3cmd is a very versatile utility and there are many more options than the ones just described, this includes skipping files for the syncing process, uploading and deleting files from S3 and recursive copying, among other things. However, I advise you to consult the s3cmd documentation, as it would go beyond the scope of the main topic to discuss that here.

Note S3 tools abound due to S3 APIs

In addition to the s3cmd utility, there are many tools that have emerged to interact with S3. This is on account AWS provides APIs that allow anyone to build functionality against its services. At the building block level -- if you want to create your own application to interact with S3 -- are AWS SDKs: AWS SDK for Java , AWS SDK for .NET , AWS SDK for PHP and AWS SDK for Ruby . And among popular S3 utilties are S3 Browser - A Windows Client for S3 , S3Fox - A Firefox browser plug-in for S3 and Bucket Explorer - A Client for S3, available for all OS.

AWS Import/Export service

Finally, we come to the last option available to set up your own big data sets on an AWS server: the AWS Import/Export service. The AWS Import/Export service consists of physically sending a device with your big data sets to Amazon so they can upload them to your server. Given they have direct access to their data centers, the transfer speed is as if it were done an a local area network, extremely fast.

This method is convenient if you have extremely large files (e.g. 10TB) or a large number of files (e.g. thousands) to upload. It's also worth mentioning that with AWS Import/Export you can also ask Amazon to send you big data sets produced by your applications on physical media, instead of taking the time to download them yourself -- this last option is the 'Export' part of the service, where as the 'Import' part consists of you sending them big data sets.

However, as convenient as the AWS Export/Import service is, it does come with its drawbacks. For starters, it isn't free, there are individual costs for processing a device and hourly data loading costs. The other drawback is that in certain regions, the Import/Export service is only available for importing data into S3 -- see sidebar for more details on this. This means that to access the data from your server you'll need to additionally transfer it yourself to an EBS volume attached to your EC2 instance -- as described in a previous section with a tool like wget or s3cmd.

You can consult the AWS Import/Export calculator to get a quote on the cost for processing your specific data volume. And if you require more information, I advise you to look over the main AWS Import/Export page at http://aws.amazon.com/importexport/ .

Note AWS Import/Export direct access to EBS for certain data centers

The AWS Import/Export service initially supported only S3 storage. However, this changed recently and you can now request AWS Import/Export service directly on EBS. Direct EBS support though is still not available at all AWS data centers. Where as Import/Export service for S3 storage is available at all AWS data centers.

 
Content

Expert consulting

Having big data troubles ?

If what you're reading fits the problems you're facing, let me help you ensure a correct solution is applied to your particular situation

I'm available per hour or per day, by phone, email or on-site.

Contact me