Case Study: Horizontally scaling user uploads - Scaling PHP Applications

Scaling PHP Applications (2014)

Case Study: Horizontally scaling user uploads

Similar to cookie handling, another issue that people commonly run into when scaling their application servers is dealing with user uploads. When we talk about user uploads, I have the following requirements in mind:

1. Uploads need to be available immediately

In most cases, users expect their uploaded files to be available immediately. Showing a broken image or unavailable file is typically unacceptable in most circumstances (video being the obvious exception, due to the processing time involved). If we move the uploaded file around the system, it needs to be entirely invisible, and seamless to the user.

2. Uploads get pushed to long-term storage in the background

Storing the uploaded data on your application servers for more than a few minutes is dangerous and irresponsible. We’ve designed our application servers to scale horizontally, to fail without losing data, so using them for long-term storage violates these principles. Hard-drives crash and running a RAID-10 on your application servers isn’t cost effective.

There are plenty of easy long-term storage solutions that can be used depending on your particular needs. We store petabytes of data in Amazon S3, but other options include Rackspace Cloud Files and Edgecast Storage.

If you have enough data to make an IaaS solution too expensive (or just can’t/don’t want to use one), there are “roll-your-own” solutions available like MogileFS and GridFS that are worth exploring.

Remember, a major principle in scaling is to decouple your systems! Application servers should serve your application, not store data.

P.S. Don’t even THINK about storing the uploaded files in your database. Use the right tool for the right job.

3. No single points of failure or centralization (kids, say no to NFS!)

There should be no single points of failure in your file upload system. Failure will happen and it needs to be handled gracefully. Uploads should be processed immediately and not by a centralized system. I avoid NFS like the plague because it’s typically very slow and creates a devastating single point of failure—when it’s not available, we lose the ability to handle ALL uploads. If uploads are core to your business, you know how much of a problem that can be. Facebook has a great (although slightly outdated) article about how they moved off of NFS to their own system called Haystack. It’s great reading material, if only for inspiration.

Don’t Upload the file directly to S3 (or any long-term storage)

Most people handle file uploads by directly uploading the users file to Amazon S3 in the context of the web-request. That is to say, they upload the file to long-term storage in the same web-request that the file is originally uploaded in. In embarrassingly simple psuedo-code, it looks something like this:

1 <?php

2 if ($_FILES['upload']) {

3 upload_to_amazon_s3($_FILE['upload']);

4 }

Although doing it this way is simple, and works for smaller services, there are two major problems.

1. What happens if Amazon S3 is down or running slow? Well, unless there’s a retry system in place, we lose the user’s uploaded file.

2. It significantly increases the time of the HTTP request, which chews up one of your PHP-FPM processes for a longer period of time. One less PHP-FPM process means one less concurrent connection that your application server can handle. So now the user has to wait not only for the file to upload from their computer to your server, but also from your server to Amazon S3 (although this can be mitigated by cutting the user loose early with fastcgi_finish_request() or using a separate PHP-FPM pool for uploads).

Handling user uploads the right way

Since handling user uploads is part of the core-business of Twitpic, I’ve thought long and hard about how to handle the problem in the best way. Our solution scales horizontally, is robust enough to handle failures while avoiding any single points of failure, and is able to deal with retries. So here’s what it looks like:

1. The file is uploaded by the user and immediately saved to disk

Duh, right. :) The file is uploaded to our application cluster and received by PHP. You should know that nginx buffers file uploads to disk until all of the data is received from the client, before sending it to PHP. This means that PHP’s upload progress features will not work, because PHP isn’t receiving the uploaded data in realtime.

We create a database record for the file and immediately store the file to disk in a data directory on the application server. We use the id column of the database record as the filename. We have a column in our database for the file named location. In the location column, we store the hostname of the server that the file was originally uploaded on.

Finally, after the file is stored to disk, we create a new job in a Resque queue to eventually push this upload to long-term storage.

The data directory that we store the file in is web accessible. What this means is that if a file is uploaded to app01, and gets a database id of 123456, it can be accessed from the web at http://app01/data/123456.

This method gives us the ability to display the file immediately, without waiting for it to be on Amazon S3, by linking directly to the application server that handled the upload. Note that this does require a database lookup to determine the file location before linking to it, however. If your PHP code running on app04 wants to show image 123456, it needs to lookup 123456 in the database, check the location of the file, and then it can use that location to generate the link http://app01/data/123456.

In the example below, we’re going to pretend we’re dealing with image uploads, and we’ll add a url() method to our image model to handle the URL logic.

1 <?php

2 if ($_FILE['upload']) {

3 $image = new Image;

4 $image->location = php_uname('n'); // Returns the hostname

5 $image->save();

6 move_uploaded_file($_FILES['upload']['tmp_name'], "/u/apps/data/{$image->id}\

7 ");

8 Resque::enqueue("uploads", "Upload_Job", array("id" => $image->id));

9 echo "<img src='{$image->url()}'>";

10 }

And the url() method would look something like this

1 <?php

2 class Image extends Model {

3 public function url() {

4 if ($this->location == "s3") {

5 return "http://s3.amazonaws.com/my_bucket/{$this->id}";

6 } else {

7 return "http://{$this->location}.mysite.com/data/{$this->id}";

8 }

9 }

10 }

2. Push the file to long-term storage

In step one we handled the user’s upload, stored it temporarily on a single application server, and made it immediately available for access on the web. That’s great, but we also need to quickly push it to long-term storage, because if app01 crashes or becomes unavailable, we’d lose access to that user’s content. No bueno!

We queued up a job in Resque to push the file to long-term storage when it was originally uploaded, so let’s talk about how we should implement the job worker (Resque is covered in-depth in Chapter 7).

The first thing that the job should do is send the file to Amazon S3. But where are we going to get the file from? We can’t directly access the data directory on the filesystem because the Resque job won’t be running on the application servers. Of course, you could tie specific jobs to specific servers, and run Resque workers on the application servers, but that’s an overall bad design. Jobs should be able to run anywhere. Wait! All we have to do is use the url() method discussed in the example above, and we can make an HTTP request to download the image— allowing the Resque job to work from any location the job is run.

Our Resque job would look something like this:

1 <?php

2 class Upload_Job {

3 public function perform() {

4 $image_id = $this->args['id'];

5 $image = Image::find($image_id);

6

7 $data = file_get_contents($image->url());

8 $status = upload_file_to_s3($data, $image->id);

9

10 if ($data == false || $status == false) {

11 throw new Exception("Couldn't upload to S3, retry");

12 }

13

14 $image->location = "s3";

15 $image->save();

16 }

17 }

We’re even handling failure! When a Resque job throws an exception, it is queued up again and retried. If there is a hiccup and your long-term storage or application server becomes momentarily unavailable or times out, the job is tried again and no data is lost.

In our real-life usage of this setup, the Resque jobs run very fast—almost immediately, so images are moved to long-term storage about 1-second after upload. What this means is that if an application server crashes, we would lose (at most) about 1-second worth of user uploaded content from that server. Not too bad.

3. Cleanup

The last thing to think about is cleanup—removing the uploaded data from your application servers after it has made its way to long-term storage. You obviously don’t want this old, stale data to grow uncapped because eventually it’ll fill up the disks on your app servers.

You could do something complex like using the nginx HttpDavModule to create a private/internal WebDAV endpoint for your data directory, allowing the Resque job to delete the file from the application server over HTTP after successfully storing it in long-term storage.

I prefer the simpler solution though :) – a simple cron to scan the data directory on each application server and delete uploads older than X-days. You could write a script for this, but it’s do-able with a UNIX one-liner that will delete the files in /data directory older than 7 days.

1 0 4 * * * find /data -type f -ctime +7d -exec rm -f {} \;

How I used to handle this the wrong way

Pushing uploads to long-term storage is really a great fit for a queue. The original system we used at Twitpic (amazingly only 3-years ago) was really horrible and unreliable (I’m almost too embarrassed to write it here). We skipped the queue and used the database directly. That is to say, we ran a cronjob that executed a query which looked something like SELECT * FROM images WHERE location != "s3" LIMIT 100, and then looped over the result, uploading each file to Amazon S3.

This was a horrible design for a couple of reasons.

1. It didn’t scale at all. Running the query multiple times returns the same result (unless you add in an ORDER BY RAND, which is terrible for performance), so scaling to multiple servers would mean lots of duplicated images uploaded to Amazon S3, a waste of system resources and expensive AWS bandwidth.

2. We didn’t have our location column indexed (it really doesn’t make sense for it to be for our other queries), so running SELECT * FROM images WHERE location != "s3" causes a full-table scan. A full-table scan means that the database needs to look at ALL of your data. If you have 10 million rows, it needs to scan them all. Every. Single. Row. Ugh!