Case Study: Async APIs w/ Kestrel saved my butt - Scaling PHP Applications

Scaling PHP Applications (2014)

Case Study: Async APIs w/ Kestrel saved my butt

In terms of the scaling websites, background jobs is a pretty new technique. 3 or 4 years ago, Social APIs weren’t as common and running parts of the web request in the background was pretty much unheard of at all but the biggest websites.

Twitpic depends on the Twitter API for many site functions. When you upload a new picture, you have the option of posting it to Twitter or Facebook. When you comment on a photo, we push that comment to your Twitter stream too.

In the beginning, it was coded the naive way. Why? We simply didn’t know better. We were grassroots, bootstrapped programmers working at scale that was beyond our imagination. You can imagine our code looked something like this-

1 <?php

2 class Image_Controller extends Controller {

3

4 // When someone uploads a new image...

5 public function create() {

6

7 $image = new Image();

8 $image->process_image_upload();

9 $image->save();

10

11 // Generate the tweet text to send to Twitter..

12 // i.e, "Check out this cool pic. http://twitpic.com/abc123"

13 $tweet = $image->tweet();

14

15 // Post a tweet to Twitter through the API.

16 Twitter::post_tweet($image->user, $tweet);

17

18 }

19 }

Programmatically, this makes sense. But it’s the wrong way to tackle interacting with 3rd party APIs at any scale. EVEN if you’re still small. Why? Because even the best case scenario sucks. Since PHP blocks on I/O, API requests are going to take 1-2s to process, no matter what. Everytime a user uploaded an image, we’d tack on 1 or 2 seconds to post the tweet to Twitter.

You can argue it’s not a big deal because they’re uploading a file- they’re not going to notice an extra few seconds. But what about commenting? We want to post a tweet when they make a comment, a normally fast operation, made slow because of the API interaction.

And it’s not a dig at Twitter. ALL HTTP APIs are slow. It’s the nature of the beast, when you go over the network to hit an API across the country, it’s going to add latency.

The worst-case scenario

Forget user experience, doing it this way leads to some serious problems. Your website becomes dependent on someone else. Remember when Twitter used to crash on a weekly basis? If you were working with the Twitter API like us, Twitter going down would crash your website too! Fail whales for everyone.

Remember how I said that each pesky API call takes 1-2 seconds because PHP blocks and waits for the response? Well, what happens when the API requests take a very long time (5+ seconds) to respond? It sits there waiting for a response, unable to process any new HTTP requests. And when Twitter goes down, these “blocked” PHP processes, sit there.. waiting for a response or to hit their timeout.

On a busy website, this can QUICKLY eat up all of your available PHP processes. When you’re posting 100+ images per second to Twitter, and Twitter goes down, that means you’re effectively blocking 100+ PHP processes every second as they wait for the Twitter API to responsd or time out- making them unable to serve any other incoming requests.

It causes a domino effect of sorts and will eventually take down the entire site when you run out of available (non-blocked) PHP processes.

When Twitter went down hard, we went down hard. It was nasty.

The road to kestrel

After being fed up with having our uptime dependent on another companies uptime, I decided enough was enough- and started researching options.

My first inclination was to use a database (don’t do this). I setup a table, inserted tweets into, and posted them with a process running in the background. As discussed in Chapter 8, this causes a race condition when you run more than one background worker unless you use MySQL Locking. MySQL Locking works, but it creates a global lock in your queue and is a poor solution for something that should be parallel and concurrent. Not to beat a dead horse, but the database is not a queue.

So, my first foray into background posting caused duplicate tweets and all around failure. I told you had no idea what I was doing.

I started researching other solutions. I found some articles written about Starling by Twitter- it was their first queuing daemon, written in Ruby. Unfortunately, Ruby (or PHP) isn’t the best fit for concurrent, highly-avaliable queueing system, so there was a ton of bad press surrounding it. It lead me to find their open-source queue, Kestrel, which replaced Starling.

Kestrel is written in Scala and runs on the JVM. Furthermore, it uses the memcached protocol, so no extra client necessary- just point your Memcached client to it! At this point in time, Resque wasn’t around so Kestrel was the best (and least complicated) solution. Eventually, we moved all of this over to Resque, though. The concepts still apply regardless of the queue server you use.

Anyways, I setup Kestrel and starting pushing our Twitter API calls into it. Instead of doing them during the web request, I had moved them into the background to be processed “eventually”.

Win

Using Kestrel was a huge win and solved many of early downtime problems. Instead of crashing when Twitter went down, our site would run FAST— we’d just see jobs pile up inside of Kestrel queues. And once Twitter came back online, all of those jobs would get processed! No more downtime, no more data loss.

Already it was a huge win. But it also improved our the user experience. Remember how I said posting to Twitter, best case, takes 1 or 2 seconds? That meant 1-2 seconds to post new images or comments. With background workers in place, new comments went through INSTANTLY and images were uploaded 25% faster.