Heroku: Up and Running (2013)
Chapter 7. When It Goes Wrong
Exceptions happen, even on exceptionally well-written applications. But when the unthinkable happens, how can you find your problems and fix them? Let’s dig into debugging problems on deploy and during runtime.
Heroku maintains a wealth of documentation via articles about all sorts of topics related to the platform and the various tools that surround it. If you’re having problems, this should be your first port of call as you should be able to find detailed information about the moving parts surrounding the issue you are experiencing.
The Heroku DevCenter can be found here.
Remember, you’re not on your own. Once you’ve gone through all of the debugging you can by yourself, don’t forget that Heroku provides support for applications, in addition to the community-based forum support on Stack Overflow.
Even with support from outside sources, you will still need to troubleshoot your issue. Heroku can help you with its platform, but what about when the issue is inside of your app? The following debugging sections can help you solve your issue, or at least give you greater insight into what is happening with your app on the platform, which will help the support process go much smoother.
Let’s take a look at some common deployment problems.
I can’t deploy, why?
As you use the Heroku platform, you will likely find yourself in this position at some time. Not to worry, though—deployment is the best place your application could fail. Ideally, it means that you just avoided deploying an app with a problem into production, which is great. Though sometimes it can be a bit difficult to understand where to get started debugging your deployment problems, here is a quick list of good places to begin:
§ Check the status site.
§ Reproduce the problem in another app.
§ Copy config, add-ons, and labs features.
§ Check the .slugignore file.
§ Fork and use a custom buildpack with added logging statements.
Heroku Status Site
It sounds like an RTFM type of comment, but checking http://status.heroku.com can save you a bit of a headache if the issue isn’t on your end. Heroku splits up its application architecture into two categories. The production category is everything that your “production quality” application needs to stay alive. This includes the routers, dyno capacity, and production tier databases. Noncritical components are grouped into “Development,” which includes the API, command-line access, shared/development databases, and anything else that might go down that shouldn’t affect your running app. By isolating failures to individual services you can help better understand the impact to your app. If development is down, you might not be able to push, but your old version of the app should still be able to run.
What happens if development is down on the status site? You can subscribe to notifications and wait for it to come back online. If everything is green and there aren’t any other problems, you might want to try reproducing the issue on another app.
Reproducing the Problem
If the issue isn’t clear by checking your deploy’s output, sometimes you can isolate the failure by trying to reproduce it on another application. Hopefully you’ve got a staging server set up, and you can try deploying to that app. If everything works fine on your staging server but not on your production app, you should try to duplicate the production environment as closely as possible with your staging server. This includes making sure config, addons, and any labs features are on par across both servers:
$ heroku config
$ heroku addons:list
$ heroku labs:list
What do you do if there is a difference? You modify your staging server to have parity with your production server and then deploy again. If the last deploy worked, you might get an error saying that nothing changed because you haven’t added a new commit in Git. A useful trick for forcing a deploy is to add a blank commit in your project:
$ git commit --allow-empty -m "debugging"
[master ddf2020] debugging
You should then be able to push and deploy. Once you’ve isolated the mechanism that causes your failure, you’ll be better equipped to fix it. Reproducing the issue on staging or a fresh app is crucial if you believe you’re getting your error as a result of the Heroku platform. You can try isolating the functionality that broke between deploys and adding it to a dummy example app.
Sometimes, after all of that, you still might not know where the failure is coming from. Then what can you do?
Check for a .slugignore File
A .slugignore file works much like a .gitignore file, except it tells Heroku what files to ignore when building your application. This can be pretty useful for a small subset of applications but it can also make some pretty hard-to-reproduce issues. Verify that you don’t have one in your project; you probably don’t, but it can’t hurt to check.
So if you still don’t know exactly what is causing your deploy issue, you can pull out the big guns and use a custom debugging fork of the buildpack.
Fork and Use a Custom Buildpack
Buildpacks are what Heroku uses to provide the step-by-step instructions that set up the environment for your app so it can run on a dyno. (Buildpacks are discussed further in Chapter 8.) The buildpacks are open source, so you can fork a buildpack, add debugging statements, and then deploy using that buildpack. If you haven’t read Chapter 8, you should do so before attempting this debugging option.
Sometimes very difficult deployment issues become very obvious if you can isolate the problem to buildpacks. On one occasion an encoding issue in client code caused an exception while running the buildpack, causing it to not give any output at all. Once the developer deployed with a custom buildpack and just a few debugging lines, the trouble area was isolated and the problem became apparent.
How do you output from a buildpack? In the compile step of a buildpack, anything sent to the standard output will display in the build process. Depending on your language, you can add puts, sprintf, System.out.println, or print(), and the output will be available while building. But what to print? First, we’ll figure out where our target problem area is with tracers.
Tracer bullets were first used by the British in 1915. The rounds have a pyrotechnic charge that burns brightly so you can see where the projectile goes, allowing the shooter to correct his aim. In programming, debug tracers are similar: they help us decode what is going on in our app. If you are trying to see if a particular piece of code gets executed, you can add a simple tracer to it:
If you have many such tracers, it can be useful to name them:
print "==== Detecting App"
Once you’ve isolated the problem area with tracers, you can dig in by outputting variable values or other debugging information.
OK, so we can use tracers output to discover which part of the build process is failing, but how do we actually add them?
To add your debug statements, first you fork the buildpack you are using. You can find them at https://github.com/heroku; just search for “buildpack.” Once you’ve forked to your personal account such as github.com/schneems, you’ll need to point the BUILDPACK_URL of your staging app to your repo, then add your tracers, and deploy your staging app. If you don’t see any output, check that your changes were pushed correctly to the repo specified in the BUILDPACK_URL.
Once you’re done debugging with your custom buildpack, you can simply remove the BUILDPACK_URL from your staging app’s configuration vars and you’re good to go.
Deploy Taking Too Long?
Heroku is an opinionated platform, and one of the opinions is that deploys shouldn’t take too long. This prevents deployments from hanging, and gives you incentive to decrease your deployment time. If you accidentally deploy a bug, you don’t want to go through an hour-long deploy process to get it into production. As such, the deployments have a maximum time that they will run; if your application takes longer to deploy than that time, it will be canceled. Although the value of the timeout may change in the future, it will likely stay somewhere around the 10-minute mark. Several common things that occur during deployments can ramp up the compilation time exponentially.
The final option is to move asset compilation to your local machine or an intermediate machine and not do the compile on every deploy. This option works well for assets that change infrequently, but it should be considered a last resort.
Besides asset compilation, downloading dependencies and running setup tasks can take a long time on very large apps. In those situations, you should consider if all of that really belongs in one app, or rather in several smaller ones.
Once you’ve got any deploy problems sorted out, it’s all smooth sailing until you hit an error in running code.
Runtime Error Detection and Debugging
Unlike deploy errors, issues that occur during runtime can go unnoticed for some time. It is important that as an application owner you take steps to increase the viability of your application. In this section we will talk about some proactive measures you can take to watch for errors, as well as the steps you need to take to debug the errors. There are two common steps to debugging runtime exceptions: detection, where we find the exception, and debugging, where we fix the exception.
If errors are happening in your application, you need to know. This goes beyond exception notifications, and total application visibility should be the goal. But why? Isn’t it just good enough to detect when errors happen and fix them? As Wayne Gretzky once said, “A good hockey player plays where the puck is. A great hockey player plays where the puck is going to be.” We’re building software, not playing hockey, but the statement still applies. Instead of chasing exceptions, make it a goal to know everything about the application, then when an exception does happen it is simply an unexpected state that needs to be explored. There are a number of simple steps you can take to increase visibility for you and your team.
Deployments are the single most important action that any developer can take on a production app. Deployments are how bugs get introduced and how they get fixed. In recent years, it has become popular to adopt a “continuous deployment” strategy, where a team deploys not just once a year, or a quarter, or month but rather any time they need to. During active development it is not uncommon for a team to deploy many times in one day. That just makes it all the more critical that your team knows when someone is deploying and what code is live on the server. Small teams can get away with turning around and shouting to the office “I’m deploying,” while larger teams can sometimes set up elaborate mechanisms that automatically lock out deploys and roll back if an exception threshold is passed. To help with notifying team members and building deploy-related applications, Heroku offers a Deploy Hooks add-on:
$ heroku addons:add deployhooks:email
The deploy hook add-on is free and can be used to send emails, ping external services such as Basecamp or Campfire, post in IRC, or even send a POST HTTP message to a URL of your choice, if you need more flexibility. What you do with the information is up to you, but it’s always a good idea to manually test out an application after deployment, which can help catch gaps that your automated tests missed. You are testing your app, right?
One of the core tenets of continuous delivery (of which continuous deployment is a part) is testing your application with an automated test suite. Automated tests are code that makes assertions about your application code. There are many different types of automated tests with different purposes, from integration tests that use your app like a real user to unit tests to help with writing individual classes and methods. The different types of tests all help to protect against regression (when old features are accidentally lost) and application exceptions (when errors or exceptional states are introduced to your application). Because tests are run before your code is deployed, it can tell you if an application is safe to deploy. In addition to running tests locally, it can be a good idea to set up a continuous integration server (CI server), or a production-like environment where tests can be run. Developers can set up their own CI server or use a hosted service such as Travis CI or Tddium:
$ heroku addons:add tddium:starter
In addition to ensuring that your tests are being run, CI servers act to raise test status visibility to you and your team. Many application owners have CI set to alert everyone on a team if a build failed, and can help keep track of when a failing test was introduced, and by which developer. Testing may not catch every production issue, but it’s simple to set up and can help with refactoring, so the time expended in setting up a good test suite will surely be paid back. What about those issues that come not from regression or exceptions, but from a gradual slowdown or sudden breakage of a data store or third-party service due to using too much capacity?
Even when your code is healthy, your app might not be. Most applications on Heroku rely on services such as Heroku Postgres and other data stores. These data stores are provided in a tiered system, so that as an application grows and needs additional resources they can be provisioned. It is important to keep a health check of all the subsystems that your application is using. You should check documentation on individual services; for example, there are queries you can run using Postgres to determine if you need to upgrade. It is a good idea to run performance metrics once or more a week to determine if your application’s data stores are running out of capacity. Depending on how your application is architected, you may need to check these statistics more often.
When you’ve got all of that in place, you can still get errors in your deployed application. Where should you look first to find them?
When you encounter errors in production, you’re not going to have a debug page; you’ll need to find the exception in your logs and start debugging from there. If you’re used to a VPS, you might be reaching for that SSH command right now, but you can’t SSH into your live box, as each dyno is sandboxed. Instead, you can use the Heroku logs command:
$ heroku logs --tail
This will provide you with a stream of all the logs from each dyno that is running. That means if you are running 1 dyno or 100, all of the output will be visible through this command. The importance of using logs to debug your application cannot be overstated. Many log messages will direct you to the exact file and line number where the exception occurs, some will even tell you exactly what went wrong. Errors in the code that you deploy are inevitable, and so is checking your logs.
Heroku will store 1,500 lines of logs on a rolling basis; if you need more than that, you should consider using a logging add-on. Heroku has several logging add-ons that will allow you to archive and search your logs. At the time of this writing, Logentries, Loggly, and Papertrail are available to store and search your logs:
$ heroku addons:add papertrail:choklad
Most of the add-ons have a free-to-try tier, so you can see which one is right for your app. They all have different interfaces and limits, and some have special features such as alerts. If you wanted to, you could even build your own log archiver using a log drain for your app, as log drains are a feature of Heroku’s Logplex architecture. You can attach an unbounded number of log drains to a Heroku application without impacting the performance of your Heroku app.
Sometimes it’s not the errors you’re hunting for in the logs; sometimes you just want the error right in front of you. For those times, you might consider coding up a special admin error page.
Admin error pages
If your application has a logged-in user state and there is a restricted admin flag on users, it can be helpful to use dynamic error pages.
When logged in as an admin, if you come across an error, the error page can show you the backtrace and exception, while non-admin users just get the normal error page. Here is an example from the site http://www.hourschool.com:
While useful for debugging reproducible errors, you could get the same info from the logs; this method just makes it a bit easier. There are many times when you won’t be able to reproduce the error, or when errors will happen without you knowing. To help combat this problem, let’s take a look at some exception notification mechanisms.
Exceptions still happen while you’re asleep, and while a logging add-on might allow you to find the details, how do you know a user is getting exceptions in the first place? Exception notification services either integrate with your application code or parse logfiles to store and record when exceptions happen. Because many of them group related exceptions, it can be very useful for determining the issues that affect the most users. They can also be configured to send emails or other types of notifications, hence the name:
$ heroku addons:add newrelic:standard
Some of these services, like New Relic (see Figure 7-1), do much more than record exceptions. They can be used to record and report your application performance and a number of other metrics. As was mentioned previously, having insight into the performance of your app is crucial.
Figure 7-1. New Relic’s web interface
Some of the add-ons have fewer features on purpose; they aim to have laser focus and try to minimize costs to the user by only providing what’s needed. Shop around for an exception notification system that meets your needs.
Uptime services that ping your app to check if it is still serving valid requests are used by many applications. These services will ping multiple public pages of your website at configurable intervals and alert you if they get any response other than a 200 (Success in HTTP). One popular service is Pingdom; it can be configured to send emails and text messages when an outage is detected. When you receive an outage report, you should always check http://status.heroku.com to confirm the outage is not part of a systemwide exception. If it is, you’ll need to hunt down the exceptions using your logs or your notification services.
So far, all of the techniques we have looked at to increase application visibility are technical in nature, but as the saying goes “the squeaky wheel gets the grease.” Keep an ear open to social media channels such as Twitter, and maintain either an error-posting forum such as Get Satisfaction, or at least a support email address. If your application has enough users, and you break something they want, they will let you know about the breakage. This should not replace any of the previously mentioned efforts, but it’s always a good idea to be receptive to your users and their needs.
Code Reviews and a Branching Workflow
Because knowing more about your application is always beneficial, it can be a good idea to incorporate quick code reviews into your workflow. One popular option is by using Gitflow or GitHubFlow. In short, every developer works on a branch, then, instead of pushing to master, he submits a pull request where another developer must review the code. This doesn’t guarantee better code, or fewer bugs, but it does guarantee that more developers are familiar with more of the new code going into production. What if someone breaks the production site on a Friday, but no one realizes it until Monday when that coder has gone on vacation? If the code was peer-reviewed, then at least one other developer is guaranteed to be somewhat familiar with the changes. Again, it’s not always about preventing every possible runtime error, but rather knowing your application inside and out, so when disaster strikes, you are prepared to act swiftly and efficiently.
Runtime Error Debugging
Your system went down, and since you’ve got so much visibility into your application you were able to pinpoint the error and get a backtrace—now what? You need to understand what caused the issue in the first place so you can fix your production system. The first goal should always be to reproduce the problem, or at least the conditions in which the problems occur: if that fails, you will at least know what doesn’t cause the error. In a perfect world, your error message would just tell you exactly what is wrong but what happens when you get a cryptic or misleading error message? It’s time to put your debugging hat on—this is the fun part.
Data State Debugging
When you’re getting an error in production but not in development, more often than not it’s due to different states in your data store (such as Postgres). The most common of these is failure to migrate your database schema before pushing your code that uses new columns and tables. This problem was so common that Heroku’s original stack, Aspen, automigrated your database for you. Unfortunately, large application owners need more control over their database migrations, and this behavior is not valid 100% of the time. Because of this, you need to make sure you run any migrations after you push them to your Heroku app. If you are using Ruby on Rails, your command might look something like this:
$ heroku run rake db:migrate
If you desire the automigration capacity, some developers build deploy scripts that can autorun migrations after every deploy. These scripts can be extended to do a number of other things, like running tests before deploying or checking CI status. Even if your database schema is correct, you can still experience unexpected scenarios due to the data inside of your database. To help solve this problem, it might be useful to clone a copy of your production data to a local machine. For more information on how to do this, refer back to Chapter 5.
$ heroku run bash
Running `bash` attached to terminal... up, run.5026
~ $ ls public/assets -l
-rw------- 1 u31991 31991 3925 2013-10-29 16:41 manifest-
From here you can use cat, ls, and find to debug low-level file generation issues.
Application State Debugging
If the error isn’t associated with your data store, but rather with your code or your environment setup, then your first goal is to reproduce the issue. Here you have two options: you can reproduce the issue locally on your development machine or on a staging server that approximates your production server. If you cannot reproduce the issue locally, you will need a staging server, which is lucky because you can just spin up another Heroku app and try deploying to that server. Use additional information around the exception in addition to the backtrace and exception messages to help you reproduce the error. Check that you’re using the same parameters in the problem request. Was the user logged in or out when she got the error? The closer you can get to the exact conditions under which the exception occurred the better chance you’ll have of reproducing it.
Once you’ve figured out where the error is and how to reproduce it, you’ll need to fix the underlying issue. Hopefully you can write a test case that covers the problem, or at least reproduce it in an interactive console. Once you’ve gotten this far, it’s just a matter of patching your code, deploying a fix, and then celebrating. In these scenarios, it can be helpful to have a retrospective plan in place, such as the 5 Whys, to see what caused the exception and if measures can be put in place to prevent a similar exception from happening in the future. However, it’s important to remember that no system or code is ever 100% bug free—even systems on the space shuttle malfunction from time to time. The important thing is that when things do break, you understand how, and can hopefully reduce the impact of the exceptions.
You’ve got your site deployed and working like a charm. Remember that application visibility is a never-ending process. If you and your team stay diligent, you’ll be able to deliver a world-class service that your users are happy to recommend to others. If you’re not quite there yet, don’t worry—there is always room for improvement. Read back over the suggestions in this chapter and see what your team could benefit from implementing.