Test-Driven Infrastructure with Chef (2011)
Chapter 1. The Philosophy of Test-Driven Infrastructure
When the first edition of this book was published in late summer 2011, there was broad skepticism in response to the idea of testing infrastructure code and only a handful of pioneers and practitioners.
Less than a year later at the inaugural #ChefConf, the Chef user conference, two of the plenary sessions and a four-hour hack session were devoted to testing. Later that year at the Chef Developer Summit, where people meet to discuss the state and direction of the Chef open source project, code testing and lifecycle practices and techniques emerged as top themes that featured in many heavily attended sessions—including one with nearly 100 core community members.
Infrastructure testing is a hugely topical subject now, with many excellent contributors furthering the state of the art. The tools and approaches that make up the infrastructure testing ecosystem have evolved significantly. It’s an area with a high rate of change and few established best practices, and it is easy to be overwhelmed at the amount to learn and bewildered at the range of tools available. This book is intended to be the companion for those new to the whole idea of infrastructure as code, as well as those who have been working within that paradigm and are now looking fully to embrace the need to prioritize testing.
This update is much expanded and provides a thorough introduction to the philosophy and basics of test-driven development and behavior-driven development in general, as well as the application of these techniques to the writing of infrastructure code using Chef. It includes an up-to-date introduction to the Chef framework and discusses the most widely used and popular tooling in use with Chef, before providing a recommended toolkit and workflow to guide adoption of test-driven infrastructure in practice.
Underpinning Philosophy
There are two fundamental philosophical points upon which this book is predicated:
1. Infrastructure can and should be treated as code.
2. Infrastructure developers should adhere to the same principles of professionalism as other software developers.
While there are a number of implications that follow from these assumptions, the primary one with which this book is concerned is that all infrastructure code must be thoroughly tested, and that the most effective way to develop infrastructure code is test-first, allowing the writing of the tests to drive and inform the development of the infrastructure code. However, before we get ahead of ourselves, let us consider our two axiomatic statements.
Infrastructure as Code
“When deploying and administering large infrastructures, it is still common to think in terms of individual machines rather than view an entire infrastructure as a combined whole. This standard practice creates many problems, including labor-intensive administration, high cost of ownership, and limited generally available knowledge or code usable for administering large infrastructures.”
— Steve Traugott and Joel Huddleston
“In today’s computer industry, we still typically install and maintain computers the way the automotive industry built cars in the early 1900s. An individual craftsman manually manipulates a machine into being, and manually maintains it afterwards.
The automotive industry discovered first mass production, then mass customization using standard tooling. The systems administration industry has a long way to go, but is getting there.”
— Steve Traugott and Joel Huddleston
These two statements came from the prophetic www.infrastructures.org at the very start of the last decade. More than 10 years later, a whole world of exciting developments have taken place: developments that have sparked a revolution, and given birth to a radical new approach to the process of designing, building, and maintaining the underlying IT systems that make web operations possible. At the heart of that revolution is a mentality and toolset that treats infrastructure as code.
We believe in this approach to the designing, building, and running of Internet infrastructures. Consequently, we’ll spend a little time exploring its origin, rationale, and principles before outlining the risks of the approach—risks that this book sets out to mitigate.
The Origins of Infrastructure as Code
Infrastructure as code is an interesting phenomenon, particularly for anyone wanting to understand the evolution of ideas. It emerged over the last six or seven years in response to the juxtaposition of two pieces of disruptive technology—utility computing and second-generation web frameworks.
The ready availability of effectively infinite compute power at the touch of a button, combined with the emergence of a new generation of hugely productive web frameworks, brought into existence a new world of scaling problems that had previously only been witnessed by the largest systems integrators. The key year was 2006, which saw the launch of Amazon Web Services’ Elastic Compute Cloud (EC2), just a few months after the release of version 1.0 of Ruby on Rails the previous Christmas. This convergence meant that anyone with an idea for a dynamic website—an idea that delivered functionality or simply amusement to a rapidly growing Internet community—could go from a scribble on the back of a beermat to a household name within weeks.
Suddenly, very small developer-led companies found themselves facing issues that were previously tackled almost exclusively by large organizations with huge budgets, big teams, enterprise-class configuration management tools, and lots of time. The people responsible for these websites that had become huge almost overnight now had to answer questions such as how to scale databases, how to add many identical machines of a given type, and how to monitor and back up critical systems. Radically small teams needed to be able to manage infrastructures at scale and to compete in the same space as big enterprises, but with none of the big enterprise systems.
It was out of this environment that a new breed of configuration management tools emerged. Building on the shoulders of existing open source tools like CFEngine, Puppet was created in part to facilitate tackling these new problems.
Given the significance of 2006 in terms of the disruptive technologies we describe, it’s no coincidence that in early 2006 Luke Kanies published an article on “Next-Generation Configuration Management” in ;login: (the USENIX magazine), describing his Ruby-based system management tool, Puppet. Puppet provided a high level domain specific language (DSL) with primitive programmability, but the development of Chef (a tool influenced by Puppet, and released in January 2009) brought the power of a third-generation programming language to system administration. Such tools equipped tiny teams and developers with the kind of automation and control that until then had only been available to the big players and expensive in-house or proprietary software. Furthermore, being built on open source tools and released early to developer communities, allowed these tools to rapidly evolve according to demand, and they swiftly became more powerful and less cumbersome than their commercial counterparts.
Thus a new paradigm was introduced—infrastructure as code. In it, we model our infrastructure with code, and then design, implement, and deploy our web application infrastructure with software best practices. We work with this code using the same tools as we would with any other modern software project. The code that models, builds, and manages the infrastructure is committed into source code management alongside the application code. We can then start to think about our infrastructure as redeployable from a code base, in which we are using the same kinds of software development methodologies that have developed over the last 20 years as the business of writing and delivering software has matured.
This approach brings with it a series of benefits that help the small, developer-led company solve some of the scalability and management problems that accompany rapid and overwhelming commercial success:
Repeatability
Because we’re building systems in a high-level programming language and committing our code, we start to become more confident that our systems are ordered and repeatable. With the same input, the same code should produce the same output. This means we can now be confident (and ensure on a regular basis) that what we believe will recreate our environment really will do that.
Automation
By utilizing mature tools for deploying applications, which are written in modern programming languages, the very act of abstracting out infrastructures brings us the benefits of automation.
Agility
The discipline of source code management and version control means we have the ability to roll forward or backward to a known state. Because we can redeploy entire systems, we are able to drastically reconfigure or change topology with ease, responding to defects and business-driven changes. In the event of a problem, we can go to the commit logs and identify what changed and who changed it. This is made all the easier because our infrastructure code is just text, and as such can be examined and compared using standard file comparison tools, such as diff.
Scalability
Repeatability and automation make it possible to grow our server fleet easily, especially when combined with the kind of rapid hardware provisioning that the cloud provides. Modular code design and reuse manages complexity as our applications grow in features, type, and quantity.
Reassurance
While all the benefits bring reassurance in their way, in particular, the fact that the architecture and design of our infrastructure is modeled—and not merely implemented—in code means that we may reasonably use the source code as documentation and see at a glance how the systems work. This knowledge repository mitigates the risk of only a single sysadmin or architect having the full understanding of how the system hangs together. That is risky—this person is now able to hold the organization ransom, and should they leave or become ill, the company is endangered.
Disaster recovery
In the event of a catastrophic event that wipes out the production systems, if our entire infrastructure has been broken down into modular components and described as code, recovery is as simple as provisioning new compute power, restoring from backup, and redeploying the infrastructure and application code. What may have been a business-ending event in the old paradigm of custom-built, partially automated infrastructure becomes a manageable outage with procedures we can test in advance.
Infrastructure as code is a powerful concept and approach that promises to help repair the split-brain phenomenon witnessed so frequently in organizations where developers and system administrators view each other as enemies, to the detriment of the common good. Through co-design of the infrastructure code that runs an application, we give operational responsibilities to developers. By focusing on design and the software lifecycle, we liberate system administrators to think at higher levels of abstraction. These new aspects of our professions help us succeed in building robust, scaled architectures. We open up a new way of working—a new way of cooperating—that is fundamental to the emerging DevOps movement.
The Principles of Infrastructure as Code
Having explored the origins and rationale for managing infrastructure as code, we now turn to the core principles we should put into practice to make it happen.
Adam Jacob, co-founder of Opscode and creator of Chef, says that there are two high-level steps:
1. Break the infrastructure down into independent, reusable, network-accessible services.
2. Integrate these services in such a way as to produce the functionality our infrastructure requires.
Adam further identifies 10 principles that describe what the characteristics of the reusable primitive components look like. His essay—Chapter 5 of Web Operations, ed. John Allspaw & Jesse Robbins (O’Reilly)—is essential reading, but I will summarize his principles here:
Modularity
Our services should be small and simple—think at the level of the simplest freestanding, useful component.
Cooperation
Our design should discourage overlap of services and should encourage other people and services to use our service in a way that fosters continuous improvement of our design and implementation.
Composability
Our services should be like building blocks—we should be able to build complete, complex systems by integrating them.
Extensibility
Our services should be easy to modify, enhance, and improve in response to new demands.
Flexibility
We should build our services using tools that provide unlimited power to ensure we have the (theoretical) ability to solve even the most complicated problems.
Repeatability
With the same inputs, our services should produce the same results in the same way every time.
Declaration
We should specify our services in terms of what we want to do, not how we want to do it.
Abstraction
We should not worry about the details of the implementation, and think at the level of the component and its function.
Idempotence
Our services should be configured only when required; action should be taken only once.
Convergence
Our services should take responsibility for their own state being in line with policy; over time, the overall system will tend to correctness.
In practice, these principles should apply to every stage of the infrastructure development process—from low-level operations such as provisioning (cloud-based providers with a published API are a good example), backups, and DNS, up through high-level functions such as the process of writing the code that abstracts and implements the services we require.
This book concentrates on the task of writing infrastructure code that meets these principles in a predictable and reliable fashion. The key enabler in this context is a powerful, declarative configuration management system that enables engineers (I like the term infrastructure developer) to write executable code that both describes the shape, behavior, and characteristics of the infrastructure that they are designing, and when actually executed, results in that infrastructure coming to life.
The Risks of Infrastructure as Code
Although the potential benefits of infrastructure as code are hard to overstate, it must be pointed out that this approach is not without its dangers. Production infrastructures that handle high-traffic websites are hugely complicated. Consider, for example, the mix of technologies involved in a large content management system installation. We might easily have multiple caching strategies, a full-text indexer, a sharded database, and a load-balanced set of web servers. That is a significant number of moving parts for the infrastructure developer to manage and understand.
It should come as no surprise that the attempt to codify complex infrastructures is a challenging task. As I visit clients embracing the approaches outlined in this chapter, I see similar problems emerging as they start to put these ideas into practice:
§ Sprawling masses of infrastructure code
§ Duplication, contradiction, and a lack of clear understanding of what it all does
§ Fear of change; a sense that we dare not meddle with the manifests or recipes because we’re not entirely certain how the system will behave
§ Bespoke software that started off well-engineered and thoroughly tested, but is now littered with TODOs, FIXMEs, and quick hacks
§ Despite the lofty goal of capturing the expertise required to understand an infrastructure in the code itself, a sense that the organization would be in trouble if one or two key people leave
§ War stories of times when a seemingly trivial change in one corner of the system had catastrophic side effects elsewhere
These issues have their roots in the failure to acknowledge and respond to a simple but powerful side effect of treating our infrastructure as code: if our environments are effectively software projects, then they should be subject to the same meticulousness as our application code. It is incumbent upon us to make sure we apply the lessons learned by the software development world in the last 10 years as they have strived to produce high quality, maintainable, and reliable software. It’s also incumbent upon us to think critically about some of the practices and principles that have been effective in that world and to begin introducing our own practices that embrace the same interests and objectives. Unfortunately, many who embrace infrastructure as code have had insufficient exposure to or experience with these ideas.
There are six areas where we need to focus our attention to ensure that our infrastructure code is developed with the same degree of thoroughness and professionalism as our application code:
Design
Our infrastructure code should seek to be simple and iterative, and we should avoid feature creep.
Collective ownership
All members of the team should be involved in the design and writing of infrastructure code and, wherever possible, code should be written in pairs.
Code review
The team should be set up to pair frequently and to see regular notifications when changes are made.
Code standards
Infrastructure code should follow the same community standards as the Ruby world; when standards and patterns have grown up around the configuration management framework, the standards and patterns should be adhered to.
Refactoring
This should happen at the point of need as part of the iterative and collaborative process of developing infrastructure code; however, it’s difficult to do this without a safety net in the form of thorough test coverage of one’s code.
Testing
Systems should be in place to ensure that one’s code produces the environment needed and that any changes have not caused side effects that alter other aspects of the infrastructure.
I would argue that good practice in all six of these areas is a natural by-product of bringing development best practices to infrastructure code—in particular by embracing the idea of test-first programming. Good leadership can lead to rapid progress in the first five areas with very little investment in new technology. However, it is indisputable that the final area—that of testing infrastructure automation—is a difficult endeavor. As such, it is the subject of this book: a manifesto for bravely rethinking how we develop infrastructure code.
Professionalism
The discipline of software development is a young one. It was not until the early 1990s that the Institute of Electrical and Electronics Engineers and the Association for Computing Machinery began to recognize software engineering as a profession. The last 15 years alone have seen significant advances in tooling, methodology, and philosophy. The discipline of infrastructure development is younger still. It is imperative that those embarking upon or moving into a career involving infrastructure development absorb the hard lessons learned by the rest of the software industry over the previous few decades, avoid repeating these mistakes, and hold themselves accountable to the same level of professionalism.
Robert C. Martin in, Clean Code: A Handbook of Agile Software Craftsmanship (Prentice Hall), draws upon the Hippocratic oath as a metaphor for the standards of professionalism demanded within the software development industry: Primum non nocere—first do no harm. This is the foundational ethical principal that all medical students learn. The essence is that the cost of action must be considered. It may be wiser to take no action or not to take a specified action in the interests of not harming the patient. The analogy holds as a software developer. Before intervening to add a feature or to fix a bug, be confident that you aren’t making things worse. Robert C. Martin suggests that the kinds of harm a software developer can inflict can be classified as functional and structural.
By functional harm, we mean the introduction of bugs into the system. A software professional should strive to release bug-free software. This is a difficult goal for developer and medical practitioner alike; granted that software (and humans) are highly complicated systems, as professionals we must make it our mantra to “do no harm.” We won’t ever be able to eradicate mistakes, but we can accept responsibility for them, and we can ensure we learn from them and put mechanisms in place to avoid repeating them.
By structural harm we mean introducing inflexibility into our systems, making software harder to change. To put the concept positively, it must be possible to make changes without the cost of change being exorbitantly high.
I like this analogy. I think it can also be taken a little further. Of all medical professionals, the one I would most want to be certain was observing the Hippocratic oath would be a brain surgeon. The cost of error is almost infinitely higher when operating upon the brain than when, for example, operating on a minor organ, or performing orthopedic surgery. I think this applies to the subject of this book, too.
As infrastructure developers, the software we have written builds and runs the entire infrastructure on which our production systems, the applications, and ultimately the business, operate. The cost of a bug, or of introducing structural inflexibility to the underpinning infrastructure on which our business runs, is potentially even greater than that of a bug in the application code itself. An error in the infrastructure could lead to the entire system becoming compromised or could result in an outage rendering all dependent systems unavailable.
How, then, can we take responsibility for, and excel in, our oath-keeping? How can we introduce no bugs and maintain system flexibility? The answer lies in testing.
The only way we can be confident that our code works is to test it. Thoroughly. Test it under various conditions. Test the happy path, the sad path, and the bad path. The happy path represents the default scenario, in which there are no exceptional or error conditions. The sad path shows that things fail when they should. The bad path shows the system when fed absolute rubbish. In the case of infrastructure code, we want to verify that changes made for one platform don’t cause unexpected side effects on other platforms. The more we test, the more confident we are.
When it comes to protecting and guaranteeing the flexibility of our code, there’s one easy way to be confident of code flexibility. Flex it. We want our code to be easy to change. To be confident that it is easy to change, we need to make easy changes. If those easy changes prove to be difficult, we need to change the way the code works. We must be committed to regular refactoring and regular small improvements across the team. This might seem to be at odds with the principle of doing no harm. Surely the more changes we make, the more risk we are taking on. Paradoxically, this isn’t actually the case. It is far, far riskier to leave the code to stagnate with little or no attention.
As infrastructure developers, if we’re afraid to make changes to our code, that’s a big red flag. The biggest reason people are afraid to make changes is that they aren’t confident that the code won’t break. That’s because they don’t have a test harness to protect them and catch the breaks. I like to think of refactoring as a little like walking along a curbstone. When you have six inches to fall, you won’t have any fear at all. If you had to walk along a beam, four inches in width, stretching between two thirty story buildings, I bet you’d be scared. You might be so scared that you wouldn’t even set out. The same is so with refactoring. When you have a fully tested code base, making changes is done with confidence and zeal. When you have no tests at all, making changes is avoided or undertaken with fear and dread.
The trouble is, testing takes time. Lots of testing takes lots of time. In the world of infrastructure code, testing takes even more time because sometimes the feedback loops are significantly longer than traditional test scenarios. This makes it imperative that we automate our testing. Testing, especially for complicated, disparate systems, is also difficult. Writing good tests for code is hard to do. That makes it imperative for us to write code that is easy to test. The best way to do that is to write the tests first. We’ll discuss this in more depth later, but the essential and applicable takeaway is that consistent, automated, and quality testing of infrastructure code is mandatory for the DevOps professional.
At this stage it’s important to acknowledge and address an obvious objection. As infrastructure developers we are asked to make a call with respect to a risk/time ratio. If it delays a release by three weeks, but delivers 100% test coverage, is this the right approach, given our maxim “do no harm”?
As is the case in many such trade-offs, there is an asymptotic curve describing a diminishing return after a certain amount of time and test coverage. It is a big step in the right direction to be making the decision consciously. Consider what part of the “brain” we are about to cut in to, what functions it performs for the body corporeal or corporate, as it were, and where we draw our line will become clear.
I’ll summarize by making a bold philosophical statement that underpins the rest of this book:
Testing our infrastructure code, thoroughly and repeatably, is non-negotiable, and is an essential component of the infrastructure developer’s work.
This book sets out to provide encouragement for those learning to test their infrastructure code, and guidance for those already on the path. It is a call to arms for infrastructure developers, DevOps professionals, if you like, to maximize the quality, reliability, repeatability, and production-readiness of their work.