Why Rust? - Programming Rust (2016)

Programming Rust (2016)

Chapter 1. Why Rust?

Systems programming languages have come a long way in the 50 years since we started using high-level languages to write operating systems, but two thorny problems in particular have proven difficult to crack:

§ It’s difficult to write secure code. It’s common for security exploits to leverage bugs in the way C and C++ programs handle memory, and it has been so at least since the Morris virus, the first Internet virus to be carefully analyzed, took advantage of a buffer overflow bug to propagate itself from one machine to the next in 1988.

§ It’s very difficult to write multithreaded code, which is the only way to exploit the abilities of modern machines. Each new generation of hardware brings us, instead of faster processors, more of them; now even midrange mobile devices have multiple cores. Taking advantage of this entails writing multithreaded code, but even experienced programmers approach that task with caution: concurrency introduces broad new classes of bugs, and can make ordinary bugs much harder to reproduce.

These are the problems Rust was made to address.

Rust is a new systems programming language designed by Mozilla. Like C and C++, Rust gives the developer fine control over the use of memory, and maintains a close relationship between the primitive operations of the language and those of the machines it runs on, helping developers anticipate their code’s costs. Rust shares the ambitions Bjarne Stroustrup articulates for C‍++ in his paper “Abstraction and the C++ machine model”:

In general, C++ implementations obey the zero-overhead principle: What you don’t use, you don’t pay for. And further: What you do use, you couldn’t hand code any better.

To these Rust adds its own goals of memory safety and data-race-free concurrency.

The key to meeting all these promises is Rust’s novel system of ownership, moves, and borrows, checked at compile time and carefully designed to complement Rust’s flexible static type system. The ownership system establishes a clear lifetime for each value, making garbage collection unnecessary in the core language, and enabling sound but flexible interfaces for managing other sorts of resources like sockets and file handles.

These same ownership rules also form the foundation of Rust’s trustworthy concurrency model. Most languages leave the relationship between a mutex and the data it’s meant to protect to the comments; Rust can actually check at compile time that your code locks the mutex while it accesses the data. Most languages admonish you to be sure not to use a data structure yourself after you’ve sent it via a channel to another thread; Rust checks that you don’t. Rust is able to prevent data races at compile time.

Mozilla and Samsung have been collaborating on an experimental new web browser engine named Servo, written in Rust. Servo’s needs and Rust’s goals are well matched: as programs whose primary use is handling untrusted data, browsers must be secure; and as the Web is the primary interactive medium of the modern Net, browsers must perform well. Servo takes advantage of Rust’s sound concurrency support to exploit as much parallelism as its developers can find, without compromising its stability. As of this writing, Servo is roughly 100,000 lines of code, and Rust has adapted over time to meet the demands of development at this scale.

Type Safety

But what do we mean by “type safety”? Safety sounds good, but what exactly are we being kept safe from?

Here’s the definition of “undefined behavior” from the 1999 standard for the C programming language, known as “C99”:

3.4.3

undefined behavior

behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements

Consider the following C program:

int main(int argc, char **argv) {

unsigned long a[1];

a[3] = 0x7ffff7b36cebUL;

return 0;

}

According to C99, because this program accesses an element off the end of the array a, its behavior is undefined, meaning that it can do anything whatsoever. This morning, running the program on Jim’s laptop produces the output:

undef: Error: .netrc file is readable by others.

undef: Remove password or make file unreadable by others.

Then it crashes. This computer don’t even have a .netrc file.

The machine code the C compiler generated for this main function happens to place the array a on the stack three words before the return address, so storing 0x7ffff7b36cebUL in a[3] changes poor main’s return address to point into the midst of code in the C standard library that consults one’s .netrc file for a password. When main returns, execution resumes not in main’s caller, but at the machine code for these lines from the library:

warnx(_("Error: .netrc file is readable by others."));

warnx(_("Remove password or make file unreadable by others."));

goto bad;

In allowing an array reference to affect the behavior of a subsequent return statement, the C compiler is fully standards-compliant. An “undefined” operation doesn’t just produce an unspecified result: it is allowed to cause the program to do anything at all.

The C99 standard grants the compiler this carte blanche to allow it to generate faster code. Rather than making the compiler responsible for detecting and handling odd behavior like running off the end of an array, the standard makes the C programmer responsible for ensuring those conditions never arise in the first place.

Empirically speaking, we’re not very good at that. The 1988 Morris virus had various ways to break into new machines, one of which entailed tricking a server into executing an elaboration on the technique shown above; the “undefined behavior” produced in that case was to download and run a copy of the virus. (Undefined behavior is often sufficiently predictable in practice to build effective security exploits from.) The same class of exploit remains in widespread use today. While a student at the University of Utah, researcher Peng Li modified C and C++ compilers to make the programs they translated report when they executed certain forms of undefined behavior. He found that nearly all programs do, including those from well-respected projects that hold their code to high standards.

In light of that example, let’s define some terms. If a program has been written so that no possible execution can exhibit undefined behavior, we say that program is well defined. If a language’s type system ensures that every program is well defined, we say that language is type safe.

C and C++ are not type safe: the program shown above has no type errors, yet exhibits undefined behavior. By contrast, Python is type safe. Python is willing to spend processor time to detect and handle out-of-range array indices in a friendlier fashion than C:

>>> a = [0]

>>> a[3] = 0x7ffff7b36ceb

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

IndexError: list assignment index out of range

>>>

Python raised an exception, which is not undefined a[3] should raise an IndexError exception, as we saw. As a type-safe language, Python assigns a meaning to every operation, even if that meaning is just to raise an exception. Java, JavaScript, Ruby, and Haskell are also type safe: every program those languages will accept at all is well defined.

NOTE

Note that being type safe is mostly independent of whether a language checks types at compile time or at run time: C checks at compile time, and is not type safe; Python checks at runtime, and is type safe. Any practical type-safe language must do at least some checks (array bounds checks, for example) at runtime.

It is ironic that the dominant systems programming languages, C and C++, are not type safe, while most other popular languages are. Given that C and C++ are meant to be used to implement the foundations of a system, entrusted with implementing security boundaries and placed in contact with untrusted data, type safety would seem like an especially valuable quality for them to have.

This is the decades-old tension Rust aims to resolve: it is both type safe and a systems programming language. Rust is designed for implementing those fundamental system layers that require performance and fine-grained control over resources, yet still guarantees the basic level of predictability that type safety provides. We’ll look at how Rust manages this unification in more detail in later parts of this book.

Type safety might seem like a modest promise, but it starts to look like a surprisingly good deal when we consider its consequences for multithreaded programming. Concurrency is notoriously difficult to use correctly in C and C++; developers usually turn to concurrency only when single-threaded code has proven unable to achieve the performance they need. But Rust’s particular form of type safety guarantees that concurrent code is free of data races, catching any misuse of mutexes or other synchronization primitives at compile time, and permitting a much less adversarial stance towards exploiting parallelism.

NOTE

Rust does provide for unsafe code, functions or lexical blocks that the programmer has marked with the unsafe keyword, within which some of Rust’s type rules are relaxed. In an unsafe block, you can use unrestricted pointers, treat blocks of raw memory as if they contained any type you like, call any C function you want, use inline assembly language, and so on.

Whereas in ordinary Rust code the compiler guarantees your program is well defined, in unsafe blocks it becomes the programmer’s responsibility to avoid undefined behavior, as in C and C++. As long as the programmer succeeds at this, unsafe blocks don’t affect the safety of the rest of the program. Rust’s standard library uses unsafe blocks to implement features that are themselves safe to use, but which the compiler isn’t able to recognize as such on its own.

The great majority of programs do not require unsafe code, and Rust programmers generally avoid it, since it must be reviewed with special care. Except where explicitly noted otherwise, you may assume that this book is discussing the safe portion of the language.