Thinking with Data (2014)

Chapter 3. Arguments

Data consists of observations about the world—records in a database, notes in a logbook, images on a hard drive. There is nothing magical about them. These observations may prove useful or useless, accurate or inaccurate, helpful or unhelpful. At the outset, they are only observations. Observations alone are not enough to act on. When we connect observations to how the world works, we have the opportunity to make knowledge. Arguments are what make knowledge out of observations.

There are many kinds of knowledge. Sometimes we have an accurate, unimpeachable mental model of how something works. Other times we have an understanding that is just good enough. And other times still, the knowledge is not in a person at all, but in an algorithm quietly puzzling out how the world fits together. What concerns us in working with data is how to get as good a connection as possible between the observations we collect and the processes that shape our world.

Knowing how arguments work gives us special powers. If we understand how to make convincing arguments, we can put tools and techniques into their proper place as parts of a whole. Without a good understanding of arguments, we make them anyway (we cannot help ourselves, working with data), but they are more likely to be small and disconnected.

By being aware of how arguments hang together, we can better:

§ Get across complicated ideas

§ Build a project in stages

§ Get inspiration from well-known patterns of argument

§ Substitute techniques for one another

§ Make our results more coherent

§ Present our findings

§ Convince ourselves and others that our tools do what we expect

Thinking explicitly about arguments is a powerful technique, with a long history in philosophy, law, the humanities, and academic debate. It is a more fleshed-out example of using a structure to help us think with data. Thinking about the argument we are making can come into play at any point in working with a problem—from gathering ideas at the very beginning, to ensuring that we are making sense before releasing something into the wild.

Audience and Prior Beliefs

Only in mathematics is it possible to demonstrate something beyond all doubt. When held to that standard, we find ourselves quickly overwhelmed.

Our ideal in crafting an argument is a skeptical but friendly audience, suitable to the context. A skeptical audience is questioning of our observations, not swayed by emotional appeals, but not so skeptical as to be dismissive. The ideal audience is curious; humble, but not stupid. It is an idealized version of ourselves at our best, intelligent and knowledgeable but not intimately familiar with the problem at hand.

With the skeptical ideal in mind, it becomes easier to make a general argument, but it is also easier to make an argument to a specific audience. After making an argument for an ideal audience, it is easy to remove some parts and emphasize others to meet the needs of one or more particular audiences. Simplifying or expanding on certain things for an audience is fine, but lying is not. Something that good data work inherits from the scientific method is that it is bad form to cheat by preying on gullibility or ignorance. It is bad form, and in the long run it will cause the ruin of a business (or maybe a civilization).

An argument moves from statements that the audience already believes to statements they do not yet believe. At the beginning, they already agree with some statements about the world. After they hear the argument, there are new statements they will agree to that they would not have agreed to before. This is the key insight as to how an argument works—moving from prior belief to new belief to establish knowledge in a defensible way.

No audience, neither our ideal nor a real one, is 100% skeptical, a blank slate that doesn’t already believe something. Many things are already background knowledge, taken for granted. Consider an argument that a rocket is safe to launch. There are certain statements that any reasonable audience will take for granted. The laws of physics aren’t going to change mid-flight. Neither will multiplication tables. Whether those laws and our understanding of metallurgy and aerodynamics will result in a safe launch requires an argument. The background knowledge of the equations of motion and laws of chemistry do not.

Most of the prior beliefs and facts that go into an argument are never explicitly spelled out. To explicitly spell out every facet of an argument is silly. Russell and Whitehead famously took 379 pages to set up the preliminaries for the proof that 1+1=2^[2^]. If the audience lacks some knowledge, either because of ignorance or reasonable doubt, it is important to be able to provide that (while not going overboard).

Prior belief, knowledge, and facts extend to more than just scientific laws. Audiences have dense webs of understanding that pre-date any argument presented to them. This collection of understanding defines what is reasonable in any given situation.

For example, one not-so-obvious but common prior belief is that the data in an argument comes from the same source that the arguer says it does. This is typically taken for granted. On the other hand, there may have been corruptions in what was collected or stored, or an intruder may have tampered with the data. If these would be reasonable possibilities to our skeptical audience (say the analysis involves an experimental sensor, or the audience is full of spooks), then any argument will need to address the question of validity before continuing. There needs to be something that the audience is tentatively willing to agree to, or else there is no way forward.

The algebra or mathematical theory behind specific techniques constitute another kind of common prior knowledge. Most arguments can safely avoid discussing these. Some real audiences may need their hand held more than others, or will be at our throats for execution details. But for the most part, the details of techniques are safely thought of as background knowledge.

Another source of prior or background knowledge is commonly known facts. Chicago is a city in the United States of America, which is a nation-state in the Northern Hemisphere on Earth, a planet. When it is necessary to compare Chicago to the whole US, their explicit relationship is rarely brought up. Of course, what is commonly understood varies by the actual audience. If the audience is in India, the location of Chicago in America may be an open issue. At that point, the audience will believe the atlas, and we’ll be back to something that is commonly accepted.

“Wisdom” is also taken for granted in many arguments. When I worked at an online dating site, it was commonly taken for granted in internal discussions that men are more likely to send messages to women than the other way around. It was something that we had verified before, and had no reason to think had changed drastically. In the course of making an argument, it was already commonly understood by the audience and didn’t need to be spelled out or verified, though it could have been.

Not all wisdom can be verified. In the online dating example, we assumed that most of the people who had filled out their profiles actually corresponded to genuine human beings. We assumed that, for the most part, their gender identity offline matched their identity online. It may be painful to take some ideas for granted, but it is a necessity. We should aspire to act reasonably given the data at hand, not require omnipotence to move forward. People rarely require omnipotence in practice, but it might surprise you how many people seem to think it is a prerequisite for an explanation of how arguments work.

BUILDING AN ARGUMENT

Visual schematics (lines, boxes, and other elements that correspond to parts of an argument) can be a useful way to set up an argument, but I have found that sentences and fragments of sentences are actually far more flexible and more likely to make their way into actual use.

Prior facts and wisdom will typically enter into an argument without being acknowledged, so there is not much to show here at this point. As we go through the parts of an argument, examples of how each idea is produced in written language will appear here, in these sidebars.

To make the ideas more transparent, I have also marked the concepts introduced in each section with tags like (Claim). In a real argument, it’s rare to explicitly call out claims and evidence and so on, but it is nevertheless instructive to try while developing an argument. One more thing to note: the following example is made up. It is plausible, but unresearched. Anybody found citing it as truth down the line will be publicly shamed in a future edition of this book.

Claims

Arguments are built around claims. Before hearing an argument, there are some statements the audience would not endorse. After all the analyzing, mapping, modeling, graphing, and final presentation of the results, we think they should agree to these statements. These are the claims. Put another way, a claim is a statement that could be reasonably doubted but that we believe we can make a case for. All arguments contain one or more claims. There is often, but not necessarily, one main claim, supported by one or more subordinate claims.

Suppose that we needed to improve the safety of a neighborhood that has been beset by muggings. We analyze the times and places where muggings happen, looking for patterns. Our main claim is that police officers should patrol more at these places and times. Our subordinate claims are that there is a problem with muggings; that the lack of police at certain places and times exacerbates the problem; and that the added cost of such deployments is reasonable given the danger. The first and last points may or may not require much of an argument, depending on the audience, whereas the second will require some kind of analysis for sure.

Note that the claim is in terms that the decision makers actually care about. In this case, they care about whether the lack of police in certain places and times exacerbates muggings, not what model we built and what techniques we used to assess that model’s fit. Being able to say that our model has good generalization error will end up being an important part of making a tight argument, but it functions as support, not as a big idea.

In the details justifying our claim, we could swap out another technique for assessing the quality of our model to show that we had a reasonable grasp of the patterns of muggings. Different techniques^[3^] make different assumptions, but some might be chosen purely for practical reasons. The techniques are important, but putting them into an argument frame makes it obvious which parts are essential and which are accidental.

Let us turn our attention back to the task of predicting how public transit affects real estate prices over time. Here is where thinking about the full problem really shines. There is no single statistical tool that is sufficient to create confidence in either a causal relationship or the knowledge that a pattern observed should continue to hold in the future. There are collections of things we can do, held together by strong arguments (such as sensitivity analysis or a paired design) that can do the job and make a case for a causal relationship. None of them are purely statistical techniques; they all require a case be made for why they are appropriate beyond how well they fit the data.

CLAIMS

(Claim) A 5% reduction in average travel time to the rest of the city results in a 10% increase in apartment prices, with the reverse true as well, an effect which persists. (Subclaim) We know this is true because we have looked back at historical apartment prices and transit access and seen that the effect persists. (Subclaim) More importantly, when there have been big shocks in transit access, like the opening of a new train stop or the closing of a bus route, there have been effects of this magnitude on apartment prices visible within a year.

Evidence, Justification, and Rebuttals

A key part of any argument is evidence. Claims do not demonstrate themselves. Evidence is the introduction of facts into an argument.

Our evidence is rarely raw data. Raw data needs to be transformed into something else, something more compact, before it can be part of an argument: a graph, a model, a sentence, a map. It is rare that we can get an audience to understand something just from lists of facts. Transformations make data intelligible, allowing raw data to be incorporated into an argument.

A transformation puts an interpretation on data by highlighting things that we take to be essential. Counting all of the sales in a month is a transformation, as is plotting a graph or fitting a statistical model of page visits against age, or making a map of every taxi pickup in a city.

Returning to our transit example, if we just wanted to show that there is some relationship between transit access and apartment prices, a high-resolution map of apartment prices overlaid on a transit map would be reasonable evidence, as would a two-dimensional histogram or scatterplot of the right quantities. For the bolder claims (for example, that the relationship is predictable in some way and can be forecast into the future), we need more robust evidence, like sophisticated models. These, too, are ways of transforming data into evidence to serve as part of an argument.

EVIDENCE AND TRANSFORMATIONS

A 5% reduction in average travel time to the rest of the city results in a 10% increase in apartment prices, with the reverse true as well, an effect which persists. We know this is true because we have looked back at historical apartment prices and transit access and seen that the effect persists. (Transformation) This graph, based on (Evidence) 20 years of raw price data from the City, demonstrates how strong the relationship is.

More importantly, when there have been big shocks in transit access, like the opening of a new train stop, or the closing of a bus route, there have been effects on apartment prices visible within a year. (Transformation, Evidence) Average prices of apartments for each of the following five blocks, with lines indicating the addition or closure of a new train route within two blocks of that street, demonstrate the rapid change. On average, closing a train stop results in a (Transformation, Evidence) 10% decline in apartment prices for apartments within two blocks away over the next year.

If I claim that the moon is made of cheese and submit pictures of the moon as evidence, I have supplied two of the necessary ingredients, a claim and evidence, but am missing a third. We need some justification of why this evidence should compel the audience to believe our claim. We need a reason, some logical connection, to tie the evidence to the claim. The reason that connects the evidence to the claim is called the justification, or sometimes the warrant.

The simplest justification of all would be that the claim is self-evident from the evidence. If we claim that Tuesdays have the highest average sales in the past year, and our evidence is that we have tabulated sales for each day over the past year and found that Tuesday is the highest, our claim is self-evident. The simplest factual arguments are self-evident; these are basically degenerate arguments, where the claim is exactly the evidence, and the justification is extraneous.

Consider a slightly more sophisticated example: that map of home prices laid over a transit map. For this map, more expensive homes read as brighter blocks. The claim is that transit access is associated with higher prices. The evidence is the map itself, ready to be looked at. The justification might be termed visual inspection. By looking at the areas where there are highly priced homes, and seeing that transit stops are nearby, we are making a somewhat convincing argument that the two are related. Not all arguments are solid ones.

Or consider a regression model, where, for example, average prices are modeled by some function of distance to each transit line. The evidence is the model. One possible subclaim is that the model is accurate enough to be used; then the justification is cross-validation, with accuracy measured in dollars.

Another subclaim is that distance from the Lexington Avenue line has the largest effect on prices; then the justification might a single-variable validation procedure^[4^]. This illustrates a crucial point: the same data can be used to support a variety of claims, depending on what justifications we draw out.

There are always reasons why a justification won’t hold in a particular case, even if it is sound in general. Those reasons are called the rebuttals. A rebuttal is the yes-but-what-if question that naturally arises in any but the most self-evident arguments.

Consider an attempt to show that a medication is effective at treating warts. If our claim is that the medication cures warts; that our evidence is a randomized controlled trial; and our justification is that randomized controlled trials are evidence of causal relationships, then common rebuttals would be that the randomization may have been done improperly, that the sample size may have been too small, or that there may be other confounding factors attached to the treatment. It pays to be highly aware of what the rebuttals to our arguments are, because those are the things we will need to answer when we need to make a water-tight argument.

In the case that we are using visual inspection to justify our claim that there is a relationship between apartment prices and transit lines, the rebuttal is that visual inspection may not be particularly clear, given that the data will be noisy. There will be highly priced places that are not near public transit lines, and places that have low prices and are on public transit lines.

For a justification of cross-validation, a rebuttal might be that the data is outdated, that the error function we chose is not relevant, or that the sample size is too small. There are always some things that render even the best techniques incorrect.

Finally, all justifications provide some degree of certainty in their conclusions, ranging from possible, to probable, to very likely, to definite. This is known as the degree of qualification of an argument. Deductive logic (Tim O’Reilly is a man; All men are mortal; Therefore, Tim O’Reilly is mortal) provides definite certainty in its conclusions, but its use in practice is limited. Having some sense of how strong our result is keeps us from making fools of ourselves.

ADDING JUSTIFICATIONS, QUALIFICATIONS, AND REBUTTALS

A 5% reduction in average travel time to the rest of the city results in a 10% increase in apartment prices, with the reverse true as well, an effect which persists. We know this is true because we have looked back at historical apartment prices and transit access and seen that the effect persists. The graph (Justification) shows the predictive power of a model trained on each year’s data in predicting the following year (based on 20 years of raw price data from the City), and demonstrates that a relationship is (Qualification) very likely.

More importantly, when there have been big shocks in transit access, like the opening of a new train station or the closing of a bus route, there were effects on apartment prices visible within a year. (Justification) Because the changes are so tightly coupled and happen more frequently together by chance than other changes, we can conclude that the changes in transit access are causing the drop in apartment prices. (Qualification) This leads us to believe in a very strong probability of a relationship.

Average prices of apartments for each of the following five blocks, with lines indicating the addition or closure of a new train route within two blocks of that street, demonstrate the rapid change. On average, closing a train stop results in a 10% decline in apartment prices for apartments within two blocks away over the next year, a relationship which roughly holds in at least (Qualification)70% of cases.

(Rebuttal) There are three possible confounding factors. First, there may be nicer apartments being built closer to train lines. (Prior knowledge) This is clearly not at issue, because new construction or widespread renovation of apartments (which would raise their value) or letting apartments decline (which would lower their value) all take place over longer time scales than the price changes take place in. Second, there may be large price swings, even in the absence of changing transit access. (Prior knowledge) This is also not an issue, because the average price change between successive years for all apartments is actually only 5%. Third, it might be the case that transit improvements or reductions and changes in apartment price are both caused by some external change, like a general decline in neighborhood quality. (Prior knowledge) This is impossible to rule out, but generally speaking, the city in question has added transit options spottily and rent often goes up regardless of these transit changes.

Deep Dive: Improving College Graduation Rates

Another extended example is in order, this one slightly more technical. Suppose that a university is interested in starting a pilot program to offer special assistance to incoming college students who are at risk of failing out of school (the context). What it needs is a way to identify who is at risk and find effective interventions (the need). We propose to create a predictive model of failure rates and to assist in the design of an experiment to test several interventions (the vision). If we can make a case for our ideas, the administration will pay for the experiments to test it; if they are successful, the program will be scaled up soon after (the outcome).

There are several parts to our work. The first major part is to build a model to predict the chance of failure. The second is to make a case for an experiment to test if the typical interventions (for example, guidance, free books on study habits, providing free tutoring) are effective for at-risk students. Just identifying students at risk of failing out isn’t enough.

When we think about building the model and designing the experiment, we need to put our findings in terms that the decision makers will understand. In this case, that’s probably a well-calibrated failure probability rather than a black-box failure predictor. It also entails presenting the experiments in terms of the range of cost per student retained, or the expected lift in graduation rates, not Type I and Type II errors.^[5^]

In order to make the experiment easy to run, we can restrict ourselves to students who fail out after the first year. If we’re lucky, the majority of failouts will happen in the first year, making the experiment even more meaningful. Performing a quick analysis of dropout rates will be enlightening, as evidence to support the short time span of our experiment (or counter-evidence to support a longer experiment, if the dropouts tend to happen at the end of a degree program).

It would be prudent to interview the decision makers and ask them what they think a reasonable range of expense would be for raising the graduate rate by 1%. By setting goals, we can assess whether our models and experiments are plausibly useful. It is important to find the current and historical graduation rate for the university, the general cost of each intervention, and to play around with those numbers. Having a general sense of what success would look like for a model will help set the bar for the remaining work. It is also important to recognize that not all students who drop out should or could have been helped enough to stay enrolled. Some admitted students were actually false positives to begin with.

To make this project useful, we need to define “failout” and differentiate a failout from a dropout, which might not be caused by poor academic performance. If the university already has a good definition of failout, we should use that. If not, we have to justify a definition. Something like: a failout is a student who drops out either when he is forced to do so by failing grades, or when he drops out on his own accord but was in the bottom quartile of grades for freshmen across the university. To justify this definition, we can argue that students who drop out on their own and had poor grades are probably dropping out in part because they performed poorly.

Now we can turn our attention to the modeling, trying to predict the probability that someone will fail out in her first year. Based on some brainstorming, we identify some potentially predictive facts about each student, plan our next steps, and collect data. If we have access to student records and applications, we might start with 10,000 records, including each student’s age, high school GPA, family history, initial courseload, declared major, and so on. This step, acquiring and cleaning the data, will probably account for half or three quarters of the time we spend, so it pays to make a series of mockups along the way to show what we expect to get out of our transformations.

Consider for a moment what we will find if we get a good classifier. Suppose that about 10% of first-year undergraduates fail out. A successful classifier will take that 10% and spread it among some people with a large risk and some with a small risk, retaining the average of 10% across all students. A good classifier will be as spread out as possible while simultaneously having a high correspondence with reality. So in the ideal scenario, for example, we can identify students with 50–60% risk of failing out according to the model, for whom 50–60% do actually fail out in reality.

Our argument will be the following: (Claim) We have the ability to forecast which students will fail out accurately enough to try to intervene. (Subclaim) Our definition of failout, students who drop out either because they are asked to leave due to poor grades or who are in the bottom quartile of their freshman year, is consistent with what the administration is interested in. (Justification for the first claim) We can group students we did not fit the model on and see that the predictions closely match the reality. The error in our predictions is only a few percent on average, accurate enough to be acceptable. (Justification for the subclaim) Students who dropped out on their own but had poor grades have essentially failed out; this idea is consistent with how the term is used in other university contexts.

Let’s return to our second need, which was to design an experiment. Depending on how much money we have, how many students we have to work with, and how expensive each intervention might be, there are different experimental designs we can pursue. At a high level, we can imagine splitting students into groups by their risk level, as well as by intervention or combinations of interventions. If money is tight, splitting students into just high- and low-risk groups and choosing students at random from there would be fine. Based on the expectations laid out in discussions with decision makers, we can set minimum thresholds for detection and choose our sample sizes appropriately. And using our risk model, we can set appropriate prior probabilities for each category.

With our model accuracy in hand, we can also estimate the range of effectiveness the various interventions might have, and the associated cost. If, for example, the best our risk model could do was separate out students who were above or below average risk, “at-risk” students could be anywhere from 11% to 100% likely to fail out. With the right distribution of students, it’s plausible that almost all of the money will be spent on students who were going to pass anyway. With some thought, we can derive more reasonable categories of high and low risk. With those, we can derive the rough range of cost and benefit in the best- and worst-case post-intervention outcomes.

Our argument for the second need is as follows, assuming that the math works out: (Claim) The experiment is worth performing. (Subclaim) The high and low ranges of costs for our experiment are low compared to a reasonable range of benefits. (Subclaim) Furthermore, their cost is fairly low overall, because we are able to target our interventions with accuracy and we can design the experiment to fit within the budget. (Claim) The right way to perform the experiment is to break up students into high- and low-risk groups, and then choose students from the population at random to minimize confounding factors…and so on.

In a fully worked-out project, these arguments would have actual data and graphs to provide evidence. It is also not right to expect that we will arrive at these arguments out of the blue. Like everything else, arguments are best arrived at iteratively. Early versions of arguments will have logical errors, misrepresentations, unclear passages, and so on. An effective countermeasure is to try to inhabit the mind of our friendly but skeptical audience and see where we can find holes. An even better approach is to find a real-life friendly skeptic to try to explain our argument to.

^[2^]Volume 1 of Principia Mathematica by Alfred North Whitehead and Bertrand Russell (Cambridge University Press, page 379). The proof was actually not completed until Volume 2.

^[3^]In this case, for example, Bayes factors for cross-validation.

^[4^]Such as bootstrapping or a t-test.

^[5^]For more information on modeling and measurement in a business context, and other in-depth discussions of topics raised throughout this book, see Data Science for Business by Foster Provost and Tom Fawcett (O’Reilly Media, 2013).