Patterns of Reasoning - Thinking with Data (2014)

Thinking with Data (2014)

Chapter 4. Patterns of Reasoning

One of the great benefits of studying arguments is that we can draw inspiration from patterns that have been noticed and explored by others. Instead of bushwhacking our way through the forest, we have a map to lead us to well-worn trails that take us where we need to go.

We can’t simply lift up the patterns that structure arguments in other disciplines and plop them down precisely into data science. There are big differences between a courtroom, a scientific dispute, a national policy debate, and the work that we do with data in a professional setting. Instead, it is possible to take insights from patterns in other fields and mold them to fit our needs.

There are three groups of patterns we will explore. The first group of patterns are called categories of disputes, and provide a framework for understanding how to make a coherent argument. The next group of patterns are called general topics, which give general strategies for making arguments. The last group is called special topics, which are the strategies for making arguments specific to working with data. Causal reasoning, which is a special topic, is so important that it is covered separately in Chapter 5.

Categories of Disputes

A very powerful way to organize our thoughts is by classifying each point of dispute in our argument. A point of dispute is the part of an argument where the audience pushes back, the point where we actually need to make a case to win over the skeptical audience. All but the most trivial arguments make at least one point that an audience will be rightfully skeptical of. Such disputes can be classified, and the classification tells us what to do next. Once we identify the kind of dispute we are dealing with, the issues we need to demonstrate follow naturally.

Ancient rhetoricians created a classification system for disputes. It has been adapted by successive generations of rhetoricians to fit modern needs. A point of dispute will fall into one of four categories: fact, definition, value, and policy.

Once we have identified what kind of dispute we are dealing with, automatic help arrives in the form of stock issues. Stock issues tell us what we need to demonstrate in order to overcome the point of contention. Once we have classified what kind of thing it is that is under dispute, there are specific subclaims we can demonstrate in order to make our case. If some of the stock issues are already believed by the audience, then we can safely ignore those. Stock issues greatly simplify the process of making a coherent argument.

We can also use any of these patterns of argument in the negative. Each list of stock issues also forms a natural list of rebuttals in the event that we want to argue against a policy or particular value judgment. It is a very flexible technique.

FACT

A dispute of fact turns on what is true, or on what has occurred. Such disagreements arise when there are concrete statements that the audience is not likely to believe without an argument. Disputes of fact are often smaller components of a larger argument. A particularly complicated dispute of fact may depend on many smaller disputes of fact to make its case.

Some examples of disputes of fact: Did we have more returning customers this month than the last? Do children who use antibiotics get sick more frequently? What is the favorite color of colorblind men? Is the F1 score of this model higher than that of the other model?

The typical questions of science are disputes of fact. Does this chemical combination in this order produce that reagent? Does the debt-to-GDP ratio predict GDP growth? Does a theorized subatomic particle exist?

What all of these questions have in common is that we can outline the criteria that would convince you to agree to an answer prior to any data being collected. The steps might be simple: count the people in group A, count the people in group B, report if A has more members than B. Or it might be highly complex, involving many parts: verify that a piece of monitoring machinery is working well, correctly perform some pipette work 100 times, correctly take the output of the monitoring machine, and finally, apply a Chi-square test to check the distribution of the results. We can make a case for why meeting such conditions would imply the claim.

There are thus two stock issues for disputes of fact. They are:

§ What is a reasonable truth condition?

§ Is that truth condition satisfied?

In other words, how would you know this fact was right, and did it turn out the way you expected?

We need to lay out the conditions for holding a fact to be true, and then show that those conditions are satisfied. If the conditions are already obvious and held by the audience, we can skip straight to demonstrating that they are satisfied. If they aren’t, then our first task is to make a case that we identified a series of steps or conditions that imply the claim.

Take a famous example, the claim that a debt-to-GDP (gross domestic product) ratio of over 90% results in negative real GDP growth. That is, if a national government had debt equal to 90% or more of its GDP in one year, then on average its GDP in the following year would fall, even adjusting for inflation. For several years, this was taken as a fact in many policy circles, based on a paper by the Harvard economists Reinhart and Rogoff.[6]

Reinhart and Rogoff stipulated the following truth condition: collect debt-to-GDP ratios for a number of years, across dozens of countries. Group each country-year into four buckets by their debt-to-GDP ratio (30%, 30–60%, 60–90%, 90% and above). Calculate the growth between that year and the next year. Average across each country, and then average all countries together. Whatever the average growth was in each bucket is the expected growth rate.

Both their truth condition and claim to satisfy that condition turned out to be flawed. When the result was examined in depth by Herndon, Ash, and Pollin from the University of Massachusetts Amherst,[7] several issues were found.

First, the truth condition was misspecified. Data is not available equally for all countries in all years, so averaging first within each country and then across all averages could weigh one year in one country equally with several decades in another. Specifically, in Reinhart and Rogoff’s data, Greece and the UK each had 19 years with a debt-to-GDP ratio over 90% and growth around 2.5%, whereas New Zealand had one year with –7.9% growth. The three numbers were simply averaged.

Second, their execution turned out to be flawed. Excel errors excluded the first five countries (Australia, Austria, Belgium, Canada, and Denmark) entirely from the analysis. Additional country-year pairs were also omitted from the whole data set, the absence of which substantially distorted the result.

Herndon, Ash, and Pollin made a counter-claim. They declared that there is no sharp drop in GDP growth at a 90% debt-to-GDP ratio, and that in fact the growth slowly falls from an average of 3% per year to around 2%, in the 30% to 120% debt-to-GDP ratio range, beyond which the data volume falls out.

Their truth condition was simply a smoothed graph fit to the original data, without any bucketing. It is worth noting that Herndon, Ash, and Pollin carried out their examination in R, rather than Excel, which provided better statistical tools and far easier replication and debugging.

DEFINITION

Disputes of definition occur when there is a particular way we want to label something, and we expect that that label will be contested. Consider trying to figure out whether antibiotics reduce the incidence of sickness in kids. In the contemporary United States, children are humans under the age of 13 (or 18, in the case of the law). Antibiotics are a known class of drugs. But what does it mean to regard a child as having been sick? Viral load? Doctor’s visits? Absence from school? Each of these picks up something essential, but brings its own problems. Quite a lot can be on the line for determining the right definition if you’re a drug company. Definitions in a data context are about trying to make precise relationships in an imprecise world. If a definition is already precise and widely shared, there is nothing at issue and we will have nothing to defend.

Words also have prior meanings that are an important part of how we think about them. If humans thought with exact, axiomatic logic, naming things would make no difference. But, for the same reason, our ability to think would be rigidly stuck within narrowly defined categories. We would fall apart the moment we had to think through something outside our axioms.

There are two activities involving definitions that can happen in an argument. The first is making a case that a general, imprecise term fits a particular example. If we show some graphs and claim that they demonstrate that a business is “growing,” we are saying that this particular business fits into the category of growing businesses. That is something we have to make a case for, but we are not necessarily claiming to have made a precise definition.

The second activity is a stronger claim to make in an argument: that we have made an existing term more precise. If we say that we have developed some rules to determine which programming language is the most popular, or which users should be considered the most influential on Twitter, we are clarifying an existing idea into something that we can count with.

“Popularity,” “influence,” “engagement,” and the like are all loaded terms, which is a good thing. If we shy away from using meaningful language, it becomes difficult to make judgments in new scenarios. Meaningful definitions provide sensible default positions in new situations. There are certain things we expect of a “popular” language beyond it being highly used in open source code repositories, even if that is how we define popularity. And we have some mental model of how “engaged” users should behave, which can be very useful for debugging an argument.

A term that an audience has no prior understanding of, either in the form of examples (real or prototypical) or prior definitions, is not going to be contested. There will be no argument, because the word will be totally new.

There are three stock issues with disputes of definition:

§ Does this definition make a meaningful distinction?

§ How well does this definition fit with prior ideas?

§ What, if any, are the reasonable alternatives, and why is this one better?

We can briefly summarize these as Useful, Consistent, and Best. A good definition should be all three.

First, consider the issue of whether a definition makes a difference (Useful). What would you think of a definition that declared users to be influential on Twitter based on the result of a coin toss? It would add no predictive ability as to how a given tweet would be taken up by the larger network. Or a definition of a growing business that included every business that was not spiraling into bankruptcy? There have to be some useful distinctions made by the definition in order to be worthy of being cared about, and we often have to make a case for the utility of a definition.

Discussions about how well a definition fits with prior ideas can take many forms (Consistent). One is: how well does this definition agree with previously accepted definitions? Another: how well does this definition capture good examples? To justify a definition of influence on Twitter, it helps to both cite existing definitions of influence (even if they aren’t mathematically precise) and to pick out particular people that are known to be influential to show that they are accounted for. Our definition should capture those people. And if it does not, we should make a clear case for why their exclusion does not create a problem with fitting into precedent.

Finally, it makes sense to consider any alternatives (Best), lest our audience do it for us. If there are obvious definitions or classifications that we are missing or not using, it behooves us to explain why our particular definition is superior (or why they are not in conflict).

Disputes of definition are closely related to the idea of construct validity used in the social sciences, especially in psychology and sociology. A construct is a fancy term for definition, and construct validity refers to the extent to which a definition is reliably useful for causal explanation. When psychologists define neuroticism, psychosis, or depression, they’re trying to make a prior idea more precise and justify that definition to others.

Definitions are also where we typically introduce simplifying assumptions into our argument. For example, in our investigation into apartment prices and transit accessibility, we discussed sticking only to apartments that are advertised to the public. On one hand, that could be a failing of the final argument. On the other hand, it greatly simplifies the analysis, and as long as we are up front about our assumptions when a reasonable skeptic could disagree with them, it is better to have provisional knowledge than none at all.

VALUE

When we are concerned with judging something, the dispute is one of value.

For example, is a particular metric good for a business to use? We have to select our criteria of goodness, defend them, and check that they apply. A metric presents a balance of ease of interpretability, precision, predictive validity, elegance, completeness, and so on. Which of these values are the right ones to apply in this situation, and how well do they apply? At some point we may have to choose between this metric and another to guide a decision. Which is more important, customer satisfaction or customer lifetime value? We often have to justify a judgment call.

Consider the decision to pick a certain validation procedure for a statistical model. Different criteria[8] are useful for solving different problems. Which of these criteria are the right ones in a particular situation requires an argument. There are trade-offs involved between validity, interpretability, accuracy, and so on. By what criteria should our model be judged?

What else but by their fruits? For disputes of value, our two stock issues are:

§ How do our goals determine which values are the most important for this argument?

§ Has the value been properly applied in this situation?

For example, consider a scenario where we are deciding between two models (not validation procedures as before, but separate models), one of which is easy to interpret and another that is more accurate but hard to interpret. For some reason, we are restricted to using a single model.

Which values matter in this case will depend on what our goals are. If our goal is to develop understanding of a phenomenon as part of a longer project, the interpretable model is more important. Likewise, if our goal is to build something we can fit into our heads to reason off of in new situations, concision and elegance are important. Our next step would then be to make a case that the model in question is, in fact, interpretable, concise, elegant, and so on.

On the other hand, if our goal is to build something that is a component of a large, mostly or entirely automated process, concision and elegance are irrelevant. But are accuracy or robustness more important? That is, is it more important to be right often, or to be able to withstand change? When the US Post Office uses its handwriting recognition tools to automatically sort mail by zip code, accuracy is the most important value—handwriting is not likely to change substantially over the next few years. By contrast, when building an autonomous robot, it is more important that it can handle new scenarios than that it always walks in the most efficient way possible. Our values are dictated by our goals. Teasing out the implications of that relationship requires an argument.

POLICY

Disputes of policy occur whenever we want to answer the question, “Is this the right course of action?” or “Is this the right way of doing things?” Recognizing that a dispute is a dispute of policy can greatly simplify the process of using data to convince people of the necessity of making a change in an organization.

Should we be reaching out to paying members more often by email? Should the Parks Department do more tree trimming? Is relying on this predictive model the right way to raise revenue? Is this implementation of an encryption standard good enough to use? Does this nonprofit deserve more funding?

The four stock issues of disputes of policy are:

§ Is there a problem?

§ Where is credit or blame due?

§ Will the proposal solve it?

§ Will it be better on balance?

David Zarefsky distills these down into Ill, Blame, Cure, and Cost.[9]

Is there a problem? We need to show that there is something worth fixing. Is revenue growth not keeping up with expectation? Are there known issues with a particular algorithm? Are trees falling down during storms and killing people? In any of these cases, it is necessary to provide an argument as to why the audience should believe that there is a problem in the first place.

Where is credit or blame due? For example, is revenue not keeping up with what is expected because of weaker than normal growth in subscriptions? If the problem is that we have a seasonal product, as opposed to that our marketing emails are poorly targeted, proposing to implement a new targeting algorithm may be beside the point. We have to make the case that we have pinpointed a source of trouble.

Would our proposal solve the problem? Perhaps we have run some randomized tests, or we have compared before and after results for the same users, or we have many strong precedents for a particular action. We need to show that our proposed solution has a good chance of working.

Finally, is it worth it? There are many solutions that are too expensive, unreliable, or hard to implement that would solve a problem but aren’t worth doing. If it takes three or four days to implement a new targeting model, and it will likely have clear gains, it is a no-brainer. If it might take weeks, and nobody’s ever done something like this before, and the payoff is actually rather low compared to other things that a team could be doing…well, it is hard to say that it would be a good decision.

Many, many sticky problems actually turn out to be policy issues. Having a framework to think through is invaluable.

General Topics

Discussions about patterns in reasoning often center around what Aristotle called general topics. General topics are patterns of argument that he saw repeatedly applied across every field. These are the “classic” varieties of arguments: specific-to-general, comparison, comparing things by degree, comparing sizes, considering the possible as opposed to the impossible, etc. Undergraduate literature courses often begin and end their discussion of patterns of argument with these.

Though these arguments might seem remote from data work, in fact they occur constantly. Some, like comparing sizes of things and discussing the possible as opposed to the impossible, are straightforward and require no explanation. Others, like specific-to-general reasoning, or reasoning by analogy, require more exposition.

SPECIFIC-TO-GENERAL

A specific-to-general argument is one concerned with reasoning from examples in order to make a point about a larger pattern. The justification for such an argument is that specific examples are good examples of the whole.

A particularly data-focused idea of a specific-to-general argument would be a statistical model. We are arguing from a small number of examples that a pattern will hold for a larger set of examples. This idea comes up repeatedly throughout this book.

Specific-to-general reasoning occurs even when reasoning from anecdotes. Whenever we make productive user experience testing, we are using specific-to-general reasoning. We have observed a relatively small amount of the user base of a given product in great detail, but by reasoning that these users are good examples of the larger user base (at least in some ways), we feel comfortable drawing conclusions based on such research.

Another example of this reasoning pattern is what might be termed an illustration. An illustration is one or more examples that have been selected to build intuition for a topic. Illustrations are extremely useful early in an argument to provide a grounding to the audience about what is relevant and possible for the subject matter.

There is an element of the imagination in every argument. If someone literally cannot imagine an example or the possibilities of the thing under discussion, it is less likely that they will be swayed by the more abstract bits of reasoning. Worse, it is less likely that the argument will actually stick with the person. Practically speaking, it is not enough to convince someone. They need to stay convinced when they walk away from your argument and not forget it moments later. Concrete examples, sticky graphics, and explanations of the prototypical will help ground arguments in ways that improve the chance of a strong takeaway, even when they are incidental to the body of the argument.

GENERAL-TO-SPECIFIC

General-to-specific arguments occur when we use beliefs about general patterns to infer results for particular examples. While it may not be true that a pattern holds for every case, it is at least plausible enough for us to draw the tentative conclusion that the pattern should hold for a particular example. That is, because a pattern generally holds for a larger category, it is plausible that it should hold for an example.

For example, it is widely believed that companies experiencing rapid revenue growth have an easy time attracting investment. If we demonstrate that a company is experiencing rapid revenue growth, it seems plausible to infer that the company will find it easy to raise money. Of course, that might not be true; the revenue growth may be short-term, or funding may be scarce. That doesn’t make it improper to tentatively draw such a conclusion.

This is the opposite of the specific-to-general pattern of reasoning. Using a statistical model to make inferences about new examples is a straightforward instance of general-to-specific results arising from reversing a specific-to-general argument. We use the justification of specific-to-general reasoning to claim that a sample of something can stand in for the whole; then we use general-to-specific reasoning when we go to apply that model to a new example.

The archetypal rebuttal of a general-to-specific argument is that this particular example may not have the properties of the general pattern, and may be an outlier.

Consider a retail clothing company with a number of departments. Menswear departments may overwhelmingly be in the business of supplying men, but a fair number of women shop in menswear departments for aesthetic reasons or on behalf of their partners or children. If we argued that sales of men’s dress shirts are indicative of more male customers, we are probably right, but if we argue that because a particular customer purchased a men’s shirt that the shopper is probably male, we may be wrong.

ARGUMENT BY ANALOGY

Arguments by analogy come in two flavors: literal and figurative. In a literal analogy, two things are actually of similar types. If we have two clients with a similar purchasing history to date, it seems reasonable to infer that after one client makes a big purchase, the other client may come soon after. The justification for argument by analogy is that if the things are alike in some ways, they will be alike in a new way under discussion.

In a figurative analogy, we have two things that are not of the same type, but we argue that they should still be alike. Traditionally, these kinds of analogies are highly abstract, like comparing justice to fire. This may seem out of place in a book on data analysis, but actually, figurative analogies are constantly present in working with data—just under a different banner.

Every mathematical model is an analogy. Just as justice is not a flame, a physical object moving in space is not an equation. No model is the same as the thing it models. No map is the territory. But behavior in one domain (math) can be helpful in understanding behavior in another domain (like the physical world, or human decision-making).

Whenever we create mathematical models as an explanation, we are making a figurative analogy. It may be a well-supported one, but it is an analogy nonetheless.

The rebuttal for argument by analogy is the same as the rebuttal for general-to-specific arguments—that what may hold for one thing does not necessarily hold for the other. Physical objects experience second-order effects that are not accounted for in the simplified physical model taken from an engineering textbook. People may behave rationally according to microeconomic models, but they might also have grudges that the models didn’t account for. Still, when the analogy holds, mathematical modeling is a very powerful way to reason.

Special Arguments

Every discipline has certain argument strategies that it shares with others. The special arguments of data science overlap with those of engineering, machine learning, business intelligence, and the rest of the mathematically inclined disciplines. There are patterns of argument that occur frequently in each of these disciplines that also pop up in settings where we are using data professionally. They can be mixed and matched in a variety of ways.

Optimization, bounding cases, and cost/benefit analysis are three special arguments that deserve particular focus, but careful attention to any case study from data science or related disciplines will reveal many more.

OPTIMIZATION

An argument about optimization is an argument that we have figured out the best way to do something, given certain constraints. When we create a process for recommending movies to someone, assigning people to advertising buckets, or packing boxes full of the right gifts for customers based on their taste, we are engaging in optimization.

Of course, optimization occurs whenever we fit a model through error minimization, but in practice we rarely talk about such activities as optimizations.

This is one of the least argument-laden activities in data science…assuming we already know what we are intending to optimize. If we don’t, we first have a dispute of value to deal with. The question of making a value judgment about the right thing to be optimizing is often far more interesting and controversial than the process itself.

BOUNDING CASE

Sometimes an argument is not about making a case for a specific number or model, but about determining what the highest or lowest reasonable values of something might be.

There are two major ways to make an argument about bounding cases. The first is called sensitivity analysis. All arguments are predicated on certain assumptions. In sensitivity analysis, we vary the assumptions to best- or worst-case values and see what the resulting answers look like. So if we are making a case that the number of new accounts for a small business could be (based on historical data) as low as two per month and as high as five, and that each account could bring in as little as $2,000 and as much as $10,000 each, then the worst case under these assumptions is a new income of $4,000 a month and a best case is as high as $50,000 a month. That is a huge range, but if we’re only concerned that we have enough growth in the next month to pay off a $2,000 loan, then we’re already fine.

A more sophisticated approach to determining bounding cases is through simulation or statistical sensitivity analysis. If we make assumptions about the plausibility of each value in the preceding ranges (based on historical data), say that 2 and 5 clients are equally likely but that 3 or 4 are twice as likely as 2 or 5, and that $2,000 to $10,000 are uniformly likely, then we can start to say things about the lower and upper decile of likely results. The simplest way to do that would be to simulate a thousand months, simulating first a number of clients and then a value per client.

Based on a particular level of risk, we can calculate percentiles of revenue that suit the task at hand. Cost projections for investors might be based on median revenue, whereas cost projections for cash flow analysis may use the lower decile or quartile for safety reasons. Uptime calculations for mission-critical servers, meanwhile, are only considered well-functioning if they can provide service 9,999 seconds out of every 10,000 or better. Bounding cases need to be matched to the relevant risk profile of the audience.

Sensitivity analysis or simulation can provide rough bounds on where important numbers will go. Sometimes just providing orders of magnitude is enough to push a decision forward—for example, by demonstrating that the energy needs for a construction project are a hundred times higher than what is available.

COST/BENEFIT ANALYSIS

A variant on disputes of both value and policy is a cost/benefit analysis. In a cost/benefit analysis, each possible outcome from a decision or group of decisions is put in terms of a common unit, like time, money, or lives saved. The justification is that the right decision is the one that maximizes the benefit (or minimizes the cost). In some sense, such an argument also draws on the idea of optimization, but unlike optimization, there is not necessarily an assertion that the argument takes into account all possible decisions, or a mix of decisions and constraints. A cost/benefit analysis compares some number of decisions against each other, but doesn’t necessarily say anything about the space of possible decisions.

For a cost/benefit analysis, there is some agreement on the right things to value, and those things are conveniently countable. It may be quite an achievement to acquire those numbers, but once they are acquired, we can compare the proposed policy against alternatives.

Cost/benefit analyses can also be combined with bounding case analyses. If the lowest plausible benefit from one action is greater than the highest plausible benefit from another, the force of the argument is extremely strong. Even if one is just higher than the other on average, already the evidence is in its favor, but questions of the spread start to matter.

The rebuttals to a cost/benefit analysis are that costs and benefits have been miscalculated, that this is generally not the right method to make such a decision, or that the calculations for cost and benefit do not take into account other costs or benefits that reorder the answers. Next-quarter cash earnings may conflict with long-term profitability, or with legal restrictions that would land a decision maker in jail.


[6] Reinhart, Carmen M., and Kenneth S. Rogoff. Growth in a Time of Debt. NBER Working Paper No. 15639, 2010. http://www.nber.org/papers/w15639.

[7] Herndon, Thomas, Michael Ash, and Robert Pollin. Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. PERI Working Paper No. 322, 2013. http://bit.ly/1gIDQfN.

[8] Such as cross-validation, external validation, analysis of variance, bootstrapping, t-tests, Bayesian evidence, and so on.

[9] Argumentation, The Study of Effective Reasoning, 2nd ed. Audio course, http://www.thegreatcourses.com/tgc/courses/course_detail.aspx?cid=4294.