Practical Data Science with R (2014)

Appendix C. More tools and ideas worth exploring

In data science, you’re betting on the data and the process, not betting on any one magic technique. We advise designing your projects to be the pursuit of quantifiable goals that have already been linked to important business needs. To concretely demonstrate this work style, we emphasize building predictive models using methods that are easily accessible from R. This is a good place to start, but shouldn’t be the end.

There’s always more to do in a data science project. At the least, you can

· Recruit new partners

· Research more profitable business goals

· Design new experiments

· Specify new variables

· Collect more data

· Explore new visualizations

· Design new presentations

· Test old assumptions

· Implement new methods

· Try new tools

The point being this: there’s always more to try. Minimize confusion by keeping a running journal of your actual goals and of things you haven’t yet had time to try. And don’t let tools and techniques distract you away from your goals and data. Always work with “your hands in the data.” That being said, we close with some useful topics for further research (please see the bibliography for publication details).

C.1. More tools

The type of tool you need depends on your problem. If you’re being overwhelmed by data volume, you need to look into big data tools. If you’re having to produce a lot of custom processing, you want to look into additional programming languages. And if you have too little data, you want to study more sophisticated statistical procedures (that offer more statistically efficient inference than the simple cross-validation ideas we emphasize in this book).

C.1.1. R itself

We’ve only been able to scratch the surface of R. Table C.1 shows some important further topics for study.

Table C.1. R topics for follow-up

R topic	Points of interest
R programming and debugging	Our current favorite R book is Kabacoff’s R in Action, which presents a good mix of R and statistics. A good source for R programming and debugging is Matloff’s The Art of R Programming, which includes parallelism, cross-language calling, object-oriented programming, step debugging, and performance profiling. Other avenues to explore are various IDEs such as RStudio and Revolution R Enterprise.
R packages and documentation	R packages are easy for clients and partners to install, so learning how to produce them is a valuable skill. Package documentation files also let you extend R’s help() system to include details about your work. A starter lesson can be found at http://cran.r-project.org/doc/manuals/R-exts.html.

C.1.2. Other languages

R is designed to support statistical data analysis through its large environment of packages (over 5,000 packages are now available from CRAN). You always hope your task is close to a standard statistical procedure and you only need to write a small amount of adapting code. But if you’re going to produce a lot of custom code, you may want to consider using something other than R. Many other programing languages and environments exist and have different relative advantages and disadvantages. The following table surveys some exciting systems.

Table C.2. Other programming languages

Language	Description
Python	Python is a good scripting language with useful tools and libraries. Python has been making strong strides in the data science world with IPython interactive computing notebooks, pandas data frames, and RPy integration.
Julia	Julia is an expressive high-level programming language that compiles to very fast code. The idea is to write concise code (as in R) but then achieve performance comparable to raw C. Claims of 20x speedup aren’t uncommon. Julia also supports distributed parallelism and IJulia (an IPython-inspired notebook system).
J	J is a powerful data processing language inspired by APL and what’s called variable-free or function-level programming. In J, most of the work is done by operators used in a compact mathematical composition notation. J supports powerful vector operations (operations that work over a lot of data in parallel). APL-derived languages (in particular, K) have historically been popular in financial and time-series applications.

With so many exciting possibilities, why did we ever advocate using R? For most midsize data science applications, R is the best tool for the task. Each of these systems does something better than R does, but R can be thought of as a best compromise.

C.1.3. Big data tools

The practical definition of big data is data at a scale where storing and processing the data becomes an engineering problem unto itself. When you hit that scale, you’ll need to move away from pure R (which performs all work in memory) to packages that store results out of memory (such as ff storage, RHadoop, and others). But at some point you may have to move your data preparation out of R. Big data tools tend to be more painful to use than moderate-size tools (data frames and databases), so you don’t want to commit to them until you have an actual need. Table C.3 touches on some important big data tools.

Table C.3. Common big data tools

Tool	Description
Hadoop	The main open source implementation of Google’s MapReduce. MapReduce is the most common way to manipulate very large data. MapReduce organizes big data tasks into jobs consisting of an initial scan and transformation (the map step), followed by sorting and distribution data to aggregators (the reduce step). MapReduce is particularly good for preprocessing, report generation, and tasks like indexing (its original applications). Other machine learning and data science tasks can require managing a large number of map-reduce steps.
Mahout	Mahout is a collection of large-scale machine-learning libraries, many of which are hosted on top of Hadoop.
Drill, Impala	Drill and Impala (and Google’s Dremel) are large-scale data tools specializing in nested records (things like documents with content and attributes, or use records with annotations). They attempt to bring power and scale to so-called schemaless data stores and can interact with stores like Cassandra, HBase, and MongoDB.
Pig, Hive, Presto	Various tools to bring SQL or SQL-like data manipulation to the big data environment.
Storm	Storm (see http://storm-project.net) can be thought of as a complement for Map-Reduce. MapReduce does everything in batch jobs (very high latency, but good eventual throughput) and is suitable for tasks like model construction. Storm organizes everything into what it calls topologies, which represent a proposed flow of individual data records through many stages of processing. So Storm is an interesting candidate for deploying models into production.
HDF5	Hierarchical Data Format 5 is a method of storing and organizing large collections of numeric data (with support for sparse structures). You’re not likely to see HDF5 outside of scientific work, but there are R and Python libraries for working with HDF5 resources. So for some problems you may consider HDF5 in place of a SQL database.

C.2. More ideas

Data science is an excellent introduction to a number of fields that can be rewarding avenues of further study. The fields overlap, but each has its own emphasis. You can’t be expert in all of these fields, but you should be aware of them and consider collaborating with partners expert in some of these fields. Here are some fields we find fascinating, with a few references if you want to learn more.

C.2.1. Adaptive learning

In this book, we use data science in a fairly static manner. The data has already been collected and the model is static after training is finished. Breaking this rigid model gives you a lot more to think about: online learning, transductive learning, and adversarial learning. In online (and stream) learning, you work with models that adapt to new data, often in environments where there’s too much data to store. With transductive learning, models are built after being told which test examples they will be used on (great for dealing with test examples that have missing variables). In adversarial learning, models have to deal with the world adapting against them (especially relevant in spam filtering and fraud detection). Adversarial learning has some exciting new material coming out soon in Joseph, Nelson, Rubinstein, and Tygar’s Adversarial Machine Learning(Cambridge University Press, projected publishing date 2014).

C.2.2. Statistical learning

This is one of our favorite fields. In statistical learning, the primary goal is to build predictive models, and the main tools are statistics (for characterizing model performance) and optimization (for fitting parameters). Important concepts include ensemble methods, regularization, and principled dimension reduction.

The definitive book on the topic is Hastie, Tibshirani, and Friedman’s The Elements of Statistical Learning, Second Edition. The book has a mathematical bent, but unlike most references, it separates the common learning procedures from the more important proofs of solution properties. If you want to understand the consequences of a method, this is the book to study.

C.2.3. Computer science machine learning

The nonstatistical (computer science) view of machine learning includes concepts like expert systems, pattern recognition, clustering, association rules, version spaces, VC dimension, boosting, and support vector machines. In the classic computer science view of machine learning, nonstatistical quantities such as model complexity (measured in terms of VC dimension, minimum description length, or other measures) are used to prove theorems about model generalization performance. This is in contrast to the statistical view, where generalization error is seen as a form of training bias that you simply test for.

In our opinion, the last great book on the topic was Mitchell’s Machine Learning (1997), but it doesn’t cover enough of the current topics. Overall, we prefer the statistical learning treatment of topics, but there are some excellent books on specific topics. One such is Cristianini and Shawe-Taylor’s An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.

C.2.4. Bayesian methods

One of the big sins of common data science is using “point estimates” for everything. Unknowns are often modeled as single values, estimates are often single values, and even algorithm performance is often reported on a single test set (or even worse, just on the training set). Bayesian methods overcome these issues by working with explicit distributions before (prior) and after (posterior) learning.

Of particular interest are Bayesian hierarchical models, which are a great formal alternative to important tricks we use in the book (tricks like regularization, dimension reduction, and smoothing). Good books on the topic include Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin’s Bayesian Data Analysis, Third Edition, and Koller and Friedman’s Probabilistic Graphical Models: Principles and Techniques.

C.2.5. Statistics

Statistics is a fascinating field in and of itself. Statistics covers a lot more about inference (trying to find the causes that are driving relations) than data science, and has a number of cool tools we aren’t able to get to in this book (such as ready-made significance tests and laws of large numbers).

There are a number of good books; for a good introductory text, we recommend Freedman, Pisani, and Purves’s Statistics, Fourth Edition.

C.2.6. Boosting

Boosting is a clever technique for reweighting training data to find submodels that are complementary to each other. You can think of boosting as a complement to bagging: bagging averages ensembles to reduce variance, and boosting manipulates weights to find more diverse models (good when you feel important effects may be hidden as interactions of variables). These ideas about data reweighting are interesting generalizations of the statistical ideas of offsets and orthogonality, but take some time to work through. We recommend trying the R package gbm (Generalized Boosted Regression Models): http://cran.r-project.org/web/packages/gbm/gbm.pdf.

C.2.7. Time series

Time series analysis can be a topic to itself. Part of the issue is the need to ensure that non-useful correlations between time steps don’t interfere with inferring useful relations to external parameters. The obvious fix (differencing) introduces its own issues (root testing) and needs some care.

Good books on the topic include Shumway and Stoffer’s Time Series Analysis and Its Applications, Third Edition, and Tsay’s Analysis of Financial Time Series, 2nd Edition.

C.2.8. Domain knowledge

The big fixes for hard data science problems are (in order): better variables, better experimental design, more data, and better machine learning algorithms. The main way to get better variables is through intuition and picking up some domain knowledge. You don’t have to work in the field to develop good domain knowledge, but you need partners who do and to spend some time thinking about the actual problem (taking a break from thinking about procedures and algorithms). A very good example of this comes from the famous “Sears catalogue problem” (see John F. Magee, “Operations Research at Arthur D. Little, Inc.: The Early Years”) where clever consultants figured out the variable most predictive of future customer value was past purchase frequency (outperforming measures like order size). The lesson is: you can build tools to try and automatically propose new features, but effective data science is more often done by having people propose potential features and letting the statistics work out their relative utility.