Agile Data Science (2014)
I wrote this book to get over a failed project and to ensure that others do not repeat my mistakes. In this book, I draw from and reflect upon my experience building analytics applications at two Hadoop shops.
Agile Data Science has three goals: to provide a how-to guide for building analytics applications with big data using Hadoop; to help teams collaborate on big data projects in an agile manner; and to give structure to the practice of applying Agile Big Data analytics in a way that advances the field.
Who This Book Is For
Agile Data Science is a course to help big data beginners and budding data scientists to become productive members of data science and analytics teams. It aims to help engineers, analysts, and data scientists work with big data in an agile way using Hadoop. It introduces an agile methodology well suited for big data.
This book is targeted at programmers with some exposure to developing software and working with data. Designers and product managers might particularly enjoy Chapters 1, 2, and 5, which would serve as an introduction to the agile process without an excessive focus on running code.
Agile Data Science assumes you are working in a *nix environment. Examples for Windows users aren’t available, but are possible via Cygwin. A user-contributed Linux Vagrant image with all the prerequisites installed is available here. You can quickly boot a Linux machine in VirtualBox using this tool.
How This Book Is Organized
This book is organized into two sections. Part I introduces the data- and toolset we will use in the tutorials in Part II. Part I is intentionally brief, taking only enough time to introduce the tools. We go more in-depth into their use in Part II, so don’t worry if you’re a little overwhelmed in Part I. The chapters that compose Part I are as follows:
Chapter 1, Theory
Introduces the Agile Big Data methodology.
Chapter 2, Data
Describes the dataset used in this book, and the mechanics of a simple prediction.
Chapter 3, Agile Tools
Introduces our toolset, and helps you get it up and running on your own machine.
Chapter 4, To the Cloud!
Walks you through scaling the tools in Chapter 3 to petabyte scale using the cloud.
Part II is a tutorial in which we build an analytics application using Agile Big Data. It is a notebook-style guide to building an analytics application. We climb the data-value pyramid one level at a time, applying agile principles as we go. I’ll demonstrate a way of building value step by step in small, agile iterations. Part II comprises the following chapters:
Chapter 5, Collecting and Displaying Records
Helps you download your inbox and then connect or “plumb” emails through to a web application.
Chapter 6, Visualizing Data with Charts
Steps you through how to navigate your data by preparing simple charts in a web application.
Chapter 7, Exploring Data with Reports
Teaches you how to extract entities from your data and link between them to create interactive reports.
Chapter 8, Making Predictions
Helps you use what you’ve done so far to infer the response rate to emails.
Chapter 9, Driving Actions
Explains how to extend your predictions into a real-time ensemble classifier to help make emails that will be replied to.
Conventions Used in This Book
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.