Wrapping Your Head around Data Science - Getting Started with Data Science - Data Science For Dummies (2016)

Data Science For Dummies (2016)

Part 1

Getting Started with Data Science

IN THIS PART …

Get introduced to the field of data science.

Define big data.

Explore solutions for big data problems.

See how real-world businesses put data science to good use.

Chapter 1

Wrapping Your Head around Data Science

IN THIS CHAPTER

check Making use of data science in different industries

check Putting together different data science components

check Identifying viable data science solutions to your own data challenges

check Becoming more marketable by way of data science

For quite some time now, everyone has been absolutely deluged by data. It’s coming from every computer, every mobile device, every camera, and every imaginable sensor — and now it’s even coming from watches and other wearable technologies. Data is generated in every social media interaction we make, every file we save, every picture we take, and every query we submit; it’s even generated when we do something as simple as ask a favorite search engine for directions to the closest ice-cream shop.

Although data immersion is nothing new, you may have noticed that the phenomenon is accelerating. Lakes, puddles, and rivers of data have turned to floods and veritable tsunamis of structured, semistructured, and unstructured data that’s streaming from almost every activity that takes place in both the digital and physical worlds. Welcome to the world of big data!

If you’re anything like me, you may have wondered, “What’s the point of all this data? Why use valuable resources to generate and collect it?” Although even a single decade ago, no one was in a position to make much use of most of the data that’s generated, the tides today have definitely turned. Specialists known as data engineers are constantly finding innovative and powerful new ways to capture, collate, and condense unimaginably massive volumes of data, and other specialists, known as data scientists, are leading change by deriving valuable and actionable insights from that data.

In its truest form, data science represents the optimization of processes and resources. Data science produces data insights — actionable, data-informed conclusions or predictions that you can use to understand and improve your business, your investments, your health, and even your lifestyle and social life. Using data science insights is like being able to see in the dark. For any goal or pursuit you can imagine, you can find data science methods to help you predict the most direct route from where you are to where you want to be — and to anticipate every pothole in the road between both places.

Seeing Who Can Make Use of Data Science

The terms data science and data engineering are often misused and confused, so let me start off by clarifying that these two fields are, in fact, separate and distinct domains of expertise. Data science is the computational science of extracting meaningful insights from raw data and then effectively communicating those insights to generate value. Data engineering, on the other hand, is an engineering domain that’s dedicated to building and maintaining systems that overcome data processing bottlenecks and data handling problems for applications that consume, process, and store large volumes, varieties, and velocities of data. In both data science and data engineering, you commonly work with these three data varieties:

· Structured: Data is stored, processed, and manipulated in a traditional relational database management system (RDBMS).

· Unstructured: Data that is commonly generated from human activities and doesn’t fit into a structured database format.

· Semistructured: Data doesn’t fit into a structured database system, but is nonetheless structured by tags that are useful for creating a form of order and hierarchy in the data.

A lot of people believe that only large organizations that have massive funding are implementing data science methodologies to optimize and improve their business, but that’s not the case. The proliferation of data has created a demand for insights, and this demand is embedded in many aspects of our modern culture — from the Uber passenger who expects his driver to pick him up exactly at the time and location predicted by the Uber application, to the online shopper who expects the Amazon platform to recommend the best product alternatives so she can compare similar goods before making a purchase. Data and the need for data-informed insights are ubiquitous. Because organizations of all sizes are beginning to recognize that they’re immersed in a sink-or-swim, data-driven, competitive environment, data know-how emerges as a core and requisite function in almost every line of business.

What does this mean for the everyday person? First, it means that everyday employees are increasingly expected to support a progressively advancing set of technological requirements. Why? Well, that’s because almost all industries are becoming increasingly reliant on data technologies and the insights they spur. Consequently, many people are in continuous need of re-upping their tech skills, or else they face the real possibility of being replaced by a more tech-savvy employee.

The good news is that upgrading tech skills doesn’t usually require people to go back to college, or — God forbid — get a university degree in statistics, computer science, or data science. The bad news is that, even with professional training or self-teaching, it always takes extra work to stay industry-relevant and tech-savvy. In this respect, the data revolution isn’t so different from any other change that has hit industry in the past. The fact is, in order to stay relevant, you need to take the time and effort to acquire only the skills that keep you current. When you’re learning how to do data science, you can take some courses, educate yourself using online resources, read books like this one, and attend events where you can learn what you need to know to stay on top of the game.

Who can use data science? You can. Your organization can. Your employer can. Anyone who has a bit of understanding and training can begin using data insights to improve their lives, their careers, and the well-being of their businesses. Data science represents a change in the way you approach the world. When exacting outcomes, people often used to make their best guess, act, and then hope for their desired result. With data insights, however, people now have access to the predictive vision that they need to truly drive change and achieve the results they need.

You can use data insights to bring about changes in the following areas:

· Business systems: Optimize returns on investment (those crucial ROIs) for any measurable activity.

· Technical marketing strategy development: Use data insights and predictive analytics to identify marketing strategies that work, eliminate under-performing efforts, and test new marketing strategies.

· Keep communities safe: Predictive policing applications help law enforcement personnel predict and prevent local criminal activities.

· Help make the world a better place for those less fortunate: Data scientists in developing nations are using social data, mobile data, and data from websites to generate real-time analytics that improve the effectiveness of humanitarian response to disaster, epidemics, food scarcity issues, and more.

Analyzing the Pieces of the Data Science Puzzle

To practice data science, in the true meaning of the term, you need the analytical know-how of math and statistics, the coding skills necessary to work with data, and an area of subject matter expertise. Without this expertise, you might as well call yourself a mathematician or a statistician. Similarly, a software programmer without subject matter expertise and analytical know-how might better be considered a software engineer or developer, but not a data scientist.

Because the demand for data insights is increasing exponentially, every area is forced to adopt data science. As such, different flavors of data science have emerged. The following are just a few titles under which experts of every discipline are using data science: ad tech data scientist, director of banking digital analyst, clinical data scientist, geoengineer data scientist, geospatial analytics data scientist, political analyst, retail personalization data scientist, and clinical informatics analyst in pharmacometrics. Given that it often seems that no one without a scorecard can keep track of who’s a data scientist, in the following sections I spell out the key components that are part of any data science role.

Collecting, querying, and consuming data

Data engineers have the job of capturing and collating large volumes of structured, unstructured, and semistructured big data — data that exceeds the processing capacity of conventional database systems because it’s too big, it moves too fast, or it doesn’t fit the structural requirements of traditional database architectures. Again, data engineering tasks are separate from the work that’s performed in data science, which focuses more on analysis, prediction, and visualization. Despite this distinction, whenever data scientists collect, query, and consume data during the analysis process, they perform work similar to that of the data engineer (the role you read about earlier in this chapter).

Although valuable insights can be generated from a single data source, often the combination of several relevant sources delivers the contextual information required to drive better data-informed decisions. A data scientist can work from several datasets that are stored in a single database, or even in several different data warehouses. (For more about combining datasets, see Chapter 3.) At other times, source data is stored and processed on a cloud-based platform that’s been built by software and data engineers.

No matter how the data is combined or where it’s stored, if you’re a data scientist, you almost always have to query data — write commands to extract relevant datasets from data storage systems, in other words. Most of the time, you use Structured Query Language (SQL) to query data. (Chapter 16 is all about SQL, so if the acronym scares you, jump ahead to that chapter now.)

Whether you’re using an application or doing custom analyses by using a programming language such as R or Python, you can choose from a number of universally accepted file formats:

· Comma-separated values (CSV) files: Almost every brand of desktop and web-based analysis application accepts this file type, as do commonly used scripting languages such as Python and R.

· Scripts: Most data scientists know how to use either the Python or R programming language to analyze and visualize data. These script files end with the extension .py or .ipynb (Python) or .r (R).

· Application files: Excel is useful for quick-and-easy, spot-check analyses on small- to medium-size datasets. These application files have the .xls or .xlsx extension. Geospatial analysis applications such as ArcGIS and QGIS save with their own proprietary file formats (the .mxd extension for ArcGIS and the .qgs extension for QGIS).

· Web programming files: If you’re building custom, web-based data visualizations, you may be working in D3.js — or Data-Driven Documents, a JavaScript library for data visualization. When you work in D3.js, you use data to manipulate web-based documents using .html, .svg, and .css files.

Applying mathematical modeling to data science tasks

Data science relies heavily on a practitioner’s math skills (and statistics skills, as described in the following section) precisely because these are the skills needed to understand your data and its significance. These skills are also valuable in data science because you can use them to carry out predictive forecasting, decision modeling, and hypotheses testing.

remember Mathematics uses deterministic methods to form a quantitative (or numerical) description of the world; statistics is a form of science that’s derived from mathematics, but it focuses on using a stochastic (probabilities) approach and inferential methods to form a quantitative description of the world. More on both is discussed in Chapter 5.

Data scientists use mathematical methods to build decision models, generate approximations, and make predictions about the future. Chapter 5 presents many complex applied mathematical approaches that are useful when working in data science.

remember In this book, I assume that you have a fairly solid skill set in basic math — it would be beneficial if you’ve taken college-level calculus or even linear algebra. I try hard, however, to meet readers where they are. I realize that you may be working based on a limited mathematical knowledge (advanced algebra or maybe business calculus), so I convey advanced mathematical concepts using a plain-language approach that’s easy for everyone to understand.

Deriving insights from statistical methods

In data science, statistical methods are useful for better understanding your data’s significance, for validating hypotheses, for simulating scenarios, and for making predictive forecasts of future events. Advanced statistical skills are somewhat rare, even among quantitative analysts, engineers, and scientists. If you want to go places in data science, though, take some time to get up to speed in a few basic statistical methods, like linear and logistic regression, naïve Bayes classification, and time series analysis. These methods are covered in Chapter 5.

Coding, coding, coding — it’s just part of the game

Coding is unavoidable when you’re working in data science. You need to be able to write code so that you can instruct the computer how you want it to manipulate, analyze, and visualize your data. Programming languages such as Python and R are important for writing scripts for data manipulation, analysis, and visualization, and SQL is useful for data querying. The JavaScript library D3.js is a hot new option for making cool, custom, and interactive web-based data visualizations.

Although coding is a requirement for data science, it doesn’t have to be this big scary thing that people make it out to be. Your coding can be as fancy and complex as you want it to be, but you can also take a rather simple approach. Although these skills are paramount to success, you can pretty easily learn enough coding to practice high-level data science. I’ve dedicated Chapters 10, 14, 15, and 16 to helping you get up to speed in using D3.js for web-based data visualization, coding in Python and in R, and querying in SQL (respectively).

Applying data science to a subject area

Statisticians have exhibited some measure of obstinacy in accepting the significance of data science. Many statisticians have cried out, “Data science is nothing new! It’s just another name for what we’ve been doing all along.” Although I can sympathize with their perspective, I’m forced to stand with the camp of data scientists who markedly declare that data science is separate and definitely distinct from the statistical approaches that comprise it.

My position on the unique nature of data science is based to some extent on the fact that data scientists often use computer languages not used in traditional statistics and take approaches derived from the field of mathematics. But the main point of distinction between statistics and data science is the need for subject matter expertise.

Because statisticians usually have only a limited amount of expertise in fields outside of statistics, they’re almost always forced to consult with a subject matter expert to verify exactly what their findings mean and to decide the best direction in which to proceed. Data scientists, on the other hand, are required to have a strong subject matter expertise in the area in which they’re working. Data scientists generate deep insights and then use their domain-specific expertise to understand exactly what those insights mean with respect to the area in which they’re working.

This list describes a few ways in which subject matter experts are using data science to enhance performance in their respective industries:

· Engineers use machine learning to optimize energy efficiency in modern building design.

· Clinical data scientists work on the personalization of treatment plans and use healthcare informatics to predict and preempt future health problems in at-risk patients.

· Marketing data scientists use logistic regression to predict and preempt customer churn (the loss or churn of customers from a product or service to that of a competitor’s). I tell you more on decreasing customer churn in Chapters 3 and 20.

· Data journalists scrape websites (extract data in-bulk directly off the pages on a website, in other words) for fresh data in order to discover and report the latest breaking-news stories. (I talk more about data journalism in Chapter 18.)

· Data scientists in crime analysis use spatial predictive modeling to predict, preempt, and prevent criminal activities. (See Chapter 21 for all the details on using data science to describe and predict criminal activity.)

· Data do-gooders use machine learning to classify and report vital information about disaster-affected communities for real-time decision support in humanitarian response, which you can read about in Chapter 19.

Communicating data insights

As a data scientist, you must have sharp oral and written communication skills. If a data scientist can’t communicate, all the knowledge and insight in the world does nothing for your organization. Data scientists need to be able to explain data insights in a way that staff members can understand. Not only that, data scientists need to be able to produce clear and meaningful data visualizations and written narratives. Most of the time, people need to see something for themselves in order to understand. Data scientists must be creative and pragmatic in their means and methods of communication. (I cover the topics of data visualization and data-driven storytelling in much greater detail in Chapter 9 and Chapter 18, respectively.)

Exploring the Data Science Solution Alternatives

Organizations and their leaders are still grappling with how to best use big data and data science. Most of them know that advanced analytics is positioned to bring a tremendous competitive edge to their organizations, but few of them have any idea about the options that are available or the exact benefits that data science can deliver. In this section, I introduce three major data science solution alternatives and describe the benefits that a data science implementation can deliver.

Assembling your own in-house team

Many organizations find it makes financial sense for them to establish their own dedicated in-house team of data professionals. This saves them money they would otherwise spend achieving similar results by hiring independent consultants or deploying a ready-made cloud-based analytics solution. Three options for building an in-house data science team are:

· Train existing employees. If you want to equip your organization with the power of data science and analytics, data science training (the lower-cost alternative) can transform existing staff into data-skilled, highly specialized subject matter experts for your in-house team.

· Hire trained personnel. Some organizations fill their requirements by either hiring experienced data scientists or by hiring fresh data science graduates. The problem with this approach is that there aren’t enough of these people to go around, and if you do find people who are willing to come onboard, they have high salary requirements. Remember, in addition to the math, statistics, and coding requirements, data scientists must have a high level of subject matter expertise in the specific field where they’re working. That’s why it’s extraordinarily difficult to find these individuals. Until universities make data literacy an integral part of every educational program, finding highly specialized and skilled data scientists to satisfy organizational requirements will be nearly impossible.

· Train existing employees and hire some experts. Another good option is to train existing employees to do high-level data science tasks and then bring on a few experienced data scientists to fulfill your more advanced data science problem-solving and strategy requirements.

Outsourcing requirements to private data science consultants

Many organizations prefer to outsource their data science and analytics requirements to an outside expert, using one of two general strategies:

· Comprehensive: This strategy serves the entire organization. To build an advanced data science implementation for your organization, you can hire a private consultant to help you with a comprehensive strategy development. This type of service will likely cost you, but you can receive tremendously valuable insights in return. A strategist will know about the options available to meet your requirements, as well as the benefits and drawbacks of each on. With strategy in hand and an on-call expert available to help you, you can much more easily navigate the task of building an internal team.

· Individual: You can apply piecemeal solutions to specific problems that arise, or that have arisen, within your organization. If you’re not prepared for the rather involved process of comprehensive strategy design and implementation, you can contract out smaller portions of work to a private data science consultant. This spot-treatment approach could still deliver the benefits of data science without requiring you to reorganize the structure and financials of your entire organization.

Leveraging cloud-based platform solutions

A cloud-based solution can deliver the power of data analytics to professionals who have only a modest level of data literacy. Some have seen the explosion of big data and data science coming from a long way off. Although it’s still new to most, professionals and organizations in the know have been working fast and furiously to prepare. New, private cloud applications such as Trusted Analytics Platform, or TAP (http://trustedanalytics.org) are dedicated to making it easier and faster for organizations to deploy their big data initiatives. Other cloud services, like Tableau, offer code-free, automated data services — from basic clean-up and statistical modeling to analysis and data visualization. Though you still need to understand the statistical, mathematical, and substantive relevance of the data insights, applications such as Tableau can deliver powerful results without requiring users to know how to write code or scripts.

remember If you decide to use cloud-based platform solutions to help your organization reach its data science objectives, you still need in-house staff who are trained and skilled to design, run, and interpret the quantitative results from these platforms. The platform will not do away with the need for in-house training and data science expertise — it will merely augment your organization so that it can more readily achieve its objectives.

Letting Data Science Make You More Marketable

Throughout this book, I hope to show you the power of data science and how you can use that power to more quickly reach your personal and professional goals. No matter the sector in which you work, acquiring data science skills can transform you into a more marketable professional. The following list describes just a few key industry sectors that can benefit from data science and analytics:

· Corporations, small- and medium-size enterprises (SMEs), and e-commerce businesses: Production-costs optimization, sales maximization, marketing ROI increases, staff-productivity optimization, customer-churn reduction, customer lifetime-value increases, inventory requirements and sales predictions, pricing model optimization, fraud detection, collaborative filtering, recommendation engines, and logistics improvements

· Governments: Business-process and staff-productivity optimization, management decision-support enhancements, finance and budget forecasting, expenditure tracking and optimization, and fraud detection

· Academia: Resource-allocation improvements, student performance-management improvements, dropout reductions, business process optimization, finance and budget forecasting, and recruitment ROI increases