R Data Mining Blueprints (2016)

Preface

With the growth of data in volume and type, it is becoming very essential to perform data mining in order to extract insights from large datasets. This is because organizations feel the need to a get return on investment (ROI) from large-scale data implementations. The fundamental reason behind data mining is to find out hidden treasure in large databases so that the business stakeholders can take action about future business outcomes. Data mining processes not only help the organizations reduce cost and increase profit but also help them find out new avenues.

In this book, I am going to explain the fundamentals of data mining using an open source tool and programming language known as R. R is a freely available language and environment for performing statistical computation, graphical data visualization, predictive modeling, and integration with other tools and platforms. I am going to explain the data mining concepts by taking example datasets using the R programming language.

In this book, I am going to explain the topics, their mathematical formulation, their implementation in a software environment, and also how the topics help in solving a business problem. The book is designed in such a way that the user can start from data management techniques, exploratory data analysis, data visualization, and modeling up to creating advanced predictive modeling such as recommendation engines, neural network models, and so on. It also gives an overview of the concept of data mining, its various facets with data science, analytics, statistical modeling, and visualization.

So let’s have a look at the chapters briefly!

What this book covers

Chapter 1, Data Manipulation Using In-built R Data, gives a glimpse of programming basics using R, how to read and write data, programming notations, and syntax understanding with the help of a real-world case study. It also includes R scripts for practice to get hands-on experience of the concepts, terminologies, and underlying reasons for performing certain tasks. The chapter is designed in such a way that any reader with little programming knowledge should be able to execute R commands to perform various data mining tasks. We will discuss in brief the meaning of data mining and its relations with other domains such as data science, analytics, and statistical modeling; apart from this, we will start the data management topics using R.

Chapter 2, Exploratory Data Analysis with Automobile Data, helps the learners to understand exploratory data analysis. It involves numerical as well as graphical representation of variables in a dataset for easy understanding and quick conclusion about a dataset. It is important to get an understanding of the dataset, type of variables considered for analysis, the association between various variables, and so on. Creating cross-tabulations to understand the relationship between categorical variables and performing classical statistical tests on the data to verify various different hypotheses about the data can be tested out.

Chapter 3, Visualize Diamond Dataset, covers the basics of data visualization along with how to create advanced data visualization using existing libraries in the R programming language. While looking at numbers and statistics, it may tell a similar story for the variables we are looking at by different cuts; however, when we visually look at the relationship between variables and factors, it shows a different story altogether. Hence, data visualization tells you a message that numbers and statistics fail to do.

Chapter 4, Regression with Automobile Data, helps you to know the basics of predictive analytics using regression methods, including various linear and nonlinear regression methods using R programming. In this chapter, you will get to know the basics of predictive analytics using regression methods, including various linear and nonlinear regression methods using R programming. You will be able to understand the theoretical background as well as get practical hands-on experience on all the regression methods using R.

Chapter 5, Market Basket Analysis with Groceries Data, shows the second method of product recommendation, popularly known as Market Basket Analysis (MBA) and also known as association rules. This is about associating items purchased at transaction level, finding out the sub-segments of users having similar products and hence, recommending the products. MBA can also be used to form upsell and cross-sell strategies.

Chapter 6, Clustering with E-commerce Data, teaches the following things: what segmentation is, how clustering can be applied to perform segmentation, what are the methods used for clustering, and a comparative view of the various methods for segmentation. In this chapter, you will know the basics of segmentation using various clustering methods.

Chapter 7, Building a Retail Recommendation Engine, covers the following things and their implementation using the R programming language: what recommendation is and how it works, types and methods for performing recommendation, and implementation of product recommendation using R.

Chapter 8, Dimensionality Reduction, implements dimensionality reduction techniques such as PCA, singular value decomposition (SVD), and iterative feature selection methods using a practical dataset and R. With the growth of data in volumes and variety, dimensions of data have been continuously on the rise. Dimensionality reduction techniques have many applications in different industries, such as in image processing, speech recognition, recommendation engines, text processing, and so on.

Chapter 9, Applying Neural Networks to Healthcare Data, teaches you various types of neural networks, methods, and variants of neural networks with different functions to control the training of artificial neural networks in performing standard data mining tasks such as these: prediction of real-valued output using regression-based methods, prediction of output levels in a classification-based task, forecasting future values of a numerical attribute based on historical data, and compressing features to recognize important ones in order to perform prediction or classification.

What you need for this book

To follow the examples and code shared along with this book, you need to have R software downloaded from https://cran.r-project.org/ (it is optional to download RStudio from https://www.rstudio.com/) and have it installed on the machine. There are no specific hardware requirements; you can have any computer with more than 2 GB RAM and it works on all platforms, including Mac, Linux, and Windows.

Who this book is for

This book is for readers who are starting their career in data mining, data science, or predictive modeling, or they are at some intermediate level with some degree of statistical knowledge and programming knowledge. Basic statistical knowledge is a must to understand the data mining concepts covered in this book. Having prior programming knowledge is not mandatory as first couple of chapters, I am going to cover data management and basic statistical analysis using R. This book is also for students, professionals, and experienced people aspiring to become data analysts.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "In the current scenario from the ArtPiece dataset, we are trying to predict whether a work of art is a good purchase, or not, by taking a few business-relevant variables."

Any command-line input or output is written as follows:

>fit<- neuralnet(formula = CurrentAuctionAveragePrice ~ Critic.Ratings +

Acq.Cost + CollectorsAverageprice + Min.Guarantee.Cost, data = train,

hidden = 15, err.fct = "sse", linear.output = F)

> fit

Call: neuralnet(formula = CurrentAuctionAveragePrice ~ Critic.Ratings +

Acq.Cost + CollectorsAverageprice + Min.Guarantee.Cost, data = train,

hidden = 15, err.fct = "sse", linear.output = F)

1 repetition was calculated.

Error Reached Threshold Steps

1 54179625353167 0.004727494957 23

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

1. Log in or register to our website using your e-mail address and password.

2. Hover the mouse pointer on the SUPPORT tab at the top.

3. Click on Code Downloads & Errata.

4. Enter the name of the book in the Search box.

5. Select the book for which you're looking to download the code files.

6. Choose from the drop-down menu where you purchased this book from.

7. Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

· WinRAR / 7-Zip for Windows

· Zipeg / iZip / UnRarX for Mac

· 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/R-Data-Mining-Blueprints. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!