Exploratory Data Analysis - Thinking in LINQ: Harnessing the power of functional programing in .NET applications (2014)

Thinking in LINQ: Harnessing the power of functional programing in .NET applications (2014)

Chapter 8. Exploratory Data Analysis

Generalization is an extremely powerful concept when applied correctly. For example, in MATLAB even the most trivial addition is performed as a matrix addition. Data comes in many formats. Mostly these formats are not ready for analysis, so programmers, researchers, and data scientists often need to write a lot of data-wrangling code to get the data into a useful form. However, LINQ has changed the way programmers interact with data. LINQ works on the generalization that data is a list—of something. For example, you can think of a database table as a list of rows, an XML file as a list of nodes, a CSV file as a list of comma-delimited string arrays, and so on.

Exploratory data analysis typically starts with a set of questions and then tries to obtain answers by examining the available data. Sometimes this is done just by finding statistics, but other times plotting data helps find trends or compare values.

In some of the examples in this chapter, you will see data visualizations that represent the findings in a concise manner. This approach often leads to insights that go beyond the initial answers to questions. You will see examples of using LINQ to analyze data and find such insights. For each problem presented here, a few questions will be posed, and then you’ll see example LINQ queries that can find the answers to those questions. Along the way, you should pick up some useful LINQ idioms that I hope are general enough to apply to your own data analysis problems. Anonymous types and anonymous methods will play a major role in most of the queries in this chapter.

8-1. Analyzing the Titanic Survivors Dataset

The Titanic dataset is a CSV file containing the list of passengers and their survival status. The dataset contains the following information:

Field Name

Description

PassengerId

The ID of the passenger. This is an integer.

Survived

A Boolean field that has the value 1 or 0; 1 means the passenger survived.

PClass

The class of the passenger: first, second, or third.

Name

Name of the passenger.

Sex

Gender of the passenger.

Age

Age of the passenger.

Sibsp

Count of sibling or spouse of the passenger onboard.

Parch

Count of parent or child of the passenger onboard.

Ticket

Ticket number of the passenger.

Fare

The fare paid by the passenger .

Cabin

Cabin number of the passenger.

Embarked

Which port the passenger boarded from:

S—Southampton

C—Cherbourg

Q—Queenstown

Problem

The question I posed for this dataset is, “What’s the chance of survival for passengers in each class, grouped by gender?”

Solution

To begin, you need to load the Titanic dataset from the CSV file to an in-memory collection. Because the dataset has names wrapped in double quotes, I had to use regular expressions.

Open a new LINQPad tab and write the code in Listing 8-1. Change the value of the Language box drop-down to C# Statements.

Listing 8-1. Finding Titanic passenger survival rates

string text = File.ReadAllText(@"C:\titanic.csv");

Regex.Matches(text,"\"[A-Za-z ., ()'-/]+\"").Cast<Match>()
.Select (m => m.Value)
.ToList()
.ForEach( z => text = text.Replace(z, z.Replace(",","[__COMMA__]")));

text.Split(new char[]{'\r','\n'},StringSplitOptions.RemoveEmptyEntries)
.Skip(1)//Skip the column header row of the CSV file
.Select (t => t.Split(','))
.Select (t => new
{
PassengerId = t[0],
Survived = t[1]=="1"?"Yes":"No",
Pclass = t[2],
Name = t[3].Replace("[__COMMA__]",","),
Sex = t[4],
Age = t[5].Length !=0 ? Convert.ToDouble(t[5]):-1,
SibSp = t[6],
Parch = t[7],
Ticket = t[8],
Fare = Convert.ToDouble(t[9]),
Cabin = t[10],
Embarked = t[11]
})//At this point the CSV is loaded as a collection of an anonymous type

.Select (f => new   Tuple<string,double,string,string,double>
(f.Pclass,Math.Round(f.Fare,2),f.Survived,f.Sex,f.Age))
.ToLookup (f => f.Item1)
.OrderByDescending (f => f.Key)
.ToDictionary(f => f.Key, f=> new KeyValuePair<double,double>
(100*((double)f.Count (x => x.Item4 == "female" && x.Item3 == "Yes")/
(double)f.Count(j => j.Item4=="female")),
100*((double)f.Count (x => x.Item4 == "male" && x.Item3 == "Yes")/
(double)f.Count(j => j.Item4=="male"))))
.Select (f => new { PClass = f.Key,
FemaleSurvivalRate = Math.Round(f.Value.Key,3) ,
MaleSurvivalRate = Math.Round(f.Value.Value,3)} )
.OrderByDescending (f => f.FemaleSurvivalRate )
.Dump("Survivor Percentage per class");

This produces the output shown in Figure 8-1.

image

Figure 8-1. Survivor percentage per class

LINQPad has a cool feature that draws a horizontal bar chart. As you click the bar icons visible in the top row, LINQPad draws the bar charts as shown in Figure 8-2.

image

Figure 8-2. Bar charts for survivor percentage for each passenger class

How It Works

At first, the code masks commas within names with a special string [__COMMA__] so that splitting the values in any given row at the commas later in the process doesn’t affect the field values. Then the code parses the CSV, splitting each row (except the first row, which contains the header) at the commas, thus generating a list of values. These values then get assigned to properties of an anonymous type. Each anonymous type instance represents one row of the CSV file. The column header names become properties of the anonymous type.

Next, the collection of this anonymous type is projected with five fields—class, fare, survival status, sex, and age—by using the following code:

Select (f => new Tuple<string,double,string,string,double>
(f.Pclass,Math.Round(f.Fare,2),f.Survived,f.Sex,f.Age))

The code then creates a lookup table in which the key represents the passenger classes. Later it creates a dictionary from this lookup table: the dictionary keys are the passenger classes, and the values are key/value pairs of doubles that represent the survival percentages of females and males, respectively, for the associated passenger class. The keys of the key/value pair represent the female survival percentages, while the values represent the male survival percentages.

The final call to Select()

Select (f => new { PClass = f.Key,
FemaleSurvivalRate = Math.Round(f.Value.Key,3) ,
MaleSurvivalRate = Math.Round(f.Value.Value,3)} )

projects these dictionary values in a meaningful way, with three columns: PClass, FemaleSurvivalRate,  and MaleSurvivalRate. This result is then sorted in descending order by FemaleSurvivalRate. As expected, the result shows that passengers in the higher classes were more likely to survive.

Problem

Another question you could pose using this data is, “What’s the survival percentage grouped by the following age ranges?”

·     0–2: Infants

·     2–6: Toddlers

·     6–12: Kids

·     13–19: Teenagers

·     20–30: Young adults

·     30–35: Early thirties

·     36–40: Late thirties

·     40–50: Middle-aged

·     50–60: Old

·     60+: Retired

Solution

To answer the question, you need a way to create the age groups by using the age value. The  anonymous function shown in Listing 8-2 does that.

Listing 8-2. Creating age range groups

Func<double,string> AgeGroup =
x => x!=-1 && x<2?"Infants"
:x>=2 && x<6 ? "Toddlers"
:x >= 6 && x<13 ?"Kids"
:x>=13 && x<=19?"Teenagers"
:x>=20 && x<30? "Young Adults"
:x>=30 && x<=35? "Early thirties"
:x>=36 && x<40? "Late thirties"
:x>=40 && x<=50? "Middle Aged"
:x>=51 && x<60 ? "Old"
:"Retired";

The data contains a few rows where the age value is missing, so for those rows I have assigned the value of -1. This way, the analysis will skip the rows where the age is missing. Otherwise, the result would be wrong.

As in the previous problem, you also want to project the CSV to a collection of an anonymous type. For the sake of avoiding duplicated code, I won’t repeat that code here. So everything up to the comment //At this point the CSV is loaded as a collection of the anonymous type in Listing 8-1 shown previously is identical in this solution too. After that, the projection in Listing 8-3 does the job.

Listing 8-3. Calculating survivor percentages by age range

.Select( x =>
new
{
Sex = x.Sex,
Age = x.Age,
Embarked = x.Embarked,
AgeGroup = AgeGroup(x.Age),
Survived = x.Survived
})
.ToLookup (x => x.AgeGroup)
.ToDictionary (x => x.Key, x => new KeyValuePair<double,double>
(100*((double)x.Count (z => z.Sex == "female" && z.Survived == "Yes")/(double)x.Count()),
100*((double)x.Count (z => z.Sex == "male" && z.Survived  == "Yes")/(double)x.Count())))
.Select(x =>
new
{
AgeGroup = x.Key,
FemaleSurvival = Math.Round(x.Value.Key,2),
MaleSurvival = Math.Round(x.Value.Value,2)
}
)
.OrderByDescending( x=> x.FemaleSurvival)
.Dump("Agewise survival percentages");

Figure 8-3 shows the output sorted by female survival percentage in descending order for each age group.

image

Figure 8-3. Survivors in different age groups

How It Works

The anonymous function AgeGroup returns the age group for each row, given the integer value of the age. For example, if the age value is 14, AgeGroup returns Teenagers. Next, the code creates a lookup table in which the keys represent the age groups. ToDictionary()transforms the lookup table to a dictionary with keys that still represent the age groups, but with values that are key/value pairs of doubles. The first double represents the survival percentage of female passengers in that age group, and the second double represents the percentage of male passengers who survived in that age group.

However, up to this point in the code, all these values are internal; thus it is hard to make sense of the data. Therefore, a Select() call projects this data as an IEnumerable of an anonymous type with three attributes: AgeGroup, FemaleSurvival, and MaleSurvival; the last two are percentages.

Finally, to find out which age range of female passengers was most likely to survive, the result is sorted by the FemaleSurvival column in descending order.

8-2. Converting SurveyMonkey Results to CSV

Last year during my company’s official outing, a colleague of mine tasked with handling the logistics created a survey on the popular free survey web site SurveyMonkey. He had to arrange for buses for people to travel from our office to the resort where the day’s outing was planned. However, he wanted to be sure of the number of people who would need the bus service before striking a deal with our travel vendor.

Unfortunately, after the survey was complete, he found the data unusable: it wasn’t in CSV format, which meant that running any kind of analysis was difficult. I wrote a converter to get the data into CSV format, but that converter wasn’t generic; it would work only for his particular survey. It later occurred to me that such conversions are a common issue. Therefore, in writing this chapter, I created a parser that can translate any SurveyMonkey results to CSV file format. The headers of the CSV file will be the questions used for the survey.

Problem

Write a parser that can parse SurveyMonkey results and return a CSV file.

Solution

SurveyMonkey results aren’t CSV files, so the first step is to get the data into CSV format. Listing 8-4 shows the code.

Listing 8-4. Turning SurveyMonkey results into CSV format

//Parsing Survey Monkey Results
string result = @"Will you come by bus
No
Name
Sam
Phone number
1234
Will you come by bus
Yes
Name
Ram
Phone number
3213
Will you come by bus
Yes
Name
Raul
Phone number
4245";

string[] questions = {"Will you come by bus","Name","Phone number"};
var allResponses = result.Split(questions,StringSplitOptions.RemoveEmptyEntries)
.Select (r => r.Trim());
int numberOfResponses = allResponses.Count ()/questions.Length;

string csv =
//Headers
questions
.Select (q => "\"" +  q + "\"" )
.Aggregate ((h1,h2) => h1 + "," + h2 ) +
//Insert Newline
Environment.NewLine  +
//Rows
Enumerable.Range(0,numberOfResponses)
.Select (e => allResponses.Skip(e*questions.Length).Take(questions.Length))
.Select (e => Enumerable.Range(0,questions.Length)
.Select (en => e.ElementAt(en) ))
.Select (e => e.Select (x =>  "\"" +  x + "\"")
.Aggregate ((m,n) => m + "," + n  ))
.Aggregate ((a,b)  =>  a + Environment.NewLine + b);

csv.Dump("CSV representation");

This generates the output shown in Figure 8-4.

image

Figure 8-4. SurveyMonkey result converted to CSV format

If you save the preceding content in a text (.csv) file and then open it in Microsoft Excel, it shows up formatted as shown in Figure 8-5.

image

Figure 8-5. SurveyMonkey converted result shown in Excel

How It Works

Splitting the result by the questions leaves only the answers, which appear in the same order as the questions. The total number of responders is equal to the total number of answers divided by the number of questions. Each such set of answers represents one row.

The code allResponses.Skip(e*questions.Length).Take(questions.Length) returns a list of elements that are answers to the current question. At the first run, the value of e is zero. So the number of  elements picked from the start is equal to the number of questions. At each iteration, the value of e increases by one. Thus an appropriate number of responses are ignored and the latest answer set is picked up to form the current row.

Skip() followed by Take() is a common idiom to progressively advance the scanning window of any algorithm that picks some elements, skipping a few from the beginning of a given collection.

As you can see, this code is almost completely generic: to run it against data from a different survey, you just need to change the values in the questions array, and you will be finished.

8-3. Analyzing Trends in Baby Names

Baby names always capture the imagination of new parents. Even though I have already named my son, I still find the trend analysis of baby names a fascinating topic. At one point, I found a list of baby names along with a measure for popularity. The data was a CSV file, which had the following fields:

Field

Description

Year

The information in the current row is for this year.

The range of the years is 1880 to 2008.

Name

The name of the baby.

Percent

A measure of popularity.

Sex

Gender of the baby.

Problem

Looking at this dataset, I posed couple of questions that would help make sense of the data:

1.    What are the top ten boys’ and girls’ names?

2.    How has the popularity of a name changed over the course of the years?

3.    What are the top ten most popular boy and girl names for each decade?

Solution

Go to a new LINQPad tab and add the query shown in Listing 8-5.

Listing 8-5. Determining name popularity

var babyNames =  File.ReadAllLines(@"C:\Personal\TableAPI\baby-names.csv")
.Select (f => f.Split(','))
.Skip(1) //Skip the header row
.Select
(
a =>
new
{
Year = Convert.ToInt32(a[0]),
Name = a[1].Trim(new char[]{'"',' '}),
Percentage = Convert.ToDouble(a[2]),
Sex = a[3].Trim(new char[]{'"',' '})
}
);

babyNames
.Where (n => n.Sex == "boy") // This analysis is being done for baby boy names.
.ToLookup (n => n.Name)
.ToDictionary (n => n.Key )
.Select (n =>
new {
Name = n.Key,
Popularity =  n.Value
.Select (v => new { Year = v.Year,
PopularityPercentage = v.Percentage})
.ToList()
})

.OrderByDescending (n => n.Popularity.Select (p => p.PopularityPercentage )
.Average ())
.Take(10) //Show top 10 names as per the average popularity
.Dump("Popularity of top 10 baby \"boy\" names over the years");

Figure 8-6 shows the overall result structure. I have collapsed the entries for each name.

image

Figure 8-6. The top ten baby boy names between 1880 and 2008

Figure 8-7 shows a partial view of the 129 items for the name John.

image

Figure 8-7. The decreasing popularity of the name John over time

Although these results show the top baby names between 1880 and 2008, the format makes it  hard to compare the popularity of one name to another. Also it is hard to figure out when a particular name started to become more popular than other names at that time.

I could have found these insights immediately if the data were plotted, as you’ll see. For the example, I’ve used Highcharts—a popular JavaScript-based data-visualization engine. You can download it from highcharts.com.

After downloading Highcharts, you can explore all the chart types it supports, but this example uses only the area-inverted chart. Open the file index.html located under the examples/area-inverted/ folder. You will see that you provide data to this chart in JSON format:

series: [{
name: 'John',
data: [3, 4, 3, 5, 4, 10, 12]
}, {
name: 'Jane',
data: [1, 3, 4, 3, 3, 5, 4]
}]

In the previous sections, you saw how to get the data for each name. That makes it easy to generate data in JSON format so you can chart name popularity. Add the following code after the Take(10) call in the previous code snippet:

.Select (n => "{" + String.Format(@"
name: '{0}',
data: [{1}]
",n.Name,n.Popularity.Select (p => p.PopularityPercentage.ToString())
.Aggregate ((p,q) => p + "," + q ))+"}")
.Aggregate ((m,n) => m + "," + n )

This generates all the series using all the names and their associated popularity percentages for all the years in the range. Copy this generated string representing the series and replace the series section in the existing file. Change the name of the file to something you prefer so you don’t lose the original example.

The y axis of the chart needs to show the years, so edit the categories to reflect that change. Here is a partial list of categories:

categories: [1880','1881','1882','1883','1884','1885','1886','1887','1888','1889','1890','1891','1892',
'1893','1894','1895','1896','1897','1898','1899','1900','1901','1902','1903','1904','1905',
'1906','1907','1908','1909','1910','1911','1912','1913','1914','1915','1916','1917','1918',
'1919','1920','1921','1922','1923','1924','1925','1926','1927','1928','1929','1930','1931'...  '1990','1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002',
'2003','2004','2005','2006','2007','2008']

The final remaining task is to change the chart title to Baby Name Popularity from 1880 to 2008 and the x-axis caption to Baby Name Popularity. After completing those changes, open the new file in your browser. You will see something similar to Figure 8-8.

image

Figure 8-8. Popularity of baby boy names over time

You can see the color version of this visualization at https://twitter.com/samthecoder/status/502471431519608832/photo/1.

I did the same experiment with baby girl names and found the trends shown in Figure 8-9.

image

Figure 8-9. Popularity of baby girl names over time

The color visualization is here: https://twitter.com/samthecoder/status/502472516342784000/photo/1.

By looking at the charts, I can say with confidence that John and Mary have been by far the most popular names in the last 130 years.

How It Works

The CSV file is first loaded into a collection of an anonymous type representing each row of the data. Each row has four columns: Year, Name, Popularity Percentage, and Sex. The first Skip() call skips the CSV headers, while the next Select() call generates a list of an anonymous type that represents the CSV as an in-memory collection.

The filter Where (n => n.Sex == "boy") filters out baby girl names, leaving only baby boy names in the collection. Next the code creates a lookup table with each name using .ToLookup (n => n.Name). This lookup table is converted to a dictionary where the keys of the dictionary represent baby boy names and the values represent the list of years and the percentage popularity of the name in that year. Figure 8-7 shows one such entry from this dictionary.

Finally, the dictionary entries are sorted based on the average popularity of the names, in descending order, ensuring that the most popular names appear at the top. The top ten entries are selected to discover the top ten most popular baby boy names.

To do this analysis for baby girl names, you need to change only the Where clause to filter out the baby boy names instead.

8-4. Analyzing Stock Values

Data about stock prices of many companies are available from Yahoo Finance. If you want to know historical stock prices for Google, you can find them at http://finance.yahoo.com/q/hp?s=GOOG. The word GOOG is the ticker symbol for Google. For Microsoft, it is MSFT; for Apple, it is AAPL.

If you visit this page, you will see a table showing the historical values for the company for which the symbol is provided. Below the table is a CSV file available for download. If you copy the link to the CSV file, it looks like this:

http://real-chart.finance.yahoo.com/table.csv?s=
GOOG&d=7&e=22&f=2014&g=d&a=2&b=27&c=2014&ignore=.csv

Note that the query string in the link has the symbol s=GOOG. So if you wanted to download historical stock prices for Microsoft, you would change the ticker symbol to MSFT, resulting in the following query string:

http://real-chart.finance.yahoo.com/table.csv?s=
MSFT&d=7&e=22&f=2014&g=d&a=2&b=27&c=2014&ignore=.csv

Problem

Given a list of ticker symbols, create a program to load the historical stock values for those companies.

Solution

The code in Listing 8-6 uses the historical values from the Yahoo CSV files mentioned above.

Listing 8-6. Loading historical values for specified stock symbols

string[] symbols =  {"AAPL","GOOG","MSFT"};
WebClient wc = new WebClient();
//This structure will hold the stock values
List<Tuple<string,DateTime,double,double,double,double,double,Tuple<double>>> mapping
= new List<Tuple<string,DateTime,double,double,double,double,double,Tuple<double>>>();
foreach (var symbol in symbols)
{
File.Delete("temp.csv");
wc.DownloadFile(
String.Format(
//Make sure the following URL string appears in a single line. Otherwise, the program won't work
@"http://realchart.finance.yahoo.com/table.csv?s={0}&d=7&e=19&f=2014&g=d&a=2&b=13&c=1986&ignore=
.csv",symbol),"temp.csv");

mapping.AddRange(File.ReadAllLines(@"temp.csv")
.Skip(1)//Skip the header
.Select( l =>
{
var toks = l.Split(',');
return new Tuple<string,DateTime,double,double,double,double,double,Tuple<double>>
(
symbol,
DateTime.Parse(toks[0]),
Convert.ToDouble(toks[1]),
Convert.ToDouble(toks[2]),
Convert.ToDouble(toks[3]),
Convert.ToDouble(toks[4]),
Convert.ToDouble(toks[5]),
//The last element must be a Tuple again.
new Tuple<double>(Convert.ToDouble(toks[6])));
}));

}
var stocks = mapping.Select (m =>
new
{
Symbol = m.Item1,
Date = m.Item2,
Open = m.Item3,
High = m.Item4,
Low = m.Item5,
Close = m.Item6,
Volume = m.Item7,
AdjClose = m.Rest.Item1
})
.ToLookup (m => m.Symbol)
.SelectMany (m => m.Take(7));

stocks.Dump("Stock values for last month");

Figure 8-10 shows a partial result of this query.

image

Figure 8-10. Values of stock prices for some companies during a week

Image Note  This code uses the WebClient class from the System.Net namespace, so you must add a reference to that DLL and add that namespace in LINQPad to get this code to work.

How It Works

This example demonstrates the special case of tuple creation. For a tuple with eight or more elements, the last element must be a tuple again. This element is denoted as Rest for the outer tuple, and its elements are accessed the usual way, as Rest.Item1 and so on.

Tuples are great for representing rows of CSV/database tables. However, one problem with using a tuple is that you can’t name its properties. It is difficult to understand the intent of code such as something.Item1, where something is a tuple. But by projecting a list of tuples by using Select(), you can create easy-to-understand code. This code employs two Select() calls. This is an idiom you’ll find frequently in LINQ code.

The second Select() call gives proper names to the tuple items.

After projecting the CSV as a list of anonymous objects, the code creates a lookup table using the symbol as the key. Because there are three symbols in this example, there will be three keys. So when SelectMany() is called with the argument 7, the call takes 7 from each of the entries in the lookup table values, resulting in 21 rows. If necessary, you can dump the lookup table to visualize what’s happening. That’s a big benefit of using LINQPad. You can even save these internal views of your code for future references.

8-5. Analyzing Git Logs

Contributors around the world participate in open source projects and contribute their code. It is interesting to see how these projects evolve, using code contributions from people who don’t even know each other. Git maintains logs of all the commits that have happened. Figure 8-11 shows a snapshot of a Git log for the jQuery project.

image

Figure 8-11. A portion of the JQuery Git log

Problem

Here are couple of data analysis tasks that make sense for any project that uses Git, either publicly or on a private server. I will start with the simpler task:

Develop a leaderboard that shows the top contributors, sorted by number of commits in descending order. (Sometimes developers commit more often than they would prefer because of missing files and such. But for this example, I have assumed that all commits are genuine and not the result of fixing an earlier faulty commit.)

Solution

Save the JQuery Git log to a text file, jquerygitlogs.txt. Then write the query in Listing 8-7 in a new LINQPad tab. This query finds the leaders who contributed more than their peers for JQuery. This also includes the project founder.

Listing 8-7. Creating a Jquery Git leaderboard

string log = File.ReadAllText("C:\\jquerygitlogs.txt");
string[] commits = Regex.Matches(log,"commit [a-zA-Z0-9]{40}")
.Cast<Match>()
.Select (m => m.Value)
.ToArray();
string[] authors = Regex.Matches(log,"Author: [a-zA-Z0-9-. @<>']+")
.Cast<Match>()
.Select (m => m.Value)
.ToArray();
string[] dates = Regex.Matches(log,"Date: [a-zA-Z0-9-:+ ]+")
.Cast<Match>()
.Select (m => m.Value)
.ToArray();

List<Tuple<string,string,string>> details = new List<Tuple<string,string,string>>();
Enumerable.Range(0,5000).ToList().ForEach( k => details.Add(new Tuple<string,string,string>(commits[k],authors[k],dates[k])));

details.Select (d =>
new {
Author = d.Item2.Substring(d.Item2.IndexOf(':')+1),
Date = DateTime.ParseExact(
d.Item3.Substring(d.Item3.IndexOf(' ')).Trim(),
"ddd MMM d HH:mm:ss yyyy zzz",CultureInfo.InvariantCulture),
Location = d.Item3.EndsWith("-0700")?"USA/Canada":"Elsewhere"
}
)
.ToLookup (d => d.Author)
.Select (d => new { Author = d.Key, CommitCount = d.Count()})
.OrderByDescending (d => d.CommitCount )
.Take(10)
.Dump("JQuery Leaderboard");

This produces the output shown in Figure 8-12. This result correctly shows John Resig at the top, because he is the main author of the JQuery library and has committed significantly more than other core team members.

image

Figure 8-12. JQuery leaderboard

How It Works

The leaderboard is created from a projection that holds data about the authors. I created a list of author names, the date when each commit was performed, and the author’s location. You can determine the rough location by examining the GMT offset. I haven’t used these variables; however, I encourage you to experiment with the information. One possible task is to identify how many contributions have been made from various continents.

Returning to the explanation at hand, the list of authors (which is basically a list of anonymous types representing author names, locations, and commit dates) is used to generate a lookup table where the index is the author names. So there will be as many keys in the table as there are distinct author names in the log file. The values of this lookup table show all the commits performed by the author whose name is being used as the key of the lookup table. You can dump this table to see what’s happing visually.

Later these values from the lookup table are used to generate a different projection with author names and the total number of commits each author performed. Finally, this is sorted by the number of commits in descending order. Taking the top ten such entries results in the leaderboard.

Problem

Another, slightly more complex problem is to show a timeline of how the JQuery project has grown by monitoring commits.

Solution

For this, I decided to use the Highcharts line-time series chart. This chart is zoomable, which is convenient. In a new LINQPad query tab, write the query in Listing 8-8 to generate and save the number of commits done per day.

Listing 8-8. Creating a Jquery timeline by monitoring Git commits

string log = File.ReadAllText("C:\\jquerygitlogs.txt");
string[] commits = Regex.Matches(log,"commit [a-zA-Z0-9]{40}")
.Cast<Match>()
.Select (m => m.Value)
.ToArray();
string[] authors = Regex.Matches(log,"Author: [a-zA-Z0-9-. @<>']+")
.Cast<Match>()
.Select (m => m.Value)
.ToArray();
string[] dates = Regex.Matches(log,"Date: [a-zA-Z0-9-:+ ]+")
.Cast<Match>()
.Select (m => m.Value)
.ToArray();
//There can be the word "commit" followed by a valid SHA ID of the commit inside a commit
//To bypass these we need to take the minimum length of all these three arrays.
var length = (new List<int>(){commits.Length, authors.Length, dates.Length}).Min();

List<Tuple<string,string,string>> details = new List<Tuple<string,string,string>>();
Enumerable.Range(0,length).ToList().ForEach( k => details.Add(new Tuple<string,string,string>(commits[k],authors[k],dates[k])));

var logs = details.Select (d =>
new
{
Author = d.Item2.Substring(d.Item2.IndexOf(':')+1),
Date = DateTime.ParseExact(
d.Item3.Substring(d.Item3.IndexOf(' ')).Trim(),
"ddd MMM d HH:mm:ss yyyy zzz",
CultureInfo.InvariantCulture)
});

DateTime startDate = logs.OrderBy (l => l.Date).First ().Date.Date;
DateTime endDate = logs.OrderBy (l => l.Date).Last().Date.Date;

startDate.Dump("Start Date");
var logMap = logs.ToLookup (l => l.Date.Date)
.ToDictionary (l =>  l.Key, l => l.Count());

List<int> commitCounts = new List<int>();
for(;startDate!=endDate;startDate = startDate.AddDays(1))
{
if(logMap.ContainsKey(startDate))
commitCounts.Add(logMap[startDate]);
else
commitCounts.Add(0);
}
StreamWriter sw = new StreamWriter("C:\\dataJQuery.txt");
sw.WriteLine(commitCounts.Select (c => c.ToString()).Aggregate ((m,n) => m + "," + n));
sw.Close();

This generates the following output:

5644
3/22/2006 12:00:00 AM

The output shows that there have been 5,644 commits, and that the first commit was performed on March 22, 2006. In addition to this output, the query saves the number of commits made per day in the file dataJQuery.txt. Open the file, and you will see the data shown in Figure 8-13.

image

Figure 8-13. The number of commits done per day in the JQuery project

Copy the entire content of this file. Go to the example\line-time-series\ folder in the Highcharts example directory. Copy the file index.htm and paste it in the same directory. Open the copied file in your favorite text editor or in Visual Studio. Locate the string data: [and replace everything between the braces  that appear right after data:. In other words, replace the data with the generated content that you copied earlier.

Right before data: you’ll see the following line:

pointStart: Date.UTC(2006, 0, 01)

Replace this line with the following:

pointStart: Date.UTC(2006, 2, 22)

This line marks the starting point of the zoomable timeline that gets rendered. The parameters UTC takes are the year, an integer between 0–11 representing the month, and an integer between 1–31 representing the day of the month. Because the first commit of the JQuery was done on March 22, 2006, the date as per UTC is 2006, 2, 22.

Finally, change the chart title and axis titles, as follows:

Existing Text

Replace With

text: ‘USD to EUR exchange rate from 2006 through 2008’

text: ‘JQuery Growth from 2006 through 2014’

name: ‘USD to EUR’

‘Commits’

text: ‘Exchange rate’

text: ‘Contributions’

At this stage, you are all set to see the visualization. Open the modified HTML file in your favorite browser. Figure 8-14 shows how it rendered in Mozilla Firefox.

image

Figure 8-14. JQuery commit pattern from inception until 2014 (the time this chapter was written)

How It Works

I have created a video of the visualization. You can see it at www.youtube.com/watch?v=i6s8hcIciUM.

The code works by identifying the dates between which commits have happened. The dictionary logMap keeps track of all the commits that have been pushed to JQuery on a particular date. If a date isn’t found as a key of logMap, no commit was made on that date. Using a loop, the listcommitCounts is populated such that the first entry of this list stores the number of commits performed on the day the JQuery project started, and so on for all other dates. Finally, the series is dumped as a CSV file in dataJQuery.txt.

8-6. Analyzing Movie Ratings  

MovieLens is a defunct dataset containing various types of movies along with recorded ratings. This dataset is good for testing movie recommendation engines. However, it is also just sheer fun to perform some exploratory analysis on this data. In doing so, I found some startling myth-buster information that I hadn’t known before. For example, one myth is that women like romantic movies more than men. I found the reality to be the opposite.

You can get the MovieLens files from http://grouplens.org/datasets/movielens/. There are several files in the ml-100k.zip file. For this example, I have used only the following files:

·        u.Item: I renamed this file to movies.txt. (This has details about movies.)

·        u.User: I renamed this file to users.txt. (This has demographic information about users.)

·        u.Data: I renamed this file to movieRatings.txt. (This has rating information about movies.)

I recommend that you read the ReadMe.txt file available from the grouplens.org link previously provided so you understand the layout of the data being parsed for each file.

Problem

Here are the questions I tried to answer with the analysis:

1.    How many movies were made in each category?

2.    What types of movies (by category) do men and women like?

3.    What are the genre preferences of men and women?

Solution

To obtain the answer to the first question, create a new C# statement query in LINQPad and write the code in Listing 8-9.

Listing 8-9. Analyzing Movie Categories

string[] allStrings = File.ReadAllText(@"C:\Personal\TableAPI\movies.txt")
.Split(new string[]{"|","\r","\n"},StringSplitOptions.None);

var movies =
Enumerable.Range(0,allStrings.Length/24)
.ToList()
.Select ( s  => allStrings.Skip(s*24).Take(24))
.Select (s =>
{
return new
{
ID = s.ElementAt(0),
Title = s.ElementAt(1),
ReleaseDate = s.ElementAt(2).Trim(),
IMDBURL = s.ElementAt(4),
IsAction = s.ElementAt(5)=="1"?true:false,
IsAdventure = s.ElementAt(6)=="1"?true:false,
IsAnimation = s.ElementAt(7)=="1"?true:false,
IsChildrens = s.ElementAt(8)=="1"?true:false,
IsComedy = s.ElementAt(9)=="1"?true:false,
IsCrime = s.ElementAt(10)=="1"?true:false,
IsDocumentary = s.ElementAt(11)=="1"?true:false,
IsDrama = s.ElementAt(12)=="1"?true:false,
IsFantasy = s.ElementAt(13)=="1"?true:false,
IsFilm_Noir = s.ElementAt(14)=="1"?true:false,
IsHorror = s.ElementAt(15)=="1"?true:false,
IsMusical = s.ElementAt(16)=="1"?true:false,
IsMystery = s.ElementAt(17)=="1"?true:false,
IsRomance = s.ElementAt(18)=="1"?true:false,
IsSci_Fi = s.ElementAt(19)=="1"?true:false,
IsThriller = s.ElementAt(20)=="1"?true:false,
IsWar = s.ElementAt(21)=="1"?true:false,
IsWestern = s.ElementAt(22)=="1"?true:false
};
});

Dictionary<string,int> moviesPerCategory = new Dictionary<string,int>();

foreach (var movie in movies)
{
movie
.GetType()
.GetProperties()
.Select (m => new KeyValuePair<string,object>(m.Name, m.GetValue(movie)))
.Skip(4)//Skipping ID,Title,ReleaseDate and IMDBURL field
.Where (f => Convert.ToBoolean(f.Value)==true)
.Select (f => f.Key.Substring(2))
.ToList()
.ForEach( k =>
{
if(!moviesPerCategory.ContainsKey(k))
moviesPerCategory.Add(k,1);
else
moviesPerCategory[k]++;
});
}
int totalMovieCount = moviesPerCategory.Select( t => t.Value).Sum();
moviesPerCategory.Select (pc =>
new {
Category = pc.Key,
Count = pc.Value,
Percentage = (100*Math.Round((double)pc.Value/(double)totalMovieCount,2))
})
.OrderByDescending (movie => movie.Percentage)
.Dump("Movie Categories");

Running this code results in the output shown in Figure 8-15. I have enabled charting. It is clear from this result that during the early ’90s (the time when this dataset was created), Fantasy movies were in great demand. One in every four movies made was a Fantasy movie.

image

Figure 8-15. SPercentage of movies made of each genre

How It Works

The variable movies is an anonymous type collection representing each movie. Each public property, such as IsRomance holds a Boolean determining whether the movie belongs in that genre. An individual movie can be associated with multiple genres. The goal in this example was to find out the number of movies made in each genre. To represent this data, I needed a dictionary in which the keys were the movie genres and the values the number of such movies. The dictionary moviesPerCategory holds this information.

The way this dictionary gets populated is particularly interesting. The loop that iterates over the movies collection uses reflection to determine the type of the anonymous type, and then lists all the public properties that represent the genres. The names of these properties are used as the keys of the moviesPerCategory dictionary. The call to Substring() drops the initial two letters Is for each of the properties.

To obtain the percentage of movies made in each genre, you need to know the total number of movies made. That value is saved in totalMovieCount. Finally, the dictionary entries are projected to show the genre of the movie, total number of movies made in that genre, and the percentage of movies made in that genre. This result is then shown sorted by the percentage of movies made in descending order.

You can build on the code so far to find the answer for the second question: What types of movies (by category) do men and women like?

To find the answer, create the query in Listing 8-10 in LINQPad.

Listing 8-10. Discovering what movies men and women like

string[] allStrings = File.ReadAllText(@"C:\movies.txt")
.Split(new string[]{"|","\r","\n"},StringSplitOptions.None);
var movies =
Enumerable.Range(0,allStrings.Length/24)
.ToList().Select ( s  => allStrings.Skip(s*24).Take(24))
.Select (s =>
{
return new
{
ID = s.ElementAt(0),
Title = s.ElementAt(1),
ReleaseDate = s.ElementAt(2),
IMDBURL = s.ElementAt(4),
IsAction = s.ElementAt(5)=="1"?true:false,
IsAdventure = s.ElementAt(6)=="1"?true:false,
IsAnimation = s.ElementAt(7)=="1"?true:false,
IsChildrens = s.ElementAt(8)=="1"?true:false,
IsComedy = s.ElementAt(9)=="1"?true:false,
IsCrime = s.ElementAt(10)=="1"?true:false,
IsDocumentary = s.ElementAt(11)=="1"?true:false,
IsDrama = s.ElementAt(12)=="1"?true:false,
IsFantasy = s.ElementAt(13)=="1"?true:false,
IsFilm_Noir = s.ElementAt(14)=="1"?true:false,
IsHorror = s.ElementAt(15)=="1"?true:false,
IsMusical = s.ElementAt(16)=="1"?true:false,
IsMystery = s.ElementAt(17)=="1"?true:false,
IsRomance = s.ElementAt(18)=="1"?true:false,
IsSci_Fi = s.ElementAt(19)=="1"?true:false,
IsThriller = s.ElementAt(20)=="1"?true:false,
IsWar = s.ElementAt(21)=="1"?true:false,
IsWestern = s.ElementAt(22)=="1"?true:false
};
})
.ToLookup (s => s.ID);

//Loading users in a collection
var users  = File.ReadAllText(@"C:\users.txt")
.Split(new char[]{'\r','\n'},StringSplitOptions.RemoveEmptyEntries)
.Select (f => f.Split('|'))
.Select (f => new { ID = f[0], Age = f[1], Sex = f[2],
Profession = f[3], ZIP  = f[4]} )
.ToLookup(f => f.ID);

//Loading movie ratings
var movieRatingTokens = File.ReadAllText(@"C:\Personal\TableAPI\movieRatings.txt")
.Split(new char[]{' ','\t','\r','\n'},StringSplitOptions.RemoveEmptyEntries);

var movieRatings = Enumerable.Range(0,movieRatingTokens.Length/4)
.Select( k => movieRatingTokens.Skip(4*k).Take(4))
.Select (k => new { UserID = k.ElementAt(0), MovieID = k.ElementAt(1), Rating =
Convert.ToInt32( k.ElementAt(2)), TimeStamp = k.ElementAt(3)
} )
.Select (k =>
{
var currentUser = users[k.UserID].First();
var movie  = movies[ k.MovieID].First();
return new { Age = currentUser.Age, Sex = currentUser.Sex,
Movie = movie.Title, Rating = k.Rating ,
IsAdenture = movie.IsAdventure,
IsAnimation = movie.IsAnimation,
IsChildrens = movie.IsChildrens,
IsComedy = movie.IsComedy,
IsCrime = movie.IsCrime,
IsDocumentary = movie.IsDocumentary,
IsDrama = movie.IsDrama,
IsFantasy = movie.IsFantasy,
IsFilm_Noir = movie.IsFilm_Noir,
IsHorror = movie.IsHorror,
IsMusical = movie.IsMusical,
IsMystery = movie.IsMystery,
IsRomance = movie.IsRomance,
IsSci_Fi = movie.IsSci_Fi,
IsThriller = movie.IsThriller,
IsWar = movie.IsWar,
IsWestern = movie.IsWestern
};
} );

Dictionary<string,Dictionary<string,int>> genderBias = new
Dictionary<string,Dictionary<string,int>>();
genderBias.Add("M",new Dictionary<string,int>());
genderBias.Add("F",new Dictionary<string,int>());
foreach (var mr in movieRatings)
{
string strRep = mr.ToString();
string key = strRep.Contains("Sex = M")?"M":"F";

var matches = Regex.Matches(strRep,"Is[A-Za-z_ ]+= True")
.Cast<Match>()
.Select (m => m.Value)
.Select (m => m.Substring(2,m.IndexOf('=')-2)
.Trim());
foreach (var m in matches)
if(!genderBias[key].ContainsKey(m))
genderBias[key].Add(m,1);
else
genderBias[key][m]++;

}

var pieData = genderBias.ToDictionary (b =>  b.Key,
b => b.Value
.Select
(
v =>
new
{
Key = v.Key ,
Liking = (double)v.Value/(double)(b.Value
.Select (va => va.Value).Sum ())
}
)
.OrderByDescending (v => v.Liking )
.ToDictionary (v => v.Key))
.Select (b => b.Key + "->" + b.Value.Select (v => "['" +
v.Key + "'," + 100 * Math.Round( v.Value.Liking,2) +"]")
.Aggregate ((m,n) => m + "," + n));

Console.WriteLine(pieData);

This generates the following output:

image

Using this data, you can draw a couple of pie charts showing the genre preferences of men and women. Highcharts has a pie chart folder. Copy the file you find there, and then make the following changes to the copied file:

Change This

To This

['Firefox',   45.0],

['IE',       26.8],

{

name: 'Chrome',

y: 12.8,

sliced: true,

selected: true

},

['Safari',    8.5],

['Opera',     6.2],

['Others',   0.7]

The first row M->['Fantasy',18]... etc. of the result obtained

'Browser market shares at a specific website, 2014'

'What kind of movies do men like?'

name: 'Browser share'

name: 'Movie Likings'

image

Figure 8-16. Showing movie genre preferences of men and women

Now, when you open the modified file in a browser, you will see a pie chart showing men’s movie preferences. The two pie charts in Figure 8-16 show the results for both men and women.

8-7. Identifying Flowers by Using Machine Learning

Classification is a common problem in machine learning. There are several supervised learning algorithms (http://en.wikipedia.org/wiki/Supervised_learning) for determining the class of a given element. One of the most commonly used supervised learning algorithms is k–Nearest Neighbors (k-NN). In this example, you will see how to use LINQ to implement k-NN () to identify the species of an iris flower.

Image Note  See http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm for an explanation of the k–Nearest Neighbors algorithm.

Problem

The iris multivariate dataset (downloadable from http://en.wikipedia.org/wiki/Iris_flower_data_set) contains data about three varieties of iris flowers. The task is to identify the species of the iris flower from the given dataset by using the k-NN algorithm. You can download the dataset from http://aima.cs.berkeley.edu/data/iris.csv.

Solution

Enter the following as a new query in LINQPad:

//Nearest Neighbor
var trainingSet = File.ReadAllText(@"C:\iris.csv")
.Split(new char[]{'\r','\n'},StringSplitOptions.RemoveEmptyEntries)
.Select ( f => f.Split(','))
.Skip(1)
.Select (f =>
new
{
SepalLength = Convert.ToDouble( f[0]),
SepalWidth = Convert.ToDouble(f[1]),
PetalLength = Convert.ToDouble(f[2]),
PetalWidth = Convert.ToDouble(f[3]),
Name = f[4]
})
//RandomSubset is a method from MoreLINQ.
// So you have to reference that
//in LINQPad to use this method. Refer to Chapter 5.
.RandomSubset(100);

//Test data
double sepalLength = 5.5;
double sepalWidth  = 2.6;
double petalLength = 4;
double petalWidth = 1.2;

int k = 5;
//Euclidean distance function
Func<double,double,double,double,double,double,double,double,double> Distance =
(sl1,sl2,sw1,sw2,pl1,pl2,pw1,pw2)  => Math.Sqrt(Math.Pow(sl1-sl2,2)
+ Math.Pow(sw1-sw2,2)
+ Math.Pow(pl1-pl2,2)
+ Math.Pow(pw1-pw2,2));

//Figure out what flower it is.
trainingSet
.Select (s => -
new
{
Name = s.Name,
DistanceFromTestData =
Distance(sepalLength,s.SepalLength,sepalWidth,s.SepalWidth,
petalLength, s.PetalLength, petalWidth, s.PetalWidth)
})
.OrderBy (s => s.DistanceFromTestData )
//Take the first "k" elements
.Take(k)
//Create a lookup with the "Name"
.ToLookup (s => s.Name)
//Sort the elements as per the descending order of number of elements in that class
.OrderByDescending (s => s.Count())
//Pick the first one--with the highest count
.First ()
//Pick its class
.Key
.Dump("I think the flower is");

This generates the following output:

image

How It Works

At the heart of this query is the Euclidian distance function, represented by the method Distance.

The Distance method determines the distance between each point in the training set and the point in question—the target point. Then this projected result, along with the distance from the test point, is sorted by the distance, and the first k entries are picked. This projection is then used to create a lookup table with the Name property, which is the class or tag of that entry. The count of values for each tag/key in the lookup table is the number of the nearest neighbors whose tag is represented by the associated key. So the key with the highest number of values is probably the best guess for the class/tag of the given test point. I recommend you change the value of k to something big, such as 25, and dump the lookup table in LINQPad to see what’s happening visually.

Summary

Congratulations! You’ve covered a lot of ground in this chapter. The goal of this chapter was to show you how to perform several data processing tasks by using LINQ in a functional way. Along the way, you should have also picked up how to use some of the most frequent LINQ idioms, including Skip() followed by Take(), Select() followed by ToLookup() and ToDictionary(), and so on. I urge you to experiment with these examples, pose other questions for yourself, and see how you can obtain the answers for those questions by using the techniques you have seen so far in this book.