Text Processing - Thinking in LINQ: Harnessing the power of functional programing in .NET applications (2014)

Thinking in LINQ: Harnessing the power of functional programing in .NET applications (2014)

Chapter 3. Text Processing

Text processing is a blanket term used to describe any kind of string processing. Checking whether a pair of words are anagrams of each other is one example of text processing. Generating suggestions for an autocomplete or assisted input process is another. Some types of text processing, such as spell-check and correction features, have become so commonplace that software users now expect them to be present in virtually every program they use.

LINQ changes the way developers deal with text because it lets you write intuitive code to solve complex text-processing tasks. In this chapter, you will have a chance to solve some fun—yet useful—text-processing challenges. The examples in this chapter vary widely in terms of difficulty. They are intended to serve as basic examples of how LINQ can help solve numerous text-processing problems.

The problems in this chapter fall into three broad but related categories:

·     Human-computer interactions that deal with various input strategies and spell-check

·     Text generation and manipulation, such as the anagram problem mentioned earlier

·     Information extraction, such as pulling content from a document

3-1. Simulating a T9 Word Suggestion

Typing on mobile phones wasn’t easy in the early days, so several schemes were invented to make it simpler. Autocompletion of words using a T9 dictionary was one such early solution. Figure 3-1 shows a typical basic mobile phone keypad.

image

Figure 3-1. The keypad of a mobile phone, with the letters on each key

A typical mobile phone has the entire alphabet on its keys. This type of keypad enables users to enter predictive text using a T9 dictionary.

Problem

Given a set of T9 key presses, select all possible matching words from a dictionary.

Solution

The code in Listing 3-1 implements an algorithm to pull all the words from T9 that match the keystrokes entered from the keypad. Words that share the same key combination are known as textonyms. For example, the key combination 4663 matches the words goodhomegone, and hood. Those words are textonyms of each other. So when a user types 4663 on a T9-enabled keypad of a mobile phone, the phone offers all the textonyms as suggestions to pick from. Sometimes these suggestions can be quite amusing. For example, select and reject are textonyms, because they both use the key combination 735328.

Listing 3-1. Find words that match T9 keypresses

string keyPad =    @"2 = ABC2
3 = DEF3
4 = GHI4
5 = JKL5
6 = MNO6
7 = PQRS7
8 = TUV8
9 = WXYZ9";

Dictionary<char,char> keyMap = new Dictionary<char,char>();

//4663 can lead to "good","home" etc
string key = "4663";

//"735328";//select/reject
//"select" and "reject" can be typed using the key combination "735328"

List<KeyValuePair<string,string>> keyAndLetters =
keyPad.ToLower()
.Split(new char[]{'\r','\n'},StringSplitOptions.RemoveEmptyEntries)
.Select
(
p =>
new KeyValuePair<string,string>(p.Split('=')[0].Trim(),p.Split('=')[1].Trim()))
.ToList();

foreach (var keyL in keyAndLetters)
{
foreach (char c in keyL.Value.ToCharArray())
keyMap.Add(c,Convert.ToChar(keyL.Key));
}

StreamReader sr = new StreamReader ("C:\\T9.txt");
string total = sr.ReadToEnd();
sr.Close();

var query = total
.Split(new char[]{'\r','\n',' '},StringSplitOptions.RemoveEmptyEntries)
.Where (t => t.Length==key.Length)
.Select (t => t.Trim());

query
.ToList()
.Select(w=> new KeyValuePair<string,string>(w,
new string(w.ToCharArray().Select (x => keyMap[x]).ToArray())))
.Where (w => w.Value==key)
.Select (w => w.Key)
.Dump("Word suggestions");

I have used this code in a Windows application that simulates T9 typing. You can see it in action on YouTube (www.youtube.com/watch?v=Su4-_v2qGvQ&feature=youtu.be). Of course, the program is also in the downloadable code that accompanies this book.

When you run the program, you will see a list of suggested words that correspond to the key combination 4663. On a T9-enabled mobile keypad, entering 4663 matches several words, including gonegoodgoof, and home, as you can see in Figure 3-2.

image

Figure 3-2. Suggestions for the keystrokes 4663 on a T9-enabled keyboard

How It Works

The task is to locate all the possible words from a given dictionary that match a given a key combination. If you approach this problem from another direction, you can think of it as replacing all the characters in a given word with the key on which that letter is available. For example, becauseab, and c are all available on the 2 key, you can replace all occurrences of these three letters with 2. Doing so re-creates the numeric combination used to enter the word.

For example, suppose you enter the word home. To determine the set of numeric keys required to enter home, replace each letter with its corresponding numeric digit from the keypad. Referring to the keypad shown in Figure 3-1, you can see that h is on key 4, o and m are both on key 6, and e is on key 3. Thus the numeric key for the word home is 4663.

Figure 3-3 represents the entire keypad. Each line shows the numeric key and the corresponding letters that can be entered by pressing that key. The Key column represents a key on the mobile phone’s keypad. The Value column represents all the characters a user can enter with that key. So the first line shows, for example, that by pressing the 2 key, users can enter an abc, or of course the number 2.

image

Figure 3-3. The keys and the corresponding letters that can be typed by using them

If you reverse this so that each character is a key, and each mobile phone key number is a value, you end up with a one-to-one mapping, in which each character is matched with its corresponding digit, as shown (partially) in Figure 3-4.

image

Figure 3-4. A one-to-one mapping of each character and its corresponding key on the mobile phone keypad

The following code creates a list of KeyValuePairs, where the keys are words from the dictionary and the values are the corresponding numeric keys:

query.ToList()
.Select(w=> new KeyValuePair<string,string>(w,
new string(w.ToCharArray().Select (x => keyMap[x]).ToArray())))

At the next stage, this list of key/value pairs is filtered by the following Where clause:

Where (w => w.Value==key)

This returns the list of those key/value pairs whose value is 4663. So keys of these filtered values are textonyms of each other. Projecting the keys of the filtered values by using the final Select() call Select (w => w.Key) leaves us with the textonyms.

3-2. Simulating a Gesture Keyboard  

Touch-enabled mobile devices introduced a new era of human-computer interaction. This posed several challenges for interaction designers. Because mobile devices come in all shapes and sizes (also known as form factors), it proved truly difficult to design a touch keyboard that worked well on all devices (an approach known as form factor–agnostic design). Simply porting traditional keyboard layout and interaction logic from desktops and laptops proved to be a poor solution for touch keypads on mobile devices.

Using an onscreen keyboard on small mobile devices had long been a pain point for users. The problem is largely one of size and accuracy. Users too often tap a different character than the one they intend, which can be annoying. To solve this frustration, interaction designers came up with the idea of gesture typing. This technology lets the user input (note that I didn’t write type) a word by sliding their fingers over the letters on the keypad, essentially drawing a line from one letter to the next. The Swype app, for example, uses this technology.

Gesture keyboards use multiple algorithms, including machine learning, to predict words that users might have intended. However, at the core of these algorithms is a simple string-processing algorithm called finding the longest common subsequence. A subsequence of a string is another string, where the letters of the latter occur in the first at monotonically increasing indices. In a monotonically increasing sequence, each element is followed by another greater than itself. For example, the sequence {1,2,4,6} is a monotonically increasing sequence because 1 is less than 2, 2 is less than 4, and 4 is less than 6. However, the sequence {1,2,0,5} isn’t monotonically increasing because 0 is less than 2.

Here are couple of amusing examples. The word wine is a subsequence of the phrase world is not enough. As another example, rental is a subsequence of ornamental. In the first example, the letters of the word wine (win, and e) occur at the indices {0,6,9,13}. These indices are monotonically increasing. Thus wine is a subsequence of the phrase world is not enough. I will leave it up to you to prove that the word rental is also a monotonically increasing subsequence of the word ornamental.

Suppose a user wants to type rental but the path she traces with her finger touches the letters rewsantdal. Several English words are subsequences of this character sequence—to list a few: ratwantrantsand, and rental. As you can see, rental is the longest common subsequence between the word rental and the character sequence rewsantdal. Thus, when the longest common subsequence of the word and the traced-character sequence is the word itself, that’s probably a correct entry. When it isn’t, however, the longest of such matches should be first in the list of suggested words, because that’s most likely what the user intended to input. Now let’s see how this can prove helpful in predicting words.

Problem

Given a gesture-typed string generated by a user, generate a list of suggested words by using LINQ to select the best matches from a dictionary.

Solution

Find the longest common monotonically increasing subsequences that are words in the dictionary, and list them for the user.

The code in this solution uses simulated gesture typing. Assume the user wants to type understands but touches the following characters along the way: ujnbvcderesdftrewazxcvbnhgfds. You will see that understands is the longest common subsequence of this character sequence that is also an English word.

This example in Listing 3-2 uses the LongestCommonSubsequence() method from the .NET open source string-processing API StringDefs (see www.codeplex.com/stringdefs). I started that project to get better string-processing capabilities in .NET.

Listing 3-2. Find closest-match words from a dictionary

void Main()
{
// Define other methods and classes here
StreamReader sr = new StreamReader ("C:\\T9.txt");
string total = sr.ReadToEnd();
sr.Close();

List<string> suggestions  = new List<string>();
var query = total.Split(new char[]{'\r','\n',' '},StringSplitOptions.RemoveEmptyEntries)
.Select (t => t.Trim());

//should show understand. See the bold characters.
string path = "ujnbvcderesdftrewazxcvbnhgfds";

query.Where(word => LongestCommonSubsequence(path,word).Equals(word))
.OrderByDescending (word => word.Length)
.ThenByDescending (word => word )
.Take(4)//Show first 4 suggestions.
.Dump("Suggestions");
}

The simulation generates the suggestions shown in Figure 3-5. These are all the words suggested/predicted by the gesture keyboard algorithm that the user might have intended to type.

image

Figure 3-5. All the suggestions derived from the gesture keyboard input

How It Works

Those words for which the longest common subsequence of it and the character sequence is the word itself are possible candidates for word suggestions/predictions. However, the longer the word, the higher the probability of it being the intended word. Thus the matching words are sorted by length, in descending order. Finally, the items are sorted in reverse alphabetical order. This is required because you want to preserve the prefix. Because the user started with a u, it is reasonable to assume that the intended word starts with a u.

3-3. Cloning Peter Norvig’s Spelling-Correction Algorithm  

Peter Norvig wrote a great spelling-correction program in about 20 lines of Python code. You can find the code and algorithm discussed at great length at http://norvig.com/spell-correct.html. At the heart of the Python code lies a concept called list comprehension (seehttp://en.wikipedia.org/wiki/List_comprehension).

LINQ is perfect for moving list comprehension from theory to practice in C#.

Problem

Use LINQ to clone Peter Norvig’s Python code in C#.

Solution

The code in Listing 3-3 shows the solution.

Listing 3-3. Implementing spelling correction

Dictionary<string,int> NWords = new Dictionary<string,int>();
public IEnumerable<string> Edits1(string word)
{
char[] alphabet = {'a','b','c','d','e','f','g','h','j','k','l','m','n','o',
'p','q','r','s','t','u','v','w','x','y','z'};
var splits = Enumerable.Range(1,word.Length)
.Select(i =>
new {First = word.Substring(0,i),
Second = word.Substring(i+1)});

var deletes = splits.Where (split  => !string.IsNullOrEmpty(split.Second))
.Select (split => split.First + split.Second.Substring(1));

var transposes = splits
.Where  (split => split.Second.Length>1)
.Select (split => split.First + split.Second[1] + split.Second[0]
+ split.Second.Substring(2));

var replaces = splits
.Where (split => !string.IsNullOrEmpty(split.Second))
.SelectMany(split => alphabet
.Select (c => split.First + c + split.Second.Substring(1)));

var inserts = splits
.Where (split     => !string.IsNullOrEmpty(split.Second))
.SelectMany(split => alphabet
.Select (c => split.First + c + split.Second));
return deletes
.Union(transposes)
.Union(replaces)
.Union(inserts);
}

public Dictionary<string,int> Train(IEnumerable<string> features)
{
Dictionary<string,int> model = new Dictionary<string,int>();
Features
.ToList()
.ForEach(f => {if (model.ContainsKey(f)) model[f] += 1; else model.Add(f,1);});
return model;
}
public IEnumerable<string> KnownEdits2(string word)
{
List<string> knownEdits2 = new List<string>();
return Edits1(word)
.SelectMany(e1 => Edits1(e1)
.Where (x => NWords.ContainsKey(x)));
}
public IEnumerable<string> Known(IEnumerable<string> words)
{
return words.Intersect(NWords.Select (v => v.Key));
}
public IEnumerable<string> Correct(string word)
{
List<string> candidates = new List<string>();
candidates.AddRange(Known(new List<string>(){word}));
candidates.AddRange(Known(Edits1(word)));
candidates.AddRange(Known(Edits1(word)));
candidates.AddRange(KnownEdits2(word));
candidates.Add(word);
return candidates
.Where (c => NWords.ContainsKey(c)).OrderByDescending (c => NWords[c]);
}
void Main()
{
StreamReader sr = new StreamReader ("big.txt");
string total = sr.ReadToEnd();
sr.Close();
NWords = Train(Regex.Matches(total,"[a-z]+")
.Cast<Match>()
.Select (m => m.Value.ToLower()));
string word = "mysgtry"; //should return "mystery"
Correct(word)
.Distinct()
.OrderByDescending (x => x.Length)
.Dump(“Did you mean”);
}

This produces the output shown in Figure 3-6.

image

Figure 3-6. Results of the spelling-correction algorithm

How It Works

At the heart of the Python implementation is the edits1() method, as shown in this Python code snippet:

def edits1(word):
splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes    = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts    = [a + c + b     for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)

The list splits contains list of key/value pairs, in which the keys are substrings of the words through the specified index i and the values are substrings of the words from the index i + 1.

Here is the clone of splits:

var splits = Enumerable.Range(1,word.Length).Select(i => new {First = word.Substring(0,i),
Second = word.Substring(i+1)});

In contrast, the deletes collection contains the keys of splits and the substring of the value from the second index onward should the value field exist. Note that every other variable in the edits1() method is declared in terms of splits(). LINQ achieves list comprehension by using filtering—a call to Where() followed by the required projection, followed by calls to Select() or SelectMany().

I have used the same variable names and method names as those used in the original Python code. You can see how the original Python code and my clone compares at http://consulttoday.com/PeterNorvigsSpellingCorrection.html.

3-4. Reversing a Sentence Word by Word

Reversing the words in a sentence may seem simple. However, the classical solution, which uses loops, requires using intermediate storage and looping constructs.

Problem

Write LINQ code to reverse the words in a given sentence.

Solution

Listing 3-4 shows the solution.

Listing 3-4. Reversing a sentence

//Reversing a sentence word-by-word
string line = "nothing know I";

line.Split(' ').Aggregate ((a,b) => b + " " + a).Dump();

This program prints the following:

"I know nothing"

How It Works

Line.Split(' ') returns a list of the words in the sentence nothing know I. The Aggregate() function then takes one pair at a time and writes them in reverse order. At the end, the string is completely reversed. You might also find Reverse() used to reverse a word collection and then Aggregate() used to stitch those reversed words together to get the reversed sentence. However, you see that’s not needed, because you can do the same with Aggregate() just by swapping the arguments in the body of the lambda function.

3-5. Creating a Word Triangle

word triangle is one of those puzzles that nearly everyone encountered during their formative years. This simple problem sets the context for more-sophisticated problems later.

Problem

Write a program using LINQ to generate a word triangle from a given word.

Solution

The LINQ statement in Listing 3-5 prints a word triangle for the word umbrella.

Listing 3-5. Create a word triangle

//Word Triangle
string word = "umbrella";
Enumerable
.Range(1, word.Length)
.Select (k => new string(word.ToCharArray().Take(k).ToArray()))
.Concat
(
Enumerable.Range(1, word.Length)
.Select(k => new string(word.ToCharArray()
.Take(word.Length - k)
.ToArray()))
)
.Aggregate ((m,n) => m + Environment.NewLine + n)
.Dump("Word Triangle");

This generates the output shown in Figure 3-7. A triangle is formed with the letters of the given word. The program starts with the first letter. Then it takes the first two, then the first three, and so on, until it prints the whole word. Then the program repeats, but taking letters from the end, instead, until only one letter remains.

image

Figure 3-7. Output for the word triangle

How It Works

The line Enumerable.Range(1, word.Length) creates a range from 1 to the length of the given word. The resultant values are projected to obtain a sequence of characters that increases in length by one each time, starting from the left side of the given word. Here’s the projection code:

Select (k => new string(word.ToCharArray().Take(k).ToArray()))

As the value of k changes from 1 to word.Length, this leaves you with a substring of the given word, of length k starting from the zero index position, at each stage. So when k is 1, you get u. When k is 2, you get um, and so on. When k finally hits the word.Length value, you get the entire word, umbrella. So that portion of the code provides the upper part of the triangle.

For the bottom part of the word triangle, you just need to reverse this process. You take every character from the left, except the last one in the first call, except the last two in the next call, and so on. You concatenate these two sequences by using Concat().

Finally, you can use the Aggregate() operator to place the incremental and decremented substrings of the given words so that each one appears right after the previous one in a new line.

3-6. Finding Anagrams

Anagrams are fascinating. Anagrams, which are words created by transposing the letters of another word, are also useful to point out and correct obvious spelling mistakes. For example, if you type hte in Microsoft Word, it will be autocorrected to the, because hte is not a word in the dictionary, and the is an anagram that is frequently used in English.

Problem

Write a query using LINQ that takes two phrases and returns true if they are anagrams of each other; otherwise, it returns false.

Solution

The query in Listing 3-6 checks whether two phrases are anagrams of each other. It considers only the alphabetic characters—not any punctuation. The easiest way to tell whether two words are anagrams of each other is to sort the characters and then compare the resultant sequences for equality. For example, the words oriental and relation are anagrams. When sorted alphabetically, the characters in both words are aeilnort.

Listing 3-6. Determining whether two phrases are anagrams of each other

string phrase1 = "the eyes";
string phrase2 = "they see";

phrase1.ToCharArray().Where (p => Char.IsLetterOrDigit(p))
.OrderBy (p => p)
.SequenceEqual(phrase2.ToCharArray().Where (p => Char.IsLetterOrDigit(p))
.OrderBy (p => p))
.Dump();

This query outputs true, because the phrases are anagrams of each other.

How It Works

IsLetterOrDigit()filters anything except a letter or a digit.

The code phrase1.ToCharArray().Where (p => Char.IsLetterOrDigit(p)) returns a list of characters only. Because in this case all the characters match the condition, you end up with IEnumerable<char> {'t','h','e','e','y','e','s'}. TheOrderBy() clause sorts this character sequence, generating eeehsty. The same call for phrase2 generates the same character sequence. Using the SequenceEqual() operator proves these two sequences are equal.

SequenceEqual() cares about the order of the elements in the source collections. That’s why you have to sort the sequences first, using the OrderBy() clauses.

3-7. Checking for Anagrams Without Sorting Characters

The solution in the previous section, which sorts the characters and then checks for anagrams, works fine. However, it’s resource-intensive when the input phrases grow big. In this section, you’ll try another approach.

Problem

Write a method to determine whether two strings are anagrams of each other, without sorting the characters of the phrases.

Solution

In such situations, you can compare the character histograms of both phrases instead. A character histogram is a character frequency table that stores the number of times each character occurs in the phrase. For example, in the phrase the eyes, the character frequency for the letter e is 3, because e occurs three times. Listing 3-7 creates character histograms and then compares them to determine whether the phrases are anagrams of each other instead of sorting. For big strings, this method yields faster results.

Listing 3-7. Checking for anagrams without sorting

string phrase1 = "the eyes";
string phrase2 = "they see";

var characterHistogram1 = phrase1
.ToCharArray()
.Where (p => Char.IsLetterOrDigit(p))
.ToLookup (p => p)
.ToDictionary (p => p.Key, p=>p.Count ());
var characterHistogram2 = phrase2
.ToCharArray()
.Where (p => Char.IsLetterOrDigit(p))
.ToLookup (p => p)
.ToDictionary (p => p.Key, p=>p.Count ());

bool isAnagram = characterHistogram1.All(d =>
characterHistogram2[d.Key] == characterHistogram1 [d.Key]);

How It Works

As you can see, the filtering part is same as the solution in the previous section. The code ignores anything except a letter or digit. Later, this is projected as a lookup table by calling the ToLookup() operator. Figure 3-8 shows the lookup table for the first phrase, the eyes.

image

Figure 3-8. Lookup table for the phrase the eyes

At the next stage, this lookup table is converted to a dictionary by using the ToDictionary() operator. p.Count() returns the number of elements in each group. Thus the process generates the character histogram shown in Figure 3-9.

image

Figure 3-9. The character histogram for the phrase the eyes

In Figure 3-9, the Key column represents the characters of the phrase, while the Value column represents the frequency of that character in the given phrase.

After obtaining a character histogram for each phrase, you can test their equality to see whether the phrases are anagrammatic pairs.

The operator All() returns true when all the values in the calling collection match the given predicate. If the two phrases are anagrams, their values will match for every key. The predicate characterHistogram2[d.Key] == dic1[d.Key] validates that.

3-8. Creating a Rudimentary Programming Language Identifier and Automatic Syntax Highlighter

Syntax highlighting is important because it improves the readability of code. Using LINQ and SyntaxHighlighter, you can create an automatic syntax highlighter that can parse code in any language and highlight the code. SyntaxHighlighter is an open source JavaScript API for providing lightweight syntax highlighting support. Most blogging engines already support it.

Problem

From raw text input, identify the programming language and then apply a proper syntax highlighter brush for that language.

Solution

Listing 3-8 reads raw code in any language from the file SampleCode.txt and generates a syntax-highlighted version. LINQ is used to identify the language based on keywords.

Listing 3-8. Adding syntax highlighting to code

Dictionary<string, List<string>> langKeywords = new Dictionary<string, List<string>>();

string directory = @"C:\syntaxhighlighter_3.0.83";
string[] files = Directory.GetFiles(directory,"*.js");

foreach (string file in files)
{
try
{
string key = new FileInfo(file).Name.Replace("shBrush", string.Empty)
.Replace(".js", string.Empty);
langKeywords.Add(key, new List<string>());
langKeywords[key] = Regex.Matches(File.ReadAllText(file)
.Split(';')
.FirstOrDefault(m => m.Contains("var keywords"))
.Split('=')[1], "[a-z]+")
.Cast<Match>()
.Select(m => m.Value).ToList();
}
catch
{
continue;
}
}

string sampleCode = File.ReadAllText("C:\\SampleCode.txt");

Dictionary<string,int> confiMap = new Dictionary<string,int>();
foreach (var lang in langKeywords.Keys)
{
foreach(var kw in langKeywords[lang])
{
if(sampleCode.Contains(kw))
{
if(!confiMap.ContainsKey(lang))
confiMap.Add(lang,1);
else
confiMap[lang]++;
}
}
}
Dictionary<string,string> brushAliases = new Dictionary<string,string>();
brushAliases.Add("CSharp","csharp");
brushAliases.Add("Python","python");
brushAliases.Add("Ruby","ruby");
brushAliases.Add("Perl","perl");
brushAliases.Add("CPP","cpp");

StreamReader sr = new StreamReader ("C:\\rudiSynTemplate.html");
string total = sr.ReadToEnd();
sr.Close();
total = total.Replace("{brushAlias}",brushAliases[confiMap.OrderByDescending (m => m.Value )

.First ().Key]);
total = total.Replace("<code>",sampleCode);

StreamWriter sw = new StreamWriter (@"C:\syntaxhighlighter_3.0.83\synh.html");
sw.WriteLine(total);
sw.Close();

System.Diagnostics.Process.Start(@"C:\syntaxhighlighter_3.0.83\synh.html");

How It Works

I tested this code with the edits1() method from Peter Norvig’s spell-checker. The automatic syntax highligher correctly identified that the code is written in Python and applied SyntaxHighlighter’s Python brush. Figure 3-10 shows the result as rendered on the browser.

image

Figure 3-10. Result of applying automatic syntax highlighting on Python code

To execute this, you need to download SyntaxHighlighter. SyntaxHighlighter has a highlighter brush for several programming languages, and each brush has a keyword section that stores all the keywords in the language. For example, here is the list of Ruby keywords from the Ruby brush:

var keywords = 'alias and BEGIN begin break case class def define_method defined ' +
'do each else elsif END end ensure false for if in module new next nil not or raise ' +
'redo rescue retry return self super then throw true undef unless until when while yield';

The program keeps track of all the keywords for all the languages in the one-to-many dictionary langKeywords.Figure 3-11 shows a collapsed view of the dictionary for a few languages.

image

Figure 3-11. Dictionary holding language names and their keywords

Next, the program finds these keywords that have to be highlighted in the sample code. Based on the number of keywords found in the sample code, the program guesses the programing language. The frequency of matches are maintained in a dictionary called confiMap. For the preceding Python example, Figure 3-12 shows the content of confiMap.

image

Figure 3-12. Confidence score for the code to apply syntax highlighting

You see that seven Python keywords were found in the sample code, which is the highest number. Thus the program guesses that the sample code is written in Python.

3-9. Creating a Word-Ladder Solver

Word ladder, or doublets, (http://en.wikipedia.org/wiki/Word_ladder) is a game in which players change one letter at a time while trying to find a path between a given pair of start and end words of the same length. One interesting example is the path from myth to fact, as shown here:

myth image math image mate image fate image facimage fact

Notice that at each step, only one letter is changed to get to the next word.

Problem

Write a program to solve a word-ladder game. Given two words, print the path between these two words if one exists.

Solution

Listing 3-9 provides a program that prints the word-ladder path between two given words if it exists; otherwise, it prints no path. To run this, you have to change the language combo value to C# Program in LINQPad.

Listing 3-9. A LINQ-based word-ladder solver

/// <summary>
/// Calculates the Hamming Distance between two strings
/// </summary>
// <param name="first">The first string</param>
/// <param name="second">The second string</param>
/// <returns></returns>
public static int HammingDistance(string first, string second)
{
return first.ToCharArray().Where((f,i) => second[i]!=f).Count();
}
void Main()
{
<string> transitions = new List<string>();

List<string> allWords = new List<string>();
StreamReader t9Reader = new StreamReader(@"C:\T9.txt");
string total = t9Reader.ReadToEnd();
t9Reader.Close();
//Start and End words
string start = "myth";
string end = "fact";

string startCopy = start;

transitions.Add(start);

allWords.AddRange(total.Split(new char[] { ' ', '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries));

allWords = allWords    .Where(word => word.Length == start.Length)
.ToList();

allWords.Add(end);

Dictionary<string, List<string>> wordEditDistanceMap =

allWords.ToLookup (w => w)
.ToDictionary
(
//key selector
w => w.Key,
//value selector
w => allWords.Where(a =>
HammingDistance(a,w.Key)==1).ToList()
);

//At this point we have the dictionary separated by edit distance 1
bool noPath = false;

List<string> currentList = new List<string>();
do
{

string[] currents = wordEditDistanceMap[start]
.Where(word => HammingDistance(word, end) ==
wordEditDistanceMap[start].Min(c => HammingDistance(end, c))).ToArray();
do
{
foreach (string c in currents)
{
if (!currentList.Contains(c))
{
currentList.Add(c);
break;
}
if ((currents.Length == 1 && currentList.Contains(c)))
{
Console.WriteLine("There is no such path !");
noPath = true;
break;
}
}

if (noPath)
break;
} while (currentList.Count == 0);
if (noPath)
break;
transitions.Add(currentList[currentList.Count - 1]);
if(transitions.Count >=2 && transitions[transitions.Count -2]==transitions.Last ( ))
{
Console.WriteLine("There is no such path");
noPath=true;
break;
}

start = currentList[currentList.Count - 1];
} while (!start.Equals(end) || noPath==true );
(!noPath)
transitions.Dump("Transition");// from \"" + startCopy + "\" to \"" + end +"\"");
}

How It Works

In this solution, you find the next word through a neighborhood forest of words that are one edit distance away at each step. The edit distance of two equal-length strings is defined as the number of characters that differ (this is also called the hamming distance). For example, the edit distance between myth and math is 1. Although the entire solution doesn’t use LINQ; however, key elements are expressed using LINQ, illustrating a couple of idiomatic LINQ usages that you will find useful elsewhere too.

The first LINQ usage is the edit distance calculation. This uses the indexed version of the Where() operator.

Consider the following code:

first.ToCharArray().Where((f,i) => second[i]!=f).Count();

This checks the number of occasions that the ith element of the second string doesn’t match its corresponding character from the first string. The indexed version of Where() is handy for removing a nested looping situation like this one. Or, in other words, in the absence of an indexed version, you would have to code it like this:

int count = 0;
foreach(var f in first)
for(int i = 0; i<second.Length;i++)
if(f != second[i])
count++;

The second idiom is ToLookup() followed by ToDictionary():

In this example, this idiom is used to create a dictionary in which the keys are all the words in the T9 dictionary and the values are all other words that are one edit distance away from the T9 words. Figure 3-13 shows an entry of this dictionary.

image

Figure 3-13. Words that are one edit distance  away are stored in the dictionary

Notice that the words facefast, and pact are all one edit distance away from the word fact. So these words end up in the value list of the word fact.

The next challenge in solving this puzzle is to locate the best candidate, the one that is closest to the target word. So if we start with myth and are going toward fact, the next word we should pick is math and not moth, because moth is further from fact than math. The hamming distance between moth and fact is 4. The hamming distance between math and fact is 3. Thus, in this case, math is a better choice than moth in our journey from myth to fact. To determine this next candidate, the following code is used:

string[] currents = wordEditDistanceMap[start]
.Where(word => HammingDistance(word, end) ==
wordEditDistanceMap[start].Min(c => HammingDistance(end, c))).ToArray();

This code returns all words that are one edit distance away from the current starting word, and for which the edit distance is the minimum from the target word.

At each step, these identified next-best-candidate words get added to the transition path by the following code:

transitions.Add(currentList[currentList.Count - 1]);

The remaining code is used to avoid infinite looping. So whenever we come back to any word we’ve already visited and that is not our target word, then there can’t be any path.

3-10. Formatting on the Fly

To format free-form text, you have likely been writing specific functions. For example, assume you have a list of social security numbers from all users, but those numbers are not formatted well. Another example might be a list of unformatted phone numbers. Although these examples are similar, you would typically have to write separate functions to apply the right formatting to them. By using LINQ, you can generate these random formatting functions on the fly. Using LINQ in this manner is much like teaching a human how to format a given type of string and then asking that person to format a bunch of those strings. If you have used the new Flash Fill feature of Microsoft Excel, this solution will look similar. However Flash Fill is a much more complex feature that uses machine learning.

Problem

Create a function that returns a transformer given an example transformation. Later, this generated transformer function can be applied to other values.

Solution

Listing 3-10 shows how to create a function that can transform given text to another format.

Listing 3-10. On-the-fly formatting

public Func<string,string> FormatLikeThis(string transformation)
{
string[] tokens = transformation.Split(new string[]{"=>"},StringSplitOptions.RemoveEmptyEntries);
string start = tokens[0];
string end = tokens[1];
Dictionary<int,char> insertCharMap = new Dictionary<int,char>();
Enumerable.Range(0,end.Length).Where(k => !start.Contains(end[k]))
.ToList()
.ForEach(k => insertCharMap.Add(k,end[k]));

Func<string,string> transformer = x =>
{
insertCharMap.ToList().ForEach(z => x = x.Insert(z.Key,z.Value.ToString()));
return x;
};
return transformer;
}

void Main()
{
string[] someVals = {"234567890","345678901","456789012"};
List<string> modifiedVals = new List<string>();
var transformer = FormatLikeThis("123456789=>123-456-789");
someVals.ToList().ForEach(k => modifiedVals.Add(transformer.Invoke(k)));
someVals.Dump("Before");
modifiedVals.Dump("After");
}

This generates the output shown in Figure 3-14.

image

Figure 3-14. Values before and after formatting

How It Works

This code works by inserting characters that aren’t present in the current string. At the first stage, the given transformation is broken into two pieces. The first part is the starting text, and the last part is the transformed text. The transformed text may have characters that aren’t part of the starting string. Consider the following code:

Enumerable.Range(0,end.Length).Where(k => !start.Contains(end[k]))
.ToList()
.ForEach(k => insertCharMap.Add(k,end[k]));

This code stores the locations and new characters that have been introduced. For the example, the transformations 123456789=>123-456-789 and the content of insertCharMap are shown in Figure 3-15.

image

Figure 3-15. Characters to insert and indices

After this character-insert mapping is created, this is used to generate a function by implementing the definition of this function dynamically.

Consider the following code:

insertCharMap.ToList().ForEach(z => x = x.Insert(z.Key,z.Value.ToString()));

This code inserts characters in the starting text to make it the target text. Another way to think about this is that the strategy to transform one text to another remains the same; only the implementations change during runtime. I encourage you to experiment with other formatting problems.

3-11. Solving Eric Lippert’s Comma-Quibbling Problem

Eric Lippert posed a question on his blog that received more attention than he originally expected. The question asks readers to insert a comma between all words in a given collection, but to insert the word and (instead of a comma) before the last word, and then wrap the output in curly braces.

Here are four scenarios from Eric’s blog:

1.    If the sequence is empty, the resulting string is {}.

2.    If the sequence is a single item ABC, the resulting string is {ABC}.

3.    If the sequence is the two-item sequence ABC, DEF, the resulting string is {ABC and DEF}.

4.    If the sequence has more than two items—for example, ABC, DEF, G, H—the resulting string is {ABC, DEF, G and H}. (Note that the G is not followed by a comma.)

Problem

Write a method to return comma-delimited text by using the preceding rules.

Solution

The program in Listing 3-11 solves this comma-quibbling problem.

Listing 3-11. One comma-quibbling problem solution

string[] input = {"ABC", "DEF", "G", "H"};
string result =
"{" //starting/opening brace
+ input.Take(input.Length - 1).Aggregate((f,s) => f + ", " + s)
+ " and "
+  input.Last()
+ "}";//closing brace
result.Dump("Eric's Comma Quibbling");

This query generates the output shown in Figure 3-16.

image

Figure 3-16. Sample output from Eric Lippert’s comma-quibbling challenge

How It Works

The challenge is to place a comma between every pair except the last word/entry in the input sequence. input.Length gives the length of the input sequence. In this case, that’s 4. So input.Take(input.Length – 1) returns an IEnumerable<string> that contains the first three elements: ABC, DEF, and G in this case.

The call to Aggregate in Aggregate((f,s) => f + ", " + s) takes a pair of elements, starting from the leftmost, one at a time, and joins these with a comma followed by a space. This returns the string ABC, DEF, G. To make the text read well, you need to insert an andbefore the last word in the series. The code does just that: it adds the and and then calls Last() to get the last element of the input sequence (H, in this case), and appends that to the end of the result string.

Finally, the code places curly braces at the start and end of the result.

This solution will work only when the list contains two or more values. To completely solve the problem, you would also need to handle an empty sequence or one containing only a single value as described in the previous scenarios.

3-12. Generating Random Serials

Random serial generation is a common problem. Two examples are random password generation and serial key generation to be used as license keys.

Problem

Write a program to generate random serials.

Solution

Listing 3-12 shows a way to generate random serials of length 8. The serials will consist of letters (lowercase and uppercase) and numbers.

Listing 3-12. Generating random serials

//Serial generation
for(int i=0;i<5;i++)
{
Enumerable
.Range(65,26)
.Select (e => ((char)e).ToString())
.Concat(Enumerable.Range(97,26).Select (e => ((char)e).ToString()))
.Concat(Enumerable.Range(0,10).Select (e => e.ToString()))
.OrderBy (e => Guid.NewGuid())
.Take(8)
.ToList().ForEach (e => Console.Write(e));
//Give a line break between two random serials
Console.WriteLine();
}

This generates five random serials, as shown in Figure 3-17. Each random serial consists of uppercase and lowercase characters and numbers, and has a length of 8.

image

Figure 3-17. Five generated random serials

How It Works

At the heart of this algorithm is the ability to randomly and easily sort a collection by using OrderBy(). In the following code, OrderBy (e => Guid.NewGuid()) sorts the collection in random order, because for each element, a new GUID is generated. The ASCII code for a is 97, and the code for A is 65. Take a look at the following two lines:

Enumerable.Range(65,26)
.Select (e => ((char)e).ToString())

This code generates the sequence ABCDEFGHIJKLMNOPQRSTUVWZYZabcdefghijklmnopqrstuvwxyz.

Next, we have the following line:

.Concat(Enumerable.Range(97,26).Select(e => ((char)e).ToString()))

This line appends the digits 0 to 9, so the sequence becomes ABCDEFGHIJKLMNOPQRSTUVWZYZabcdefghijklmnopqrstuvwxyz0123456789.

Then the final call to OrderBy() randomly sorts the characters in this sequence, creating a different sequence every time. Finally, it picks the first eight characters—and there you have it: a random serial.

3-13. Generating All Substrings of a Given String

Generating all substrings of a given string is a problem you are likely to encounter in many different situations. For example, in some plagiarism-detection algorithms, substrings from several sources are extracted and then the number of tokens are checked for a match. The total percentage of tokens that are same in both sources give an impression of how much those two sources match. This is sometime referred as a similarity measure or a proximity score.

Problem

Write a program to generate all the substrings of a given string.

Solution

The program in Listing 3-13 prints all the substrings of the given string.

Listing 3-13. Generating all substrings of a given string

public static List<string> NGrams(string sentence, int q)
{
int total = sentence.Length - q;
List<string> tokens = new List<string>();
for (int i = 0; i <= total; i++)
tokens.Add(sentence.Substring(i, q));
return tokens;
}
void Main()
{
string name  = "LINQ";
Enumerable.Range(0,name.Length+1)
.SelectMany(z => NGrams(name,z))
.Distinct()
.Where (b => b.Length!=0)
.Dump("All substrings of 'LINQ'");
}

This generates the output shown in Figure 3-18: all substrings of the string LINQ.

image

Figure 3-18. All substrings of the string LINQ

How It Works

At the heart of this implementation is the NGrams() method. This method returns all the tokens of size q. So if q is 2, and you call NGrams() with the string LINQ, it will generate three substrings: LI, IN, and NQ. Some computing texts refer to this process as a sliding window.

So if you call the NGrams() method with all possible sizes (in this case, that’s 1, 2, 3, and 4) because the string LINQ has a length of 4), you will end up with all the possible substrings of LINQ.

Enumerable.Range(0,name.Length+1) calculates the length range for which you want to generate N-Grams. Because NGrams() returns a collection of items, you get a one-to-many relationship from one integer length to a set of substrings. This relationship is aptly represented by the projection using SelectMany():

SelectMany(z => NGrams(name,z))

Because there might be duplicates, it is good to clean the results with a call to Distinct().

3-14. Creating a Scrabble Cheater

Scrabble is a game to showcase word power. The more words you can make from a set of given letters, the more you score.

Problem

Write a program that can generate all other possible words that can be formed by using all the letters or a subset of those letters of the given word.

Solution

The following program generates a list of all words that can be made from a subset of the letters of the given word. For example, if the given word is what, players can form the words hatthawa, and at }.

Listing 3-14 finds all words that can be formed using the letters of the given word or a subset of those letters. It uses a T9 dictionary to locate words.

Listing 3-14. Find all words that can be formed from a given set of letters

Func<string,Dictionary<char,int>> ToHist =
word => word.ToCharArray()
.ToLookup (w => w)
.ToDictionary(w => w.Key, w => w.Count());
//Scrabble Cheat
string GivenWord = "what";

StreamReader sr = new StreamReader("C:\\T9.txt");
string total = sr.ReadToEnd();
sr.Close();
List<string> allWords = Regex.Matches(total,"[a-z]+")
.Cast<Match>()
.Select (m => m.Value)
.Distinct()
.ToList();

Dictionary<string,Dictionary<char,int>> forest =
new Dictionary<string,Dictionary<char,int>>();

allWords.ForEach(w => forest.Add(w, ToHist(w)));

Dictionary<char,int> hist = ToHist(GivenWord);

List<string> scrabbleCheats = new List<string>();

foreach (string w in forest.Keys)
{
if(
//keys should match
forest[w].Select (x => x.Key).All(x => hist.Select (h => h.Key).Contains(x))
&&
//values should be less than or equal to that of hist
forest[w].All (x => hist[x.Key]-forest[w][x.Key] >= 0)
)
scrabbleCheats.Add(w);
}

scrabbleCheats.OrderBy (c => c.Length).Dump("Scrabble Cheats");

This generates the words shown in Figure 3-19: all the words that can be generated from the word what.

image

Figure 3-19. All the words that can be formed using the letters of the word what

How It Works

To locate words that can be formed from all the characters of a given word or a subset of those characters, you locate the words for which the character histogram is equal to or is a subset of the character histogram of the given word.

OK, that might sound heavy. Here’s an example. Let’s say we want to find all words that can be formed with the letters of the word what. The first step is to construct a character histogram of what. This histogram is shown in Table 3-1.

Table 3-1. The character histogram of the word what

Character

Frequency

W

1

H

1

A

1

T

1

Note that the histogram for the word hat is a subset of this histogram. In other words, to create hat, we need one h, one a, and one t. We have one of each in what. So the word hat can be formed using the letters of the word what.

In the code, the C# dictionary forest holds all the character histograms for all words in the T9 dictionary. The forest keys represent the words in the dictionary, and the values store their character histograms.

The delegate ToHist generates the histogram of a given word. ToHist() uses ToLookup() to generate a lookup table of characters. This call is followed by a ToDictionary() call, where each key is a character of the word, and each value is a character’s frequency (the number of times that character occurs in a given string).

Figure 3-20 shows the intermediate lookup table, created by ToLookup(), for the word cool. Here all the unique characters are keys, and the values represent the occurrences.

image

Figure 3-20. The lookup table created for the word cool

So this lookup table is converted to a dictionary, displayed as a table in Figure 3-21. The table has two columns: Key and Value. The Key column lists the distinct characters of the word cool. The Value column stores the frequency of the characters in the word.

image

Figure 3-21. The character histogram of the word cool generated from the lookup table

Iterating over the keys in the dictionary forest, we find those keys where all the characters in the histogram match those of the given word by using the following statement:

forest[w].Select (x => x.Key).All(x => hist.Select (h => h.Key).Contains(x)) forest[w] gives us the character histogram. Projecting the keys of this dictionary by using Select (x => x.Key) gives us all the distinct characters. All(x => hist.Select (h => h.Key).Contains(x)) checks whether all these characters are also present in the histogram of the given word.

The last call to All(), forest[w].All (x => hist[x.Key]-forest[w][x.Key] >= 0) checks whether the character frequencies match up for all characters. hist[x.Key] represents the frequency of the character represented by x.Key. forest[w][x.Key]represents the frequency of the character x.Key for the word forest[w].

3-15. Finding All the Subsequences of a Given String

As you learned earlier in this chapter, a subsequence is a string that can be formed from a subset of another string, where these characters appear in monotonically increasing indices in the latter string. For example, the word wine is a subsequence of the phrase world is not enough. Another example is the word rental is a subsequence of ornamental.

Problem

Write a program to find all the subsequences of the given string.

Solution

Listing 3-15 finds all the sub-sequences of a given word.

Listing 3-15. Finding all sub-sequences of a given word

StreamReader sr = new StreamReader ("C:\\T9.txt");
var allWords = Regex.Matches(sr.ReadToEnd(),"[a-z]+").Cast<Match>().Select (m => m.Value).ToList();
sr.Close();
List<string> subsequences = new List<string>();
string bigWord = "awesome";
foreach (string smallWord in allWords)
{
var q = smallWord
.ToCharArray()
.Select (x => bigWord
.ToCharArray()
.ToList()
.LastIndexOf(x));

if(q.All (x => x !=-1)
&&
q.Take(q.Count () - 1)
.Select ((x,i) => new {CurrentIndex = x, NextIndex = q.ElementAt(i+1)})
.All (x => x.NextIndex - x.CurrentIndex > 0))
{
subsequences.Add(smallWord);
}
}
subsequences.Dump("All subsequences");

This generates all the subsequences of the word awesome, as shown in Figure 3-22.

image

Figure 3-22. All the sub-sequences of the word awesome

How It Works

In this example, the letters of the word some appear at monotonically increasing indices in awesome. (The letters of some appear at indices 4, 5, 6, and 7 in awesome.) Thus some is a subsequence of awesome. The following query finds all the indices of the bigWord word where the letters of the smallWord occur.

var q  = smallWord .ToCharArray()
.Select (x => bigWord
.ToCharArray()
.ToList()
.LastIndexOf(x));

If smallWord is rental and bigWord is ornamental, q will have the indices shown in Figure 3-23.

image

Figure 3-23. All the indices of ornamental where the letters of rental occur

To find the subsequence, it’s enough to check for monotonicity of these indices. Also because all characters of the smaller subsequence word have to occur in the bigger word, none of these indices can’t be -1. The first statement q.All (x => x !=-1) verifies that statement.

The second statement checks for the monotonicity of these indices. The first part of the second statement is as follows:

q.Take(q.Count () - 1)
.Select ((x,i) => new {CurrentIndex = x, NextIndex = q.ElementAt(i+1)})

This code projects a list of key/value pairs, where each key represents the current index at each level and the value represents the next index at each level. So the preceding list of indices gets projected as shown in Figure 3-24.

image

Figure 3-24. Current and next index at each level

The final call to All(), shown below, checks whether CurrentIndex is less than and not equal to the next level, for all these projected values.

All (x => x.NextIndex - x.CurrentIndex > 0)

The preceding call will return true only when the indices are monotonically increasing.

Here is a negative test case to prove that this code works. The word saw is not a sub-sequence of awesome. Figure 3-25 shows a table for the current and next index for saw.

image

Figure 3-25. The gap (current index) and next index of the word saw

In this example, the last All() call will return false, so saw can’t be a subsequence of awesome.

3-16. Squeezing a Paragraph to Fill Tightly

Many text editors support a functionality to fill a paragraph tightly so that the number of words in each line becomes almost the same. This is useful for wrapping text to properly utilize screen area.

Problem

Write a program to wrap a given paragraph of multiple lines so that the lengths of the lines become almost the same.

Solution

This program, shown in Listing 3-16, uses average line length and a LINQ operator to generate a nicely distributed paragraph in which all lines are of almost the same length.

Listing 3-16. Line-wrapping algorithm to equalize line lengths

string text =
@"Almost any text editor provides a fill
operation. The fill operation transforms raggedy-looking text
with lines of
different lengths into nicely formatted text with lines
nearly the same length.";
text.Dump("Before");

var words = text.Split(new char[]{' ','\r','\n'},StringSplitOptions.RemoveEmptyEntries)
.Where (t => t.Trim().Length!=0);

var lines = text.Split(new char[]{'\r','\n'},StringSplitOptions.RemoveEmptyEntries);

int max  = lines
.Select(l => l.Split(new char[]{' '},
StringSplitOptions.RemoveEmptyEntries).Count ())
.OrderByDescending (l => l)
.First();

max  = max + max / 2;//Maximum width is 1.5 times that of the current maximum width

Enumerable.Range(0,words.Count ()/max + 1)//decide how many lines need to be there.
.Select(k => words.Skip(k*max).Take(max).Aggregate ((u,v) => u + " " + v))
.Aggregate ((m,n) => m + Environment.NewLine + n)//provide line breaks
.Dump("After");

This generates the output shown in Figure 3-26.

image

Figure 3-26. Results before and after the Fill Paragraph routine

How It Works

The first step to lay lines with even lengths is to determine a decent width for each line. The line with the maximum width is a good starting point. The following code finds the length of the longest line:

lines.Select(l => l.Split(new char[]{' '},StringSplitOptions.RemoveEmptyEntries).Count ())
.OrderByDescending (l => l)
.First();

Using 1.5 times this number should be a good-enough length for the longest line.

The code max = max + max / 2; calculates the maximum line length. The words list contains all the words in the paragraph. Thus words.Length / max + 1 gives the total number of lines required to display the paragraph in the result. Next, the following code creates all the new lines with the required number of words:

Select(k => words.Skip(k*max).Take(max).Aggregate ((u,v) => u + " " + v))

When k is 0, no word is skipped. At the next stage, when the value of k is increased, that many (k*max) words are skipped. The call to Aggregate() stitches the words together to generate the new line. The final call to Aggregate() inserts the line breaks.

3-17. Printing the Lines of a Song

There are many programming languages, some popular, and some not so popular. Some are considered esoteric; however, that doesn’t mean some people don’t use them. 99 Bottles of Beer is a web site that challenges people to write a program in any language to generate the lines of the song “99 Bottles of Beer.” You can find the lyrics at http://99-bottles-of-beer.net/lyrics.html.

Problem

Write a program to generate the lines of the song “99 Bottles of Beer.” Although you can use for loops, please don’t; try to do it using LINQ operators instead.

Solution

I wrote the code in Listing 3-17 by using LINQ to generate the song lines. You can find my full solution on the 99 Bottles of Beer web site, at http://99-bottles-of-beer.net/language-csharp-2549.html.

Listing 3-17. Generating the lyrics to “99 Bottles of Beer”

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace _99Bottlez
{
class Program
{
static void Main(string[] args)
{
int countOfBottles = 10;//Number of bottles
string lineTemplate = @"{X} bottles of beer on the wall, {X} bottles
of beer. Take one down and pass it around, {Y}
bottles of beer on the wall.";

string lastLine = @"No more bottles of beer on the wall, no more
bottles of beer.Go to the store and buy some
more, {X} bottles of beer on the wall.";

List<string> songLines = new List<string> ();
Enumerable.Range(1, countOfBottles)
.Reverse()
.ToList()
.ForEach
(c => songLines.Add(lineTemplate.Replace("{X}",
c.ToString()).Replace("{Y}", (c-1)!=0?(c - 1).ToString():@" No
more bottles of beer on the wall.")));

//Add the last line
songLines.Add(lastLine.Replace("{X}", countOfBottles.ToString()));

songLines.ForEach(c => Console.WriteLine(c));
Console.ReadLine();
}
}
}

How It Works

The program first generates a list of all line numbers and then reverses the list. Finally, the placeholders in the line template (all but last line) get replaced with the proper numbers and added to the songLines list. The code creates the final line by replacing the placeholders in the last-line template, and appends that to the songLines list. The final lines print the result to the Console.

3-18. Mining Abbreviations and Full Forms from News Articles

Newspaper articles often set abbreviations apart from the rest of the text. Normally, they wrap abbreviations in parentheses. The full form of the abbreviations are typically provided just before the abbreviations appear. For example, a newspaper article might state, “Today United Nations (UN) officials….” If a program can be written to read news articles and mine them for abbreviations and their full forms, making glossaries and indexes would be easy. This program could also be used to automatically tag a news article using abbreviations that appear in the text, making it possible to use an indexed search to find news articles related to an abbreviation.

Problem

Write a program that reads a news article in the form of plain text and returns all abbreviations and their full forms.

Solution

The following program finds all the abbreviations and their full forms from a given paragraph of a newspaper article. I have made the abbreviations and their expanded forms bold in the sample input in Listing 3-18.

Listing 3-18. Identifying abbreviations and their expanded forms

string sentence = @"This is an effort by the World Health Organization (WHO) and
Fédération Internationale de Football Association (FIFA) to help
footballers in poor nations. Associated Press (AP) reports.";
//Add all stop words in this list
List<string> stopWords = new List<string>() {"of", "de"};
foreach(string sw in stopWords)
{
List<string> matches = Regex
.Matches(sentence,sw + " " + "[A-Z][a-z]+")
.Cast<Match>()
.Select(m => m.Value).ToList();

foreach (string m in matches)
{
sentence = sentence.Replace(m,m.Replace(sw+" ",string.Empty)+"_"+sw);
}
}
List<string> all = sentence.Split(new char[]{' ',',','!',';','[',']','(',')',
'-','\'','"','\r','\n'},StringSplitOptions.RemoveEmptyEntries).ToList();
List<string> abbs = all
.Where (s => s.ToCharArray().All (x => x>='A' && x<'Z'))
.Distinct()
.ToList();
abbs
.Select (a => new KeyValuePair<int,int>(all.IndexOf(a),a.Length))
.Select(a => new { Abbreviation = all[a.Key], Expansion =
Enumerable.Range(a.Key-a.Value,a.Value)

.Select (e => all[e])
.Aggregate((f,g) => f.Split('_').Aggregate ((x,y) => y +  " " + x )
+ " " + g.Split('_').Aggregate ((x,y) => y +  " " + x )).Trim()})
.OrderBy (a => a.Abbreviation).Dump("Abbreviations with Expansions");

This program generates the output shown in Figure 3-27. In the table, the first column shows the abbreviation, and the second column shows the expanded form of the abbreviation.

image

Figure 3-27. Extracted abbreviations and their expanded forms

How It Works

The program first loads all the words from the input into a list. Using a regular expression, the program finds the abbreviations within that list and creates a list of them as well. (Of course, those abbreviations are also stored in the first list, because all the words of the given text are stored there.)

The program relies on the length of the abbreviations to backtrack and find the full expanded text for the abbreviation. However, stop words such as of or in can jeopardize that process, because those words aren’t reflected in the abbreviation. For example the abbreviation for “United States of America” is typically USA, not USoA. To fix this potential problem, the example uses a regular expression to locate patterns in which a stop word appears followed by a word with a capital letter. The program transforms this combination to a single word by concatenating the stop word to the following word with a leading underscore. For example, in this example, the phrase of International is transformed to International_of. That may look weird, but the transformation ensures that backtracking will find the complete expanded form of the abbreviation.

Consider this code snippet:

abbs
.Select (a => new KeyValuePair<int,int>(all.IndexOf(a),a.Length))

This returns IEnumerable<KeyValuePair<int,int>>, where the first integer represents the index of the abbreviation in the list of all words. The second integer stores the length of all abbreviations as values. This value determines how many words we should check back (backtrack) to find the expansion of the abbreviation.

The following snippet returns as many words as the length of the abbreviation, which are immediately before the abbreviation in the list of all words:

Enumerable.Range(a.Key-a.Value,a.Value)
.Select (e => all[e])

a.Key represents the index where the abbreviation is found, and a.Value represents its length. So a.Key - a.Value returns an index which is a.Value ahead of the abbreviation in the list of all words.

The following call to Aggregate deals with the stop word tokens properly:

.Aggregate((f,g) => f.Split('_').Aggregate ((x,y) => y +  " " + x )
+ " " + g.Split('_').Aggregate ((x,y) => y +  " " + x )).Trim()})

This call transforms International_of back to of International. It also stitches together the words that form the expanded form of the abbreviation. Finally, the list is sorted alphabetically by abbreviations.

Summary

Congratulations on finishing yet another long chapter. I hope you had fun trying out the examples. Also, you’ve seen how diverse problems such as human-computer interaction and data-mining tasks can be implemented using LINQ. And you have picked up a few LINQ idioms along the way. These idioms will be useful in later chapters—and beyond this book, when you deal with your own code. The next chapter shows how to refactor code by using LINQ to make it cleaner and more concise.