Interacting with the File System - Thinking in LINQ: Harnessing the power of functional programing in .NET applications (2014)

Thinking in LINQ: Harnessing the power of functional programing in .NET applications (2014)

Chapter 9. Interacting with the File System

You can use LINQ as you would a scripting programming language to perform several types of file system analytics operations. For .NET developers, LINQ can be as useful as PowerShell—in fact, sometimes even better, because when using LINQ, developers can still leverage all the other benefits that the host language has to offer (C# in this case).

In this chapter, you will see recipes that illustrate how you can use LINQ to perform various file operations, including the following:

· Comparing CSV files

· Finding the total size of a set of files in a directory

· Simulating some common LINUX commands

· Finding duplicate files, duplicate file names, and zero-length files

9-1. Comparing Two CSV Files

Most file diff utilities work on matching files on a line-by-line basis. However, this scheme doesn’t work for CSV file comparison. Two CSV files are the same if their rows are the same—irrespective of the order in which the rows appear. The same is true for columns. If two CSV files have the same columns, even if the columns appear in a different sequence, the files can be considered to have the same column headers.

Problem

Write a general-purpose function to check whether two CSV files are the same or different.

Solution

Listing 9-1 provides an admirably short, yet still complete solution.

Listing 9-1. Determine whether two CSV files are the same.

Func<string,IEnumerable<string>> GetHeaders =
(fileName) => File.ReadAllLines(fileName)
.First()
.Split(new char[]{','},StringSplitOptions.None);

Func<string,IEnumerable<string>> GetBody =
(fileName) => File.ReadAllLines(fileName)
.Skip(1)
.Where (f => f.Trim().Length!=0);
Func<string,string,bool> IsSameCSV =
(firstFile,secondFile) =>
//Match column headers
GetHeaders(firstFile)
.All (x => GetHeaders(secondFile).Contains(x))
//Match the body
&& GetBody(firstFile)
.All (x => GetBody(secondFile).Contains(x));

When this code is run using the two CSV files shown in Figure 9-1, the function IsSameCSV returns true.

image

Figure 9-1. Showing two same CSV files

How It Works

Comparing two CSV files requires comparing both the headers and the rows. When the header order and row order are the same, the problem is trivial, but the task becomes more complicated when the headers and/or the rows appear in a different order in one file than in the other. Therefore, when comparing two CSV files, you must make sure that you’re comparing only the values in the lines—irrespective of the order in which the lines appear.

Image Note When referring to the order of lines, the code still assumes that the header line is the first line in the file (which is reasonable, because that is part of the CSV specification).

The methods GetHeaders() and GetBody() return the headers (columns) and the body (rows) of the CSV file, respectively. Note that the GetBody() method skips the header line, by calling Skip(1).

IsSameCSV takes two file names as arguments, and returns true if the headers and body of these two CSV files are the same. It compares the headers and bodies separately. The following code compares the headers, but you can see from the “Solution” section that the code to compare the bodies follows exactly the same logic.

GetHeaders(firstFile)
.All (x => GetHeaders(secondFile).Contains(x))

9-2. Finding the Total File Size in a Directory

Finding the total size of the files in each of a set of directories and showing that in megabytes or gigabytes is an important component of understanding disk space consumption. To find the total size of a directory, you must recursively calculate the size of all the files in each subdirectory of that directory.

Problem

Write a LINQ script that lists the size (measured in megabytes) of all the directories and files inside a given directory.

Solution

Listing 9-2 shows a complete solution.

Listing 9-2. Calculate the total size of files in a directory.

Directory.GetFiles(@"C:\Users\mukhsudi\Downloads","*.*",
SearchOption.AllDirectories)
.Select (d => new FileInfo(d))
.Select (d => new { Directory = d.DirectoryName,
FileSize = d.Length} )
.ToLookup (d => d.Directory )
.Select (d => new { Directory = d.Key, TotalSizeInMB =
Math.Round(d.Select (x => x.FileSize).Sum () /
Math.Pow(1024.0,2),2)})
.OrderByDescending (d => d.TotalSizeInMB)
.Dump();

Image Note You will need to edit the path in the following code to be a valid directory on your computer.

How It Works

Calling GetFiles with SearchOption.AllDirectories returns the full name of all the files in all the subdirectories of the specified directory. The operating system represents the size of files in bytes. You can retrieve the file’s size from its Length property. Dividing it by 1024 raised to the power of 2 gives you the size of the file in megabytes. Because a directory/folder can contain many files, d.Select(x => x.FileSize) returns a collection of file sizes measured in megabytes. The final call to Sum() finds the total size of the files in the specified directory.

9-3. Cloning LINUX Head and Tail Commands

Listing the first few or the last few lines of a file is a common task. Linux includes the commands head and tail to do this; however, the Windows command prompt doesn’t include any equivalent commands. Using LINQ, you can easily brew up your own version of head and tail.

Problem

Write a program to clone the Linux head and tail commands for Windows.

Solution

Listing 9-3 creates clones of the LINUX head and tail commands.

Listing 9-3. Clone the Linux head and tail commands

Func<IEnumerable<string>,int,IEnumerable<string>> TakeLast =
(list, count) => list.Skip(list.Count()-count);
//Cloning head
Func<string,int,IEnumerable<string>> Head = (fileName, lineCount)=>
File.ReadAllLines(fileName).Take(lineCount);
//Cloning tail
Func<string,int,IEnumerable<string>> Tail = (fileName, lineCount)=>
TakeLast(File.ReadAllLines(fileName),lineCount);

Head("C:\\conf.txt",4).Dump();
Tail("C:\\conf.txt",4).Dump();

Figure 9-2 shows the content of the conf.txt file.

image

Figure 9-2. Content of the conf.txt file

The output of the code is shown in Figure 9-3.

image

Figure 9-3. Output of the cloned Linux head and tail commands

How It Works

Showing the first few lines from a text file is the same as taking the first few lines from the file. The Take() operator, which you’ve encountered numerous times earlier in this book, works perfectly in this context. On the other hand, showing the last few rows of the file is the same as skipping the total number of lines minus the number of lines you want to show, and then showing those last few lines. The TakeLast() method does exactly that.

9-4. Locating Files with the Same Name (Possible Duplicates)

Sometimes the same file gets copied to multiple destinations. From a file management perspective, it’s important to be able to find such duplicate files and delete all the unneeded copies. The first step in doing that is to locate all files that have the same name.

Image Caution Just because two files have the same name doesn’t necessarily mean that they’re duplicates. For example, two very different software installations might use the file name license.txt. But the content of the two files is likely to be different.

Problem

Write a LINQ script to find files residing in different directories with the same name.

Solution

Listing 9-4 shows how to find identically named files in different directories.

Listing 9-4. Find files of the same name.

//Locating duplicate files
Directory.GetFiles(@"C:\Users\mukhsudi\Downloads",
"*.*",SearchOption.AllDirectories)
.Select (d => new FileInfo(d))
.Select (d => new {FileName = d.Name,
Directory = d.DirectoryName})
.ToLookup (d => d.FileName)
.Where (d => d.Count ()>=2)
.Dump();

How It Works

The code first finds the files and maps their directories. Next, it creates a lookup table using the file name as the key. The result is that for any file whose name is duplicated elsewhere in some other folder/directory, there will be at least two entries for that particular key. These duplicate entries are found by the filter .Where (d => d.Count ()>=2).

9-5. Finding Exact-Duplicate Files

This is an extension of the preceding recipe. Sometimes people rename duplicate files without changing the contents. Unfortunately, that means the same file—but with different names—may exist in several different folders. The code from the previous recipe finds only duplicate names, not duplicate files. This recipe finds exact file duplicates—even if the file names are different.

Problem

Write a LINQ script to find duplicate files with different names, even if the duplicate files reside in different folders.

Solution

This solution, shown in Listing 9-5, complements the previous solution by finding files with identical content, even if the file names are different.

Listing 9-5. Find files with identical content

//Locating exact-duplicate files
Directory.GetFiles(@"C:\Program Files"
,"*.*",SearchOption.AllDirectories)
.Where (d => d.EndsWith(".txt"))
.Select (d => new { FileName = d,
ContentHash = File.ReadAllText(d).GetHashCode()})
.ToLookup (d => d.ContentHash)
.Where (d => d.Count ()>=2)
.Dump();

Image Note The code below will raise an error if you do not have sufficient rights to access all the files in the specified directory.

How It Works

Locating exact-duplicate files is an expensive process, because to determine whether the content of two files is identical, you need to read the files, create a hashcode, and then compare hashcodes. Unlike the previous example, this example creates the lookup table using the hashcodes of the files as the lookup table key. If the result contains two or more elements for any given hashcode key, then those files are exact duplicates.

9-6. Organizing Downloads Automatically

If you’re like me, you probably download a lot of files—and then forget about them. Over time, it becomes a pain to organize all these files in proper directories. Using a LINQ script, you can bring order to this chaos.

Problem

The problem here is to manage downloaded files by placing them in specific directories organized by file type and keyword.

Solution

Listing 9-6 organizes downloaded files by keyword and file type, storing them in appropriate directories.

Listing 9-6. Programmatically organize downloaded files.

string[] keywords = {"Roslyn","Rx","LINQ","F#"};
string[] videoFormats = {".mp4",".mpg",".mpeg",".flv"};
string[] slides = {".pptx",".ppt"};
string[] articles = {".pdf",".doc",".docx"};
string[] blogs = {".html",".htm"};

Directory.GetFiles(@"C:\Users\mukhsudi\Downloads")
.ToLookup (d => keywords.FirstOrDefault(x => d.Contains(x)))
.Where (d => d.Key != null )
.Select (d =>
new
{
Key = d.Key,
//Find all the videos for the given keyword
Videos = d.Where (x => videoFormats
.Any (f => x.EndsWith(f))),
//Find all the articles
Articles = d.Where (x => articles
.Any (f => x.EndsWith(f)))

//I omitted Slides and Blogs because those will be similar.

})
.ToList()
.ForEach(z =>
{
Directory.CreateDirectory(z.Key + " Videos");
z.Videos
.ToList()
.ForEach(f =>
File.Copy(f,Path.Combine(z.Key + " Videos",
new FileInfo(f).Name)));
Directory.CreateDirectory(z.Key + " Articles");
z.Articles.ToList().ForEach(f =>
File.Copy(f,Path.Combine(z.Key + " Articles",
new FileInfo(f).Name)));
});

How It Works

At the heart of this solution are the lists of keywords that determine how you want to classify your files. Each list contains several types of files that you want to store in separate directories. In this case, the file type lists are as follows:

string[] videoFormats = {".mp4",".mpg",".mpeg",".flv"};
string[] slides = {".pptx",".ppt"};
string[] articles = {".pdf",".doc",".docx"};
string[] blogs = {".html",".htm"};

These arrays determine the various file types in each different category. For example, if a file’s extension is either pptx or ppt, it’s a presentation file. I want to keep these files in a folder called XYZ Slides, where XYZ is a placeholder for the keywords defined in the first line—in this case, Roslyn, C#, LINQ and F#. The goal is that if the file name contains one of the keywords, for example, LINQ, and has a .pptx extension, then that file will be copied into a LINQ Slides folder. The idea is the same for all the other keywords and for all the other extensions.

The first call to ToLookup() tries to find matching keywords from the files. It stores the keywords and the file names in a lookup table. For file names that don’t contain any of the specified keywords, the key returned is null. The next Where() call filters out those files. Finally, theSelect() call projects the list of the files with the file name—the videos and articles associated with the current keyword.

At the end, the call to ForEach copies all the files into their appropriate destination folders.

9-7. Finding Files Modified Last Week

While doing forensic analysis on a file system, you often need to know when a file was last accessed. Using LINQ and the FileSystem APIs, it’s easy to find all files modified within the last week.

Problem

Find all files in a directory that were modified during the past week.

Solution

Listing 9-7 finds all the files modified within the previous week.

Listing 9-7. Find modified files within a date/time range

Directory.GetFiles(@"C:\Program Files","*.*",SearchOption.AllDirectories)
.Select (d => new FileInfo(d))
.OrderByDescending (d => d.LastWriteTime)
.Select (d => new {Name = d.FullName ,
LastModifiedTime = d.LastWriteTime})
.Where (d => d.LastModifiedTime.AddDays(7)
.CompareTo(DateTime.Today)>=0 )
.Dump("Files modified during last week");

Image Note The code below will raise an error if you do not have sufficient rights to access all the files in the specified directory.

How It Works

Whenever a file is modified, the last write time changes. Thus you can use the last write time to determine when a file was last changed. Knowing that, you can find all files where the last write time is within seven days of the current date, using the filter call shown here:

.Where (d => d.LastModifiedTime.AddDays(7)
.CompareTo(DateTime.Today)>=0 )

9-8. Locating Dead Files (Files with Zero Bytes)

A dead file is a file that has nothing in it. These files are far more common than you might think in your file system.

Problem

Locate dead files in your file system.

Solution

This short solution, shown in Listing 9-8, finds dead files in a specified directory and all its subdirectories.

Listing 9-8. Find zero-length files in a directory tree

Directory.GetFiles(@"C:\Program Files","*.*",SearchOption.AllDirectories)
.Select (d => new FileInfo(d))
.Where (d => d.Length == 0)
.Dump("Dead Files");

Image Note The code below will raise an error if you do not have sufficient rights to access all the files in the specified directory.

How It Works

Files with nothing in them are generally not useful. You can find these files by checking whether the Length property of the file is zero. GetFiles() returns a string array containing the names of the files, and then Select()projects this list as an IEnumerable of FileInfo.