Integrating stock data into the F# language - Developing analytical components - F# Deep Dives (2015)

F# Deep Dives (2015)

Part 2. Developing analytical components

Chapter 6. Integrating stock data into the F# language

Keith Battocchi

From its initial release, F# has always been a good language for manipulating data. The standard F# library provides a wide range of functions that help to reduce the complexity of data-processing tasks. For example, grouping, aggregation, and floating windows can be implemented in a single line of code. F#’s rich libraries, concise syntax, and support for functional programming can significantly reduce the time to market for data-driven applications. Despite these features, many data-access tasks require some amount of boilerplate code.

For example, consider the task of reading historical stock prices from the Yahoo! Finance site, which returns data as a comma-separated value (CSV) file. When pulling data, you create a simple data type representing the data stored in the rows (for example, a StockQuote type with appropriately typed properties for the date, opening price, and so on) along with a corresponding method for creating an instance of the type from the strings pulled from the columns of the file. Given the columns of the file, creating this data type is completely mechanical; but before F# 3.0, there wasn’t a good way to automate this kind of metaprogramming. CSV files are just one example of a data source that has a logical schema that can be mapped to F# types—the same type of problem arises when dealing with databases, web services, XML and JSON documents, and even entire REST services.

In this chapter, you’ll learn how to use a new feature in F# 3.0 called type providers to make accessing structured data from F# even easier. To simplify accessing stock data, you’ll first build a CSV type provider, which makes it frictionless to access the data from an arbitrary CSV file.[1]Then you’ll build a specialized type provider that focuses on the data exposed by Yahoo! Finance. The example demonstrates a number of business cases for F#. First, repeating boilerplate code is an error-prone approach to processing data, so removing this duplication helps you write correct code. Second, data access and parsing is a tedious task. By integrating Yahoo! Finance data directly into the language, you significantly reduce the time to market for analytical components based on this data source. In summary, type providers allow you to eliminate all the boilerplate just described and dive right in.

1 The CSV type provider is a slightly simplified version of the open source CSV type provider included with the F# 3.0 Sample Pack, which can be found at http://fsharp3sample.codeplex.com.

Introducing type providers

As you saw in the introduction, type providers are a new feature in F# 3.0, designed to make accessing external data easier. Type providers map external data sources into types that the F# compiler can understand. For example, the WorldBank type provider (see the following sidebar) provides a type WorldBank, which has an automatically generated property for each country of the world: WorldBank.``United States``, WorldBank. Ukraine, and so on. These members are generated by the type provider based on the current response from the WorldBank REST API.

F# data type providers

The standard F# distribution comes with type providers for a few data sources, including WSDL (web services) and SQL (using LINQ to Entities and LINQ to SQL). But a large number of type providers have been developed and maintained by the F# community as part of the F# Data project. The type providers include the following:

· XML, JSON, and CSV are type providers for three common file formats. They work by inferring the file schema from a sample file and generating types for accessing documents with the same structure.

· WorldBank and Freebase are specialized type providers that provide access to information provided by the World Bank (country development indicators) and Freebase (a graph-based online knowledge database).

F# Data is available on GitHub, and the package can be installed from NuGet using the package named FSharp.Data.

Before looking at the implementation of your own type provider, let’s explore how type providers are used. This will help you understand the Yahoo! Finance type provider you’ll implement in this chapter.

Using the CSV type provider

The CSV type provider is available in F# Data. In this section, you’ll explore how it works. Providing access to CSV files will be one part of the Yahoo! Finance type provider, so you’ll reimplement part of the CSV type provider later in this chapter (you could also reuse the code from F# Data, but reimplementing the CSV part will be a good example).

The files that you’ll later obtain from the Yahoo! Finance service are CSV files containing stock data for a particular ticker symbol. A sample file starts with the following three lines:

Date,Open,High,Low,Close,Volume,Adj Close

2011-12-30,26.00,26.12,25.91,25.96,27395700,25.04

2011-12-29,25.95,26.05,25.86,26.02,22616900,25.10

The file starts with a header row that specifies the names of individual columns, followed by a number of lines with data that matches the scheme of the first row.

Without preprocessing the file or doing anything else, the CSV provider allows you to write a script like that in the following listing. It prints the date together with daily high and daily low prices from each row in the file that came from a day in December.

Listing 1. Using the CSV provider to print stock data from December

Listing 1 assumes that you obtained the F# Data package using NuGet, so the first line references FSharp.Data.dll from the packages directory . The key part is the line that defines the Stocks type alias. The type provider looks at the specified sample file. The type provider then generates a type with a Rows property, where each row has properties corresponding to the headers in the CSV file such as Date , Low, and High .

Here you’re running the script in F# Interactive (FSI), but you could equally compile it in a standard F# application and use the type provider to parse other CSV files by using a different filename in the call to the Load method . The CSV provider gives similar benefits to using a code generator, but without the code generation! This means you can retain all of F#’s benefits as a great data-scripting language. Type providers even play nicely with the features of the integrated development environment (IDE) that make it easy to explore unfamiliar APIs from within an F# script, such as code completion.

How the CSV provider works

Perhaps the easiest way to think of a type provider is as a form of compiler plug-in. Type providers are components that match a particular interface and that are loaded from a referenced .NET assembly by the compiler at design time. At a high level, the type provider interface just specifies a set of .NET System.Type instances representing the types to provide, and these types expose provided members (such as methods, properties, and nested types). Each provided type has a corresponding .NET type that takes its place in the compiled code, and each call to a method or property has a corresponding code quotation that’s inserted in its place.[2]

2 There are two kinds of type providers: erasing and generative. In this chapter I’ll only be describing the mechanics of erasing providers, because generative providers are typically used only to wrap existing .NET code generators.

Note

An important part of the type-provider story is that type providers work well with editors (IDEs). When entering listing 1 into an IDE with F# support such as Visual Studio, you can type stockRow followed by a dot, and you’ll get a completion list with the available columns. This means you need to consider design time (the type provider running in the editor), compile time (when the compiler is run), and runtime (when the compiled program runs).

If you look closely again at listing 1, you’ll see one additional feature of type providers: the provided type (CsvProvider in this case) can be parameterized by statically known information (such as the location of the CSV file in this case, or a database connection string in the case of a type provider for connecting to SQL databases). This powerful mechanism allows the same type provider to be used to access a variety of similarly structured data sources.

The philosophy behind type providers

As listing 1 indicates, type providers fill a niche in F# that’s filled by other technologies in other languages. For instance, you might use code generation to achieve something like what the type provider does in listing 1 if you were programming in another language (for example, C# developers use LINQ to Entities this way). To understand the unique benefits of type providers for F#, consider the following list of criteria:

· Code-focused —You want to minimize context switching. If you’re using FSI, you should be able to reference a type provider and program against it interactively without having to switch to another tool first.

· Uniform integration —You want a single, consistent mechanism, regardless of the data source being accessed, and you want the types representing data sources to be treated just like any user-defined types. Together with the previous item, this motivates the use of .NET Type objects as the “interface” to type providers. Invoking a type provider is seamless, as opposed to some metaprogramming techniques that require explicit splicing syntax or preprocessing steps.

· Compile-time checking —Requiring strongly typed access means you want compile-time checking to make sure you’re accessing data sources in a meaningful way. Moreover, you want to integrate with modern tooling, using code completion, tooltips, and so forth in an IDE to make you more productive. Doing so rules out techniques like metaobject protocols in dynamic languages.

· Large-scale data sources —You want something that scales to large data sources containing thousands or millions of logical types. For instance, exposing data in the Freebase online database (www.freebase.com) to F# programmers in a strongly typed way can’t be done using naive code generation.

· Live and evolving data sources —You want to enable support for invalidation (for example, if the provider recognizes that the schema has changed on the server) while you’re using your IDE. If you use a technique that involves a single compilation step, then there’s no way to interact with a language service.

Type providers satisfy all of these criteria, whereas other techniques generally fail to satisfy some of them. For example, code generation can’t scale to millions of logical types, and using dynamic languages doesn’t provide the safety of a strongly typed system.

Designing and implementing the CSV type provider

Getting back to the main thread of this chapter, you want to implement a type provider that will give you type-safe, scalable access to the data available from Yahoo! Finance. The service returns data in the CSV format, so we’ll start by looking at the part of the type provider that handles CSV files. As outlined in the following sidebar, the example combines two approaches to writing type providers.

General and specialized type providers

F# type providers for accessing data sources can be classified into two categories. General-purpose type providers provide access to data in a specific format or data store but work for any instance of the data. This includes SQL databases (where the type provider can directly use database schema) and XML, JSON, and CSV type providers (where the schema is inferred from a sample file). In all these cases, the provider can be used with any concrete database or data file.

Specialized type providers such as the WorldBank and Freebase providers are specialized to one particular kind of data. They provide type-safe access to properties from the data store (like countries, chemical elements, and so on), but they’re bound directly to one data store.

The Yahoo! Finance type provider developed in this chapter is essentially an instance of the second class—it provides data access to a particular data store. But let’s start with the CSV parsing part, which is an example of the first kind.

Design strategy

In my experience, the best way to design a type provider is to iterate through the following simple process:

1. Write some sample code you’d expect the user of your type provider to be able to write. In the CSV provider case, this would be the sample in listing 1.

2. Write the corresponding F# code that you’d expect the type provider to translate this to. For the CSV provider, this would use an F# runtime type that provides access to columns. For example, if the rows are represented as tuples, the translation for stockRow.Date would bestockRow.Item1.

3. Determine how to automate the translation from one to the other. At this stage, you’ll sometimes discover an aspect of your proposed design that makes this hard or impossible. If that happens, back up and consider alternative designs for steps 1 and 2.

4. Implement the translation using the type provider API. The raw API (defined by the ITypeProvider interface) is difficult to master. I strongly recommend building on top of the ProvidedTypes API defined by the F# 3.0 Sample Pack, which exposes a much more straightforward interface.

This approach works particularly well for a few reasons. First, because type providers involve running code at design time that constructs expressions that will be invoked at runtime, it can be tricky to mentally track what’s being computed at any given time. Explicitly writing down what the user will write and what the type provider should generate in response helps to keep these separate phases straight and ensures that the logic for the type provider is possible to implement.

Tip

The ProvidedTypes API is formed by two files that you can include in your type-provider project. The easiest way to get it is to add the FSharp.TypeProviders.StarterPack NuGet package to your project. The code from the package is used by F# Data and other F# community libraries (http://github.com/fsprojects), which are both good examples of how to use it.

To show how this works, consider again the code in listing 1. Listing 2 shows roughly how this code is translated by the CSV type provider.[3] If you want to see the code that a type provider generates, you can quote the code that uses the provider with the F# quotation operator (<@ @>) and inspect the result. Doing so can be extremely valuable for understanding how existing type providers work as well as for debugging a type provider that you’re in the midst of writing.

3 I’ve taken a few liberties with the translation. For example, the actual generated code contains additional information to ensure that string parsing happens in a globalization-friendly way.

Listing 2. Code emitted by the type provider

First, note that the provided type CsvFile<"stocks.csv"> has been replaced by the real underlying type Runtime.CsvFile<'T>, which is a type that’s included in the type provider’s assembly. The runtime type is parameterized by the type of rows . When using the type provider, this is a type with named properties, but at runtime, it’s erased to a tuple type.

The call to the Parse method has been replaced by a call to Create , which takes several arguments. These arguments specify the CSV file’s location and the function used to convert the string array read out of each row into a strongly typed representation of the row. In this case, you’re using a tuple type to represent the type of data in each row, with one field per column of the CSV file. Based on the contents of the file, the type provider has ensured that the conversion function converts each column’s data into the right type based on the data that’s in that column (that is,DateTime for the first column, double for the second column, and so on). The actual F# Data provider also takes delimiters, quotation characters, and culture information, but I omitted them here for simplicity.

Next, note that each property access on the row data (specifically the Date, Low, and High columns in listing 1) has been converted into a call to the tuple’s corresponding item property[4] (the Item1, Item4, and Item3 properties, respectively).

4 Although the ItemN properties on tuples are public, they’re normally hidden from F# user code, so the literal translation isn’t valid F# code.

Inferring column types

The CSV type provider uses the names from the header row as the names of the properties, but it also infers the type of values in the columns. For example, in listing 1 you were able to directly access the month by writing row.Date.Month.

In this section, you’ll implement a simple type-inference algorithm that works similarly to the one used in F# Data. You probably shouldn’t expect to be able to infer the type from the first row: it’s possible that you’ll see a value like 5 (which you might naively take to be an integer value) even in a column containing floating-point values. This means you’ll need to infer the type of a value and then the ability to combine the types—given two types, you want to generalize and produce a single type that can represent all values representable by the two types.

Additionally, it would be nice to infer whether or not data is required (based on whether there are any missing values in the rows you look at). This would lead to the same issue when trying to infer from only one row: just because you have a value present in the first row doesn’t mean a value is required. Although this is an important aspect of the real CSV type provider, we’ll ignore this aspect to make the chapter simpler.

Representing and Inferring Types

First, you define a simple discriminated union that defines the types that your type provider supports. As mentioned earlier, this ignores missing values, and you only consider a few primitive types:

Using a discriminated union means you’ll be able to easily implement the generalization function later on. You have a case for strings and dates. The case for numeric types has a parameter specifying whether the numeric value can be decimal (represented as float) or integer (int).

Finally, you define a case called Top, which you’ll use as a sentinel to indicate an unknown type of value to start with. Whenever you generalize among the Top type and any other type, the other type will always be chosen (so you’ll end up with Top as the inferred type for a column when the data has no rows, in which case you’ll default to a string).

Next, let’s define the function that will infer a type from a particular string value. The actual code in the F# Data type providers is more complicated, because it needs to handle globalization using CultureInfo, but the idea is the same:

let inferStringType str =

if fst (Int32.TryParse str) then NumericType(false)

elif fst (Double.TryParse str) then NumericType(true)

elif fst (DateTime.TryParse str) then DateType

else StringType

In the inferStringType function, you repeatedly try to parse the string, starting with the most specific type (integer) and working your way toward more general types (floating-point number, and so on), stopping whenever you find a type for which you can parse the value. The code uses the fact that methods taking out arguments can be treated as methods returning a tuple of a Boolean success value and the parsed value. You get the Boolean (indicating success) using the fst function, ignoring the parsed value.

Generalizing types

The most interesting part of the type-inference algorithm is the function that implements generalization. Given two types, you want to find a type that’s suitable for representing values of both types. Compare the implementation in the following listing with the diagram showing the relationships between types in figure 1.

Figure 1. The subtyping relationship between inferred types. An arrow pointing from one type to another (for example, from Date to String) means a value of the first type can also be treated as a value of the other type. Any value can be treated as String, so you can always generalize any two types.

Listing 3. Finding the generalized type

Just as you inferred the most specific applicable type when parsing individual values from strings, you likewise try to generalize two types to the most specific type that can hold both of their values. The implementation uses the following rules:

1. If the types are the same, then you’re finished—the generalized type is the same, too.

2. If one of the types is TopType, then you use the other type, which must be more informative.

3. If either of the types is StringType, then you use that as the result, because anything can be read as a string.

4. If you get DateType on one side and some other type (which isn’t a date, because that would be handled by case ), you return StringType, because there’s no other common type.

5. Finally, if you have two numerical types, then you return the “wider” type. You only support floating-point numbers and integers, so you treat floating-points as wider than integers (because an integer can be parsed as a float).

The rules respect the relationship between types demonstrated in figure 1. For any two types, you always return their common supertype—that is, a type to which there’s an arrow from both of the source types.

As a concrete example of how this inference process works, assume the values in a column are 5, 6.5, and 1 million. Before reading any data, you start with TopType as your initial guess at the type because you have no information to go on. After reading the first value, 5, you refine your inferred type to NumericType(false) because that’s the most specific type for which you can parse this value. Then you read the floating-point value and infer the type NumericType(true). When you generalize the combination of NumericType(false) and NumericType(true), you get NumericType(true). Finally, you read 1 million and infer type StringType, because the value can’t be automatically treated as numerical. You generalize NumericType(true) and StringType to get the final inferred type for the column: StringType.

Exercise

The type inference implemented here is basic. You can extend it in a number of interesting ways. First, try extending the hierarchy of the numerical types. When a value doesn’t fit into the Int32 type, it could be a large integer that can be represented as an Int64 value. Also, if a value isn’t an integer but is small enough, you can fit it in a more precise Decimal type. To do this, you’ll need to change NumericType to carry another flag (or perhaps a number) as an argument. You can also add support for Booleans.

Your second task is more difficult: currently, a missing value will be treated as an empty string (""), so all columns with missing values will become strings. Change the definition of InferredType to carry an optional flag (this isn’t needed for strings), and update the inference and generalization functions accordingly.

Let’s take a step back and look at what you’ve implemented so far. You wrote an inference algorithm that looks at a sequence of string values and infers the most appropriate type. Even though it’s only a few lines, it’s an amazing achievement! The key part of the implementation was handled by a single pattern-matching expression—this is another example of how F# lets you implement fairly complex logic in a few lines of code.

The next step toward building a type provider is implementing the runtime and writing code that generates types using the runtime types.

Implementing the runtime and type provider

The type provider implemented in this chapter works by generating provided types that are erased to actual runtime types. For example, the Stocks type declared in listing 1 was erased to runtime type CsvFile<'T> as shown in listing 2. In this section, you first implement the target runtime type and then look at the type-provider component that generates the erased type.

Implementing the CSV provider runtime

At runtime, you need to have some representation for your CSV data. The CSV format supports quoting (for strings containing commas and newlines) and escaping, so writing a proper parser isn’t easy. Fortunately, you can reuse the parser that’s already available in F# Data to make your life easier. The functionality you’ll need at runtime is simple. In particular, you’ll need to do the following:

1. Read the contents of the specified file, either from a local file or from a URL.

2. Split the rows at the delimiters (ignoring delimiters that occur inside a quoted string). You use the CSV parser from F# Data for this.

3. Convert the column data to the row type. You do this by iterating over the rows and invoking a specified function on each row. The function is dynamically generated by the type provider based on the inferred type.

To encapsulate this functionality, you’ll use a generic type, with the generic parameter indicating the type of the data for each row. Your type provider will always use a tuple type as the generic argument. Listing 4 contains the definition for the type used for CSV runtime support. The type uses the nongeneric CsvFile type from the F# Data library (which you can reference using the FSharp.Data package from NuGet).

Listing 4. CSV runtime support

The type would be more complicated if you had to reimplement CSV parsing, but it would still be simple. Why do you need to implement it? Because you now want to write code that generates a type that’s erased to CsvFile<'T> as its runtime representation, so you need a specific structure of the type. In particular, the constructor takes the row parser and file to load , and the Rows property returns the rows as a sequence of parsed values of the 'T type .

Building code quotations

The final piece of the puzzle for the type provider’s implementation is to generate a type representing the specific CSV file structure (like Stocks) and the type representing its rows. The only difficult part is creating the conversion function, which maps the columns passed as strings into a tuple type that’s used to represent the row at runtime.

The first thing you need to do is to map column types to parsing functions for individual columns. The function in listing 5 takes InferredType and returns its .NET representation together with a function that builds a code quotation representing the parsing of another quotation provided as an argument. This part is tricky, so look at the code first and then I’ll explain what’s going on.

Listing 5. Building quotations for parsing types

The parsing functions are returned as the second element of the tuple. They take expressions representing the string to parse and return other expressions representing the code that parses the specified parameter. You build an untyped quotation using <@@ ... @@>; this represents some F# code that can be compiled. The parameter e is another quoted expression, and you splice it into the entire expression using %%e. The parameter will be provided when you want to build the parsing function. For example, given row.Columns.[0], the expression for parsing dates will beDateTime.Parse(row.Columns.[0]).

F# code quotations

F# code quotations are similar to LINQ expression trees. They represent code as a data structure that can be manipulated, processed, and compiled. In this chapter, you’re using them to generate blocks of code in the type provider that are then passed to the compiler and used in the produced code.

Another important use of quotations is translation of F# code to some other runtime. For example, queries can be translated to run in SQL databases (as in LINQ to SQL). But F# code can also be translated to run as JavaScript (using FunScript or Web-Sharper) or as GPU code using Alea.cuBase.

The function that generates types based on a sample CSV file is a bit more complicated, so let’s first explore its structure. It takes a CsvFile as a sample. This is the sample file that’s specified as the static parameter, and its headers and rows are used to generate the type for a concrete CSV file. The structure of the function is as follows:

Let’s start with the first part of the function. The code in the next listing reads individual fields and uses the NumberOfColumns property of the CsvFile file together with the Headers property to get the names of the headers.

Listing 6. Getting the names and types of fields

For each field, you get its name (using sample.Headers) and then infer the type based on the rows of the sample CSV file. To infer the type, you first call infer-StringType to get the type of individual values and then combine the types using the generalizeTypes function, starting with the TopType . Finally, you return the index, name, and type together with a parser quotation. Here you’re running through the rows of data, splitting on delimiters, and using the generalizeTypes function to combine the inferred type for a given row with the inferred type based on all previous rows.

Next, you use F#’s reflection facilities to get the type of a tuple containing all of these fields and create a Row type that will erase to the tuple type, with a named property for each column.

Listing 7. Generating the provided Row type

The code starts by using F# reflection to build a tuple type that’s used to represent the rows at runtime in the erased code . The most interesting part is using Provided-TypeDefinition to build a new type called Row. The second parameter specifies the runtime representation of the type, which is your tuple built earlier.

The rest of the code iterates over all the known columns and generates a new ProvidedProperty of the type obtained from the type inference. Before adding the property to the Row type, you set its getter . This is a function that takes a quotation representing the this instance and returns the code of the body. Here, the getter gets the ith field from the tuple.

Now you need to create a function that converts CsvRow to a typed tuple so that you can later pass it to the CsvFile<'T> runtime as the first argument. In principle, you need to generate code that looks something like this:

fun row ->

(DateTime.Parse(row.Columns.[0], Double.Parse(row.Columns.[1]))

The parts marked in bold are those that you need to build now. The rest of the code—the parsing—is code you already have from the getTypeAndParser function in listing 5. The code generation is shown next.

Listing 8. Generating the parsing function

You start by creating a variable named row of type CsvRow that represents the input of the lambda function. Next you build a list with expressions representing the individual tuple arguments. Here, you use a quotation literal <@@ ... @@> with splicing using %% to build an expression that takes the variable you built earlier and accesses the current column. The quotation is then passed to parser, which generates the parsing code around it—turning String into the appropriate element type. Finally, you create an expression representing the body of the lambda function and the lambda function itself .

Completing the CSV type provider

The last step in the implementation of the CSV type provider would be to write a new type that inherits from TypeProviderForNamespaces, annotate it with the Type-Provider attribute, and write code to generate the concrete type representing the entire file (such as Stocks in the earlier example).

Accomplishing this isn’t that much work, and you can find the complete example on the book’s website. But the main goal of this chapter is to build a type provider for Yahoo! Finance that uses the CSV provider functionality to provide access to financial data. So, let’s skip the final part of building the CSV provider and instead look at how to obtain information about companies. Then you’ll wrap the financial data into a type provider for Yahoo! Finance, which will rely on all the functionality you’ve implemented so far. The CSV data handling that you implemented in this section may be simple and doesn’t handle various corner cases, but it’s a great example of what needs to be done to implement the internals of a type provider.

Implementing the Yahoo! Finance type provider

General-purpose type providers, like the CSV type provider, are useful because they can be used with a wide variety of data sources. Indeed, our CSV type provider can work not only with local CSV files, but also with CSV files served via the web. And fortunately, there are providers such as Yahoo! Finance that make such data publicly available.

Unfortunately, the URLs for Yahoo! Finance’s data are a bit esoteric and take stock tickers as parameters. To get data for IT companies like Microsoft and Yahoo!, you need to know that their ticker names are MSFT and YHOO. Moreover, the URL can also encode other information. For example, to get weekly stock prices for Yahoo!’s stock between January and March of 2012, you’d use a URL like

http://ichart.yahoo.com/table.csv?s=YHOO&a=0&b=1&c=2013

&d=2&e=31&f=2013&g=d&ignore=.csv

Although it’s no problem to use these complicated URLs with the CSV type provider, the need to create them adds room for error and makes it hard to explore the data.

In cases like these, it’s often nice to create a type provider that makes accessing a specific data source more convenient. Toward that end, in this section you’ll create a type provider that makes it easy to navigate through a directory of companies and get their stock prices. For example, you’ll be able to write the following code:

open DeepDives

type tech = Yahoo.Technology

type goods = Yahoo.``Consumer Goods``

let companies =

[ tech .``Technical & System Software``.``Adobe Systems Inc``

tech.``Internet Information Providers``.``Google Inc.``

goods.``Electronic Equipment``.``Apple Inc`` ]

If you haven’t done so already, have a look at the complete sample on the book’s website. When you use the type provider in an F# editor, you’ll get autocompletion as you navigate through the different sectors and companies. This makes the type provider a fantastic tool for explorative data programming. The remainder of this section shows you how to build a specialized provider like this one.

Getting company information using YQL

To implement this type provider, you’ll build on top of a Yahoo! service called the Yahoo! Query Language (YQL), which you can use to access Yahoo! Finance service endpoints for enumerating sectors and industries. YQL enables you to encode queries to those services in the form of specially formatted URLs. When you issue web requests to those URLs, you get XML responses representing the query results. As an example, to get the results of the YQL query

"select * from yahoo.finance.sectors"

you issue a request to the corresponding URL:

http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20

yahoo.finance.sectors&env=store%3A%2F%2Fdatatables.org%2

Falltableswithkeys

The response you get looks like this:

<?xml version="1.0" encoding="UTF-8"?>

<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng"

yahoo:count="9" yahoo:created="2013-04-19T10:14:58Z">

<results>

<sector name="Basic Materials">

<industry id="112" name="Agricultural Chemicals"/>

<industry id="132" name="Aluminum"/>

<industry id="110" name="Chemicals - Major Diversified"/>

<!-- ... -->

</sector>

<sector name="Conglomerates">

<!-- ... -->

</sector>

</results>

</query>

You could read the data using a standard .NET library for working with XML, but because your code already uses F# Data, you can use the XML type provider in the implementation of your higher-level Yahoo! Finance type provider. Before drilling into the implementation, let’s write a simple script that prints the relevant information:

You’re using the code in a new script file, so you need to start by referencing F# Data . Then you use the XML type provider to get a type for type-safe access to the YQL query results (based on the sample query discussed earlier). Note that the sample URL is printed as multiline here, but it needs to be on a single line without spaces.

The rest of the code is simple thanks to the fact that the XML type provider generates a nice type for reading the data. You use the GetSample method to read the XML document specified as the sample and iterate over all sectors and the industries they contain, and then you print the details.

You can iterate through the industry nodes in the same way. When you do so, you extract the industry ID and issue another YQL query along the lines of "select * from yahoo.finance.industry where id="112"" (depending on the ID of the industry you’re drilling into). From there you can get the set of companies associated with that industry.

Implementing the type provider

As the chapter nears its end, let’s look at the code needed to build a type provider! In this subsection, you start by writing a type provider YahooProvider that contains one nested type for each sector, which then contain one more deeply nested type for each industry. The following listing implements the part of the type provider that lets you easily navigate through different sectors. You’ll add the companies and get the data in the next two subsections.

Listing 9. Type provider for navigating through industries

The code starts by wrapping the previous snippet in the YahooRuntime module, which contains a single lazy value that returns the list of sectors and their industries with the name and the ID. You use a lazy value to avoid downloading the list multiple times.

The type provider is implemented as a type YahooProvider , which is marked with the TypeProvider attribute and inherits from TypeProviderForNamespaces. The type definition, followed by the code to get the current .NET assembly, is boilerplate code shared by most type-provider implementations.

Most of the interesting code is implemented in the generateType and generate-Industry functions. The former is the main function that generates the global type named Yahoo , and the latter is a simple function that builds a type representing each specific industry. In the global type, you build a new type for each sector and add it as a direct child of the global type. The industry types are added as children of the sector type , which gives you a nice nested structure that can be explored interactively by typing a dot (.) and looking at the child types.

So far, you’ve built a type provider for exploring the hierarchy of Yahoo! Finance sectors and industries. You can see how it looks in figure 2. The next step is to get the list of companies in the generateIndustry function.

Figure 2. The Yahoo! Finance type provider in action. So far, the type provider lets you browse different sectors (such as Technology). When you select a sector, you can then navigate through the industries. In Visual Studio or Xamarin Studio, you’ll see the different available sectors and industries in the autocompletion lists.

Generating company names lazily

If you use the type provider, you’ll notice that it takes some time before the autocompletion list appears initially. This is because the type provider first needs to run the YQL query to get the sectors and industries. Now imagine that you also had to run another YQL query to get the companies for each industry!

Fortunately, the F# type-provider mechanism is designed to scale to extremely large data sources. The mechanism has the ability to expose members of provided types lazily. Taking advantage of laziness is easy: instead of using the AddMember method on ProvidedType, as you did in the previous example, you call the AddMembersDelayed method. Listing 10 shows a new version of the generateIndustry function, together with one more XML type for reading YQL results (you’ll need to put this into the Yahoo-Runtime module).

Listing 10. Lazily loading the industry contents

Listing 10 starts by defining a type, Companies , which is later used for parsing the results of an YQL query returning the list of companies in a given industry . Note that the URL in the first case needs to be on a single line, but we had to split it into multiple lines here. When getting the list of companies, you build the URL dynamically and include the ID of the current industry.

The part that generates the type first creates ProvidedTypeDefinition (as before), but now you use the AddMembersDelayed function to add an industry’s companies in a lazy fashion. The operation takes a lambda function as an argument and calls it when the members are needed (such as when the user looks at the members in the code editor). The function needs to return a list of members to be added. Here, you iterate over all the companies and add a static property for each company .

In this listing, all properties (representing the companies) are of type string, and their getter code always returns the "TODO" string . The final step is to replace this with a call to the CSV type provider runtime that you implemented earlier in this chapter.

Reusing the CSV provider

Once you get down to the individual company level, you’re ready to read the data from Yahoo!’s CSV files. This gives you an opportunity to reuse your existing CSV provider in the implementation of the Yahoo! Finance provider. Fortunately, because the structure of the CSV files is fully known, it’s much easier to generate the right quotation this time around. Listing 11 shows the new version of the generateIndustry and generateType functions. The former is now a bit more complex, because it needs to generate a body for each of the properties. The latter is similar to the code shown earlier. The only difference is that it fetches sample CSV data and generates a row type that’s shared by all the properties accessing specific company stock prices.

Listing 11. Using the CSV components in the Yahoo! Finance provider

In the generateType function, you call the generateTypesAndBuilder function from earlier in this chapter. This gives you two types: rowTy, which represents the generated type for rows (with named properties), and tupleTy, which is the underlying tuple type. You also get a quotation representing a function for converting from a CsvRow to the underlying tuple type. The rest of the function is the same as earlier, with the only difference being that all the information is passed to generateIndustry.

When generating types for an industry, you first build a type CsvFile<'T>, where 'T is the underlying tuple type, and you get the constructor of the type . The most interesting part is the code that builds body . This is an expression that reads the CSV file and returns the rows. The expression that you want to build looks like this:

(new CsvFile<DateTime * float * float>(

(fun row ->

DateTime.Parse(row.Columns.[0]),

Double.Parse(row.Columns.[0]),

Double.Parse(row.Columns.[0]) ),

"http://ichart.yahoo.com/table.csv?s=MSFT" )).Rows

The code to construct the quotation uses two methods from the Expr type. You call Expr.NewObject to build an expression that represents a constructor call with the function and the string value as arguments. Then you create an expression that returns the Rows property of the type usingExpr.PropertyGet.

To conclude the chapter, let’s have a quick look at the completed type provider.

Yahoo! Finance provider in action

The Yahoo! Finance type provider is focused on providing easy access to one specific data source. This makes it a great fit for interactive scripting. The general-purpose CSV provider might be a better fit for writing applications and libraries.

The main strength of the Yahoo! Finance provider is that it lets you explore the data available to you through the type system. This makes it easy to write scripts that work with data of specific companies. For example:

In this example, you first define two type aliases for types representing Technology and Consumer Goods sectors. You then get a list of several companies from each of the sectors and find the latest opening price for each of the companies. The types help you in two ways. First, they greatly reduce the time to market for such interactive analyses. Imagine the code you’d have to write to issue the YQL queries by hand, parse the returned XML, and then read the CSV data!

Second, having the names available in the types also reduces the potential for bugs. Misspelling a stock ticker could easily give you code that returns data for a completely different company than you intended. Using a type provider, you use the full name, and the compiler makes sure the company exists.

I hope this chapter has convinced you that writing a type provider isn’t black magic but rather a fun problem! Of course, in many cases you can use the existing type providers for formats like JSON, XML, and CSV, but there are certainly some good uses for writing custom providers.

Building real-world type providers

To make this chapter reasonably short, I ignored a couple of issues that you may have to face when writing real-world type providers. The first issue is caching. When you use our sample provider, you’ll notice that it occasionally takes some time before you get the autocompletion list. To improve this, you should cache the YQL query results and use the cached values.

The second issue appears when you want to write type providers that work as portable libraries (for mobile apps). In that case, you need to separate the runtime component (which is portable) from the type-provider component (which isn’t portable and runs as a compiler plug-in). Both of these issues are solved, for example, in the F# Data type providers, so the project source code might give you a good hint as to how to solve them.

Summary

In this chapter, you learned about type providers and how to use them. You saw how type providers make accessing strongly typed data simple and safe. Although writing a type provider isn’t trivial, it’s easier than you might think. When you repeatedly need to access a specific data source, writing a custom type provider can significantly reduce the time you spend writing the data-access code.

You implemented two kinds of type providers in this chapter. A general CSV type provider lets you access arbitrary CSV files in a strongly typed fashion, regardless of whether the files are found locally or on the internet. This is a great way to handle common file formats, and it can be used in both scripting and compiled applications.

Then you implemented a type provider specific to Yahoo! Finance, to make it easier to navigate the vast set of financial data that’s available. This type provider is more suitable for scripting purposes when you’re interested in getting data for a specific company. It lets you do interesting financial analyses without leaving the F# editor.

Along the way, you learned how to design type providers. You saw how to build code quotations to express the code that a type provider emits, and you also learned how to take advantage of lazy loading to ensure that large spaces of provided types can be explored in an ad hoc manner without pulling back large quantities of data over the web.

About the author

Keith Battocchi is a developer in the Conversational Understanding team in Microsoft’s Applications and Services Group. Previously he spent two years working on the design and applications of type providers with the F# team at Microsoft Research.