Textbases - Joe Celko's Complete Guide to NoSQL: What Every SQL Professional Needs to Know about NonRelational Databases (2014)

Joe Celko's Complete Guide to NoSQL: What Every SQL Professional Needs to Know about NonRelational Databases (2014)

Chapter 7. Textbases

Abstract

Textbase is the current buzzword for document management systems, which deals with data kept in text as opposed to traditional structured data, relationships, or temporal models. It is the oldest form of data we use. Documents can be free text or semi-structured documents. The problem is that text can be treated as strings that have only syntax; that is patterns of characters that can be mathematically defined and mechanically manipulated by relatively simple algorithms. However, words have semantics; this requires human judgment or insanely complicated algorithms that are able to learn and make humanlike judgments. Most of the important business rules (laws, contracts, rules, definitions, communications, etc.) are in text.

Keywords

CQL (Contextual Query Language); KWOC; microfilm; NISO (National Information Standards Organization); regular expressions; Unicode

Introduction

I coined the term textbase decades ago to define an evolution that was just beginning. The most important business data is not in databases or files; it is in text. It is in contracts, warranties, correspondence, manuals, and reference material. Traditionally, data processing (another old term) dealt with highly structured data in a machine-usable media. Storage was expensive and you did not waste it with text. Text by its nature is fuzzy and bulky; traditional data is encoded to be precise and compact.

Text and printed documents also have legal problems. Thanks to having been invented several thousand years ago, they have requirements and traditions that are enforced by law and social mechanisms. As storage got cheaper and we got smarter, the printed word started to get automated. Not just the printed word, but reading and understanding it is also becoming automated.

7.1 Classic Document Management Systems

Textbases began as document management systems. The early ancestors were microfilm and microfiche. The text was stored as a physical image with a machine searchable index. In 1938, University Microfilms International (UMI) was established by Eugene Power to distribute microfilm editions of current and past publications and academic dissertations. They dominated this field into the 1970s.

The big changes were legal rather than technological. First, FAX copies of a document became legal, then microfilm copies of signed documents became legal, and, finally, electronically signed copies of signed documents became legal. The legal definition of an electronic signature can vary from a simple checkbox, a graphic file image of a signature, to an encryption protocol that can be verified by a third party. But the final result has been you did not need to have warehouses full of paper to have a legal document.

7.1.1 Document Indexing and Storage

Roll microfilm had physical “blips” between frames so that a microfilm reader could count and locate frames. Microfiche uses a Hollerith card (yes, they actually say “Hollerith” in the literature!) with a window of photographic film in it. Punch card sorters and storage cabinets can handle them. These are the original magnetic tape and punch card logical models! This should not be a surprise; new technology mimics the old technology. You do not jump to a new mindset all at once.

Later, we got hardware that could physically move the microfilm or microfiche around for us. They whiz, spin, and hum with lots of mechanical parts. If you can find an old video of these machines, you will get the feeling of what “steam punk” would be like if this science fiction genre were set in the 1950s instead of the Victorian era. We would call it “electromechanical punk” and everyone would wear gray flannel suits with narrow neckties.

These are a version of the (< key >, < value >) model used by NoSQL, using a more physical technology. But there is a difference; in textbases, the final judgment is made by a human reading and understanding the document. For semi-structured documents, such as insurance policies, there are policy numbers and other traditional structured data elements. But there are also semi-structured data elements, such as medical exams. And there are completely free text elements, such as a note like “Investigate this guy for fraud! We think he killed his wife for the insurance!” or worse.

7.1.2 Keyword and Keyword in Context

How do you help a human understand the document? The first contact a human has with a document is the title. This sounds obvious and it is why journals and formal documents had dull but descriptive titles. Before the early 20th century, books also had a secondary title to help someone decide to buy the book. Juvenile fiction was particularly subject to this (e.g., The Boy Aviators in Nicaragua or In League with Insurgents), but so was adult fiction (e.g., Moby-Dick or The Whale) and scholarly books (e.g., An Investigation of the Laws of Thought on Which Are Founded the Mathematical Theories of Logic and Probabilities).

The next step in the evolutionary process was a list of keywords taken from a specialized vocabulary. This is popular in technical journals. The obvious problem is picking that list of words. In a technical field, just keeping up with the terminology is hard. Alternate spellings, acronyms, and replacement terms occur all the time. Talk to a medical records person about the names of diseases over time.

Since the real data is in the semantics, the next step in evolution is the technique of keyword in context (KWIC) index. It was the most common format for concordance lines in documents. A concordance is an alphabetical list of the principal words used in a document with their immediate contexts. Before computers, they were difficult, time consuming, and expense. This is why they were only done for religious texts or major works of literature. The first concordance, to the Vulgate Bible, was compiled by Hugh of St. Cher (d. 1262), who employed 500 monks to assist him. In 1448, Rabbi Mordecai Nathan completed a concordance to the Hebrew Bible. It took him ten years. Today, we use a text search program with much better response times.

The KWIC system and its relatives are based on a concept called keyword in titles, which was first proposed for Manchester libraries in 1864 by Andrea Crestadoro. KWIC indexes divide the lines vertically in two columns with a strong gutter, with the keywords on the right side of the gutter in alphabetical order. For the nontypesetters, a gutter is a vertical whitespace in a body of text.

There are some variations in KWIC, such as KWOC (keyword out of context), KWAC (keyword augmented in context), and KEYTALPHA (key term alphabetical). The differences are in the display to the user. The good side of the keyword indexing methods is that they are fast to create once you have a keyword list. It is just as easy as indexing a traditional file or database. Many systems do not even require controlled vocabulary; they do a controlled scan and build the list.

The bad news (and another step in the evolution) is that there no semantics in the search. The concept of related subjects, either narrower or broader concepts and topics in some recognized hierarchical structure, does not exist outside the mind of the users.

7.1.3 Industry Standards

Serious document handling began with librarians, not with computer people. This is no surprise; books existed long before databases. The National Information Standards Organization (NISO) and now the ANSI Z39 group, was founded in 1939, long before we had computers. It is the umbrella for over 70 organizations in the fields of publishing, libraries, IT, and media organizations. They have a lot of standards for many things, but the important one for us deals with documents, not library bookshelves.

Contextual Query Language

NISO has defined a minimal text language, the Common Query Language (CQL; (http://zing.z3950.org/cql/intro.html). It assumes that there is a set of documents with a computerized interface that can be queries. Queries can be formed with the usual three Boolean operators—AND, OR, and NOT—to find search words or phrases (strings are enclosed in double quotemarks) in a document. For example:

dinosaur NOT reptile

(bird OR dinosaur) AND (feathers OR scales)

"feathered dinosaur" AND (yixian OR jehol)

All the Boolean operators have the same precedence, and associate from left to right. This is not what a programmer expects! This means you need to use a lot of extra parentheses. These are the same queries:

dinosaur AND bird OR dinobird

(dinosaur AND bird) OR dinobird

Proximity operators select candidates based on the positional relationship among words in the documents. Here is the BNF (Backus Normal Form or Backus-Naur Form, a formal grammar for defining programming languages) for the infixed operator:

PROX/[< relation >] / [< distance >] / [< unit >] / [< ordering >]

However, any or all of the parameters may be omitted if the default values are required. Further, any trailing part of the operator consisting entirely of slashes (because the defaults are used) is omitted. This is easier to explain with the following examples.

foo PROX bar

Here, the words foo and bar are immediately adjacent to each other, in either order.

foo PROX///SENTENCE bar

The words foo and bar occur anywhere in the same sentence. This means that the document engine can detect sentences; this is on the edge of syntax and semantics. The default distance is zero when the unit is not a word.

foo PROX//3/ELEMENT bar

The words foo and bar must occur within three elements of each other; for example, if a record contains a list of authors, and author number 4 contains foo and author number 7 contains bar, then this search will find that record.

foo PROX/=/2/PARAGRAPH bar

Here, the words foo and bar must appear exactly two paragraphs apart—it is not good enough for them to appear in the same paragraph or in adjacent paragraphs. And we now have a paragraph as a search unit in the data.

foo PROX/>/4/WORD/ORDERED bar

This finds records in which the words appear, with foo first, followed more than four words later by bar, in that order. The other searches are not ordered.

A document can have index fields (year, author, ISBN, title, subject, etc.) that can also be searched. Again examples will give you an overview:

YEAR > 1998

TITLE ALL "complete dinosaur"

TITLE ANY "dinosaur bird reptile"

TITLE EXACT "the complete dinosaur"

The ALL option looks for all of the words in the string, without regard to their order. The ANY option looks for any of the words in the string. The EXACT option looks for all of the words in the string and they have to be in the order given. The exact searches are most useful on structured fields, such as ISBN codes, telephone numbers, and so forth. But we are getting into semantics with modifiers. The usual backslash modifier notation is used:

◆ STEM: The words in the search term are scanned and derived words from the same stem or root word are matched. For example, “walked,” “walking,” “walker,” and so forth would be reduced to the stem word “walk” in the search. Obviously this is language-dependent.

◆ RELEVANT: The words in the search term are in an implementation-dependent sense relevant to those in the records being searched. For example, the search subject ANY/RELEVANT "fish frog" would find records of which the subject field included any of the words “shark,” “tuna,” “coelocanth,” “toad,” “amphibian,” and so forth.

◆ FUZZY: A catch-all modifier indicating that the server can apply an implementation-dependent form of “fuzzy matching” between the specified search term and its records. This may be useful for badly spelled search terms.

◆ PHONETIC: This tries to match the term not only against words that are spelled the same but also those that sound the same with implementation-dependent rules. For example, with SUBJECT =/PHONETIC, “rose” might match “rows” and “roes” (fish eggs).

The modifier EXACT/FUZZY appears strange, but is very useful for structured fields with errors. For example, for a telephone number that might have incorrect digits, the structure is the exact part; the digits are the fuzzy part:

telephone_nbr EXACT/FUZZY "404-845-7777"

Commercial Services and Products

There have been many commercial services that provide access to data. The most used ones are LexisNexis and WestLaw legal research services. They have terabytes of online data sold by subscriptions. There are also other industry-specific database services, university-based document textbases, etc.

These services have a proprietary search language, but most of these languages are similar to the CQL with minor differences in syntax. They have the Boolean and proximity constructs, but often include support for their particular specialization, such as a legal or medical vocabulary.

Later there are products that can create textbases from raw data for a user. ZyIndex has been the leader in this area for decades. It was written in 1983 in Pascal as a full-text search program for files on IBM-compatible PCs running DOS. Over the years the company added optical character recognition (OCR), full-text search email and attachments, XML, and other features. Competition followed and the internals vary, but most of the search languages stuck to the CQL model.

These products have two phases. The first phase is to index the documents so they can be searched. Indexing can also mean that the original text can be compressed; this is important! The second phase is the search engine. Indexing either ignores “noise words” or uses the full text. The concept of noise words is linguistic; there are “full words” and “empty words” in Chinese grammar. The noise or empty words are structural things like conjunctions, prepositions, articles, etc., as opposed to verbs and nouns that form the skeleton of a concept. Virtually, every sentence will have a noise word in it, so looking for them is a waste. The trade-off is that “to be or not to be” or other phrases are all noise words and might not be found.

There are systems that will scan every word when they index. The usual model is to associate a list of data page numbers instead of a precise location within the file. Grabbing the document in data pages is easy because that it how disk storage works. It is cheap and simple to decompress and display the text once it is in the main storage.

Regular Expressions

Regular expressions came from UNIX by way of the mathematician Stephen Cole Kleene. They are abstract pattern strings that describe a set of strings. ANSI/ISO standard SQL has a simple LIKE predicate and more complex SIMILAR TO predicate.

A vertical bar separates alternatives: gray|grey can match “gray” or “grey” as a string. Parentheses are used to define the scope and precedence of the operators (among other uses). A quantifier after a token (e.g., a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are the question mark (?), asterisk (*) (derived from the Kleene star), and the plus sign (+) (Kleene cross). In English, they mean:

?

Zero or one of the preceding element: colou?r matches both “color” and “colour.”

*

Zero or more of the preceding element: ab*c matches “ac”, “abc”, “abbc”, “abbbc”, etc.

+

One or more of the preceding element: ab + c matches “abc”, “abbc”, “abbbc”, etc.

These constructions can be combined to form arbitrarily complex expressions. The actual syntax for regular expressions varies among tools. Regular expressions are pure syntax and have no semantics. The SQL people love this, but textbase people think in semantics, not syntax. This is why textbases do not use full regular expressions—a fundamental difference.

The IEEE POSIX Basic Regular Expressions (BRE) standard (ISO/IEC 9945-2:1993) was designed mostly for backward compatibility with the traditional (Simple Regular Expression) syntax but provided a common standard that has since been adopted as the default syntax of many UNIX regular expression tools, though there is often some variation or additional features. In the BRE syntax, most characters are treated as literals—they match only themselves (e.g., "a" matches “a”). The exceptions, listed in Table 7.1, are called metacharacters or metasequences.

Table 7.1

Exceptions to BRE (Basic Regular Expressions) Exceptions

Metacharacter

Description

.

Matches any single character (many applications exclude new lines, and exactly which characters are considered new lines is flavor-, character encoding–, and platform-specific, but it is safe to assume that the line-feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches “abc”, but [a.c] matches only “a”, “.”, or “c”.

[ ]

Matches a single character that is contained within the brackets. For example, [abc] matches “a”, “b”, or “c”. [a-z] specifies a range that matches any lowercase letter from “a” to “z”. These forms can be mixed: [abcx-z] matches “a”, “b”, “c”, “x”, “y”, or “z”, as does [a-cx-z].

[^ ]

Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than “a”, “b”, or “c”. [^a-z] matches any single character that is not a lowercase letter from “a” to “z”. Likewise, literal characters and ranges can be mixed.

^

Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

$

Matches the ending position of the string or the position just before a string-ending new line. In line-based tools, it matches the ending position of any line.

\n

Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is theoretically irregular and was not adopted in the POSIX ERE syntax. Some tools allow referencing more than nine capturing groups.

*

Matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc.; [xyz]* matches “,” “x”, “y”, “z”, “zx”, “zyx”, “xyzzy”, and so on; (ab)* matches “,” “ab”, “abab”, “ababab”, and so on.

7.2 Text Mining and Understanding

So far, what we have discussed are simple tools that use a local textbase to do retrievals. While retrieval of text is important, this is not what we really want to do with text. We want meaning, not strings. This is why the first tools were used by lawyers and other professionals who have special knowledge, use a specialized vocabulary, and draw text from limited sources.

If you want to play with a large textbase, Google has an online tool (http://books.google.com/ngrams) that will present a simple graph of the occurrences of a search phrase in books from 1800 to 2009 in many languages. This lets researchers look for the spread of a term, when it fell in and out of fashion and so forth.

We had to get out of this limited scope and search everything—this led to Google, Yahoo, and other web search engines. The volume of text and the constant flow of it are the obvious problems. The more subtle problems are in the mixed nature of text, but even simple searches on large volumes of text are time consuming.

Google found a strong correlation between how many people search for flu-related topics and how many people actually have flu symptoms. Of course, not every person who searches for “flu” is actually sick, but a pattern emerges when all the flu-related search queries are added together. Some of the details are given in http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html.

7.2.1 Semantics versus Syntax

Strings are simply patterns. You do not care if the alphabet used to build strings is Latin, Greek, or symbols that you invented. The most important property of a string is linear ordering of the letters. Machines love linear ordering and they are good at parsing when given a set of syntax rules. We have a firm mathematical basis and lots of software for parsing. Life is good for strings.

Words are different—they have meaning. They form sentences; sentences form paragraphs; paragraphs form a document. This is semantics. This is reading for understanding. This was one of the goals for artificial intelligence (AI) in computer science. The computer geek joke has been that AI is everything in computer research that almost works, but not quite.

Programs that read and grade student essays are called robo-readers and they are controversial. On March 15, 2013, Les Perelman, the former director of writing at the Massachusetts Institute of Technology, presented a paper at the Conference on College Composition and Communication in Las Vegas, NV. It was about his critique of the 2012 paper by Mark D. Shermis, the dean of the college of education at the University of Akron. Shermis and co-author Ben Hamner, a data scientist, found automated essay scoring was capable of producing scores similar to human scores when grading SAT essays.

Perelman argued the methodology and the data in the Shermis–Hamner paper do not support their conclusions. The reason machine grading is important is that the vast majority of states are planning to introduce new high-stakes tests for K–12 students that require writing sections. Educational software is marketed now because there simply are not enough humans to do the work.

Essay grading software is produced by a number of major vendors, including McGraw-Hill and Pearson. There are two consortia of states preparing to introduce totally new high-stakes standardized exams in 2014 to match the common core curriculum, which has swept the nation. The two consortia—the Partnership for Assessment of Readiness for College and Careers, and Smarter Balanced Assessment Consortium—want to use machines to drive down costs.

Perelman thinks teachers will soon teach students to write to please robo-readers, which Perelman argues disproportionately gives students credit for length and loquacious wording, even if they do not quite make sense. He notes, “The machine is rigged to try to get as close to the human scores as possible, but machines don’t understand meaning.” In 2010, when the common core exams were developed for its 24 member states, the group wanted to use machines to grade all of the writing. Today, this has changed. Now, 40% of the writing section, 40% of the written responses in the reading section, and 25% of the written responses in the math section will be scored by humans.

Many years ago, MAD magazine did a humor piece on high school teachers grading famous speeches and literature. Abraham Lincoln was told to replace “four score and seven years” with “87 years” for clarity; Ernest Hemingway needed to add more descriptions; Shakespeare was told that “to be or not to be” was contradictory and we needed a clear statement of this question. “The technology hasn’t moved ahead as fast as we thought,” said Jacqueline King, a director at Smarter Balanced. I am not sure that it can really move that far ahead.

7.2.2 Semantic Networks

A semantic network or semantic web is a graph model (see Chapter 3 on graph databases) of a language. Roughly speaking, the grammar is shown in the arcs of the graph and the words are the nodes. The nodes need a disambiguation process because, to quote the lyrics from “Stairway to Heaven” (1971) by Led Zeppelin, “cause you know sometimes words have two meanings,” and the software has to learn how to pick a meaning.

Word-sense disambiguation (WSD) is an open problem of natural language processing that has two variants: lexical sample and all words task. The lexical sample uses a small sample of preselected words. The all words method uses all regular expressions that came from UNIX by way of the mathematician Stephen Cole Kleene. They are abstract pattern strings that describe a set of strings. ANSI/ISO standard SQL has a simple LIKE predicate and more complex SIMILAR TO predicate. This is a more realistic form of evaluation, but it is more expensive to produce.

Consider the (written) word “bass”:

◆ A type of fish

◆ Low-frequency musical tones

If you heard the word spoken, there would be a difference in pronunciation, but these sentences give no such clue:

◆ I went bass fishing.

◆ The bass line of the song is too weak.

You need a context. If the first sentence appears in text, a parse can see that this might be a adjective–noun pattern. But if it appears in an article in Field & Stream magazine, you have more confidence in the semantics of “bass” meaning a fish than if it were inDownbeat music reviews.

WordNet is a lexical database for the English language that puts English words into sets of synonyms called synsets. There are also short, general definitions and various semantic relations between these synonym sets. For example, the concept of “car” is encoded as {“car”, “auto”, “automobile”, “machine”, “motorcar”}. This is an open-source database that is widely used.

7.3 Language Problem

Americans tend not to learn foreign languages. Culturally, we believe that everyone learns American English and this is largely true today. We inherited the old British Empire’s legacy and added U.S. technical and economic domination to it. Our technical domination gave us another advantage when computers came along. We had the basic Latin alphabet embedded in ASCII so it was easy to store English text on computers. Alphabets are linear and have a small character set; other languages are not so lucky. Chinese requires a huge character set; Korean and other languages require characters that are written in two dimensions.

7.3.1 Unicode and ISO Standards

The world is made up of more alphabet, syllabary, and symbol systems than just Latin-1. We have Unicode today. This is a 16-bit code that can represent in most of the world’s writing systems. It has a repertoire of more than 110,000 characters covering 100 scripts. The standard shows the printed version of each character, rules for normalization (how to assemble a character from accent marks and other overlays), decomposition, collation, and display order (right-to-left scripts versus left-to-right scripts).

Unicode is the de-facto and de-jure tool for internationalization and localization of computer software. It is part of XML, Java, and the Microsoft .NET framework. Before Unicode, the ISO 8859 standard defined 8-bit codes for most of the European languages, with either complete or partial coverage. The problem was that a textbase with arbitrary scripts mixed with each other could not be stored without serious programming. The first 256 code points are identical to ISO 8859-1. This lets us move existing western text to Unicode, but also to store ISO standard encodings, which are limited to this subset of characters.

Unicode provides a unique numeric code point for each character. It does not deal with display, so SQL programmers will like that abstraction. The presentation layer has to decide on size, shape, font, colors, or any other visual properties.

7.3.2 Machine Translation

Languages can have inflectional and agglutinative parts in their grammar. An inflectional language changes the forms of their words to show grammatical functions. In linguistics, declension is the inflection of nouns, pronouns, adjectives, and articles to indicate number (at least singular and plural, but Arabic also has dual), case (nominative, subjective, genitive, possessive, instrumental, locational, etc.), and gender. Old English was a highly inflected language, but its declensions greatly simplified as it evolved into modern English.

An example of a Latin noun declension is given in Table 7.2, using the singular forms of the word homo.

Table 7.2

Latin Declension for Homo

Image

There are more declensions and Latin is not the worst such language. These languages are hard to parse and tend to be highly irregular. They often have totally different words for related concepts, such as plurals. English is notorious for its plurals (mouse versus mice, hero versus heroes, cherry versus cherries, etc.).

At the other extreme, an agglutinative language forms words by pre- or post- fixing morphemes to a root or stem. In contrast to Latin, the most agglutinate language is Esperanto. The advantage is that such languages are much easier to parse. For example:

“I see two female kittens”—Mi vidas du katinidojn

where kat/in/id/o/j/n means “female kittens in accusative” and made of:

kat (cat = root word) || in (female gender affix) || id (“a child” affix) || o (noun affix) || j (plural affix) || n (accusative case affix)

The root and the affixes never change. This is why projects like Distributed Language Translation (DLT) use a version of Esperanto as an intermediate language. The textbase is translated into the intermediate language and then from intermediate language into one of many target languages. Esperanto is so simple that intermediate language can be translated at a target computer. DLT was used for scrolling video news text on cable television in Europe.

Concluding Thoughts

We are doing a lot of things with text that we never thought would be possible. But the machines cannot make intuitive jumps; they are narrow in specialized worlds.

In 2011, IBM entered their Watson AI computer system on the television quiz show Jeopardy. The program had to answer questions posed in natural language. It was very impressive and won a million dollars in prize money. The machine was specifically developed to answer questions on Jeopardy; it was not a general-purpose tool. Question categories that used clues containing only a few words were problematic for it. In February 2013, Watson’s first commercial application is at the Memorial Sloan–Kettering Cancer Center in conjunction with health insurance company WellPoint. It is now tuned to do utilization management decisions in lung cancer treatments.

With improved algorithms and computing power, we are getting to the point of reading and understanding the text. That is a whole different game, but it is not human intelligence yet. These algorithms find existing relationships, they do not create a new one. My favorite example from medicine was a burn surgeon who looked at the expanded metal grid on his barbeque and realized that the same pattern of slits and stretching could be used with human skin to repair burns. Machines do not do that.

LexisNexis and WestLaw teach law students how to use their textbases for legal research. The lessons include some standard exercises based on famous and important legal decisions. Year after year, the law students overwhelming fail to do a correct search. They fetch documents they do not need and fail to find documents they should have found. Even smart people working in a special niche are not good at asking questions from textbases.

References

1. Perelman LC. Critique (Ver 3.4) of Mark D Shermis & Ben Hammer, “Contrasting state-of-the-art automated scoring of essays: Analysis”. 2012; In: http://www.scoreright.org/NCME_2012_Paper3_29_12.pdf; 2012.

2. Ry Rivard Ty. Humans fight over robo-readers. 2013, Mar 13; In: http://www.insidehighered.com/news/2013/03/15/professors-odds-machine-graded-essays#ixzz2cnutkboG; 2013, Mar 13;.

3. Shermis MD, Hamner B. Contrasting state-of-the-Art automated scoring of essays: Analysis The University of Akron. 2012; In: http://www.scoreright.org/NCME_2012_Paper3_29_12.pdf; 2012.