MapReduce Model - Joe Celko's Complete Guide to NoSQL: What Every SQL Professional Needs to Know about NonRelational Databases (2014)

Joe Celko's Complete Guide to NoSQL: What Every SQL Professional Needs to Know about NonRelational Databases (2014)

Chapter 4. MapReduce Model

Abstract

The MapReduce model was developed by Google and Yahoo for their internal use. Google created the Hadoop distributed file system and Yahoo developed Pig Latin to handle their volume of data. These products became open source. Hadoop dominates the NoSQL market as part of the SMAQ stack, the NoSQL counterpart of the LAMP stack for websites. The process has two phases: mapping and reducing. The mapping phase gets the data in a parallelized fashion. The reduce phase filters and aggregates this data to produce a final result.

Keywords

ETL (extract transform load); Google; Hadoop; HDFS (Hadoop distributed file system); LAMP stack; MapReduce; Pig Latin; RAID storage systems; SMAQ stack; Yahoo

Introduction

This chapter discusses the MapReduce model of data processing developed by Google and Yahoo for their internal use. This is a data retrieval model rather than a query model.

The Internet as we know it today, or Web 2.0 if you prefer, really started with the LAMP stack of open-source software that anyone could use to get up a website: Linux (operating system), Apache (HTTP server), MySQL (database, but since it was acquired by Oracle, people are moving to the open-source version, MariaDB), and PHP, Perl, or Python for the application language. Apache and MySQL are now controlled by Oracle Corporation and the open-source community is distrustful of them.

There is a similar stack in the Big Data storage world called the SMAQ stack for storage, MapReduce, and query, rather than particular products per se. Like the LAMP stack, the tools that implement the layers of the SMAQ are usually open source and run on commodity hardware. The operative word here is “commodity” so that more shops can move to Big Data models.

This leads to the obvious question as to what Big Data is. The best answer I found is when the size of the data involved in the project is a major concern for whatever reason. We are looking for projects driven by the data, not computations or analysis of the data. The first web applications that hit this problem were web search engines. This makes sense; they were trying to keep up with the growth of the Web and not drop or leave out anything.

Today, there are other players on the Web with size problems. The obvious ones are social networks, multiplayer games and simulations, as well as large retailers and auction sites. But outside of the Web, mobile phones, sensors, and other constant data flows can create petabytes of data.

Google invented the basic MapReduce technique, but Yahoo actually turned it into the Hadoop storage tools. As of this writing. Hadoop-based systems have a majority of the storage architectures. The query part can be done with Java because Hadoop is written in Java, but there are higher-level query languages for these platforms (more on that later).

The MapReduce part is the heart of this model. Imagine a large open office with clerks sitting at their desks (commodity hardware), with piles of catalogs in front of them. Putting the catalogs on their desks is a batch process; it is not like an SQL transaction model with interactive insert, update, and delete actions on the database.

Keeping with the office clerks image, once a day (or whatever temporal unit), the mail clerks dump the day’s catalogs on the clerk’s desks. What the clerks do not see is that the mail room (data sources) has to be cleaned up, filtered, and sorted a bit before the mail gets put in the mail cart and distributed. The ETL (extract, transform, load) tools from data warehousing work in Big Data, too, but the data sources are not often the clean, traditional structured ones that commercial data warehouses use. That is a whole topic in itself.

But assume we are ready for business. A boss at the front of the room shouts out the query: “Hey, find me a pair of red ballet flats!” to everyone, at the same time. Some of the clerks might realize that they do not have shoe catalogs in the pile on their desk and will ignore the request. The rest of the clerks will snap to attention and start looking through their catalogs. But what are they using as a match? A human being knows that we asked for a particular kind and color of women’s shoes. A human being will look at a picture and understand it. A computer has to be programmed to do this, and that might include a weighted match score and not a yes/no result. The smarter the algorithm, the longer it takes to run, and the more it costs in resources.

This is the mapping part. The query has to be parallelized. In this analogy, shouting out a query is enough, but the real world is not that simple. You have to have tasks that can be done independently of each other and yet consolidated into an alpha result. Another mail clerk has to run down the rows of desks and pick up the hits from the clerks, as they finish at different rates. Some clerks will have no matches and we can skip them. Some clerks will have an exact match to “red ballet flats” in their catalog; some clerks will have “ballet flats” or “red flats” near-matches.

Now it is time for the reduce phase. The mail clerk gets the catalog clerks’ notes to the boss at the front of the room. But their handwriting is sloppy, so the mail clerk has to summarize and sort these notes. More algorithms, and a presentation layer now! Finally the boss has his or her answer and we are ready for another query.

Notice that this is more of a retrieval than what an SQL programmer would think of as a query. It is not elaborate like a relational division, roll up, cube, or other typical aggregation in SQL. This leads us to the storage used and finally the query languages

4.1 Hadoop Distributed File System

The standard storage mechanism used by Hadoop is the Hadoop distributed file system (HDFS). It is built from commodity hardware arranged to be fault tolerant. The nature of commodity hardware is that when we have a failure, the bad unit can be swapped out. This is the reason that RAID storage works. But we want extreme scalability, up to petabytes. This is more data than the usual RAID storage system handles.

The next assumption is that it will be streaming data rather than random data access. The data is just stuffed into disks while RAID systems have deliberate redundancy in the data that has to be controlled by the RAID system. This is a write-once model that assumes data never changes after it is written. This model simplifies replication and speeds up data throughput. But it means that the front end has to do any validation and integrity checking before the data gets into the system.

RDBMS people hate this lack of data integrity. We want CHECK() constraints and referential integrity enforced by FOREIGN KEY constraints in the database. It is a file system, not a database. The Big Data model is that we might get data integrity eventually. In the meantime, we assume that we can live with some level of incorrect and missing data.

HDFS is portable across operating systems, but you will find that LINUX is the most popular platform. This should be no surprise, since it was so well established on the Web.

The huge data volume makes it is much faster to move the program near to the data, and HDFS has features to facilitate this. HDFS provides an interface similar to that of regular file systems. Unlike a database, HDFS can only store and retrieve data, not index it. Simple random access to data is not possible.

4.2 Query Languages

While it is possible to use a native API to get to the HDFS, developers prefer a higher-level interface. They are faster to code, they document the process, and the code can port to another platform or compiler.

4.2.1 Pig Latin

Pig Latin, or simply Pig, was developed by Yahoo and is now part of the Hadoop project. It is aimed at developers who use a workflow or directed graph programming model. That model can be parallelized, but each path has to be executed in order.

The typical Pig program has a LOAD command at the front and a STORE command at the end. Another characteristic that is not quite like procedural programming is that assignments are permanent; you cannot change a name. Unfortunately, you can reuse it without losing the prior object. Think of each statement as a station in an assembly line. You fetch a record from a source, pass it to the next station, and fetch the next record. The next station will do its task with whatever data it has.

For example, the fields in a record are referenced by a position using a dollar sign and a number, starting with $0. Following is the example Pig program from the Wikipedia article on the language. It extracts words by pulling lines from text and filtering out the whitespace. The data is grouped by words, each group is counted, and the final counts go to a file:

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);

-- Extract words from each line and put them into a pig bag

-- datatype, then flatten the bag to get Alpha word on each row

words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- FILTER out any words that are just white spaces

filtered_words = FILTER words BY word MATCHES '\\w + ';

-- create a GROUP for each word

word_Groups = GROUP filtered_words BY word;

-- count the entries in each GROUP

word_count = FOREACH word_Groups GENERATE COUNT(filtered_words) AS count, GROUP AS word;

-- order the records BY count

ordered_word_count = ORDER word_count BY count DESC;

STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

Developers can write user-defined functions (UDFs) in Java to get more expressive power. While SQL also has a CREATE function that can use external languages, SQL is expressive and powerful enough that this is seldom done by good SQL programmers.

The LOAD command is a cross between the SELECT command and the Data Declaration Language (DDL) of SQL. It not only fetches records, but has extra clauses that format the data into something the next statements can handle. The USING clause invokes a library procedure and uses it to read from the data source. The AS (< field pair list >) clause will break the records into fields and assign a data type to them. The elements in the field pair list are a pair of (< field name >:< data type >) separated by commas. There are also options for structured data of various kinds. DUMP will show you the current content of an object for debugging; it does not send it to persistent storage. Do not worry about it until you need to look at your code.

The FILTER command is the Pig version of the SQL search condition. It uses a mix of SQL and C symbols, logical operators, and semantics as SQL. For example:

Users_20 = FILTER Users BY age > 19 AND age < 30

Predicates in the BY clause can contain the C equality operators == and ! = rather than the SQL <>. The rest of the theta operators are >, >= , <, and < =. These comparators can be used on any scalar data type, and == and ! = can also be applied to maps and tuples. To use these with beta tuples, both tuples must have the same schema or both not have a schema. None of the equality operators can be applied to bags. Pig has the usual operator precedence and basic math functions. But it is not a computational language, so you do not get a large math function library as part of the language.

Strings are called chararrays (array of characters) after the C family model and have Java’s regular expression syntax and semantics. Since chararrays are written in Java, this is no surprise, but they can throw off an SQL or other language programmer. SQL is based on the POSIX regular expressions, which have lots of shorthands. PERL regular expressions will work on a portion of a string, while Java does not. For example, if you are looking for all fields that contain the string “Celko” you must use '.*Celko.*' and not 'Celko', which is an exact match. Only SQL uses '%' and '_ ' for wildcards; everyone else has '.' for single characters and '*' for a varying-length match.

The usual logical AND, OR, and NOT operators are here with the standard precedence. Pig has SQL’s three-valued logic and NULLs, so an UNKNOWN will be treated as a FALSE in a FILTER. Pig will short-circuit logical operations when possible. That means a Pig program is executed from left to right, and when the value of a predicate will not be affected by the following terms in the FILTER, evaluation stops.

Since Pig allows UDFs and is not a functional language, it means that some code might not execute and have expected side effects. As a silly example, consider a UDF that returns TRUE, but not before it has done something outside the application, like reformat all the hard drives:

FILTER Foobar BY (1 == 2) AND Format_All_Drives_Function (x);

The first term (1 == 2) is FALSE so Format_All_Drives_Function (x) will never be invoked. But if we write it as

FILTER Foobar BY Format_All_Drives_Function (x) AND (1 == 2);

Format_All_Drives_Function (x) is invoked now. The system will disappear if nothing stops this piece of insanity.

Side effects also prevent other optimizations that would be possible if we could be sure there were no side effects. SQL/PSM gets around this by requiring a procedure or function be declared as not deterministic. A deterministic function will return the same output for the same inputs. Think of a mathematical function, like sin() or cos(); now think of a FETCH statement that gets whatever the next record in a file happens to be.

There is a website with procedures in Java for all kinds of things. It was too cute to resist, so this website is the Piggybank. The packages are based on the type of function. The current top-level packages correspond to the function type and are:

org.apache.pig.piggybank.comparison—for a custom comparator used by the ORDER operator.

org.apache.pig.piggybank.evaluation—for evaluation functions like aggregates and column transformations.

org.apache.pig.piggybank.filtering—for functions used in the FILTER operator.

org.apache.pig.piggybank.grouping—for grouping functions.

org.apache.pig.piggybank.storage—for LOAD/STORE functions.

The FOREACH statement applies an operation to every record in the data pipeline. Unix programmers who have written piped commands will be familiar with this model. FOREACH inputs a record named Alpha and outputs an Alpha record to send down the pipeline to the next statement. The next statement will create a new record named Beta. For the RDBMS people, this is (sort of) how Pig implements the relational projection operator. For example, the following code loads an entire record, but then removes all but the user and ID fields from each record:

Alpha = LOAD 'input' as (user_name:chararray, user_id:long, address:chararray, phone_nbr:chararray);

Beta = FOREACH Alpha GENERATE user_name, user_id;

But this is not quite projection in the relational sense. RDBMS is set-oriented, so the projection occurs all at once. Pig is a workflow model—we get a flow of generated tuples as output. Subtle, but important.

FOREACH has a lot of tools, the simplest of which are constants and field references. Field references can be by name (the SQL model) or by position (record-oriented model). Positional references are proceeded by a $ and start from 0. I strongly recommend against the positional references since positions do not document the process. Positions will change if the data source changes. Referencing a nonexistent positional field in the tuple will return NULL. Referencing a field name that does not exist in the tuple will produce an error.

Use descriptive names and not positional numbers for safety and as the best modern programming practice. Today, we have text editors and can cut and paste lists and we do not use 80-column punch cards. For example:

Prices = LOAD 'NYSE_Daily_Ticker' as (exchange, symbol, date, open, high, low, close, volume, adj_close);

Gain = FOREACH Prices GENERATE close – open; -- simple math

--Gain2 = FOREACH Prices GENERATE $6 - $3;

Gain and Gain2 will contain the same values. In addition to using names and positions, you can refer to all fields in a record using an * like the SQL convention. This produces a tuple that contains all the fields. Beginning in version 0.9, you can also refer to ranges of fields using beta periods and the syntax [< start field >]..[< end field >] . The syntax expands the fields in the order declared. If there is no explicit starting column, then the first alpha is used; if there is no explicit ending column, then the last alpha is used; otherwise, the range includes all fields between the start field and end field based on the field list.

The very useful question mark operator, with the syntax < predicate > ? < true value >: < false value >, will be familiar to older C programmers. It is the Pig version of the CASE expression ancestor! The predicate is tested and the expression returns the true value after the question mark if it is TRUE. It returns the false value if the predicate is FALSE. This is how the C question mark works in the Boolean logic of its parent language. But Pig has a NULL! Quasi-SQL comes into play! Perhaps it is easier to see, as follows:

2 == 2 ? 1 : 4 --returns 1

2 == 3 ? 1 : 4 --returns 4

NULL == 2 ? 1 : 4 -- returns NULL

2 == 2 ? 1 : 'Celko' -- type error, string vs integer

Pig has bags, which is what SQL is based on. It is a collection of tuples that have no ordering, and allow duplicates in the collection. But SQL—good SQL—will have a PRIMARY KEY that assures the bag is actually a real set.

The GROUP statement is not the GROUP BY statement used by SQL! The GROUP BY in SQL is a summarizing statement that returns a table. The Pig GROUP statement collects records with the same key together. The result is not a summary, but the intermediate step of building a collection of bags. In Pig, you can apply the aggregate functions if you wish. For example:

Daily_Ticker = LOAD 'Daily_Stock_Prices' AS (stock_sym, stock_price);

Daily_Stock_Groups = GROUP Daily_Ticker BY stock_sym;

Ticker_Cnt = FOREACH Daily_Stock_Groups GENERATE GROUP, COUNT(Daily_Ticker);

This example groups records by the stock’s ticker symbol and then counts them. The records coming out of the GROUP BY statement have beta fields, the key, and the bag of collected records. The key field is named GROUP and the bag is named for the alias that was grouped. So in the previous examples it will be named Daily_Ticker and inherit the schema from Daily_Ticker. If the relation Daily_Ticker has no schema, then the bag Daily_Ticker will have no schema. For each record in the GROUP, the entire record, including the key, is in the bag.

You can also use GROUP on multiple keys. The keys must be in parenthesis, just like an SQL row constructor. But unlike SQL, we can use tuples as fields in the results; so, we still have records with two fields, but the fields are more complicated than SQL’s scalar columns.

At this point, it is easier to show this with an example. Let’s make up two data sets, Alpha and Beta:

Alpha = LOAD 'Alpha' USING PigStorage();

Beta = LOAD 'Beta' USING PigStorage();

PigStorage is a standard library routine that will let us read in records from a standard source. Assume the data looks like this:

Alpha:

a A 1

b B 2

c C 3

a AA 11

a AAA 111

b BB 22

Beta:

x X a

y Y b

x XX b

z Z c

Now we can use some of the fancy commands that resemble their relational cousins. We have already discussed GENERATE, but here is how it is dumped. Pay attention to the parentheses and the use of a zero initial position:

Alpha_0_2 = FOREACH Alpha GENERATE $0, $2;

(a, 1)

(b, 2)

(c, 3)

(a, 11)

(a, 111)

(b, 22)

The GROUP statement will display the grouping key first, then the tuple of rows appears in curvy brackets. Math majors will be delighted with this because the curvy bracket is the standard notation for an enumerated set. Notice also that the fields are in their original order:

Alpha_Grp_0 = GROUP Alpha BY $0;

(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)})

(b, {(b, B, 2), (b, BB, 22)})

(c, {(c, C, 3)})

When the grouping key is more than one field, the row constructor is in parentheses, but the curvy brackets are still a list of fields:

Alpha_Grp_0_1 = GROUP Alpha BY ($0, $1);

((a, A), {(a, A, 1)})

((a, AA), {(a, AA, 11)})

((a, AAA), {(a, AAA, 111)})

((b, B), {(b, B, 2)})

((b, BB), {(b, BB, 22)})

((c, C), {(c, C, 3)})

Pig has three basic aggregate functions that look like their SQL cousins: SUM(), COUNT(), and AVG(). The rounding and presentation rules are not unexpected, but Pig does not have all of the fancy SQL operators that have been added to the ANSI/ISO standard SQL over the years. For example:

Alpha_Grp_0_Sum = FOREACH Alpha_Grp_0 GENERATE GROUP, SUM(Alpha.$2);

(a, 123.0)

(b, 24.0)

(c, 3.0)

Alpha_Grp_0_Cnt = FOREACH Alpha_Grp_0 GENERATE GROUP, COUNT(Alpha);

(a, 3)

(b, 2)

(c, 1)

Alpha_Grp_0_Avg = FOREACH Alpha_Grp_0 GENERATE GROUP, AVG(Alpha);

(a, 41.0)

(b, 12.0)

(c, 3.0)

Now we get into the fancy stuff! FLATTEN will look familiar to LISP programmers, which has a common function of the same name. It takes the tuples of the curvy brackets and puts them into a list. This is why having the key in the tuples is important; you do not destroy information. For example:

Alpha_Grp_0_Flat = FOREACH Alpha_Grp_0 GENERATE FLATTEN(Alpha);

(a, A, 1)

(a, AA, 11)

(a, AAA, 111)

(b, B, 2)

(b, BB, 22)

(c, C, 3)

The COGROUP is a sort of join. You wind up with three or more fields. The first is what value is common to the tuples that follow. Each of the BY clauses tells you which column in the tuple is used. NULLs are treated as equal, just as we did in SQL’s grouping operators. For example:

Alpha_Beta_Cogrp = COGROUP Alpha BY $0, Beta BY $2;

(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)}, {(x, X, a)})

(b, {(b, B, 2), (b, BB, 22)}, {(y, Y, b), (x, XX, b)})

(c, {(c, C, 3)}, {(z, Z, c)})

Again, notice that this is a nested structure. The style in Pig is to build a chain of steps so that the engine can take advantage of parallelism in the workflow model. But that can often mean un-nesting these structures. Look at this example and study it:

Alpha_Beta_Cogrp_Flat = FOREACH Alpha_Beta_Cogroup GENERATE FLATTEN(Alpha.($0, $2)), FLATTEN(Beta.$1);

(a, 1, X)

(a, 11, X)

(a, 111, X)

(b, 2, Y)

(b, 22, Y)

(b, 2, XX)

(b, 22, XX)

(c, 3, Z)

JOIN is the classic relational natural equijoin, but where SQL would drop one of the redundant join columns from the result table, Pig keeps both. This example has the join fields at the ends of the rows, so you can see them. Also notice how Alpha and Beta retain their identity, so the $ position notation does not apply to the result:

Alpha_Beta_Join = JOIN Alpha BY $0, Beta BY $2;

(a, A, 1, x, X, a)

(a, AA, 11, x, X, a)

(a, AAA, 111, x, X, a)

(b, B, 2, y, Y, b)

(b, BB, 22, y, Y, b)

(b, B, 2, x, XX, b)

(b, BB, 22, x, XX, b)

(c, C, 3, z, Z, c)

CROSS is the classic relational cross-join or Cartesian product if you prefer classic set theory. This can be dangerous for SQL programmers. In SQL, the SELECT .. FROM.. statement is defined as a cross-join in the FROM clause, and projection in the SELECT clause. No SQL engine actually does it this way in the real world, but since Pig is a step-by-step language, you can do exactly that! Essentially, the Pig programmer has to be his or her own optimizer. For example:

Alpha_Beta_Cross = CROSS Alpha, Beta;

(a, AA, 11, z, Z, c)

(a, AA, 11, x, XX, b)

(a, AA, 11, y, Y, b)

(a, AA, 11, x, X, a)

(c, C, 3, z, Z, c)

(c, C, 3, x, XX, b)

(c, C, 3, y, Y, b)

(c, C, 3, x, X, a)

(b, BB, 22, z, Z, c)

(b, BB, 22, x, XX, b)

(b, BB, 22, y, Y, b)

(b, BB, 22, x, X, a)

(a, AAA, 111, x, XX, b)

(b, B, 2, x, XX, b)

(a, AAA, 111, z, Z, c)

(b, B, 2, z, Z, c)

(a, AAA, 111, y, Y, b)

(b, B, 2, y, Y, b)

(b, B, 2, x, X, a)

(a, AAA, 111, x, X, a)

(a, A, 1, z, Z, c)

(a, A, 1, x, XX, b)

(a, A, 1, y, Y, b)

(a, A, 1, x, X, a)

Split was in Dr. Codd’s original relational operators. It never caught on because it returns two tables and can be done with other relational operators Maier, 1983, pp. 37–38). But Pig has a version of it that lets you split the data into several different “buckets” in one statement, as follows:

SPLIT Alpha INTO Alpha_Under IF $2 < 10, Alpha_Over IF $2 > = 10;

-- Alpha_Under:

(a, A, 1)

(b, B, 2)

(c, C, 3)

-- Alpha_Over:

(a, AA, 11)

(a, AAA, 111)

(b, BB, 22)

Did you notice that you could have rows that do not fall into a bucket?

There is a trade-off in this model. In SQL, the optimizer has statistics and uses that knowledge to create an execution plan. If the stats change, the execution plan can change. There is no way to collect statistics in Pig or any web-based environment. Once the program is written, you have to live with it.

This also means there is no way to distribute the workload evenly over the “reducers” in the system. If one of them has a huge workload, everyone has to wait until everyone is ready to pass data to the next step in the workflow. In fact, it might be impossible for one reducer to manage that much data.

Hadoop has a “combiner phase” that does not remove all skew data, but it places a bound on it. And since, in most jobs, the number of mappers will be at most in the tens of thousands, even if the reducers get a skewed number of records, the absolute number of records per reducer will be small enough that the reducers can handle them quickly.

Some calculations like SUM that can be decomposed into any number of steps are called distributive and they work with the combiner. Remember your high school algebra? This is the distributive property and we like it.

Calculations that can be decomposed into an initial step, any number of intermediate steps, and a final step are called algebraic. Distributive calculations are a special case of algebraic, where the initial, intermediate, and final steps are all the same. COUNT is an example of such a function, where the initial step is a count and the intermediate and final steps are sums (more counting) of the individual counts. The median is not algebraic. You must have all the records sorted by some field(s) before you can find the middle value.

The real work in Pig is building UDFs that use the combiner whenever possible, because of its skew-reducing features and because early aggregation greatly reduces the amount of data shipped over the network and written to disk, thus speeding up performance significantly. This is not easy, so I am not even going to try to cover it.

4.2.2 Hive and Other Tools

Just as Pig was a Yahoo project, Hive is an open-source Hadoop language from Facebook. It is closer to SQL than Pig and can be used for ad-hoc queries without being compiled like Pig. It is the representative product in a family that includes Cassandra and Hypertable. They use HDFS as a storage system, but use a table-based abstraction over HDFS, so it is easy to load structured data. Hive QL is the SQL-like query language that executes MapReduce jobs. But it can use the Sqoop to import data from relational databases into Hadoop. It was developed by Cloudera for their Hadoop platform products. Sqoop is database-agnostic, as it uses the Java JDBC database API. Tables can be imported either wholesale, or using queries to restrict the data import. Sqoop also offers the ability to reinject the results of MapReduce from HDFS back into a relational database. This means that Hive is used for analysis and not for online transaction processing (OLTP) or batch processing.

You declare tables with columns as in SQL, using a simple set of data types: INT, FLOAT, DATE, STRING, and BOOLEAN. The real strength comes from also using simple data structures:

Structs: The elements within the type can be accessed using the dot (.) notation. For example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a.

Maps (key–value tuples): The elements are accessed using ['element name'] notation. For example, in a map M comprising of a mapping from 'group' → gid, the gid value can be accessed using M ['group'].

Arrays (indexable one-dimensional lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements ['a', 'b', 'c'], A[1] returns 'b':

CREATE TABLE Foo

(something_string STRING,

something_float FLOAT,l

my_array ARRAY < MAP < ['foobar'],

STRUCT < p1:INT, p2:INT>>);

SELECT something_string, something_float,

my_array[0], ['foobar'].p1

FROM Foo;

The tables default to text fields separated by a control-A token and records separated by a new-line token. You can add more clauses to define the deliminators and file layout. For example, a simple CSV file can be defined by adding the following to the end ofCREATE TABLE:

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\,'

STORED AS TEXTFILE

Another important part of the declarations is that a table can be partitioned over storage and you manipulate it by the partitions. For example, you can get a clause to get random samples from these partitions:

TABLESAMPLE(BUCKET x OUT OF y)

The SELECT can also use infixed joins, but it allows only equijoins. You have INNER JOIN, LEFT OUTER, RIGHT OUTER, and FULL OUTER syntax.

There is also a LEFT SEMI JOIN to check if there is a reference in one table to another; this cannot be done decoratively in the SQL subset used. The compiler does not do the same optimizations you are used to in SQL, so put the largest table on the rightmost side of the join to get the best performance. There is also the usual UNION ALL, but none of the other set operators.

The usual SQL aggregate functions are used. The other built-in functions are a mix of SQL and C family syntax often with both options—that is, UPPER() and UCASE are the same function with two names, and so forth.

Concluding Thoughts

If you can read an execution plan in the SQL you know, you should not have any trouble with the basic concepts in the MapReduce model of data. Your learning curve will come with having to use Java and other lower-level tools that are part of an SQL compiler for complicated tasks.

The lack of an SQL style optimizer with statistics, transaction levels, and built-in data integrity will be a major jump. You will find that you have to do manually those things that you have had done for you.

Older programming languages have idioms and conventions, just like human languages. A COBOL programmer will solve a problem in a different way than an SQL programmer would approach the same problem. MapReduce is still developing its idioms and conventions. In fact, it is still trying to find a standard language to match the position that SQL has in RDBMS.

References

1. Maier D. Theory of relational databases. Rockville. F. MD: Computer Science Press; 1983.

2. Capriolo E, Wampler D. Programming hive. Cambridge: O'Reilly Media; 2012.

3. Gates A. Programming Pig. Cambridge, MA: O'Reilly Media; 2012.