Introduction to SQL - Leveraging SQL for Business Intelligence - Microsoft Business Intelligence Tools for Excel Analysts (2014)

Microsoft Business Intelligence Tools for Excel Analysts (2014)

PART II: Leveraging SQL for Business Intelligence

Chapter 9: Introduction to SQL

In This Chapter

· Starting off with SQL basics

· Going further with advanced SQL concepts

Standard Query Language (SQL) is an industry-standard, fourth-generation programming language designed for working with databases. Most database vendors, including SQL Server, have adopted this language into their implementations, making it easier for data analysts to work on different platforms.

You'll use SQL on a frequent basis. A lot of times that occurs through a reporting interface that translates your requests into SQL. This makes it important for you to at least learn the basics of SQL. Understanding SQL allows you to go beyond the canned reports available to you and develop your own queries directly against the various databases that may exist in your organization.

SQL Basics

Learning SQL is easy after you understand the basic structure of the language. In this section we cover the SQL statements that allow you to retrieve data and then group and filter it.

The Select statement

At its core, SQL revolves around the Select statement. This statement enables you to retrieve records from a database table. The basic syntax of a Select statement is:

Select ‘Hello World’

To see the Select statement in action, open a Query window in Microsoft SQL Server Management Studio with the AdventureWorkdsDW database loaded and run the Hello World statement.

note.eps Use a single quote in SQL to indicate the beginning and end. Numeric values are entered without quotes in an SQL statement.

The statement returns one row in the result set with the words Hello World.

The From clause

The Select statement is most often used with the From clause. The From clause allows you to specify the tables and views you would like to retrieve data from. Here's how you use a From clause with a Select statement:

Select AccountType, Operator
From dbo.DimAccount

After you add a From clause in your Select statement, you can call out specific columns you want to return in the output and ask for all columns using the * wildcard:

Select *
From dbo.DimAccount

note.eps You can mix columns and constants in your Select statement to produce more sophisticated outputs. You can also give aliases to columns or constants by adding a name after each one. Here Hello World is repeated in each row in your output and the name of that column is MyAlias. The column Operator is also renamed to OperatorName.

· Select AccountType, Operator OperatorName, ‘Hello World’ MyAlias
From dbo.DimAccount

Joins basics

The From clause in your query can be extended and made more sophisticated with joins. Joins allow you to bring data from multiple related tables in your database and gives you fine control over how to relate those tables together. There are several types of joins in SQL; in this section we go over the three basic ones:

· Inner: This is the most often-used join type and, as shown in Figure 9-1, it returns the intersections of two tables or datasets.

9781118821527-fg0901.tif

Figure 9-1: Three basic joins.

· Left: A left join returns all the records in the table on the left of the join and only those related records from the right. You may also perform a right join if you want the opposite outcome. You can stick to a left join to keep things simpler. All you have to do is switch the order of the tables in the left join to achieve the same outcome as a right join.

· Full: Returns the union of two tables or datasets.

note.eps In the case of a left join or full join, you may get Null columns back for rows that do not have a relationship in the non-primary table or dataset.

When you define a join, you need to also specify the join criteria in your SQL statement. You accomplish this by using the On keyword. To simplify your SQL statement, use an Alias in the join.

Select da.AccountType, da.Operator, ff.Amount
From dbo.DimAccount da
Join dbo.FactFinance ff
On da.AccountKey = ff.AccountKey

This query returns two columns from DimAccount and one from FactFinance. The alias given to each table can be specified before every column in your Select statement. (You can skip qualifying a column with an alias if that column name is unique in all the tables in your join.) As a best practice, you should always pre-qualify your columns with the proper alias and use a standard abbreviation to define those aliases.

The Where clause

Use a Where clause in the Select statement to filter the result set and conditionally return specific records. The Where clause is always used in combination with an operator, such as = (equal), <> (not equal), > (greater than), < (less than), >= (greater than or equal to), <= (less than or equal to), Between (within general range).

This SQL statement returns only records with account types equal to Assets.

Select da.AccountType, da.Operator, ff.Amount
From dbo.DimAccount da
Join dbo.FactFinance ff
On da.AccountKey = ff.AccountKey
Where da.AccountType=‘Assets’

This SQL statement returns only records with amounts greater than or equal to 10.

Select da.AccountType, da.Operator, ff.Amount
From dbo.DimAccount da
Join dbo.FactFinance ff
On da.AccountKey = ff.AccountKey
Where ff.Amount >=10

Grouping

Grouping is a useful SQL feature for analytics. It allows you to aggregate your result set and perform advanced filtering. You perform grouping on your result set by using the Group By clause in your Select statement. The Group By clause is often used with anAggregation function and/or a Having clause. Here's a basic Group By clause:

Select da.AccountType, sum(ff.Amount) AmountTotal, avg(ff.Amount) AmountAverage
From dbo.DimAccount da
Join dbo.FactFinance ff
On da.AccountKey = ff.AccountKey
Group By da.AccountType

This Select statement returns a list of values in AccountType along with the total amount for each value. This example uses the Sum (sum of records) and Avg (average of records) Aggregation functions. SQL includes several Aggregation functions that can be used with a Group By clause, including Count (count of records), Min (minimum value), and Max (maximum value).

You can set an advanced filter on a query using the Having clause:

Select da.AccountType, sum(ff.Amount) AmountTotal, avg(ff.Amount) AmountAverage
From dbo.DimAccount da
Join dbo.FactFinance ff
On da.AccountKey = ff.AccountKey
Group By da.AccountType
Having sum(ff.Amount)>=10

This Select statement returns the AccountType records that have a sum of an amount greater than or equal to 10. The difference between this example and a filter applied using the Where clause is that the filter is applied after the aggregation.

The Order By clause

The Order By clause allows you to set the order of the returned records in your result set. The default sort order in SQL is ascending; you can change that to descending. Here's a basic Order By example:

Select da.AccountType, sum(ff.Amount) AmountTotal, avg(ff.Amount) AmountAverage
From dbo.DimAccount da
Join dbo.FactFinance ff
On da.AccountKey = ff.AccountKey
Group By da.AccountType
Having sum(ff.Amount)>=10
Order By da.AccountType Asc, sum(ff.Amount) Desc

You can specify the Order By clause by referencing the specific column names or the column ordinal as it is listed in the Select clause:

Select da.AccountType, sum(ff.Amount) AmountTotal, avg(ff.Amount) AmountAverage
From dbo.DimAccount da
Join dbo.FactFinance ff
On da.AccountKey = ff.AccountKey
Group By da.AccountType
Having sum(ff.Amount)>=10
Order By 1 Asc, 2 Desc

Selecting Distinct records

The Distinct keyword allows you to return the unique occurrences of the values in your record set without using a Group By clause. This example uses a Distinct keyword in the Select statement.

Select Distinct da.AccountType
From dbo.DimAccount da
Join dbo.FactFinance ff
On da.AccountKey = ff.AccountKey

Without the Distinct keyword, AccountType is repeated for every row in FactFinance (that is related to DimAccount). However, with Distinct you would only get the unique occurrences of AccountType.

tip.eps A common use of the Distinct keyword is to validate the unique key of a dataset or a table to make sure your understanding of the data matches what exists in your database. Take the DimDate table in the AdventureWorksDW, for example. FullDateAlternateKey is defined, according to the name of the column, as the alternate key for that table. Therefore, running the distinct set of FullDateAlternateKey produces a record count equal to Count(*) on the table (see Figure 9-2). Examining the results in the Results pane of the Query window, you can validate this is true.

9781118821527-fg0902.tif

Figure 9-2: Using the Distinct keyword to validate uniqueness.

Selecting Top records

Use the Top expression in the Select statement (or other data manipulation statements) to allow you to restrict the result set to a number or percent of rows.

This SQL statement returns the top 10 records in the DimAccount table:

Select Top 10 AccountType
From dbo.DimAccount

This SQL statement returns the top 10 percent of records in the DimAccount table:

Select Top 10 Percent AccountType
From dbo.DimAccount

Advanced SQL Concepts

Now that we’ve covered the basics, we review some of the more advanced functions in SQL, including data manipulation statements.

The Union operator

As the name implies, the Union operator allows you to combine the result set of two Select statements into one output. There are two variations on this operator: Union and Union All. The difference between the two is that Union returns a distinct output of the result set. Here's how to use the Union All operator:

Select ‘First Row’
Union All
Select ‘Second Row’
Union All
Select ‘Third Row’

The Select statements with Union or Union All operators can include all the options available to regular Select statements. The only restriction is that the result set in each Select statement must have the same structure.

Case expression

The Case expression is referred to as a Scalar function in SQL. There are several other useful functions in this category and they all follow a similar concept — they are applied on a single value in your result set (or the data you are querying) to manipulate that value. Here's the basic structure of a Case expression:

Case
When <condition>
Then <output>
Else <output>
End

You can use the Case expression in the Select statement in various places, including the columns list, the join, or the Where clause. This query changes the values of AccountType from Balances to Balance and leaves all other values the same.

Select Case
When AccountType=‘Balances’
Then ‘Balance’
Else AccountType
End ModifiedAccountType
From dbo.DimAccount

Like operator

The Like operator is a versatile SQL function that allows you to perform sophisticated string pattern searching in your query. Here's a basic Like operator:

Select AccountType
From dbo.DimAccount
Where AccountType Like ‘Balances’

This query returns only records where the AccountType value is exactly equal to Balances. So far this does not seem like anything special; you can achieve the same results using an = (equal) operator. However, the power of the Like operator stems from the use of wildcard characters. SQL Server includes four wildcard characters.

See Table 9-1 for how to use wildcards.

0901
0901a

Subqueries

Just as the name implies, a subquery is a query within a query. It can be nested anywhere an SQL expression can be used. In most cases, you can accomplish your objective without the use of subqueries, but they tend to be a style preference. In the following sections, we show you when you'd use a subquery.

In a From clause

The subquery in this example returns all the records that exist both in FactFinance and DimScenario.

Select da.AccountType, sub.ScenarioName,sum(sub.Amount)
From dbo.DimAccount da
Left Join (
Select ff.Amount, ff.AccountKey,ds.ScenarioName
From dbo.FactFinance ff
Join dbo.DimScenario ds
On ff.ScenarioKey=ds.ScenarioKey
) sub
On da.AccountKey=sub.AccountKey
Group By da.AccountType, sub.ScenarioName

The result set is left joined to DimAccount. The final output returns all rows from DimAccount and only those rows that match from the subquery.

Correlated subqueries

A correlated subquery means that the subquery runs once for every row in the main query. You can put correlated subqueries in several places inside a Select statement. This subquery runs once for every row in FactFinance and is restricted by the relationship on ScenarioKey with the outer query.

Select Distinct
ff.AccountKey,
(
Select ds.ScenarioName
From dbo.DimScenario ds
Where ff.ScenarioKey=ds.ScenarioKey
)
From dbo.FactFinance ff

note.eps This query can be much more simply achieved by putting DimScenario in the From clause and creating an inner join on ScenarioKey. The output would be exactly the same. Always avoid making your SQL more complicated whenever possible.

Advanced joins

When you start working in SQL, basic joins work fine for you. However, as you start getting into advanced SQL and sophisticated analytics requirements, you may find you need to go beyond the basics to accomplish your objectives. Depending on your situation, you may need to use one or more of the advanced joins described in this section.

Cross join

A cross join allows you to perform a Cartesian product between two tables returning all combinations of rows. You may want to use this type of join when you're trying to create a list of data combinations to use in a lookup table or on a specialized report. For example, this query results in the unique list of all combinations of AccountTypes and ScenarioNames.

Select Distinct da.AccountType, ds.ScenarioName
From dbo.DimAccount da
Cross Join dbo.DimScenario ds

Self join

In an SQL statement you may join the same table to itself using different aliases to facilitate a self join. This is mostly used when you have a parent-child relationship in a table. A common example is a hierarchy of some sort (for example, an organization hierarchy). This self join example unravels one level in the parent-child hierarchy in the DimAccount table. To unravel all the levels, you must keep self joining once for every level, restricting the first level to the top-most parent in the hierarchy.

Select Distinct a1.AccountDescription, a2.AccountDescription
From dbo.DimAccount a1
Join dbo.DimAccount a2
On a1.AccountKey=a2.ParentAccountKey
Order by 1,2

Inner joins

An inner join is not just limited to an equality relationship. Several situations in analytics require you to go beyond that basic relationship and into more advanced ones. This requires a strong ability to visualize data. A common example occurs in what is referred to as a Type II dimension. This type of dimension shows a point-in-time snapshot of the data. Take the example of an organization hierarchy. In certain views you may want to see the organization metrics based on how the organization hierarchy looked at the point that metric was taken. In that case, you have to use a between join in your query.

Advanced grouping

SQL includes a very powerful grouping functionality. These functions are most commonly used in analytics and allow you to perform things such as ranking or partitioning your result set. Following is the basic construct of these functions:

· Over Clause: Used to determine how the dataset should be partitioned and ordered. Inside the Over Clause, use the Partition By or Order By clauses to define the columns involved and the output:

Over (Partition By col1, col2, col3)

· Window Functions: The Over Clause is used with Window Functions such as Rank or Row_Number, or with an Aggregation function:

Select SalesTerritoryKey, ProductKey,
Sum(OrderQuantity) Over(Partition By SalesTerritoryKey)
SalesTerritoryOrderQuantitySum
From dbo.FactInternetSales

The output of this query returns the sum of OrderQuantity by SalesTerritoryKey and repeats those numbers for all ProductKeys in that SalesTerritoryKey (see Table 9-2).

Table 9-2: Over Clause Output

SalesTerritoryKey

ProductKey

SalesTerritoryOrderQuantitySum

9

562

13345

9

482

13345

9

574

13345

9

541

13345

9

535

13345

9

214

13345

9

538

13345

9

537

13345

9

528

13345

6

483

7620

6

477

7620

6

479

7620

6

222

7620

6

480

7620

6

539

7620

Manipulating data

You can perform three basic types of data manipulation activities on your analytics system. After you have learned the basics, you can get more sophisticated with your data updates by combining the other SQL components in this chapter with data manipulation commands to fulfill the requirements.

Insert

An Insert statement appends or adds data to a table. There are three primary ways to insert data into a table:

· Insert with values: This consists of providing a list of values (a single row) that you want to insert into specific table columns:

insert into dbo.sample (id, name)
values(1,‘sample’)

· Insert with a Select statement: This involves inserting the result set (multiple rows) of a Select statement into a table. This is a fast way to insert large amounts of data into your table:

insert into dbo.sample (id, name)
select 1, ‘sample’

· Using a Select Into statement: This is used to create and fill a table based on the result set of a Select statement.

select 1 id, ‘sample’ name
into sample

note.eps The Select statement can be as sophisticated as you need to accomplish the required transformation.

Delete and Update

Delete and Update statements allow you to remove or modify rows in a table that fit certain specified conditions. They can get pretty sophisticated and allow you fine control over your data manipulation.

Delete
From dbo.sample
Where name like ‘s%’

Update dbo.sample
Set name=‘testing’
Where name=‘test’

You can expand the capabilities of your Delete and Update statements by adding joins in the From clause to base your change on values in other tables. We go over some advanced data manipulation patterns specific to analytics in the next chapter.