Monday, 28 November 2011

PIVOT and UNPIVOT T-SQL Operator : Part 2 (UNPIVOT)


In my previous post I explained about Pivoting using PIVOT operator or Standard SQL. In this post I will explain Unpivoting in SQL. I suggest you to read my previous post on Pivoting (http://sqlenthusiast.blogspot.com/2011/11/pivot-and-unpivot-t-sql-operator-part-1.html). This will help you to understand unpivoting better.


Unpivoting Data :
Unpivoting is the technique to rotate from state of column to state of rows. It involves querying the pivoted state of data, producing out of each input to multiple output rows. It is very difficult to understand Unpivoting by definition, here I will explain unpivoting with example.

For demonstration puppose please run the following code in Listing 1.1. Hear I am creating projects table and filling it with sample data. Then I create 'MgrClientProject' table. Which is filled with rows of resultset that we got by Pivoting data of Projects table for manager and clients. 
 

Listing1.1
USE tempdb;

IF OBJECT_ID('dbo.Projects', 'U') IS NOT NULL DROP TABLE dbo.Projects;

CREATE TABLE dbo.Projects
(
projectid INT NOT NULL,
admissiondate DATE NOT NULL,
mgrid INT NOT NULL,
clientid VARCHAR(5) NOT NULL,
daysCount INT NOT NULL,
CONSTRAINT PK_Projects PRIMARY KEY(projectid)
);

INSERT INTO dbo.Projects(projectid, admissiondate, mgrid, clientid, daysCount)
VALUES
(30001, '20090802', 3, 'A', 10),
(10001, '20091224', 2, 'A', 12),
(10005, '20091224', 1, 'B', 20),
(40001, '20100109', 2, 'A', 40),
(10006, '20100118', 1, 'C', 14),
(20001, '20100212', 2, 'B', 12),
(40005, '20110212', 3, 'A', 10),
(20002, '20110216', 1, 'C', 20),
(30003, '20110418', 2, 'B', 15),
(30004, '20090418', 3, 'C', 22),
(30007, '20110907', 3, 'D', 30);

Select mgrid, [A], [B], [C], [D]
INTO MgrClientProject
from (Select mgrid, clientid, daysCount from dbo.Projects) as D
PIVOT (SUM(daysCount) FOR clientid in (A,B,C,D)) as P;

Select * from MgrClientProject


Here is the output of the query to select rows of MgrClientProject table.
mgrid A B C D
1 NULL 20 34 NULL
2 52 27 NULL NULL
3 20 NULL 22 30
Fig1.1


The table has rows for each manager and column for each client and in the intersection of mgrid and clientid we have daysCount values for the projects. For each irrelevent combination of manager and client we have null in the intersection. We are requested to give unpivoted data from this table. We need to generate output having rows for each mgrid and client along with their dayscount value of their project. The desired output should be this.

mgrid clientid daysCount
1 B 20
1 C 34
2 A 52
2 B 27
3 A 20
3 C 22
3 D 30
Fig1.2


Unpivoting involves implementing three logical processing phases : producing copies, extracting elements and eliminating irrelevent intersections.

We can do Unpivoting in two ways.
a) By standard SQL
b) By T-SQL UNPIVOT operator


a) By standard SQL
Unpivoting with Standard Sql implements all three logical phases in very explicit manner.

The first step involves producing multiple copies of input rows. We need to create copies of input rows for each column in the source table. In our our example we have four colums : A, B, C, D. Each colum represents clientid values. We need to do cross join between MgrClientProject table with the table having rows for each client.

To get table having rows for each client we will use table value constructor in the form of a VALUES clause to create a virtual table with a row for each client. Table value constructor is new feature in SQL Server 2008. Following is the query to do cross Join of MgrClientProject table with the table values constructor having each client value.

Select * from dbo.MgrClientProject
CROSS JOIN (Values('A'),('B'),('C'),('D')) AS P(clientid);

The output of this query is.
mgrid A B C D clientid
1 NULL 20 34 NULL A
1 NULL 20 34 NULL B
1 NULL 20 34 NULL C
1 NULL 20 34 NULL D
2 52 27 NULL NULL A
2 52 27 NULL NULL B
2 52 27 NULL NULL C
2 52 27 NULL NULL D
3 20 NULL 22 30 A
3 20 NULL 22 30 B
3 20 NULL 22 30 C
3 20 NULL 22 30 D
Fig1.3


As you can see that four copies of each input rows is created for each Client : A, B, C, D

Next Step is to create new column, let say daysCount, for each mgrid and and custid (newly created colum) combination. The daysCount value will be based on custid value. For example when custid values is 'A' then daysCount value will be the value of colum A, when custid values is 'B' then daysCount value will be the value of colum B. We will use case expression for each custid value to get daysCount value. Here is the query.

Select mgrid, clientid,
CASE WHEN clientid='A' then A
WHEN clientid='B' then B
WHEN clientid='C' then C
WHEN clientid='D' then D END AS daysCount
from dbo.MgrClientProject
CROSS JOIN (Values('A'),('B'),('C'),('D')) AS P(clientid)

The output of this query is.
mgrid clientid daysCount
1 A NULL
1 B 20
1 C 34
1 D NULL
2 A 52
2 B 27
2 C NULL
2 D NULL
3 A 20
3 B NULL
3 C 22
3 D 30
Fig1.4


The third step is to eliminate irrelevent rows where we have NULL value in daysCount column.
Following is final query is to produce Unpivoted result that I have shown in Fig 1.2.

Select * from (Select mgrid, clientid,
CASE WHEN clientid='A' then A
WHEN clientid='B' then B
WHEN clientid='C' then C
WHEN clientid='D' then D END AS daysCount
from dbo.MgrClientProject
CROSS JOIN (Values('A'),('B'),('C'),('D')) AS P(clientid)) AS D
WHERE daysCount is not NULL


b) By T-SQL UNPIVOT operator
Unpivoting data means producing two colums from all colums of input table other then group by colum. In our example input table is MgrClientProject and two output colums created will be custid having values(A, B, C, D) and daysCount.

The standard format of T-SQL UNPIVOT operator is.
SELECT ...
FROM <source_table_or_table_expression>
UNPIVOT(<target_col_to_hold_source_col_values>
FOR <target_col_to_hold_source_col_names> IN(<list_of_source_columns>)) AS
<result_table_alias>

Like PIVOT operator UNPIVOT operator also works as a table operator. In the UNPIVOT operator we need to specify result column name (daysCount in our case) that will have input table spreading column values(A, B, C, D column values in our case). We will specify result coulmn name that will have input table colum names(A, B, C, D column name in our case) and in parenthesis we need to specify input colum names(A, B, C, D column name in our case). We will close the parenthesis and give alias name to the table resulting from UNPIVOT table operator. Following is the Final query to Unpivot MgrClientProject table using T-SQL UNPIVOT operator. This query will result Output exactly like Fig1.2.

Select mgrid, clientid, daysCount
from dbo.MgrClientProject
UNPIVOT(daysCount FOR clientid in (A,B,C,D)) as P;

Note that the UNPIVOT operator implements the same logical processing phases described earlier -- generating copies, extracting elements, and eliminating NULL intersections. The last phase is not an optional phase as in the solution based on standard SQL.

Unpivoting does not necessarilly bring back original table before pivoting. Because Pivoting can cause data loss which can't be bring back by pivoting the data.


Post comment and queries on this topic and share your thoughts with me.
Hope you will like this post.

PIVOT and UNPIVOT T-SQL Operator : Part 1 (PIVOT)

Pivoting Data
 
Pivoting data means rotating data from the state of rows to state of column and possibly, aggregate data along the way.


To explain the definition of pivoting data i will use an example. For demonstration purpose please run the below code in new query window of SSMS (SQL Server Management Studio). In this example I use a Sample Projects table that you create in tempdb database and populate it with sample data by running the code below.


Listing 1.1
USE tempdb;

IF OBJECT_ID('dbo.Projects', 'U') IS NOT NULL DROP TABLE dbo.Projects;

CREATE TABLE dbo.Projects
(
projectid INT NOT NULL,
admissiondate DATE NOT NULL,
mgrid INT NOT NULL,
clientid VARCHAR(5) NOT NULL,
daysCount INT NOT NULL,
CONSTRAINT PK_Projects PRIMARY KEY(projectid)
);

INSERT INTO dbo.Projects(projectid, admissiondate, mgrid, clientid, daysCount)
VALUES
(30001, '20090802', 3, 'A', 10),
(10001, '20091224', 2, 'A', 12),
(10005, '20091224', 1, 'B', 20),
(40001, '20100109', 2, 'A', 40),
(10006, '20100118', 1, 'C', 14),
(20001, '20100212', 2, 'B', 12),
(40005, '20110212', 3, 'A', 10),
(20002, '20110216', 1, 'C', 20),
(30003, '20110418', 2, 'B', 15),
(30004, '20090418', 3, 'C', 22),
(30007, '20110907', 3, 'D', 30);

SELECT * FROM dbo.Projects;


Before explaing pivoting, consider the query to get total daysCount for each mgrid and clientid combination. The solution query for this request is simple...

Select mgrid, clientid, SUM(daysCount) as total_daysCount
from dbo.Projects group by mgrid, clientid

the output is of this query is :
mgrid clientid total_daysCount
2 A 52
3 A 20
1 B 20
2 B 27
1 C 34
3 C 22
3 D 30
Fig 1.1


Now Suppose that the requirement is to generate output in this format :
mgrid A B C D
1 NULL 20 34 NULL
2 52 27 NULL NULL
3 20 NULL 22 30
Fig : 1.2


Here we have rows for each mgrid value and column for each clientd value and in the intersaction of each combination we have toal_daysCount value. This view of data (Fig1.2) is called pivoted view of data of Projects table and the technique to generate output in this format is called Pivoting.


Every pivoting request involves three logical processing phases, each with associated elements: a grouping phase with an associated grouping or on rows element, a spreading phase with an associated spreading or on cols element, and an aggregation phase with an associated aggregation element and aggregate function.
In our example, you need to produce a single row in the result for each unique manager ID. This means that the rows from the Projects table need to be grouped by the mgrid attribute, and therefore the grouping element in our case is the mgrid attribute.
The Projects table has a single column holding all client ID values, and a single column holding their daysCount values. The pivoting process is supposed to produce a different result column for each unique client ID, and each column contains the aggregated quantities for that client. You can think of this process as "spreading" quantities by client IDs. The spreading element in our case is the clientid attribute.
Finally, because pivoting involves grouping, you will need to aggregate data to produce the result values in the "intersection" of the grouping and spreading elements. You will need to identify the aggregate function (SUM in our case), and the aggregation element (daysCount attribute in our case).


We can pivot data in two ways.
a) By using Standard Sql Commands.
b) By using T-SQL PIVOT operator.

Here I will explain both solution.

a) By using Standard Sql Commands

Pivoting using Standard Sql follows all three pivoting phases.

We use mgrid as our grouping element.

The spreading phase is achieved in the select clause by spreading all spreading colum value. Here spreading colum value is clientid. For pivoting you must know all possible spreading column values before writing pivoting query because you need to specify those values for each column in output query. In our example spreading cloumn(clientid) has for values which are A, B, C and D. We need to specify seperate CASE expression in Sqlect clause for each spreading column value. For example, here's the CASE expression for client A:

CASE WHEN clientid='A' THEN daysCount

This expression returns the daysCount from the current row only when the current row represents an project for client A. Otherwise the expression returns null.

Finally, the aggregation phase is achieved by applying the relevant aggregate function (SUM in our case) to the result of each CASE expression. For example, here's the expression that produces the result column for client A:

SUM(CASE WHEN clientid = 'A' THEN daysCount END) AS A

Here's the complete solution query pivoting project data, returning the total days on projects for each manager (on rows) and client (on cols):

SELECT mgrid,
SUM(CASE WHEN clientid = 'A' THEN daysCount END) AS A,
SUM(CASE WHEN clientid = 'B' THEN daysCount END) AS B,
SUM(CASE WHEN clientid = 'C' THEN daysCount END) AS C,
SUM(CASE WHEN clientid = 'D' THEN daysCount END) AS D
FROM dbo.Projects
GROUP BY mgrid;


This query retun pivoted days show in the Fig 1.2.


b) By using T-SQL PIVOT operator

SQL Server 2005 introduced a T-SQL–specific table operator called PIVOT. The PIVOT operator operates in the context of the FROM clause of a query. It operates on some source table or table expression, pivots the data, and returns a result table. The PIVOT operator involves the same logical processing phases as described earlier (grouping, spreading, and aggregating) with the same pivoting elements, but uses different, native syntax.

The general form of a query with the PIVOT operator is:

SELECT ...
FROM <source_table_or_table_expression>
PIVOT(<agg_func>(<aggregation_element>)
FOR <spreading_element>
IN (<list_of_target_columns>)) AS <result_table_alias>;

In the parentheses of the PIVOT operator you specify the aggregate function (SUM in our example), aggregation element (daysCount), spreading element (clientid), and the list of target column names (A, B, C, D). Following the parentheses of the PIVOT operator you specify an alias for the result table.

It is important to note that with the PIVOT operator, you do not explicitly specify the grouping elements, removing the need for a GROUP BY in the query. The PIVOT operator figures out the grouping elements implicitly as all attributes from the source table (or table expression) that were not specified as either the spreading element or the aggregation element. You need to ensure that the source table for the PIVOT operator has no attributes besides the grouping, spreading, and aggregation elements, so that after specifying the spreading and aggregation elements, the only attributes left are those you intend as grouping elements. You achieve this is by not applying the PIVOT operator to the original table directly (Projects in our case), but instead to a table expression that includes only the attributes representing the pivoting elements and no others. For example, here's the solution query to our original pivoting request, using the native PIVOT operator:

SELECT mgrid, A, B, C, D
FROM (SELECT mgrid, clientid, daysCount
FROM dbo.Projects) AS D
PIVOT(SUM(daysCount) FOR clientid IN(A, B, C, D)) AS P;

As another example for a pivoting request, suppose that instead of returning managers on rows and clients on columns, you want it the other way around: the grouping element is clientid, the spreading element is mgrid, and the aggregation element and aggregate function remain SUM(daysCount). After you learn the "template" for a pivoting solution (standard or native), it's just a matter of fitting those elements in the right places. The following solution query uses the native PIVOT operator:

SELECT clientid, [1], [2], [3]
FROM (SELECT mgrid, clientid, daysCount
FROM dbo.Projects) AS D
PIVOT(SUM(daysCount) FOR mgrid IN([1], [2], [3])) AS P;


The client IDs 1, 2, and 3 are values in the clientid column in the source table, but in terms of the result, these values become target column names. Therefore, in the PIVOT IN clause you must refer to them as identifiers. When identifiers are irregular (for example, when they start with a digit), you need to delimit them—hence the use of square brackets.

This query returns the following output:
clientid 1 2 3
A NULL 52 20
B 20 27 NULL
C 34 NULL 22
D NULL NULL 30
Fig. 1.3




Hope u like this explaination of PIVOT operator. If u have any confusion or want to share ur thoughts on this topic please write or ask me in the comment section of this post.

In my next post I will explain about UNPIVOT operator with example.