Madeira  

CASE Study

תאריך: 9 ינואר, 2012 | מאת: גיא גלנצר

Sometimes we want to populate a table with random data, mainly for testing purposes. For example, let's consider the following table:

CREATE TABLE
    Sales.Orders
(
    Id            INT             NOT NULL    IDENTITY(1,1) ,
    CustomerId    INT             NOT NULL ,
    DateAndTime   DATETIME2(0)    NOT NULL ,
    OrderStatus   NVARCHAR(50)    NULL
);
GO

Let's say we want to insert 100,000 rows with random data. We want to generate random values between 1 and 1,000,000 for the "CustomerId" column, and we want to generate random date & time values in the past 5 years for the "DateAndTime" column. For now, let's ignore the "OrderStatus" column. Here is how I would do it:

INSERT INTO
    Sales.Orders WITH (TABLOCK)
(
    CustomerId ,
    DateAndTime
)
SELECT TOP (100000)
    CustomerId    = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
    DateAndTime   = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ())
FROM
    sys.all_columns AS T1
CROSS JOIN
    sys.all_columns AS T2;
GO

Notice a few things:

  1. I'm using the TABLOCK hint in order to use minimal logging. For more information about minimally logged inserts, I recommend this article by Itzik Ben-Gan.
  2. I'm retrieving rows from sys.all_columns cross joined to itself. It really doesn’t matter what table/s I'm using in this case. All I need is to make sure that the data set contains at least 100,000 rows. The TOP clause limits the number of rows returned to 100,000.
  3. I'm using the NEWID function in order to generate random values. If you try to use the RAND function, it will return the same value for the whole recordset, because the RAND function is evaluated only once for the query. If you want a different value for each row (of course you do), then you need to use the NEWID function.

Here is how the table looks like (10 first rows) after the insert:

Id CustomerId DateAndTime OrderStatus
1 127693 2009-09-30 02:18:25 NULL
2 135764 2008-04-26 07:25:29 NULL
3 516346 2010-04-17 13:19:24 NULL
4 523668 2009-05-20 21:12:51 NULL
5 249878 2010-04-14 10:43:30 NULL
6 720186 2008-11-20 07:00:06 NULL
7 80792 2011-03-08 02:39:24 NULL
8 46122 2011-08-24 10:47:45 NULL
9 737056 2009-12-15 02:14:43 NULL
10 304384 2008-03-17 04:51:10 NULL

Now, let's add the "OrderStatus" column to the game. This column should contain one of the following values: "New", "Open", "Closed", "Cancelled" and "Archived". Generating random values for this column is a bit trickier. We can achieve this by using the CASE statement. But first, a few words about this statement…

The CASE statement has two flavors. One is called "Simple CASE", and it compares a single expression to a set of simple expressions. The second is called "Searched CASE", and it evaluates a set of general Boolean expressions.

Here is an example of a simple CASE:

CASE CustomerTypeId
    WHEN 1    THEN N'Regular'
    WHEN 2    THEN N'VIP'
    WHEN 3    THEN N'New'
    WHEN 4    THEN N'Blocked'
    ELSE N'Unknown'
END

And here is an example of a searched CASE:

CASE
    WHEN Price <= $100.00                       THEN N'Cheap'
    WHEN Price > $100.00 AND ProductType = 1    THEN N'Medium'
    WHEN Price > $100.00 AND ProductType = 2    THEN N'Expensive'
    WHEN Price IS NULL                          THEN N'Unknown Price'
    ELSE N'Unknown Product Type'
END

In both cases, the first condition that evaluates to TRUE determines the result. The ELSE clause is optional. If all conditions evaluate to FALSE and there is no ELSE clause, then the result is NULL.

Now, let's go back to our example. In order to generate a random value for the "OrderStatus" column, we can generate a random value between 1 and 5, and use the simple CASE statement in order to translate these numbers into text values, like this:

TRUNCATE TABLE
    Sales.Orders;
GO

INSERT INTO
    Sales.Orders WITH (TABLOCK)
(
    CustomerId ,
    DateAndTime ,
    OrderStatus
)
SELECT TOP (100000)
    CustomerId    = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
    DateAndTime   = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) ,
    OrderStatus   = CASE ABS (CHECKSUM (NEWID ())) % 5 + 1
                        WHEN 1    THEN N'New'
                        WHEN 2    THEN N'Open'
                        WHEN 3    THEN N'Closed'
                        WHEN 4    THEN N'Cancelled'
                        WHEN 5    THEN N'Archived'
                    END
FROM
    sys.all_columns AS T1
CROSS JOIN
    sys.all_columns AS T2;
GO

Here are the results (first 10 rows):

Id CustomerId DateAndTime OrderStatus
1 116062 2008-09-24 19:50:01 NULL
2 3469 2008-02-24 00:59:53 Cancelled
3 635318 2009-07-23 09:44:25 Open
4 413157 2010-04-27 01:14:15 NULL
5 505098 2009-08-20 22:42:10 NULL
6 113948 2010-01-26 21:47:57 NULL
7 981939 2007-10-20 21:12:36 NULL
8 258577 2007-05-20 20:59:16 Open
9 704882 2007-10-01 16:13:24 Cancelled
10 668201 2010-04-08 15:12:40 NULL

Notice that there are many NULL values in the "OrderStatus" column. This is strange, cause the expression we used (ABS (CHECKSUM (NEWID ())) % 5 + 1) always returns a number between 1 and 5, so one of the conditions in the CASE clause should have been evaluated to TRUE. What’s going on?

Well, I'm glad you asked, because this is what I wanted to blog about this time. SQL Server doesn't really have two flavors of the CASE statement. It's actually two interfaces for our convenience. When we use the simple CASE statement, SQL Server translates it into the searched CASE format, because this is actually the only format that exists for the query processor. So the previous INSERT statement is translated by SQL Server to this:

TRUNCATE TABLE
    Sales.Orders;
GO

INSERT INTO
    Sales.Orders WITH (TABLOCK)
(
    CustomerId ,
    DateAndTime ,
    OrderStatus
)
SELECT TOP (100000)
    CustomerId    = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
    DateAndTime   = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) ,
    OrderStatus   = CASE
                        WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 1    THEN N'New'
                        WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 2    THEN N'Open'
                        WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 3    THEN N'Closed'
                        WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 4    THEN N'Cancelled'
                        WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 5    THEN N'Archived'
                    END
FROM
    sys.all_columns AS T1
CROSS JOIN
    sys.all_columns AS T2;
GO

Now you can see that the expression ABS (CHECKSUM (NEWID ())) % 5 + 1 is calculated separately for each condition in the Searched CASE statement. For the first condition, a random value between 1 and 5 is calculated and compared to "1". The probability of a match is 20%. If it's FALSE, then we move one to the second condition, for which another (different) random value between 1 and 5 is calculated and, this time, compared to "2". Again, we have a probability of 20% for a match, and so on. It is very likely that all conditions will evaluate to FALSE, and in that case the result will be NULL. Actually, the probability of getting NULL from this CASE expression is 0.85 = 32.768%. If you count the number of NULL values in the table, you'll find that there are around 32,768 such rows (out of 100,000 rows in total).

OK, so we learned that the Simple CASE is actually a Searched CASE behind the scenes, and that for this reason, using a random expression with the Simple CASE is problematic. That's good to know (I think), but I still want to populate the "OrderStatus" column with random values. How do I do that?

Well, I'm glad you asked again. There are several options to solve this problem. One of them is to use a small table with a single column containing the 5 text values, cross join to this table as well and sort the whole recordset randomly. Like this:

CREATE TABLE
    Sales.OrderStatuses
(
    OrderStatus NVARCHAR(50) NOT NULL
);
GO

INSERT INTO
    Sales.OrderStatuses
(
    OrderStatus
)
VALUES
    (N'New') ,
    (N'Open') ,
    (N'Closed') ,
    (N'Cancelled') ,
    (N'Archived');
GO

TRUNCATE TABLE
    Sales.Orders;
GO

INSERT INTO
    Sales.Orders WITH (TABLOCK)
(
    CustomerId ,
    DateAndTime ,
    OrderStatus
)
SELECT TOP (100000)
    CustomerId    = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
    DateAndTime   = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) ,
    OrderStatus   = OrderStatus
FROM
    sys.all_columns AS T1
CROSS JOIN
    sys.all_columns AS T2
CROSS JOIN
    Sales.OrderStatuses
ORDER BY
    NEWID () ASC;
GO

This method works, but it's very inefficient, because the query processor has to complete all the join operations first, then calculate the NEWID function for every row, then sort the whole recordset according to the values of NEWID, and only then retrieve the top 100,000 rows. Since the recordset is huge (around 134,000,000 in my case), the sort operation is going to take forever. You can try to replace sys.all_columns with another table that has fewer rows in order to get a recordset of a size close to 100,000, but it's still not going to be an easy task.

Another solution to the problem is to insert the 100,000 rows one by one in a loop. Inside the loop we can calculate the random value once and put it in a variable, and then we can use the Simple CASE statement with the variable. Like this:

DECLARE
    @Counter                    AS INT        = 100000 ,
    @RandomValueBetween1And5    AS TINYINT;

TRUNCATE TABLE
    Sales.Orders;

WHILE
    @Counter > 0
BEGIN

    SET @RandomValueBetween1And5 = ABS (CHECKSUM (NEWID ())) % 5 + 1;

    INSERT INTO
        Sales.Orders WITH (TABLOCK)
    (
        CustomerId ,
        DateAndTime ,
        OrderStatus
    )
    SELECT TOP (100000)
        CustomerId    = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
        DateAndTime   = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) ,
        OrderStatus   = CASE @RandomValueBetween1And5
                            WHEN 1    THEN N'New'
                            WHEN 2    THEN N'Open'
                            WHEN 3    THEN N'Closed'
                            WHEN 4    THEN N'Cancelled'
                            WHEN 5    THEN N'Archived'
                        END;

    SET @Counter -= 1;

END;
GO

This method also works, of course, but it is also inefficient, because there are 100,000 INSERT operations instead of one.

There are more inefficient ways to solve this problem, but let's look at an efficient one. Let's go back to the original query with the Simple CASE statement. We can still use the same query, but we need to make sure that the expression in the Simple CASE statement is calculated only once, and then compared to the other values (1 through 5). We can achieve this by performing the calculation of the random value in a derived table, like this:

TRUNCATE TABLE
    Sales.Orders;
GO

INSERT INTO
    Sales.Orders WITH (TABLOCK)
(
    CustomerId ,
    DateAndTime ,
    OrderStatus
)
SELECT TOP (100000)
    CustomerId    = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
    DateAndTime   = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) ,
    OrderStatus   = CASE RandomValueTable.RandomValue
                        WHEN 1    THEN N'New'
                        WHEN 2    THEN N'Open'
                        WHEN 3    THEN N'Closed'
                        WHEN 4    THEN N'Cancelled'
                        WHEN 5    THEN N'Archived'
                    END
FROM
    sys.all_columns AS T1
CROSS JOIN
    sys.all_columns AS T2
CROSS JOIN
    (
        SELECT
            RandomValue = ABS (CHECKSUM (NEWID ())) % 5 + 1
    )
    AS
        RandomValueTable;
GO

Now the random value is calculated once inside the derived table, and it is exposed to the outer query as a column. When SQL Server transforms the Simple CASE statement into a Searched CASE format, it's still OK, because the "RandomValue" column is not calculated separately for each condition. It's actually similar to the solution using a loop and a variable to hold the random value. Here, we use a column in a derived table to hold the random value instead of a variable.

Now we have a solution that works and is also very efficient. Here are the results:

Id CustomerId DateAndTime OrderStatus
1 369111 2011-01-01 03:31:42 Archived
2 204392 2009-05-02 09:00:00 Cancelled
3 997423 2010-12-09 00:22:51 Archived
4 542308 2009-12-04 03:10:26 Cancelled
5 960822 2009-03-12 07:26:03 Closed
6 216770 2007-05-02 21:03:30 Closed
7 1314 2008-07-23 22:26:21 Archived
8 886849 2007-08-27 15:56:29 Archived
9 176455 2009-08-31 20:43:28 Archived
10 949195 2010-07-31 02:41:29 New


תגיות מתוך הפוסט


תגובות הגולשים

  1. מאת Shay Attiya:

    Very cool stuff !!

  2. [...] http://www.madeira.co.il/case-study/ פורסם: Jan 12 2012, 03:26 PM by Madeira | with no comments תגים:SQL Server, CASE [...]

  3. גיא גלנצר מאת גיא גלנצר:

    Shay and Geri,

    Thanks for your feedback.

  4. מאת חן שאוליאן:

    יפה מאוד, לא ידעתי על הבעיתיות הזו הקיימת ב- CASE … פוסט ממש מעניין ויעיל :)

  5. מאת צ'יקו דרורי:

    Nice one!
    Tnx

על-מנת להשאיר תגובה אנא הכנס את הפרטים או התחבר




הערה: התגובות עוברות סינון, לאחריו תופיע התגובה. אין צורך להכניס שוב את התגובה.

תגובות אחרונות