PBKDF2 in Oracle 12c


When I wrote my article on Oracle passwords I included a small, limited implementation of the PBKDF2 algorithm in order to illustrate the 12c hashing methodology.    Here I’m publishing a fuller implementation with parameters for determining the derived key length, number of hashing iterations, and a choice of hashing methods (SH1, SH256, SH384, or SH512.)  I’ve included methods to return a RAW value (up to 32K octets/bytes) and VARCHAR2 (up to 16K octets, 32K hex characters) as well as BLOB and CLOB variations if you happen to need a large hash value.

The VARCHAR2 version is simply RAWTOHEX wrapped around the get_raw.  The BLOB and CLOB follow the same overall logic but have some small internal optimizations to help with extremely large results and/or numerous iterations.

My implementation is a fairly straight forward adaption of documentation in section 5.2 of RFC2898.   The only significant deviation is with the BLOB and CLOB functions where I batch up intermediate concatenations within a RAW or VARCHAR2 before concatenating to the BLOB or CLOB result.  I do this for efficiency because lob operations are slower.

First, a couple of use cases to test the functionality.  These were both taken from the test vectors published in RFC6070.

SQL> select pbkdf2.get_hex('password',utl_raw.cast_to_raw('salt'),1,20,2) from dual;

PBKDF2.GET_HEX('PASSWORD',UTL_RAW.CAST_TO_RAW('SALT'),1,20,2)
-------------------------------------------------------------------------------------------
0C60C80F961F0E71F3A9B524AF6012062FE037A6

SQL> select pbkdf2.get_hex('password',utl_raw.cast_to_raw('salt'),4096,20,2) from dual;

PBKDF2.GET_HEX('PASSWORD',UTL_RAW.CAST_TO_RAW('SALT'),4096,20,2)
-------------------------------------------------------------------------------------------
4B007901B765489ABEAD49D926F721D065A429C1

And now, the PBKDF2 package

CREATE OR REPLACE PACKAGE pbkdf2
IS
    --                    .///.
    --                   (0 o)
    ---------------0000--(_)--0000---------------
    --
    --  Sean D. Stuber
    --  sean.stuber@gmail.com
    --
    --             oooO      Oooo
    --------------(   )-----(   )---------------
    --             \ (       ) /
    --              \_)     (_/

    -- Implementation of algorithm described in section 5.2 of RFC2898
    -- https://tools.ietf.org/html/rfc2898

    -- dk_length refers to number of octets returned for the desired key
    -- regardless of whether the result is raw/blob or hex characters in varchar2/clob
    --   So, a 20-octet key returned by get_raw, would be a 40 character hex string
    --   returned by get_hex.  The dk_length parameter would be 20 in both cases.
    
    -- The following HMAC algorithms are supported
    --   DBMS_CRYPTO.HMAC_SH1    = 2
    --   DBMS_CRYPTO.HMAC_SH256  = 3 
    --   DBMS_CRYPTO.HMAC_SH384  = 4
    --   DBMS_CRYPTO.HMAC_SH512  = 5

    -- Test vectors
    --   https://tools.ietf.org/html/rfc6070

    --  select pbkdf2.get_hex('password',utl_raw.cast_to_raw('salt'),1,20,2) from dual;
    --      0C60C80F961F0E71F3A9B524AF6012062FE037A6 
    --  select pbkdf2.get_hex('password',utl_raw.cast_to_raw('salt'),2,20,2) from dual;
    --      EA6C014DC72D6F8CCD1ED92ACE1D41F0D8DE8957
    --  select pbkdf2.get_hex('password',utl_raw.cast_to_raw('salt'),4096,20,2) from dual;
    --      4B007901B765489ABEAD49D926F721D065A429C1
    --  select pbkdf2.get_hex('passwordPASSWORDpassword',utl_raw.cast_to_raw('saltSALTsaltSALTsaltSALTsaltSALTsalt'),4096,25,2) from dual;
    --      3D2EEC4FE41C849B80C8D83662C0E44A8B291A964CF2F07038

    FUNCTION get_raw(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN RAW
        DETERMINISTIC;

    FUNCTION get_hex(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN VARCHAR2
        DETERMINISTIC;

    FUNCTION get_blob(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN BLOB
        DETERMINISTIC;

    FUNCTION get_clob(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN CLOB
        DETERMINISTIC;
END;
/

CREATE OR REPLACE PACKAGE BODY pbkdf2
IS
    --                    .///.
    --                   (0 o)
    ---------------0000--(_)--0000---------------
    --
    --  Sean D. Stuber
    --  sean.stuber@gmail.com
    --
    --             oooO      Oooo
    --------------(   )-----(   )---------------
    --             \ (       ) /
    --              \_)     (_/

    -- Implementation of algorithm described in section 5.2 of RFC2898
    -- https://tools.ietf.org/html/rfc2898

    -- dk_length refers to number of octets returned for the desired key
    -- regardless of whether the result is raw/blob or hex characters in varchar2/clob
    --   So, a 20-octet key returned by get_raw, would be a 40 character hex string
    --   returned by get_hex.  The dk_length parameter would be 20 in both cases.

    -- The following HMAC algorithms are supported
    --   DBMS_CRYPTO.HMAC_SH1    = 2
    --   DBMS_CRYPTO.HMAC_SH256  = 3
    --   DBMS_CRYPTO.HMAC_SH384  = 4
    --   DBMS_CRYPTO.HMAC_SH512  = 5

    c_max_raw_length   CONSTANT PLS_INTEGER := 32767;
    c_max_hex_length   CONSTANT PLS_INTEGER := 32767;

    SUBTYPE t_maxraw IS RAW(32767);

    SUBTYPE t_maxhex IS VARCHAR2(32767);

    SUBTYPE t_hmac_result IS RAW(64);           -- must be big enough to hold largest supported HMAC

    FUNCTION iterate_hmac_xor(
        p_salt             IN RAW,
        p_iterations       IN PLS_INTEGER,
        p_hmac             IN PLS_INTEGER,
        p_block_iterator   IN PLS_INTEGER,
        p_raw_password     IN RAW
    )
        RETURN t_hmac_result
    IS
        v_u           t_maxraw;
        v_f_xor_sum   t_hmac_result;
    BEGIN
        -- The RFC describes the U(1)...U(c) values recursively
        -- but the implementation below simply loops with a stored value
        -- to achieve the same functionality.
        v_u :=
            UTL_RAW.CONCAT(
                p_salt,
                UTL_RAW.cast_from_binary_integer(p_block_iterator, UTL_RAW.big_endian)
            );

        v_u := DBMS_CRYPTO.mac(src => v_u, typ => p_hmac, key => p_raw_password);
        v_f_xor_sum := v_u;

        FOR c IN 2 .. p_iterations
        LOOP
            v_u := DBMS_CRYPTO.mac(src => v_u, typ => p_hmac, key => p_raw_password);
            v_f_xor_sum := UTL_RAW.bit_xor(v_f_xor_sum, v_u);
        END LOOP;

        RETURN v_f_xor_sum;
    END iterate_hmac_xor;

    FUNCTION get_raw(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN RAW
        DETERMINISTIC
    IS
        c_hlen           CONSTANT PLS_INTEGER
            := CASE p_hmac
                   WHEN DBMS_CRYPTO.hmac_sh1 THEN 20
                   WHEN DBMS_CRYPTO.hmac_sh256 THEN 32
                   WHEN DBMS_CRYPTO.hmac_sh384 THEN 48
                   WHEN DBMS_CRYPTO.hmac_sh512 THEN 64
               END ;

        c_octet_blocks   CONSTANT PLS_INTEGER := CEIL(p_dk_length / c_hlen);
        v_t_concat                t_maxraw := NULL;
        v_block_iterator          PLS_INTEGER := 1;
    BEGIN
        -- raise exception message per rfc
        -- but this limit is rather moot since the function
        -- is capped by raw limits
        IF p_dk_length > (POWER(2, 32) - 1) * c_hlen
        THEN
            raise_application_error(-20001, 'derived key too long');
        ELSIF p_dk_length > c_max_raw_length
        THEN
            raise_application_error(-20001, 'raw output must be less than to 32K bytes');
        END IF;

        IF p_iterations < 1
        THEN
            raise_application_error(-20001, 'must iterate at least once');
        END IF;

        -- Loop one block of hlen-octets at a time of the derived key.
        -- If we build a key past the desired length then exit early, no need to continue
        WHILE     v_block_iterator <= c_octet_blocks
              AND (v_t_concat IS NULL OR UTL_RAW.LENGTH(v_t_concat) 

c_max_raw_length / 2 THEN raise_application_error(-20001, 'hex representation must be less than 32K characters'); END IF; RETURN RAWTOHEX( get_raw( p_password, p_salt, p_iterations, p_dk_length, p_hmac ) ); END get_hex; FUNCTION get_blob( p_password IN VARCHAR2, p_salt IN RAW, p_iterations IN PLS_INTEGER, p_dk_length IN PLS_INTEGER, p_hmac IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512 ) RETURN BLOB DETERMINISTIC IS c_hlen CONSTANT PLS_INTEGER := CASE p_hmac WHEN DBMS_CRYPTO.hmac_sh1 THEN 20 WHEN DBMS_CRYPTO.hmac_sh256 THEN 32 WHEN DBMS_CRYPTO.hmac_sh384 THEN 48 WHEN DBMS_CRYPTO.hmac_sh512 THEN 64 END ; c_octet_blocks CONSTANT PLS_INTEGER := CEIL(p_dk_length / c_hlen); v_t_concat BLOB; v_block_iterator PLS_INTEGER := 1; v_temp t_maxraw; BEGIN -- raise exception message per rfc IF p_dk_length > (POWER(2, 32) - 1) * c_hlen THEN raise_application_error(-20001, 'derived key too long'); END IF; IF p_iterations v_t_concat, cache => FALSE, dur => DBMS_LOB.session); -- Loop one block of hlen-octets at a time of the derived key. -- If we build a key past the desired length then exit early, no need to continue WHILE v_block_iterator <= c_octet_blocks AND (DBMS_LOB.getlength(v_t_concat)

c_max_raw_length - c_hlen THEN DBMS_LOB.writeappend(v_t_concat, UTL_RAW.LENGTH(v_temp), v_temp); v_temp := NULL; END IF; v_block_iterator := v_block_iterator + 1; END LOOP; DBMS_LOB.writeappend(v_t_concat, UTL_RAW.LENGTH(v_temp), v_temp); DBMS_LOB.TRIM(v_t_concat, p_dk_length); RETURN v_t_concat; END get_blob; FUNCTION get_clob( p_password IN VARCHAR2, p_salt IN RAW, p_iterations IN PLS_INTEGER, p_dk_length IN PLS_INTEGER, p_hmac IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512 ) RETURN CLOB DETERMINISTIC IS c_hlen CONSTANT PLS_INTEGER := CASE p_hmac WHEN DBMS_CRYPTO.hmac_sh1 THEN 20 WHEN DBMS_CRYPTO.hmac_sh256 THEN 32 WHEN DBMS_CRYPTO.hmac_sh384 THEN 48 WHEN DBMS_CRYPTO.hmac_sh512 THEN 64 END ; c_octet_blocks CONSTANT PLS_INTEGER := CEIL(p_dk_length / c_hlen); v_t_concat CLOB; v_block_iterator PLS_INTEGER := 1; v_temp t_maxhex; BEGIN -- raise exception message per rfc IF p_dk_length > (POWER(2, 32) - 1) * c_hlen THEN raise_application_error(-20001, 'derived key too long'); END IF; IF p_iterations v_t_concat, cache => FALSE, dur => DBMS_LOB.session); -- Loop one block of hlen-octets at a time of the derived key. -- If we build a key past the desired length then exit early, no need to continue -- The end result is a HEX string, so double the length (2 characters to represent one byte) WHILE v_block_iterator <= c_octet_blocks AND (DBMS_LOB.getlength(v_t_concat)

c_max_hex_length - 2 * c_hlen THEN v_t_concat := v_t_concat || v_temp; v_temp := NULL; END IF; v_block_iterator := v_block_iterator + 1; END LOOP; v_t_concat := v_t_concat || v_temp; DBMS_LOB.TRIM(v_t_concat, p_dk_length * 2); RETURN v_t_concat; END get_clob; END; /

The PBKDF2 algorithm is supposed to be slow to execute in order to discourage brute force hacking attempts. While I did use a few coding techniques to try to help performance, this is still not a fast implementation. If you need maximal speed then I recommend a c library. For short strings and less than 100000 iterations the package should have adequate speed for most use cases.

The code presented above requires at least 12cR1, but could run on lower versions by changing the supported hashing methods.

I hope it helps, questions and comments are always welcome.

How Oracle Stores Passwords article updated through 12.2


I’ve updated the content through 12.2 and added a section on SQLNET.ORA parameters affecting the protocols (and thereby the hashing algorithms) supported.

The pdf file can be downloaded from my dropbox here.

How to pass a vector or data.frame from R to Oracle as a bind variable in a select statement using ROracle.


Recently I was asked how to bind an R variable to a SQL statement executed with ROracle.  In this case it wasn’t a simple  “select * from table where x = :b1 and y = :b2.”

You can do those quite easily with a data.frame with only one row in it using dbGetQuery(connection, sqlstatement, data.frame).  It will even work for some bulk binding, such as a multi-row insert from a data.frame or a vector.

Here though, the developer wanted to pass in an R set as an In-List for a where clause.  That is something like “select * from table where (x,y) in (:b1)”  where the bind variable was a data.frame consisting of a few hundred observations of 2 variables each.  Of course, the syntax shown is only pseudo-code, which further complicated the requirements.  How to create the query with valid syntax and how to pass the data.frame to the query as a bind variable?

First, in the context of a select statement neither a data.frame nor a vector is legal as an input parameter if they have more than one row.  Unfortunately that means we’ll have to take the extra step of converting the data into a legal data type to pass it through the dbGetQuery interface and then do something with that converted data within the SQL statement.

A simple and, more importantly, legal option is to convert the data into a delimited chr variable.  On the Oracle side this will be interpreted as a VARCHAR2 or a CLOB depending on the size, either of which is easily parsed back into the original rows and columns as needed.

First, let’s use a simple vector with just a few elements. We’ll look up a few employees from the sample SCOTT.EMP table.  We can’t use “in (:b1)” as shown above, but if we can generate a collection we can use “member of :b1.”

Since we’re going to pass a delimited chr value from R, we need to convert that text into a collection.  Previously I posted how to split a clob into a nested table of varchar2 values.  We’ll use that here too.  For this first example the clob handling will be a bit of overkill since our text string will be small; but it will still illustrate the process.

So, first, we’ll construct a vector, then make a chr variable from the vector.  Finally, pass the chr into a query using the split_clob function to create a nested table collection for use with the member of condition.

> employees_vector <- c("SCOTT","KING","ADAMS")
> employees_vector
[1] "SCOTT" "KING" "ADAMS"
> employees_chr <- paste(employees_vector,collapse=",")
> employees_chr
[1] "SCOTT,KING,ADAMS"
> employees_df <- dbGetQuery(con,"select * from emp where ename member of split_clob(:b1,',')",employees_chr)
> employees_df
  EMPNO ENAME       JOB   MGR            HIREDATE  SAL COMM DEPTNO
1  7788 SCOTT   ANALYST  7566 1987-04-19 00:00:00 3000   NA     20
2  7839 KING  PRESIDENT    NA 1981-11-16 23:00:00 5000   NA     10
3  7876 ADAMS     CLERK  7788 1987-05-23 00:00:00 1100   NA     20

It is possible to use an IN clause in the query, but you must create a subquery for the condition.  So, using the same chr variable we still use split_clob, but we then use the TABLE function to use the resulting collection as a data source for the subquery.

> employees_df2 <- dbGetQuery(con,"select * from emp where ename in (select * from table(split_clob(:b1,',')))",employees_chr)
> employees_df2
   EMPNO ENAME       JOB  MGR            HIREDATE  SAL COMM DEPTNO
1   7788 SCOTT   ANALYST 7566 1987-04-19 00:00:00 3000   NA     20
2   7839 KING  PRESIDENT   NA 1981-11-16 23:00:00 5000   NA     10
3   7876 ADAMS     CLERK 7788 1987-05-23 00:00:00 1100   NA     20

It is also possible to use the text field directly with a simple INSTR (where instr(‘KING,SCOTT,ADAMS’,ename) > 0,) but doing so reliably is more difficult.  Also, by leaving the table column untouched then indexes on the table can be used more reliably.  Due to these limitations I’m not providing examples.  Again, it is possible to do so, but not recommended.

The examples above are for a sets where each row only contains a single value (a vector, or a data.frame of a single column.)  But what if you need multi-column checks? I.e.  Something of the form “select * from table1 where (a,b) in (select x,y from table2).”  Where “table2” is somehow based on our data.frame contents.

While the basic idea is the same, the use of multiple columns in the condition creates an additional challenge because we need to somehow encode the rows and columns into a chr field such that the fields are distinct but still grouped by row.  Also the split_clob function will only generate one value for each row instead of reconstructing all of the individual fields.

First, on the R side,  we’ll use paste function again, but twice, once with the separation delimiter and then again with the collapse delimiter.

> input_df <- data.frame(jobs=c("CLERK","CLERK","ANALYST","PRESIDENT") ,depts=c("SALES","ACCOUNTING","RESEARCH","ACCOUNTING"))
> input_df
 jobs depts
1 CLERK SALES
2 CLERK ACCOUNTING
3 ANALYST RESEARCH
4 PRESIDENT ACCOUNTING
> input_chr <- paste(paste(input_df$jobs,input_df$depts,sep="|"),collapse=",")
> input_chr
[1] "CLERK|SALES,CLERK|ACCOUNTING,ANALYST|RESEARCH,PRESIDENT|ACCOUNTING"
>

Now we have a single chr field, where each row is delimited with commas and the fields within the row are delimited with pipes.  Using split_clob we can separate the string into 4 fields and then parse each of those into a row of 2 fields.

Removing R for a moment, we can test the splitting and parsing to see what the subquery will return.

SELECT SUBSTR(COLUMN_VALUE, 1, INSTR(COLUMN_VALUE, '|') - 1) jobs,
 SUBSTR(COLUMN_VALUE, INSTR(COLUMN_VALUE, '|') + 1) depts
 FROM TABLE(split_clob('CLERK|SALES,CLERK|ACCOUNTING,ANALYST|RESEARCH,PRESIDENT|ACCOUNTING', ','));

JOBS         DEPTS 
------------ --------------
CLERK        SALES 
CLERK        ACCOUNTING 
ANALYST      RESEARCH 
PRESIDENT    ACCOUNTING

4 rows selected.

Now we put it all together and pull some information from the emp and dept tables about our job/department pairs.

> emp_dept_df <- dbGetQuery(con,"
+   SELECT e.ename,
+          e.job,
+          e.sal,
+          d.dname,
+          d.loc
+     FROM emp e INNER JOIN dept d ON e.deptno = d.deptno
+    WHERE (e.job, d.dname) IN
+             (SELECT SUBSTR(COLUMN_VALUE, 1, INSTR(COLUMN_VALUE, '|') - 1) jobs,
+                     SUBSTR(COLUMN_VALUE, INSTR(COLUMN_VALUE, '|') + 1) depts
+                FROM TABLE(split_clob(:b1, ',')))
+  ORDER BY d.dname, e.ename",input_chr)
> emp_dept_df
   ENAME       JOB  SAL      DNAME      LOC
1   KING PRESIDENT 5000 ACCOUNTING NEW YORK
2 MILLER     CLERK 1300 ACCOUNTING NEW YORK
3   FORD   ANALYST 3000   RESEARCH DALLAS
4  SCOTT   ANALYST 3000   RESEARCH DALLAS
5  JAMES     CLERK  950      SALES CHICAGO
>

Obviously as the number of fields in a data.frame expands the parsing will get longer and more complicated.  If you’ll be performing  similar queries often, you may want to build a dedicated function within the database that combines the split_clob functionality with the parsing of the individual fields and returns a collection of user-defined types.  Also, these examples used text fields and columns but could include dates or numeric values.  In that case you would need to ensure consistent formatting in the string construction so the parsing can be accomplished correctly and reliably.  The overall technique remains the same though.

As mentioned above, these steps do entail extra processing on both the R side as well as the Oracle side of the interface; but for most uses cases the extra resource consumption will hopefully be minor and this method provides a work around to a limitation of syntax in the ROracle/DBI functionality suite.

I hope you find it useful.

Setting up the Star Schema Benchmark (SSB) in Oracle


In my previous two posts I showed how to setup a schema for the TPC-H tables and test data. A related test system called the Star Schema Benchmark (SSB) from Pat O’Neil, Betty O’Neil, and Xuedong Chen at the University of Massachusetts at Boston alters the TPC-H structures to create a warehousing data model.

The SSB is documented here. As with the TPC-H setup, a modified version of the dbgen utility creates the data can also be found on the UMass site here.

Unzip the dbgen file somewhere on your linux system. As of the time of this writing June 5, 2009 Revision 3 is the current version of the SSB. In the future the steps may change slightly but should be fairly similar to what is described below.
When you unzip, a directory dbgen will be created. Change to that directory, make a copy of the makefile template and edit it. The SSB version of dbgen doesn’t support Oracle but that’s not necessary for the data setup, only the query generation requires a database. So you can pick any of the supported databases for the purposes of data creation.

$ cd dbgen
$ cp makefile.suite makefile
$ vi makefile

Within the makefile, change the following lines

CC = gcc
DATABASE= SQLSERVER
MACHINE = LINUX
WORKLOAD = SSBM

Then make will create the dbgen utility. You’ll get several warnings but since we’re not distributing the resulting binary and have controlled usage they are safe to ignore.

Oracle offers a training lab for the 12c In-Memory option utilizing the SSB schema. Their training guide states the test system is based on a 50GB scale; but the query results illustrated in the guide show only a 4GB data set, so I’ll demonstrate the same here.
We’ll generate a small, approximately 4GB, set of test data and place it in a directory we’ll use later for the upload into the Oracle tables. Unlike the TPC-H dbgen you must generate each file type individually.

$ ./dbgen -s 4 -T c
$ ./dbgen -s 4 -T p
$ ./dbgen -s 4 -T s
$ ./dbgen -s 4 -T d
$ ./dbgen -s 4 -T l
$ mv *.tbl /home/oracle/ssb

Within the database, set up the SSB schema. All tables will be created according to the layouts described in section 2 of the SSB specification. For the purposes of data import, a set of external tables will also be created. These are not part of SSB itself and may be left in place or dropped after data load is complete. The SSB specification describes primary keys and foreign keys but no check constraints. The TPC-H schema allows NOT NULL declarations so I’ve included them here as well. If you prefer, simply remove the “NOT NULL” in the DDL below to allow nulls. The dbgen utility will populate every column though. The descriptions of a couple tables have some errors which I’ve made note of in the comments below. One is a partially documented; but seemingly unused column which I have removed. The other is on the date table which defines a day-of-week column of 8 characters but dbgen creates some days of 9 letters in length (“Wednesday”.)

One final change, the “DATE” table has been renamed “DATE_DIM” since DATE is an Oracle keyword. This change also makes the schema compatible with the Oracle In-Memory lab.

CREATE USER ssb IDENTIFIED BY ssb;

GRANT CREATE SESSION,
      CREATE TABLE,
      CREATE ANY DIRECTORY,
      UNLIMITED TABLESPACE
    TO ssb;

CREATE OR REPLACE DIRECTORY ssb_dir AS '/home/oracle/ssb';

GRANT READ, WRITE ON DIRECTORY ssb_dir TO ssb;

CREATE TABLE ssb.ext_lineorder
(
    lo_orderkey        INTEGER,
    lo_linenumber      NUMBER(1, 0),
    lo_custkey         INTEGER,
    lo_partkey         INTEGER,
    lo_suppkey         INTEGER,
    lo_orderdate       INTEGER,
    lo_orderpriority   CHAR(15),
    lo_shippriority    CHAR(1),
    lo_quantity        NUMBER(2, 0),
    lo_extendedprice   NUMBER,
    lo_ordtotalprice   NUMBER,
    lo_discount        NUMBER(2, 0),
    lo_revenue         NUMBER,
    lo_supplycost      NUMBER,
    --lo_ordsupplycost   NUMBER, -- this is mentioned in 2.2 Notes(c) but isn't in the layout or sample queries, so not needed?
    lo_tax             NUMBER(1, 0),
    lo_commitdate      INTEGER,
    lo_shipmode        CHAR(10)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY ssb_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('lineorder.tbl*'))
          PARALLEL 4;

CREATE TABLE ssb.lineorder
(
    lo_orderkey        INTEGER NOT NULL,
    lo_linenumber      NUMBER(1, 0) NOT NULL,
    lo_custkey         INTEGER NOT NULL,
    lo_partkey         INTEGER NOT NULL,
    lo_suppkey         INTEGER NOT NULL,
    lo_orderdate       NUMBER(8,0) NOT NULL,
    lo_orderpriority   CHAR(15) NOT NULL,
    lo_shippriority    CHAR(1) NOT NULL,
    lo_quantity        NUMBER(2, 0) NOT NULL,
    lo_extendedprice   NUMBER NOT NULL,
    lo_ordtotalprice   NUMBER NOT NULL,
    lo_discount        NUMBER(2, 0) NOT NULL,
    lo_revenue         NUMBER NOT NULL,
    lo_supplycost      NUMBER NOT NULL,
    --lo_ordsupplycost   NUMBER not null, -- this is mentioned in 2.2 Notes(c) but isn't in the layout or sample queries, so not needed?
    lo_tax             NUMBER(1, 0) NOT NULL,
    lo_commitdate      NUMBER(8,0) NOT NULL,
    lo_shipmode        CHAR(10) NOT NULL
);

CREATE TABLE ssb.ext_part
(
    p_partkey     INTEGER,
    p_name        VARCHAR2(22),
    p_mfgr        CHAR(6),
    p_category    CHAR(7),
    p_brand1      CHAR(9),
    p_color       VARCHAR2(11),
    p_type        VARCHAR2(25),
    p_size        NUMBER(2, 0),
    p_container   CHAR(10)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY ssb_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('part.tbl'));

CREATE TABLE ssb.part
(
    p_partkey     INTEGER NOT NULL,
    p_name        VARCHAR2(22) NOT NULL,
    p_mfgr        CHAR(6) NOT NULL,
    p_category    CHAR(7) NOT NULL,
    p_brand1      CHAR(9) NOT NULL,
    p_color       VARCHAR2(11) NOT NULL,
    p_type        VARCHAR2(25) NOT NULL,
    p_size        NUMBER(2, 0) NOT NULL,
    p_container   CHAR(10) NOT NULL
);

CREATE TABLE ssb.ext_supplier
(
    s_suppkey   INTEGER,
    s_name      CHAR(25),
    s_address   VARCHAR2(25),
    s_city      CHAR(10),
    s_nation    CHAR(15),
    s_region    CHAR(12),
    s_phone     CHAR(15)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY ssb_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('supplier.tbl'));

CREATE TABLE ssb.supplier
(
    s_suppkey   INTEGER NOT NULL,
    s_name      CHAR(25) NOT NULL,
    s_address   VARCHAR2(25) NOT NULL,
    s_city      CHAR(10) NOT NULL,
    s_nation    CHAR(15) NOT NULL,
    s_region    CHAR(12) NOT NULL,
    s_phone     CHAR(15) NOT NULL
);

CREATE TABLE ssb.ext_customer
(
    c_custkey      INTEGER,
    c_name         VARCHAR2(25),
    c_address      VARCHAR2(25),
    c_city         CHAR(10),
    c_nation       CHAR(15),
    c_region       CHAR(12),
    c_phone        CHAR(15),
    c_mktsegment   CHAR(10)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY ssb_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('customer.tbl'));

CREATE TABLE ssb.customer
(
    c_custkey      INTEGER NOT NULL,
    c_name         VARCHAR2(25) NOT NULL,
    c_address      VARCHAR2(25) NOT NULL,
    c_city         CHAR(10) NOT NULL,
    c_nation       CHAR(15) NOT NULL,
    c_region       CHAR(12) NOT NULL,
    c_phone        CHAR(15) NOT NULL,
    c_mktsegment   CHAR(10) NOT NULL
);

CREATE TABLE ssb.ext_date_dim
(
    d_datekey            NUMBER(8,0),
    d_date               CHAR(18),
    d_dayofweek          CHAR(9),    -- defined in Section 2.6 as Size 8, but Wednesday is 9 letters
    d_month              CHAR(9),
    d_year               NUMBER(4, 0),
    d_yearmonthnum       NUMBER(6, 0),
    d_yearmonth          CHAR(7),
    d_daynuminweek       NUMBER(1, 0),
    d_daynuminmonth      NUMBER(2, 0),
    d_daynuminyear       NUMBER(3, 0),
    d_monthnuminyear     NUMBER(2, 0),
    d_weeknuminyear      NUMBER(2, 0),
    d_sellingseason      CHAR(12),
    d_lastdayinweekfl    NUMBER(1, 0),
    d_lastdayinmonthfl   NUMBER(1, 0),
    d_holidayfl          NUMBER(1, 0),
    d_weekdayfl          NUMBER(1, 0)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY ssb_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('date.tbl'));

CREATE TABLE ssb.date_dim
(
    d_datekey            NUMBER(8,0) NOT NULL,
    d_date               CHAR(18) NOT NULL,
    d_dayofweek          CHAR(9) NOT NULL,    -- defined in Section 2.6 as Size 8, but Wednesday is 9 letters
    d_month              CHAR(9) NOT NULL,
    d_year               NUMBER(4, 0) NOT NULL,
    d_yearmonthnum       NUMBER(6, 0) NOT NULL,
    d_yearmonth          CHAR(7) NOT NULL,
    d_daynuminweek       NUMBER(1, 0) NOT NULL,
    d_daynuminmonth      NUMBER(2, 0) NOT NULL,
    d_daynuminyear       NUMBER(3, 0) NOT NULL,
    d_monthnuminyear     NUMBER(2, 0) NOT NULL,
    d_weeknuminyear      NUMBER(2, 0) NOT NULL,
    d_sellingseason      CHAR(12) NOT NULL,
    d_lastdayinweekfl    NUMBER(1, 0) NOT NULL,
    d_lastdayinmonthfl   NUMBER(1, 0) NOT NULL,
    d_holidayfl          NUMBER(1, 0) NOT NULL,
    d_weekdayfl          NUMBER(1, 0) NOT NULL
);

Now load the data. As you scale up into larger volumes, these steps are still valid; but you may want to split the loads into separate steps and alter the LINEORDER external table to read multiple files in parallel and use parallel dml on insert in order to speed up the process. The truncate lines aren’t necessary for the first time data load; but are included for future reloads of the dbgen data with other scaling.

TRUNCATE TABLE ssb.lineorder;
TRUNCATE TABLE ssb.part;
TRUNCATE TABLE ssb.supplier;
TRUNCATE TABLE ssb.customer;
TRUNCATE TABLE ssb.date_dim;

ALTER TABLE ssb.lineorder PARALLEL 4;
ALTER SESSION ENABLE PARALLEL DML;

INSERT /*+ APPEND */ INTO  ssb.part      SELECT * FROM ssb.ext_part;
commit;
INSERT /*+ APPEND */ INTO  ssb.supplier  SELECT * FROM ssb.ext_supplier;
commit;
INSERT /*+ APPEND */ INTO  ssb.customer  SELECT * FROM ssb.ext_customer;
commit;
INSERT /*+ APPEND */ INTO  ssb.date_dim  SELECT * FROM ssb.ext_date_dim;
commit;
INSERT /*+ APPEND */ INTO  ssb.lineorder SELECT * FROM ssb.ext_lineorder;
commit;

And finally, add the constraints and indexes.

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT pk_lineorder PRIMARY KEY(lo_orderkey, lo_linenumber);

ALTER TABLE ssb.part
    ADD CONSTRAINT pk_part PRIMARY KEY(p_partkey);

ALTER TABLE ssb.supplier
    ADD CONSTRAINT pk_supplier PRIMARY KEY(s_suppkey);

ALTER TABLE ssb.customer
    ADD CONSTRAINT pk_customer PRIMARY KEY(c_custkey);

ALTER TABLE ssb.date_dim
    ADD CONSTRAINT pk_date_dim PRIMARY KEY(d_datekey);

---

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT fk_lineitem_customer FOREIGN KEY(lo_custkey) REFERENCES ssb.customer(c_custkey);

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT fk_lineitem_part FOREIGN KEY(lo_partkey) REFERENCES ssb.part(p_partkey);

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT fk_lineitem_supplier FOREIGN KEY(lo_suppkey) REFERENCES ssb.supplier(s_suppkey);

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT fk_lineitem_orderdate FOREIGN KEY(lo_orderdate) REFERENCES ssb.date_dim(d_datekey);

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT fk_lineitem_commitdate FOREIGN KEY(lo_commitdate) REFERENCES ssb.date_dim(d_datekey);

And that’s it, you should now have a complete SSB scale-4 data set to complete either the SSB suite of test queries, oracle labs, or run your own tests.

If you want to generate larger data sets you can with similar syntax to that seen with TPC-H. The DDL and inserts above are already defined for either parallel or single-file loads of the lineorder table. The other tables are relatively small in comparison. They can still be split if desired but you probably won’t need to.

$ ./dbgen -s 10 -T c
$ ./dbgen -s 10 -T p
$ ./dbgen -s 10 -T s
$ ./dbgen -s 10 -T d
$ ./dbgen -s 10 -T l -C 4 -S 1
$ ./dbgen -s 10 -T l -C 4 -S 2
$ ./dbgen -s 10 -T l -C 4 -S 3
$ ./dbgen -s 10 -T l -C 4 -S 4

Enjoy!

Setting up TPC-H test data with Oracle on Linux (Part 2 – Large data sets)


In my previous post I showed how to set up data loads for the TPC-H test data. All of the steps provided should work regardless of the data volume. However, for large data sets you may want to parallelize the data generation and data import in order to speed up the process.

The user, directory, permissions, target tables, constraints, and indexes from the previous post will be the same. As with the small data sets the constraints and indexes won’t be added until after the data is loaded.

As with the small sets, dbgen is used but instead of scale 4, I’ll use scale 100 (approximately 100GB of data.)
I’ll split it into 4 parts for parallelization. Thus needing to call dbgen 4 times, once for each part.

$ ./dbgen -s 100 -S 1 -C 4
$ ./dbgen -s 100 -S 2 -C 4
$ ./dbgen -s 100 -S 3 -C 4
$ ./dbgen -s 100 -S 4 -C 4

These can be run sequentially or in 4 separate sessions. If you have higher cpu capacity then I suggest running with more than 4 parts and/or run them in parallel sessions.

Note, sessions other than the first will each try to replace the nation.tbl file so you’ll be prompted if you want to replace it or not. Y or N is fine, the same data will be generated regardless.

$ ./dbgen -s 100 -S 2 -C 4
TPC-H Population Generator (Version 2.17.2)
Copyright Transaction Processing Performance Council 1994 - 2010
Do you want to overwrite ./nation.tbl ? [Y/N]: n

When all 4 calls are complete you’ll have the following files.

$ ls *.tbl.*
customer.tbl.1  orders.tbl.1    part.tbl.2
customer.tbl.2  orders.tbl.2    part.tbl.3
customer.tbl.3  orders.tbl.3    part.tbl.4
customer.tbl.4  orders.tbl.4    region.tbl
lineitem.tbl.1  partsupp.tbl.1  supplier.tbl.1
lineitem.tbl.2  partsupp.tbl.2  supplier.tbl.2
lineitem.tbl.3  partsupp.tbl.3  supplier.tbl.3
lineitem.tbl.4  partsupp.tbl.4  supplier.tbl.4
nation.tbl      part.tbl.1

The NATION and REGION tables are small and fixed in size (25 and 5 rows respectively) regardless of the scale value chosen. So they only have one file and will not be processed in parallel.
The other tables will use parallel 4 with multi-file locations in their corresponding external tables.

CREATE TABLE tpch.ext_part
(
    p_partkey       NUMBER(10, 0),
    p_name          VARCHAR2(55),
    p_mfgr          CHAR(25),
    p_brand         CHAR(10),
    p_type          VARCHAR2(25),
    p_size          INTEGER,
    p_container     CHAR(10),
    p_retailprice   NUMBER,
    p_comment       VARCHAR2(23)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('part.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_supplier
(
    s_suppkey     NUMBER(10, 0),
    s_name        CHAR(25),
    s_address     VARCHAR2(40),
    s_nationkey   NUMBER(10, 0),
    s_phone       CHAR(15),
    s_acctbal     NUMBER,
    s_comment     VARCHAR2(101)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('supplier.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_partsupp
(
    ps_partkey      NUMBER(10, 0),
    ps_suppkey      NUMBER(10, 0),
    ps_availqty     INTEGER,
    ps_supplycost   NUMBER,
    ps_comment      VARCHAR2(199)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('partsupp.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_customer
(
    c_custkey      NUMBER(10, 0),
    c_name         VARCHAR2(25),
    c_address      VARCHAR2(40),
    c_nationkey    NUMBER(10, 0),
    c_phone        CHAR(15),
    c_acctbal      NUMBER,
    c_mktsegment   CHAR(10),
    c_comment      VARCHAR2(117)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('customer.tbl*'))
    PARALLEL 4;

-- read date values as yyyy-mm-dd text

CREATE TABLE tpch.ext_orders
(
    o_orderkey        NUMBER(10, 0),
    o_custkey         NUMBER(10, 0),
    o_orderstatus     CHAR(1),
    o_totalprice      NUMBER,
    o_orderdate       CHAR(10),
    o_orderpriority   CHAR(15),
    o_clerk           CHAR(15),
    o_shippriority    INTEGER,
    o_comment         VARCHAR2(79)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('orders.tbl*'))
    PARALLEL 4;

-- read date values as yyyy-mm-dd text

CREATE TABLE tpch.ext_lineitem
(
    l_orderkey        NUMBER(10, 0),
    l_partkey         NUMBER(10, 0),
    l_suppkey         NUMBER(10, 0),
    l_linenumber      INTEGER,
    l_quantity        NUMBER,
    l_extendedprice   NUMBER,
    l_discount        NUMBER,
    l_tax             NUMBER,
    l_returnflag      CHAR(1),
    l_linestatus      CHAR(1),
    l_shipdate        CHAR(10),
    l_commitdate      CHAR(10),
    l_receiptdate     CHAR(10),
    l_shipinstruct    CHAR(25),
    l_shipmode        CHAR(10),
    l_comment         VARCHAR2(44)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('lineitem.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_nation
(
    n_nationkey   NUMBER(10, 0),
    n_name        CHAR(25),
    n_regionkey   NUMBER(10, 0),
    n_comment     VARCHAR(152)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('nation.tbl'));

CREATE TABLE tpch.ext_region
(
    r_regionkey   NUMBER(10, 0),
    r_name        CHAR(25),
    r_comment     VARCHAR(152)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('region.tbl'));

In the previous post the target tables weren’t created with a parallel clause. So, add that now.

ALTER TABLE tpch.part     PARALLEL 4;
ALTER TABLE tpch.supplier PARALLEL 4;
ALTER TABLE tpch.partsupp PARALLEL 4;
ALTER TABLE tpch.customer PARALLEL 4;
ALTER TABLE tpch.orders   PARALLEL 4;
ALTER TABLE tpch.lineitem PARALLEL 4;

Clear out any test data from prior runs and load the new data. Use parallel dml to ensure maximum speed on load.

TRUNCATE TABLE tpch.part;
TRUNCATE TABLE tpch.supplier;
TRUNCATE TABLE tpch.partsupp;
TRUNCATE TABLE tpch.customer;
TRUNCATE TABLE tpch.orders;
TRUNCATE TABLE tpch.lineitem;
TRUNCATE TABLE tpch.nation;
TRUNCATE TABLE tpch.region;

ALTER SESSION SET nls_date_format='YYYY-MM-DD';
ALTER SESSION ENABLE PARALLEL DML;

INSERT /*+ APPEND */ INTO  tpch.part     SELECT * FROM tpch.ext_part;
INSERT /*+ APPEND */ INTO  tpch.supplier SELECT * FROM tpch.ext_supplier;
INSERT /*+ APPEND */ INTO  tpch.partsupp SELECT * FROM tpch.ext_partsupp;
INSERT /*+ APPEND */ INTO  tpch.customer SELECT * FROM tpch.ext_customer;
INSERT /*+ APPEND */ INTO  tpch.orders   SELECT * FROM tpch.ext_orders;
INSERT /*+ APPEND */ INTO  tpch.lineitem SELECT * FROM tpch.ext_lineitem;
INSERT /*+ APPEND */ INTO  tpch.nation   SELECT * FROM tpch.ext_nation;
INSERT /*+ APPEND */ INTO  tpch.region   SELECT * FROM tpch.ext_region;

Like the dbgen step you can run these sequentially or in parallel. Because the selects and inserts will all be parallel, I chose to run them sequentially. If you have the hardware to run in parallel, you can do so, or simply increase the number of input files and the parallel parameter on the tables to match your available cpu capacity.

Finally, add the constraints and indexes as shown in the previous post and you’re done.

Setting up TPC-H test data with Oracle on Linux (Part 1 – Small data sets)


While Oracle has their own SCOTT, SH, OE, HR, etc. sample schemas, it’s often useful to have the ability to scale data volumes for different experiments.  The TPC-H schema and sample data sets provide a convenient means of doing so. They are especially helpful in scenarios such as this, blogging, where readers may have a database but not the same data. Setting up the TPC-H data is fairly simple.

First, go to the TPC home, click Downloads and select the tools zip file.  You may also want to read the pdf file documenting the schema and the data scales.  Unzip the file somewhere on your linux system.  As of the time of this writing  2.17.2  is the current version.  In the future the steps may change slightly but should be fairly similar to what is described below.
When you unzip, a directory 2.17.2 will be created. Change to that directory, make a copy of the makefile template and edit it.

$ cd 2.17.2/dbgen
$ cp makefile.suite makefile
$ vi makefile

Within the makefile, change the following lines

CC = gcc
DATABASE= ORACLE
MACHINE = LINUX
WORKLOAD = TPCH

Then make will create the dbgen utility.

$ make
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o build.o build.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o driver.o driver.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o bm_utils.o bm_utils.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o rnd.o rnd.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o print.o print.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o load_stub.o load_stub.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o bcd2.o bcd2.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o speed_seed.o speed_seed.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o text.o text.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o permute.o permute.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o rng64.o rng64.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64  -O -o dbgen build.o driver.o bm_utils.o rnd.o print.o load_stub.o bcd2.o speed_seed.o text.o permute.o rng64.o -lm
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o qgen.o qgen.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o varsub.o varsub.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64  -O -o qgen build.o bm_utils.o qgen.o rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o -lm

We’ll generate a small, approximately 4GB, set of test data and place it in a directory we’ll use later for the upload into the Oracle tables.

$ ./dbgen -s 4
TPC-H Population Generator (Version 2.17.2)
Copyright Transaction Processing Performance Council 1994 - 2010
$ ls *.tbl
customer.tbl  lineitem.tbl  nation.tbl  orders.tbl  partsupp.tbl  part.tbl  region.tbl  supplier.tbl
$ mv *.tbl /home/oracle/tpch

Within the database, set up the tpch schema. All tables will be created according to the layouts described in section 1.4.1 of the TPC-H standard specification. For the purposes of data import, a set of external tables will also be created. These are not part of TPC-H itself and may be left in place or dropped after data load is complete. Also, per section 1.4.2, constraints are optional. I’m including the allowable primary key, foreign key, not null, and check constraints described in the 1.4.2 subsections. Other than the default indexes created to support the primary key constraints, no other indexes are included in the steps below. For efficiency of data loading, the constraints and indexes will be added after the data loading is complete.

CREATE USER tpch IDENTIFIED BY tpch;

GRANT CREATE SESSION,
      CREATE TABLE,
      UNLIMITED TABLESPACE
    TO tpch;

CREATE OR REPLACE DIRECTORY tpch_dir AS '/home/oracle/tpch';

GRANT READ ON DIRECTORY tpch_dir TO tpch;

-- 1.4.1
--  per 1.4.2.1  all table columns may be defined NOT NULL

CREATE TABLE tpch.ext_part
(
    p_partkey       NUMBER(10, 0),
    p_name          VARCHAR2(55),
    p_mfgr          CHAR(25),
    p_brand         CHAR(10),
    p_type          VARCHAR2(25),
    p_size          INTEGER,
    p_container     CHAR(10),
    p_retailprice   NUMBER,
    p_comment       VARCHAR2(23)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('part.tbl'));

CREATE TABLE tpch.part
(
    p_partkey       NUMBER(10, 0) NOT NULL,
    p_name          VARCHAR2(55) NOT NULL,
    p_mfgr          CHAR(25) NOT NULL,
    p_brand         CHAR(10) NOT NULL,
    p_type          VARCHAR2(25) NOT NULL,
    p_size          INTEGER NOT NULL,
    p_container     CHAR(10) NOT NULL,
    p_retailprice   NUMBER NOT NULL,
    p_comment       VARCHAR2(23) NOT NULL
);


CREATE TABLE tpch.ext_supplier
(
    s_suppkey     NUMBER(10, 0),
    s_name        CHAR(25),
    s_address     VARCHAR2(40),
    s_nationkey   NUMBER(10, 0),
    s_phone       CHAR(15),
    s_acctbal     NUMBER,
    s_comment     VARCHAR2(101)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('supplier.tbl'));

CREATE TABLE tpch.supplier
(
    s_suppkey     NUMBER(10, 0) NOT NULL,
    s_name        CHAR(25) NOT NULL,
    s_address     VARCHAR2(40) NOT NULL,
    s_nationkey   NUMBER(10, 0) NOT NULL,
    s_phone       CHAR(15) NOT NULL,
    s_acctbal     NUMBER NOT NULL,
    s_comment     VARCHAR2(101) NOT NULL
);

CREATE TABLE tpch.ext_partsupp
(
    ps_partkey      NUMBER(10, 0),
    ps_suppkey      NUMBER(10, 0),
    ps_availqty     INTEGER,
    ps_supplycost   NUMBER,
    ps_comment      VARCHAR2(199)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('partsupp.tbl'));

CREATE TABLE tpch.partsupp
(
    ps_partkey      NUMBER(10, 0) NOT NULL,
    ps_suppkey      NUMBER(10, 0) NOT NULL,
    ps_availqty     INTEGER NOT NULL,
    ps_supplycost   NUMBER NOT NULL,
    ps_comment      VARCHAR2(199) NOT NULL
);

CREATE TABLE tpch.ext_customer
(
    c_custkey      NUMBER(10, 0),
    c_name         VARCHAR2(25),
    c_address      VARCHAR2(40),
    c_nationkey    NUMBER(10, 0),
    c_phone        CHAR(15),
    c_acctbal      NUMBER,
    c_mktsegment   CHAR(10),
    c_comment      VARCHAR2(117)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('customer.tbl'));

CREATE TABLE tpch.customer
(
    c_custkey      NUMBER(10, 0) NOT NULL,
    c_name         VARCHAR2(25) NOT NULL,
    c_address      VARCHAR2(40) NOT NULL,
    c_nationkey    NUMBER(10, 0) NOT NULL,
    c_phone        CHAR(15) NOT NULL,
    c_acctbal      NUMBER NOT NULL,
    c_mktsegment   CHAR(10) NOT NULL,
    c_comment      VARCHAR2(117) NOT NULL
);

-- read date values as yyyy-mm-dd text

CREATE TABLE tpch.ext_orders
(
    o_orderkey        NUMBER(10, 0),
    o_custkey         NUMBER(10, 0),
    o_orderstatus     CHAR(1),
    o_totalprice      NUMBER,
    o_orderdate       CHAR(10),
    o_orderpriority   CHAR(15),
    o_clerk           CHAR(15),
    o_shippriority    INTEGER,
    o_comment         VARCHAR2(79)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('orders.tbl'));

CREATE TABLE tpch.orders
(
    o_orderkey        NUMBER(10, 0) NOT NULL,
    o_custkey         NUMBER(10, 0) NOT NULL,
    o_orderstatus     CHAR(1) NOT NULL,
    o_totalprice      NUMBER NOT NULL,
    o_orderdate       DATE NOT NULL,
    o_orderpriority   CHAR(15) NOT NULL,
    o_clerk           CHAR(15) NOT NULL,
    o_shippriority    INTEGER NOT NULL,
    o_comment         VARCHAR2(79) NOT NULL
);

-- read date values as yyyy-mm-dd text

CREATE TABLE tpch.ext_lineitem
(
    l_orderkey        NUMBER(10, 0),
    l_partkey         NUMBER(10, 0),
    l_suppkey         NUMBER(10, 0),
    l_linenumber      INTEGER,
    l_quantity        NUMBER,
    l_extendedprice   NUMBER,
    l_discount        NUMBER,
    l_tax             NUMBER,
    l_returnflag      CHAR(1),
    l_linestatus      CHAR(1),
    l_shipdate        CHAR(10),
    l_commitdate      CHAR(10),
    l_receiptdate     CHAR(10),
    l_shipinstruct    CHAR(25),
    l_shipmode        CHAR(10),
    l_comment         VARCHAR2(44)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('lineitem.tbl'));

CREATE TABLE tpch.lineitem
(
    l_orderkey        NUMBER(10, 0),
    l_partkey         NUMBER(10, 0),
    l_suppkey         NUMBER(10, 0),
    l_linenumber      INTEGER,
    l_quantity        NUMBER,
    l_extendedprice   NUMBER,
    l_discount        NUMBER,
    l_tax             NUMBER,
    l_returnflag      CHAR(1),
    l_linestatus      CHAR(1),
    l_shipdate        DATE,
    l_commitdate      DATE,
    l_receiptdate     DATE,
    l_shipinstruct    CHAR(25),
    l_shipmode        CHAR(10),
    l_comment         VARCHAR2(44)
);

CREATE TABLE tpch.ext_nation
(
    n_nationkey   NUMBER(10, 0),
    n_name        CHAR(25),
    n_regionkey   NUMBER(10, 0),
    n_comment     VARCHAR(152)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('nation.tbl'));

CREATE TABLE tpch.nation
(
    n_nationkey   NUMBER(10, 0),
    n_name        CHAR(25),
    n_regionkey   NUMBER(10, 0),
    n_comment     VARCHAR(152)
);

CREATE TABLE tpch.ext_region
(
    r_regionkey   NUMBER(10, 0),
    r_name        CHAR(25),
    r_comment     VARCHAR(152)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('region.tbl'));

CREATE TABLE tpch.region
(
    r_regionkey   NUMBER(10, 0),
    r_name        CHAR(25),
    r_comment     VARCHAR(152)
);

Now load the data. The external tables read the date values as text, so we must set the NLS_DATE_FORMAT prior to loading so the text will be parsed correctly, or embed the formatting within each sql statement. For the small data set in this example, the steps described here should complete within a few minutes. As you scale up into larger volumes, these steps are still valid; but you may want to split the loads into separate steps and alter the external to read multiple files in parallel and use parallel dml on insert in order to speed up the process. The truncate lines aren’t necessary for the first time data load; but are included for future reloads of the dbgen data with other scaling.

TRUNCATE TABLE tpch.part;
TRUNCATE TABLE tpch.supplier;
TRUNCATE TABLE tpch.partsupp;
TRUNCATE TABLE tpch.customer;
TRUNCATE TABLE tpch.orders;
TRUNCATE TABLE tpch.lineitem;
TRUNCATE TABLE tpch.nation;
TRUNCATE TABLE tpch.region;

ALTER SESSION SET nls_date_format='YYYY-MM-DD';

INSERT /*+ APPEND */ INTO  tpch.part     SELECT * FROM tpch.ext_part;
INSERT /*+ APPEND */ INTO  tpch.supplier SELECT * FROM tpch.ext_supplier;
INSERT /*+ APPEND */ INTO  tpch.partsupp SELECT * FROM tpch.ext_partsupp;
INSERT /*+ APPEND */ INTO  tpch.customer SELECT * FROM tpch.ext_customer;
INSERT /*+ APPEND */ INTO  tpch.orders   SELECT * FROM tpch.ext_orders;
INSERT /*+ APPEND */ INTO  tpch.lineitem SELECT * FROM tpch.ext_lineitem;
INSERT /*+ APPEND */ INTO  tpch.nation   SELECT * FROM tpch.ext_nation;
INSERT /*+ APPEND */ INTO  tpch.region   SELECT * FROM tpch.ext_region;

And finally, add the constraints and indexes.

ALTER TABLE tpch.part
    ADD CONSTRAINT pk_part PRIMARY KEY(p_partkey);

ALTER TABLE tpch.supplier
    ADD CONSTRAINT pk_supplier PRIMARY KEY(s_suppkey);

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT pk_partsupp PRIMARY KEY(ps_partkey, ps_suppkey);

ALTER TABLE tpch.customer
    ADD CONSTRAINT pk_customer PRIMARY KEY(c_custkey);

ALTER TABLE tpch.orders
    ADD CONSTRAINT pk_orders PRIMARY KEY(o_orderkey);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT pk_lineitem PRIMARY KEY(l_linenumber, l_orderkey);

ALTER TABLE tpch.nation
    ADD CONSTRAINT pk_nation PRIMARY KEY(n_nationkey);

ALTER TABLE tpch.region
    ADD CONSTRAINT pk_region PRIMARY KEY(r_regionkey);

-- 1.4.2.3

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT fk_partsupp_part FOREIGN KEY(ps_partkey) REFERENCES tpch.part(p_partkey);

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT fk_partsupp_supplier FOREIGN KEY(ps_suppkey) REFERENCES tpch.supplier(s_suppkey);

ALTER TABLE tpch.customer
    ADD CONSTRAINT fk_customer_nation FOREIGN KEY(c_nationkey) REFERENCES tpch.nation(n_nationkey);

ALTER TABLE tpch.orders
    ADD CONSTRAINT fk_orders_customer FOREIGN KEY(o_custkey) REFERENCES tpch.customer(c_custkey);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT fk_lineitem_order FOREIGN KEY(l_orderkey) REFERENCES tpch.orders(o_orderkey);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT fk_lineitem_part FOREIGN KEY(l_partkey) REFERENCES tpch.part(p_partkey);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT fk_lineitem_supplier FOREIGN KEY(l_suppkey) REFERENCES tpch.supplier(s_suppkey);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT fk_lineitem_partsupp FOREIGN KEY(l_partkey, l_suppkey)
        REFERENCES tpch.partsupp(ps_partkey, ps_suppkey);

-- 1.4.2.4 - 1

ALTER TABLE tpch.part
    ADD CONSTRAINT chk_part_partkey CHECK(p_partkey >= 0);

ALTER TABLE tpch.supplier
    ADD CONSTRAINT chk_supplier_suppkey CHECK(s_suppkey >= 0);

ALTER TABLE tpch.customer
    ADD CONSTRAINT chk_customer_custkey CHECK(c_custkey >= 0);

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT chk_partsupp_partkey CHECK(ps_partkey >= 0);

ALTER TABLE tpch.region
    ADD CONSTRAINT chk_region_regionkey CHECK(r_regionkey >= 0);

ALTER TABLE tpch.nation
    ADD CONSTRAINT chk_nation_nationkey CHECK(n_nationkey >= 0);

-- 1.4.2.4 - 2

ALTER TABLE tpch.part
    ADD CONSTRAINT chk_part_size CHECK(p_size >= 0);

ALTER TABLE tpch.part
    ADD CONSTRAINT chk_part_retailprice CHECK(p_retailprice >= 0);

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT chk_partsupp_availqty CHECK(ps_availqty >= 0);

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT chk_partsupp_supplycost CHECK(ps_supplycost >= 0);

ALTER TABLE tpch.orders
    ADD CONSTRAINT chk_orders_totalprice CHECK(o_totalprice >= 0);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT chk_lineitem_quantity CHECK(l_quantity >= 0);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT chk_lineitem_extendedprice CHECK(l_extendedprice >= 0);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT chk_lineitem_tax CHECK(l_tax >= 0);

-- 1.4.2.4 - 3

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT chk_lineitem_discount CHECK(l_discount >= 0.00 AND l_discount <= 1.00);

-- 1.4.2.4 - 4

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT chk_lineitem_ship_rcpt CHECK(l_shipdate <= l_receiptdate);

And that’s it, you should now have a complete TPC-H scale-4 data set to complete either the TPC-H suite of test queries, oracle labs, or run your own tests.

Enjoy!

12cR2 finally has on premises release!


12.2.0.1 has been released for on premises installation.

It’s only available for Linux and Solaris so far, but that’s good enough to get some test installations started.

http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html

%d bloggers like this: