Setting up TPC-H test data with Oracle on Linux (Part 2 – Large data sets)


In my previous post I showed how to set up data loads for the TPC-H test data. All of the steps provided should work regardless of the data volume. However, for large data sets you may want to parallelize the data generation and data import in order to speed up the process.

The user, directory, permissions, target tables, constraints, and indexes from the previous post will be the same. As with the small data sets the constraints and indexes won’t be added until after the data is loaded.

As with the small sets, dbgen is used but instead of scale 4, I’ll use scale 100 (approximately 100GB of data.)
I’ll split it into 4 parts for parallelization. Thus needing to call dbgen 4 times, once for each part.

$ ./dbgen -s 100 -S 1 -C 4
$ ./dbgen -s 100 -S 2 -C 4
$ ./dbgen -s 100 -S 3 -C 4
$ ./dbgen -s 100 -S 4 -C 4

These can be run sequentially or in 4 separate sessions. If you have higher cpu capacity then I suggest running with more than 4 parts and/or run them in parallel sessions.

Note, sessions other than the first will each try to replace the nation.tbl file so you’ll be prompted if you want to replace it or not. Y or N is fine, the same data will be generated regardless.

$ ./dbgen -s 100 -S 2 -C 4
TPC-H Population Generator (Version 2.17.2)
Copyright Transaction Processing Performance Council 1994 - 2010
Do you want to overwrite ./nation.tbl ? [Y/N]: n

When all 4 calls are complete you’ll have the following files.

$ ls *.tbl.*
customer.tbl.1  orders.tbl.1    part.tbl.2
customer.tbl.2  orders.tbl.2    part.tbl.3
customer.tbl.3  orders.tbl.3    part.tbl.4
customer.tbl.4  orders.tbl.4    region.tbl
lineitem.tbl.1  partsupp.tbl.1  supplier.tbl.1
lineitem.tbl.2  partsupp.tbl.2  supplier.tbl.2
lineitem.tbl.3  partsupp.tbl.3  supplier.tbl.3
lineitem.tbl.4  partsupp.tbl.4  supplier.tbl.4
nation.tbl      part.tbl.1

The NATION and REGION tables are small and fixed in size (25 and 5 rows respectively) regardless of the scale value chosen. So they only have one file and will not be processed in parallel.
The other tables will use parallel 4 with multi-file locations in their corresponding external tables.

CREATE TABLE tpch.ext_part
(
    p_partkey       NUMBER(10, 0),
    p_name          VARCHAR2(55),
    p_mfgr          CHAR(25),
    p_brand         CHAR(10),
    p_type          VARCHAR2(25),
    p_size          INTEGER,
    p_container     CHAR(10),
    p_retailprice   NUMBER,
    p_comment       VARCHAR2(23)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('part.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_supplier
(
    s_suppkey     NUMBER(10, 0),
    s_name        CHAR(25),
    s_address     VARCHAR2(40),
    s_nationkey   NUMBER(10, 0),
    s_phone       CHAR(15),
    s_acctbal     NUMBER,
    s_comment     VARCHAR2(101)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('supplier.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_partsupp
(
    ps_partkey      NUMBER(10, 0),
    ps_suppkey      NUMBER(10, 0),
    ps_availqty     INTEGER,
    ps_supplycost   NUMBER,
    ps_comment      VARCHAR2(199)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('partsupp.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_customer
(
    c_custkey      NUMBER(10, 0),
    c_name         VARCHAR2(25),
    c_address      VARCHAR2(40),
    c_nationkey    NUMBER(10, 0),
    c_phone        CHAR(15),
    c_acctbal      NUMBER,
    c_mktsegment   CHAR(10),
    c_comment      VARCHAR2(117)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('customer.tbl*'))
    PARALLEL 4;

-- read date values as yyyy-mm-dd text

CREATE TABLE tpch.ext_orders
(
    o_orderkey        NUMBER(10, 0),
    o_custkey         NUMBER(10, 0),
    o_orderstatus     CHAR(1),
    o_totalprice      NUMBER,
    o_orderdate       CHAR(10),
    o_orderpriority   CHAR(15),
    o_clerk           CHAR(15),
    o_shippriority    INTEGER,
    o_comment         VARCHAR2(79)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('orders.tbl*'))
    PARALLEL 4;

-- read date values as yyyy-mm-dd text

CREATE TABLE tpch.ext_lineitem
(
    l_orderkey        NUMBER(10, 0),
    l_partkey         NUMBER(10, 0),
    l_suppkey         NUMBER(10, 0),
    l_linenumber      INTEGER,
    l_quantity        NUMBER,
    l_extendedprice   NUMBER,
    l_discount        NUMBER,
    l_tax             NUMBER,
    l_returnflag      CHAR(1),
    l_linestatus      CHAR(1),
    l_shipdate        CHAR(10),
    l_commitdate      CHAR(10),
    l_receiptdate     CHAR(10),
    l_shipinstruct    CHAR(25),
    l_shipmode        CHAR(10),
    l_comment         VARCHAR2(44)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('lineitem.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_nation
(
    n_nationkey   NUMBER(10, 0),
    n_name        CHAR(25),
    n_regionkey   NUMBER(10, 0),
    n_comment     VARCHAR(152)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('nation.tbl'));

CREATE TABLE tpch.ext_region
(
    r_regionkey   NUMBER(10, 0),
    r_name        CHAR(25),
    r_comment     VARCHAR(152)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('region.tbl'));

In the previous post the target tables weren’t created with a parallel clause. So, add that now.

ALTER TABLE tpch.part     PARALLEL 4;
ALTER TABLE tpch.supplier PARALLEL 4;
ALTER TABLE tpch.partsupp PARALLEL 4;
ALTER TABLE tpch.customer PARALLEL 4;
ALTER TABLE tpch.orders   PARALLEL 4;
ALTER TABLE tpch.lineitem PARALLEL 4;

Clear out any test data from prior runs and load the new data. Use parallel dml to ensure maximum speed on load.

TRUNCATE TABLE tpch.part;
TRUNCATE TABLE tpch.supplier;
TRUNCATE TABLE tpch.partsupp;
TRUNCATE TABLE tpch.customer;
TRUNCATE TABLE tpch.orders;
TRUNCATE TABLE tpch.lineitem;
TRUNCATE TABLE tpch.nation;
TRUNCATE TABLE tpch.region;

ALTER SESSION SET nls_date_format='YYYY-MM-DD';
ALTER SESSION ENABLE PARALLEL DML;

INSERT /*+ APPEND */ INTO  tpch.part     SELECT * FROM tpch.ext_part;
INSERT /*+ APPEND */ INTO  tpch.supplier SELECT * FROM tpch.ext_supplier;
INSERT /*+ APPEND */ INTO  tpch.partsupp SELECT * FROM tpch.ext_partsupp;
INSERT /*+ APPEND */ INTO  tpch.customer SELECT * FROM tpch.ext_customer;
INSERT /*+ APPEND */ INTO  tpch.orders   SELECT * FROM tpch.ext_orders;
INSERT /*+ APPEND */ INTO  tpch.lineitem SELECT * FROM tpch.ext_lineitem;
INSERT /*+ APPEND */ INTO  tpch.nation   SELECT * FROM tpch.ext_nation;
INSERT /*+ APPEND */ INTO  tpch.region   SELECT * FROM tpch.ext_region;

Like the dbgen step you can run these sequentially or in parallel. Because the selects and inserts will all be parallel, I chose to run them sequentially. If you have the hardware to run in parallel, you can do so, or simply increase the number of input files and the parallel parameter on the tables to match your available cpu capacity.

Finally, add the constraints and indexes as shown in the previous post and you’re done.

12cR2 finally has on premises release!


12.2.0.1 has been released for on premises installation.

It’s only available for Linux and Solaris so far, but that’s good enough to get some test installations started.

http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html

See you in Las Vegas!


I’m flying out tomorrow for Collaborate 16.
Looking forward to another great conference.

I’m presenting again this year.
I’ll be speaking on Tuesday, at 2:15
“Why Developers Need to Think like DBAs, Why DBAs Need to Think like Developers”
Session 1355 in Jasmine C

Splitting a clob into rows


I’ve used this tool for a wide variety of other parsing projects. One of the interesting tuning techniques I used was to pull the clob apart into 32K varchar2 chunks.
It is possible to split the clob directly using the DBMS_LOB package or through the overloaded SQL functions; but clobs are expensive objects. Varchar2 variables on the other hand are relatively light weight making the sub-parsing within them much faster. Doing this does take a little bit of care though to make sure the chunks don’t accidentally split a line in two.

Also, I do make the assumption that no one line will be more than 32K long which is fine for this function anyway since the output is a SQL collection with a varchar2 limit of 4000 bytes.
The returned VCARRAY type is a simple table collection type.

CREATE OR REPLACE TYPE VCARRAY as table of varchar2(4000)

I wrote this originally in 9i. With 12c support for 32K varchar2 in SQL I may need to revisit it and make a new version.

CREATE OR REPLACE FUNCTION split_clob(p_clob IN CLOB, p_delimiter IN VARCHAR2 DEFAULT CHR(10))
    RETURN vcarray
    PIPELINED
IS
    --                    .///.
    --                   (0 o)
    ---------------0000--(_)--0000---------------
    --
    --  Sean D. Stuber
    --  sean.stuber@gmail.com
    --
    --             oooO      Oooo
    --------------(   )-----(   )---------------
    --             \ (       ) /
    --              \_)     (_/

    c_chunk_limit   CONSTANT INTEGER := 32767;
    v_clob_length            INTEGER := DBMS_LOB.getlength(p_clob);
    v_clob_index             INTEGER;
    v_chunk                  VARCHAR2(32767);
    v_chunk_end              INTEGER;
    v_chunk_length           INTEGER;
    v_chunk_index            INTEGER;
    v_delim_len              INTEGER := LENGTH(p_delimiter);
    v_line_end               INTEGER;
BEGIN
    v_clob_length := DBMS_LOB.getlength(p_clob);
    v_clob_index := 1;

    WHILE v_clob_index <= v_clob_length
    LOOP
        /*
            Pull one 32K chunk off the clob at a time.
            This is because it's MUCH faster to use built in functions
            on a varchar2 type than to use dbms_lob functions on a clob.
        */
        v_chunk := DBMS_LOB.SUBSTR(p_clob, c_chunk_limit, v_clob_index);

        IF v_clob_index > v_clob_length - c_chunk_limit
        THEN
            -- if we walked off the end the clob,
            -- then the chunk is whatever we picked up at the end
            -- delimited or not
            v_clob_index := v_clob_length + 1;
        ELSE
            v_chunk_end := INSTR(v_chunk, p_delimiter, -1);

            IF v_chunk_end = 0
            THEN
                DBMS_OUTPUT.put_line('No delimiters found!');
                RETURN;
            END IF;

            v_chunk := SUBSTR(v_chunk, 1, v_chunk_end);
            v_clob_index := v_clob_index + v_chunk_end + v_delim_len - 1;
        END IF;

        /*
            Given a varchar2 chunk split it into lines
        */

        v_chunk_index := 1;
        v_chunk_length := NVL(LENGTH(v_chunk), 0);

        WHILE v_chunk_index <= v_chunk_length
        LOOP
            v_line_end := INSTR(v_chunk, p_delimiter, v_chunk_index);

            IF v_line_end = 0 OR (v_line_end - v_chunk_index) > 4000
            THEN
                PIPE ROW (SUBSTR(v_chunk, v_chunk_index, 4000));
                v_chunk_index := v_chunk_index + 4000;
            ELSE
                PIPE ROW (SUBSTR(v_chunk, v_chunk_index, v_line_end - v_chunk_index));
                v_chunk_index := v_line_end + v_delim_len;
            END IF;
        END LOOP;
    END LOOP;

    RETURN;
EXCEPTION
    WHEN no_data_needed
    THEN
        NULL;
END split_clob;
/

Thank you, thank you, thank you!


A little while ago Oracle announced the winners of the Oracle Database Developer Choice Awards and I was a winner in both of categories I was nominated,

SQL and PL/SQL

I was surprised and overjoyed when I was notified that I had not only been nominated; but named a finalist.
I’m truly humbled by the supportive votes I received.

I’m also inspired to try to give back even more and I’m got a few ideas brewing for my next few articles.

Thank you again!

Only a few hours left in the Oracle Database Developer Choice Awards


If you haven’t voted yet, please do so soon! Voting ends 3pm US/Pacific time.

I’ve been nominated in the SQL and PL/SQL categories

https://community.oracle.com/community/database/awards/sql-voting
https://community.oracle.com/community/database/awards/plsql-voting

A day late but still fun


Yesterday Kim Berg Hansen posted a couple entries to his blog about scraping and parsing the Oracle Developer Choice Awards voting pages and building summaries of the results.

I’ve been doing something similar since the voting began but took a slightly different approach. I do like the idea of storing them in a table to be able to derive a history but I never did that, I instead simply looked at a snapshot in time and didn’t bother to keep trends.

Kim took two approaches, one using an apex call which he has said didn’t work very reliably and another where he simply pastes the html himself.

My method is similar but I use UTL_HTTP calls to extract the html.
Also, my function returns the raw html and then I do cleanup in my sql statement, he cleans his up within his scraping function.
Since he’s storing the results that makes sense, why incur the cost of cleanup each time he read his table?

Here is my version of the scraping function

CREATE OR REPLACE FUNCTION read_oracle_dev_choice(p_category IN VARCHAR2)
    RETURN CLOB
IS
    v_http_request    UTL_HTTP.req;
    v_http_response   UTL_HTTP.resp;
    v_buffer          VARCHAR2(32767);
    v_clob            CLOB;
    v_req_ctx         UTL_HTTP.request_context_key;
BEGIN
    v_req_ctx := UTL_HTTP.create_request_context('file:/home/oracle/devchoice', 'pa55w0rd', FALSE);

    v_http_request :=
        UTL_HTTP.begin_request('https://community.oracle.com/community/database/awards/' || p_category || '-voting/',
                               'GET',
                               'HTTP/1.1',
                               v_req_ctx);

    v_http_response := UTL_HTTP.get_response(v_http_request);
    --DBMS_OUTPUT.put_line('Response status code: ' || v_http_response.status_code);
    --DBMS_OUTPUT.put_line('Response reason phrase: ' || v_http_response.reason_phrase);
    --DBMS_OUTPUT.put_line('Response HTTP version: ' || v_http_response.http_version);

    DBMS_LOB.createtemporary(v_clob, TRUE);

    BEGIN
        LOOP
            UTL_HTTP.read_text(v_http_response, v_buffer, 32767);
            DBMS_LOB.writeappend(v_clob, LENGTH(v_buffer), v_buffer);
        END LOOP;
    EXCEPTION
        WHEN UTL_HTTP.end_of_body
        THEN
            UTL_HTTP.end_response(v_http_response);
    END;

    UTL_HTTP.destroy_request_context(v_req_ctx);

    RETURN v_clob;
END;

Once this scraping function has been created and you have your corresponding ACL and wallet entries, you can extract the results with a single SQL statement.

  SELECT *
    FROM (SELECT x.*, RANK() OVER(PARTITION BY category ORDER BY (score_rank + upvote_rank + uppct_rank)) avg_rank
            FROM (SELECT x.*,
                         RANK() OVER(PARTITION BY category ORDER BY score DESC) score_rank,
                         RANK() OVER(PARTITION BY category ORDER BY upvotes DESC) upvote_rank,
                         RANK() OVER(PARTITION BY category ORDER BY uppct DESC) uppct_rank
                    FROM (SELECT x.*, ROUND(100 * upvotes / totalvotes, 2) uppct
                            FROM (SELECT category,
                                         finalist,
                                         score,
                                         (score + 10 * votes) / 20 upvotes,
                                         (10 * votes - score) / 20 downvotes,
                                         votes totalvotes
                                    FROM (   SELECT cat category,
                                                    TO_NUMBER(REGEXP_SUBSTR(score, '[0-9]+')) score,
                                                    TO_NUMBER(REGEXP_SUBSTR(votes, '[0-9]+')) votes,
                                                    TRIM(finalist) finalist
                                               FROM (SELECT 'sql' cat, read_oracle_dev_choice('sql') html FROM DUAL
                                                     UNION ALL
                                                     SELECT 'plsql', read_oracle_dev_choice('plsql') FROM DUAL
                                                     UNION ALL
                                                     SELECT 'apex', read_oracle_dev_choice('apex') FROM DUAL
                                                     UNION ALL
                                                     SELECT 'db-design', read_oracle_dev_choice('db-design') FROM DUAL
                                                     UNION ALL
                                                     SELECT 'ords', read_oracle_dev_choice('ords') FROM DUAL),
                                                    XMLTABLE(
                                                        '//div[@class="jive-content-ideas-list-item clearfix"]'
                                                        PASSING xmltype(
                                                                    REGEXP_REPLACE(
                                                                        REGEXP_REPLACE(
                                                                            REGEXP_REPLACE(
                                                                                html,
                                                                                '^.+?
', '', 1, 0, 'n'), '.+$', '', 1, 0, 'n'), CHR(38) || '[^;]+?;')) COLUMNS score VARCHAR2(20) PATH './div/div/strong', votes VARCHAR2(20) PATH './div/span/span[@class="idea-vote-count"]', finalist VARCHAR2(20) PATH './div[@class="jive-content"]/div/div[@class="jive-content-title"]') x)) x) x) x) ORDER BY category DESC, score DESC;

And this produces results like this:


CATEGORY  FINALIST                  SCORE    UPVOTES  DOWNVOTES TOTALVOTES      UPPCT SCORE_RANK UPVOTE_RANK UPPCT_RANK   AVG_RANK
--------- -------------------- ---------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------
sql       Emrah Mete                 3010        335         34        369      90.79          1           1          1          1
sql       Sayan Malakshinov          1780        210         32        242      86.78          2           2          2          2
sql       Sean Stuber                 790         98         19        117      83.76          3           3          3          3
sql       Kim Berg Hansen             680         86         18        104      82.69          4           4          4          4
sql       Erik Van Roon               400         55         15         70      78.57          5           5          6          5
sql       Justin Cave                 300         44         14         58      75.86          6           6          8          7
sql       Matthias Rogel              290         38          9         47      80.85          7           7          5          6
sql       Stew Ashton                 260         38         12         50         76          8           7          7          8
plsql     Adrian Billington          1080        112          4        116      96.55          1           2          1          1
plsql     Roger Troller              1030        117         14        131      89.31          2           1          4          2
plsql     Sean Stuber                 910         97          6        103      94.17          3           3          2          3
plsql     Patrick Barel               690         79         10         89      88.76          4           4          5          4
plsql     Morten Braten               620         68          6         74      91.89          5           5          3          4
plsql     Kim Berg Hansen             440         51          7         58      87.93          6           6          7          6
plsql     Bill Coulam                 400         46          6         52      88.46          7           7          6          7
ords      Dietmar Aust               1240        128          4        132      96.97          1           1          1          1
ords      Dimitri Gielis              800         86          6         92      93.48          2           2          2          2
ords      Morten Braten               430         50          7         57      87.72          3           3          4          3
ords      Kiran Pawar                 350         39          4         43       90.7          4           4          3          4
ords      Anton Nielsen               240         28          4         32       87.5          5           5          5          5
ords      Tim St. Hilaire             130         16          3         19      84.21          6           6          6          6
db-design Heli Helskyaho             1030        119         16        135      88.15          1           1          2          1
db-design Michelle Kolbe              630         75         12         87      86.21          2           2          3          2
db-design Rob Lockard                 520         57          5         62      91.94          3           3          1          2
db-design Mark Hoxey                  160         23          7         30      76.67          4           4          4          4
apex      Jari Laine                  720         78          6         84      92.86          1           1          4          1
apex      Morten Braten               680         73          5         78      93.59          2           2          3          2
apex      Juergen Schuster            560         58          2         60      96.67          3           3          1          2
apex      Kiran Pawar                 430         48          5         53      90.57          4           4          5          5
apex      Karen Cannell               280         30          2         32      93.75          5           5          2          4
apex      Paul MacMillan              130         21          8         29      72.41          6           6          7          6
apex      Trent Schafer               120         17          5         22      77.27          7           7          6          7

I hope you find this helpful and a fun exercise to pursue.

If you haven’t voted, I encourage you to do so here…

I’m a finalist in the SQL and PL/SQL Categories and would, of course, appreciate a vote, be sure to check the other categories though too and vote up the finalists there that have helped you in some way.
Also note, you can vote up more than one person in a category, so by all means vote up as many as you want, including the other finalists in my categories if you want.

p.s. – an interesting “feature” of the web scraping is the URL works better when it includes a trailing slash “/”. I have no idea why this should help, but without it the returned CLOB will sometimes (often) contain “Unexpected Error Occurred” rather than the expected results. Occasionally the reverse happens though. If anyone can explain why this is, I would appreciate it.

%d bloggers like this: