Setting up the Star Schema Benchmark (SSB) in Oracle


In my previous two posts I showed how to setup a schema for the TPC-H tables and test data. A related test system called the Star Schema Benchmark (SSB) from Pat O’Neil, Betty O’Neil, and Xuedong Chen at the University of Massachusetts at Boston alters the TPC-H structures to create a warehousing data model.

The SSB is documented here. As with the TPC-H setup, a modified version of the dbgen utility creates the data can also be found on the UMass site here.

Unzip the dbgen file somewhere on your linux system. As of the time of this writing June 5, 2009 Revision 3 is the current version of the SSB. In the future the steps may change slightly but should be fairly similar to what is described below.
When you unzip, a directory dbgen will be created. Change to that directory, make a copy of the makefile template and edit it. The SSB version of dbgen doesn’t support Oracle but that’s not necessary for the data setup, only the query generation requires a database. So you can pick any of the supported databases for the purposes of data creation.

$ cd dbgen
$ cp makefile.suite makefile
$ vi makefile

Within the makefile, change the following lines

CC = gcc
DATABASE= SQLSERVER
MACHINE = LINUX
WORKLOAD = SSBM

Then make will create the dbgen utility. You’ll get several warnings but since we’re not distributing the resulting binary and have controlled usage they are safe to ignore.

Oracle offers a training lab for the 12c In-Memory option utilizing the SSB schema. Their training guide states the test system is based on a 50GB scale; but the query results illustrated in the guide show only a 4GB data set, so I’ll demonstrate the same here.
We’ll generate a small, approximately 4GB, set of test data and place it in a directory we’ll use later for the upload into the Oracle tables. Unlike the TPC-H dbgen you must generate each file type individually.

$ ./dbgen -s 4 -T c
$ ./dbgen -s 4 -T p
$ ./dbgen -s 4 -T s
$ ./dbgen -s 4 -T d
$ ./dbgen -s 4 -T l
$ mv *.tbl /home/oracle/ssb

Within the database, set up the SSB schema. All tables will be created according to the layouts described in section 2 of the SSB specification. For the purposes of data import, a set of external tables will also be created. These are not part of SSB itself and may be left in place or dropped after data load is complete. The SSB specification describes primary keys and foreign keys but no check constraints. The TPC-H schema allows NOT NULL declarations so I’ve included them here as well. If you prefer, simply remove the “NOT NULL” in the DDL below to allow nulls. The dbgen utility will populate every column though. The descriptions of a couple tables have some errors which I’ve made note of in the comments below. One is a partially documented; but seemingly unused column which I have removed. The other is on the date table which defines a day-of-week column of 8 characters but dbgen creates some days of 9 letters in length (“Wednesday”.)

One final change, the “DATE” table has been renamed “DATE_DIM” since DATE is an Oracle keyword. This change also makes the schema compatible with the Oracle In-Memory lab.

CREATE USER ssb IDENTIFIED BY ssb;

GRANT CREATE SESSION,
      CREATE TABLE,
      CREATE ANY DIRECTORY,
      UNLIMITED TABLESPACE
    TO ssb;

CREATE OR REPLACE DIRECTORY ssb_dir AS '/home/oracle/ssb';

GRANT READ, WRITE ON DIRECTORY ssb_dir TO ssb;

CREATE TABLE ssb.ext_lineorder
(
    lo_orderkey        INTEGER,
    lo_linenumber      NUMBER(1, 0),
    lo_custkey         INTEGER,
    lo_partkey         INTEGER,
    lo_suppkey         INTEGER,
    lo_orderdate       INTEGER,
    lo_orderpriority   CHAR(15),
    lo_shippriority    CHAR(1),
    lo_quantity        NUMBER(2, 0),
    lo_extendedprice   NUMBER,
    lo_ordtotalprice   NUMBER,
    lo_discount        NUMBER(2, 0),
    lo_revenue         NUMBER,
    lo_supplycost      NUMBER,
    --lo_ordsupplycost   NUMBER, -- this is mentioned in 2.2 Notes(c) but isn't in the layout or sample queries, so not needed?
    lo_tax             NUMBER(1, 0),
    lo_commitdate      INTEGER,
    lo_shipmode        CHAR(10)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY ssb_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('lineorder.tbl*'))
          PARALLEL 4;

CREATE TABLE ssb.lineorder
(
    lo_orderkey        INTEGER NOT NULL,
    lo_linenumber      NUMBER(1, 0) NOT NULL,
    lo_custkey         INTEGER NOT NULL,
    lo_partkey         INTEGER NOT NULL,
    lo_suppkey         INTEGER NOT NULL,
    lo_orderdate       NUMBER(8,0) NOT NULL,
    lo_orderpriority   CHAR(15) NOT NULL,
    lo_shippriority    CHAR(1) NOT NULL,
    lo_quantity        NUMBER(2, 0) NOT NULL,
    lo_extendedprice   NUMBER NOT NULL,
    lo_ordtotalprice   NUMBER NOT NULL,
    lo_discount        NUMBER(2, 0) NOT NULL,
    lo_revenue         NUMBER NOT NULL,
    lo_supplycost      NUMBER NOT NULL,
    --lo_ordsupplycost   NUMBER not null, -- this is mentioned in 2.2 Notes(c) but isn't in the layout or sample queries, so not needed?
    lo_tax             NUMBER(1, 0) NOT NULL,
    lo_commitdate      NUMBER(8,0) NOT NULL,
    lo_shipmode        CHAR(10) NOT NULL
);

CREATE TABLE ssb.ext_part
(
    p_partkey     INTEGER,
    p_name        VARCHAR2(22),
    p_mfgr        CHAR(6),
    p_category    CHAR(7),
    p_brand1      CHAR(9),
    p_color       VARCHAR2(11),
    p_type        VARCHAR2(25),
    p_size        NUMBER(2, 0),
    p_container   CHAR(10)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY ssb_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('part.tbl'));

CREATE TABLE ssb.part
(
    p_partkey     INTEGER NOT NULL,
    p_name        VARCHAR2(22) NOT NULL,
    p_mfgr        CHAR(6) NOT NULL,
    p_category    CHAR(7) NOT NULL,
    p_brand1      CHAR(9) NOT NULL,
    p_color       VARCHAR2(11) NOT NULL,
    p_type        VARCHAR2(25) NOT NULL,
    p_size        NUMBER(2, 0) NOT NULL,
    p_container   CHAR(10) NOT NULL
);

CREATE TABLE ssb.ext_supplier
(
    s_suppkey   INTEGER,
    s_name      CHAR(25),
    s_address   VARCHAR2(25),
    s_city      CHAR(10),
    s_nation    CHAR(15),
    s_region    CHAR(12),
    s_phone     CHAR(15)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY ssb_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('supplier.tbl'));

CREATE TABLE ssb.supplier
(
    s_suppkey   INTEGER NOT NULL,
    s_name      CHAR(25) NOT NULL,
    s_address   VARCHAR2(25) NOT NULL,
    s_city      CHAR(10) NOT NULL,
    s_nation    CHAR(15) NOT NULL,
    s_region    CHAR(12) NOT NULL,
    s_phone     CHAR(15) NOT NULL
);

CREATE TABLE ssb.ext_customer
(
    c_custkey      INTEGER,
    c_name         VARCHAR2(25),
    c_address      VARCHAR2(25),
    c_city         CHAR(10),
    c_nation       CHAR(15),
    c_region       CHAR(12),
    c_phone        CHAR(15),
    c_mktsegment   CHAR(10)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY ssb_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('customer.tbl'));

CREATE TABLE ssb.customer
(
    c_custkey      INTEGER NOT NULL,
    c_name         VARCHAR2(25) NOT NULL,
    c_address      VARCHAR2(25) NOT NULL,
    c_city         CHAR(10) NOT NULL,
    c_nation       CHAR(15) NOT NULL,
    c_region       CHAR(12) NOT NULL,
    c_phone        CHAR(15) NOT NULL,
    c_mktsegment   CHAR(10) NOT NULL
);

CREATE TABLE ssb.ext_date_dim
(
    d_datekey            NUMBER(8,0),
    d_date               CHAR(18),
    d_dayofweek          CHAR(9),    -- defined in Section 2.6 as Size 8, but Wednesday is 9 letters
    d_month              CHAR(9),
    d_year               NUMBER(4, 0),
    d_yearmonthnum       NUMBER(6, 0),
    d_yearmonth          CHAR(7),
    d_daynuminweek       NUMBER(1, 0),
    d_daynuminmonth      NUMBER(2, 0),
    d_daynuminyear       NUMBER(3, 0),
    d_monthnuminyear     NUMBER(2, 0),
    d_weeknuminyear      NUMBER(2, 0),
    d_sellingseason      CHAR(12),
    d_lastdayinweekfl    NUMBER(1, 0),
    d_lastdayinmonthfl   NUMBER(1, 0),
    d_holidayfl          NUMBER(1, 0),
    d_weekdayfl          NUMBER(1, 0)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY ssb_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('date.tbl'));

CREATE TABLE ssb.date_dim
(
    d_datekey            NUMBER(8,0) NOT NULL,
    d_date               CHAR(18) NOT NULL,
    d_dayofweek          CHAR(9) NOT NULL,    -- defined in Section 2.6 as Size 8, but Wednesday is 9 letters
    d_month              CHAR(9) NOT NULL,
    d_year               NUMBER(4, 0) NOT NULL,
    d_yearmonthnum       NUMBER(6, 0) NOT NULL,
    d_yearmonth          CHAR(7) NOT NULL,
    d_daynuminweek       NUMBER(1, 0) NOT NULL,
    d_daynuminmonth      NUMBER(2, 0) NOT NULL,
    d_daynuminyear       NUMBER(3, 0) NOT NULL,
    d_monthnuminyear     NUMBER(2, 0) NOT NULL,
    d_weeknuminyear      NUMBER(2, 0) NOT NULL,
    d_sellingseason      CHAR(12) NOT NULL,
    d_lastdayinweekfl    NUMBER(1, 0) NOT NULL,
    d_lastdayinmonthfl   NUMBER(1, 0) NOT NULL,
    d_holidayfl          NUMBER(1, 0) NOT NULL,
    d_weekdayfl          NUMBER(1, 0) NOT NULL
);

Now load the data. As you scale up into larger volumes, these steps are still valid; but you may want to split the loads into separate steps and alter the LINEORDER external table to read multiple files in parallel and use parallel dml on insert in order to speed up the process. The truncate lines aren’t necessary for the first time data load; but are included for future reloads of the dbgen data with other scaling.

TRUNCATE TABLE ssb.lineorder;
TRUNCATE TABLE ssb.part;
TRUNCATE TABLE ssb.supplier;
TRUNCATE TABLE ssb.customer;
TRUNCATE TABLE ssb.date_dim;

ALTER TABLE ssb.lineorder PARALLEL 4;
ALTER SESSION ENABLE PARALLEL DML;

INSERT /*+ APPEND */ INTO  ssb.part      SELECT * FROM ssb.ext_part;
commit;
INSERT /*+ APPEND */ INTO  ssb.supplier  SELECT * FROM ssb.ext_supplier;
commit;
INSERT /*+ APPEND */ INTO  ssb.customer  SELECT * FROM ssb.ext_customer;
commit;
INSERT /*+ APPEND */ INTO  ssb.date_dim  SELECT * FROM ssb.ext_date_dim;
commit;
INSERT /*+ APPEND */ INTO  ssb.lineorder SELECT * FROM ssb.ext_lineorder;
commit;

And finally, add the constraints and indexes.

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT pk_lineorder PRIMARY KEY(lo_orderkey, lo_linenumber);

ALTER TABLE ssb.part
    ADD CONSTRAINT pk_part PRIMARY KEY(p_partkey);

ALTER TABLE ssb.supplier
    ADD CONSTRAINT pk_supplier PRIMARY KEY(s_suppkey);

ALTER TABLE ssb.customer
    ADD CONSTRAINT pk_customer PRIMARY KEY(c_custkey);

ALTER TABLE ssb.date_dim
    ADD CONSTRAINT pk_date_dim PRIMARY KEY(d_datekey);

---

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT fk_lineitem_customer FOREIGN KEY(lo_custkey) REFERENCES ssb.customer(c_custkey);

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT fk_lineitem_part FOREIGN KEY(lo_partkey) REFERENCES ssb.part(p_partkey);

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT fk_lineitem_supplier FOREIGN KEY(lo_suppkey) REFERENCES ssb.supplier(s_suppkey);

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT fk_lineitem_orderdate FOREIGN KEY(lo_orderdate) REFERENCES ssb.date_dim(d_datekey);

ALTER TABLE ssb.lineorder
    ADD CONSTRAINT fk_lineitem_commitdate FOREIGN KEY(lo_commitdate) REFERENCES ssb.date_dim(d_datekey);

And that’s it, you should now have a complete SSB scale-4 data set to complete either the SSB suite of test queries, oracle labs, or run your own tests.

If you want to generate larger data sets you can with similar syntax to that seen with TPC-H. The DDL and inserts above are already defined for either parallel or single-file loads of the lineorder table. The other tables are relatively small in comparison. They can still be split if desired but you probably won’t need to.

$ ./dbgen -s 10 -T c
$ ./dbgen -s 10 -T p
$ ./dbgen -s 10 -T s
$ ./dbgen -s 10 -T d
$ ./dbgen -s 10 -T l -C 4 -S 1
$ ./dbgen -s 10 -T l -C 4 -S 2
$ ./dbgen -s 10 -T l -C 4 -S 3
$ ./dbgen -s 10 -T l -C 4 -S 4

Enjoy!

Setting up TPC-H test data with Oracle on Linux (Part 2 – Large data sets)


In my previous post I showed how to set up data loads for the TPC-H test data. All of the steps provided should work regardless of the data volume. However, for large data sets you may want to parallelize the data generation and data import in order to speed up the process.

The user, directory, permissions, target tables, constraints, and indexes from the previous post will be the same. As with the small data sets the constraints and indexes won’t be added until after the data is loaded.

As with the small sets, dbgen is used but instead of scale 4, I’ll use scale 100 (approximately 100GB of data.)
I’ll split it into 4 parts for parallelization. Thus needing to call dbgen 4 times, once for each part.

$ ./dbgen -s 100 -S 1 -C 4
$ ./dbgen -s 100 -S 2 -C 4
$ ./dbgen -s 100 -S 3 -C 4
$ ./dbgen -s 100 -S 4 -C 4

These can be run sequentially or in 4 separate sessions. If you have higher cpu capacity then I suggest running with more than 4 parts and/or run them in parallel sessions.

Note, sessions other than the first will each try to replace the nation.tbl file so you’ll be prompted if you want to replace it or not. Y or N is fine, the same data will be generated regardless.

$ ./dbgen -s 100 -S 2 -C 4
TPC-H Population Generator (Version 2.17.2)
Copyright Transaction Processing Performance Council 1994 - 2010
Do you want to overwrite ./nation.tbl ? [Y/N]: n

When all 4 calls are complete you’ll have the following files.

$ ls *.tbl.*
customer.tbl.1  orders.tbl.1    part.tbl.2
customer.tbl.2  orders.tbl.2    part.tbl.3
customer.tbl.3  orders.tbl.3    part.tbl.4
customer.tbl.4  orders.tbl.4    region.tbl
lineitem.tbl.1  partsupp.tbl.1  supplier.tbl.1
lineitem.tbl.2  partsupp.tbl.2  supplier.tbl.2
lineitem.tbl.3  partsupp.tbl.3  supplier.tbl.3
lineitem.tbl.4  partsupp.tbl.4  supplier.tbl.4
nation.tbl      part.tbl.1

The NATION and REGION tables are small and fixed in size (25 and 5 rows respectively) regardless of the scale value chosen. So they only have one file and will not be processed in parallel.
The other tables will use parallel 4 with multi-file locations in their corresponding external tables.

CREATE TABLE tpch.ext_part
(
    p_partkey       NUMBER(10, 0),
    p_name          VARCHAR2(55),
    p_mfgr          CHAR(25),
    p_brand         CHAR(10),
    p_type          VARCHAR2(25),
    p_size          INTEGER,
    p_container     CHAR(10),
    p_retailprice   NUMBER,
    p_comment       VARCHAR2(23)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('part.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_supplier
(
    s_suppkey     NUMBER(10, 0),
    s_name        CHAR(25),
    s_address     VARCHAR2(40),
    s_nationkey   NUMBER(10, 0),
    s_phone       CHAR(15),
    s_acctbal     NUMBER,
    s_comment     VARCHAR2(101)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('supplier.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_partsupp
(
    ps_partkey      NUMBER(10, 0),
    ps_suppkey      NUMBER(10, 0),
    ps_availqty     INTEGER,
    ps_supplycost   NUMBER,
    ps_comment      VARCHAR2(199)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('partsupp.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_customer
(
    c_custkey      NUMBER(10, 0),
    c_name         VARCHAR2(25),
    c_address      VARCHAR2(40),
    c_nationkey    NUMBER(10, 0),
    c_phone        CHAR(15),
    c_acctbal      NUMBER,
    c_mktsegment   CHAR(10),
    c_comment      VARCHAR2(117)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('customer.tbl*'))
    PARALLEL 4;

-- read date values as yyyy-mm-dd text

CREATE TABLE tpch.ext_orders
(
    o_orderkey        NUMBER(10, 0),
    o_custkey         NUMBER(10, 0),
    o_orderstatus     CHAR(1),
    o_totalprice      NUMBER,
    o_orderdate       CHAR(10),
    o_orderpriority   CHAR(15),
    o_clerk           CHAR(15),
    o_shippriority    INTEGER,
    o_comment         VARCHAR2(79)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('orders.tbl*'))
    PARALLEL 4;

-- read date values as yyyy-mm-dd text

CREATE TABLE tpch.ext_lineitem
(
    l_orderkey        NUMBER(10, 0),
    l_partkey         NUMBER(10, 0),
    l_suppkey         NUMBER(10, 0),
    l_linenumber      INTEGER,
    l_quantity        NUMBER,
    l_extendedprice   NUMBER,
    l_discount        NUMBER,
    l_tax             NUMBER,
    l_returnflag      CHAR(1),
    l_linestatus      CHAR(1),
    l_shipdate        CHAR(10),
    l_commitdate      CHAR(10),
    l_receiptdate     CHAR(10),
    l_shipinstruct    CHAR(25),
    l_shipmode        CHAR(10),
    l_comment         VARCHAR2(44)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('lineitem.tbl*'))
    PARALLEL 4;

CREATE TABLE tpch.ext_nation
(
    n_nationkey   NUMBER(10, 0),
    n_name        CHAR(25),
    n_regionkey   NUMBER(10, 0),
    n_comment     VARCHAR(152)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('nation.tbl'));

CREATE TABLE tpch.ext_region
(
    r_regionkey   NUMBER(10, 0),
    r_name        CHAR(25),
    r_comment     VARCHAR(152)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('region.tbl'));

In the previous post the target tables weren’t created with a parallel clause. So, add that now.

ALTER TABLE tpch.part     PARALLEL 4;
ALTER TABLE tpch.supplier PARALLEL 4;
ALTER TABLE tpch.partsupp PARALLEL 4;
ALTER TABLE tpch.customer PARALLEL 4;
ALTER TABLE tpch.orders   PARALLEL 4;
ALTER TABLE tpch.lineitem PARALLEL 4;

Clear out any test data from prior runs and load the new data. Use parallel dml to ensure maximum speed on load.

TRUNCATE TABLE tpch.part;
TRUNCATE TABLE tpch.supplier;
TRUNCATE TABLE tpch.partsupp;
TRUNCATE TABLE tpch.customer;
TRUNCATE TABLE tpch.orders;
TRUNCATE TABLE tpch.lineitem;
TRUNCATE TABLE tpch.nation;
TRUNCATE TABLE tpch.region;

ALTER SESSION SET nls_date_format='YYYY-MM-DD';
ALTER SESSION ENABLE PARALLEL DML;

INSERT /*+ APPEND */ INTO  tpch.part     SELECT * FROM tpch.ext_part;
INSERT /*+ APPEND */ INTO  tpch.supplier SELECT * FROM tpch.ext_supplier;
INSERT /*+ APPEND */ INTO  tpch.partsupp SELECT * FROM tpch.ext_partsupp;
INSERT /*+ APPEND */ INTO  tpch.customer SELECT * FROM tpch.ext_customer;
INSERT /*+ APPEND */ INTO  tpch.orders   SELECT * FROM tpch.ext_orders;
INSERT /*+ APPEND */ INTO  tpch.lineitem SELECT * FROM tpch.ext_lineitem;
INSERT /*+ APPEND */ INTO  tpch.nation   SELECT * FROM tpch.ext_nation;
INSERT /*+ APPEND */ INTO  tpch.region   SELECT * FROM tpch.ext_region;

Like the dbgen step you can run these sequentially or in parallel. Because the selects and inserts will all be parallel, I chose to run them sequentially. If you have the hardware to run in parallel, you can do so, or simply increase the number of input files and the parallel parameter on the tables to match your available cpu capacity.

Finally, add the constraints and indexes as shown in the previous post and you’re done.

Setting up TPC-H test data with Oracle on Linux (Part 1 – Small data sets)


While Oracle has their own SCOTT, SH, OE, HR, etc. sample schemas, it’s often useful to have the ability to scale data volumes for different experiments.  The TPC-H schema and sample data sets provide a convenient means of doing so. They are especially helpful in scenarios such as this, blogging, where readers may have a database but not the same data. Setting up the TPC-H data is fairly simple.

First, go to the TPC home, click Downloads and select the tools zip file.  You may also want to read the pdf file documenting the schema and the data scales.  Unzip the file somewhere on your linux system.  As of the time of this writing  2.17.2  is the current version.  In the future the steps may change slightly but should be fairly similar to what is described below.
When you unzip, a directory 2.17.2 will be created. Change to that directory, make a copy of the makefile template and edit it.

$ cd 2.17.2/dbgen
$ cp makefile.suite makefile
$ vi makefile

Within the makefile, change the following lines

CC = gcc
DATABASE= ORACLE
MACHINE = LINUX
WORKLOAD = TPCH

Then make will create the dbgen utility.

$ make
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o build.o build.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o driver.o driver.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o bm_utils.o bm_utils.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o rnd.o rnd.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o print.o print.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o load_stub.o load_stub.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o bcd2.o bcd2.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o speed_seed.o speed_seed.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o text.o text.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o permute.o permute.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o rng64.o rng64.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64  -O -o dbgen build.o driver.o bm_utils.o rnd.o print.o load_stub.o bcd2.o speed_seed.o text.o permute.o rng64.o -lm
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o qgen.o qgen.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o varsub.o varsub.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64  -O -o qgen build.o bm_utils.o qgen.o rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o -lm

We’ll generate a small, approximately 4GB, set of test data and place it in a directory we’ll use later for the upload into the Oracle tables.

$ ./dbgen -s 4
TPC-H Population Generator (Version 2.17.2)
Copyright Transaction Processing Performance Council 1994 - 2010
$ ls *.tbl
customer.tbl  lineitem.tbl  nation.tbl  orders.tbl  partsupp.tbl  part.tbl  region.tbl  supplier.tbl
$ mv *.tbl /home/oracle/tpch

Within the database, set up the tpch schema. All tables will be created according to the layouts described in section 1.4.1 of the TPC-H standard specification. For the purposes of data import, a set of external tables will also be created. These are not part of TPC-H itself and may be left in place or dropped after data load is complete. Also, per section 1.4.2, constraints are optional. I’m including the allowable primary key, foreign key, not null, and check constraints described in the 1.4.2 subsections. Other than the default indexes created to support the primary key constraints, no other indexes are included in the steps below. For efficiency of data loading, the constraints and indexes will be added after the data loading is complete.

CREATE USER tpch IDENTIFIED BY tpch;

GRANT CREATE SESSION,
      CREATE TABLE,
      UNLIMITED TABLESPACE
    TO tpch;

CREATE OR REPLACE DIRECTORY tpch_dir AS '/home/oracle/tpch';

GRANT READ ON DIRECTORY tpch_dir TO tpch;

-- 1.4.1
--  per 1.4.2.1  all table columns may be defined NOT NULL

CREATE TABLE tpch.ext_part
(
    p_partkey       NUMBER(10, 0),
    p_name          VARCHAR2(55),
    p_mfgr          CHAR(25),
    p_brand         CHAR(10),
    p_type          VARCHAR2(25),
    p_size          INTEGER,
    p_container     CHAR(10),
    p_retailprice   NUMBER,
    p_comment       VARCHAR2(23)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('part.tbl'));

CREATE TABLE tpch.part
(
    p_partkey       NUMBER(10, 0) NOT NULL,
    p_name          VARCHAR2(55) NOT NULL,
    p_mfgr          CHAR(25) NOT NULL,
    p_brand         CHAR(10) NOT NULL,
    p_type          VARCHAR2(25) NOT NULL,
    p_size          INTEGER NOT NULL,
    p_container     CHAR(10) NOT NULL,
    p_retailprice   NUMBER NOT NULL,
    p_comment       VARCHAR2(23) NOT NULL
);


CREATE TABLE tpch.ext_supplier
(
    s_suppkey     NUMBER(10, 0),
    s_name        CHAR(25),
    s_address     VARCHAR2(40),
    s_nationkey   NUMBER(10, 0),
    s_phone       CHAR(15),
    s_acctbal     NUMBER,
    s_comment     VARCHAR2(101)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('supplier.tbl'));

CREATE TABLE tpch.supplier
(
    s_suppkey     NUMBER(10, 0) NOT NULL,
    s_name        CHAR(25) NOT NULL,
    s_address     VARCHAR2(40) NOT NULL,
    s_nationkey   NUMBER(10, 0) NOT NULL,
    s_phone       CHAR(15) NOT NULL,
    s_acctbal     NUMBER NOT NULL,
    s_comment     VARCHAR2(101) NOT NULL
);

CREATE TABLE tpch.ext_partsupp
(
    ps_partkey      NUMBER(10, 0),
    ps_suppkey      NUMBER(10, 0),
    ps_availqty     INTEGER,
    ps_supplycost   NUMBER,
    ps_comment      VARCHAR2(199)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('partsupp.tbl'));

CREATE TABLE tpch.partsupp
(
    ps_partkey      NUMBER(10, 0) NOT NULL,
    ps_suppkey      NUMBER(10, 0) NOT NULL,
    ps_availqty     INTEGER NOT NULL,
    ps_supplycost   NUMBER NOT NULL,
    ps_comment      VARCHAR2(199) NOT NULL
);

CREATE TABLE tpch.ext_customer
(
    c_custkey      NUMBER(10, 0),
    c_name         VARCHAR2(25),
    c_address      VARCHAR2(40),
    c_nationkey    NUMBER(10, 0),
    c_phone        CHAR(15),
    c_acctbal      NUMBER,
    c_mktsegment   CHAR(10),
    c_comment      VARCHAR2(117)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('customer.tbl'));

CREATE TABLE tpch.customer
(
    c_custkey      NUMBER(10, 0) NOT NULL,
    c_name         VARCHAR2(25) NOT NULL,
    c_address      VARCHAR2(40) NOT NULL,
    c_nationkey    NUMBER(10, 0) NOT NULL,
    c_phone        CHAR(15) NOT NULL,
    c_acctbal      NUMBER NOT NULL,
    c_mktsegment   CHAR(10) NOT NULL,
    c_comment      VARCHAR2(117) NOT NULL
);

-- read date values as yyyy-mm-dd text

CREATE TABLE tpch.ext_orders
(
    o_orderkey        NUMBER(10, 0),
    o_custkey         NUMBER(10, 0),
    o_orderstatus     CHAR(1),
    o_totalprice      NUMBER,
    o_orderdate       CHAR(10),
    o_orderpriority   CHAR(15),
    o_clerk           CHAR(15),
    o_shippriority    INTEGER,
    o_comment         VARCHAR2(79)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('orders.tbl'));

CREATE TABLE tpch.orders
(
    o_orderkey        NUMBER(10, 0) NOT NULL,
    o_custkey         NUMBER(10, 0) NOT NULL,
    o_orderstatus     CHAR(1) NOT NULL,
    o_totalprice      NUMBER NOT NULL,
    o_orderdate       DATE NOT NULL,
    o_orderpriority   CHAR(15) NOT NULL,
    o_clerk           CHAR(15) NOT NULL,
    o_shippriority    INTEGER NOT NULL,
    o_comment         VARCHAR2(79) NOT NULL
);

-- read date values as yyyy-mm-dd text

CREATE TABLE tpch.ext_lineitem
(
    l_orderkey        NUMBER(10, 0),
    l_partkey         NUMBER(10, 0),
    l_suppkey         NUMBER(10, 0),
    l_linenumber      INTEGER,
    l_quantity        NUMBER,
    l_extendedprice   NUMBER,
    l_discount        NUMBER,
    l_tax             NUMBER,
    l_returnflag      CHAR(1),
    l_linestatus      CHAR(1),
    l_shipdate        CHAR(10),
    l_commitdate      CHAR(10),
    l_receiptdate     CHAR(10),
    l_shipinstruct    CHAR(25),
    l_shipmode        CHAR(10),
    l_comment         VARCHAR2(44)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('lineitem.tbl'));

CREATE TABLE tpch.lineitem
(
    l_orderkey        NUMBER(10, 0),
    l_partkey         NUMBER(10, 0),
    l_suppkey         NUMBER(10, 0),
    l_linenumber      INTEGER,
    l_quantity        NUMBER,
    l_extendedprice   NUMBER,
    l_discount        NUMBER,
    l_tax             NUMBER,
    l_returnflag      CHAR(1),
    l_linestatus      CHAR(1),
    l_shipdate        DATE,
    l_commitdate      DATE,
    l_receiptdate     DATE,
    l_shipinstruct    CHAR(25),
    l_shipmode        CHAR(10),
    l_comment         VARCHAR2(44)
);

CREATE TABLE tpch.ext_nation
(
    n_nationkey   NUMBER(10, 0),
    n_name        CHAR(25),
    n_regionkey   NUMBER(10, 0),
    n_comment     VARCHAR(152)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('nation.tbl'));

CREATE TABLE tpch.nation
(
    n_nationkey   NUMBER(10, 0),
    n_name        CHAR(25),
    n_regionkey   NUMBER(10, 0),
    n_comment     VARCHAR(152)
);

CREATE TABLE tpch.ext_region
(
    r_regionkey   NUMBER(10, 0),
    r_name        CHAR(25),
    r_comment     VARCHAR(152)
)
ORGANIZATION EXTERNAL
    (TYPE oracle_loader
          DEFAULT DIRECTORY tpch_dir
              ACCESS PARAMETERS (
                  FIELDS
                      TERMINATED BY '|'
                  MISSING FIELD VALUES ARE NULL
              )
          LOCATION('region.tbl'));

CREATE TABLE tpch.region
(
    r_regionkey   NUMBER(10, 0),
    r_name        CHAR(25),
    r_comment     VARCHAR(152)
);

Now load the data. The external tables read the date values as text, so we must set the NLS_DATE_FORMAT prior to loading so the text will be parsed correctly, or embed the formatting within each sql statement. For the small data set in this example, the steps described here should complete within a few minutes. As you scale up into larger volumes, these steps are still valid; but you may want to split the loads into separate steps and alter the external to read multiple files in parallel and use parallel dml on insert in order to speed up the process. The truncate lines aren’t necessary for the first time data load; but are included for future reloads of the dbgen data with other scaling.

TRUNCATE TABLE tpch.part;
TRUNCATE TABLE tpch.supplier;
TRUNCATE TABLE tpch.partsupp;
TRUNCATE TABLE tpch.customer;
TRUNCATE TABLE tpch.orders;
TRUNCATE TABLE tpch.lineitem;
TRUNCATE TABLE tpch.nation;
TRUNCATE TABLE tpch.region;

ALTER SESSION SET nls_date_format='YYYY-MM-DD';

INSERT /*+ APPEND */ INTO  tpch.part     SELECT * FROM tpch.ext_part;
INSERT /*+ APPEND */ INTO  tpch.supplier SELECT * FROM tpch.ext_supplier;
INSERT /*+ APPEND */ INTO  tpch.partsupp SELECT * FROM tpch.ext_partsupp;
INSERT /*+ APPEND */ INTO  tpch.customer SELECT * FROM tpch.ext_customer;
INSERT /*+ APPEND */ INTO  tpch.orders   SELECT * FROM tpch.ext_orders;
INSERT /*+ APPEND */ INTO  tpch.lineitem SELECT * FROM tpch.ext_lineitem;
INSERT /*+ APPEND */ INTO  tpch.nation   SELECT * FROM tpch.ext_nation;
INSERT /*+ APPEND */ INTO  tpch.region   SELECT * FROM tpch.ext_region;

And finally, add the constraints and indexes.

ALTER TABLE tpch.part
    ADD CONSTRAINT pk_part PRIMARY KEY(p_partkey);

ALTER TABLE tpch.supplier
    ADD CONSTRAINT pk_supplier PRIMARY KEY(s_suppkey);

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT pk_partsupp PRIMARY KEY(ps_partkey, ps_suppkey);

ALTER TABLE tpch.customer
    ADD CONSTRAINT pk_customer PRIMARY KEY(c_custkey);

ALTER TABLE tpch.orders
    ADD CONSTRAINT pk_orders PRIMARY KEY(o_orderkey);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT pk_lineitem PRIMARY KEY(l_linenumber, l_orderkey);

ALTER TABLE tpch.nation
    ADD CONSTRAINT pk_nation PRIMARY KEY(n_nationkey);

ALTER TABLE tpch.region
    ADD CONSTRAINT pk_region PRIMARY KEY(r_regionkey);

-- 1.4.2.3

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT fk_partsupp_part FOREIGN KEY(ps_partkey) REFERENCES tpch.part(p_partkey);

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT fk_partsupp_supplier FOREIGN KEY(ps_suppkey) REFERENCES tpch.supplier(s_suppkey);

ALTER TABLE tpch.customer
    ADD CONSTRAINT fk_customer_nation FOREIGN KEY(c_nationkey) REFERENCES tpch.nation(n_nationkey);

ALTER TABLE tpch.orders
    ADD CONSTRAINT fk_orders_customer FOREIGN KEY(o_custkey) REFERENCES tpch.customer(c_custkey);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT fk_lineitem_order FOREIGN KEY(l_orderkey) REFERENCES tpch.orders(o_orderkey);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT fk_lineitem_part FOREIGN KEY(l_partkey) REFERENCES tpch.part(p_partkey);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT fk_lineitem_supplier FOREIGN KEY(l_suppkey) REFERENCES tpch.supplier(s_suppkey);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT fk_lineitem_partsupp FOREIGN KEY(l_partkey, l_suppkey)
        REFERENCES tpch.partsupp(ps_partkey, ps_suppkey);

-- 1.4.2.4 - 1

ALTER TABLE tpch.part
    ADD CONSTRAINT chk_part_partkey CHECK(p_partkey >= 0);

ALTER TABLE tpch.supplier
    ADD CONSTRAINT chk_supplier_suppkey CHECK(s_suppkey >= 0);

ALTER TABLE tpch.customer
    ADD CONSTRAINT chk_customer_custkey CHECK(c_custkey >= 0);

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT chk_partsupp_partkey CHECK(ps_partkey >= 0);

ALTER TABLE tpch.region
    ADD CONSTRAINT chk_region_regionkey CHECK(r_regionkey >= 0);

ALTER TABLE tpch.nation
    ADD CONSTRAINT chk_nation_nationkey CHECK(n_nationkey >= 0);

-- 1.4.2.4 - 2

ALTER TABLE tpch.part
    ADD CONSTRAINT chk_part_size CHECK(p_size >= 0);

ALTER TABLE tpch.part
    ADD CONSTRAINT chk_part_retailprice CHECK(p_retailprice >= 0);

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT chk_partsupp_availqty CHECK(ps_availqty >= 0);

ALTER TABLE tpch.partsupp
    ADD CONSTRAINT chk_partsupp_supplycost CHECK(ps_supplycost >= 0);

ALTER TABLE tpch.orders
    ADD CONSTRAINT chk_orders_totalprice CHECK(o_totalprice >= 0);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT chk_lineitem_quantity CHECK(l_quantity >= 0);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT chk_lineitem_extendedprice CHECK(l_extendedprice >= 0);

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT chk_lineitem_tax CHECK(l_tax >= 0);

-- 1.4.2.4 - 3

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT chk_lineitem_discount CHECK(l_discount >= 0.00 AND l_discount <= 1.00);

-- 1.4.2.4 - 4

ALTER TABLE tpch.lineitem
    ADD CONSTRAINT chk_lineitem_ship_rcpt CHECK(l_shipdate <= l_receiptdate);

And that’s it, you should now have a complete TPC-H scale-4 data set to complete either the TPC-H suite of test queries, oracle labs, or run your own tests.

Enjoy!

12cR2 finally has on premises release!


12.2.0.1 has been released for on premises installation.

It’s only available for Linux and Solaris so far, but that’s good enough to get some test installations started.

http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html

Nontechnical Skills Important To Being A Good Developer


What’s it take to be a good developer?

Brains? Of course you have to be able to think.
Logic? Sure, you need to have logical constructions.
Detailed? Maybe, you do need to pick out the intricacies of requirements.
Thorough? Yeah, I’ll give you that. The cause of all bugs is failing to cover some scenario.

I could go on about features of good developers and I’m sure readers have their own favorites; but the thing that I think stands out among the best developers is a “need to be right.” Now, of course this doesn’t mean that a developer is always right. I certainly am not. Nor does it mean that a developer needs to win every argument, although we may try.

No, the need to be right is a pursuit of correctness or best. One of the nice features of the software world is that many of our problems have absolutes in correct or incorrect; or, if not, they at least have objective comparisons of better or worse.

Does your calculation produce the correct answer? yes or no?
Does your procedure/function/method fulfill the requirement? yes or no?
Does subroutine A consume more memory than subroutine B while getting the same answer?
Does query X perform more IO operations then query Y to produce the same result?

Of course, a good developer wants to write good code; so it can be humbling when a bug is found, or a better way obsoletes your method. The need to be right though, supersedes the challenge to your ego. The need to be right is a relentless curiosity. If you find a bug in my code, thank you, if you identify why the bug occurs and how to fix it… THANK YOU! If you and I both have functioning SQL statements but yours is more efficient, THANK YOU! When I learn from a mistake or a superior design I “become right.” The need to be right is not about somebody else being wrong, it’s about you having, provably, the most correct and most efficient method possible. If that means you have to revise your solution based on new information, so be it.

One of the danger phrases that worries me is “I got it to work.” There’s nothing wrong, per se, with that phrase. I use it myself after a difficult challenge. The worrisome nature of it is that it’s often used as a declaration of finality. The task strained the brain and finally after much struggle, there is a successful result. “I got it to work, YAY!” But… does it work well? If you have the need to be right then you’ll want that answer. If you don’t, then you’ll call it done, check it in, and ship it off to production.

I’ve had many people ask me why I think it’s not a success. That’s simple. The problem was difficult, right? So difficult that you found it to be monumental when you got a successful run out of it. So then, what makes you think the thing that was so hard it couldn’t even produce the correct results until just now has simultaneously achieved maximum efficiency? Were you so busy tuning it while it was still wrong that it couldn’t possibly be any faster or less resource consuming? If so, that’s kind of an odd work style. Maybe that’s why it seemed so difficult?

It’s all related to the classic engineering joke/truism: Good/Fast/Cheap – pick 2. I believe in a corollary to that rule, it’s cheaper to doing something right once than twice wrong. So, while it might take more effort and cost to build a correct and efficient solution. In the long run that will end up begin cheaper because you don’t have to revisit as often.

However, there are traps in the need to be right. First, and probably the most obvious – you might fool yourself. If you try options A, B, C and determine C is best, that’s great; but if you never thought to look at option D you still might not be right. All you’ve done is prove you’re not the most wrong.

Second, there are problems that don’t have absolutes. “Most easily supportable”, “Best documentation”, “Most pleasing UX” – these ideals have leanings. There are approaches that seem more reasonable (follow some kind of standard, documentation that actually describes the objects, fewer interruptions in navigation) but at some point the differences just become preferences.

Third, you can be a jerk about it. If you forget that being “right” isn’t about somebody else being wrong. If you fall into this trap then you’re no longer helpful to your team even if you do happen to be “right” because attitude precludes successful cooperation.

The last trap is particularly tricky because it’s self-conflicting. Sometimes you should call it “good enough.” If the cost to improve some code is $1000 but you’ll only get $900 of benefit out of it, then don’t waste the money. It might indeed be possible to make a more efficient version; but if doing so is a net loss then don’t do it. Another problem with this approach is that it might not be that easy to produce an accurate cost/benefit analysis. If you call it quits early, you may be missing an opportunity. If you pursue too long you may be wasting time, money and other resources. When I run into this gray area I talk to my manager or client or whomever is footing the bill and give them an honest status. “I think this is working well, but could probably be improved. I’m willing to continue working on it; but it’s up to you how much to spend before accepting the results.” If you can give an estimate of cost to implement and expectation of improvement that’s even better.

Some might dismiss the idea of need to be right; that it’s just an obnoxious way of saying “try to write good code.” Any programmer that cares about their results will try to be both correct and efficient with their results. Yes, that’s true. I agree with that. However, the differentiating factor between merely being conscientious and having the need to be right is the proving.

If you ask any developer “Are you trying to write good code or bad code?” They will, of course, answer “good code.” In my opinion though, a good programmer is willing to show why their code is good. This “need” has two important benefits. First, by proving their design qualities they demonstrate they really do care about the quality of their output, they aren’t simply giving the desired response. Second, the act of proving correctness and efficiency requires technical insight and skill.

My favorite developers to work with always demonstrate this need to be right by showing their work and asking for review.

I shied away from writing a philosophical article because it’s one of those areas without an absolute.  This is merely an opinion; but I thought an interesting and important piece of successful development.  I hope you enjoyed reading.

 

Sean

 

Why does my code break with no changes and no invalid objects?


Recently I encountered an interesting situation where a job has been running successfully for months and then one day it suddenly started failing every time with an ORA-06502: PL/SQL: numeric or value error.  The same error can also manifest itself in bulk operations with ORA-06502: PL/SQL: numeric or value error: Bulk Bind: Truncated Bind.

The more confusing part is that no changes were made to the code and there were no invalid objects.  Digging into the problem some more, the job called a procedure which read from a view, selecting from remote tables via a database link.

So, next step, obviously, I check on the remote side: but again, I find no invalid objects there and looking at the LAST_DDL_TIME in DBA_OBJECTS I see none of the tables had been modified in a couple months either.

Those old remote changes seemed innocuous; but it had been even longer since I had changed anything in the view or procedure within the local db.  I recompiled the view in the local db and the job started working again.  So, what changed and why did it only fail now and not a couple months ago when the remote tables changed?

For this I had to go to SYS and query DBA_TAB_COLS with flashback query to check the columns of my view and sure enough, one of the VARCHAR2 columns had grown.  The reason nothing failed was no new data came in right away using the new, bigger limit.

Once I saw the problem, replicating it was fairly easy.  Another interesting quirk between SQL and PL/SQL is the query within the procedure would run without error when executed as SQL because it didn’t have precompiled size limitations.  So it would adapt on its own.

The example objects and data to replicate the problem are fairly simple but it can be a little tricky following along since the session alternates between the local and remote database.  I’ve left the connect statements and altered the SQL> prompt to include the database name to help highlight where the action is taking place.

sds@REMOTE_DB> connect sds/pa55w0rd@remote_db
Connected.
sds@REMOTE_DB> create table sds.testtable(test_id integer, testtext varchar2(10));

Table created.

sds@REMOTE_DB> insert into sds.testtable(test_id,testtext) values (1,'abcd');

1 row created.

sds@REMOTE_DB> insert into sds.testtable(test_id,testtext) values (2,'efgh');

1 row created.

sds@REMOTE_DB> commit;

Commit complete.

sds@REMOTE_DB> connect sds/pa55w0rd@local_db
Connected.
sds@LOCAL_DB> CREATE DATABASE LINK remote_db_link CONNECT TO sds IDENTIFIED BY "pa55w0rd" USING 'remote_db';

Database link created.

sds@LOCAL_DB> CREATE OR REPLACE VIEW test_view
AS
SELECT * FROM sds.testtable@remote_db_link; 2 3

View created.

sds@LOCAL_DB> SELECT * FROM test_view;

TEST_ID TESTTEXT
---------- ----------
1 abcd
2 efgh

sds@LOCAL_DB> CREATE OR REPLACE PROCEDURE show_remote_data
IS
BEGIN
FOR x IN ( SELECT test_id, testtext
FROM test_view
ORDER BY test_id)
LOOP
DBMS_OUTPUT.put_line(x.test_id || ' ' || x.testtext);
END LOOP;
END; 2 3 4 5 6 7 8 9 10
11 /

Procedure created.

sds@LOCAL_DB> set serveroutput on
sds@LOCAL_DB> exec show_remote_data;
1 abcd
2 efgh

PL/SQL procedure successfully completed.

Up to here everything is working normally, now we’ll make the text column larger.
The procedure will still work correctly even though the local and remote sizes don’t match. The error doesn’t occur until new data shows up that exceeds the prior limit.

sds@LOCAL_DB> connect sds/pa55w0rd@remote_db
Connected.
sds@REMOTE_DB> alter table sds.testtable modify (testtext varchar2(30));

Table altered.

sds@REMOTE_DB> connect sds/pa55w0rd@local_db
Connected.
sds@LOCAL_DB> exec show_remote_data;

PL/SQL procedure successfully completed.

sds@LOCAL_DB> set serveroutput on
sds@LOCAL_DB> exec show_remote_data;
1 abcd
2 efgh

PL/SQL procedure successfully completed.

sds@LOCAL_DB> connect sds/pa55w0rd@remote_db
Connected.
sds@REMOTE_DB> insert into sds.testtable(test_id,testtext) values (3,'abcdefghijklmnopqrstuvwxyz');

1 row created.

sds@REMOTE_DB> commit;

Commit complete.

sds@REMOTE_DB> connect sds/pa55w0rd@local_db
Connected.
sds@LOCAL_DB> set serveroutput on
sds@LOCAL_DB> exec show_remote_data;
1 abcd
2 efgh
BEGIN show_remote_data; END;

*
ERROR at line 1:
ORA-06502: PL/SQL: numeric or value error
ORA-06512: at "SDS.SHOW_REMOTE_DATA", line 4
ORA-06512: at line 1

sds@LOCAL_DB>

Missing peaks in ASH results


The ASH charts in OEM are great utilities for getting a quick summary of your system’s activity. However, these results can be misleading because of how the data is represented on screen. First, ASH is data is collected by sampling so it’s not a complete picture of everything that runs. Another thing to consider is that the charting in OEM doesn’t plot every ASH data point. Instead, it will average them across time slices. Within Top Activity and the ASH Analytics summary charts these points are then connected by curves or straight lines which then further dilutes the results.

Some example snapshots will help illustrate these issues.

The OEM Top Activity screen may produce a chart like this…
Top Activity

First, note the large spike around 1:30am on the 16th. This spike was largely comprised of RMAN backups and is a significant increase in overall activity on the server with approximately 9 active sessions at its peak and a sustained activity level of 8 for most of that period.

Next, let’s look at that same database using ASH Analytics and note how that spike is drawn as a pyramid of activity. While the slope of the sides is fairly steep, it’s still significantly more gradual than that illustrated by the Top Activity chart. The peak activity is still approximately 9 active sessions at its highest but it’s harder to determine when and where it tapers off because the charting simply draws straight lines between time slices.

ASH Analytics

But, ASH Analytics offers a zoom window feature and using that we can highlight the 1am-2am hour and we get a different picture that more closely reflects the story told in the Top Activity chart. Note the sharp increase at 1:30 as see in the Top Activity. Also, note the higher peaks approaching and exceeding 12 active sessions whereas each of the previous charts indicated a peak of 9. The last curiosity is when the activity declines it is more gradual than the Top Activity but steeper than the Analytics overall chart.

ASH Analytics wall

The charts above demonstrate some ambiguities in using any one visualization. In those examples though, the data was mostly consistent in magnitude, but differing on rate of change due to resolution of the time slices.

Another potential problem with the averaging is losing accuracy by dropping information. For instance, in the first chart above, note the brief IO spike around 9:30am with a peak of 6 active sessions. If you look on the ASH Analytics summary chart it has averaged the curve down to approximately 2 active sessions. If we now go to the ASH Analytics page and zoom in to only the 9am-10am hour, we see that spike was in fact much larger at 24! This is 4 to 12 times our previous values and more importantly, running at twice the number of available processors. It was a brief surge and the system recovered fine but if you were looking for potential trouble areas of resource contention, the first two charts could be misleading.

ASH Analytics peak

I definitely don’t want to discourage readers from using OEM’s ASH tools; but I also don’t want to suggest you need to zoom in on every single time range in order to get the most accurate picture. Instead I want readers to be aware of the limitations inherent in data averaging and if you do have reason to inspect activity at a narrow time range, then by all means zoom in with ASH Analytics to get the best picture. If you need larger scale summary views, consider querying the ASH data yourself to find extreme values that may have been hidden by the averaging.

%d bloggers like this: