MERGE, Triggers, and Statement Restart


The MERGE command in Oracle is a great tool for modifying data in a table; efficiently solving the old problem of insert collisions and updates of non-existent rows. Most of the time we use it becaue it “just works.” But, what really goes on when you issue a merge? In particular, which triggers fire? How does it work with interaction of other sessions, and how does statement restart impact it?

To examine these I’ll set up a small test table, a bunch of triggers, a logging table, and some code to populate the log as a statement executes.

CREATE TABLE mergetest
(
    id     INTEGER PRIMARY KEY,
    val    VARCHAR2(50)
);

CREATE TABLE mergetest_log
(
    logtime       TIMESTAMP,
    logmessage    VARCHAR2(100)
);

CREATE OR REPLACE PROCEDURE write_mergetest_log(p_message IN VARCHAR2)
IS
    PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
    INSERT INTO mergetest_log(logtime, logmessage)
         VALUES (SYSTIMESTAMP, SYS_CONTEXT('userenv', 'client_info') || ' ' || p_message);

    COMMIT;
END;

CREATE OR REPLACE FUNCTION trigger_action
    RETURN VARCHAR2
IS
BEGIN
 RETURN CASE
            WHEN INSERTING
            THEN
                'INSERT'
            WHEN UPDATING
            THEN
                'UPDATE'
            WHEN DELETING
            THEN
                'DELETE'
            ELSE
                'UNKNOWN'
        END;
END;

CREATE OR REPLACE TRIGGER trg_before_delete
    BEFORE DELETE
    ON mergetest
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_before_insert
    BEFORE INSERT
    ON mergetest
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_before_update
    BEFORE UPDATE
    ON mergetest
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_after_delete
    AFTER DELETE
    ON mergetest
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_after_insert
    AFTER INSERT
    ON mergetest
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_after_update
    AFTER UPDATE
    ON mergetest
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_before_row_delete
    BEFORE DELETE
    ON mergetest
    FOR EACH ROW
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_before_row_insert
    BEFORE INSERT
    ON mergetest
    FOR EACH ROW
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_before_row_update
    BEFORE UPDATE
    ON mergetest
    FOR EACH ROW
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_after_rowdelete
    AFTER DELETE
    ON mergetest
    FOR EACH ROW
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_after_rowinsert
    AFTER INSERT
    ON mergetest
    FOR EACH ROW
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

CREATE OR REPLACE TRIGGER trg_after_rowupdate
    AFTER UPDATE
    ON mergetest
    FOR EACH ROW
BEGIN
    write_mergetest_log($$plsql_unit || ' ' || trigger_action);
END;

So, that’s simple enough. Two tables, all the triggers do the same thing, simply logging who they are and what they are doing and the logging procedure uses an autonomous transaction so the progress can be captured even if the calling statements are rolled back.

You might wonder why I used so many triggers instead of a single composite trigger, or at least combining INSERT/UPDATE/DELETE of each trigger type. In fact, my early tests when writing this article did use a single trigger. The problem with that was I lost some granularity of information when statement restart occurred and some actions became undefined. I’ll highlight this in later example code and output.

Now, we’ll execute 3 merge statements, the first into the empty table, then a merge update of the row we just inserted, and then another merge that should create another new row. Then we’ll check the log table and see what happened.

clear screen
TRUNCATE TABLE mergetest;
TRUNCATE TABLE mergetest_log;
exec dbms_application_info.set_client_info('SESSION ONE');
exec write_mergetest_log('MERGE 1');
MERGE INTO mergetest t
     USING (SELECT 1 id, 'mergeupdate' updateval, 'mergeinsert' insertval FROM DUAL) s
        ON (t.id = s.id)
WHEN MATCHED
THEN
    UPDATE SET t.val = s.updateval
WHEN NOT MATCHED
THEN
    INSERT     VALUES (s.id, s.insertval);
exec write_mergetest_log('COMMIT');
COMMIT;
exec write_mergetest_log('MERGE 2');
MERGE INTO mergetest t
     USING (SELECT 1 id, 'mergeupdate' updateval, 'mergeinsert' insertval FROM DUAL) s
        ON (t.id = s.id)
WHEN MATCHED
THEN
    UPDATE SET t.val = s.updateval
WHEN NOT MATCHED
THEN
    INSERT     VALUES (s.id, s.insertval);
exec write_mergetest_log('COMMIT');
COMMIT;
exec write_mergetest_log('MERGE 3');
MERGE INTO mergetest t
     USING (SELECT 2 id, 'mergeupdate' updateval, 'mergeinsert' insertval FROM DUAL) s
        ON (t.id = s.id)
WHEN MATCHED
THEN
    UPDATE SET t.val = s.updateval
WHEN NOT MATCHED
THEN
    INSERT     VALUES (s.id, s.insertval);
exec write_mergetest_log('COMMIT');
COMMIT;

select * from mergetest_log order by logtime;

Note, both insert and update statement-level triggers fire; but for the row-level triggers, only the operation actually performed are fired.

LOGTIME                      LOGMESSAGE
---------------------------- -----------------------------------------
2018-01-07 15:41:43.337157   SESSION ONE MERGE 1
2018-01-07 15:41:43.386852   SESSION ONE TRG_BEFORE_INSERT INSERT
2018-01-07 15:41:43.387073   SESSION ONE TRG_BEFORE_UPDATE UPDATE
2018-01-07 15:41:43.404012   SESSION ONE TRG_BEFORE_ROW_INSERT INSERT
2018-01-07 15:41:43.437701   SESSION ONE TRG_AFTER_ROW_INSERT INSERT
2018-01-07 15:41:43.437873   SESSION ONE TRG_AFTER_INSERT INSERT
2018-01-07 15:41:43.438004   SESSION ONE TRG_AFTER_UPDATE UPDATE
2018-01-07 15:41:43.491317   SESSION ONE COMMIT
2018-01-07 15:41:43.524916   SESSION ONE MERGE 2
2018-01-07 15:41:43.552057   SESSION ONE TRG_BEFORE_INSERT INSERT
2018-01-07 15:41:43.552230   SESSION ONE TRG_BEFORE_UPDATE UPDATE
2018-01-07 15:41:43.552452   SESSION ONE TRG_BEFORE_ROW_UPDATE UPDATE
2018-01-07 15:41:43.552599   SESSION ONE TRG_AFTER_ROW_UPDATE UPDATE
2018-01-07 15:41:43.552715   SESSION ONE TRG_AFTER_INSERT INSERT
2018-01-07 15:41:43.552827   SESSION ONE TRG_AFTER_UPDATE UPDATE
2018-01-07 15:41:43.609417   SESSION ONE COMMIT
2018-01-07 15:41:43.641952   SESSION ONE MERGE 3
2018-01-07 15:41:43.669958   SESSION ONE TRG_BEFORE_INSERT INSERT
2018-01-07 15:41:43.670092   SESSION ONE TRG_BEFORE_UPDATE UPDATE
2018-01-07 15:41:43.670259   SESSION ONE TRG_BEFORE_ROW_INSERT INSERT
2018-01-07 15:41:43.670442   SESSION ONE TRG_AFTER_ROW_INSERT INSERT
2018-01-07 15:41:43.670551   SESSION ONE TRG_AFTER_INSERT INSERT
2018-01-07 15:41:43.670643   SESSION ONE TRG_AFTER_UPDATE UPDATE
2018-01-07 15:41:43.726528   SESSION ONE COMMIT

24 rows selected.

Next, let’s try it with a second session involved and see how the locking impacts the merge. I’ll color code the sessions to help indicate which session is producing which output. Green background for session 1, red background for session 2.


clear screen
exec dbms_application_info.set_client_info('SESSION ONE');
TRUNCATE TABLE mergetest;
TRUNCATE TABLE mergetest_log;
INSERT INTO mergetest VALUES (1, 'start');
COMMIT;
exec write_mergetest_log('UPDATE');
UPDATE mergetest set val = 'update' WHERE id = 1;
pause "Run merge in other session"



clear screen
exec dbms_application_info.set_client_info('SESSION TWO');
SELECT * FROM mergetest;
exec write_mergetest_log('MERGE');
MERGE INTO mergetest t
     USING (SELECT 1 id, 'mergeupdate' updateval, 'mergeinsert' insertval FROM DUAL) s
        ON (t.id = s.id)
WHEN MATCHED
THEN
    UPDATE SET t.val = s.updateval
WHEN NOT MATCHED
THEN
    INSERT     VALUES (s.id, s.insertval);
exec write_mergetest_log('COMMIT');
COMMIT;



exec write_mergetest_log('COMMIT');
COMMIT;
exec write_mergetest_log('INSERT');
INSERT INTO mergetest VALUES (1, 'insert');

When run, the output looked like this…


15:54:31 SQL> exec dbms_application_info.set_client_info('SESSION ONE');

PL/SQL procedure successfully completed.

15:54:31 SQL> TRUNCATE TABLE mergetest;

Table truncated.

15:54:32 SQL> TRUNCATE TABLE mergetest_log;

Table truncated.

15:54:32 SQL> INSERT INTO mergetest VALUES (1, 'start');

1 row created.

15:54:32 SQL> COMMIT;

Commit complete.

15:54:32 SQL> exec write_mergetest_log('UPDATE');

PL/SQL procedure successfully completed.

15:54:32 SQL> UPDATE mergetest set val = 'update' WHERE id = 1;

1 row updated.

15:54:32 SQL> pause "Run merge in other session"
"Run merge in other session"



15:54:41 SQL> exec dbms_application_info.set_client_info('SESSION TWO');

PL/SQL procedure successfully completed.

15:54:41 SQL> SELECT * FROM mergetest;

        ID VAL
---------- --------------------------------------------------
         1 start

15:54:41 SQL> exec write_mergetest_log('MERGE');

PL/SQL procedure successfully completed.

15:54:41 SQL> MERGE INTO mergetest t
15:54:41   2       USING (SELECT 1 id, 'mergeupdate' updateval, 'mergeinsert' insertval FROM DUAL) s
15:54:41   3          ON (t.id = s.id)
15:54:41   4  WHEN MATCHED
15:54:41   5  THEN
15:54:41   6      UPDATE SET t.val = s.updateval
15:54:41   7  WHEN NOT MATCHED
15:54:41   8  THEN
15:54:41   9      INSERT     VALUES (s.id, s.insertval);

1 row merged.

15:54:54 SQL> exec write_mergetest_log('COMMIT');

PL/SQL procedure successfully completed.

15:54:54 SQL> COMMIT;

Commit complete.



exec write_mergetest_log('COMMIT');
15:54:54 SQL> COMMIT;

Commit complete.

15:54:54 SQL> exec write_mergetest_log('INSERT');

PL/SQL procedure successfully completed.

15:54:54 SQL> INSERT INTO mergetest VALUES (1, 'insert');
INSERT INTO mergetest VALUES (1, 'insert')
*
ERROR at line 1:
ORA-00001: unique constraint (SDS.SYS_C0048605) violated

The logs then again show the statement-level triggers fire for both insert and update, regardless of what the merge actually does; but look at the merge in session two.
The before triggers fire a second time due to statement restart, but the activity is actually “UNKNOWN”, which I’ve highlighted in purple. Using before-triggers is always tricky with statement restart, but this adds an extra layer of complication. If you have a trigger that uses the “INSERTING/UPDATING/DELETING” activity functions to determine what the action, on statement restart those triggers will fail. This is what happened to me when I used a single consolidated trigger in my initial tests. By using individual triggers for each action I could track exactly which activity is causing the trigger to fire even if the activity itself is undefined.

15:54:54 SQL> select * from mergetest_log order by logtime;

LOGTIME                       LOGMESSAGE
----------------------------- --------------------------------------------
2018-01-07 15:54:31.449622    SESSION ONE TRG_BEFORE_INSERT INSERT
2018-01-07 15:54:31.450485    SESSION ONE TRG_BEFORE_ROW_INSERT INSERT
2018-01-07 15:54:31.544272    SESSION ONE TRG_AFTER_ROW_INSERT INSERT
2018-01-07 15:54:31.544476    SESSION ONE TRG_AFTER_INSERT INSERT
2018-01-07 15:54:31.581790    SESSION ONE UPDATE
2018-01-07 15:54:31.599431    SESSION ONE TRG_BEFORE_UPDATE UPDATE
2018-01-07 15:54:31.599642    SESSION ONE TRG_BEFORE_ROW_UPDATE UPDATE
2018-01-07 15:54:31.599823    SESSION ONE TRG_AFTER_ROW_UPDATE UPDATE
2018-01-07 15:54:31.599954    SESSION ONE TRG_AFTER_UPDATE UPDATE
2018-01-07 15:54:40.794142    SESSION TWO MERGE
2018-01-07 15:54:40.820151    SESSION TWO TRG_BEFORE_INSERT INSERT
2018-01-07 15:54:40.820316    SESSION TWO TRG_BEFORE_UPDATE UPDATE
2018-01-07 15:54:53.233075    SESSION TWO TRG_BEFORE_ROW_UPDATE UPDATE
2018-01-07 15:54:53.233366    SESSION TWO TRG_BEFORE_UPDATE UNKNOWN
2018-01-07 15:54:53.233495    SESSION TWO TRG_BEFORE_INSERT UNKNOWN
2018-01-07 15:54:53.233703    SESSION TWO TRG_BEFORE_ROW_UPDATE UPDATE
2018-01-07 15:54:53.233857    SESSION TWO TRG_AFTER_ROW_UPDATE UPDATE
2018-01-07 15:54:53.233978    SESSION TWO TRG_AFTER_INSERT INSERT
2018-01-07 15:54:53.234079    SESSION TWO TRG_AFTER_UPDATE UPDATE
2018-01-07 15:54:53.262496    SESSION ONE INSERT
2018-01-07 15:54:53.296106    SESSION TWO COMMIT
2018-01-07 15:54:59.362455    SESSION ONE TRG_BEFORE_INSERT INSERT
2018-01-07 15:54:59.362682    SESSION ONE TRG_BEFORE_ROW_INSERT INSERT
23 rows selected.

While I’ve used merge many times I’ll admit I hadn’t really thought about the trigger implications as much as I should have. In particular I hadn’t considered the possibility of an UNKNOWN action inside of trigger and what I’d do to accommodate one when encountered. I hope you’ve found useful as well.

Advertisements

My favorite database feature: the “Collapsing Index”


One of the interesting features of Oracle is if an entry in an index would only have nulls in it, then that entry isn’t created.
A simple test can verify this assertion.

SQL> CREATE TABLE null_test
  2  (
  3      id              INTEGER NOT NULL,
  4      dummy_string    VARCHAR2(10)
  5  );

Table created.

SQL> INSERT INTO null_test(id, dummy_string)
  2          SELECT LEVEL, 'populated'
  3            FROM DUAL
  4      CONNECT BY LEVEL  INSERT INTO null_test(id, dummy_string)
  2      SELECT 11, NULL
  3        FROM DUAL;

1 row created.

SQL> INSERT INTO null_test(id, dummy_string)
  2      SELECT 12, NULL
  3        FROM DUAL;

1 row created.

SQL> CREATE INDEX null_test_dummy_string_idx
  2      ON null_test(dummy_string);

Index created.

SQL> BEGIN
  2      DBMS_STATS.gather_table_stats(USER, 'NULL_TEST');
  3  END;
  4  /

PL/SQL procedure successfully completed.

So, at this point I have a table with 12 rows in it. Of the index column, 10 rows are populated and 2 are left null. Now that I’ve gathered stats, lets see what the contents of the table vs the index look like.

SQL> SELECT num_rows
  2    FROM user_tables
  3   WHERE table_name = 'NULL_TEST';

  NUM_ROWS
----------
        12

SQL> SELECT num_rows
  2    FROM user_ind_statistics
  3   WHERE index_name = 'NULL_TEST_DUMMY_STRING_IDX';

  NUM_ROWS
----------
        10

As expected, the index is missing the 2 null rows.

So, what can we do with this feature? How can missing data be part of my favorite feature? Don’t we want data to be in an index for easy lookup?

Yes, we do… BUT, what we really want is the data we want to lookup to be in our index. That is, an index that has extra, unimportant data in it is just wasted space and more data to sift through. So, what if we could have an index that only contained exactly the rows we were interested in? Then, as the data became less interesting it would disappear from the index and the index would “collapse” to once again only contain the important data. Sounds ideal, right, but what sort of use-case would have data like that and could use such an index?

Processing queues are a classic example of this kind of data. Over time a table will accumulate many records that have been completed and don’t need revisited, or if they do, they don’t have the same urgency as new, unprocessed records. In this case, the ideal index would point to only the new records and exclude all the old ones.

To illustrate I’ll create a simple task queue containing just 2 columns, one a TASK_ID, and a second column which is a flag indicating if the record has been PROCESSED or not.
Then I’ll fill the table with 100000 rows of old data, where the processed flag = ‘Y’, then I’ll insert 10 new rows with processed flag = ‘N’.
Last create an index on the processed flag, gather stats and check the numbers.

SQL> CREATE TABLE task_queue
  2  (
  3      task_id      INTEGER NOT NULL,
  4      processed    CHAR(1) NOT NULL
  5  );

Table created.

SQL> INSERT INTO task_queue(task_id, processed)
  2          SELECT LEVEL, 'Y'
  3            FROM DUAL
  4      CONNECT BY LEVEL  INSERT INTO task_queue(task_id, processed)
  2          SELECT 100000 + LEVEL, 'N'
  3            FROM DUAL
  4      CONNECT BY LEVEL  CREATE INDEX task_queue_processed_idx
  2      ON task_queue(processed);

Index created.

SQL> BEGIN
  2      DBMS_STATS.gather_table_stats(USER, 'TASK_QUEUE');
  3  END;
  4  /

PL/SQL procedure successfully completed.

SQL> SELECT column_name, num_distinct, num_nulls
  2    FROM user_tab_col_statistics
  3   WHERE table_name = 'TASK_QUEUE';

COLUMN_NAME               NUM_DISTINCT  NUM_NULLS
------------------------- ------------ ----------
TASK_ID                         100010          0
PROCESSED                            2          0

SQL> SELECT index_name,
  2         leaf_blocks,
  3         distinct_keys,
  4         num_rows
  5    FROM user_ind_statistics
  6   WHERE table_name = 'TASK_QUEUE';

INDEX_NAME                     LEAF_BLOCKS  DISTINCT_KEYS   NUM_ROWS
------------------------------ -----------  ------------- ----------
TASK_QUEUE_PROCESSED_IDX               182             2     100010

All looks pretty good at this point but when we try to query the table to find the new rows we get poor results.
The index isn’t even used and we use nearly 192 get operations to find just the 10 rows we want.

SQL> SELECT *
  2    FROM task_queue
  3   WHERE processed = 'N';

   TASK_ID P
---------- -
    100001 N
    100002 N
    100003 N
    100004 N
    100005 N
    100006 N
    100007 N
    100008 N
    100009 N
    100010 N

10 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2224684298

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            | 50005 |   341K|    69   (2)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TASK_QUEUE | 50005 |   341K|    69   (2)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("PROCESSED"='N')


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        192  consistent gets
          0  physical reads
          0  redo size
        538  bytes sent via SQL*Net to client
        552  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         10  rows processed

Let’s see if we can do better. Going back to original assertion that null-only entries won’t be indexed – we throw out all the processed records if we treat them as nulls. Using the NULLIF function we’ll achieve exactly the functionality we’re looking for. ‘Y’ values will become NULL, ‘N’ values will be the only ones left to index.

SQL> CREATE INDEX task_queue_unprocessed_idx
  2      ON task_queue(NULLIF(processed,'Y'));

Index created.

SQL> SELECT index_name,
  2         leaf_blocks,
  3         distinct_keys,
  4         num_rows
  5    FROM user_ind_statistics
  6   WHERE table_name = 'TASK_QUEUE';

INDEX_NAME                      LEAF_BLOCKS DISTINCT_KEYS   NUM_ROWS
------------------------------- ----------- ------------- ----------
TASK_QUEUE_PROCESSED_IDX                182             2     100010
TASK_QUEUE_UNPROCESSED_IDX                1             1         10

That looks promising. The new index only has 10 rows in it and it’s much smaller, all 10 leaf nodes fit into a single block.

SQL> SELECT *
  2    FROM task_queue
  3   WHERE NULLIF(processed,'Y') = 'N';

   TASK_ID P
---------- -
    100001 N
    100002 N
    100003 N
    100004 N
    100005 N
    100006 N
    100007 N
    100008 N
    100009 N
    100010 N

10 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1345274010

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                            |  1000 |  7000 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TASK_QUEUE                 |  1000 |  7000 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | TASK_QUEUE_UNPROCESSED_IDX |    10 |       |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(CASE  WHEN "PROCESSED"='Y' THEN NULL ELSE "PROCESSED" END ='N')


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        538  bytes sent via SQL*Net to client
        552  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         10  rows processed

That’s much better, only 4 gets needed instead of 192. That’s much more efficient. Also note the NULLIF is actually rewritten in the function-based index definition and the execution plan filter/access operation.

That’s somewhat inconvenient syntax, so I often create a view to encapsulate the function filter as well as provide a meaningful name. Here we’re looking for unprocessed tasks, so that’s what I name the view.

SQL> CREATE OR REPLACE VIEW unprocessed_tasks
  2  AS
  3      SELECT *
  4        FROM task_queue
  5       WHERE NULLIF(processed,'Y') = 'N';

View created.

SQL> SELECT * FROM unprocessed_tasks;

   TASK_ID P
---------- -
    100001 N
    100002 N
    100003 N
    100004 N
    100005 N
    100006 N
    100007 N
    100008 N
    100009 N
    100010 N

10 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1345274010

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                            |  1000 |  7000 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TASK_QUEUE                 |  1000 |  7000 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | TASK_QUEUE_UNPROCESSED_IDX |    10 |       |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(CASE  WHEN "PROCESSED"='Y' THEN NULL ELSE "PROCESSED" END ='N')


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        538  bytes sent via SQL*Net to client
        552  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         10  rows processed

Some might wonder: “What about that normal index we created at the beginning?” Well, the optimizer doesn’t want to use it by default; but we can use a HINT to tell the optimizer to use it instead of the full table scan.

SQL> SELECT  /*+ INDEX(task_queue task_queue_processed_idx) */
  2         *
  3    FROM task_queue
  4   WHERE processed = 'N';

   TASK_ID P
---------- -
    100001 N
    100002 N
    100003 N
    100004 N
    100005 N
    100006 N
    100007 N
    100008 N
    100009 N
    100010 N

10 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2535177461

----------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                          | 50005 |   341K|   175   (1)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TASK_QUEUE               | 50005 |   341K|   175   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | TASK_QUEUE_PROCESSED_IDX | 50005 |       |    91   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("PROCESSED"='N')


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          5  consistent gets
          0  physical reads
          0  redo size
        562  bytes sent via SQL*Net to client
        552  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         10  rows processed

5 gets isn’t bad. It’s definitely a lot better than the 192 of the full table scan, but it’s not as good as the function-based “collapsing” index approach. But, this is a very small table with not all that many rows in it. Let’s look at the relative sizes of the objects created.

SQL> SELECT segment_name,
  2         segment_type,
  3         bytes,
  4         blocks
  5    FROM user_segments
  6   WHERE segment_name LIKE 'TASK_QUEUE%';

SEGMENT_NAME                    SEGMENT_TYPE            BYTES     BLOCKS
------------------------------- ------------------ ---------- ----------
TASK_QUEUE                      TABLE                 2097152        256
TASK_QUEUE_PROCESSED_IDX        INDEX                 2097152        256
TASK_QUEUE_UNPROCESSED_IDX      INDEX                   65536          8

As expected, the function-based index has collapsed to just a small fraction of the size of the table, whereas the normal index is just as big as the table itself. Of course, with a more complex table the sizes will likely differ; but the intent is to show relative scale of the different indexes. Therefore, it is possible to get nearly the same execution performance with a hinted index it’ll still never be quite the same and you’ll consume more space. Furthermore, the normal index will continue to grow forever. The collapsing index should stay small as long as you regularly process the new records. It won’t matter if your table grows into the trillions of rows, if you never have more than a few dozen unprocessed records at a time, the index will never require more than a few blocks to hold them.

Function-based indexes were introduced in 8i and I’ve used this technique in every db version since.
I hope you find it as useful as I have.

Note 1: All of the autotrace results shown above were taken after repeated executions and the results stabilized after the effects of caching were removed. You should get comparable results on your systems; but not necessarily on the first execution of any of the queries.

Note 2: This isn’t just an Oracle feature. Not indexing NULLs is a fairly standard operation across modern relational databases. DB2 has similar functionality in version 10.5 and above via expression-based indexes. SQL Server doesn’t have not quite the same functionality but with indexes on non-persisted,computed columns can produce achieve somewhat similar benefits.

Oracle ACE


Oracle gave me an unexpected present for Christmas this year by naming me an Oracle ACE!

I’m truly honored to be included in the ranks of so many I’ve read and learned from.
Of course I’m also extremely happy to know my efforts to return the favor and share my knowledge and experience has been appreciated.

Thank you!

Unexpected change in CHR() functionality after upgrading to 12c.


I ran into an interesting bug in some code today due to a math error which caused CHR() to be called with non-integer input values.  According to the Oracle documentation, this is perfectly acceptable; but it doesn’t make much sense.  Unfortunately, the documentation doesn’t specify what should happen for such inputs.  This ambiguity is the cause of the surprise.

Furthermore, the functionality is different based on where the function is invoked.

Within SQL,   CHR(x) = CHR(FLOOR(x))

But within PL/SQL,   Prior to 12c  CHR(x) = CHR(ROUND(x))
In 12c,  PL/SQL and SQL are consistent and CHR(x) = CHR(FLOOR(x))

Here are some examples to illustrate.

SQL> connect system@db_ver_12c
Enter password:
Connected.
SQL> set serveroutput on
SQL> SELECT CHR(65), CHR(65.5), CHR(66), CHR(66.5) FROM DUAL;

C C C C
- - - -
A A B B

SQL> BEGIN
 2 DBMS_OUTPUT.put_line(CHR(65) || ' ' || CHR(65.5) || ' ' || CHR(66) || ' ' || CHR(66.5));
 3 END;
 4 /
A A B B

PL/SQL procedure successfully completed.

SQL> connect system@db_ver_11g
Enter password:
Connected.
SQL> set serveroutput on
SQL> SELECT CHR(65), CHR(65.5), CHR(66), CHR(66.5) FROM DUAL;

C C C C
- - - -
A A B B

SQL> BEGIN
 2 DBMS_OUTPUT.put_line(CHR(65) || ' ' || CHR(65.5) || ' ' || CHR(66) || ' ' || CHR(66.5));
 3 END;
 4 /
A B B C

Note in 11g  the SQL and PL/SQL produce different results.  I think the 12c functionality is what should be considered correct; but in this case there was some legacy code relying on the old rounding functionality to correct a math error.

The silver-lining in this story is that identifying the CHR() functionality change highlighted the math error and; once the math was corrected, the fix didn’t depend on either FLOOR or ROUND because only integer values would be provided as input.

PBKDF2 in Oracle 12c


When I wrote my article on Oracle passwords I included a small, limited implementation of the PBKDF2 algorithm in order to illustrate the 12c hashing methodology.    Here I’m publishing a fuller implementation with parameters for determining the derived key length, number of hashing iterations, and a choice of hashing methods (SH1, SH256, SH384, or SH512.)  I’ve included methods to return a RAW value (up to 32K octets/bytes) and VARCHAR2 (up to 16K octets, 32K hex characters) as well as BLOB and CLOB variations if you happen to need a large hash value.

The VARCHAR2 version is simply RAWTOHEX wrapped around the get_raw.  The BLOB and CLOB follow the same overall logic but have some small internal optimizations to help with extremely large results and/or numerous iterations.

My implementation is a fairly straight forward adaption of documentation in section 5.2 of RFC2898.   The only significant deviation is with the BLOB and CLOB functions where I batch up intermediate concatenations within a RAW or VARCHAR2 before concatenating to the BLOB or CLOB result.  I do this for efficiency because lob operations are slower.

First, a couple of use cases to test the functionality.  These were both taken from the test vectors published in RFC6070.

SQL> select pbkdf2.get_hex('password',utl_raw.cast_to_raw('salt'),1,20,2) from dual;

PBKDF2.GET_HEX('PASSWORD',UTL_RAW.CAST_TO_RAW('SALT'),1,20,2)
-------------------------------------------------------------------------------------------
0C60C80F961F0E71F3A9B524AF6012062FE037A6

SQL> select pbkdf2.get_hex('password',utl_raw.cast_to_raw('salt'),4096,20,2) from dual;

PBKDF2.GET_HEX('PASSWORD',UTL_RAW.CAST_TO_RAW('SALT'),4096,20,2)
-------------------------------------------------------------------------------------------
4B007901B765489ABEAD49D926F721D065A429C1

And now, the PBKDF2 package

CREATE OR REPLACE PACKAGE pbkdf2
IS
    --                    .///.
    --                   (0 o)
    ---------------0000--(_)--0000---------------
    --
    --  Sean D. Stuber
    --  sean.stuber@gmail.com
    --
    --             oooO      Oooo
    --------------(   )-----(   )---------------
    --             \ (       ) /
    --              \_)     (_/

    -- Implementation of algorithm described in section 5.2 of RFC2898
    -- https://tools.ietf.org/html/rfc2898

    -- dk_length refers to number of octets returned for the desired key
    -- regardless of whether the result is raw/blob or hex characters in varchar2/clob
    --   So, a 20-octet key returned by get_raw, would be a 40 character hex string
    --   returned by get_hex.  The dk_length parameter would be 20 in both cases.
    
    -- The following HMAC algorithms are supported
    --   DBMS_CRYPTO.HMAC_SH1    = 2
    --   DBMS_CRYPTO.HMAC_SH256  = 3 
    --   DBMS_CRYPTO.HMAC_SH384  = 4
    --   DBMS_CRYPTO.HMAC_SH512  = 5

    -- Test vectors
    --   https://tools.ietf.org/html/rfc6070

    --  select pbkdf2.get_hex('password',utl_raw.cast_to_raw('salt'),1,20,2) from dual;
    --      0C60C80F961F0E71F3A9B524AF6012062FE037A6 
    --  select pbkdf2.get_hex('password',utl_raw.cast_to_raw('salt'),2,20,2) from dual;
    --      EA6C014DC72D6F8CCD1ED92ACE1D41F0D8DE8957
    --  select pbkdf2.get_hex('password',utl_raw.cast_to_raw('salt'),4096,20,2) from dual;
    --      4B007901B765489ABEAD49D926F721D065A429C1
    --  select pbkdf2.get_hex('passwordPASSWORDpassword',utl_raw.cast_to_raw('saltSALTsaltSALTsaltSALTsaltSALTsalt'),4096,25,2) from dual;
    --      3D2EEC4FE41C849B80C8D83662C0E44A8B291A964CF2F07038

    FUNCTION get_raw(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN RAW
        DETERMINISTIC;

    FUNCTION get_hex(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN VARCHAR2
        DETERMINISTIC;

    FUNCTION get_blob(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN BLOB
        DETERMINISTIC;

    FUNCTION get_clob(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN CLOB
        DETERMINISTIC;
END;
/
CREATE OR REPLACE PACKAGE BODY pbkdf2
IS
    --                    .///.
    --                   (0 o)
    ---------------0000--(_)--0000---------------
    --
    --  Sean D. Stuber
    --  sean.stuber@gmail.com
    --
    --             oooO      Oooo
    --------------(   )-----(   )---------------
    --             \ (       ) /
    --              \_)     (_/

    -- Implementation of algorithm described in section 5.2 of RFC2898
    -- https://tools.ietf.org/html/rfc2898

    -- dk_length refers to number of octets returned for the desired key
    -- regardless of whether the result is raw/blob or hex characters in varchar2/clob
    --   So, a 20-octet key returned by get_raw, would be a 40 character hex string
    --   returned by get_hex.  The dk_length parameter would be 20 in both cases.

    -- The following HMAC algorithms are supported
    --   DBMS_CRYPTO.HMAC_SH1    = 2
    --   DBMS_CRYPTO.HMAC_SH256  = 3
    --   DBMS_CRYPTO.HMAC_SH384  = 4
    --   DBMS_CRYPTO.HMAC_SH512  = 5

    c_max_raw_length   CONSTANT PLS_INTEGER := 32767;
    c_max_hex_length   CONSTANT PLS_INTEGER := 32767;

    SUBTYPE t_maxraw IS RAW(32767);

    SUBTYPE t_maxhex IS VARCHAR2(32767);

    SUBTYPE t_hmac_result IS RAW(64);           -- must be big enough to hold largest supported HMAC

    FUNCTION iterate_hmac_xor(
        p_salt             IN RAW,
        p_iterations       IN PLS_INTEGER,
        p_hmac             IN PLS_INTEGER,
        p_block_iterator   IN PLS_INTEGER,
        p_raw_password     IN RAW
    )
        RETURN t_hmac_result
    IS
        v_u           t_maxraw;
        v_f_xor_sum   t_hmac_result;
    BEGIN
        -- The RFC describes the U(1)...U(c) values recursively
        -- but the implementation below simply loops with a stored value
        -- to achieve the same functionality.
        v_u :=
            UTL_RAW.CONCAT(
                p_salt,
                UTL_RAW.cast_from_binary_integer(p_block_iterator, UTL_RAW.big_endian)
            );

        v_u := DBMS_CRYPTO.mac(src => v_u, typ => p_hmac, key => p_raw_password);
        v_f_xor_sum := v_u;

        FOR c IN 2 .. p_iterations
        LOOP
            v_u := DBMS_CRYPTO.mac(src => v_u, typ => p_hmac, key => p_raw_password);
            v_f_xor_sum := UTL_RAW.bit_xor(v_f_xor_sum, v_u);
        END LOOP;

        RETURN v_f_xor_sum;
    END iterate_hmac_xor;

    FUNCTION get_raw(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN RAW
        DETERMINISTIC
    IS
        c_hlen           CONSTANT PLS_INTEGER
            := CASE p_hmac
                   WHEN DBMS_CRYPTO.hmac_sh1 THEN 20
                   WHEN DBMS_CRYPTO.hmac_sh256 THEN 32
                   WHEN DBMS_CRYPTO.hmac_sh384 THEN 48
                   WHEN DBMS_CRYPTO.hmac_sh512 THEN 64
               END ;

        c_octet_blocks   CONSTANT PLS_INTEGER := CEIL(p_dk_length / c_hlen);
        v_t_concat                t_maxraw := NULL;
        v_block_iterator          PLS_INTEGER := 1;
    BEGIN
        -- raise exception message per rfc
        -- but this limit is rather moot since the function
        -- is capped by raw limits
        IF p_dk_length > (POWER(2, 32) - 1) * c_hlen
        THEN
            raise_application_error(-20001, 'derived key too long');
        ELSIF p_dk_length > c_max_raw_length
        THEN
            raise_application_error(-20001, 'raw output must be less than to 32K bytes');
        END IF;

        IF p_iterations < 1
        THEN
            raise_application_error(-20001, 'must iterate at least once');
        END IF;

        -- Loop one block of hlen-octets at a time of the derived key.
        -- If we build a key past the desired length then exit early, no need to continue
        WHILE     v_block_iterator <= c_octet_blocks
              AND (v_t_concat IS NULL OR UTL_RAW.LENGTH(v_t_concat) < p_dk_length)
        LOOP
            v_t_concat :=
                UTL_RAW.CONCAT(
                    v_t_concat,
                    iterate_hmac_xor(
                        p_salt,
                        p_iterations,
                        p_hmac,
                        v_block_iterator,
                        UTL_RAW.cast_to_raw(p_password)
                    )
                );
            v_block_iterator := v_block_iterator + 1;
        END LOOP;

        RETURN UTL_RAW.SUBSTR(v_t_concat, 1, p_dk_length);
    END get_raw;

    FUNCTION get_hex(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN VARCHAR2
        DETERMINISTIC
    IS
    BEGIN
        -- raise exception message per rfc
        -- but this limit is rather moot since the function
        -- is capped by varchar2 limits
        IF p_dk_length > c_max_raw_length / 2
        THEN
            raise_application_error(-20001, 'hex representation must be less than 32K characters');
        END IF;

        RETURN RAWTOHEX(
                   get_raw(
                       p_password,
                       p_salt,
                       p_iterations,
                       p_dk_length,
                       p_hmac
                   )
               );
    END get_hex;

    FUNCTION get_blob(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN BLOB
        DETERMINISTIC
    IS
        c_hlen           CONSTANT PLS_INTEGER
            := CASE p_hmac
                   WHEN DBMS_CRYPTO.hmac_sh1 THEN 20
                   WHEN DBMS_CRYPTO.hmac_sh256 THEN 32
                   WHEN DBMS_CRYPTO.hmac_sh384 THEN 48
                   WHEN DBMS_CRYPTO.hmac_sh512 THEN 64
               END ;

        c_octet_blocks   CONSTANT PLS_INTEGER := CEIL(p_dk_length / c_hlen);
        v_t_concat                BLOB;
        v_block_iterator          PLS_INTEGER := 1;
        v_temp                    t_maxraw;
    BEGIN
        -- raise exception message per rfc
        IF p_dk_length > (POWER(2, 32) - 1) * c_hlen
        THEN
            raise_application_error(-20001, 'derived key too long');
        END IF;

        IF p_iterations < 1
        THEN
            raise_application_error(-20001, 'must iterate at least once');
        END IF;

        DBMS_LOB.createtemporary(lob_loc => v_t_concat, cache => FALSE, dur => DBMS_LOB.session);

        -- Loop one block of hlen-octets at a time of the derived key.
        -- If we build a key past the desired length then exit early, no need to continue
        WHILE v_block_iterator <= c_octet_blocks AND (DBMS_LOB.getlength(v_t_concat) < p_dk_length)
        LOOP
            v_temp :=
                UTL_RAW.CONCAT(
                    v_temp,
                    RAWTOHEX(
                        iterate_hmac_xor(
                            p_salt,
                            p_iterations,
                            p_hmac,
                            v_block_iterator,
                            UTL_RAW.cast_to_raw(p_password)
                        )
                    )
                );

            -- Concatenating raw is faster than blob
            -- only concatenate to the blob when the temp raw fills up
            IF UTL_RAW.LENGTH(v_temp) > c_max_raw_length - c_hlen
            THEN
                DBMS_LOB.writeappend(v_t_concat, UTL_RAW.LENGTH(v_temp), v_temp);
                v_temp := NULL;
            END IF;

            v_block_iterator := v_block_iterator + 1;
        END LOOP;

        DBMS_LOB.writeappend(v_t_concat, UTL_RAW.LENGTH(v_temp), v_temp);

        DBMS_LOB.TRIM(v_t_concat, p_dk_length);

        RETURN v_t_concat;
    END get_blob;

    FUNCTION get_clob(
        p_password     IN VARCHAR2,
        p_salt         IN RAW,
        p_iterations   IN PLS_INTEGER,
        p_dk_length    IN PLS_INTEGER,
        p_hmac         IN PLS_INTEGER DEFAULT DBMS_CRYPTO.hmac_sh512
    )
        RETURN CLOB
        DETERMINISTIC
    IS
        c_hlen           CONSTANT PLS_INTEGER
            := CASE p_hmac
                   WHEN DBMS_CRYPTO.hmac_sh1 THEN 20
                   WHEN DBMS_CRYPTO.hmac_sh256 THEN 32
                   WHEN DBMS_CRYPTO.hmac_sh384 THEN 48
                   WHEN DBMS_CRYPTO.hmac_sh512 THEN 64
               END ;

        c_octet_blocks   CONSTANT PLS_INTEGER := CEIL(p_dk_length / c_hlen);
        v_t_concat                CLOB;
        v_block_iterator          PLS_INTEGER := 1;
        v_temp                    t_maxhex;
    BEGIN
        -- raise exception message per rfc
        IF p_dk_length > (POWER(2, 32) - 1) * c_hlen
        THEN
            raise_application_error(-20001, 'derived key too long');
        END IF;

        IF p_iterations < 1
        THEN
            raise_application_error(-20001, 'must iterate at least once');
        END IF;

        DBMS_LOB.createtemporary(lob_loc => v_t_concat, cache => FALSE, dur => DBMS_LOB.session);

        -- Loop one block of hlen-octets at a time of the derived key.
        -- If we build a key past the desired length then exit early, no need to continue
        -- The end result is a HEX string, so double the length (2 characters to represent one byte)
        WHILE     v_block_iterator <= c_octet_blocks
              AND (DBMS_LOB.getlength(v_t_concat) < p_dk_length * 2)
        LOOP
            v_temp :=
                   v_temp
                || RAWTOHEX(
                       iterate_hmac_xor(
                           p_salt,
                           p_iterations,
                           p_hmac,
                           v_block_iterator,
                           UTL_RAW.cast_to_raw(p_password)
                       )
                   );

            -- Concatenating varchar2 is faster than clob
            -- only concatenate to the clob when the temp varchar2 fills up
            IF LENGTH(v_temp) > c_max_hex_length - 2 * c_hlen
            THEN
                v_t_concat := v_t_concat || v_temp;
                v_temp := NULL;
            END IF;

            v_block_iterator := v_block_iterator + 1;
        END LOOP;

        v_t_concat := v_t_concat || v_temp;
        DBMS_LOB.TRIM(v_t_concat, p_dk_length * 2);

        RETURN v_t_concat;
    END get_clob;
END;
/

The PBKDF2 algorithm is supposed to be slow to execute in order to discourage brute force hacking attempts. While I did use a few coding techniques to try to help performance, this is still not a fast implementation. If you need maximal speed then I recommend a c library. For short strings and less than 100000 iterations the package should have adequate speed for most use cases.

The code presented above requires at least 12cR1, but could run on lower versions by changing the supported hashing methods.

I hope it helps, questions and comments are always welcome.

How Oracle Stores Passwords article updated through 12.2


Added section on 11g and 12c SQLNET.ORA parameters.

WordPress article here.

A pdf of this article can be downloaded from my dropbox here.

How to pass a vector or data.frame from R to Oracle as a bind variable in a select statement using ROracle.


Recently I was asked how to bind an R variable to a SQL statement executed with ROracle.  In this case it wasn’t a simple  “select * from table where x = :b1 and y = :b2.”

You can do those quite easily with a data.frame with only one row in it using dbGetQuery(connection, sqlstatement, data.frame).  It will even work for some bulk binding, such as a multi-row insert from a data.frame or a vector.

Here though, the developer wanted to pass in an R set as an In-List for a where clause.  That is something like “select * from table where (x,y) in (:b1)”  where the bind variable was a data.frame consisting of a few hundred observations of 2 variables each.  Of course, the syntax shown is only pseudo-code, which further complicated the requirements.  How to create the query with valid syntax and how to pass the data.frame to the query as a bind variable?

First, in the context of a select statement neither a data.frame nor a vector is legal as an input parameter if they have more than one row.  Unfortunately that means we’ll have to take the extra step of converting the data into a legal data type to pass it through the dbGetQuery interface and then do something with that converted data within the SQL statement.

A simple and, more importantly, legal option is to convert the data into a delimited chr variable.  On the Oracle side this will be interpreted as a VARCHAR2 or a CLOB depending on the size, either of which is easily parsed back into the original rows and columns as needed.

First, let’s use a simple vector with just a few elements. We’ll look up a few employees from the sample SCOTT.EMP table.  We can’t use “in (:b1)” as shown above, but if we can generate a collection we can use “member of :b1.”

Since we’re going to pass a delimited chr value from R, we need to convert that text into a collection.  Previously I posted how to split a clob into a nested table of varchar2 values.  We’ll use that here too.  For this first example the clob handling will be a bit of overkill since our text string will be small; but it will still illustrate the process.

So, first, we’ll construct a vector, then make a chr variable from the vector.  Finally, pass the chr into a query using the split_clob function to create a nested table collection for use with the member of condition.

> employees_vector  employees_vector
[1] "SCOTT" "KING" "ADAMS"
> employees_chr  employees_chr
[1] "SCOTT,KING,ADAMS"
> employees_df  employees_df
  EMPNO ENAME       JOB   MGR            HIREDATE  SAL COMM DEPTNO
1  7788 SCOTT   ANALYST  7566 1987-04-19 00:00:00 3000   NA     20
2  7839 KING  PRESIDENT    NA 1981-11-16 23:00:00 5000   NA     10
3  7876 ADAMS     CLERK  7788 1987-05-23 00:00:00 1100   NA     20

It is possible to use an IN clause in the query, but you must create a subquery for the condition.  So, using the same chr variable we still use split_clob, but we then use the TABLE function to use the resulting collection as a data source for the subquery.

> employees_df2 <- dbGetQuery(con,"select * from emp where ename in (select * from table(split_clob(:b1,',')))",employees_chr)
> employees_df2
   EMPNO ENAME       JOB  MGR            HIREDATE  SAL COMM DEPTNO
1   7788 SCOTT   ANALYST 7566 1987-04-19 00:00:00 3000   NA     20
2   7839 KING  PRESIDENT   NA 1981-11-16 23:00:00 5000   NA     10
3   7876 ADAMS     CLERK 7788 1987-05-23 00:00:00 1100   NA     20

It is also possible to use the text field directly with a simple INSTR (where instr(‘KING,SCOTT,ADAMS’,ename) > 0,) but doing so reliably is more difficult.  Also, by leaving the table column untouched then indexes on the table can be used more reliably.  Due to these limitations I’m not providing examples.  Again, it is possible to do so, but not recommended.

The examples above are for a sets where each row only contains a single value (a vector, or a data.frame of a single column.)  But what if you need multi-column checks? I.e.  Something of the form “select * from table1 where (a,b) in (select x,y from table2).”  Where “table2” is somehow based on our data.frame contents.

While the basic idea is the same, the use of multiple columns in the condition creates an additional challenge because we need to somehow encode the rows and columns into a chr field such that the fields are distinct but still grouped by row.  Also the split_clob function will only generate one value for each row instead of reconstructing all of the individual fields.

First, on the R side,  we’ll use paste function again, but twice, once with the separation delimiter and then again with the collapse delimiter.

> input_df  input_df
 jobs depts
1 CLERK SALES
2 CLERK ACCOUNTING
3 ANALYST RESEARCH
4 PRESIDENT ACCOUNTING
> input_chr  input_chr
[1] "CLERK|SALES,CLERK|ACCOUNTING,ANALYST|RESEARCH,PRESIDENT|ACCOUNTING"
>

Now we have a single chr field, where each row is delimited with commas and the fields within the row are delimited with pipes.  Using split_clob we can separate the string into 4 fields and then parse each of those into a row of 2 fields.

Removing R for a moment, we can test the splitting and parsing to see what the subquery will return.

SELECT SUBSTR(COLUMN_VALUE, 1, INSTR(COLUMN_VALUE, '|') - 1) jobs,
 SUBSTR(COLUMN_VALUE, INSTR(COLUMN_VALUE, '|') + 1) depts
 FROM TABLE(split_clob('CLERK|SALES,CLERK|ACCOUNTING,ANALYST|RESEARCH,PRESIDENT|ACCOUNTING', ','));

JOBS         DEPTS 
------------ --------------
CLERK        SALES 
CLERK        ACCOUNTING 
ANALYST      RESEARCH 
PRESIDENT    ACCOUNTING

4 rows selected.

Now we put it all together and pull some information from the emp and dept tables about our job/department pairs.

> emp_dept_df  emp_dept_df
   ENAME       JOB  SAL      DNAME      LOC
1   KING PRESIDENT 5000 ACCOUNTING NEW YORK
2 MILLER     CLERK 1300 ACCOUNTING NEW YORK
3   FORD   ANALYST 3000   RESEARCH DALLAS
4  SCOTT   ANALYST 3000   RESEARCH DALLAS
5  JAMES     CLERK  950      SALES CHICAGO
>

Obviously as the number of fields in a data.frame expands the parsing will get longer and more complicated.  If you’ll be performing  similar queries often, you may want to build a dedicated function within the database that combines the split_clob functionality with the parsing of the individual fields and returns a collection of user-defined types.  Also, these examples used text fields and columns but could include dates or numeric values.  In that case you would need to ensure consistent formatting in the string construction so the parsing can be accomplished correctly and reliably.  The overall technique remains the same though.

As mentioned above, these steps do entail extra processing on both the R side as well as the Oracle side of the interface; but for most uses cases the extra resource consumption will hopefully be minor and this method provides a work around to a limitation of syntax in the ROracle/DBI functionality suite.

I hope you find it useful.

%d bloggers like this: