Timing is Everything, but Recording it is Even Better


I got back from speaking at Collaborate 18 a week ago. My topic was performance diagnostics.
The recurring theme of my talk is for developers to leave themselves some bread crumbs to mark their trail.

The Oracle database has wonderful diagnostic tools like ASH/AWR, SQL Monitor, tracing, and the hierarchical profiler; but as good as these tools are, the best way of measuring performance issues in your code should be your own logs. In particular, logs with timestamps.

It always blows my mind when a developer asks me to do initial triage on a system because they don’t know where the long running processes are. Why don’t you have logs that capture the begin and end points of important processes? Even if you don’t have a reason to suspect a performance issue, including a simple “Starting ABC”/”Ending ABC” set of bookend provides scope to other debug or error messages. Later, if you do have a reason to check performance, having the start and end points of your process logged gives you a means of checking how long it ran.

Coincidentally, the day I got back to the office after the conference, I was contacted about a process running inordinately slow. Of course I asked, where is the slow part and they couldn’t really say but they took a guess at one particular procedure. As it happens, about 10 years ago I wrote a lot of the code behind that procedure. Naturally, I was interested to see if I had learned anything in the past 10 years that I could use to make it faster; but first, I simply wrapped the procedure with the app team’s own logging framework and asked them to run the application.

What we found was the old procedure was actually running quite fast, averaging about 0.13 seconds per run with a worst case of about 0.82 seconds. The problem was it was being called repeatedly inside an expensive loop with each iteration feeding the procedure a small amount of data to process. The fix was to remove the iterative process and allow the old procedure to run on larger data sets. The total execution time dropped from minutes to seconds.

While I had full DBA access to this system, I didn’t use an AWR report, 10046 tracing, SQL Monitor, a pl/sql profiler, ASH data, or any sort of “Top SQL” scripts and tools to determine the problem. I didn’t need an extensive instrumentation system for this, all I needed was a set of simple timestamped begin and end messages to tell me where the application was spending its time. Of course I could add even more logging to get more granular timings; but the point is, it’s often not necessary. If you can track the scope of work progression, you can determine, without extended privileges to administrative views or tools, where your application is spending its time.

You can then talk to your DBA about “why” a particular SQL or procedure is slow, and determine what to do about it. I’m not going to recommend any particular instrumentation framework or tool because the specific choice isn’t as important as the methodology. It’s quite simple, create a procedure to write messages with timestamps to a table, preferably within an autonomous transaction. Then, adding a mere 3 lines of code to each procedure or function will be sufficient for tracking most performance related issues.

The most basic logging framework could be nothing more than this:

CREATE TABLE app_log
(
    logtime       TIMESTAMP WITH TIME ZONE,
    logmessage    VARCHAR2(4000)
);

CREATE OR REPLACE PROCEDURE write_log(p_message IN VARCHAR2)
IS
    PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
    INSERT INTO app_log(logtime, logmessage)
         VALUES (SYSTIMESTAMP, p_message);

    COMMIT;
END;

Usage of the above could just be begin and end messages. End will have 2 or more, one for success, one for failures. If you have multiple exception clauses, then each should have its own end-message.

BEGIN
    write_log('Begin process ....');
    --- Do Process ....
    write_log('End process ....');
EXCEPTION
    WHEN OTHERS THEN
        write_log('End process .... with error');
        --- Handle or reraise exception
END;

You can, of course, expand on the above and would most likely want to include other information as well to help properly scope execution. For example you might want to include username, machine, session id, or instance number. You can also include intermediate step messages which can also be used to produce performance diagnostic information. If you already have a logging/debug message framework, that’s great, simply extend the usage to include begin/end messages and you’ve turned the debugging into performance instrumentation.

It’s such a small amount of work for immense potential benefit. Unfortunately performance instrumentation is often underutilized.
Also note, these logs can also be defensive in nature. If your application is just one cog in a larger machine, perhaps a web service, or a process inside a larger ETL system; having logs of when your portion of the work began and when your portion of the work ended can help resolve finger-pointing as to where a problem lies.

A reliable set of performance metrics also makes for easier audits of SLA/QoS contracts or can be used internally for alerts when an SLA threshold is near breaching.

Also, as mentioned earlier, these logs don’t need extended privileges, so it is easier to perform investigation on systems where you may have less access than others.

I hope this helps inspire you to track your own application’s execution with some sort of logging framework. Which one isn’t as important as consistently using it.

Advertisements

18c JSON TO_UTC_TIMESTAMP_TZ bug?


I tried the following query on 18c via Oracle’s LiveSQL site and the results don’t correspond to the documented functionality; or, at least, not my reading of the functionality.

select to_utc_timestamp_tz('2018-04-27') t from dual
union all
select to_utc_timestamp_tz('2018-04-27T17:40:25+00:00') from dual
union all
select to_utc_timestamp_tz('2018-04-27T17:40:25Z') from dual
union all
select to_utc_timestamp_tz('20180427T174025Z') from dual
union all
select to_utc_timestamp_tz('2018-04-27T17:40:25+05:00') from dual --- This one is not correct per documented functionality

producing the following results

T
27-APR-18 12.00.00.000000 AM +00:00
27-APR-18 05.40.25.000000 PM +00:00
27-APR-18 05.40.25.000000 PM +00:00
27-APR-18 05.40.25.000000 PM +00:00
27-APR-18 05.40.25.000000 PM +05:00

Note the first four rows all return UTC (+0:00) time zone for the corresponding input value. This is to be expected based not only on the name of the function but also the documentation..
The fifth row though preserves the +5:00 offset. This seems contrary to the documentation:

https://docs.oracle.com/en/database/oracle/oracle-database/18/adjsn/changes.html#GUID-7F43AE55-4176-477E-8D3B-021ED838106D
or
MOS Doc ID 2357879.1

“SQL function to_UTC_timestamp_tz takes as input an ISO 8601 date format string and returns an instance of SQL data type TIMESTAMP WITH TIMEZONE. It normalizes the input to UTC time (Coordinated Universal Time, formerly Greenwich Mean Time).”

By my reading, I would expect any successful transformation to return a timestamp with time zone value in UTC (+0:00) time zone. So is this a misreading on my part, a functionality bug, or a documentation bug? Interestingly, the function is not included in the SQL Reference, only in the JSON Developer’s Guide which isn’t a bug per se, but seems like a documentation oversight.

Performance Diagnostics and Tuning With and Without Your DBA


I presented this topic at the Collaborate 2018 conference.

My slides can be found here.

A couple of the slides included simplified queries so the text would fit on a single screen and remain legible.

To find who is using temp, a more functionally robust query can be found here:
https://seanstuber.wordpress.com/2018/03/08/who-is-using-temp-and-what-are-they-doing/

Similarly, the who is blocking whom query has a fuller version here:
https://seanstuber.wordpress.com/2018/03/04/who-is-blocking-whom-and-what-are-they-doing/

Thanks to all who attended and asked questions and/or commented.

Reading a PDF blob with PL/SQL and SQL


This is another article inspired by a recent request. That is, given a PDF file with tabular data, could I extract the data into a more human-usable format?
Yes, it is possible but it requires multiple steps and does come with some caveats.

The first step is converting the PDF into a format that is more easily processed with SQL and PL/SQL. To that end, Oracle Text includes functionality to parse a PDF BLOB into an XHTML CLOB.

With a BLOB containing a PDF with contents that look like this:Image of Sample Table

We can convert that to text of this form (leading spaces in the opening tags are to avoid html rendering errors within WordPress):











div.c {
 white-space:pre;
 position:absolute;
}



...


XHTML can’t be parsed directly using native functions; but simply removing the !DOCTYPE declaration is sufficient to turn the contents into usable XML.
Within the BODY of the document, the PDF text will be stored in DIV tags such as these:

Pocket
3
$1.68
$5.04
Wocket
5
$2.00
$10.00
Widget
13
$2,342.23
$30,448.99

However, there is no concept of a “table” associated with these nodes. To create rows and columns the style attribute must be parsed based on the top and left pixel positions. Data that is in the same “row” will usually be written to the same top location. Columns are little more difficult because numeric data will typically be right-justified; but the text is stored within the pdf based on it’s left-most position. To construct more traditional columns you must count values within a row. The 1st position within a row, regardless of the pixel location is the 1st column, 2nd position is the 2nd column and so on.

Thus, in the example above, column one is “Pocket”, “Wocket”, and “Widget” all at position 70. Column two is the number 3 at position 195, number 5 also at position 195 and then number 13 at position 188. The same process is repeated for the Price and Total columns. Pivoting by the column counters can turn the sequential DIV values into traditional rows and columns, producing results like this:

A                    B                    C                    D                   
-------------------- -------------------- -------------------- --------------------
Sample Table                                                                       
Item                 Qty                  Price                Total               
Pocket               3                    $1.68                $5.04               
Wocket               5                    $2.00                $10.00              
Widget               13                   $2,342.23            $30,448.99                                                                                     

Note, the table header became left-justified as a result of the parsing technique because there was only one “column” in that row. Similar issues will result if there are blank (null) values within a table, since there will be no place holder to count. So, this technique is not guaranteed to produce immediately useful results. The output may require manual adjustment if data is out of sequence.

I have loaded a few blobs into a table, simply named “my_blobs” and extract the contents with queries of this form, extending the PIVOT clause to add columns as needed. The PDF referenced below is the same one linked to above.

  SELECT a,
         b,
         c,
         d
    FROM (SELECT rown, DENSE_RANK() OVER(PARTITION BY rown ORDER BY coln) coln, val
            FROM (SELECT TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'top:[0-9]+'), 5)) rown,
                         TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'left:[0-9]+'), 6)) coln,
                         val
                    FROM XMLTABLE(xmlnamespaces(DEFAULT 'http://www.w3.org/1999/xhtml'), '/html/body/div'
                                  PASSING (SELECT pdf_to_xml(content)
                                             FROM my_blobs
                                            WHERE name = 'SampleTable.pdf')
                                  COLUMNS style VARCHAR2(100) PATH '@style', val VARCHAR2(20) PATH 'text()') x))
         PIVOT (MAX(val) FOR coln IN (1 a, 2 b, 3 c, 4 d))
ORDER BY rown;

The same query logic also works with larger PDFs too. Here’s one from the publicly available Statistical Annex of the European Economy.

SELECT a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
    FROM (SELECT rown, DENSE_RANK() OVER(PARTITION BY rown ORDER BY coln) coln, val
            FROM (SELECT TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'top:[0-9]+'), 5)) rown,
                         TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'left:[0-9]+'), 6)) coln,
                         TRIM(val) val
                    FROM XMLTABLE(
                             xmlnamespaces(DEFAULT 'http://www.w3.org/1999/xhtml'), '/html/body/div'
                             PASSING (SELECT pdf_to_xml(content)
                                        FROM my_blobs
                                       WHERE name = 'https://ec.europa.eu/info/sites/info/files/saee_autumn_2017_en.pdf')
                             COLUMNS style VARCHAR2(100) PATH '@style', val VARCHAR2(10) PATH 'text()') x))
         PIVOT (MAX(val)
               FOR coln
               IN (1 a,2 b,3 c,4 d,5 e,6 f,7 g,8 h,9 i,10 j,11 k,12 l,13 m,14 n,15 o,16 p,17 q,18 r,19 s,20 t,21 u,22 v,23 w,24 x,25 y,26 z))
ORDER BY rown

Scrolling through the we can pick out individual tables such as this one listing Population and Labor market numbers for various countries by year. Note this also provides an example of the problem with missing values in a table. 1991 has a summary row in addition to a row of the individual country values. The values on the right side of the table are parsed into the left by the column counting.

1960       9153       55433      1221       2835       8332       30263      46636      50200      573        2121       2779       332        333        11487      7047       8858       1585       4068       4430       247685     235006                                                
1965       9464       58619      1305       2877       8550       31962      49825      52112      591        2266       2971       350        323        12295      7271       8999       1649       4371       4564       260363     246887                                                
1970       9656       60651      1371       2951       8793       33761      51939      53822      614        2359       3140       358        327        13032      7467       8680       1725       4538       4606       269790     255715                                                
1975       9801       61829      1441       3178       9047       35758      53926      55441      502        2456       3302       365        309        13660      7579       9093       1794       4739       4711       278930     264388                                                
1980       9859       61566      1489       3402       9643       37491      55150      56434      509        2512       3413       364        321        14148      7549       9766       1901       4980       4780       285277     270152                                                
1981       9858       61682      1500       3444       9729       37759      55462      56502      515        2519       3433       365        324        14247      7569       9851       1907       5016       4800       286481     271268                                                
1982       9856       61638      1510       3481       9790       37986      55797      56544      521        2531       3457       366        331        14312      7574       9912       1910       5055       4827       287397     272081                                                
1983       9854       61423      1521       3506       9847       38172      56096      56564      528        2546       3485       366        335        14368      7562       9958       1922       5092       4856       288000     272571                                                
1984       9855       61175      1531       3530       9896       38330      56369      56577      535        2562       3514       366        335        14423      7561       9996       1932       5127       4882       288497     272961                                                
1985       9857       61024      1541       3541       9934       38470      56649      56593      542        2579       3545       367        341        14488      7565       10024      1942       5162       4902       289064     273413                                                
1986       9859       61066      1552       3542       9967       38585      56937      56596      548        2600       3579       368        347        14567      7570       10033      1966       5194       4918       289794     274008                                                
1987       9870       61077      1565       3543       10001      38685      57244      56602      554        2627       3616       371        350        14664      7575       10030      1990       5223       4932       290516     274593                                                
1988       9904       61449      1574       3531       10037      38767      57572      56629      560        2653       3655       374        353        14760      7585       10020      1995       5251       4947       291615     275574                                                
1989       9940       62063      1581       3510       10089      38828      57912      56672      568        2667       3684       378        356        14846      7620       10005      1996       5276       4964       292954     276826                                                
1990       9968       63253      1582       3506       10197      38867      58227      56719      580        2663       3698       382        360        14947      7678       9983       1998       5299       4986       294894     278714                                                
1991       640                   280354                                                                                                                                                                                                                                                      
1991       10006      79973      1574       3526       10320      38966      58520      56759      595        2651       3704       387        364        15068      7755       9960       1999       5303       5014       312444     296253                                                
1992       10047      80500      1545       3549       10399      39158      58811      56797      611        2614       3700       392        367        15182      7841       9952       1996       5305       5042       313811     297671                                                
1993       10086      80946      1506       3563       10460      39361      59066      56832      626        2563       3683       398        371        15290      7906       9965       1992       5325       5066       315006     298940                                                
1994       10116      81147      1474       3571       10513      39549      59286      56843      639        2521       3657       404        375        15381      7936       9992       1989       5347       5088       315828     299826                                                
1995       10137      81308      1448       3601       10562      39719      59501      56844      651        2485       3629       410        378        15460      7948       10026      1989       5363       5108       316567     300624                                                
1996       10157      81466      1425       3626       10609      39884      59713      56860      661        2457       3602       414        380        15526      7959       10064      1990       5374       5125       317292     301403                                                
1997       10181      81510      1406       3661       10661      40050      59926      56890      671        2433       3575       420        383        15608      7968       10109      1986       5383       5140       317962     302124                                                
1998       10203      81446      1393       3714       10721      40214      60147      56907      679        2410       3549       425        385        15703      7977       10160      1982       5391       5154       318560     302770                                                
1999       10226      81422      1379       3755       10762      40370      60457      56916      687        2391       3524       431        387        15809      7992       10218      1984       5396       5166       319272     303524                                                
2000       10251      81457      1401       3804       10806      40554      60872      56942      694        2368       3500       437        390        15922      8012       10290      1989       5401       5176       320266     304523                                                
2001       10287      81517      1393       3864       10862      40766      61317      56980      702        2338       3471       442        393        16043      8042       10363      1992       5380       5188       321338     305670                                                
2002       10333      81578      1384       3932       10902      41424      61764      57100      710        2310       3443       447        396        16147      8082       10420      1995       5379       5201       322944     307328                                                
2003       10376      81549      1375       3997       10928      42196      62202      57413      718        2288       3415       452        399        16223      8118       10459      1996       5379       5213       324697     309126                                                
2004       10421      81456      1366       4067       10955      42859      62661      57845      728        2263       3377       459        401        16276      8169       10484      1997       5382       5228       326396     310880                                                
2005       10479      81337      1359       4160       10987      43663      63133      58191      739        2239       3323       466        404        16317      8225       10503      2001       5387       5246       328157     312707                                                
2006       10548      81173      1351       4270       11020      44361      63574      58428      751        2219       3270       473        405        16341      8268       10522      2008       5391       5266       329639     314245                                                
2007       10626      80992      1343       4400       11049      45236      63967      58787      767        2201       3231       481        407        16378      8295       10543      2019       5397       5289       331407     316042                                                
2008       10710      80764      1338       4496       11078      45983      64324      59242      787        2178       3198       489        409        16440      8322       10558      2022       5406       5313       333057     317719                                                
2009       10797      80483      1336       4539       11107      46368      64655      59578      808        2142       3163       498        412        16526      8341       10568      2042       5418       5339       334120     318800                                                
2010       10896      80284      1333       4560       11121      46562      64974      59830      829        2097       3097       508        414        16612      8361       10573      2049       5430       5363       334894     319643                                                
2011       10994      80275      1330       4577       11105      46736      65294      60060      851        2059       3028       519        416        16693      8389       10558      2053       5398       5388       335722     320588                                                
2012       11068      80426      1325       4590       11045      46766      65615      60339      864        2034       2988       532        419        16752      8426       10515      2057       5406       5414       336581     321488                                                
2013       11125      80646      1320       4602       10965      46593      65953      60646      862        2013       2958       545        423        16800      8477       10457      2060       5413       5439       337298     322249                                                
2014       11180      80983      1316       4615       10892      46455      66290      60789      853        1994       2932       558        427        16863      8544       10401      2062       5419       5463       338036     323033                                                
2015       11239      81687      1313       4642       10821      46410      66590      60731      848        1977       2905       569        432        16932      8630       10358      2063       5422       5481       339050     324089                                                
2016       11295      82491      1316       4684       10784      46450      66858      60628      852        1959       2868       584        438        17030      8740       10326      2065       5431       5495       340292     325364                                                
2017       11358      83171      1320       4734       10734      46491      67219      60587      852        1940       2825       598        442        17104      8810       10305      2068       5434       5516       341509     326629                                                
2018       11416      83575      1321       4782       10677      46529      67582      60578      857        1921       2785       613        445        17174      8871       10295      2070       5438       5537       342467     327630                                                
2019       11475      83784      1321       4826       10620      46565      67947      60559      862        1905       2749       627        449        17239      8925       10289      2072       5442       5558       343215     328415                                                

Another option, if you know ranges of horizontal pixel locations that define columns, you can use those ranges with a CASE statement instead of counting from left to right. While more complex, this would allow you to maintain correct position even with missing values. The counting method allows the query to be more reusable across varied documents but specifying ranges of left-values may produce more desirable results.

In both of the queries above I use a function to convert the BLOB to an XMLTYPE. I’ve used two forms of the function. The first, and simpler form uses the deprecated IFILTER function.

CREATE OR REPLACE FUNCTION pdf_to_xml(p_pdf IN BLOB)
    RETURN XMLTYPE
IS
    v_clob   CLOB;
    v_xml    XMLTYPE;
BEGIN
    DBMS_LOB.createtemporary(v_clob, TRUE);
    DBMS_LOB.open(v_clob, DBMS_LOB.lob_readwrite);
    ctx_doc.ifilter(p_pdf, v_clob);

    DBMS_LOB.close(v_clob);

    v_xml := xmltype(REGEXP_REPLACE(v_clob, ']+>'));

    DBMS_LOB.freetemporary(v_clob);
    RETURN v_xml;
END;

The second form requires the creation of a text policy and preference before creating the function. For this purpose, some of the simple defaults from the Oracle documentation are sufficient.

BEGIN
    ctx_ddl.create_preference(preference_name => 'fast_filter', object_name => 'AUTO_FILTER');
    ctx_ddl.set_attribute(preference_name   => 'fast_filter',
                          attribute_name    => 'OUTPUT_FORMATTING',
                          attribute_value   => 'FALSE');
    ctx_ddl.create_policy(policy_name => 'my_policy', filter => 'fast_filter');
END;

CREATE OR REPLACE FUNCTION pdf_to_xml(p_pdf IN BLOB)
    RETURN XMLTYPE
IS
    v_clob   CLOB;
    v_xml    XMLTYPE;
BEGIN
    DBMS_LOB.createtemporary(v_clob, TRUE);
    DBMS_LOB.open(v_clob, DBMS_LOB.lob_readwrite);
    ctx_doc.policy_filter('my_policy',
                          p_pdf,
                          v_clob,
                          FALSE);

    DBMS_LOB.close(v_clob);

    v_xml := xmltype(REGEXP_REPLACE(v_clob, ']+>'));

    DBMS_LOB.freetemporary(v_clob);
    RETURN v_xml;
END;

If you use the second form and later decide the policy and preference are no longer needed you can drop them:

BEGIN
    ctx_ddl.drop_policy('my_policy');
    ctx_ddl.drop_preference('fast_filter');
END;

Hopefully the function and query above will provide a solid framework for pursuing similar searches within your own PDF files. One last caveat I’ve found is that rows are not always exactly aligned. Fortunately the worst case I’ve found was some columns within a “row” were offset from the others by a single pixel. I was able to compensate by rounding the row value to nearest 10 ROUND(…,-1). For documents with small font size or rows with tight spacing that may be insufficient, requiring more sophisticated grouping math. Also, I expect that offset of 1 is simply the case for some of the specific files I’ve encountered. Other files may have more widely varied values, again requiring more complicated rounding or grouping rules to aggregate to consistent values.

NULL vs NOT NULL columns and Index usage


Recently I was asked to look at a problem where a simple query refused to use an index.

The table had several million rows in it with a single index on one column populated with 3 values (in my sample the values are A, B, and C.)

CREATE TABLE null_test
(
    col1           NUMBER,
    col2           NUMBER,
    col3           NUMBER,
    col4           NUMBER,
    test_column    VARCHAR2(1),
    col6           NUMBER,
    col7           NUMBER,
    col8           NUMBER,
    col9           NUMBER,
    col10          NUMBER
);

INSERT INTO null_test
        SELECT LEVEL,
               LEVEL,
               LEVEL,
               LEVEL,
               CHR(ASCII('A') + MOD(LEVEL, 3)),
               LEVEL,
               LEVEL,
               LEVEL,
               LEVEL,
               LEVEL
          FROM DUAL
    CONNECT BY LEVEL <= 3000000;

create index null_test_index on null_test(test_column) ;

exec dbms_stats.gather_table_stats(user,'NULL_TEST');

When a query only references columns in an index then it seems logical the optimizer would decide to use the index only and not reference the table at all. However, as my friend discovered, the exact opposite happened.

SQL> explain plan for select test_column from null_test;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

Plan hash value: 3898976153

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |  3000K|  5859K|  6519   (1)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| NULL_TEST |  3000K|  5859K|  6519   (1)| 00:00:01 |
-------------------------------------------------------------------------------

8 rows selected.

Then my friend followed up with another simple query specifying all of the values found in the table. This time, the index was used.

SQL> explain plan for select test_column from null_test where test_column in ('A','B','C');

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------

Plan hash value: 107206137

----------------------------------------------------------------------------------------
| Id  | Operation            | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                 |  3000K|  5859K|  1415   (2)| 00:00:01 |
|*  1 |  INDEX FAST FULL SCAN| NULL_TEST_INDEX |  3000K|  5859K|  1415   (2)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("TEST_COLUMN"='A' OR "TEST_COLUMN"='B' OR "TEST_COLUMN"='C')

13 rows selected.

This is where I was contacted to determine why this seemingly odd behavior would occur.
The reason is, TEST_COLUMN is defined to allow nulls. It just so happens the table is fully populated but Oracle can’t be sure of that at query time. An index won’t create a null leaf node, so if someone inserts a row with a null value, the index would miss it. Therefore, the optimizer can’t trust the index to return reliable results for “select test_column from null_test” because there could be one or more null values in the results. So the optimizer specifies a table scan as a reliable answer.

The second query imposes an implicit not-null constraint on the results. The only results that are possibly allowed are those “in (‘A’,’B’,’C’)”. A NULL will, obviously, not satisfy that condition. Therefore the optimizer knows it can trust the index as a reliable search mechanism.

Even if you supply a hint to tell the optimizer you really want to use the index, the optimizer will be obliged to negate that hint as it could produce invalid results

SQL> explain plan for select /* INDEX(null_test null_test_index) */ test_column from null_test;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------

Plan hash value: 3898976153

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |  3000K|  5859K|  6519   (1)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| NULL_TEST |  3000K|  5859K|  6519   (1)| 00:00:01 |
-------------------------------------------------------------------------------

8 rows selected.

So, now the question becomes: “Are nulls really valid for the test column?” If they are, then a table scan is the only way to include the rows with NULL values along with the populated rows.
If the nulls are not valid, then the column should be defined to reflect that rule.

SQL> alter table null_test modify (test_column varchar2(1) not null);

Table altered.

SQL> explain plan for select test_column from null_test;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------

Plan hash value: 107206137

----------------------------------------------------------------------------------------
| Id  | Operation            | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                 |  3000K|  5859K|  1407   (1)| 00:00:01 |
|   1 |  INDEX FAST FULL SCAN| NULL_TEST_INDEX |  3000K|  5859K|  1407   (1)| 00:00:01 |
----------------------------------------------------------------------------------------

8 rows selected.

Now, with the column defined as not null, the optimizer knows it can trust the index to contain all of the values in the table because the table can’t have any NULLs.

Many developers will think of the column definitions merely in terms of data “correctness;” but it’s important to remember those same data quality rules are meta-data for the optimizer as well. Thus they can and will influence the optimizer plans.

SQL for the fun of it: finding typing patterns in words (Part 3 of 3)


Part 1 Part 2

I’ve continued to enjoy this little puzzle and trying to find different ways of solving it. I’ve used the CONNECT BY clause to unpivot words into lists of letters but in this article I’ll show a more direct recursion on the word and then wrap up the series with the smallest and most efficient solution.

First, a recursive technique using CONNECT BY.
I’ll use TRANSLATE to map each letter to its corresponding hand and then recursively check if each letter alternates hands from the previous one. On the first repeat of a hand, the recursion stops and the search ends for that word. Findig words ending with the left hand is achieved with a simple LIKE condition (WHERE hand LIKE ‘%L’.)

    SELECT word,
           len,
           LEVEL lvl,
           hand
      FROM (SELECT word,
                   LENGTH(word) len,
                   TRANSLATE(word, 'qwertasdfgzxcvb yuiophjklnm', 'LLLLLLLLLLLLLLL RRRRRRRRRRR') hand
              FROM words)
     WHERE hand LIKE '%L'
CONNECT BY word = PRIOR word
       AND LEVEL <= len
       AND (LEVEL = 1 OR SUBSTR(hand, LEVEL - 1, 2) IN ('RL', 'LR'))
       AND PRIOR SYS_GUID() IS NOT NULL;

This parses words into a series like this:

WORD       LEN   LVL HAND     
--------- ---- ----- ---------
aid          3     1 LRL      
aid          3     2 LRL      
aid          3     3 LRL      
baklava      7     1 LLRRLLL  
chair        5     1 LRLRL    
chair        5     2 LRLRL    
chair        5     3 LRLRL    
chair        5     4 LRLRL    
chair        5     5 LRLRL    
purchase     8     1 RRLLRLLL

Finding words that alternate from the first to the last letter is then just a matter of pulling rows where the level reaches the length of the word. In the example above that would be aid and chair, dropping baklava and purchase.

SELECT word
  FROM (    SELECT word, len, LEVEL lvl
              FROM (SELECT word,
                           LENGTH(word)
                               len,
                           TRANSLATE(
                               word,
                               'qwertasdfgzxcvb yuiophjklnm',
                               'LLLLLLLLLLLLLLL RRRRRRRRRRR'
                           )
                               hand
                      FROM words)
             WHERE hand LIKE '%L'
        CONNECT BY     word = PRIOR word
                   AND LEVEL <= len
                   AND (LEVEL = 1 OR SUBSTR(hand, LEVEL - 1, 2) IN ('RL', 'LR'))
                   AND PRIOR SYS_GUID() IS NOT NULL)
 WHERE lvl = len;

That works pretty well. While connect by can be somewhat cpu and memory intensive, I do like the searching has an early exit when a word repeats a hand unlike previously explored methods.

The next, perhaps obvious, method to try is a recursive sub-query factoring clause (WITH clause.) The syntax changes but the logic of the recursion is essentially the same. Translate the letters to hand, for each letter check if it alternates from the previous hand or not. If the hands alternate then continue, if they do not, then end early for that word. Keep only those rows where the counter reaches the length of the word.

WITH
    translated(word, hand)
    AS
        (SELECT word,
                TRANSLATE(word,
                          'qwertasdfgzxcvb yuiophjklnm', 
                          'LLLLLLLLLLLLLLL RRRRRRRRRRR') hand
           FROM words),
    x(word, hand, n)
    AS
        (SELECT word, hand, 1
           FROM translated
          WHERE hand LIKE '%L'
         UNION ALL
         SELECT word, hand, n + 1
           FROM x
          WHERE SUBSTR(hand, n, 2) IN ('RL', 'LR'))
SELECT word
  FROM x
 WHERE n = LENGTH(word);

It’s slightly more cumbersome syntax but I think it’s an easier read for those new to the syntax compared to the CONNECT BY. It also doesn’t require any special tricks like the SYS_GUID() to properly connect children of each word.

And finally, while I have enjoyed playing around with syntax to find different methods to solve the same original question, all of them are unnecessarily complex. The two answers above provide a hint as to the direction I would go if I had to implement this in production instead of just for academic curiosity.

First, simply translate the letters to hands and then check if the resulting “word” was of the form LRLR…RL or RLRL…RL. That is, does it alternate with RL (so it ends with an L) and possibly start with an L?

So, a single call to translate and a regular expression filter provides the answer, both simply and efficiently.

SELECT word
  FROM (SELECT word, TRANSLATE(word, 
                               'qwertasdfgzxcvb yuiophjklnm',
                               'LLLLLLLLLLLLLLL RRRRRRRRRRR') hand
          FROM words)
 WHERE REGEXP_LIKE(hand, '^L?(RL)*$');

If you have an older version of the database that doesn’t support regular expressions, or if you wanted to port this puzzle and answer to non-Oracle systems. Then you can simply construct an alternating word using padding functions. Remember to check both right and left starting letters in the alternating.

This is slightly more complex than using REGEXP_LIKE; but this version should be supported back to at least version 7.3.

SELECT word
  FROM (SELECT word,
               TRANSLATE(word, 'qwertasdfgzxcvb yuiophjklnm', 'LLLLLLLLLLLLLLL RRRRRRRRRRR') hand
          FROM wordlist)
 WHERE     hand LIKE '%L'
       AND hand IN ('L' || RPAD('RL', LENGTH(word) - 1, 'RL'), RPAD('RL', LENGTH(word), 'RL'));

I hope you enjoyed this little exercise as much as I have. Thank you for reading!

SQL for the fun of it: finding typing patterns in words (Part 2 of 3: using Model Clause)


Part 1 Part 3

In my last post I solved the task of finding words that can be typed with alternating hands by using collections, match recognize, and analytics. I closed with the comment that I was confident MODEL could be used to solve it too; but at the time I wasn’t sure how to do it.

I have a hard time letting a good puzzle go, so today I decided to confront my limits and learn to use MODEL a bit better.
For my first pass, I started with the same idea as I used before. That is, generate rows of letters for each word in order.

SELECT word,
       LENGTH(word) len,
       COLUMN_VALUE n,
       SUBSTR(word, COLUMN_VALUE, 1) c
  FROM words,
       TABLE(    SELECT COLLECT(LEVEL)
                   FROM DUAL
             CONNECT BY LEVEL <= LENGTH(word))

In my model clause I create a couple of dummy measures: HAND and ALTERNATING. Then I create rules such that HAND is defined to be left or right (L or R) depending on which set of keys are used with the same vctab collection construction as in the previous solutions. That part is fairly straight forward. ALTERNATING is a little trickier, I define the rule such that the first letter is defined to be alternating (the first key stroke is always ok.) Then, for each letter after that I check if the previous letter was considered alternating and if it was, then check if the current letter is on the alternate hand from the previous letter. Thus, any letter that repeats hands will cause itself and all subsequent letters to be flagged as non-alternating. Thus, if the last letter is still flagged alternating, then every letter must have alternated.

My query at this point looked like this:

SELECT *
  FROM (SELECT word,
               LENGTH(word) len,
               COLUMN_VALUE n,
               SUBSTR(word, COLUMN_VALUE, 1) c
          FROM words,
               TABLE(    SELECT COLLECT(LEVEL)
                           FROM DUAL
                     CONNECT BY LEVEL <= LENGTH(word)))
MODEL
    PARTITION BY (word)
    DIMENSION BY(n)
    MEASURES(c,
             len,
             '-' hand,
             '---' alternating)
    RULES UPDATE SEQUENTIAL ORDER
    (
        hand [ANY] = CASE
                         WHEN c[CV(n)]
                                   MEMBER OF vctab('q','w','e','r','t','a','s','d','f','g','z','x','c','v','b')
                         THEN
                             'L'
                         WHEN c[CV(n)]
                                   MEMBER OF vctab('y','u','i','o','p','h','j','k','l','n','m')
                         THEN
                             'R'
                     END,
        alternating [ANY]
        ORDER BY n =
            CASE
                WHEN CV(n) = 1
                THEN
                    'yes'
                WHEN alternating[CV(n) - 1] = 'yes'
                 AND (hand[CV(n)], hand[CV(n) - 1]) IN (('L', 'R'), ('R', 'L'))
                THEN
                    'yes'
                ELSE
                    'no'
            END)

Producing results as follows:

WORD        N C    LEN HAND ALTERNATING
--------- --- -- ----- ---- -----------
chair       1 c      5 L    yes        
chair       2 h      5 R    yes        
chair       3 a      5 L    yes        
chair       4 i      5 R    yes        
chair       5 r      5 L    yes 
       
iambic      1 i      6 R    yes        
iambic      2 a      6 L    yes        
iambic      3 m      6 R    yes        
iambic      4 b      6 L    yes        
iambic      5 i      6 R    yes        
iambic      6 c      6 L    yes

fidelity    1 f      8 L    yes        
fidelity    2 i      8 R    yes        
fidelity    3 d      8 L    yes        
fidelity    4 e      8 L    no         
fidelity    5 l      8 R    no         
fidelity    6 i      8 R    no         
fidelity    7 t      8 L    no         
fidelity    8 y      8 R    no

Finally, wrap the entire query as an inline view and filter the results by applying the last condition: the words must end with the left hand.

SELECT word
  FROM (SELECT *
          FROM (SELECT word,
                       LENGTH(word) len,
                       COLUMN_VALUE n,
                       SUBSTR(word, COLUMN_VALUE, 1) c
                  FROM words,
                       TABLE(    SELECT COLLECT(LEVEL)
                                   FROM DUAL
                             CONNECT BY LEVEL <= LENGTH(word)))
        MODEL
            PARTITION BY (word)
            DIMENSION BY(n)
            MEASURES(c,
                     len,
                     '-' hand,
                     '---' alternating)
            RULES UPDATE SEQUENTIAL ORDER
            (
                hand [ANY] = CASE
                                 WHEN c[CV(n)]
                                           MEMBER OF vctab('q','w','e','r','t','a','s','d','f','g','z','x','c','v','b')
                                 THEN
                                     'L'
                                 WHEN c[CV(n)]
                                           MEMBER OF vctab('y','u','i','o','p','h','j','k','l','n','m')
                                 THEN
                                     'R'
                             END,
                alternating [ANY]
                ORDER BY n =
                    CASE
                        WHEN CV(n) = 1
                        THEN
                            'yes'
                        WHEN alternating[CV(n) - 1] = 'yes'
                         AND (hand[CV(n)], hand[CV(n) - 1]) IN (('L', 'R'), ('R', 'L'))
                        THEN
                            'yes'
                        ELSE
                            'no'
                    END))
 WHERE n = len AND hand = 'L' AND alternating = 'yes';

That worked pretty well, but I didn’t think I was fully utilizing the MODEL syntax. Most significantly, I was using CONNECT BY to generate the character rows and that’s something MODEL could do for me. So I took another stab at it, using a FOR expression to count from 1 to length. Here I ran into some syntax quirks. The word column, as a partitioning value is an invalid identifier within the rules clause, so I had to create a measure (w) from the partitioning word. Also, since the length measure will be referenced in the outer query, its value needs to be set for every generated row (len [n] = len[1].) The hand and alternating measures I set with the same logic as shown above and then again use the same filter in the outer WHERE clause.

SELECT word
  FROM (SELECT *
          FROM words
        MODEL
            PARTITION BY (word)
            DIMENSION BY(1 n)
            MEASURES(word w,
                     '-' c,
                     LENGTH(word) len,
                     '-' hand,
                     '---' alternating)
            RULES
            (
                c [FOR n FROM 1 TO len[1] INCREMENT 1] = SUBSTR(w[1], CV(n), 1),
                len [n] = len[1],
                hand [ANY] = CASE
                                 WHEN c[CV(n)]
                                           MEMBER OF vctab('q','w','e','r','t','a','s','d','f','g','z','x','c','v','b')
                                 THEN
                                     'L'
                                 WHEN c[CV(n)]
                                           MEMBER OF vctab('y','u','i','o','p','h','j','k','l','n','m')
                                 THEN
                                     'R'
                             END,
                alternating [ANY]
                ORDER BY n =
                    CASE
                        WHEN CV(n) = 1
                        THEN
                            'yes'
                        WHEN alternating[CV(n) - 1] = 'yes'
                         AND (hand[CV(n)], hand[CV(n) - 1]) IN (('L', 'R'), ('R', 'L'))
                        THEN
                            'yes'
                        ELSE
                            'no'
                    END))
 WHERE n = len AND hand = 'L' AND alternating = 'yes';

I like this one better because it seems truer to the spirit of the task of using MODEL.

I’m always open for comments, but on this one in particular I welcome suggestions and variations from others on better or simply alternate implementations using MODEL.

%d bloggers like this: