Fast * Many = Slow, or, the Problem of Fast Code


Several years ago I made a little math joke about a process that performed millions of look-ups and took forever to complete. That joke, just a short one-liner: “Fast Times Many Equals Slow” has since become a bit of a mantra.

The problem can be particularly pernicious in that a problematic set of code might be well designed, extensively tested, and extremely fast. All three points are desirable and, in some cases, indisputable for a given routine. The performance problem arises when this fast code is executed repeatedly. The problem can sneak up on you. Maybe you write a small routine that executes in 14 microseconds and then you invoke it a thousand times. That means your code will contribute 0.014 total seconds to your process. Without any other context, that sounds pretty good to me and would often be considered acceptable, maybe even excellent.

However, if the process is ramped up to a billion iterations, now that fast routine adds almost 4 hours to your processing time.

0.000014 * 1000000000 = 14000 seconds = 3:53:20

If 4 hours (plus whatever time it takes to perform the rest of the process) is considered slow… is the problem that your routine is slow or is the problem that you ran it many times?

While I’ve seen this problem repeatedly over the years, it’s been particularly prevalent lately across a variety of applications from multiple sources (i.e. it wasn’t just one developer implementing the same design over and over.) I’m sure it’s merely coincidence that so many instances were revealed in a short period of time; but that doesn’t negate the underlying cause.

Iteration is a common programming technique and is often necessary. Sometimes you can mask it by hiding behind an API that performs the iteration for you, but doing so doesn’t remove the problem, it simply shoves it somewhere else. In some cases though, pushing the work somewhere else might be the right solution. There is nothing inherently wrong with iteration itself; but rather the choice of what is included within each cycle. A few weeks ago I wrote about the importance of timing instrumentation and how I used it solve a performance problem.

The problem discussed in that article was caused by misplaced iteration. Program A would read information from an external source and pass it to Program B to parse and process the individual records retrieved from the remote service. The iteration issue occurred because A would only feed the data in small increments to B. So B still needed to iterate over each record. That part was unavoidable (at some point “somebody” needed to act on each record) the problem was B already had iteration built in and A simply added extra work by pre-iterating. Not only that, each cycle carried additional overhead in a new database call, additional network traffic and additional response processing. Internal caching benefits within B were reduced because it was forced to start over with each new invocation.

It might have been possible to make the iteration in A and/or B a little faster, or rewrite B to support multi-call caching when processing within a larger set, but the simplest and most beneficial change was to simply pass all of the data in one call from A to B. This completely eliminated the overhead of making distinct calls. The internal work of B didn’t change at all. It iterated the same as it always did. Admittedly, it’s internal caching did become more effective but the reduction of the back and forth communication with A was the biggest savings. The entire process improved by about a factor of 10.

It’s important to note this was not a scaling factor in the billions as in my earlier example. While such volumes are not uncommon, many processes have no need to cycle that many times. In this example the code was only iterating over tens of thousands of rows, a magnitude many people and processes would consider “normal.” It’s also important to note that no one cycle was slow. The A program was able to parse quickly and split quickly. B was able to parse and process the records quickly.

As noted above, this is why iteration problems can be a subtle threat. Everything passed unit testing and seemed fast, because it was. It was only in the larger scale testing with full external volumes that the problem revealed itself.

So, what is one to do about it?

  1. Check to see if you are iterating needlessly. As seen in my parsing example, there was no need to add an extra layer of iteration. In the context of databases are you iterating row-by-row when you could be performing set-based operations? The database engine may still need to iterate across a data set; but it’s highly optimized for doing so. Adding row-by-row processing in your client is just another instance of adding iteration on top of iteration already in place. This is sometimes the hardest path to implement because it’s essentially a redesign; but it can be the most effective though by completely negating some steps.
  2. Upon what are you iterating? If you are processing a year’s worth of data and loop through each month, do you perform duplicate work in each cycle? If so, can you move some of that work into a step of its own and only do that work once? This is a good place to check for unnecessary “fast” steps such as key-value look-ups. Maybe for a given customer id you do a lookup for the customer name. It may be very fast with a nice, indexed, unique key retrieval; but if you do it for every order, every day, of every month. It adds up, especially when you do all of that again for the next customer, and the next, and so on. If you must iterate, can you sort the data before starting to remove the need to process identical data repeatedly. If you sort orders by customer, you can lookup the customer information once on the first order for that customer and reuse it until all of that customer’s data is exhausted.
  3. Is each cycle as fast as it can be? This may seem obvious; but it’s still worth a look. The same multiplicative factor making iteration painful can be used to your advantage as well. I worked with another DBA on a process iterating on hundreds of millions of rows. While it would have been nice to ask the vendor to redesign their product; that wasn’t really feasible as an immediate solution. Instead we shaved a few milliseconds off each iteration. One millisecond saved for one million executions is over 16 minutes removed from the total processing time. Multiply that by hundreds and we get DAYS of processing time improvement. Again, this was still not ideal, but it was appreciated by the end users waiting on the data.
  4. Are you including extra overhead in your iterations? Can you consolidate steps? This check is a special case combination of the previous two. As an example, a process that reads a web service, then pass the results to another process, and then take the results of that an pass them to a third process to insert or update some repository – a fairly straight forward ETL process. If you watch the data in an ETL flow, do you “E”xtract one row, “T”ransform that row, and then “L”oad that row? Might it be faster to Extract all of the rows, or a large volume of them, then pass those to a transform process, which then passes them in bulk to a loader? You can eliminate the overhead of invoking each step. This overhead could take make forms, including sql vs pl/sql context switches, network latency, file open/close operations, and more. Maybe instead of processing 1000 rows individually, you can process all 1000 together in each step, or if that’s too expensive then maybe 10 batches of 100, 4 batches of 250, 20 batches of 50… whatever the hardware can handle.
  5. Last option – can you parallelize? If individual iteration is required can you do 2 or more at the same time? Parallelizing is the most resource intensive plan of attack but that doesn’t necessarily mean it’s inappropriate. If you can isolate date into independent chunks and have the hardware available to process them, then why not? The most appealing part of parallel processing is it can be combined with any or all of the other solutions. The most important thing to remember when implementing parallelization is it doesn’t actually fix anything. If your process is flawed, it might be possible to throw enough hardware (money) at the problem to reduce the wall-clock time to something acceptable but doing so means your scaling is determined entirely by your wallet and it’s usually not going to be linear. That is, doubling the hardware doesn’t mean you’ll double the processing speed. There are problem sets where parallelizing adds additional benefits beyond reducing processing time. If you have independent data, such as regional sales, it makes sense to process the East region separately from the West region. Not only can you do both sets of work at the same time, but you also get a bonus of resiliency. A failure in the East doesn’t necessarily impact the West processing.

It might be more accurate to say “Fast * Many = A lot of time” but “Fast * Many = Slow” is a better mnemonic. Plus, the phrase is used when there is a performance issue in some looping process, which is mostly commonly described as the process being “slow.” My point was not to suggest all iteration was bad; but rather to consider the larger scale impacts when it occurs and what, if anything, you can or should do about it.

This has been a bit of a rant; but hopefully I was able to include some useful information along the way.

Advertisements

Timing is Everything, but Recording it is Even Better


I got back from speaking at Collaborate 18 a week ago. My topic was performance diagnostics.
The recurring theme of my talk is for developers to leave themselves some bread crumbs to mark their trail.

The Oracle database has wonderful diagnostic tools like ASH/AWR, SQL Monitor, tracing, and the hierarchical profiler; but as good as these tools are, the best way of measuring performance issues in your code should be your own logs. In particular, logs with timestamps.

It always blows my mind when a developer asks me to do initial triage on a system because they don’t know where the long running processes are. Why don’t you have logs that capture the begin and end points of important processes? Even if you don’t have a reason to suspect a performance issue, including a simple “Starting ABC”/”Ending ABC” set of bookend provides scope to other debug or error messages. Later, if you do have a reason to check performance, having the start and end points of your process logged gives you a means of checking how long it ran.

Coincidentally, the day I got back to the office after the conference, I was contacted about a process running inordinately slow. Of course I asked, where is the slow part and they couldn’t really say but they took a guess at one particular procedure. As it happens, about 10 years ago I wrote a lot of the code behind that procedure. Naturally, I was interested to see if I had learned anything in the past 10 years that I could use to make it faster; but first, I simply wrapped the procedure with the app team’s own logging framework and asked them to run the application.

What we found was the old procedure was actually running quite fast, averaging about 0.13 seconds per run with a worst case of about 0.82 seconds. The problem was it was being called repeatedly inside an expensive loop with each iteration feeding the procedure a small amount of data to process. The fix was to remove the iterative process and allow the old procedure to run on larger data sets. The total execution time dropped from minutes to seconds.

While I had full DBA access to this system, I didn’t use an AWR report, 10046 tracing, SQL Monitor, a pl/sql profiler, ASH data, or any sort of “Top SQL” scripts and tools to determine the problem. I didn’t need an extensive instrumentation system for this, all I needed was a set of simple timestamped begin and end messages to tell me where the application was spending its time. Of course I could add even more logging to get more granular timings; but the point is, it’s often not necessary. If you can track the scope of work progression, you can determine, without extended privileges to administrative views or tools, where your application is spending its time.

You can then talk to your DBA about “why” a particular SQL or procedure is slow, and determine what to do about it. I’m not going to recommend any particular instrumentation framework or tool because the specific choice isn’t as important as the methodology. It’s quite simple, create a procedure to write messages with timestamps to a table, preferably within an autonomous transaction. Then, adding a mere 3 lines of code to each procedure or function will be sufficient for tracking most performance related issues.

The most basic logging framework could be nothing more than this:

CREATE TABLE app_log
(
    logtime       TIMESTAMP WITH TIME ZONE,
    logmessage    VARCHAR2(4000)
);

CREATE OR REPLACE PROCEDURE write_log(p_message IN VARCHAR2)
IS
    PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
    INSERT INTO app_log(logtime, logmessage)
         VALUES (SYSTIMESTAMP, p_message);

    COMMIT;
END;

Usage of the above could just be begin and end messages. End will have 2 or more, one for success, one for failures. If you have multiple exception clauses, then each should have its own end-message.

BEGIN
    write_log('Begin process ....');
    --- Do Process ....
    write_log('End process ....');
EXCEPTION
    WHEN OTHERS THEN
        write_log('End process .... with error');
        --- Handle or reraise exception
END;

You can, of course, expand on the above and would most likely want to include other information as well to help properly scope execution. For example you might want to include username, machine, session id, or instance number. You can also include intermediate step messages which can also be used to produce performance diagnostic information. If you already have a logging/debug message framework, that’s great, simply extend the usage to include begin/end messages and you’ve turned the debugging into performance instrumentation.

It’s such a small amount of work for immense potential benefit. Unfortunately performance instrumentation is often underutilized.
Also note, these logs can also be defensive in nature. If your application is just one cog in a larger machine, perhaps a web service, or a process inside a larger ETL system; having logs of when your portion of the work began and when your portion of the work ended can help resolve finger-pointing as to where a problem lies.

A reliable set of performance metrics also makes for easier audits of SLA/QoS contracts or can be used internally for alerts when an SLA threshold is near breaching.

Also, as mentioned earlier, these logs don’t need extended privileges, so it is easier to perform investigation on systems where you may have less access than others.

I hope this helps inspire you to track your own application’s execution with some sort of logging framework. Which one isn’t as important as consistently using it.

18c JSON TO_UTC_TIMESTAMP_TZ bug?


I tried the following query on 18c via Oracle’s LiveSQL site and the results don’t correspond to the documented functionality; or, at least, not my reading of the functionality.

select to_utc_timestamp_tz('2018-04-27') t from dual
union all
select to_utc_timestamp_tz('2018-04-27T17:40:25+00:00') from dual
union all
select to_utc_timestamp_tz('2018-04-27T17:40:25Z') from dual
union all
select to_utc_timestamp_tz('20180427T174025Z') from dual
union all
select to_utc_timestamp_tz('2018-04-27T17:40:25+05:00') from dual --- This one is not correct per documented functionality

producing the following results

T
27-APR-18 12.00.00.000000 AM +00:00
27-APR-18 05.40.25.000000 PM +00:00
27-APR-18 05.40.25.000000 PM +00:00
27-APR-18 05.40.25.000000 PM +00:00
27-APR-18 05.40.25.000000 PM +05:00

Note the first four rows all return UTC (+0:00) time zone for the corresponding input value. This is to be expected based not only on the name of the function but also the documentation..
The fifth row though preserves the +5:00 offset. This seems contrary to the documentation:

https://docs.oracle.com/en/database/oracle/oracle-database/18/adjsn/changes.html#GUID-7F43AE55-4176-477E-8D3B-021ED838106D
or
MOS Doc ID 2357879.1

“SQL function to_UTC_timestamp_tz takes as input an ISO 8601 date format string and returns an instance of SQL data type TIMESTAMP WITH TIMEZONE. It normalizes the input to UTC time (Coordinated Universal Time, formerly Greenwich Mean Time).”

By my reading, I would expect any successful transformation to return a timestamp with time zone value in UTC (+0:00) time zone. So is this a misreading on my part, a functionality bug, or a documentation bug? Interestingly, the function is not included in the SQL Reference, only in the JSON Developer’s Guide which isn’t a bug per se, but seems like a documentation oversight.

Performance Diagnostics and Tuning With and Without Your DBA


I presented this topic at the Collaborate 2018 conference.

My slides can be found here.

A couple of the slides included simplified queries so the text would fit on a single screen and remain legible.

To find who is using temp, a more functionally robust query can be found here:
https://seanstuber.wordpress.com/2018/03/08/who-is-using-temp-and-what-are-they-doing/

Similarly, the who is blocking whom query has a fuller version here:
https://seanstuber.wordpress.com/2018/03/04/who-is-blocking-whom-and-what-are-they-doing/

Thanks to all who attended and asked questions and/or commented.

Reading a PDF blob with PL/SQL and SQL


This is another article inspired by a recent request. That is, given a PDF file with tabular data, could I extract the data into a more human-usable format?
Yes, it is possible but it requires multiple steps and does come with some caveats.

The first step is converting the PDF into a format that is more easily processed with SQL and PL/SQL. To that end, Oracle Text includes functionality to parse a PDF BLOB into an XHTML CLOB.

With a BLOB containing a PDF with contents that look like this:Image of Sample Table

We can convert that to text of this form (leading spaces in the opening tags are to avoid html rendering errors within WordPress):











div.c {
 white-space:pre;
 position:absolute;
}



...


XHTML can’t be parsed directly using native functions; but simply removing the !DOCTYPE declaration is sufficient to turn the contents into usable XML.
Within the BODY of the document, the PDF text will be stored in DIV tags such as these:

Pocket
3
$1.68
$5.04
Wocket
5
$2.00
$10.00
Widget
13
$2,342.23
$30,448.99

However, there is no concept of a “table” associated with these nodes. To create rows and columns the style attribute must be parsed based on the top and left pixel positions. Data that is in the same “row” will usually be written to the same top location. Columns are little more difficult because numeric data will typically be right-justified; but the text is stored within the pdf based on it’s left-most position. To construct more traditional columns you must count values within a row. The 1st position within a row, regardless of the pixel location is the 1st column, 2nd position is the 2nd column and so on.

Thus, in the example above, column one is “Pocket”, “Wocket”, and “Widget” all at position 70. Column two is the number 3 at position 195, number 5 also at position 195 and then number 13 at position 188. The same process is repeated for the Price and Total columns. Pivoting by the column counters can turn the sequential DIV values into traditional rows and columns, producing results like this:

A                    B                    C                    D                   
-------------------- -------------------- -------------------- --------------------
Sample Table                                                                       
Item                 Qty                  Price                Total               
Pocket               3                    $1.68                $5.04               
Wocket               5                    $2.00                $10.00              
Widget               13                   $2,342.23            $30,448.99                                                                                     

Note, the table header became left-justified as a result of the parsing technique because there was only one “column” in that row. Similar issues will result if there are blank (null) values within a table, since there will be no place holder to count. So, this technique is not guaranteed to produce immediately useful results. The output may require manual adjustment if data is out of sequence.

I have loaded a few blobs into a table, simply named “my_blobs” and extract the contents with queries of this form, extending the PIVOT clause to add columns as needed. The PDF referenced below is the same one linked to above.

  SELECT a,
         b,
         c,
         d
    FROM (SELECT rown, DENSE_RANK() OVER(PARTITION BY rown ORDER BY coln) coln, val
            FROM (SELECT TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'top:[0-9]+'), 5)) rown,
                         TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'left:[0-9]+'), 6)) coln,
                         val
                    FROM XMLTABLE(xmlnamespaces(DEFAULT 'http://www.w3.org/1999/xhtml'), '/html/body/div'
                                  PASSING (SELECT pdf_to_xml(content)
                                             FROM my_blobs
                                            WHERE name = 'SampleTable.pdf')
                                  COLUMNS style VARCHAR2(100) PATH '@style', val VARCHAR2(20) PATH 'text()') x))
         PIVOT (MAX(val) FOR coln IN (1 a, 2 b, 3 c, 4 d))
ORDER BY rown;

The same query logic also works with larger PDFs too. Here’s one from the publicly available Statistical Annex of the European Economy.

SELECT a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
    FROM (SELECT rown, DENSE_RANK() OVER(PARTITION BY rown ORDER BY coln) coln, val
            FROM (SELECT TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'top:[0-9]+'), 5)) rown,
                         TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'left:[0-9]+'), 6)) coln,
                         TRIM(val) val
                    FROM XMLTABLE(
                             xmlnamespaces(DEFAULT 'http://www.w3.org/1999/xhtml'), '/html/body/div'
                             PASSING (SELECT pdf_to_xml(content)
                                        FROM my_blobs
                                       WHERE name = 'https://ec.europa.eu/info/sites/info/files/saee_autumn_2017_en.pdf')
                             COLUMNS style VARCHAR2(100) PATH '@style', val VARCHAR2(10) PATH 'text()') x))
         PIVOT (MAX(val)
               FOR coln
               IN (1 a,2 b,3 c,4 d,5 e,6 f,7 g,8 h,9 i,10 j,11 k,12 l,13 m,14 n,15 o,16 p,17 q,18 r,19 s,20 t,21 u,22 v,23 w,24 x,25 y,26 z))
ORDER BY rown

Scrolling through the we can pick out individual tables such as this one listing Population and Labor market numbers for various countries by year. Note this also provides an example of the problem with missing values in a table. 1991 has a summary row in addition to a row of the individual country values. The values on the right side of the table are parsed into the left by the column counting.

1960       9153       55433      1221       2835       8332       30263      46636      50200      573        2121       2779       332        333        11487      7047       8858       1585       4068       4430       247685     235006                                                
1965       9464       58619      1305       2877       8550       31962      49825      52112      591        2266       2971       350        323        12295      7271       8999       1649       4371       4564       260363     246887                                                
1970       9656       60651      1371       2951       8793       33761      51939      53822      614        2359       3140       358        327        13032      7467       8680       1725       4538       4606       269790     255715                                                
1975       9801       61829      1441       3178       9047       35758      53926      55441      502        2456       3302       365        309        13660      7579       9093       1794       4739       4711       278930     264388                                                
1980       9859       61566      1489       3402       9643       37491      55150      56434      509        2512       3413       364        321        14148      7549       9766       1901       4980       4780       285277     270152                                                
1981       9858       61682      1500       3444       9729       37759      55462      56502      515        2519       3433       365        324        14247      7569       9851       1907       5016       4800       286481     271268                                                
1982       9856       61638      1510       3481       9790       37986      55797      56544      521        2531       3457       366        331        14312      7574       9912       1910       5055       4827       287397     272081                                                
1983       9854       61423      1521       3506       9847       38172      56096      56564      528        2546       3485       366        335        14368      7562       9958       1922       5092       4856       288000     272571                                                
1984       9855       61175      1531       3530       9896       38330      56369      56577      535        2562       3514       366        335        14423      7561       9996       1932       5127       4882       288497     272961                                                
1985       9857       61024      1541       3541       9934       38470      56649      56593      542        2579       3545       367        341        14488      7565       10024      1942       5162       4902       289064     273413                                                
1986       9859       61066      1552       3542       9967       38585      56937      56596      548        2600       3579       368        347        14567      7570       10033      1966       5194       4918       289794     274008                                                
1987       9870       61077      1565       3543       10001      38685      57244      56602      554        2627       3616       371        350        14664      7575       10030      1990       5223       4932       290516     274593                                                
1988       9904       61449      1574       3531       10037      38767      57572      56629      560        2653       3655       374        353        14760      7585       10020      1995       5251       4947       291615     275574                                                
1989       9940       62063      1581       3510       10089      38828      57912      56672      568        2667       3684       378        356        14846      7620       10005      1996       5276       4964       292954     276826                                                
1990       9968       63253      1582       3506       10197      38867      58227      56719      580        2663       3698       382        360        14947      7678       9983       1998       5299       4986       294894     278714                                                
1991       640                   280354                                                                                                                                                                                                                                                      
1991       10006      79973      1574       3526       10320      38966      58520      56759      595        2651       3704       387        364        15068      7755       9960       1999       5303       5014       312444     296253                                                
1992       10047      80500      1545       3549       10399      39158      58811      56797      611        2614       3700       392        367        15182      7841       9952       1996       5305       5042       313811     297671                                                
1993       10086      80946      1506       3563       10460      39361      59066      56832      626        2563       3683       398        371        15290      7906       9965       1992       5325       5066       315006     298940                                                
1994       10116      81147      1474       3571       10513      39549      59286      56843      639        2521       3657       404        375        15381      7936       9992       1989       5347       5088       315828     299826                                                
1995       10137      81308      1448       3601       10562      39719      59501      56844      651        2485       3629       410        378        15460      7948       10026      1989       5363       5108       316567     300624                                                
1996       10157      81466      1425       3626       10609      39884      59713      56860      661        2457       3602       414        380        15526      7959       10064      1990       5374       5125       317292     301403                                                
1997       10181      81510      1406       3661       10661      40050      59926      56890      671        2433       3575       420        383        15608      7968       10109      1986       5383       5140       317962     302124                                                
1998       10203      81446      1393       3714       10721      40214      60147      56907      679        2410       3549       425        385        15703      7977       10160      1982       5391       5154       318560     302770                                                
1999       10226      81422      1379       3755       10762      40370      60457      56916      687        2391       3524       431        387        15809      7992       10218      1984       5396       5166       319272     303524                                                
2000       10251      81457      1401       3804       10806      40554      60872      56942      694        2368       3500       437        390        15922      8012       10290      1989       5401       5176       320266     304523                                                
2001       10287      81517      1393       3864       10862      40766      61317      56980      702        2338       3471       442        393        16043      8042       10363      1992       5380       5188       321338     305670                                                
2002       10333      81578      1384       3932       10902      41424      61764      57100      710        2310       3443       447        396        16147      8082       10420      1995       5379       5201       322944     307328                                                
2003       10376      81549      1375       3997       10928      42196      62202      57413      718        2288       3415       452        399        16223      8118       10459      1996       5379       5213       324697     309126                                                
2004       10421      81456      1366       4067       10955      42859      62661      57845      728        2263       3377       459        401        16276      8169       10484      1997       5382       5228       326396     310880                                                
2005       10479      81337      1359       4160       10987      43663      63133      58191      739        2239       3323       466        404        16317      8225       10503      2001       5387       5246       328157     312707                                                
2006       10548      81173      1351       4270       11020      44361      63574      58428      751        2219       3270       473        405        16341      8268       10522      2008       5391       5266       329639     314245                                                
2007       10626      80992      1343       4400       11049      45236      63967      58787      767        2201       3231       481        407        16378      8295       10543      2019       5397       5289       331407     316042                                                
2008       10710      80764      1338       4496       11078      45983      64324      59242      787        2178       3198       489        409        16440      8322       10558      2022       5406       5313       333057     317719                                                
2009       10797      80483      1336       4539       11107      46368      64655      59578      808        2142       3163       498        412        16526      8341       10568      2042       5418       5339       334120     318800                                                
2010       10896      80284      1333       4560       11121      46562      64974      59830      829        2097       3097       508        414        16612      8361       10573      2049       5430       5363       334894     319643                                                
2011       10994      80275      1330       4577       11105      46736      65294      60060      851        2059       3028       519        416        16693      8389       10558      2053       5398       5388       335722     320588                                                
2012       11068      80426      1325       4590       11045      46766      65615      60339      864        2034       2988       532        419        16752      8426       10515      2057       5406       5414       336581     321488                                                
2013       11125      80646      1320       4602       10965      46593      65953      60646      862        2013       2958       545        423        16800      8477       10457      2060       5413       5439       337298     322249                                                
2014       11180      80983      1316       4615       10892      46455      66290      60789      853        1994       2932       558        427        16863      8544       10401      2062       5419       5463       338036     323033                                                
2015       11239      81687      1313       4642       10821      46410      66590      60731      848        1977       2905       569        432        16932      8630       10358      2063       5422       5481       339050     324089                                                
2016       11295      82491      1316       4684       10784      46450      66858      60628      852        1959       2868       584        438        17030      8740       10326      2065       5431       5495       340292     325364                                                
2017       11358      83171      1320       4734       10734      46491      67219      60587      852        1940       2825       598        442        17104      8810       10305      2068       5434       5516       341509     326629                                                
2018       11416      83575      1321       4782       10677      46529      67582      60578      857        1921       2785       613        445        17174      8871       10295      2070       5438       5537       342467     327630                                                
2019       11475      83784      1321       4826       10620      46565      67947      60559      862        1905       2749       627        449        17239      8925       10289      2072       5442       5558       343215     328415                                                

Another option, if you know ranges of horizontal pixel locations that define columns, you can use those ranges with a CASE statement instead of counting from left to right. While more complex, this would allow you to maintain correct position even with missing values. The counting method allows the query to be more reusable across varied documents but specifying ranges of left-values may produce more desirable results.

In both of the queries above I use a function to convert the BLOB to an XMLTYPE. I’ve used two forms of the function. The first, and simpler form uses the deprecated IFILTER function.

CREATE OR REPLACE FUNCTION pdf_to_xml(p_pdf IN BLOB)
    RETURN XMLTYPE
IS
    v_clob   CLOB;
    v_xml    XMLTYPE;
BEGIN
    DBMS_LOB.createtemporary(v_clob, TRUE);
    DBMS_LOB.open(v_clob, DBMS_LOB.lob_readwrite);
    ctx_doc.ifilter(p_pdf, v_clob);

    DBMS_LOB.close(v_clob);

    v_xml := xmltype(REGEXP_REPLACE(v_clob, ']+>'));

    DBMS_LOB.freetemporary(v_clob);
    RETURN v_xml;
END;

The second form requires the creation of a text policy and preference before creating the function. For this purpose, some of the simple defaults from the Oracle documentation are sufficient.

BEGIN
    ctx_ddl.create_preference(preference_name => 'fast_filter', object_name => 'AUTO_FILTER');
    ctx_ddl.set_attribute(preference_name   => 'fast_filter',
                          attribute_name    => 'OUTPUT_FORMATTING',
                          attribute_value   => 'FALSE');
    ctx_ddl.create_policy(policy_name => 'my_policy', filter => 'fast_filter');
END;

CREATE OR REPLACE FUNCTION pdf_to_xml(p_pdf IN BLOB)
    RETURN XMLTYPE
IS
    v_clob   CLOB;
    v_xml    XMLTYPE;
BEGIN
    DBMS_LOB.createtemporary(v_clob, TRUE);
    DBMS_LOB.open(v_clob, DBMS_LOB.lob_readwrite);
    ctx_doc.policy_filter('my_policy',
                          p_pdf,
                          v_clob,
                          FALSE);

    DBMS_LOB.close(v_clob);

    v_xml := xmltype(REGEXP_REPLACE(v_clob, ']+>'));

    DBMS_LOB.freetemporary(v_clob);
    RETURN v_xml;
END;

If you use the second form and later decide the policy and preference are no longer needed you can drop them:

BEGIN
    ctx_ddl.drop_policy('my_policy');
    ctx_ddl.drop_preference('fast_filter');
END;

Hopefully the function and query above will provide a solid framework for pursuing similar searches within your own PDF files. One last caveat I’ve found is that rows are not always exactly aligned. Fortunately the worst case I’ve found was some columns within a “row” were offset from the others by a single pixel. I was able to compensate by rounding the row value to nearest 10 ROUND(…,-1). For documents with small font size or rows with tight spacing that may be insufficient, requiring more sophisticated grouping math. Also, I expect that offset of 1 is simply the case for some of the specific files I’ve encountered. Other files may have more widely varied values, again requiring more complicated rounding or grouping rules to aggregate to consistent values.

NULL vs NOT NULL columns and Index usage


Recently I was asked to look at a problem where a simple query refused to use an index.

The table had several million rows in it with a single index on one column populated with 3 values (in my sample the values are A, B, and C.)

CREATE TABLE null_test
(
    col1           NUMBER,
    col2           NUMBER,
    col3           NUMBER,
    col4           NUMBER,
    test_column    VARCHAR2(1),
    col6           NUMBER,
    col7           NUMBER,
    col8           NUMBER,
    col9           NUMBER,
    col10          NUMBER
);

INSERT INTO null_test
        SELECT LEVEL,
               LEVEL,
               LEVEL,
               LEVEL,
               CHR(ASCII('A') + MOD(LEVEL, 3)),
               LEVEL,
               LEVEL,
               LEVEL,
               LEVEL,
               LEVEL
          FROM DUAL
    CONNECT BY LEVEL <= 3000000;

create index null_test_index on null_test(test_column) ;

exec dbms_stats.gather_table_stats(user,'NULL_TEST');

When a query only references columns in an index then it seems logical the optimizer would decide to use the index only and not reference the table at all. However, as my friend discovered, the exact opposite happened.

SQL> explain plan for select test_column from null_test;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

Plan hash value: 3898976153

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |  3000K|  5859K|  6519   (1)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| NULL_TEST |  3000K|  5859K|  6519   (1)| 00:00:01 |
-------------------------------------------------------------------------------

8 rows selected.

Then my friend followed up with another simple query specifying all of the values found in the table. This time, the index was used.

SQL> explain plan for select test_column from null_test where test_column in ('A','B','C');

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------

Plan hash value: 107206137

----------------------------------------------------------------------------------------
| Id  | Operation            | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                 |  3000K|  5859K|  1415   (2)| 00:00:01 |
|*  1 |  INDEX FAST FULL SCAN| NULL_TEST_INDEX |  3000K|  5859K|  1415   (2)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("TEST_COLUMN"='A' OR "TEST_COLUMN"='B' OR "TEST_COLUMN"='C')

13 rows selected.

This is where I was contacted to determine why this seemingly odd behavior would occur.
The reason is, TEST_COLUMN is defined to allow nulls. It just so happens the table is fully populated but Oracle can’t be sure of that at query time. An index won’t create a null leaf node, so if someone inserts a row with a null value, the index would miss it. Therefore, the optimizer can’t trust the index to return reliable results for “select test_column from null_test” because there could be one or more null values in the results. So the optimizer specifies a table scan as a reliable answer.

The second query imposes an implicit not-null constraint on the results. The only results that are possibly allowed are those “in (‘A’,’B’,’C’)”. A NULL will, obviously, not satisfy that condition. Therefore the optimizer knows it can trust the index as a reliable search mechanism.

Even if you supply a hint to tell the optimizer you really want to use the index, the optimizer will be obliged to negate that hint as it could produce invalid results

SQL> explain plan for select /* INDEX(null_test null_test_index) */ test_column from null_test;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------

Plan hash value: 3898976153

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |  3000K|  5859K|  6519   (1)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| NULL_TEST |  3000K|  5859K|  6519   (1)| 00:00:01 |
-------------------------------------------------------------------------------

8 rows selected.

So, now the question becomes: “Are nulls really valid for the test column?” If they are, then a table scan is the only way to include the rows with NULL values along with the populated rows.
If the nulls are not valid, then the column should be defined to reflect that rule.

SQL> alter table null_test modify (test_column varchar2(1) not null);

Table altered.

SQL> explain plan for select test_column from null_test;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------

Plan hash value: 107206137

----------------------------------------------------------------------------------------
| Id  | Operation            | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                 |  3000K|  5859K|  1407   (1)| 00:00:01 |
|   1 |  INDEX FAST FULL SCAN| NULL_TEST_INDEX |  3000K|  5859K|  1407   (1)| 00:00:01 |
----------------------------------------------------------------------------------------

8 rows selected.

Now, with the column defined as not null, the optimizer knows it can trust the index to contain all of the values in the table because the table can’t have any NULLs.

Many developers will think of the column definitions merely in terms of data “correctness;” but it’s important to remember those same data quality rules are meta-data for the optimizer as well. Thus they can and will influence the optimizer plans.

SQL for the fun of it: finding typing patterns in words (Part 3 of 3)


Part 1 Part 2

I’ve continued to enjoy this little puzzle and trying to find different ways of solving it. I’ve used the CONNECT BY clause to unpivot words into lists of letters but in this article I’ll show a more direct recursion on the word and then wrap up the series with the smallest and most efficient solution.

First, a recursive technique using CONNECT BY.
I’ll use TRANSLATE to map each letter to its corresponding hand and then recursively check if each letter alternates hands from the previous one. On the first repeat of a hand, the recursion stops and the search ends for that word. Findig words ending with the left hand is achieved with a simple LIKE condition (WHERE hand LIKE ‘%L’.)

    SELECT word,
           len,
           LEVEL lvl,
           hand
      FROM (SELECT word,
                   LENGTH(word) len,
                   TRANSLATE(word, 'qwertasdfgzxcvb yuiophjklnm', 'LLLLLLLLLLLLLLL RRRRRRRRRRR') hand
              FROM words)
     WHERE hand LIKE '%L'
CONNECT BY word = PRIOR word
       AND LEVEL <= len
       AND (LEVEL = 1 OR SUBSTR(hand, LEVEL - 1, 2) IN ('RL', 'LR'))
       AND PRIOR SYS_GUID() IS NOT NULL;

This parses words into a series like this:

WORD       LEN   LVL HAND     
--------- ---- ----- ---------
aid          3     1 LRL      
aid          3     2 LRL      
aid          3     3 LRL      
baklava      7     1 LLRRLLL  
chair        5     1 LRLRL    
chair        5     2 LRLRL    
chair        5     3 LRLRL    
chair        5     4 LRLRL    
chair        5     5 LRLRL    
purchase     8     1 RRLLRLLL

Finding words that alternate from the first to the last letter is then just a matter of pulling rows where the level reaches the length of the word. In the example above that would be aid and chair, dropping baklava and purchase.

SELECT word
  FROM (    SELECT word, len, LEVEL lvl
              FROM (SELECT word,
                           LENGTH(word)
                               len,
                           TRANSLATE(
                               word,
                               'qwertasdfgzxcvb yuiophjklnm',
                               'LLLLLLLLLLLLLLL RRRRRRRRRRR'
                           )
                               hand
                      FROM words)
             WHERE hand LIKE '%L'
        CONNECT BY     word = PRIOR word
                   AND LEVEL <= len
                   AND (LEVEL = 1 OR SUBSTR(hand, LEVEL - 1, 2) IN ('RL', 'LR'))
                   AND PRIOR SYS_GUID() IS NOT NULL)
 WHERE lvl = len;

That works pretty well. While connect by can be somewhat cpu and memory intensive, I do like the searching has an early exit when a word repeats a hand unlike previously explored methods.

The next, perhaps obvious, method to try is a recursive sub-query factoring clause (WITH clause.) The syntax changes but the logic of the recursion is essentially the same. Translate the letters to hand, for each letter check if it alternates from the previous hand or not. If the hands alternate then continue, if they do not, then end early for that word. Keep only those rows where the counter reaches the length of the word.

WITH
    translated(word, hand)
    AS
        (SELECT word,
                TRANSLATE(word,
                          'qwertasdfgzxcvb yuiophjklnm', 
                          'LLLLLLLLLLLLLLL RRRRRRRRRRR') hand
           FROM words),
    x(word, hand, n)
    AS
        (SELECT word, hand, 1
           FROM translated
          WHERE hand LIKE '%L'
         UNION ALL
         SELECT word, hand, n + 1
           FROM x
          WHERE SUBSTR(hand, n, 2) IN ('RL', 'LR'))
SELECT word
  FROM x
 WHERE n = LENGTH(word);

It’s slightly more cumbersome syntax but I think it’s an easier read for those new to the syntax compared to the CONNECT BY. It also doesn’t require any special tricks like the SYS_GUID() to properly connect children of each word.

And finally, while I have enjoyed playing around with syntax to find different methods to solve the same original question, all of them are unnecessarily complex. The two answers above provide a hint as to the direction I would go if I had to implement this in production instead of just for academic curiosity.

First, simply translate the letters to hands and then check if the resulting “word” was of the form LRLR…RL or RLRL…RL. That is, does it alternate with RL (so it ends with an L) and possibly start with an L?

So, a single call to translate and a regular expression filter provides the answer, both simply and efficiently.

SELECT word
  FROM (SELECT word, TRANSLATE(word, 
                               'qwertasdfgzxcvb yuiophjklnm',
                               'LLLLLLLLLLLLLLL RRRRRRRRRRR') hand
          FROM words)
 WHERE REGEXP_LIKE(hand, '^L?(RL)*$');

If you have an older version of the database that doesn’t support regular expressions, or if you wanted to port this puzzle and answer to non-Oracle systems. Then you can simply construct an alternating word using padding functions. Remember to check both right and left starting letters in the alternating.

This is slightly more complex than using REGEXP_LIKE; but this version should be supported back to at least version 7.3.

SELECT word
  FROM (SELECT word,
               TRANSLATE(word, 'qwertasdfgzxcvb yuiophjklnm', 'LLLLLLLLLLLLLLL RRRRRRRRRRR') hand
          FROM wordlist)
 WHERE     hand LIKE '%L'
       AND hand IN ('L' || RPAD('RL', LENGTH(word) - 1, 'RL'), RPAD('RL', LENGTH(word), 'RL'));

I hope you enjoyed this little exercise as much as I have. Thank you for reading!

%d bloggers like this: