Splitting a clob into rows


I’ve used this tool for a wide variety of other parsing projects. One of the interesting tuning techniques I used was to pull the clob apart into 32K varchar2 chunks.
It is possible to split the clob directly using the DBMS_LOB package or through the overloaded SQL functions; but clobs are expensive objects. Varchar2 variables on the other hand are relatively light weight making the sub-parsing within them much faster. Doing this does take a little bit of care though to make sure the chunks don’t accidentally split a line in two.

Also, I do make the assumption that no one line will be more than 32K long which is fine for this function anyway since the output is a SQL collection with a varchar2 limit of 4000 bytes.
The returned VCARRAY type is a simple table collection type.

CREATE OR REPLACE TYPE VCARRAY as table of varchar2(4000)

I wrote this originally in 9i. With 12c support for 32K varchar2 in SQL I may need to revisit it and make a new version.

CREATE OR REPLACE FUNCTION split_clob(p_clob IN CLOB, p_delimiter IN VARCHAR2 DEFAULT CHR(10))
    RETURN vcarray
    PIPELINED
IS
    --                    .///.
    --                   (0 o)
    ---------------0000--(_)--0000---------------
    --
    --  Sean D. Stuber
    --  sean.stuber@gmail.com
    --
    --             oooO      Oooo
    --------------(   )-----(   )---------------
    --             \ (       ) /
    --              \_)     (_/

    c_chunk_limit   CONSTANT INTEGER := 32767;
    v_clob_length            INTEGER := DBMS_LOB.getlength(p_clob);
    v_clob_index             INTEGER;
    v_chunk                  VARCHAR2(32767);
    v_chunk_end              INTEGER;
    v_chunk_length           INTEGER;
    v_chunk_index            INTEGER;
    v_delim_len              INTEGER := LENGTH(p_delimiter);
    v_line_end               INTEGER;
BEGIN
    v_clob_length := DBMS_LOB.getlength(p_clob);
    v_clob_index := 1;

    WHILE v_clob_index <= v_clob_length
    LOOP
        /*
            Pull one 32K chunk off the clob at a time.
            This is because it's MUCH faster to use built in functions
            on a varchar2 type than to use dbms_lob functions on a clob.
        */
        v_chunk := DBMS_LOB.SUBSTR(p_clob, c_chunk_limit, v_clob_index);

        IF v_clob_index > v_clob_length - c_chunk_limit
        THEN
            -- if we walked off the end the clob,
            -- then the chunk is whatever we picked up at the end
            -- delimited or not
            v_clob_index := v_clob_length + 1;
        ELSE
            v_chunk_end := INSTR(v_chunk, p_delimiter, -1);

            IF v_chunk_end = 0
            THEN
                DBMS_OUTPUT.put_line('No delimiters found!');
                RETURN;
            END IF;

            v_chunk := SUBSTR(v_chunk, 1, v_chunk_end);
            v_clob_index := v_clob_index + v_chunk_end + v_delim_len - 1;
        END IF;

        /*
            Given a varchar2 chunk split it into lines
        */

        v_chunk_index := 1;
        v_chunk_length := NVL(LENGTH(v_chunk), 0);

        WHILE v_chunk_index <= v_chunk_length
        LOOP
            v_line_end := INSTR(v_chunk, p_delimiter, v_chunk_index);

            IF v_line_end = 0 OR (v_line_end - v_chunk_index) > 4000
            THEN
                PIPE ROW (SUBSTR(v_chunk, v_chunk_index, 4000));
                v_chunk_index := v_chunk_index + 4000;
            ELSE
                PIPE ROW (SUBSTR(v_chunk, v_chunk_index, v_line_end - v_chunk_index));
                v_chunk_index := v_line_end + v_delim_len;
            END IF;
        END LOOP;
    END LOOP;

    RETURN;
EXCEPTION
    WHEN no_data_needed
    THEN
        NULL;
END split_clob;
/

Thank you, thank you, thank you!


A little while ago Oracle announced the winners of the Oracle Database Developer Choice Awards and I was a winner in both of categories I was nominated,

SQL and PL/SQL

I was surprised and overjoyed when I was notified that I had not only been nominated; but named a finalist.
I’m truly humbled by the supportive votes I received.

I’m also inspired to try to give back even more and I’m got a few ideas brewing for my next few articles.

Thank you again!

Only a few hours left in the Oracle Database Developer Choice Awards


If you haven’t voted yet, please do so soon! Voting ends 3pm US/Pacific time.

I’ve been nominated in the SQL and PL/SQL categories

https://community.oracle.com/community/database/awards/sql-voting
https://community.oracle.com/community/database/awards/plsql-voting

A day late but still fun


Yesterday Kim Berg Hansen posted a couple entries to his blog about scraping and parsing the Oracle Developer Choice Awards voting pages and building summaries of the results.

I’ve been doing something similar since the voting began but took a slightly different approach. I do like the idea of storing them in a table to be able to derive a history but I never did that, I instead simply looked at a snapshot in time and didn’t bother to keep trends.

Kim took two approaches, one using an apex call which he has said didn’t work very reliably and another where he simply pastes the html himself.

My method is similar but I use UTL_HTTP calls to extract the html.
Also, my function returns the raw html and then I do cleanup in my sql statement, he cleans his up within his scraping function.
Since he’s storing the results that makes sense, why incur the cost of cleanup each time he read his table?

Here is my version of the scraping function

CREATE OR REPLACE FUNCTION read_oracle_dev_choice(p_category IN VARCHAR2)
    RETURN CLOB
IS
    v_http_request    UTL_HTTP.req;
    v_http_response   UTL_HTTP.resp;
    v_buffer          VARCHAR2(32767);
    v_clob            CLOB;
    v_req_ctx         UTL_HTTP.request_context_key;
BEGIN
    v_req_ctx := UTL_HTTP.create_request_context('file:/home/oracle/devchoice', 'pa55w0rd', FALSE);

    v_http_request :=
        UTL_HTTP.begin_request('https://community.oracle.com/community/database/awards/' || p_category || '-voting/',
                               'GET',
                               'HTTP/1.1',
                               v_req_ctx);

    v_http_response := UTL_HTTP.get_response(v_http_request);
    --DBMS_OUTPUT.put_line('Response status code: ' || v_http_response.status_code);
    --DBMS_OUTPUT.put_line('Response reason phrase: ' || v_http_response.reason_phrase);
    --DBMS_OUTPUT.put_line('Response HTTP version: ' || v_http_response.http_version);

    DBMS_LOB.createtemporary(v_clob, TRUE);

    BEGIN
        LOOP
            UTL_HTTP.read_text(v_http_response, v_buffer, 32767);
            DBMS_LOB.writeappend(v_clob, LENGTH(v_buffer), v_buffer);
        END LOOP;
    EXCEPTION
        WHEN UTL_HTTP.end_of_body
        THEN
            UTL_HTTP.end_response(v_http_response);
    END;

    UTL_HTTP.destroy_request_context(v_req_ctx);

    RETURN v_clob;
END;

Once this scraping function has been created and you have your corresponding ACL and wallet entries, you can extract the results with a single SQL statement.

  SELECT *
    FROM (SELECT x.*, RANK() OVER(PARTITION BY category ORDER BY (score_rank + upvote_rank + uppct_rank)) avg_rank
            FROM (SELECT x.*,
                         RANK() OVER(PARTITION BY category ORDER BY score DESC) score_rank,
                         RANK() OVER(PARTITION BY category ORDER BY upvotes DESC) upvote_rank,
                         RANK() OVER(PARTITION BY category ORDER BY uppct DESC) uppct_rank
                    FROM (SELECT x.*, ROUND(100 * upvotes / totalvotes, 2) uppct
                            FROM (SELECT category,
                                         finalist,
                                         score,
                                         (score + 10 * votes) / 20 upvotes,
                                         (10 * votes - score) / 20 downvotes,
                                         votes totalvotes
                                    FROM (   SELECT cat category,
                                                    TO_NUMBER(REGEXP_SUBSTR(score, '[0-9]+')) score,
                                                    TO_NUMBER(REGEXP_SUBSTR(votes, '[0-9]+')) votes,
                                                    TRIM(finalist) finalist
                                               FROM (SELECT 'sql' cat, read_oracle_dev_choice('sql') html FROM DUAL
                                                     UNION ALL
                                                     SELECT 'plsql', read_oracle_dev_choice('plsql') FROM DUAL
                                                     UNION ALL
                                                     SELECT 'apex', read_oracle_dev_choice('apex') FROM DUAL
                                                     UNION ALL
                                                     SELECT 'db-design', read_oracle_dev_choice('db-design') FROM DUAL
                                                     UNION ALL
                                                     SELECT 'ords', read_oracle_dev_choice('ords') FROM DUAL),
                                                    XMLTABLE(
                                                        '//div[@class="jive-content-ideas-list-item clearfix"]'
                                                        PASSING xmltype(
                                                                    REGEXP_REPLACE(
                                                                        REGEXP_REPLACE(
                                                                            REGEXP_REPLACE(
                                                                                html,
                                                                                '^.+?
', '', 1, 0, 'n'), '.+$', '', 1, 0, 'n'), CHR(38) || '[^;]+?;')) COLUMNS score VARCHAR2(20) PATH './div/div/strong', votes VARCHAR2(20) PATH './div/span/span[@class="idea-vote-count"]', finalist VARCHAR2(20) PATH './div[@class="jive-content"]/div/div[@class="jive-content-title"]') x)) x) x) x) ORDER BY category DESC, score DESC;

And this produces results like this:


CATEGORY  FINALIST                  SCORE    UPVOTES  DOWNVOTES TOTALVOTES      UPPCT SCORE_RANK UPVOTE_RANK UPPCT_RANK   AVG_RANK
--------- -------------------- ---------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------
sql       Emrah Mete                 3010        335         34        369      90.79          1           1          1          1
sql       Sayan Malakshinov          1780        210         32        242      86.78          2           2          2          2
sql       Sean Stuber                 790         98         19        117      83.76          3           3          3          3
sql       Kim Berg Hansen             680         86         18        104      82.69          4           4          4          4
sql       Erik Van Roon               400         55         15         70      78.57          5           5          6          5
sql       Justin Cave                 300         44         14         58      75.86          6           6          8          7
sql       Matthias Rogel              290         38          9         47      80.85          7           7          5          6
sql       Stew Ashton                 260         38         12         50         76          8           7          7          8
plsql     Adrian Billington          1080        112          4        116      96.55          1           2          1          1
plsql     Roger Troller              1030        117         14        131      89.31          2           1          4          2
plsql     Sean Stuber                 910         97          6        103      94.17          3           3          2          3
plsql     Patrick Barel               690         79         10         89      88.76          4           4          5          4
plsql     Morten Braten               620         68          6         74      91.89          5           5          3          4
plsql     Kim Berg Hansen             440         51          7         58      87.93          6           6          7          6
plsql     Bill Coulam                 400         46          6         52      88.46          7           7          6          7
ords      Dietmar Aust               1240        128          4        132      96.97          1           1          1          1
ords      Dimitri Gielis              800         86          6         92      93.48          2           2          2          2
ords      Morten Braten               430         50          7         57      87.72          3           3          4          3
ords      Kiran Pawar                 350         39          4         43       90.7          4           4          3          4
ords      Anton Nielsen               240         28          4         32       87.5          5           5          5          5
ords      Tim St. Hilaire             130         16          3         19      84.21          6           6          6          6
db-design Heli Helskyaho             1030        119         16        135      88.15          1           1          2          1
db-design Michelle Kolbe              630         75         12         87      86.21          2           2          3          2
db-design Rob Lockard                 520         57          5         62      91.94          3           3          1          2
db-design Mark Hoxey                  160         23          7         30      76.67          4           4          4          4
apex      Jari Laine                  720         78          6         84      92.86          1           1          4          1
apex      Morten Braten               680         73          5         78      93.59          2           2          3          2
apex      Juergen Schuster            560         58          2         60      96.67          3           3          1          2
apex      Kiran Pawar                 430         48          5         53      90.57          4           4          5          5
apex      Karen Cannell               280         30          2         32      93.75          5           5          2          4
apex      Paul MacMillan              130         21          8         29      72.41          6           6          7          6
apex      Trent Schafer               120         17          5         22      77.27          7           7          6          7

I hope you find this helpful and a fun exercise to pursue.

If you haven’t voted, I encourage you to do so here…

I’m a finalist in the SQL and PL/SQL Categories and would, of course, appreciate a vote, be sure to check the other categories though too and vote up the finalists there that have helped you in some way.
Also note, you can vote up more than one person in a category, so by all means vote up as many as you want, including the other finalists in my categories if you want.

p.s. – an interesting “feature” of the web scraping is the URL works better when it includes a trailing slash “/”. I have no idea why this should help, but without it the returned CLOB will sometimes (often) contain “Unexpected Error Occurred” rather than the expected results. Occasionally the reverse happens though. If anyone can explain why this is, I would appreciate it.

Wow! I’m a finalist in the Oracle Developer Choice Awards!!!


Yesterday I was happily surprised by an email from Steven Feuerstein letting me know I hadn’t responded to my finalist announcement. Well, the reason for that was I never saw it! I’m glad he let me know though, what an honor.

I am a finalist in both the SQL and PL/SQL categories. If you’ve enjoyed any of my articles or I’ve answered a question for you or otherwise helped, I would appreciate and up-vote in the polls.

My profiles can be found here PL/SQL and here SQL.

Thank you very much for your support!

Using SQL to query USGS Data to monitor recent earthquakes


I live in Ohio which is fairly stable as far as earthquakes go; but I have coworkers and friends that live in other, more quake-prone areas of the country.  So, sometimes I like to look up what’s going on in those areas.  Naturally, I decided to see how I could use Oracle features to assist in my monitoring.

First, I needed a reliable source of information.  The U.S. Geological Survey web site is my data source and they have a variety of free data feeds available.  If you’re going to do a lot of data analysis I recommend pulling the data into a local table and saving it rather than hammering their server again and again for each query.

So, while these example show how to access and parse the data, your production use should be to local storage.  You’ll get better performance on repeated access rather than pulling across the internet anyway.

The USGS has a several different formats available.  I’ll be using the Quakeml formatted feed.  Quakeml is simply xml with a predefined schema.  Since I’m not a seismologist I’m only interested in the basic information of when?/where?/how big? and not necessarily all of the minute details like sampling method, uncertainty, etc.

In the examples below I’ll use a rectangular box to define the area I want to examine and I’ll request all of the data for the current and previous 2 days.  For more complete information on constructing other query urls, see the reference here: http://earthquake.usgs.gov/fdsnws/event/1

The service returns a single xml document consisting of multiple event tags.  Each event represents one earthquake.  We’ll extract the text within the following nodes to get our result set.

/q:quakeml/eventParameters/event/description/text
/q:quakeml/eventParameters/event/origin/time/value
/q:quakeml/eventParameters/event/origin/longitude/value
/q:quakeml/eventParameters/event/origin/latitude/value
/q:quakeml/eventParameters/event/origin/depth/value
/q:quakeml/eventParameters/event/magnitude/mag/value

Note text is an xml tag,  not the xquery function text().

The url construction requires some cumbersome syntax but isn’t really all that complicated.  It basically boils down to a base url followed by the parameters for the start and end times (in GMT),  a boundary box of latitude/longitude coordinates, a minimum magnitude and an ordering criteria.

In this case I’m looking for a bounding box that includes the state of Oklahoma, thus yielding a url construction of:

'http://earthquake.usgs.gov/fdsnws/event/1/query.quakeml?starttime='
|| TO_CHAR(FROM_TZ(CAST(TRUNC(SYSDATE - 2) AS TIMESTAMP), 'US/Eastern') AT TIME ZONE 'GMT', 'yyyy-mm-dd')
|| '&endtime='
|| TO_CHAR(SYSTIMESTAMP AT TIME ZONE 'GMT', 'yyyy-mm-dd"%20"hh24:mi:ss')
|| '&maxlatitude=37&minlatitude=33&maxlongitude=-94&minlongitude=-103'
|| '&minmagnitude=0&orderby=time'

 

Using the HTTPURITYPE function we can read the USGS site for that url and extract the xml.  Using the XMLTABLE we can then extract each of the event nodes into individual rows and parse out the values we’re interested in.

SELECT FROM_TZ(TO_TIMESTAMP(time, 'yyyy-mm-dd"T"hh24:mi:ssxff"Z"'), 'GMT') 
            AT TIME ZONE 'US/Central' quake_time,
       latitude,
       longitude,
       magnitude,
       depth_in_meters,
       location_name,
       XMLSERIALIZE(DOCUMENT eventxml AS CLOB INDENT SIZE = 3)
  FROM XMLTABLE(
           xmlnamespaces(DEFAULT 'http://quakeml.org/xmlns/bed/1.2',
                         'http://anss.org/xmlns/catalog/0.1' AS "catalog",
                         'http://quakeml.org/xmlns/quakeml/1.2' AS "q"),
           '/q:quakeml/eventParameters/event'
           PASSING httpuritype(
                          'http://earthquake.usgs.gov/fdsnws/event/1/query.quakeml?starttime='
                       || TO_CHAR(FROM_TZ(CAST(TRUNC(SYSDATE - 2) AS TIMESTAMP), 'US/Eastern')
                                AT TIME ZONE 'GMT','yyyy-mm-dd"')
                       || '&endtime='
                       || TO_CHAR(SYSTIMESTAMP AT TIME ZONE 'GMT', 'yyyy-mm-dd"%20"hh24:mi:ss')
                       || '&maxlatitude=37&minlatitude=33&maxlongitude=-94&minlongitude=-103'
                       || '&minmagnitude=0&orderby=time').getxml()
           COLUMNS 
              eventxml XMLTYPE PATH '.',
              time VARCHAR2(100) PATH './origin/time/value',
              longitude NUMBER PATH './origin/longitude/value',
              latitude NUMBER PATH './origin/latitude/value',
              magnitude NUMBER PATH './magnitude/mag/value',
              depth_in_meters NUMBER PATH './origin/depth/value',
              location_name VARCHAR2(100) PATH './description/text')

 

Since Oklahoma is in US/Central timezone I’ll extract the data in time local to the events.

Running the query above yields a result set like the following…

QUAKE_TIME LATITUDE LONGITUDE MAGNITUDE DEPTH_IN_METERS LOCATION_NAME
2015-07-27 19:24:03.770000000 -05:00
35.9938
-97.5664
3.2
3290
5km NNE of Crescent, Oklahoma
2015-07-27 16:07:46.590000000 -05:00
36.0019
-97.5183
2.5
4490
8km NE of Crescent, Oklahoma
2015-07-27 13:12:15.330000000 -05:00
36.006
-97.5761
4.5
3180
6km NNE of Crescent, Oklahoma
2015-07-27 12:49:28.000000000 -05:00
36.0018
-97.5667
4
5590
6km NNE of Crescent, Oklahoma
2015-07-26 08:26:50.480000000 -05:00
35.9983
-97.5693
3.2
4270
5km NNE of Crescent, Oklahoma
2015-07-26 06:30:48.720000000 -05:00
36.5294
-98.8821
3
5000
30km ENE of Mooreland, Oklahoma
2015-07-26 04:54:33.520000000 -05:00
36.0082
-97.5709
3.9
5860
6km NNE of Crescent, Oklahoma
2015-07-26 04:15:06.100000000 -05:00
35.9656
-96.8124
2.8
4160
4km WSW of Cushing, Oklahoma
2015-07-26 01:59:44.610000000 -05:00
36.0016
-97.5775
2.9
7130
5km NNE of Crescent, Oklahoma
2015-07-25 06:14:44.200000000 -05:00
36.0034
-97.568
3.9
5570
6km NNE of Crescent, Oklahoma
2015-07-25 03:47:53.420000000 -05:00
36.3478
-96.8192
2.9
1100
1km NW of Pawnee, Oklahoma
2015-07-25 01:48:21.880000000 -05:00
36.136
-97.0685
1.8
4280
2km NNW of Stillwater, Oklahoma
2015-07-24 23:49:02.820000000 -05:00
36.1675
-96.967
3
5000
10km NE of Stillwater, Oklahoma
2015-07-24 23:10:33.690000000 -05:00
36.5804
-97.6208
3.1
5000
27km SSE of Medford, Oklahoma

New version of LDAP package published


I let this project sit on the back burners for too many months but tonight finally got around to cleaning it up and getting it out for people to use.

Do note, the authentication is a little different and hence, not compatible with prior versions.
While I don’t like breaking compatibility, it was a necessary change since the earlier versions made too many assumptions about administrative distinguished names.
Other than that, most of the api should still look and behave pretty similarly.

Version 2 supports aliases now, so if you have multiple names for the same database service, you can now consolidate them to a single entry with aliases pointing to it.

Enjoy!

Version 2 can be found here

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: