Pro C bulk insert error

  • From: Lou Avrami <avramil@xxxxxxxxxxxxxx>
  • To: <oracle-l@xxxxxxxxxxxxx>
  • Date: Thu, 07 Oct 2010 20:43:41 -0400 (EDT)

Hello all,

We're experiencing a very interesting error and I'm hoping someone might have 
some insight into a solution.

We have an application from a third-party vendor which collects "messages" from 
various remote devices.  The messages are loaded into an Oracle 11.2.0.1 
Enterprise Edition database, running on a Solaris 10 server.

Every week or so, we have an issue that causes these messages to queue up in 
the application.  When the bottleneck is resolved, the queued messages are 
"bulk loaded" into the database.  During these bulk load periods, the 
application receives the following error message:

------ Tue Sep 21 05:49:40 2010
pid     = 28927
conn    = [wfc]
process = ideasLog
module  = DB_ALG_UpdateChunk
sqlcode = -12152
mesg    = ORA-12152: TNS:unable to send break message

The connection to the database is apparently dropped, as the following message 
is:

mesg    = ORA-03114: not connected to ORACLE

The application eventually reconnects:

mesg    = Attempting to reconnect to database.
...
mesg    = reconnecting, retries 30, interval 10
...
mesg    = reconnect attempt #1
...
mesg    = reconnect successful

but in the interim, many of the queued messages are lost.

In a non-production environment the application folks discovered a way to cause 
messages to queue up and subsequently recreate the error.  I set up client and 
server side tracing to capture the SQL that was actually producing the 
ORA-12152 errors, but when tracing was enabled, we could not recreate the 
application errors.  When tracing was removed, we COULD reproduce the errors.  
Within the application, we also had the ability to set up a diagnostic mode and 
capture transactions, but again, when the app tracing was turned on, we could 
not reproduce the errors.  Turn the app tracing off though and the ORA-12152 
errors could be reproduced at will.

When we ran these conditions by the vendor support, they came back with a 
recommendation that we turn off the "bulkDBOperation" application setting.  The 
default setting is "ON", which means that 250 rows of messages are processed at 
a time and are inserted into several tables.  Turning bulkDBOperation OFF only 
processes 1 row of messages at a time, significantly slower.

Turn bulkDBOperation ON, the errors can be reproduced on demand, by executing 
certain functions and forcing a bulk load of messages.  Turn it off and these 
same functions do not produce the ORA-12152 errors and lost messages.

So it seems that when the application is "slow" - either because application 
tracing is on, Oracle tracing is on or the application is configured to only 
process one message at a time - it "works", there are no ORA-12152 and no 
messages are lost.  But if no tracing is enabled and the default processing 
mode is used, errors are produced when messages are queued up and the 
application has to process a backlog.

What I find especially interesting is that if bulkDBOperation is ON and some 
sort of tracing is also enabled, there are no errors.  The application runs 
significantly slower, for example it takes about 10 minutes to restart, as 
opposed to 1-2 minutes when there is no tracing.  But turn tracing off, with 
bulkDBOperation ON, and an ORA-12152 error occurs within the first minute of 
processing backlog messages.

The app folks are of the opinion that this is an "Oracle' problem.  Maybe ... 
it just seems very strange that when there is tracing overload on the database 
there are no errors, but when tracing is off errors occur.

The vendor support folks did send along the Pro C code that does the message 
processing.  I am pasting it below, preceded by the explanation by the vendor.  
Previously, they had suggested that we reconfigure the SDU and TDU packet sizes 
within SQLNET.  That had no affect - there ORA-12152 errors continued.  

Reading the vendor's explanation and looking at the code, one thing kind of 
struck me - what happens if one or more of the several "buffers" that are 
processed at the end of the 5 second period do NOT have 250 rows?  Does the 
code still try to load it 250 times?  What happens if some buffers have 250 
rows and some do not?

Looking at the nearest Pro C manual, I came across the following:

http://www.cis.njit.edu/~oracle/manuals/proc/ch13.html

Using the FOR Clause 
....
The FOR clause variable specifies the number of array elements to be processed. 
Make sure the number is not larger than the smallest array dimension. Also, the 
number must be positive. If it is negative or zero, no rows are processed and 
Oracle issues an error message.

This may not be applicable ...
I just find it very strange that when there is overload from runing in a 
diagnostic mode, there are no errors.

If anyone has some thoughts or insights into this problem, please let me know 
what they are.

Thanks,
Lou Avrami

========================
By changing how long it takes to make a DB call or how long it takes IDEAS to 
send a message to ideasLog, you are changing how many message ideasLog can 
process. ideasLog will spend 5 seconds in its IPMH loop gathering messages from 
IDEAS. If it takes longer to process those messages in any way, then we can 
send as many rows to the DB at a time.
 
We have buffers of 250 rows across several tables, alertlog, eventatt, 
alertprop, …, that we are attempting to fill up during this 5 second period. If 
one of the buffers fills, we flush all of them to the DB using Oracle host 
arrays. This is where the bigger packets are coming from. When IDEAS or the DB 
slows down, we can’t get as many rows into those buffers during that 5 seconds. 
So, when we flush, the packets sizes down at the oracle layer are probably 
smaller. 
 
My guess is that if you turn off the bulkDBOperation, you will not see this 
error from Oracle because the packet sizes will be greatly decreased. Steve is 
correct though. ideasLog will be significantly slower. 
 
Here is an example of the code that gets used when the bulkDBOperation option 
is turned on. Notice that we have the ability to update 250 rows at a time. 
This is one of the methods that seems to trigger the network packet errors.

nxint16
DB_ALG_UpdateChunk_r( ALG *chunk, nxuint32 HowMany)
{
   ALG *pAlgPtr;
 
   EXEC SQL BEGIN DECLARE SECTION;
        nxint32     arr_alertlogUpd_alert[250];
        nxint32     arr_alertlogUpd_alertid[250];
        nxint32     arr_alertlogUpd_mom[250];
        nxint32     arr_alertlogUpd_mo[250];
        nxint32     arr_alertlogUpd_times[250];
        nxint32     arr_alertlogUpd_ackoper[250];
        nxint32     arr_alertlogUpd_curroper[250];
        nxint32     arr_alertlogUpd_cleartype[250];
        nxint16     arr_alertlogUpd_severity[250];
        nxint16     arr_alertlogUpd_status[250];
        nxint16     arr_alertlogUpd_clearreason[250];
        char        arr_alertlogUpd_first1[250][26];
        char        arr_alertlogUpd_last1[250][26];
        char        arr_alertlogUpd_cleared[250][26];
        char        arr_alertlogUpd_acknowledge[250][26];
        char        arr_alertlogUpd_text1[250][256];
        nxint32     rowcount;
 
   EXEC SQL END DECLARE SECTION;
 
   rowcount = HowMany;
 
   for (int i=0; i<HowMany; i++) {
 
        pAlgPtr = (ALG *) chunk;
 
        arr_alertlogUpd_alert[i]       = pAlgPtr->alert;
        arr_alertlogUpd_alertid[i]     = pAlgPtr->alertid;
        arr_alertlogUpd_mom[i]         = pAlgPtr->mom;
        arr_alertlogUpd_mo[i]          = pAlgPtr->mo;
        arr_alertlogUpd_times[i]       = pAlgPtr->times;
        arr_alertlogUpd_ackoper[i]     = pAlgPtr->ackoper;
        arr_alertlogUpd_curroper[i]    = pAlgPtr->curroper;
        arr_alertlogUpd_cleartype[i]   = pAlgPtr->cleartype;
        arr_alertlogUpd_severity[i]    = pAlgPtr->severity;
        arr_alertlogUpd_status[i]      = pAlgPtr->status;
        arr_alertlogUpd_clearreason[i] = pAlgPtr->clearreason;
 
        dbstrcpyin(arr_alertlogUpd_first1[i],      pAlgPtr->first1 );
        dbstrcpyin(arr_alertlogUpd_last1[i],       pAlgPtr->last1 );
        dbstrcpyin(arr_alertlogUpd_cleared[i],     pAlgPtr->cleared );
        dbstrcpyin(arr_alertlogUpd_acknowledge[i], pAlgPtr->acknowledge );
        dbstrcpyin(arr_alertlogUpd_text1[i],       pAlgPtr->text1 );
 
        chunk++;
   }
 
   EXEC SQL FOR :rowcount UPDATE alertlog
            SET alertid     =      :arr_alertlogUpd_alertid,
                mom         =      :arr_alertlogUpd_mom,
                mo          =      :arr_alertlogUpd_mo,
                times       =      :arr_alertlogUpd_times,
                ackoper     =      :arr_alertlogUpd_ackoper,
                curroper    =      :arr_alertlogUpd_curroper,
                cleartype   =      :arr_alertlogUpd_cleartype,
                severity    =      :arr_alertlogUpd_severity,
                status      =      :arr_alertlogUpd_status,
                clearreason =      :arr_alertlogUpd_clearreason,
                first1      =      :arr_alertlogUpd_first1,
                last1       =      :arr_alertlogUpd_last1,
                cleared     =      :arr_alertlogUpd_cleared,
                acknowledge =      :arr_alertlogUpd_acknowledge,
                text1       =      :arr_alertlogUpd_text1
           WHERE alert = :arr_alertlogUpd_alert;
 
 
   if (DB_DB_SQLErr(&sqlca,"DB_ALG_UpdateChunk")) {
        DB_DB_Commit();
        return 1;
   }
 
   DB_DB_Commit();
   return 0;
}
--
//www.freelists.org/webpage/oracle-l


Other related posts: