RE: OT - Getting fired for database oops

  • From: "Taylor, Chris David" <Chris.Taylor@xxxxxxxxxxxxxxx>
  • To: <stephenbooth.uk@xxxxxxxxx>, <oracle-l@xxxxxxxxxxxxx>
  • Date: Mon, 18 May 2009 07:24:07 -0500

Sometimes I'm 'fearful' of getting fired for actually doing my job! :)
 
 
 
Chris Taylor
Sr. Oracle DBA
Ingram Barge Company
Nashville, TN 37205
Office: 615-517-3355
Cell: 615-354-4799
Email: chris.taylor@xxxxxxxxxxxxxxx
 

CONFIDENTIALITY NOTICE: This e-mail and any attachments are confidential and 
may also be privileged. If you are not the named recipient, please notify the 
sender immediately and delete the contents of this message without disclosing 
the contents to anyone, using them for any purpose, or storing or copying the 
information on any medium.

 

________________________________

From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On 
Behalf Of Stephen Booth
Sent: Monday, May 18, 2009 7:08 AM
To: oracle-l@xxxxxxxxxxxxx
Subject: Re: OT - Getting fired for database oops




On 05/18/2009, John Hallas <John.Hallas@xxxxxxxxxxxxxxxxxx> wrote: 

        I do know of a DBA who deleted the test database ready for a refresh 
from production. The 578 datafiles took a long time to delete but slightly 
longer (36 hours)  to recover once he realised that he was logged onto 
production.


Something very similar happened in one of my past jobs.  A consultant DBA at a 
customer site (employed by the customer through an agency) trashed the main 
production finance system at 17:00 one Friday, thinking he was dropping the QA 
one ready for a restore from the production backup over the weekend.  I then 
had to spend the entire weekend restoring the production system and rolling it 
forward (this was a 23:55 by 7 (i.e. 5 minutes permitted downtime a day) 
system, fortunately weekends were slow and there was provision to cache 
transactions locally then apply them as a batch, unfortunately the total 
transaction for the weekend amounted to about the average for 10 minutes 
transactions on Monday morning so getting it fixed for Monday was vital), plus 
restoring the QA system.
 




        The company got a £1.8 million fine for the outage  - government 
supplier etc


Fortunately we were able to get the system back by the early hours of Monday 
morning so losses were minimal (about £1million, pocket change for this 
organisation).
 


        He kept his job though

         

I suspect that the DBA who trashed the database would have been sacked but from 
what we could tell from some forensic unpicking of events, phone logs, 
statements from people on site at the time and CCTV footage he spent about 30 
minutes trying to fix it, phoned his agency for 10 minutes, cleared his desk 
and left for destination unknown.  When contacted his agency denied any 
knowledge of him.

The key lessons we learned from this were:

* Don't use the same passwords on production and QA (OS and Oracle).
* For any regular destructive jobs (e.g. deleting datafiles to clear down QA 
ready for restore from prod) have a pre-written script that is only on the 
server it's needed on rather than using a manual script.
* When you've broken a mirror from a 3 way stack to back up from, consider not 
resilvering until the last possible moment (had this been the case here we 
could have restored by resilvering from the detached copy to the other two 
'disks' and rolling forward on the logfiles, total downtime less than 3 hours).

We did try to get the customer to agree to us doing the trashing of the 
database as part of our restore process on the Saturday but they insisted on 
keeping control of the process and that it be done by their own staff.

Stephen

-- 
It's better to ask a silly question than to make a silly assumption.

http://stephensorablog.blogspot.com/ | 
http://www.linkedin.com/in/stephenboothuk | Skype: stephenbooth_uk

Apparently I'm a "Eierlegende Woll-Milch-Sau", I think it was meant as a 
compliment. 

Other related posts: