Re: AWS RDS Detecting failover

From: "Steve T. Baldwin" <stbaldwin@xxxxxxxx>
To: Maris Elsins <elmaris@xxxxxxxxx>, "knecht.stefan@xxxxxxxxx" <knecht.stefan@xxxxxxxxx>
Date: Tue, 3 Oct 2017 19:18:44 +0000

Thanks all.  Maris is correct.  I'm trying to react to a 'real' failover, not a
planned one.

I've tinkered with the event notification plumbing but in my experience you
don't actually receive a notification until well after the actual failover
event.  In my experience, around 8-10 minutes after.  I queried AWS support
about this and it is 'expected behaviour'.  The other issue is what do you do
with the event?  How do you discover and notify every connected client that
they need to abort their connection and reconnect?  Particularly if they are in
the middle of a 'wait-for-2-hours' tcp response.  If the clients are known
processes you can always kill+restart them, but what if your client is your
on-prem DB server, connected to the RDS instance over a DB link?  Messy.

That's why I landed on the socat solution.  If I have a single server proxying
all DB connections, I can go to that server and kill the socat processes
serving the DB that failed over.  Those processes can then take immediate
action - or when they next access the DB.  No more wait-for-2-hours problem.
However it just feels like a messy hack and I was hoping someone has invented a
more elegant wheel.

Thanks,

Steve

________________________________
From: Maris Elsins <elmaris@xxxxxxxxx>
Sent: Tuesday, 3 October 2017 11:27:39 PM
To: knecht.stefan@xxxxxxxxx
Cc: Steve T. Baldwin; Oracle-L (E-mail)
Subject: Re: AWS RDS Detecting failover

Hi,

I think the purpose is to come up with something that'd work in case of real
un-planned failover, so killing sessions prior failover wouldn't be possible as
the "failing over" is controlled by AWS.
This would require some additional work, but, there's an Event notification
raised at the time of RDS failover, which you could subscribe to and process it
to clean up/restart the old processes/connections when failover happens (look
here<http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Events.html#USER_Events.Messages>)

---
Maris Elsins
@MarisElsins<https://twitter.com/MarisElsins>
www.facebook.com/maris.elsins<https://www.facebook.com/maris.elsins>

On Tue, Oct 3, 2017 at 3:05 PM, Stefan Knecht
<knecht.stefan@xxxxxxxxx<mailto:knecht.stefan@xxxxxxxxx>> wrote:
If you're saying that the connected clients do not realize that the database
connection has died, have you tried forcefully killing all client sessions
prior to doing the failover?

That, combined with the appropriate tnsnames settings (e.g. listing both the
primary and failover sites with appropriate connection timeouts) should get you
what you need?

On Tue, Oct 3, 2017 at 2:48 AM, Steve T. Baldwin
<stbaldwin@xxxxxxxx<mailto:stbaldwin@xxxxxxxx>> wrote:

Hi all,

In my testing, when a client is connected to a multi-az RDS instance and I
force failover, that client doesn't 'see' it.  If it makes any DB request after
or during the failover it ends up timing out - eventually.  Unfortunately this
timeout is controlled by the tcp keepalive setting which defaults to 2 hours.
Not very helpful when the actual failover can be complete in a couple of
minutes.

I'm wondering what other RDS users are doing to handle this scenario.

I've tried tinkering with sqlnet.ora params but couldn't find any that would
allow a connected client to detect the failover.

We may have many connected clients - both on-prem and from AWS - including our
on-prem DB servers using DB links.  I'm reluctant to muck with OS-level tcp
keepalive params, and in some cases that may not even be possible (e.g. from
another RDS instance).

My current solution involves using socat (http://www.dest-unreach.org/socat/)
as a proxy.  I can easily adjust the tcp keepalive parameters with this and
depending on the values I set for those parameters I can detect failover almost
immediately.

However it means either running socat on every client, or having a dedicated
containter/ec2 running socat - which I then have to monitor.  If that
container/ec2 fails but the DB doesn't all in-flight connections are lost.

I'm thinking there has to be a better way.  I've contacted AWS support but they
suggested mucking with the tcp keepalive settings on all clients.  Or
alternatively using SNS and Lambdas to notify/kill connected clients.  The
latter wasn't ideal because the Lambda didn't get fired until well after the
failover (8-10 mins), and I have a mixture of AWS and on-prem clients, so the
notification part is messy.

Thanks for any suggestions.

Steve

------------------------------------------------------------------
This email is intended solely for the use of the
addressee and may contain information that is
confidential, proprietary, or both. If you receive
this email in error please immediately notify
the sender and delete the email.
------------------------------------------------------------------

--
//
zztat - The Next-Gen Oracle Performance Monitoring and Reaction Framework!
Visit us at zztat.net<http://zztat.net> | Support our Indiegogo campaign at
igg.me/at/zztat<http://igg.me/at/zztat> | @zztat_oracle

------------------------------------------------------------------
This email is intended solely for the use of the
addressee and may contain information that is
confidential, proprietary, or both. If you receive
this email in error please immediately notify
the sender and delete the email.
------------------------------------------------------------------

Follow-Ups:
- Re: AWS RDS Detecting failover
  - From: Steve T. Baldwin

References:
- AWS RDS Detecting failover
  - From: Steve T. Baldwin
- Re: AWS RDS Detecting failover
  - From: Stefan Knecht
- Re: AWS RDS Detecting failover
  - From: Maris Elsins

Re: AWS RDS Detecting failover

Other related posts: