RE: Overhead of load-balanced microservices architecture

From: "Mark W. Farnham" <mwf@xxxxxxxx>
To: <Clay.Jackson@xxxxxxxxx>, <dougk5@xxxxxxx>, <oracle-l@xxxxxxxxxxxxx>
Date: Thu, 13 Aug 2020 11:28:39 -0400

Keeping in mind the possible death by inches problem of TOO many open Oracle
sessions (clarified if you need by Graham Wood’s realworld demos and videos),
the 1980’s implementation of leaving a service daemon running with an open
Oracle connection is a fast response, low cost way to do this. Back in the day,
we programmed these in OCI (Oracle Call Interface) to make it easier to
implement the daemons in C, which was the natural programming language for UNIX
and things built by copying the architecture and design of UNIX.

So your health check would:

1)    Ping the service daemon to see if it is running, and only if it is not
running do a login to Oracle to start it.

2)    Fire an action code at the service daemon with a return vector for the
answer.

In olden days each question code had to be built as a C program subroutine
included in the daemon code. Now, of course, you would make stored packaged
PL/SQL procedures and/or functions, so the C harness would be quite simple.

One of the action codes probably should be to disconnect and stop. While it is
possible to have a supervisor at the OS level to check for and restart the
service daemon, that creates the potential for a vampire that won’t die when
you are trying to do routine maintenance on the database. (This is the same
logic as having a quick swap-in url picture for web application logins that
say: Please bear with us. Services are expected to resume at YYMMDD HH24:MI
[reading the expected return time from a file you control.])

Eliminating incessant restart attempt traffic by persistent machines and
frustrated humans is an important thing to do, both for your convenience and
for the worldwide zeitgeist.

Conversely if you have a start script for a database with a flag whether or not
to start your list of daemons, you can save a lot of time and energy and still
allow individual requests to start the daemon if it is not running. A
configuration file you control would indicate whether to attempt to honor the
start code, and you would slap that to “NO” when you also swap in the out of
service url entry point screens.

A small number of daemons (often 1 per database) will help you avoid death by
inches problems of TOO many open Oracle sessions. Especially if the queries are
pre-parsed and ready to execute. IF any of the queries can be lengthy or the
system is prone to service request storms, then you might need to build FIFO
(First In, First Out) queueing of service request messages into the daemon
harness.

The alternative of starting a whole bunch of dedicated listeners to avoid
queueing delays is unappealing, at least to me, and you can skip the argument
about whether or not you really need the extra listeners, which you won’t until
the storm hits your aforementioned radar.

I don’t know whether this software harness is available off the shelf, and
yeah, you need to be able to control DOS (Denial Of Service) attacks if your
health checks face the public internet. (Hint: they probably don’t, but if you
might have to troubleshoot them, an off-“LAN_CAMPUS” VPN or something might
save you going to the office.)

Good luck. Whether or not your team can reduce the frequency of any particular
check is a reasonable question. Running through the full login, security check,
session start overhead of an RDBMS session multiple times per second is begging
for a storm to hit your radar. Being lucky that you never will hit the radar
might be the cheapest solution, but do you feel lucky?

From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On ;
Behalf Of Clay Jackson (cjackson)
Sent: Thursday, August 13, 2020 1:03 AM
To: dougk5@xxxxxxx; oracle-l@xxxxxxxxxxxxx
Subject: RE: Overhead of load-balanced microservices architecture

I’m by no means an expert on either F5 or Exadata hardware, and things have
changed in the last 10 years.

That said; what you might run into (and what I DID run into almost 10 years ago
with F5s and Oracle in “another life”) is network queuing.  At the network and
OS level (“below” Oracle), the (Oracle) listener tells the OS to start
listening for connections on a specified port.  7/second is not THAT large;
but, if one considers what happens when each connect request is received
(several network round trips as TCP negotiates the higher level connections, a
message “up” to the Oracle process at some point that tells the Oracle listener
process to actually set up the database connection), some of which are “single
threaded”; you may start to see queuing for some of those connection requests,
and when that happens, it can “cascade” very quickly.

I’ll dig back in my notes and if can find something that specifically relates
to what happened, I’ll post it

Clay Jackson

From: oracle-l-bounce@xxxxxxxxxxxxx <oracle-l-bounce@xxxxxxxxxxxxx> On Behalf
Of DOUG KUSHNER
Sent: Wednesday, August 12, 2020 9:34 PM
To: oracle-l@xxxxxxxxxxxxx
Subject: Overhead of load-balanced microservices architecture

CAUTION: This email originated from outside of the organization. Do not follow
guidance, click links, or open attachments unless you recognize the sender and
know the content is safe.

Our dev team recently rolled out an application using an F5 load-balanced
microservices architecture.  There are several miscroservices, each load
balanced on up to 4 servers each, and each with a health-check api that hits
the database.  While this may have looked good on paper, just the overhead of
the health-checks with no work being processed has resulted in roughly 7
connection attempts per second to the database.  This results in a version
check query about 40K times per hour.  The database is on an Exadata (2-node
RAC) with several other production databases.

Of course the Exadata has been handling it, so unless you are looking for
anomalies (which I always am), this will fly under the radar until it doesn't.
:)

I'm wondering if anyone knows how to determine the theoretical max
connections/sec that a listener can handle based on the number of cores
licensed in the system?

Also wondering if anyone here has encountered this scenario before and how they
dealt with it.  I'm also looking for a good reference on the subject.

My immediate focus will be on determining why these health check connections do
not appear to be utilizing the services' connection pools, while the dev team
determines whether they can relax the frequency of these health checks.

Regards,

Doug

Follow-Ups:
- Re: Overhead of load-balanced microservices architecture
  - From: Mladen Gogala

References:
- Overhead of load-balanced microservices architecture
  - From: DOUG KUSHNER
- RE: Overhead of load-balanced microservices architecture
  - From: Clay Jackson (cjackson)

RE: Overhead of load-balanced microservices architecture

Other related posts: