[isapros] ISA EE Disaster Recovery - Apologies, a bit long!

  • From: "Jason Jones" <Jason.Jones@xxxxxxxxxxxxxxxxx>
  • To: <isapros@xxxxxxxxxxxxx>
  • Date: Thu, 3 Aug 2006 14:09:35 +0100

Hi All,

Just got back from spending a few days trying to do a disaster recovery
exercise for ISA EE. Had a few problems, so wondered if anyone could
provide some ideas.

Before the test, we have been writing our own DR procedures based upon
testing in our virtual labs. During research for this, I found that
there seems to very little information available of any value (from what
I can tell) so I have had to base a lot of the process on "what I have
seen/found" rather than "how it works". Any info on detailed  recovery
for ISA would be really handy (specifically on how ISA deals with server
objects and GUIDs)

Anyhow, the CSS recovery went fine, but I had lots of issues trying to
recover array members. The biggest problem seemed to be related to the
use of NLB. For background, the servers are also acting as multinetwork
firewalls so have 5 NICs, each with one or more VIPs.

The process I followed was:

*       Reinstall OS with same computer name and IP settings etc.
*               Configure pre-install configuration changes (edit hosts
file, reinstall SSL certs etc.)
*       Re-Install ISA Server software and join the original array
*       Apply ISA specific service packs and updates (SP2 + KB916106)
*       Array cleanup (delete old server object with new GUID)
*       Configure post-install manual configuration changes (reg
changes, PMTU, add missing VIPs etc.)

Are there a flaws in this approach? Should I be doing something
different?

From the testing in the labs, I noticed that when you try and install a
new array member with the same computer name, ISA shows a new object
under the array members called Servername{GUID} and the existing object
is still shown as ServerName. After a short time (maybe once the
firewall service is restarted) the properties of the old server are
moved back to the original object and you can then delete the temporary
ServerName{GUID} object. From the actual DR test, this didn't actually
happen and the correct server object was the one labelled
ServerName{GUID}.

The key problem seems to lie with NLB, because as soon as the
reinstalled server contacted the CSS and downloaded the array config,
NLB is enabled on interfaces and at this point the server loses network
connectivity. I managed to get the server working again by running the
'RemoveallNLBsettings' script and restarting firewall services, but this
didn't work every time and on occasion needed to be run several times
before working. 

Funnily enough, the customer wasn't happy with this "try it a few times
until it works" approach :-))) 

Any ideas, really appreciated...

Cheers

JJ

Other related posts: