[THIN] Re: Random ICA disconnects!

  • From: "Gabe Knuth" <me@xxxxxxxxxxxxx>
  • To: <thin@xxxxxxxxxxxxx>
  • Date: Thu, 20 May 2004 21:53:51 -0400

Nah...we just found out that the users quit letting us know about them.  We had 
to dig through logs and found about 55 disconects in two days on a 300 user 
farm.  I'm somewhat relieved, considering I can't see how a fileserver could 
fix that problem anway.
 
So, problem still there...no clue how to proceed.
________________________________

From: thin-bounce@xxxxxxxxxxxxx on behalf of Steve Raffensberger
Sent: Thu 5/20/2004 8:12 PM
To: thin@xxxxxxxxxxxxx
Subject: [THIN] Re: Random ICA disconnects!



Gabe Knuth seems to have fixed his disconnection problem by replacing a file
server.

The following is a copy of a message from Rick Mack to this forum a few
weeks ago. Rick mentions all the standard disconnect troubleshooting steps
one should take.

Hope this helps,

Raff

-------------------- From Rick ------
Hi People,

Had fun on a site with lots of disconnections and the fix turned out to be
something we didn't suspect at all. Thought it might be interesting if I
gave you a quick tour of what we did.

We had just consolidated a bunch of Citrix Metaframe (win2k SP2/ MF Xpa FR2)
servers in to one location (previously each regional office had their own
server). Gigabit backbone 1-2 MB ADSL connections to each office (8-30)
users per office.

WAN performance was a bit ordinary at times 'til we put in a Thinprint
gateway server, and protocol queuing, getting away from ICA client-based
printing. Everything looked reasonably good, but there were a fair few
disconnections happening on the WAN. Some users were getting disconnected up
to 5-6 times a day and getting really annoyed.

We did the usual things, monitored WAN stability, turned on ICA keepalives
and upped tcpmaxretransmissions so that the sessions might last out any
transient comms problems and disconnections were detected promptly enough
the auto reconnection worked most of the time. But the disconnections
remained, even though the autoreconnection made things a lot less
aggravating for the users. The disconnections were happening almost at
random, only the busier users got disconnected more. But even idle sessions
could get disconnected.

Went through the servers with a fine tooth comb, fixing up everything that
was even slightly out. Word from the network guys was that except for a very
occasional dropout, which disconnected a lot of sessions at once, the WAN
links were fine.

So what was it?

We set up a network trace with Ethereal between a couple of the most badly
effected ICA clients and a dedicated server. I used dumpel to trawl the
server event logs for events 683 and 682 (disconnection/reconnections) so
that we could accurately determine when disconnections were happening. Since
these 2 client machines were getting 8-12 disconnections a day between them,
we didn't have long to wait.

The results were a surprise. There were a lot of re-transmissions, mostly on
the ICA client side, and most of the TCP session disconnections were
actually from the client end. It looked like the server was going offline
(from a comms perspective) for up to 30-45 seconds, prompting a client
disconnect. When we increased the tcpretranmission count at the client end
(win98) disconnections still happened, despite the TCP session timeout being
extended to over 2 minutes. Packets and retransmissions were just getting
lost.

It really looked like a LAN problem (problem in computer room). The servers
had gigabit cards, so we tried dropping everything to 100 Mb, even replaced
the gigabit card with 10/100 cards and bypassed the gigabit switch with the
server plugged into a 10/100 switch. Since the computer room had mistakenly
been cabled to Cat 5 we even bypassed the existing cabling with cat 6 cables
direct to the switch.

No improvement. So we couldn't blame the NICs, the cabling or the gigabit
network. But it sure looked like ICA packets were dropping down a black hole
at times.

I happened to spot a Microsoft technote on a PMTUdetection fault in win2K
SP2 that looked just about perfect. If you look at the IP flags on an ICA
protocol packet, you'll find the "Don't fragment" bit is set. Considering
that an ADSL link often uses a smaller MTU than ethernet, it looked like we
might have found our problem. When we examined the network trace for large
(> MTU (1440 bytes)) every single large packet was being retransmitted.

When you did a "ping -f -l 1441" from the server to an ICA client all
packets were dropped, and "ping -f -l 1440" had about a 25-50% drop rate.
Smaller packets were okay. Whoopee! And all you have to do is put in a
registry entry to force a small packet size and things will be fixed. Nope!

So where was our black hole?

To absolutely exclude the LAN components, we set up a system with 3 NICs
(one for remote access, 2 for monitoring). We set up 2 lots of simultaneous
packet monitoring, between the WAN router and core switch (input side), and
the switch and the server. That way we had 2 packet traces on both sides of
the switch that were accurately synchronised by time offset (both ethereal
sessions on same system, one on each monitoring NIC).

The results were pretty discouraging because both traces looked identical.
Kind of suggested that our problem wasn't on the LAN.

But one of our network guys was finally convinced that it was a network
issue, so he persisted in going through the disconnection traces packet by
packet. About 50 packets upstream from the disconnection, he found something
that shouldn't have been there. We were looking at a packet trace where we
were using a TCP/IP address filter, looking at packets between a single
client and server. What he found was that that the destination MAC address
of packets going to the server was occasionally changing, just before a
whole bunch of retransmissions and disconnection from the client end.

The router was actually sending packets with the IP address of the router to
their PIX firewall (default gateway), not the Metaframe server, and more
importantly as well as a packet getting redirected to the wrong place, all
subsequent retransmissions of the lost packet were also getting sent to the
PIX. This was happening in the midst of normal traffic and ACKs, all with
the right MAC address and IP address. Since all the client retransmissions
weren't being acknowledged, the client eventually just gave up.

I guess I didn't mention that the router in question was a new model Cisco
router. One of the performance enhancements that Cisco have is CEF (cisco
express forwarding) which optimises packet retransmissions etc by resending
identical packets out of a buffer rather than handling slower retransmission
from the WAN. If the same packet was being regenerated, it could explain why
the re-transmissions were also going to the same, wrong MAC address. Didn't
explain why the router was getting confused, but at least explained why it
was being consistent.

When we disabled CEF, the disconnections went away. Cisco will be getting a
full bug report and we've got a happy customer. Just don't ask how many
man-hours it took to find this feature :-(

Regards,

Rick

Ulrich Mack
Volante Systems
18 Heussler Terrace, Milton 4064
Queensland, Australia
tel +61 7 32467704
rmack@xxxxxxxxxxxxxx


----------------------------------------

********************************************************
This Week's Sponsor - Tarantella Secure Global Desktop
Tarantella Secure Global Desktop Terminal Server Edition
Free Terminal Service Edition software with 2 years maintenance.
http://www.tarantella.com/ttba
**********************************************************
Useful Thin Client Computing Links are available at:
http://thin.net/links.cfm
***********************************************************
For Archives, to Unsubscribe, Subscribe or
set Digest or Vacation mode use the below link:
http://thin.net/citrixlist.cfm



-- No attachments (even text) are allowed --
-- Type: application/ms-tnef
-- File: winmail.dat


********************************************************
This Week's Sponsor - Tarantella Secure Global Desktop
Tarantella Secure Global Desktop Terminal Server Edition
Free Terminal Service Edition software with 2 years maintenance.
http://www.tarantella.com/ttba
**********************************************************
Useful Thin Client Computing Links are available at:
http://thin.net/links.cfm
***********************************************************
For Archives, to Unsubscribe, Subscribe or 
set Digest or Vacation mode use the below link:
http://thin.net/citrixlist.cfm

Other related posts: