[THIN] Re: PS4 Locks up

  • From: "Matthew Shrewsbury" <MShrewsbury@xxxxxxxxxxxxxxx>
  • To: <thin@xxxxxxxxxxxxx>
  • Date: Mon, 27 Mar 2006 08:16:42 -0500

Thanks for all the input...most helpful!!

 

Matthew Shrewsbury, MCSE+Internet MCSE 2000 CCA Server+

Network Manager

-----Original Message-----
From: thin-bounce@xxxxxxxxxxxxx [mailto:thin-bounce@xxxxxxxxxxxxx] On
Behalf Of Rick Mack
Sent: Friday, March 24, 2006 7:03 PM
To: thin@xxxxxxxxxxxxx
Subject: RE: [THIN] PS4 Locks up

 

Hi Matthew,

 

Server lockups can be incredibly frustrating to sort out. 

 

Basically it's possible to see several different types of hangs or
lockups.

 

The first may be due to software on the server stressing things to the
limit.

Examples that I can think of are:

 

apps with severe memory leaks which cause the server to page itself to
death

cpu hogs that take out all your cpu resources

applications that exhaust some system resource like file handles

heavy registry updates

 

Then there are server hardware problems like flakey memory.

 

Corrupt user profiles has been a major cause of hangs on Server 2003 at
times, but you're running 200 server so that's kind of unlikely. Using
the latest version of UPHClean isn't a bad idea though.

 

You didn't state whether you're running 2000 SP4, but if you are, I'd
suggest you look at hotfixes 324446, 816134, 817446, 821255, 823747,
823272 and 829485.

 

However software issues apart, the most common cause of hangs are
back-end servers.

 

By this I mean that TS systems are incredibly dependent on timely
response from back-end servers (file/print, domain controllers). If the
network I/O request queues fill up, the TS systems will hang, either
momentarily or just stop depending on the amount of pending I/O.

 

MaxMPXCt and MaxWorkitems tuning helps a lot and can make the difference
between a server that hangs and one that just goes slow when the
back-end gets sluggish.

 

The best I can probably do for you is to give you an example.

 

Had a situation recently where a TS server was just going super slow at
times and would hang for 2-3 minutes at a time. It was properly tuned
and looked okay from a performance monitoring viewpoint. Current
commands were a bit high but not excessive. In terms of when the hangs
were happening, they didn't happen all the time but started mid-morning
and kept happening til late afternoon.

 

I was fairly certain early on that the file/print server was the
problem, but that's where the fun started. The file server had so many
things wrong with it that we barely knew where to start. Cleaned up a
lot of crap (MP3s mostly) to free up some disk space, defragged and
chkdsked the volumes, moved files around to spread the I/O, fixed the
antivirus settings. Things got a lot better but we were still seeing the
hangs.

 

Set up perfmon to look at just about everything and saw an interesting
relationship between network i/o and server work queues. The network I/O
baseline was fairly high all the time, but would drop down to zero for
2-3 minutes. At the same time, the server work queue count was climbing
linearly up to 20-30 indicating that the CPUs were super busy (but cpu
time didn't peak at the same time). After 2-3 minutes the work queues
would drop to zero and the network I/O would resume. Memory utilization
was okay, cache hits were generally better than 95%, very little disk
I/O, cpu utilization was ok etc. 

 

So something was making the server so busy that it wasn't responding to
anything. Poked around until I realised that there was something very
peculiar about the network i/o throughput I was seeing with task
manager. We're used to peaks and troughs in activity, but there was a
constant baseline activity and it never fell to zero. So what was going
on?

 

Installed ethereal and started looking at what was happening. I found
that the baseline activity was due to 2 workstations on the network that
were hammering the server. When we looked at the packet capture it was
really interesting. The packets were MTU sized SMB packets mostly filled
with nulls, so we were looking at some sort of malformed SMB request.

 

To cut a long story short, the 2 workstations were infected with a virus
which wasn't being activated until the user logged on. Once the user
logged off or the workstation was turned off everything started working
as designed. If the relevant users didn't come in that day or arrived
late or left early, the hang times would change. Cleaned up the virus
and the hangs disappeared.

 

Had a similar scenario where the culprits were 2 workstations where the
users had set the antivirus package to check network drives. 

 

Other causes can be backups left running during production time or
basically anything that slows down server responsiveness. If you've got
a lot of group policies, then group policy processing can also cause
server hangs on logon if your domain controllers aren't performing well.
If the server has just crashed and a lot of users are logging back on,
it can just lock up.

 

The last scenario I can think of at the moment is where there's a
NIC/switch port speed mismatch or autonegotiation problem. However this
is generally easy to diagnose because if you copy a file to and from the
server, you can see a huge difference in the copy speeds.

 

Hopefully there's something relevant in this rambling ;-)

 

regards,

 

Rick

 

Ulrich Mack 
Volante Systems 

________________________________

From: thin-bounce@xxxxxxxxxxxxx on behalf of Matthew Shrewsbury
Sent: Sat 25/03/2006 8:29
To: thin@xxxxxxxxxxxxx
Subject: [THIN] PS4 Locks up

As you may have noticed I have been posting a lot as of late. For some
strange reason out of the blue I've been having a lot of issues. One PS4
server keeps locking up out of my two although both have locked up but
not at the same time.

 

1) I've tried updating all firmware and drivers.

2) I've installed PSE400W2KR01.msp on the one server it would install on
(but it locked but about 7 hours later).

3) I've searched event logs but don't see anything obvious.

 

Is there any good method of repairing a Citrix server? I'm just not
finding anything that points me to a problem. Sunday I'm going to come
in and try my best to find the problem. I plan to start with low level
hardware diagnostics, then proceed to virus scans/ boot from PE and look
for root kits. Run some network sniffing tools. 

 

If everything comes up clean how should I try and repair my server?
Should I try and repair PS4 through add remove programs? I'm running a
older version of User Profile hive cleanup...could this cause lock ups?

 

Matthew Shrewsbury, MCSE+Internet MCSE 2000 CCA Server+

Network Manager

 

########################################################################
#############

This e-mail, including all attachments, may be confidential or
privileged. Confidentiality or privilege is not waived or lost because
this e-mail has been sent to you in error. If you are not the intended
recipient any use, disclosure or copying of this e-mail is prohibited.
If you have received it in error please notify the sender immediately by
reply e-mail and destroy all copies of this e-mail and any attachments.
All liability for direct and indirect loss arising from this e-mail and
any attachments is hereby disclaimed to the extent permitted by law.

########################################################################
#############

Other related posts: