[THIN] Re: PS4 Locks up

  • From: "Rick Mack" <Rick.Mack@xxxxxxxxxxxxxx>
  • To: <thin@xxxxxxxxxxxxx>
  • Date: Sat, 25 Mar 2006 10:03:19 +1000

Hi Matthew,
 
Server lockups can be incredibly frustrating to sort out. 
 
Basically it's possible to see several different types of hangs or lockups.
 
The first may be due to software on the server stressing things to the limit.
Examples that I can think of are:
 
apps with severe memory leaks which cause the server to page itself to death
cpu hogs that take out all your cpu resources
applications that exhaust some system resource like file handles
heavy registry updates
 
Then there are server hardware problems like flakey memory.
 
Corrupt user profiles has been a major cause of hangs on Server 2003 at times, 
but you're running 200 server so that's kind of unlikely. Using the latest 
version of UPHClean isn't a bad idea though.
 
You didn't state whether you're running 2000 SP4, but if you are, I'd suggest 
you look at hotfixes 324446, 816134, 817446, 821255, 823747, 823272 and 829485.
 
However software issues apart, the most common cause of hangs are back-end 
servers.
 
By this I mean that TS systems are incredibly dependent on timely response from 
back-end servers (file/print, domain controllers). If the network I/O request 
queues fill up, the TS systems will hang, either momentarily or just stop 
depending on the amount of pending I/O.
 
MaxMPXCt and MaxWorkitems tuning helps a lot and can make the difference 
between a server that hangs and one that just goes slow when the back-end gets 
sluggish.
 
The best I can probably do for you is to give you an example.
 
Had a situation recently where a TS server was just going super slow at times 
and would hang for 2-3 minutes at a time. It was properly tuned and looked okay 
from a performance monitoring viewpoint. Current commands were a bit high but 
not excessive. In terms of when the hangs were happening, they didn't happen 
all the time but started mid-morning and kept happening til late afternoon.
 
I was fairly certain early on that the file/print server was the problem, but 
that's where the fun started. The file server had so many things wrong with it 
that we barely knew where to start. Cleaned up a lot of crap (MP3s mostly) to 
free up some disk space, defragged and chkdsked the volumes, moved files around 
to spread the I/O, fixed the antivirus settings. Things got a lot better but we 
were still seeing the hangs.
 
Set up perfmon to look at just about everything and saw an interesting 
relationship between network i/o and server work queues. The network I/O 
baseline was fairly high all the time, but would drop down to zero for 2-3 
minutes. At the same time, the server work queue count was climbing linearly up 
to 20-30 indicating that the CPUs were super busy (but cpu time didn't peak at 
the same time). After 2-3 minutes the work queues would drop to zero and the 
network I/O would resume. Memory utilization was okay, cache hits were 
generally better than 95%, very little disk I/O, cpu utilization was ok etc. 
 
So something was making the server so busy that it wasn't responding to 
anything. Poked around until I realised that there was something very peculiar 
about the network i/o throughput I was seeing with task manager. We're used to 
peaks and troughs in activity, but there was a constant baseline activity and 
it never fell to zero. So what was going on?
 
Installed ethereal and started looking at what was happening. I found that the 
baseline activity was due to 2 workstations on the network that were hammering 
the server. When we looked at the packet capture it was really interesting. The 
packets were MTU sized SMB packets mostly filled with nulls, so we were looking 
at some sort of malformed SMB request.
 
To cut a long story short, the 2 workstations were infected with a virus which 
wasn't being activated until the user logged on. Once the user logged off or 
the workstation was turned off everything started working as designed. If the 
relevant users didn't come in that day or arrived late or left early, the hang 
times would change. Cleaned up the virus and the hangs disappeared.
 
Had a similar scenario where the culprits were 2 workstations where the users 
had set the antivirus package to check network drives. 
 
Other causes can be backups left running during production time or basically 
anything that slows down server responsiveness. If you've got a lot of group 
policies, then group policy processing can also cause server hangs on logon if 
your domain controllers aren't performing well. If the server has just crashed 
and a lot of users are logging back on, it can just lock up.
 
The last scenario I can think of at the moment is where there's a NIC/switch 
port speed mismatch or autonegotiation problem. However this is generally easy 
to diagnose because if you copy a file to and from the server, you can see a 
huge difference in the copy speeds.
 
Hopefully there's something relevant in this rambling ;-)
 
regards,
 
Rick
 
Ulrich Mack 
Volante Systems 


________________________________

From: thin-bounce@xxxxxxxxxxxxx on behalf of Matthew Shrewsbury
Sent: Sat 25/03/2006 8:29
To: thin@xxxxxxxxxxxxx
Subject: [THIN] PS4 Locks up



As you may have noticed I have been posting a lot as of late. For some strange 
reason out of the blue I've been having a lot of issues. One PS4 server keeps 
locking up out of my two although both have locked up but not at the same time.

 

1) I've tried updating all firmware and drivers.

2) I've installed PSE400W2KR01.msp on the one server it would install on (but 
it locked but about 7 hours later).

3) I've searched event logs but don't see anything obvious.

 

Is there any good method of repairing a Citrix server? I'm just not finding 
anything that points me to a problem. Sunday I'm going to come in and try my 
best to find the problem. I plan to start with low level hardware diagnostics, 
then proceed to virus scans/ boot from PE and look for root kits. Run some 
network sniffing tools. 

 

If everything comes up clean how should I try and repair my server? Should I 
try and repair PS4 through add remove programs? I'm running a older version of 
User Profile hive cleanup...could this cause lock ups?

 

Matthew Shrewsbury, MCSE+Internet MCSE 2000 CCA Server+

Network Manager

 


#####################################################################################
This e-mail, including all attachments, may be confidential or privileged.  
Confidentiality or privilege is not waived or lost because this e-mail has been 
sent to you in error.  If you are not the intended recipient any use, 
disclosure or copying of this e-mail is prohibited.  If you have received it in 
error please notify the sender immediately by reply e-mail and destroy all 
copies of this e-mail and any attachments.  All liability for direct and 
indirect loss arising from this e-mail and any attachments is hereby disclaimed 
to the extent permitted by law.
#####################################################################################

Other related posts: