Hi People,

Had a "fascinating" time on a customer site yesterday that I thought
might be worth reporting.

I've known for a long time that back-end or network performance issues
can severelt impact on TS server performance, and in worst case
scenarios, can result in complete server lockups.

What made things interesting is that a previosusly stable site had 4 TS
systems (2003/MF XP FR3) that had been running without problems. ON the
weekend, the customer ran windows update on all their servers, and on
Monday morning, the TS systems started hanging once they had been
running for 1-2 hours.

The customer was "fixing" things by rebooting the systems, and finally
called us on Wednesday PM. Had another customer call a couple of hours
later with hang problems post the latest security hotfixes so the
immediate reaction was bloody Microsoft and their lack of regression
testing on security hotfixes.


Spent yesterday afternoon reverting the systems to a pre-hotfix image,
and the hangs continued. Then I found the culprit. The customers main
file/print server an IBM NAS system, had lost a disk and was running in
degraded mode. Best file transfer rates I could get in an out of the
file/print server were 1.5-2 MB per second. In fact, while I was copying
a GB of data from the file/print server to one of the TS systems, the TS
system locked up hard. The admins home drive locked up, and any I/O that
required an SMB transaction just hung.

However, I was logged on via an ICA session, and had the computer
management MMC module open. It still worked, and as a test I was able to
run a local disk defrag etc. But anthing that used the workstation
service (redirector) was dead, including the IMA service etc. I
unsuccessfully tried to restart the workstation service so ended up
rebooting the serevr.

What made this all interesting is that I was able to poke around the
inside of a "hung" server. I'd always "known" that this sort of hang was
caused by SMB performance issues, but this was the first time I'd been
able to confirm it absolutely. The SMB redirector error handling hasn't
improved from NT 4 to Windows 2000 to 2003. It still sucks ;-).

Anyway the fix was to get the file/print server back up to speed by
fixing the RAID array.



