[THIN] Re: TSCALE or Appsense

  • From: Michael Pardee <pardeemp.list@xxxxxxxxx>
  • To: <thin@xxxxxxxxxxxxx>
  • Date: Thu, 11 Jan 2007 06:46:29 -0500

Reading the section on file serving it¹s like you worked here for the last
couple of years.  We redirect Favorites, Desktop, My Documents, and App Data
and we have gone through everything you mention below and made all of the
same tuning tweaks, etc.  Even the chkdsk where we lost all of the ACLs, had
to make it everyone change (maybe full) to keep production running, and then
working hours and hours to fix it.  Compression was also a huge performance
killer as admins would use it to address free space issues.  The problem
escalated as the disks got too full and the defragmenting process became
useless.  Adding/expanding LUNs, uncompressing disks, and defragging had a
huge affect on performance.  Backups that took 26 hours to complete for a
single LUN now take 15 hours.  Since we use Windows clustering I believe
there is even a bit more tuning that had to be done and it took us many,
many calls with Microsoft to get it all ironed out.

The hyperthreading comment is something that I will take back for
investigation as well as applying the security via policy instead of the
file level to address exactly what we saw with chkdsk.  We are currently
redesigning our back end file serving to use DFS and Windows2003 x64.  I¹m
anxious to see how it works out.

All great suggestions Rick.  I have openings in Pittsburgh and Phoenix, just
tell me when to have your office ready. ;p



From: Rick Mack <ulrich.mack@xxxxxxxxx>
Reply-To: <thin@xxxxxxxxxxxxx>
Date: Thu, 11 Jan 2007 21:20:40 +1000
To: <thin@xxxxxxxxxxxxx>
Subject: [THIN] Re: TSCALE or Appsense

 
Hi Angela,
 

> Rick, you misunderstood me on the pagefile point.  I was thinking of
> rebooting the servers nightly to refresh the server resources just in case
> the memory is not being freed up fully once the applications close.  We
> don't have the clearpagefileonexit option enabled (I made that mistake
> once).  We currently reboot our servers weekly.  Was interested in seeing if
> its worth doing it nightly. On this point, do people also reboot their Web
> Interface servers or dedicated Zone data Collectors or simply the servers
> that farm the apps?
 
Nightly reboots don't hurt if you can fit them in. At least things will
generally be as good as they can be the next day. Unless of course the
servers don't reboot ;-)
 
(1) Is your page file contiguous (start and maximum same size, check with
pagedfrag (sysinternals)

Pagefile is contigous.  4096 min and max.  1-2 servers run 2003 Enterprise
and have more than 4Gb RAM.  These servers have 6Gb pagefiles. I know Citrix
won't really use more than 4Gb RAM.  Is it best I reduce the pagefile to
4Gb?
 
You're presumably using the /PAE switch in boot.ini to use memory above 4 GB
(or the servers have hot swap memory or hardware enforced DEP) but the big
problem on 32 bit systems is that as you add more memory you also add to the
kernel memory overhead and you will eventually hit the wall. IBM published a
study about 18 months ago that showed quite nicely that memory over 6-8 GB
actually resulted in less scalability on a server running a lot of
processes. So you end up tweaking the MAXMEM switch to reduce the amount of
RAM seen by the system until things are optimal, with the rest of the memory
wasted :-( 
 
I guess that's where X64 comes in.
 
But back to your question, I really believe that you gain nothing in having
a huge page file (or files aggregate). On 4 GB systems the working page file
shouldn't be over 3.6 GB max with a /MAXMEM switch on an alternate boot.ini
entry to reduce memory to 3 GB so we can do a full memory dump if necessary.
 
I've avoided having more than one pagefile on a physical drive because the
use of 2 page files could have caused unnecessary disk thrashing. So what I
used to do was put the pagefile on a partition other than the system
partition to at least reduce directory overhead on the system disk. Of
course that meant you couldn't ever do a full crash dump.
 
However it looks like you can have your cake and eat it too according to
Microsoft technote 197379. If it's corrrect, you can have 2 page files and
actually use only the one on the least used partition. So you could
presumably do your paging on the least used partition, and have a maximum
sized page file on the system disk to handle a memory dump.
 
(2) What have you done by way of minimal system tuning, optimise memory for
applications etc?
I've got a tuning policy template that will help you do this without any
hacking

The farm was initially setup by Citrix so they used some of their own ADMs.
Theres alot of customisations but not sure if they are performance based.
Were you interested in any particular settings?
 
There are a bunch of tweaks, but if Citrix did the job then I'd suspect
everything that matters will be there. Maybe ;-)

(3) Do you defrag your system disks on a regular basis?
- running a scheduled batch job to "defrag c:" at 3 AM every day is dead
simple

No.  We don't defrag our disks at all.  Is this something that will make a
noticable performance difference?  I had a look at a few servers (ie ran
defrag manually) and it said it doesn't need defragging so I'll assume this
is OK for now but I may schedule this if its best practice.  Do you defrag
your servers daily/weekly/monthly?
 
Defragging does help and it's a good test of your file system structure. All
you have to do is run a scheduled "defrag c:" at 3 AM every day and it'll
ensure the disk will generally perform as well as it can.

(4) How big are your user's profiles?
- if they  get too big (over 6-8 MB) the extra system overhead from logins
and logouts will really hurt.

Profiles are between 1-2Mb.  We redirect Application Data path, Desktop
path, My Pictures path and My Documents path to users TS Home Drive to keep
profiles small.
 
That's just fine. However be aware that depending on the application being
used, redirecting application data can sometimes creat a huge preformance
hit.

(5) have you tuned your back-end servers (file/print, domain controllers) to
increase the network i/o queue size [ie maxmpxct/maxworkitems]
- I've got a tuning policy template for this that you can have.

I haven't tuned the File/Print server or the DC.  Would be interested in
seeing your template..
 
I want to really stress this.
 
Your terminal server farm peak performance is totally dependent on the
performance of your file server.
 
If it's too busy servicing requests, your whole farm will suffer. The
default network i/o request queue size on terminal services is way too small
and it has to be  increased by increasing the lanmanserver maxmpxct and
maxworkitems value on the file server. If you don't, once the number of
pending i/o requests fill the queue, everything will stop and you will see
momentary or even quite lengthy hangs. If the file server is busy enough it
can hang your whole farm. Really.
 
Increasing lanmanworkstation maxcmds on the file server client doesn't
increase the request queue size though I do set it to match the
maxmpxct/maxworkitems values.
 
Domain controllers are file servers, they host sysvol, group policies etc
and group policy processing generates a huge number of small i/o requests.
In a large TS environment, you can see TS server hangs on user login if the
domain controllers aren't tweaked as well.
 
But I'd like to make a few more comments about file servers. As I've
stressed, they are the heart of your TS environment, particularly if you
have a significant amount of folder redirection. Every folder that's
redirected increases the amount of network I/O operations. This isn't about
the data throughput capability of your NICs etc, it's about the ability of
the file server to service i/o requests, get data off disks and send the
data where it's needed.
 
Tuning your file server is the most profound thing you can do to improve
farm performance.
 
(a) Tune the network i/o parameters (maxmpxct/maxworkitems etc) I'll email
you the back-end server tuning template.
(b) defrag your file server data volumes. Get a good defragging product (eg
winternals defrag commander) and use it regularly. Either that or use a unix
system as your file server ;-)
(c) run chkdsk across the data volumes at regular intervals. I've seen farms
grind to a halt because of corrupted security descriptors on a data volume.
(d) don't let your data volumes get more than 80% full
(e) don't run backups during prime time and avoid a too agressive virus
checker (if I checked the file when I wrote it to disk, why check it again
when I read it, or vice versa).
(f) Have a good hard think about not using hyperthreading on the cpus on
your fileserver.
 
What would you think if I offered you an add-on for your car that could make
it go 10-25% faster most of the time at no extra cost? And the only catch
was that if you went up a steep enough hill your wheels would fall off .
 
When hyperthreaded CPUs get too busy servicing multiple serial i/o streams
then they start thrashing the shared cache and things get very slow very
quickly.
 
I use hyperthreading on my TS systems, but not on a file server that's going
to be super busy. Because if it hits the wall so do your TS systems. And
it's no fun having a whole farm hang.
 
(g) Don't migrate your file server to VMWare because it's a great way to
make sure your farm goes slower.
 
Special Note: If (c) goes wrong when you run a "ckhdsk /f" to fix security
descriptors, you can lose all the security ACLs on your data volume. While
you're desperately looking for your backup and give everyone access to
everything to buy you time to fix things, you might consider adding the ACLs
to folders using group policy. You get self repairing ACLs and the big plus
is that the ACLs are set in concrete and documented.
 
(6) Is your network/switch port condifigauration set up properly. If you
take a large file, does it take the same time to copy to and to copy from
another system. Use %systemroot%\drive cache\i386\drivers.cab.
- that's an easy one for your network people if the speed isn't the same in
both directions.

According to networks team they are setup OK.  Copy speed is OK also
 
Good. You sometimes see some really bizarre performance problems that boil
down to a misconfigure switch port/NIC configuration/

(7) My memory is a bit lazy at the moment. Did you mention that the main
apps are browser based? What are you running?

The majority of our published applications are browser based.  Some do use
Java. 
 
That means bloat but what the heck, so does everything else. :-(

I have installed a Smart Array Write cache on one server as a test to see if
it makes a difference before I upgrade all my servers.
 
It will definitely help.

regards,
 
Rick
 


Other related posts: