[haiku-appserver] Re: accelerating app_server

  • From: "Axel Dörfler" <axeld@xxxxxxxxxxxxxxxx>
  • To: haiku-appserver@xxxxxxxxxxxxx
  • Date: Mon, 23 Jul 2007 02:38:41 +0200 CEST

Stephan Assmus <superstippi@xxxxxx> wrote:
> One thing I noticed in my performance comparissons is that our
> client->server communication seems to take too much time. It takes us
> sometimes more than double the ammount of time in some cases to
> figure out
> that we don't need to do anything (disregarding drawing commands
> outside of
> the current clipping region). Our drawing implementation itself is
> absolutely fast enough, also the clipping. But the communication
> overhead
> is quite large. I have looked at our LinkSender implementation, but
> it
> looks fine to me. Our BLooper::check_lock() also seems to take too
> much
> time. I don't know why, it looks fast. (check_lock() is called in
> every
> drawing function).
> I have a test where I draw 100 individual points using StrokeLine()
> and
> measure the time inbetween two Sync()s. Running the program on ZETA
> produces these results:
> drawing outside clipping region: 93 µsecs
> with actual drawing: 213 µsecs
> increase: 120 µsecs
> running in the app_server test environment:
> drawing outside clipping region: 205 µsecs
> with actual drawing: 382 µsecs
> increase: 177 µsecs
> ... the increase is just 57 µsecs more for the test environment, and
> that
> is for drawing into a bitmap and making sure a BView is invalidated
> eventually for every single dot. So the actual drawing is not the
> problem.

Have you tried to compare the two when running in a BDirectWindow?
Anyway, it's nice to compare them this way; at least missing Haiku
kernel optimizations won't matter this way :-)

> On the client side, we are looking at these numbers:
> Dano: 15 µsecs
> test environment: 45 µsecs
> ... to fire off the 100 StrokeLine commands to the server. 20 µsecs
> in our
> number are just the check_lock() implementation (using
> find_thread(NULL)).

It looks like the BeOS BLooper::check_lock() implementation uses the
fCachedStack member - just like what the MultiLocker implementation
does. AFAICT this shouldn't really result in a speedup on x86 machines,
though, only on PPC...
Have you tried calling their version vs. our version directly in a loop
a few 100000 times?

> The rest of the additional delay seems to be just our communication
> overhead.

There, it would be interesting to see how much the client writes to the
server over the port; maybe it actually uses shared memory for the

> So the question is, I guess, does anybody have any ideas on how to
> cut down
> on those times?

It's definitely helpful to pin down the performance hogs more
specifically. Ie. what function exactly needs longer than it should,
and why. In the case above, if find_thread(NULL) actually takes more
time than whatever Dano does here, we should use that hack, too :-)


Other related posts: