[haiku-appserver] Re: drawing thread

  • From: "Rudolf" <drivers.be-hold@xxxxxxxxxxxx>
  • To: haiku-appserver@xxxxxxxxxxxxx
  • Date: Thu, 21 Oct 2004 10:44:18 +0200 CEST

Hi Adi,

>       We use ViewDriver because our instruction clipping code (when in an 
> update you are allowed to draw only in a specified region) is not in 
> place and we use BView::ConstrainClippingRegion(&reg) for that.
Nice!  I didn't even know that such a option existed.. I guess it's 
because I never really wrote an app yet :-/

>       When Gabe finises the clipping code(read DisplayDrive it's done) 
> we'll 
> be able to use the full acceleration an accelerant can give us.
So, you can use DD and AccelerantDriver. Does Accelerantdriver already 
has the engine stuff in now? Or will it do software exectution only at 
the beginning? (just curious)

> > -->I am assuming here that these buffers reside on the 
> > Graphicscard.
>       In R1, no.

> > The sentence also implies that both the source and destination of 
> > the 
> > drawing actions are residing on the graphicscards RAM, or acc would 
> > not 
> > be possible (ATM at least).
>       HW cursor is possible. The rest, no. It's no problem, we'll use 
> MMX, 
> SSE. That is until you provide us with drivers that can use pixel 
> shaders. :-))))
OK, so you are indeed confirming here you know you won't be using the 
ACC engine in the driver. You are using MMX and SSE which in my book is 
software drawing, so no acceleration from the cards engine. It's indeed 
the best you can do for now, so sounds OK to me :)

> > OK, here's my 'warning':
> > While you are correct in saying you can draw 'parallel' in those 
> > regions inside the buffer, there may be important performance 
> > penalties 
> > if you don't serialize the access after all. The 'burst mode' of 
> > writing across the PCI/AGP bus depends on serialized access (at 
> > least 
> > beyond those say 32bytes blocks, within these blocks it doesn't 
> > matter). Burst mode (fully automatically generated by the system's 
> > hardware) works in PCI mode, and in AGP mode. In AGP mode bursts 
> > are 
> > the ones being accelerated with the 'fastwrites' feature.
>       How do we serialize access?

Locking, semaphores. Only one thread draws in the graphicsRAM at a 
time. This ensures that you have the best chance that it's memory is 
accessed in a more or less serial way. Of course, in the end it depends 
on what every thread is doing. If you are updating some piece of 
background window because it just became foreground (some other window 
moved away), you will be doing it serially as well: every line of 
screen is being updated (filled) from left to right (or vice versa). 
You only make jumps in memory if you reach the end of such a updated 
part of a line: the jump will be "bytes_per_row". After this serial 
access again happens.

So this kind of doublebuffering would be fast (bursts, FW) and sounds 
like a good plan (double-buffering 'source' bitmaps are in main mem).
If locking is done: but you could of course just benchmark both options 
for yourself and see if I am correct ;-)

If you _don't_ use doublebuffering, then suddenly you can't 'guarantee' 
every thread will do serialized access optimally: now the app's 
behaviour will determine this. Video will off course be 'serialized' in 
the app, but if some figure is to be drawn onscreen (a transparant box 
or whatever), then this can't be done in a serial fashion (because the 
content of the box is not touched).

>       If we're doing a blit(copy line by line) from main mem to on-screen 
> video mem, won't the HW generated burst step into action?
Yep. OK, you could call this HW acceleration also. In my book however, 
I would not call it that way. I guess I still have to read you 
app_server guys book of defines yet.. :-/ (or vice versa of course ;-)

I wouldn't even know what to call it per se, but something like 'bus 
acceleration' as a name for it would be much better here..

Anyway: Indeed, burst and FW will be optimally as stated above 8-)
It's a good idea to have (at least) 32bit word alignment in place for 
the source bitmaps in main mem BTW.

> > If indeed both the source and destination of the drawing action 
> > reside 
> > in graphicsRAM, this performance issue only exists if you use non-
> > accelerated drawing actions, because the CPU would need to read the 
> > data across the PCI/AGP bus, and then write it back, while the 
> > acceleration engine keeps it local on the graphicscard RAM.
>       I don't think we'll be using videoMem that way, and I'm speaking 
> about 
> R2 here. R3, who knows, if we can use those pixel shader units... :-) 

But, we have to talk about defines again. What do you mean by pixel 
shader units? They have nothing to do with it (I can tell you that 
without knowing exactly what they are... :)
The only thing we need to get 2D acc from main mem is me instructing 
the engine to fetch the source from main mem instead of from local 
graphicsmem (on matrox one single flag! (plus adressing of course). The 
engine instructions themselves are identical (in theory) to what they 
are now.

This is a step I am going to investigate at some point, hopefully in 
the not (too) distant future. The main problem here is to get cache 
coherency working OK I expect. There's a second step also (GART and 
aperture), but that only improves speed if you are going to fetch 
mutliple bitmaps 'simultaneously' (so that's one reason why normally 
this stuff is only used for 3D acc: the second reason is stability. If 
you want to use 2D acc later on, you should consider real AGP transfers 
and so acc itself (by letting engine fetch from main mem) as being an 
option. You should take into account that only FW is used, and in some 
cases, even only standard PCI. (off course we'll have to see what PCIe 
brings us as well..)

If you mean using real 3D for the desktop already (by mentioning pixel 
shader units), then of course, AGP transfers are much less an option or 
speed would get unworkable low probably. Although I guess it would 
still be a good plan to let it be useable without 3D acc and acc engine 
fetches from main mem (software mode via MESA without 3D acc drivers).

(Quake2 on my laptop runs at 3.9fps already with MESA6.1, so in openGL 
mode (compared to those 92FPS in internal software rendering mode I 
once talked about))

> After the talks we've had, I thought good and wanted to ask your 
> opinion 
> on this:
>       Do you think it's good to draw(with the CPU) in vidMem off-screen 
> surfaces(I mean double buffering in videoMem)? I say no because of 
> the 
> PCI bus; writing _and_ reading is expensive. 
I say no too, because of just this reason.

>I think, a better solution 
> is to do triple buffering. Have an off-screen surface in mainMem into 
> which a window will draw. When drawing it's done, blit this in an 
> _off-screen surface in video memory_, validate a flag it's OK to use 
> that surface, and when a portion of that window is needed we use the 
> 2D(/3D) engine to blit on-screen.
>       What do you think?
Agreed. Although I did not mention this setup, it did cross my mind :)

Of course, in the end there maybe more to consider. Like, you are 
running 3D apps in a window at the same time. Card memory you use for 
triplebuffering the desktop is nolonger available for the 3D app, 
lowering it's speed (by it needing more frequently fetching stuff from 
main mem: assuming the bus stays a bottleneck.)

BTW: Talking about defines again: I find it interesting to see you talk 
about blits when you actually mean copying. In my book the word 'blit' 
in reserved for acc engine mem copying. (so acceleration). I mean, I 
don't want 'my book' to be nessesarily right, but this _is_ a potential 
problem: we could misunderstand each other easily by not having these 
terms defined clearly...


Hey, I still miss something else that's interesting as well I think. I 
understand what you mean by double/triple buffering, and I understand 
why you want to do that. But still, I am wondering about another of 
BeOS it's shortcomings: updating the screen in such a way that you 
won't see distortions if you drag a window (for instance). I mean: 
tearing. Some people talk about double buffering in this context: 
having a copy of the entire screen in cardRAM that is switched between 
during retraces. This of course requires both buffers to be updated 
with everything, and you can talk and think a lot about strategies to 
get this done with minimal overhead.

But the goal it serves would be a perfect, undistorted screen output at 
all times, which would be nice to have as well at some point.. (just 
updating the single screenbuffer during retrace is much too expensive I 
guess, as this leaves very little time to update the buffer: the 
retrace is a relative very short piece in time compared to 'full-time 
acc', as is used now.)



Other related posts: