[waug] Obstacles to Browser and system performance

From: incerti auctoris <incertia@xxxxxxxxxxx>
To: <waug@xxxxxxxxxxxxx>
Date: Mon, 17 Dec 2007 17:23:39 +0000
In reference to questioning between Robin (me), Ian, Moss, and Derek at
the last meeting:

(Please correct any mistakes/misquotes I may have made!)


Browsers - Why do slow PCs have a fast Browser,
           while fast RO machines have a slow one?

           .- .------------------------------.
 Text      :  | http etc low-level protocols |
 Browsers -:  |------------------------------|
 only      :  | IMG        | HTML, CSS,      |
           :  | decode     | javascript, etc |
           '- |------------------------------|
              | DOM                          |
              |------------------------------|
 Fig. 1       | Render (Ordering then        |
              | displaying little boxes)     |
              '------------------------------'


BROWSER COMPONENTS

Fig. 1 shows the basic components of a www Browser. The bottom two
parts are not used by text-only browsers (mentioned as a fast option).

I know there is no problem with the low-level protocols, and that they
are now built into RO and easier to configure than MsWindows (as if
anything could be more obscure).

Anything more than simple HTML without the CSS/javascript requires the
abstraction of a DOM (Document Object Model), prior to rendering, and
possibly afterward for interaction.

The final display is mildly complicated by the need to sense active
regions so that what is clicked/selected can be related to hyperlinks
and/or CSS/js rollovers.

The rendering split's into two parts:

 1. Determining back-and-forth the widths of "little boxes", then
    sorting them into order. Boxes vary from bottom-level primitives
    (individual characters and graphic images), top-level primitives
    (the BODY element), and containers (DIV and similar).

 2. Once we know where all the "little boxes are", painting them
    on the screen.

Point 1 is fairly involved, particularly with calculating nested TABLEs,
but essentially a recursive mathematical operation, and shouldn't take
long.

Point 2 seems to work very well in other existing applications, eg DTP,
often with display requirements exceeding the complexity of common web
page layouts.

When split from point 1, I don't see what stops point 2 being
implemented efficiently.

I don't think there is much difficulty decoding IMG formats into
(cached) sprites, from eg GIF, PNG, JPEG, BMP, and to a lesser extent,
SVG. While this takes appreciable time on earlier machines, recent
technology seems to handle it at imperceptible speed.
I think there could a speed saving, nonetheless, by used a hardware
solution for this in, eg, a GPU or other sub-computing component, like
an FEP (Front-End processor) or similar (although it would be
"middle/side-end processing").


TECHNICAL OPINION

The above covers my opening question. Below are Ian's and Moss's initial
replies before we ran out of time:

Issues raised were:

 * Image Translation
 * Memory Paging
 * Bus Feed
 * No Native FPU


MEMORY PAGING (Moss & Ian)

(We got the farthest on this point, and didn't have time to define the
other points.)

I asked, particularly in relation to Browser performance, how system
efficiency could be raised by running an all-static-RAM system, and what
obstacles there are to further optimisations.

This relates to what was formerly controlled by the MEMC (Memory
Controller), and is now presumably integrated into the ARM chip or stuck
on the side with glue logic(?)

Moss believed the most serious limitation to be the restriction to
accessing 26b (4Mb) pages of RAM at a time, which required the address
offset to be looked up in the page memory, by the MEMC or equivalent
(hereafter referred to as "MMU" (Memory Management Unit) for simplicity).

The number of cycles for a memory read would normally be:
 1. Page lookup (SRAM)
 2. Main DRAM Read/Write
 3. Pause for DRAM refresh cycle
Screen display scan cycles are presumably handled by the GPU now(?)

Using SRAM (Static RAM) in place of DRAM (Dynamic RAM) for the main
store would knock out (3), but the CPU would still be left twiddling
it's THUMBs (sorry) for one of the two cycles remaining.

The points I wanted to ask about this were:

If the MMU can't simply be replaced by an Adder because the contents of
it's page memory are defined by hidden circuitry, couldn't either the
hidden circuitry be duplicated and read into the adder, or a consistent
"fake" number be generated for each real lookup. If the fakes were
consistent (ie generated through logic from the lookup value), shouldn't
that work just as well?

And how are the page values calculated, anyhow?

Alternatively, if we made do with just 4Mb main store (for the stake of
argument), we wouldn't need any page switching, and this could be
switched off, or with just 8Mb main store, it would be easy to guess
(normal=0, other=1)?

Further, if we use a 32b ARM, wouldn't the page length be also increased to
32b (128Mb)?
-Then we could run "frugally" with 128Mb (or 256Mb with "guess logic")?
-I know the IOP/XScale used in the Iyonix are all 32b, have the
ARM7/9/11/Cortex followed suit? (See also last point, below.)

ARM Has produced a chip with ARM9, possible onboard GPU, and 128Mb RAM
(probably SRAM). Could this help?

The IC code is "339S0030 ARM", manufactured by Samsung.
This may be Samsung's S3C6400 or S3C2460 based on the ARM1176 core.
The IC also has a Samsung part number K4X1G153PC-XGC3, which indicates a
1Gbit memory device, ie 128Mb.

The above data from various sources, including some manually read off
photographs, but I think the 128Mb is an interesting coincidence, and I
further wonder if the 16kb ARM Instruction/Data caches are redundant
with this on-chip SRAM?


IMAGE TRANSLATION (Moss)

I don't understand what this is: Does it concern decoding of
GIF/JPEG/PNGs, general shifting around of large blocks of data (perhaps
related to Bus Feed), or some other hardware or OS limitation?


BUS FEED (Ian)

This could mean several things also:
Does it relate to the current main system bus used (PCI on Iyonix,
probably A9 too)?
Is it more about the speed limitations of DRAM as opposed to SRAM?
Is there some other hardware bottleneck concerned with the interface
between processor and RAM/Cards, the PCI bus driver, the weird
transmission-line phenomena of signals on the PCI bus having to "bounce"
off the end of the line before they can be handled (surely fixable with
a higher speed PCI version(?)), or is there some other hardware/OS
limitation?

Does the A9, not being internally expandable, have to remain enslaved to
a PCI bus, or is using a proprietary one more trouble than it's worth?


NO NATIVE FPU (Moss)

I thought there was an FPU (Floating Point Unit) available as a
co-processor?
What do you mean by "native"?
Is the existing FPU not built into ARM7/9/etc?
Does it not execute code fast enough to keep up?
Is it not a RISC/ARM processor?
What is the performance problem caused by not having it (whatever
it is)?


 - * -

ARM VERSIONS

This is a separate-but-related point that I didn't get around to
raising:

Are there any barriers to running RO on an ARM9, as opposed to on the
ARM7 (A9) or IOP/XScale (Iyonix)?

This was raised upon asking R-Comp at Guildford, why smartphones use the
ARM9 series, while our desktop machines seem to be stuck at the ARM7
level. (!?)

If so, how are they proposed to be overcome on the planned Cortex
multicore package, or will the Cortex be using four older cores for
smaller size?


Yours Sincerely,
Robin Hodson (Mr.)
Tel. 07811 550086
http://freedom.is/better/blog

_________________________________________________________________
Get Hotmail on your mobile, text MSN to 63463!
http://mobile.uk.msn.com/pc/mail.aspxVisit the WAUG website 
//www.freelists.org/webpage/waug
Follow-Ups:
- [waug] Re: Obstacles to Browser and system performance
  - From: Mr N Rolfe
[waug] Obstacles to Browser and system performance

Other related posts: