[interfacekit] Re: Flat Message Format
- From: Ingo Weinhold <bonefish@xxxxxxxxxxxxxxx>
- To: haiku-commits@xxxxxxxxxxxxxxxx
- Date: Sun, 24 Jul 2005 20:09:28 +0200
I've cross-posted my reply to the IK team list which is probably the better
place, if this thread is going to be longer.
On 2005-07-23 at 18:16:28 [+0200], mmlr@xxxxxxxx wrote:
[...]
> I though about the "flat data storage" and came to the following
> conclusion:
> To read things from the message it will work quite well, as we just
> access the flat buffer at some locations. We could have a helper class
> that manages the offsets into the buffer.
> Also for flattening / unflattening this would be absolutely ideal, as we
> essentialy (besides adding the message header) would not have to do
> anything.
Yep.
> But the problem lies much more in writing to this flat buffer. If you
> wanted to add an item to a message field you would have to grow the
> buffer, move the following items and copy the data in. If we have a
> helper class that manages the offsets into the buffer we would also
> have to update these offsets for all following items. If you think
> about Archive() again we would grow the buffer very often and the time
> it takes to grow would increase with the message size. Also adding data
> probably happens more often than flattening / unflattening.
This mainly depends on how you store the data exactly. E.g. if you waste a
bit of memory by reallocating in a clever way, you'll have to move data
less often.
> Also a problem comes to mind if you look at these fancy message flags.
> If we really wanted to care about these we would have to "rewrite" the
> flat buffer whenever we go from 1 -> multiple and the other way around
> or when we go from "mini data" that holds the size / count as one byte
> to "big data" that holds size and count as 4 bytes. This does
> complicate the task to find the offset of a given item a bit and could
> lead to having to grow the overall buffer again.
IMHO, we should drop those. Memory has become much cheaper since the day
BMessages were invented. Wasting a few bytes per field really doesn't
matter that much. If we later find it does, we can still add a special
handling.
> As a possible solution to the "grow time" I thought about a "chunked"
> buffer so that if you wanted to grow item 3 of 100 you would only grow
> the chunk that holds item 3. But with this we would obviously have to
> "collect" the chunks again for flatten. And also if we set the chunk
> size relativly small we will end up with about the same as we have now
> (a buffer per item or a buffer per field).
>
> The task here is to find a good balance between both sides. I'll try to
> optimize for the most usual cases here. I guess in the next version
> I'll end up with a chunked flat buffer.
The big disadvantage is, that you don't have a single flat buffer anymore.
While chunks should be quite OK for writev_port(), they're not so nice when
it comes to using shared memory for sending larger messages.
Let me propose something for a single flat buffer format. It's merely a
brain-storming right now, but, I think, it could be a basis to address the
requirements we have for the common cases, while still being acceptable for
less common cases.
I believe, the most common use case is sending of simple messages: The
sender adds a few message fields and sends the message, the receiver
receives it and extracts the relevant data via the Find*() methods. In
fewer cases the receiver manipulates the message (adding/removing fields,
replacing data) and forwards it.
Archiving/unarchiving should be similar to the simple sending/receiving
case, maybe with the exception that the amount of data may be significantly
greater (especially when object hierarchies are archived).
A rarer use (at least IMO) have BMessages as random-access data containers.
They are known to perform significantly worse than containers (at least
regarding write operations) that are optimized for those cases (e.g. STL
containers) and are probably only used when performance doesn't matter, for
instance as container for settings.
My conclusion is, that we should concentrate on AddData() and FindData()
performance, and make compromises regarding Replace/RemoveData(), though
rather on the side of memory usage than on performance.
Here's the format I'd propose:
==============
|message header|
|==============|
| field table |
| ----------- |
|field 1 offset|
|field 2 offset|
| ... |
|field n offset|
| [space] |
|==============|
| field i1 |
|==============|
| [space] |
|==============|
| field i2 |
|==============|
| [space] |
|==============|
| ... |
|==============|
| field in |
|==============|
| [space] |
==============
The field (offset) table refers to the fields in order of their addition.
Its capacity is increased when needed (*2 policy for instance) initially
starting with say 10 slots, so that it will rarely ever be resized for
normal messaging. The fields including their data form contiguous chunks --
internal pointers being field relative offsets -- so that they can be
moved en bloc without the need for relocation. Their order doesn't need to
match that in the field table; the top level allocator is free to move them
around as necessary: Adding a field value to field i1 may increase its size
beyond the space available where it is located, so that the allocator may
decide to simply move it to the end, where is enough space (or can be made
by resizing the whole buffer). Or the allocator could move field i2 out of
the way. Actually even the field table could be treated like a field in
this respect, as long as the message header contains an offset to it.
The internal structure of a non-fixed-size field looks quite similar:
==============
| field header |
|==============|
| entry table |
| ----------- |
|entry 1 offset|
|entry 2 offset|
| ... |
|entry n offset|
| space |
|==============|
| entry i1 |
|==============|
| [space] |
|==============|
| entry i2 |
|==============|
| [space] |
|==============|
| ... |
|==============|
| [space] |
|==============|
| entry in |
==============
Each entry (data) starts with the data length. For fixed-size fields the
entry table and the entries are replaced by a simple data array.
I'd probably grant no free space for a newly added field's entry table (nor
space after the field). Only when a second entry is added, I'd start being
more generous. Initially there shouldn't be any spacing between the
entries. Remove/ReplaceData() could cause fragmentation though. To keep it
in check, the allocator can do compacting when a certain threshold is
reached.
I guess, it wouldn't hurt to also track a bit of statistical data per field
to aid the allocator. E.g. an often edited field should probably be moved
to the end of the message. And two fields to which data is added in turn
could be granted a greater capacity growth rate. Though, I suppose one
should check, how well things work in practice, first.
I didn't mention the field name hash table yet. It could either be added to
the data buffer itself or kept in the BMessage but stored separately. It
doesn't take much space, but it could as well be created when needed -- for
messages with only 4 or 5 fields, it probably doesn't speed up anything.
> > But maybe we could have a support class that flattens data
> > chunks to a BDataIO in BMessage format or a memory buffer
>
> This, while being a nice addition, would obviously kill the whole thing
> about the flat buffer. If you mixed a user given buffer with the
> internal storage without copying it, it would not be possible to have
> one flat buffer anymore. On the other hand this kind of message would
> probably be used for storage only and not to send around, which would
> make a flatten / unflatten optimization unnecessary.
While something like that would indeed be nice, please let's focus on the
basic part and postpone this till R2.
CU, Ingo
Other related posts:
- » [interfacekit] Re: Flat Message Format