[haiku-development] Re: Proposal and questions: Support of BFS attributes

  • From: "Alexander G. M. Smith" <agmsmith@xxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx
  • Date: Tue, 23 Oct 2012 10:44:39 -0400 EDT

Siarzhuk Zharski <zharik@xxxxxx> wrote on Mon, 22 Oct 2012 23:01:19 +0200:
> last week I'm implementing BFS attributes support in gnutar and 
> libarchive bsdtar applications. Below are my suggestions and possible 
> solutions for this problem.

Great!  It will be nice to have an archive format that can store
attributes and be larger than the 2GB limit of Zip.

> It was decided to use tar format extensions introduced in POSIX.1-2001. 
> Those extensions are presented as list of strings in special (so known 
> PaxHeader) header before the standart header of every file. Every 
> string, that correspond to one BFS attribute has following format and 
> should be UTF-8 encoded:
> 
> <length> <key>=<type> <data><CR>

http://www.mkssoftware.com/docs/man4/pax.4.asp

255 character limit on file names, what's the BeOS/Haiku limit?  Ah,
it's 255 too, so no problem there.  I assume paths to a file can be
stored somehow without having to cram the whole path name into 255
characters (since Gnu Tar seems to work with this standard).

11 octal digits for the file length, so the maximum is a 34 bit number,
or 16GB.  Though I see there's a binary hack workaround described in
http://en.wikipedia.org/wiki/Tar_(file_format)  So it can do BeOS/Haiku
files with 64 bit sizes.

But doesn't each extended header have to fit into one 512 byte block?
Which kind of makes it not too useful for bigger things like icon
attributes or some of the longer e-mail attributes (like CC address
lists).  In BeOS, an attribute can be as big as a file.  Though I see
you later mention a separate attributes file to store the bigger ones.

> 1) At the moment the attribute type is stored as raw uint32 value. You 
> should already know that type constants in Haiku are defined as 
> mnemonical 4-byte combinations: 'CSTR', 'LONG', 'RECT' that are inside 
> of ASCII characters range. Can I hope that this tradition will be not 
> broken in the future? ;-)

Treat it as an error if it isn't a 4 character code?  Or use escape
sequences?  Perhaps quoted printable like in e-mail, so unprintable
characters becomes "=xy" (equals sign followed by two character hex code
for the byte).  That would leave normal codes Human readable as
printable characters.

> 2) Endiannes - [...] But many of Haiku attributes are binary by theirs
> nature and can become invalid in case of easy expanding on the system with 
> different byte order. [...]

Always store the archived bytes in network byte order
(http://en.wikipedia.org/wiki/Endianness#Endianness_in_networking
and RFC1700).  Then swap as needed when restoring the file.  Though
you'll need a small database of the attribute types to figure out
which ones need swapping (and be open to adding new types when someone
invents one).  Keep in mind that some attributes can be considered to
be arrays of values.  For example, an attribute with 12 bytes of data
and a type of 'LONG' (which means a 4 byte integer) can be treated as
an array of three integers.  I'm not sure what you'd do with a BMESSAGE
type of attribute.  Don't convert numbers to text and back, just save
the binary, since text doesn't work accurately for floating point numbers.

> 3) The size of attributes. It is obvious that bloating megabytes with 
> conversion to HEX is bad practice. In opposite, trashing archive with 
> extra "BeOS attributes" special files like MAC version of libarchive do 
> is not nice too. I think some limit should be defined either the 
> attribute can be HEX-stored in PaxHeader or as special binary file. For 
> example 8 or 16 kilobytes. Attributes exxeeding this limt should be 
> handled as special "attributes" file.

It's already limited to 512 bytes, less with overhead.  So past that
limit, use the separate file(s) big attributes.  I'd recommend a
separate file for each big attribute rather than one big one for a
bunch of attributes, to keep things Human readable.

> PS: And, yes, I have already checked that "bin->HEX->gz/bz2/zip" way 
> has better compression ratio than "bin->base64->gz/bz2/zip" one. ;-) So 
> assuming typical using of tar files as container for stream compressing 
> I do the things in simpler way.

Both are annoying in that you can't read the original attribute text
(for string attributes) when looking at the archive file (useful for
debugging).  How about using that quoted printable encoding rather
than Hex or Base64?  How well does quoted printable work for
compression?  I suspect it will turn out to be almost as good as hex.

- Alex


Other related posts: