Siarzhuk Zharski <zharik@xxxxxx> wrote on Mon, 22 Oct 2012 23:01:19 +0200: > last week I'm implementing BFS attributes support in gnutar and > libarchive bsdtar applications. Below are my suggestions and possible > solutions for this problem. Great! It will be nice to have an archive format that can store attributes and be larger than the 2GB limit of Zip. > It was decided to use tar format extensions introduced in POSIX.1-2001. > Those extensions are presented as list of strings in special (so known > PaxHeader) header before the standart header of every file. Every > string, that correspond to one BFS attribute has following format and > should be UTF-8 encoded: > > <length> <key>=<type> <data><CR> http://www.mkssoftware.com/docs/man4/pax.4.asp 255 character limit on file names, what's the BeOS/Haiku limit? Ah, it's 255 too, so no problem there. I assume paths to a file can be stored somehow without having to cram the whole path name into 255 characters (since Gnu Tar seems to work with this standard). 11 octal digits for the file length, so the maximum is a 34 bit number, or 16GB. Though I see there's a binary hack workaround described in http://en.wikipedia.org/wiki/Tar_(file_format) So it can do BeOS/Haiku files with 64 bit sizes. But doesn't each extended header have to fit into one 512 byte block? Which kind of makes it not too useful for bigger things like icon attributes or some of the longer e-mail attributes (like CC address lists). In BeOS, an attribute can be as big as a file. Though I see you later mention a separate attributes file to store the bigger ones. > 1) At the moment the attribute type is stored as raw uint32 value. You > should already know that type constants in Haiku are defined as > mnemonical 4-byte combinations: 'CSTR', 'LONG', 'RECT' that are inside > of ASCII characters range. Can I hope that this tradition will be not > broken in the future? ;-) Treat it as an error if it isn't a 4 character code? Or use escape sequences? Perhaps quoted printable like in e-mail, so unprintable characters becomes "=xy" (equals sign followed by two character hex code for the byte). That would leave normal codes Human readable as printable characters. > 2) Endiannes - [...] But many of Haiku attributes are binary by theirs > nature and can become invalid in case of easy expanding on the system with > different byte order. [...] Always store the archived bytes in network byte order (http://en.wikipedia.org/wiki/Endianness#Endianness_in_networking and RFC1700). Then swap as needed when restoring the file. Though you'll need a small database of the attribute types to figure out which ones need swapping (and be open to adding new types when someone invents one). Keep in mind that some attributes can be considered to be arrays of values. For example, an attribute with 12 bytes of data and a type of 'LONG' (which means a 4 byte integer) can be treated as an array of three integers. I'm not sure what you'd do with a BMESSAGE type of attribute. Don't convert numbers to text and back, just save the binary, since text doesn't work accurately for floating point numbers. > 3) The size of attributes. It is obvious that bloating megabytes with > conversion to HEX is bad practice. In opposite, trashing archive with > extra "BeOS attributes" special files like MAC version of libarchive do > is not nice too. I think some limit should be defined either the > attribute can be HEX-stored in PaxHeader or as special binary file. For > example 8 or 16 kilobytes. Attributes exxeeding this limt should be > handled as special "attributes" file. It's already limited to 512 bytes, less with overhead. So past that limit, use the separate file(s) big attributes. I'd recommend a separate file for each big attribute rather than one big one for a bunch of attributes, to keep things Human readable. > PS: And, yes, I have already checked that "bin->HEX->gz/bz2/zip" way > has better compression ratio than "bin->base64->gz/bz2/zip" one. ;-) So > assuming typical using of tar files as container for stream compressing > I do the things in simpler way. Both are annoying in that you can't read the original attribute text (for string attributes) when looking at the archive file (useful for debugging). How about using that quoted printable encoding rather than Hex or Base64? How well does quoted printable work for compression? I suspect it will turn out to be almost as good as hex. - Alex