On 2006-06-09 at 13:13:35 [+0200], Axel Dörfler <axeld@xxxxxxxxxxxxxxxx> wrote: > Ingo Weinhold <bonefish@xxxxxxxxxxxxxxx> wrote: > > since BeOS seems to have built-in support for recognizing files as > > text > > files, we want to have the same. I'm about to implement that, missing > > is > > basically the algorithm deciding whether (or with what probability) a > > buffer of bytes actually contains text. > > > > A simple but maybe a bit ignorant approach would be to check whether > > the > > buffer contains valid UTF-8 characters only (or more than, say, 95%). > > But > > maybe someone has better ideas... > > I would add special rule semantics for this, ie. a "text" rule and an > "ascii" rule where the former would accept UTF-8 and the latter plain > ASCII only, maybe even with a method to specify the minimal congruence. I don't quite understand what you mean. I would simply take ascmagic.c, adjust it (to C++, parameters/return types of the identification function, strip things I don't need) and return the type it finds. > If you have a look at BSD's "file", the text magic happens in > ascmagic.c - it looks very reasonable to me, and could even identify > the charset for StyledEdit (at least in a basic way that should be > enough for the Western world). I intend to incorporate that code into the text sniffer add-on directly, stripping as much of the character set stuff as possible. But we can certainly provide a library function (e.g. in libtextencoding) that guesses the character encoding/set of a given buffer. CU, Ingo