[openbeos] Re: Identifying Text Files

  • From: Ingo Weinhold <bonefish@xxxxxxxxxxxxxxx>
  • To: openbeos@xxxxxxxxxxxxx
  • Date: Fri, 09 Jun 2006 21:14:47 +0200

On 2006-06-09 at 13:13:35 [+0200], Axel Dörfler <axeld@xxxxxxxxxxxxxxxx> 
wrote:
> Ingo Weinhold <bonefish@xxxxxxxxxxxxxxx> wrote:
> > since BeOS seems to have built-in support for recognizing files as
> > text
> > files, we want to have the same. I'm about to implement that, missing
> > is
> > basically the algorithm deciding whether (or with what probability) a
> > buffer of bytes actually contains text.
> > 
> > A simple but maybe a bit ignorant approach would be to check whether
> > the
> > buffer contains valid UTF-8 characters only (or more than, say, 95%).
> > But
> > maybe someone has better ideas...
> 
> I would add special rule semantics for this, ie. a "text" rule and an
> "ascii" rule where the former would accept UTF-8 and the latter plain
> ASCII only, maybe even with a method to specify the minimal congruence.

I don't quite understand what you mean. I would simply take ascmagic.c, 
adjust it (to C++, parameters/return types of the identification function, 
strip things I don't need) and return the type it finds.

> If you have a look at BSD's "file", the text magic happens in
> ascmagic.c - it looks very reasonable to me, and could even identify
> the charset for StyledEdit (at least in a basic way that should be
> enough for the Western world).

I intend to incorporate that code into the text sniffer add-on directly, 
stripping as much of the character set stuff as possible. But we can 
certainly provide a library function (e.g. in libtextencoding) that guesses 
the character encoding/set of a given buffer.

CU, Ingo

Other related posts: