[openbeos] Re: queries on missing attributes and Query Language Specification

From: "Alexander G. M. Smith" <agmsmith@xxxxxxxxxx>
To: openbeos@xxxxxxxxxxxxx
Date: Tue, 21 Dec 2004 08:30:43 -0500 EST
François Revol wrote on Tue, 21 Dec 2004 12:48:00 +0100 CET:
> I recall some ppl saying =="*" and =="**" produced different outputs, 
> you might want to try both. (why I don't know, well "*X" means ends 
> with X, "X*" means starts with X, maybe "*" only means ends with 
> nothing to some fs, while "**" means contains '', which is semantically 
> different ?)

Somebody goofed in their query implementation, that's why.

> Btw, your code is a bit... hairy I'd say :P
> Will have to write my own parser for googlefs :-(
> Would really be useful to have that available as a module.
> /me adds a TODO item nr 46763276.

Yup.  Leaving out the hairy bits, it's a recursive descent parser.  Easy to
implement your own one.  Basically each decomposition step becomes a function
that tests the current fragment of query expression against the step's
requirements (possibly calling other parse functions for testing parts of the
fragment to see if they are an expression etc.) and either does its job
or returns false.

- Alex

P.S. In case people haven't seen it before, here's my description of the
BFS query language:

The query language consists of comparison expressions joined by "&&"
(logical and) and "||" (logical or) operations, with round parenthesis "()"
to control grouping.  && has grouping precedence over || if you don't
override it with parenthesis (in general, the same operator precidence is
used as in the C programming language).  You can also put a "!" (logical
negation) before the expression to reverse its meaning.  The comparison
expression has three parts:

AttributeName Relation Value

The AttributeName is just the name of an attribute, which is case sensitive,
and for speed should match an index.  The Relation is one of "==" (equal),
"=" (equal), "!=" (not equal), "<" (less than), ">" (greater than), ">="
(greater than or equal), "<=" (less than or equal) which have the usual
meanings.  The value is a string which gets converted to a number if the
attribute is numeric (dates are big numbers: number of seconds since January
1, 1970), or gets converted to a regular expression if the attribute is a
string.  Note that hex values (0xabcd1234) can be specified even for
floating point numbers.  The same attribute on different files can have a
different data type, making comparisons more interesting.  If the attribute
isn't present, the file won't be listed.  Regular expressions use "*" to
stand for any number (including zero) of any character, "\" to escape the
next character, "?" to match any single character, [ad-f] to match a single
character specified by the actual characters or a range, [!...] to match a
single character not of the specified ones.  The pattern matching works on
Unicode characters, so that if you store the file name as UTF-8 multibyte
characters, you need to be careful about what counts as a single character
for "?" and [a-z] patterns.  For example a "?" will match "事", which is some
Japanese character (shows up as the 3 bytes E4, BA, 8B in UTF-8 format).

UnaryOperator AttributeName

As a new (January 10 2003) feature in AGMSRAMFileSystem, I'm adding a couple
of unary operators <> and >< to find all files that have or don't have an
attribute.  Use "<>AttributeName" for files with the attribute and
"><AttributeName" to find files that don't have it.  Note how >< looks like
an X in shape, kind of implying crossing out of something, so I use it to
find files that do not have the attribute.  Rather than using new symbols, <
and > are reused so that the list of reserved symbols doesn't change.

After a bit of experimentation (finding out you can put quotes around
attribute names etc), I've come up with this description of the language.
The order of the expansions is important, if the first alternative fits
(such as QuotedString vs UnquotedString in the definition of String), use it
before trying the remaining alternatives.

<String> ::= <WhiteSpace> ( <QuotedString> | <UnquotedString> )

<WhiteSpace> ::= { <Space-like characters> }
Meaning zero or more space or other space-like characters (tabs, newlines,
and others - see the standard isspace() function).  I'm not going to insert
<WhiteSpace> all over the place in this language description, but it is
there in places you'd expect.  In practice, the query will be converted to
tokens before being parsed, at that point white space is removed and the
parser doesn't have to worry about it.

<QuotedString> ::= "\"" { <NonQuote> | "\\\"" } ( "\"" | "\0" )
Meaning a quote character followed by zero or more of (non-quote character
or backslash quote (\") which puts a quote character in the string) followed
by a quote or end of string.  The string will contain the text between the
quotes (including white space), with the backslash quote combination for
entering one quote into the string.  Other backslash combinations aren't
special; they appear in the string as the backslash and the following
character.  You can leave off the trailing quote if the QuotedString is the
last thing in the query.

<UnquotedString> ::= [ <NonSpecial character> ]
Meaning one or more of any character except (, ), !, |, &, <, >, =, white
space.  The string will be all characters up to the next character that
might have special meaning.  Yes, quotes just are ordinary characters; you
can use name==*"* to find all files with a quote mark in their name.  No,
it's not smart enough to look ahead, so a "=" will end the string even if
the next character doesn't make a "==" out of it.  You may want to ignore
that BFS simplification, but it doesn't matter since most queries use quoted
strings.

<Expression> ::= <SimplerExpression> | <SimplerExpression> "||" <Expression>
Meaning that the || operator has lowest precedence.  Also implies that a
sequence of || operations will be evaluated left to right, though query
optimization may change that order.

<SimplerExpression> ::= <Term> | <Term> "&&" <SimplerExpression>
Meaning that the && operator has the second lowest precedence and associates
left to right.

<Term> ::= <Comparison> | "(" <Expression> ( ")" | "\0" ) | "!" <Term>
Meaning that !  (the not operator) binds tightly, so !a<b&&c<d is equivalent
to (!(a<b))&&(c<d).  To save on complexity, the parser can remove !
operations and reverse the logic of the items being notted, so that
statement would be equivalent to (a>=b)&&(c<d).  Use De Morgan's Law to
apply not to logic operations.  Parenthesis let you group subexpressions.
You can leave off the trailing ")" for expressions that end at the end of
the query string.

<Comparison> ::=
<String> <RelationalOperator> <String> | <UnaryOperator> <String>
Meaning that the basic comparison operation is done between a named
attribute (the first string) and some sort of value (the second string).
The second string will be converted to the appropriate data type to match
the attribute (if it is an integer attribute, the value string will be
converted to an integer, same for floating point attributes, date
attributes, etc).  If the conversion fails, it will use some default as the
value (usually zero).  In the unary version, the string is the name of an
attribute.

<RelationalOperator> ::= "<" | "<=" | "==" | "=" | "!=" | ">=" | ">"
As a special case for string comparisons, if the value is a pattern
(containing * [ ?  or \) and the comparison is == or = or != then pattern
matching will be done rather than the usual relational comparison.  BFS
overdoes this and gives no results if you use less or greater than with a
pattern.  If you are using multiple keyword indices, then the test succeeds
if any one of the keywords succeeds.

<UnaryOperator> ::= "<>" | "><"
These are used for seeing if an attribute exists or if it does not exist.
References:
- [openbeos] Re: queries on missing attributes
  - From: François Revol
[openbeos] Re: queries on missing attributes and Query Language Specification

Other related posts: