[dokuwiki] Re: idea for improving the index

  • From: Uwe Koloska <dokuwiki@xxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Fri, 13 Jun 2008 19:47:19 +0200

Am Freitag, 13. Juni 2008 schrieb Christopher Smith:
> On 13 Jun 2008, at 00:01, Uwe Koloska wrote:
> > To find the URLs I use
> > this
> > regular expression:
> >  "/[a-z]+:\/\/[^|\s]*/"
> >
> > <snip>
> >
> > So, what do you think about this?
> > - is this the right thing to do?
>
> I don't think it is the right thing to do.

You mean this simple regexp, don't you?

> It might be worthwhile to eliminate the "http", "www" & "com" from
> urls before sending the raw wiki text to the indexer,

not to forget about the query and the anchor.  A full URL could look as 
complicated as this:
(see  http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax)

foo://username:password@xxxxxxxxxxx:8042/over/there/index.dtb;type=animal?name=ferret#nose
  \ /   \________________/\_________/ \__/\_________/ \___/ \_/ \_________/ 
\_________/ \__/
   |           |               |        |     |         |     |       |         
   |     |
scheme     userinfo         hostname  port  path  filename extension 
parameter(s) query fragment
   |    \_______________________________/
   |                authority
   |   ________________________
  / \ /                        \
  urn:example:animal:ferret:nose

and for the hostname, there are also constructions like ".co.uk".

> but I don't 
> think its sensible to strip out the main part of the domain name -
> which is a useful search term.

On the other hand, most of the time, the most important information from an 
URL is duplicated in the descriptive part of the link.

> However, the list of items to 
> eliminate is likely to be complex, e.g. should it include "org", "co",
> "uk" or "de"?  Which implies a user configurable list.
>
> Given the added complexity, it might be more sensible to handle this
> in a plugin - attached to an event which allows filtering of raw wiki 
> text before handing it to the indexer ... or perhaps a more complex
> set of events to allow for replacement search/indexing mechanisms.

This may really be a good idea. Then there can be plugins of different 
complexity.  But I think for this we need a mechanism to determine the order 
of the plugins (or is this possible with the current event processing?) so 
that we can first delete all unwanted content, then parse the raw text to 
token and then filter some token out (or expand them).

Just my 2 Eurocent.
Uwe
-- 
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: