Am Freitag, 13. Juni 2008 schrieb Christopher Smith: > On 13 Jun 2008, at 00:01, Uwe Koloska wrote: > > To find the URLs I use > > this > > regular expression: > > "/[a-z]+:\/\/[^|\s]*/" > > > > <snip> > > > > So, what do you think about this? > > - is this the right thing to do? > > I don't think it is the right thing to do. You mean this simple regexp, don't you? > It might be worthwhile to eliminate the "http", "www" & "com" from > urls before sending the raw wiki text to the indexer, not to forget about the query and the anchor. A full URL could look as complicated as this: (see http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax) foo://username:password@xxxxxxxxxxx:8042/over/there/index.dtb;type=animal?name=ferret#nose \ / \________________/\_________/ \__/\_________/ \___/ \_/ \_________/ \_________/ \__/ | | | | | | | | | | scheme userinfo hostname port path filename extension parameter(s) query fragment | \_______________________________/ | authority | ________________________ / \ / \ urn:example:animal:ferret:nose and for the hostname, there are also constructions like ".co.uk". > but I don't > think its sensible to strip out the main part of the domain name - > which is a useful search term. On the other hand, most of the time, the most important information from an URL is duplicated in the descriptive part of the link. > However, the list of items to > eliminate is likely to be complex, e.g. should it include "org", "co", > "uk" or "de"? Which implies a user configurable list. > > Given the added complexity, it might be more sensible to handle this > in a plugin - attached to an event which allows filtering of raw wiki > text before handing it to the indexer ... or perhaps a more complex > set of events to allow for replacement search/indexing mechanisms. This may really be a good idea. Then there can be plugins of different complexity. But I think for this we need a mechanism to determine the order of the plugins (or is this possible with the current event processing?) so that we can first delete all unwanted content, then parse the raw text to token and then filter some token out (or expand them). Just my 2 Eurocent. Uwe -- DokuWiki mailing list - more info at http://wiki.splitbrain.org/wiki:mailinglist