[dokuwiki] Re: Search Result

  • From: "Jacob Steenhagen" <jacob@xxxxxxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Tue, 26 Feb 2008 21:09:14 -0500

On Tue, Feb 26, 2008 at 6:15 PM, Todd Augsburger <todd@xxxxxxxxxxxxxxxx>
wrote:

> [snip]
> What about a compromise/alternative? It seems to me that the biggest
> objection to raw searching is that it doesn't look "familiar" to the
> user--they see the markup. But what if we just stripped the wiki markup
> from
> the raw text--wouldn't that be fast and do-able, and look better to the
> user?
>
> For instance, in the 3 places mentioned, there's a call to rawWiki() which
> could be replaced with a function to strip wiki tags. (Or rawWiki() itself
> could be modified to accept another parameter.) Simplistically, something
> like html_entity_decode(preg_replace('/[^\w\.!?"\']+/','
> ',rawWiki($id)),ENT_QUOTES) could be used, but there's probably something
> far better.
>

That's pretty close to my main suggestion in
//www.freelists.org/archives/dokuwiki/02-2008/msg00130.html

The main problem I see is when to run the text through the entity stripping
function. If something already exists to do that in the core code, it would,
of course, be preferable to reuse that existing code. If not, something
would have to be written from scratch.

If you run the wiki text through the entity stripper after ft_snippet() has
decided on what the snippet will be, but before it highlights the search
terms, you end up with a snippet that was shorter than originally intended
(if possible, 50 bytes in either direction, ideally 100; assuming I'm
reading the code right).

If you run it right after $text gets filled from rawWiki(), which would
strip the entities before ft_snippet decided on exactly what text would be
in the snippet allowing it to be the ideal length, you have to parse the
entire page (which on a page like wiki:syntax is quite large). At his point,
you've essentially created the plain text renderer (though a full fledged
plain text render would probably preserve as much of the formatting is as
possible in plain text [eg, insert line breaks as appropriate]).

So running the strip entities type function after the snippet has been
chosen would definitely be the fastest. And I'd imagine for the speed
penalty of doing it the other way, the shortened snippet is probably a fair
trade.

I really think I've made an argument for both methods I've talked about in
this thread. It seems the main difference is how much caching is/can be
done. But like you alluded to in your message, Todd, if our search hits 36
pages and only 6 of them have a good plain text rendered cache and the
snippet engine has to wait for 30 pages to rebuild their cache before it can
build its snippets, that's a bad thing.

-- 
http://jacob.steenhagen.us

Other related posts: