On Tue, Feb 26, 2008 at 6:15 PM, Todd Augsburger <todd@xxxxxxxxxxxxxxxx> wrote: > [snip] > What about a compromise/alternative? It seems to me that the biggest > objection to raw searching is that it doesn't look "familiar" to the > user--they see the markup. But what if we just stripped the wiki markup > from > the raw text--wouldn't that be fast and do-able, and look better to the > user? > > For instance, in the 3 places mentioned, there's a call to rawWiki() which > could be replaced with a function to strip wiki tags. (Or rawWiki() itself > could be modified to accept another parameter.) Simplistically, something > like html_entity_decode(preg_replace('/[^\w\.!?"\']+/',' > ',rawWiki($id)),ENT_QUOTES) could be used, but there's probably something > far better. > That's pretty close to my main suggestion in //www.freelists.org/archives/dokuwiki/02-2008/msg00130.html The main problem I see is when to run the text through the entity stripping function. If something already exists to do that in the core code, it would, of course, be preferable to reuse that existing code. If not, something would have to be written from scratch. If you run the wiki text through the entity stripper after ft_snippet() has decided on what the snippet will be, but before it highlights the search terms, you end up with a snippet that was shorter than originally intended (if possible, 50 bytes in either direction, ideally 100; assuming I'm reading the code right). If you run it right after $text gets filled from rawWiki(), which would strip the entities before ft_snippet decided on exactly what text would be in the snippet allowing it to be the ideal length, you have to parse the entire page (which on a page like wiki:syntax is quite large). At his point, you've essentially created the plain text renderer (though a full fledged plain text render would probably preserve as much of the formatting is as possible in plain text [eg, insert line breaks as appropriate]). So running the strip entities type function after the snippet has been chosen would definitely be the fastest. And I'd imagine for the speed penalty of doing it the other way, the shortened snippet is probably a fair trade. I really think I've made an argument for both methods I've talked about in this thread. It seems the main difference is how much caching is/can be done. But like you alluded to in your message, Todd, if our search hits 36 pages and only 6 of them have a good plain text rendered cache and the snippet engine has to wait for 30 pages to rebuild their cache before it can build its snippets, that's a bad thing. -- http://jacob.steenhagen.us