[haiku-doc] Important XML parsing changes needed

  • From: "Humdinger" <humdingerb@xxxxxxxxxxxxxx>
  • To: "Documentation Haiku ML" <haiku-doc@xxxxxxxxxxxxx>
  • Date: Sun, 15 Nov 2009 07:55:37 +0100

Hi there,

the XML parsing of the translation site assures maximum consistency 
between languages. Unfortunately it sometimes conflicts with the 
different needs of those languages. For example, you may want to add a 
<i> or to stress something of importance or you'd like to use a <abbr> 
that would be out of place in English (sorry, I had better examples, 
but forgot...).

Anyway, it makes sense to relax the XML parsing a bit more. Instead of 
a long list what not to parse, let's have a really short list of the 
essential tags that need parsing. Besides what automatically results in 
its own block (h1-6, p, pre, div, table, ul, ol etc.):

 * <span class="*"> - to make sure named objects are consistently 
formatted.
 * <a href> - here it's important that the actual URL is NOT parsed (as 
it's now), because we'd like to link to a localized version of the 
resource (at least for external URL - intra user guide links could 
still be parsed, I suppose).

If the language managers see that people abuse these new-found 
formatting rights, we'll have to come up with a more restrictive 
positive-list after all.


Another area to relax parsing is entities, special characters, see 
http://www.htmlhelp.com/reference/html40/entities/. Those have to be 
added to a translation fairly often, esp. for "&", non-breaking spaces, 
and dashes.
It has been suggested to simple exclude all &xyz; entities from 
parsing, but hat could result in people using e.g. "&uuml;" instead of 
"ü" when writing normally. That would make working with a text 
unnecessarily clumsy.

Therefore I suggest to only exclude these entities (do other languages 
need more?):

&amp;           &#38;           &
&nbsp;          &#160;          non-breaking space
&copy;          &#169;          copyright-symbol
&reg;           &#174;          registered-symbol
&trade; &#8482; tm-symbol
&ndash; &#8211; n-size dash
&mdash; &#8212; m-size dash
&hellip;        &#8230; ellipses...

Everything else should really be already in the original English text.

It would be a good idea to have a script run over the generated 
exported pages to convert all special characters that are "hiding" to 
their respective HTML-encoding. Plus:
<i> to <em>
<b> to <strong>
<s> to <del>
<u> to <cite>
<tt> to <code>


I hope Vincent is monitoring this mailing list and manages to make 
these adjustments. Especially the <a href> is annoying when you want to 
link to a localized wiki page, and the &-entity thwarts the attempt to 
refrain from using the term "drag&drop" etc.

Does anyone has more on this issue that I forgot?

Thanks
Regards,
Humdinger

-- 
--=-=--=-=--=-=--=-=--=-=--=-=--=-=--=-=--=-=--=-
Deutsche Haiku News @ http://www.haiku-gazette.de


Other related posts: