[yunqa.de] Re: TDIHtmlParser.StartPos returns wrong value if a copyright symbol ((c)) is included

  • From: Delphi Inspiration <delphi@xxxxxxxx>
  • To: yunqa@xxxxxxxxxxxxx
  • Date: Mon, 30 Apr 2012 22:08:01 +0200

On 30.04.2012 16:53, Edwin Yip wrote:

> Sorry, during the composing of a simple project that's intended to
> demonstrate the problem, I found that it's not a problem if
> DIHtmlParser, but a logic problem of my program - I use Scintilla which
> uses "byte positions", and I wasn't aware of this and wrongly assumed it
> also uses "character position"... On the other hand, this shows that
>  DiHtmlParser is very stable :)

Thanks for clearing this up!

> BTW, if you don't mind I'm asking, do you happened to have a function in
> your "utility unites" that can convert a "character index" of a Unicode
> string A to a "byte index" of a utf8 string B that's corresponding to
> string A? Thanks.

I would keep reading UTF-8 characters and count byte positions along the
way. Functions to read multiple bytes into a single UTF-8 character include:

DISQLite3, DISQLite3Api.pas:

{ Reads a single character from an UTF-8 sequence buffer p of length l
  and stores its Unicode Code Point to o. Returns the number of bytes
  consumed. Assumes that p is assigned and l > 0. }
function sqlite3_read_utf8(
  const p: Pointer;
  const l: Cardinal;
  out o: Cardinal): Cardinal;

DIConverters, DIConverters.pas:

{ Usage: http://yunqa.de/delphi/doku.php/products/converters/index }
function utf8_mbtowc(
  const conv: conv_t;
  var pwc: ucs4_t;
  const s: Pointer;
  const n: Integer): Integer;

