[quickjs-devel] Handling potentially non-UTF-8 Strings (w/ patch)

  • From: Muh Muhten <muh.muhten@xxxxxxxxx>
  • To: quickjs-devel@xxxxxxxxxxxxx
  • Date: Tue, 3 Mar 2020 23:43:02 -0500

Hello,

QuickJS provides several interfaces which take or return strings, but which on common systems can contain potentially arbitrary 0-terminated data: scriptArgs, std.getenv, every function that deals in filenames (std.open, os.stat &c.).

This data need not necessarily decode as UTF-8, and when it it does not:
- on read, JS_NewStringLen mangles the data, replacing sequences of non-decodable bytes with U+FFFD
- files and environment variables with names that are not valid UTF-8 cannot be interacted with at all!
- passing the name obtained from os.readdir to os.open will, in fact, open the wrong file...

I think in an ideal world these functions would (have variants which?) work with array buffers rather than strings, but this presents ergonomic issues, especially given the lack of an encode/decode-String API.

I propose adopting so-called "UTF-8B" semantics: encode bytes >=0x80 which cannot be decoded as a UTF-8 into unpaired low surrogates in the range [U+DC80,U+DD00), and map them back when encoding back to a C string. This is a lossless conversion which maps valid UTF-8 to valid UTF-16 and invalid UTF-8 to invalid UTF-16, and is precedented by e.g. Python's PEP 383 <https://www.python.org/dev/peps/pep-0383/>.

A patch which implements this mapping in JS_NewStringLen and JS_ToCStringLen2 is below.

Some nice features of this approach are that does not alter ECMAScript semantics (Strings are not forbidden from unpaired surrogates), only on interactions with the C API, and there is no impact on code which exclusively works with *valid* Unicode strings.

A caveat, however, is that in order to maintain the reversibility of the transformation, it is necessary to treat UTF-8-encoded surrogates (e.g. CESU-8 or WTF-8 style) the same as invalid, so that JS_NewStringLen is no longer surjective. My judgement is that this is no great loss, given that (a) this was awkward and non-normative to begin with (and perhaps merely accidental anyway), and (b) it can be worked around by slicing into astral plane characters in a pinch.

In any case, though, it would be valuable to also have C API functions that work with strings as uint16 *s as well.

---
 quickjs.c | 20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/quickjs.c b/quickjs.c
index 372acd3..bfd9f3d 100644
--- a/quickjs.c
+++ b/quickjs.c
@@ -3731,26 +3731,17 @@ JSValue JS_NewStringLen(JSContext *ctx, const char *buf, size_t buf_len)
             } else {
                 /* parse utf-8 sequence, return 0xFFFFFFFF for error */
                 c = unicode_from_utf8(p, p_end - p, &p_next);
-                if (c < 0x10000) {
+                if (c < 0xd800 || (c >= 0xe000 && c < 0x10000)) {
                     p = p_next;
-                } else if (c <= 0x10FFFF) {
+                } else if (c >= 0x10000 && c <= 0x10FFFF) {
                     p = p_next;
                     /* surrogate pair */
                     c -= 0x10000;
                     string_buffer_putc16(b, (c >> 10) + 0xd800);
                     c = (c & 0x3ff) + 0xdc00;
                 } else {
-                    /* invalid char */
-                    c = 0xfffd;
-                    /* skip the invalid chars */
- /* XXX: seems incorrect. Why not just use c = *p++; ? */
-                    while (p < p_end && (*p >= 0x80 && *p < 0xc0))
-                        p++;
-                    if (p < p_end) {
-                        p++;
-                        while (p < p_end && (*p >= 0x80 && *p < 0xc0))
-                            p++;
-                    }
+                    /* XXX: PEP 383-style non-decodable byte smuggling */
+                    c = 0xdc80 | *p++;
                 }
                 string_buffer_putc16(b, c);
             }
@@ -3875,6 +3866,9 @@ const char *JS_ToCStringLen2(JSContext *ctx, size_t *plen, JSValueConst val1, BO
             c = src[pos++];
             if (c < 0x80) {
                 *q++ = c;
+            } else if (c >= 0xdc80 && c < 0xdd00) {
+                /* XXX: PEP 383-style non-decodable byte smuggling */
+                *q++ = c & 0xff;
             } else {
                 if (c >= 0xd800 && c < 0xdc00) {
                     if (pos < len && !cesu8) {
--
From a76181dbb62d1c00301f5c5111198785b826de5d Mon Sep 17 00:00:00 2001
From: Muh Muhten <muh.muhten@xxxxxxxxx>
Date: Tue, 3 Mar 2020 21:40:44 -0500
Subject: [PATCH] make invalid UTF-8 round-trip through String

Modifies JS_NewStringLen's conversion from UTF-8 to UTF-16 to generate
unpaired low surrogates upon encountering non-decodable bytes, and
JS_ToCStringLen2's reverse conversion to decode that particular subset
of unpaired low surrogates into the corresponding bytes.

This is the so-called "UTF-8B" approach, and is precedented by e.g.
Python's PEP 383 <https://www.python.org/dev/peps/pep-0383/>.

This ensures that arbitrary byte data can be recovered even after being
stored in a JS String, which is important as many interfaces which may
represent arbitrary bytes (scriptArgs, std.getenv(), os.readdir(), &c.)
expose this data as a string, mangling values that do not decode as
valid UTF-8 in the process.

Unfortunately, it is as a result *not feasible* to maintain the ability
to create strings containing those unpaired surrogates by encoding them
(normatively illegally) in UTF-8 format--so-called "WTF-8"--as that
would entail having two byte streams encoding the same 16-bit data; it
would not be possible to recover the distinction between them.

My judgement is that the impact of this limitation is fairly low as it
affects only interactions with the C API, and it is comparatively less
common to manipulate potentially-malformed UTF-16, mostly involving
interfaces to legacy systems that assume Unicode is a 2-byte encoding.
Moreover, doing so is *already* cumbersome; although strings can be
extracted as CESU-8, the existing C interfaces are generally all UTF-8
based and do not present strings as sequences of 16-bit code units.
---
 quickjs.c | 20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/quickjs.c b/quickjs.c
index 372acd3..bfd9f3d 100644
--- a/quickjs.c
+++ b/quickjs.c
@@ -3731,26 +3731,17 @@ JSValue JS_NewStringLen(JSContext *ctx, const char 
*buf, size_t buf_len)
             } else {
                 /* parse utf-8 sequence, return 0xFFFFFFFF for error */
                 c = unicode_from_utf8(p, p_end - p, &p_next);
-                if (c < 0x10000) {
+                if (c < 0xd800 || (c >= 0xe000 && c < 0x10000)) {
                     p = p_next;
-                } else if (c <= 0x10FFFF) {
+                } else if (c >= 0x10000 && c <= 0x10FFFF) {
                     p = p_next;
                     /* surrogate pair */
                     c -= 0x10000;
                     string_buffer_putc16(b, (c >> 10) + 0xd800);
                     c = (c & 0x3ff) + 0xdc00;
                 } else {
-                    /* invalid char */
-                    c = 0xfffd;
-                    /* skip the invalid chars */
-                    /* XXX: seems incorrect. Why not just use c = *p++; ? */
-                    while (p < p_end && (*p >= 0x80 && *p < 0xc0))
-                        p++;
-                    if (p < p_end) {
-                        p++;
-                        while (p < p_end && (*p >= 0x80 && *p < 0xc0))
-                            p++;
-                    }
+                    /* XXX: PEP 383-style non-decodable byte smuggling */
+                    c = 0xdc80 | *p++;
                 }
                 string_buffer_putc16(b, c);
             }
@@ -3875,6 +3866,9 @@ const char *JS_ToCStringLen2(JSContext *ctx, size_t 
*plen, JSValueConst val1, BO
             c = src[pos++];
             if (c < 0x80) {
                 *q++ = c;
+            } else if (c >= 0xdc80 && c < 0xdd00) {
+                /* XXX: PEP 383-style non-decodable byte smuggling */
+                *q++ = c & 0xff;
             } else {
                 if (c >= 0xd800 && c < 0xdc00) {
                     if (pos < len && !cesu8) {
-- 
2.20.1 (Apple Git-117)

Other related posts:

  • » [quickjs-devel] Handling potentially non-UTF-8 Strings (w/ patch) - Muh Muhten