Hello,
QuickJS provides several interfaces which take or return strings, but
which on common systems can contain potentially arbitrary 0-terminated
data: scriptArgs, std.getenv, every function that deals in filenames
(std.open, os.stat &c.).
This data need not necessarily decode as UTF-8, and when it it does not:
- on read, JS_NewStringLen mangles the data, replacing sequences of
non-decodable bytes with U+FFFD
- files and environment variables with names that are not valid UTF-8
cannot be interacted with at all!
- passing the name obtained from os.readdir to os.open will, in fact,
open the wrong file...
I think in an ideal world these functions would (have variants which?)
work with array buffers rather than strings, but this presents ergonomic
issues, especially given the lack of an encode/decode-String API.
I propose adopting so-called "UTF-8B" semantics: encode bytes >=0x80
which cannot be decoded as a UTF-8 into unpaired low surrogates in the
range [U+DC80,U+DD00), and map them back when encoding back to a C
string. This is a lossless conversion which maps valid UTF-8 to valid
UTF-16 and invalid UTF-8 to invalid UTF-16, and is precedented by e.g.
Python's PEP 383 <https://www.python.org/dev/peps/pep-0383/>.
A patch which implements this mapping in JS_NewStringLen and
JS_ToCStringLen2 is below.
Some nice features of this approach are that does not alter ECMAScript
semantics (Strings are not forbidden from unpaired surrogates), only on
interactions with the C API, and there is no impact on code which
exclusively works with *valid* Unicode strings.
A caveat, however, is that in order to maintain the reversibility of the
transformation, it is necessary to treat UTF-8-encoded surrogates (e.g.
CESU-8 or WTF-8 style) the same as invalid, so that JS_NewStringLen is
no longer surjective. My judgement is that this is no great loss, given
that (a) this was awkward and non-normative to begin with (and perhaps
merely accidental anyway), and (b) it can be worked around by slicing
into astral plane characters in a pinch.
In any case, though, it would be valuable to also have C API functions
that work with strings as uint16 *s as well.
---
quickjs.c | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)
diff --git a/quickjs.c b/quickjs.c
index 372acd3..bfd9f3d 100644
--- a/quickjs.c
+++ b/quickjs.c
@@ -3731,26 +3731,17 @@ JSValue JS_NewStringLen(JSContext *ctx, const
char *buf, size_t buf_len)
} else {
/* parse utf-8 sequence, return 0xFFFFFFFF for error */
c = unicode_from_utf8(p, p_end - p, &p_next);
- if (c < 0x10000) {
+ if (c < 0xd800 || (c >= 0xe000 && c < 0x10000)) {
p = p_next;
- } else if (c <= 0x10FFFF) {
+ } else if (c >= 0x10000 && c <= 0x10FFFF) {
p = p_next;
/* surrogate pair */
c -= 0x10000;
string_buffer_putc16(b, (c >> 10) + 0xd800);
c = (c & 0x3ff) + 0xdc00;
} else {
- /* invalid char */
- c = 0xfffd;
- /* skip the invalid chars */
- /* XXX: seems incorrect. Why not just use c = *p++;
? */
- while (p < p_end && (*p >= 0x80 && *p < 0xc0))
- p++;
- if (p < p_end) {
- p++;
- while (p < p_end && (*p >= 0x80 && *p < 0xc0))
- p++;
- }
+ /* XXX: PEP 383-style non-decodable byte smuggling */
+ c = 0xdc80 | *p++;
}
string_buffer_putc16(b, c);
}
@@ -3875,6 +3866,9 @@ const char *JS_ToCStringLen2(JSContext *ctx,
size_t *plen, JSValueConst val1, BO
c = src[pos++];
if (c < 0x80) {
*q++ = c;
+ } else if (c >= 0xdc80 && c < 0xdd00) {
+ /* XXX: PEP 383-style non-decodable byte smuggling */
+ *q++ = c & 0xff;
} else {
if (c >= 0xd800 && c < 0xdc00) {
if (pos < len && !cesu8) {
--
From a76181dbb62d1c00301f5c5111198785b826de5d Mon Sep 17 00:00:00 2001
From: Muh Muhten <muh.muhten@xxxxxxxxx>
Date: Tue, 3 Mar 2020 21:40:44 -0500
Subject: [PATCH] make invalid UTF-8 round-trip through String
Modifies JS_NewStringLen's conversion from UTF-8 to UTF-16 to generate
unpaired low surrogates upon encountering non-decodable bytes, and
JS_ToCStringLen2's reverse conversion to decode that particular subset
of unpaired low surrogates into the corresponding bytes.
This is the so-called "UTF-8B" approach, and is precedented by e.g.
Python's PEP 383 <https://www.python.org/dev/peps/pep-0383/>.
This ensures that arbitrary byte data can be recovered even after being
stored in a JS String, which is important as many interfaces which may
represent arbitrary bytes (scriptArgs, std.getenv(), os.readdir(), &c.)
expose this data as a string, mangling values that do not decode as
valid UTF-8 in the process.
Unfortunately, it is as a result *not feasible* to maintain the ability
to create strings containing those unpaired surrogates by encoding them
(normatively illegally) in UTF-8 format--so-called "WTF-8"--as that
would entail having two byte streams encoding the same 16-bit data; it
would not be possible to recover the distinction between them.
My judgement is that the impact of this limitation is fairly low as it
affects only interactions with the C API, and it is comparatively less
common to manipulate potentially-malformed UTF-16, mostly involving
interfaces to legacy systems that assume Unicode is a 2-byte encoding.
Moreover, doing so is *already* cumbersome; although strings can be
extracted as CESU-8, the existing C interfaces are generally all UTF-8
based and do not present strings as sequences of 16-bit code units.
---
quickjs.c | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)
diff --git a/quickjs.c b/quickjs.c
index 372acd3..bfd9f3d 100644
--- a/quickjs.c
+++ b/quickjs.c
@@ -3731,26 +3731,17 @@ JSValue JS_NewStringLen(JSContext *ctx, const char
*buf, size_t buf_len)
} else {
/* parse utf-8 sequence, return 0xFFFFFFFF for error */
c = unicode_from_utf8(p, p_end - p, &p_next);
- if (c < 0x10000) {
+ if (c < 0xd800 || (c >= 0xe000 && c < 0x10000)) {
p = p_next;
- } else if (c <= 0x10FFFF) {
+ } else if (c >= 0x10000 && c <= 0x10FFFF) {
p = p_next;
/* surrogate pair */
c -= 0x10000;
string_buffer_putc16(b, (c >> 10) + 0xd800);
c = (c & 0x3ff) + 0xdc00;
} else {
- /* invalid char */
- c = 0xfffd;
- /* skip the invalid chars */
- /* XXX: seems incorrect. Why not just use c = *p++; ? */
- while (p < p_end && (*p >= 0x80 && *p < 0xc0))
- p++;
- if (p < p_end) {
- p++;
- while (p < p_end && (*p >= 0x80 && *p < 0xc0))
- p++;
- }
+ /* XXX: PEP 383-style non-decodable byte smuggling */
+ c = 0xdc80 | *p++;
}
string_buffer_putc16(b, c);
}
@@ -3875,6 +3866,9 @@ const char *JS_ToCStringLen2(JSContext *ctx, size_t
*plen, JSValueConst val1, BO
c = src[pos++];
if (c < 0x80) {
*q++ = c;
+ } else if (c >= 0xdc80 && c < 0xdd00) {
+ /* XXX: PEP 383-style non-decodable byte smuggling */
+ *q++ = c & 0xff;
} else {
if (c >= 0xd800 && c < 0xdc00) {
if (pos < len && !cesu8) {
--
2.20.1 (Apple Git-117)