Discussion: Strings
See original GitHub issueFeature Request (more of a discussion really)
Overview
When thinking of syntactic sugar for chars (#103) I started thinking of ways to implement strings. Also, @ballercat mentioned in this comment:
I do have plans for at least static strings, so I would like to keep string tokenizing around.
I’m curious what those plans are. There are quite a few design choices to make regarding what kind of strings Walt wants to support, so maybe a discussion is good?
I’ll write down my thoughts below, perhaps they’re of use to start the discussion.
Impact
Extra-Large.
The consequences of settling on a convention for strings is huge, since it determines what is and isn’t idiomatic code for string handling in Walt. It also determines the most common way to do interop with JS strings, so this is not a decision to be made lightly, IMO.
Details
So the seed for these ideas came from #103, which proposed adding syntactic sugar for turning single characters into the equivalent i32
value that calling codePointAt(0)
would return. This would simplify interacting with string data passed to Walt code (since that also would have to be turned into numerical data first).
Strings could be handled similarly: as i32[]
arrays packing either full code point values, or two UTF16 values at a time. (I’m skipping option of UTF8 encoding for now, as that would require converting JavaScript’s UTF16 code points to UTF8 and back. This probably would make things prohibitively slow. Also, UTF16 keeps the strings consistent with the strings in JavaScript, which will likely ease interop. However, see “Syntax” below).
Of course, a plain array is probably a bit too simplistic, because anyone using strings would probably also want to know their length too. A few simple options for handling that:
- encode them like null-terminated strings in C, with a zero at the end. Of course, there are plenty of issues with that, and we can spare some bytes for overhead
- use first element of the
i32
of the array to represent length, rest of the array as a string - do both: length at the start, zero-termination at the end. Why not? Benefit of both worlds at minimal overhead costs.
I think the last option is probably best: if we encode strings as packed UTF16, it’s only six bytes of overhead. If stored as full i32 elements for each character, it’s eight. Nobody will miss that.
Using a single i32
for string length would limit single strings to length 0x7FFFFFFF, but seriously… that’s the equivalent of 2 GiB worth of characters, assuming one byte per character. So that’s two or four times that size in WASM. One could do (length>>>1<<1)+(length&1)
to compensate for the negative values and get the full 0xFFFFFFFF range of the 32 bits, but if you need more than 2 million characters, I doubt 4 million will suffice either.
(one could also use the MSB as some kind of flag, meaning negative values for the length could be meaningful, but I’m struggling to think of what kind of flag one might need for strings exactly. Maybe a rope-structure where the string isn’t zero-terminated but ends with the pointer to the next piece of rope?)
Then the choice would be how to store individual chars:
- use
codePointAt
to generate full unicode code points - use
charCodeAt
to generate UTF16 characters, store one per array element - use
charCodeAt
to generate UTF16 characters, store two peri32
array element
Personally I’d probably want to use the two-UTF16-chars-per-element, since it’s half the memory overhead for characters. In my experience, when it comes to performance, avoiding the memory wall outweighs a little bit of bitshifting and masking.
However, that would require special bitshifting code all over Walt whenever strings are manipulated, unless even more sugar is introduced, and then we’re getting further and further removed from the WASM metal. So instead I think convenience & ease of writing correct code is probably a better default over error-prone optimisations that often are unnecessary (but again, see “syntax” below).
That leaves the two other options. At that point, codePointAt
seems more logical, since there is little benefit to using charCode
s at this point.
Extracting WASM strings would then require using the ArrayBuffer of the module to generate a Uint32Array
, and then generating strings from the section of code representing a string via String.fromCodePoint
. Sending would be the same process in reverse.
(It’s probably a good idea to create convenience functions for that to facilitate these conversion steps)
Generation
Similar to chars, strings should be easy to generate:
- initiate the array size at the right length
- iterate over the string with
codePointAt
and assign the values to the array (like the of a char)
That’s it.
Syntax
So you might think “huh? what syntax? Just use single and double quotes, no?” but I want to backtrack a bit: I argued that for the common case, using i32[]
arrays filled with full unicode code points was probably the best idea, since it seems like it would be the least error prone.
However, imagine you’re dealing with a piece of text that only has ASCII characters, and that difference in performance and overhead becomes quite significant. Perhaps you’d like a string literal to turn into one that only uses the 8 bits per character required. So then why not use something like:
let utf8String: i32[] = 1;
// 64 chars, or 16 bytes
utf8String= utf8`This will turn into 8-bit chars packed as groups of four, y'all!`
let utf16String: i32[] = 17;
// 51 utf16 chars, so 26 bytes
utf16String = utf16`Liberté, égalité, fraternité for all utf encodings!`;
let asciiString: i32[] = 43;
asciiString = c`For when you really need that null-terminated, 8-bit string`;
Basically, my idea is to use tagged templates to support different encodings, since I imagine that various contexts have better use for either UTF8, UTF16 or even UTF32 strings.
So the idea is to use single and double quotes for plain strings, whatever convention will end up being used for that, and tagged templates for these specific, but likely, use-cases. This way we can keep everyone happy.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:33 (11 by maintainers)
Top GitHub Comments
The compiler exports a
stringDecoder
andstringEncoder
utilities. It’s used extensively in the self-hosted compiler specs.stringDecoder
returns a generator.I did not actually run this code below, but it should work just fine.
It would be helpful, I think the project would need to mature more and stabilize before this feature though.
I’m going to be closing the issue as static strings were implemented and tested as part of (#107). This is about as much I’d like to implement as far as strings at this point from the compiler/language side of things.
Thank you very much for excellent discussion and brainstorming.