question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discussion: Strings

See original GitHub issue

Feature Request (more of a discussion really)

Overview

When thinking of syntactic sugar for chars (#103) I started thinking of ways to implement strings. Also, @ballercat mentioned in this comment:

I do have plans for at least static strings, so I would like to keep string tokenizing around.

I’m curious what those plans are. There are quite a few design choices to make regarding what kind of strings Walt wants to support, so maybe a discussion is good?

I’ll write down my thoughts below, perhaps they’re of use to start the discussion.

Impact

Extra-Large.

The consequences of settling on a convention for strings is huge, since it determines what is and isn’t idiomatic code for string handling in Walt. It also determines the most common way to do interop with JS strings, so this is not a decision to be made lightly, IMO.

Details

So the seed for these ideas came from #103, which proposed adding syntactic sugar for turning single characters into the equivalent i32 value that calling codePointAt(0) would return. This would simplify interacting with string data passed to Walt code (since that also would have to be turned into numerical data first).

Strings could be handled similarly: as i32[] arrays packing either full code point values, or two UTF16 values at a time. (I’m skipping option of UTF8 encoding for now, as that would require converting JavaScript’s UTF16 code points to UTF8 and back. This probably would make things prohibitively slow. Also, UTF16 keeps the strings consistent with the strings in JavaScript, which will likely ease interop. However, see “Syntax” below).

Of course, a plain array is probably a bit too simplistic, because anyone using strings would probably also want to know their length too. A few simple options for handling that:

  • encode them like null-terminated strings in C, with a zero at the end. Of course, there are plenty of issues with that, and we can spare some bytes for overhead
  • use first element of the i32 of the array to represent length, rest of the array as a string
  • do both: length at the start, zero-termination at the end. Why not? Benefit of both worlds at minimal overhead costs.

I think the last option is probably best: if we encode strings as packed UTF16, it’s only six bytes of overhead. If stored as full i32 elements for each character, it’s eight. Nobody will miss that.

Using a single i32 for string length would limit single strings to length 0x7FFFFFFF, but seriously… that’s the equivalent of 2 GiB worth of characters, assuming one byte per character. So that’s two or four times that size in WASM. One could do (length>>>1<<1)+(length&1) to compensate for the negative values and get the full 0xFFFFFFFF range of the 32 bits, but if you need more than 2 million characters, I doubt 4 million will suffice either.

(one could also use the MSB as some kind of flag, meaning negative values for the length could be meaningful, but I’m struggling to think of what kind of flag one might need for strings exactly. Maybe a rope-structure where the string isn’t zero-terminated but ends with the pointer to the next piece of rope?)

Then the choice would be how to store individual chars:

  • use codePointAt to generate full unicode code points
  • use charCodeAt to generate UTF16 characters, store one per array element
  • use charCodeAt to generate UTF16 characters, store two per i32 array element

Personally I’d probably want to use the two-UTF16-chars-per-element, since it’s half the memory overhead for characters. In my experience, when it comes to performance, avoiding the memory wall outweighs a little bit of bitshifting and masking.

However, that would require special bitshifting code all over Walt whenever strings are manipulated, unless even more sugar is introduced, and then we’re getting further and further removed from the WASM metal. So instead I think convenience & ease of writing correct code is probably a better default over error-prone optimisations that often are unnecessary (but again, see “syntax” below).

That leaves the two other options. At that point, codePointAt seems more logical, since there is little benefit to using charCodes at this point.

Extracting WASM strings would then require using the ArrayBuffer of the module to generate a Uint32Array, and then generating strings from the section of code representing a string via String.fromCodePoint. Sending would be the same process in reverse.

(It’s probably a good idea to create convenience functions for that to facilitate these conversion steps)

Generation

Similar to chars, strings should be easy to generate:

  • initiate the array size at the right length
  • iterate over the string with codePointAt and assign the values to the array (like the of a char)

That’s it.

Syntax

So you might think “huh? what syntax? Just use single and double quotes, no?” but I want to backtrack a bit: I argued that for the common case, using i32[] arrays filled with full unicode code points was probably the best idea, since it seems like it would be the least error prone.

However, imagine you’re dealing with a piece of text that only has ASCII characters, and that difference in performance and overhead becomes quite significant. Perhaps you’d like a string literal to turn into one that only uses the 8 bits per character required. So then why not use something like:

let utf8String: i32[] = 1;
// 64 chars, or 16 bytes
utf8String= utf8`This will turn into 8-bit chars packed as groups of four, y'all!` 

let utf16String: i32[] = 17;
// 51 utf16 chars, so 26 bytes
utf16String = utf16`Liberté, égalité, fraternité for all utf encodings!`;

let asciiString: i32[] = 43;
asciiString = c`For when you really need that null-terminated, 8-bit string`;

Basically, my idea is to use tagged templates to support different encodings, since I imagine that various contexts have better use for either UTF8, UTF16 or even UTF32 strings.

So the idea is to use single and double quotes for plain strings, whatever convention will end up being used for that, and tagged templates for these specific, but likely, use-cases. This way we can keep everyone happy.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:33 (11 by maintainers)

github_iconTop GitHub Comments

3reactions
ballercatcommented, Jul 28, 2018

The compiler exports a stringDecoder and stringEncoder utilities. It’s used extensively in the self-hosted compiler specs. stringDecoder returns a generator.

I did not actually run this code below, but it should work just fine.

import compile, { stringDecoder } from 'walt-compiler';

export const getText = view => ptr => {
  let text = "";
  const decoder = stringDecoder(view, ptr);
  let iterator = decoder.next();
  while (!iterator.done) {
    text += String.fromCodePoint(iterator.value);
    iterator = decoder.next();
  }

  return text;
};

const source = `
   export const memory : Memory = { initial: 1 };
   export function hello() : i32 {
      return "Hello World!";
   }
`;

WebAssembly.instantiate(compile(source)).then(({ instance }) => {
    const view = new DataView(instance.exports.memory.buffer);
    const decodeText = getText(view);

    console.log(decodeText(instance.exports.hello()); // "Hello World!"
});
2reactions
ballercatcommented, Apr 4, 2018

You know, I was wondering: could there be a use-case for having something like hygienic macros or some other kind of pre-processor thing that is more modern and safer than what C has to offer, that would both make extending the compiler easy for you, let others build their own extensions, and keep the core code modular?

It would be helpful, I think the project would need to mature more and stabilize before this feature though.

I’m going to be closing the issue as static strings were implemented and tested as part of (#107). This is about as much I’d like to implement as far as strings at this point from the compiler/language side of things.

Thank you very much for excellent discussion and brainstorming.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python Discussion (Strings) - YouTube
Strings with its Methods. ... Python Discussion ( Strings ). 298 views Streamed 1 year ago. nETSETOS. nETSETOS. 10.8K subscribers. Subscribe.
Read more >
Strings 2021 – ICTP – SAIFR
Strings 2021 is an annual conference that brings together the entire string theory community. Besides reviews of major developments in the field and...
Read more >
Add Strings - LeetCode Discuss
Discuss (999+) · Submissions · 415. Add Strings. Hot Newest to Oldest Most Votes. New. KhoaHoang's avatar · Java | Fast & Simple....
Read more >
Strings 2022 (18-23 July 2022) - CERN Indico
From statistical mechanics to microstate counting for extremal and near-extremal black holes¶ 30m. Speaker: Luca Iliesiu.
Read more >
Strings 2XXX | Not Even Wrong
Nima Arkani-Hamed today gave a “vision talk” at Strings 2019, ... A panel discussion at the Strings 2018 conference ended by addressing ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found