question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add package `tokenizers`

See original GitHub issue

There has recently been some exciting progress on bringing Hugging Face’s Tokenizers lib to wasm/browsers:

I think it’d be great if the full Python bindings were available in the browser (including JupyterLite) via Pyodide, and it doesn’t seem like too much extra work.

@messense has been super helpful over on this issue and the instructions below now produce a .whl that seems to almost work in the browser (problem is explained below).

Build instructions:

Visit this branch and start a Github Codespace on it, then run these commands:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
rustup toolchain install nightly
rustup target add --toolchain nightly wasm32-unknown-emscripten
rustup component add rust-src --toolchain nightly-x86_64-unknown-linux-gnu
sudo pip install maturin==0.13.0b8
git clone --depth 1 --branch 3.1.14 https://github.com/emscripten-core/emsdk
cd emsdk
./emsdk install latest
./emsdk activate latest
source ./emsdk_env.sh
cd ../bindings/python
RUSTUP_TOOLCHAIN=nightly maturin build --release -o dist --target wasm32-unknown-emscripten -i python3.10

The problem

This produces a .whl file that seems to work, except for a threading issue as you can see in this demo (source repo). @ryanking13 mentioned in this comment that Pyodide has some problems with threads that come up at runtime, so perhaps that’s what’s happening here?

For quick reference, here’s the error that you’ll see in the browser console in the above-linked demo:

pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 6, kind: WouldBlock, message: "Resource temporarily unavailable" }) }

image

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
rthcommented, Jul 1, 2022

Great that it works! Also, I find it quite impressive that just running maturin build works (without any of the pyodide build setup) and that it is now tested against Pyodide in their CI https://github.com/PyO3/maturin/pull/974

About integrating it in pyodide’s meta.yaml build setup, emsdk and rust toolchain should already be available in the current docker. Maybe we would have to set sharedlibrary: true to indicate that the build doesn’t use setup.py and then have your custom build setup under build->script as here. Not fully sure, someone would have to try it.

Also @hoodmane was involved in most of this Rust-related work, so he would probably have ideas about about the best way to support maturin.

2reactions
rthcommented, Jul 1, 2022

Could you try setting TOKENIZERS_PARALLELISM=0 env variable? Generally yes, threading would need to be disabled in tokenizers.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Installation - Hugging Face
You should install Tokenizers in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.
Read more >
tokenizers - PyPI
Fast and Customizable Tokenizers. ... pip install tokenizers ... Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
Read more >
Fast, Consistent Tokenization of Natural Language Text - GitHub
The tokenizers in this package have a consistent interface. They all take either a character vector of any length, or a list where...
Read more >
nltk.tokenize package
This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, ...
Read more >
tokenize — Tokenizer for Python source — Python 3.11.1 ...
The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found