Add package `tokenizers`
See original GitHub issueThere has recently been some exciting progress on bringing Hugging Face’s Tokenizers lib to wasm/browsers:
- @Narsil: https://github.com/huggingface/tokenizers/pull/1009
- @mbrunel: https://github.com/mithril-security/tokenizers-wasm
- Discussion:
I think it’d be great if the full Python bindings were available in the browser (including JupyterLite) via Pyodide, and it doesn’t seem like too much extra work.
@messense has been super helpful over on this issue and the instructions below now produce a .whl
that seems to almost work in the browser (problem is explained below).
Build instructions:
Visit this branch and start a Github Codespace on it, then run these commands:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
rustup toolchain install nightly
rustup target add --toolchain nightly wasm32-unknown-emscripten
rustup component add rust-src --toolchain nightly-x86_64-unknown-linux-gnu
sudo pip install maturin==0.13.0b8
git clone --depth 1 --branch 3.1.14 https://github.com/emscripten-core/emsdk
cd emsdk
./emsdk install latest
./emsdk activate latest
source ./emsdk_env.sh
cd ../bindings/python
RUSTUP_TOOLCHAIN=nightly maturin build --release -o dist --target wasm32-unknown-emscripten -i python3.10
The problem
This produces a .whl
file that seems to work, except for a threading issue as you can see in this demo (source repo). @ryanking13 mentioned in this comment that Pyodide has some problems with threads that come up at runtime, so perhaps that’s what’s happening here?
For quick reference, here’s the error that you’ll see in the browser console in the above-linked demo:
pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 6, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Great that it works! Also, I find it quite impressive that just running
maturin build
works (without any of the pyodide build setup) and that it is now tested against Pyodide in their CI https://github.com/PyO3/maturin/pull/974About integrating it in pyodide’s meta.yaml build setup, emsdk and rust toolchain should already be available in the current docker. Maybe we would have to set
sharedlibrary: true
to indicate that the build doesn’t usesetup.py
and then have your custom build setup underbuild->script
as here. Not fully sure, someone would have to try it.Also @hoodmane was involved in most of this Rust-related work, so he would probably have ideas about about the best way to support
maturin
.Could you try setting
TOKENIZERS_PARALLELISM=0
env variable? Generally yes, threading would need to be disabled in tokenizers.