question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

invalid jinja templates for a bigger number of files

See original GitHub issue

When trying to open a combined JSON file for more than 1908 netcdf files with more than 5 variables (or when setting templates_count to 0 in MultiZarrToZarr.translate), fsspec raises a jinja error. For a script to reproduce this and the actual traceback:

script to generate, convert and combine netcdf files
import pathlib

import dask
import dask.diagnostics
import fsspec
import numpy as np
import ujson
import xarray as xr

from fsspec_reference_maker.combine import MultiZarrToZarr
from fsspec_reference_maker.hdf import SingleHdf5ToZarr


def gen_json(u, root):
    so = {"mode": "rb"}  # add storage options for other filesystem implementations
    with fsspec.open(u, **so) as inf:
        h5chunks = SingleHdf5ToZarr(inf, u, inline_threshold=300)

        with open(root / f"{pathlib.Path(u).name}.json", "wb") as outf:
            outf.write(ujson.dumps(h5chunks.translate()).encode())


nc_root = pathlib.Path("nc")
nc_root.mkdir(exist_ok=True)


def create_file(n):
    ds = xr.Dataset(
        data_vars={
            "a": (("x", "y"), np.ones((100, 1))),
            "b": (("y", "z"), np.zeros((1, 200))),
        },
        coords={
            "y": [n],
        },
    ).chunk({"x": 10})
    ds.to_netcdf(nc_root / f"{n:04d}.nc")


with dask.diagnostics.ProgressBar():
    _ = dask.compute(*[dask.delayed(create_file)(i) for i in range(1909)])

paths = sorted(nc_root.iterdir())

# create a directory to store the datas
separate_root = pathlib.Path("separate")
separate_root.mkdir(exist_ok=True)

# create a directory to store the individual datas
with dask.diagnostics.ProgressBar():
    _ = dask.compute(*[dask.delayed(gen_json)(str(p), separate_root) for p in paths])


def combine_json(paths, outpath):
    mzz = MultiZarrToZarr(
        paths,
        remote_protocol="file",
        xarray_open_kwargs={
            "decode_cf": False,
            "mask_and_scale": False,
            "decode_times": False,
            "decode_timedelta": False,
            "use_cftime": False,
            "decode_coords": False,
        },
        xarray_concat_args={
            "combine_attrs": "override",
            "data_vars": "minimal",
            "coords": "minimal",
            "compat": "override",
            # "concat_dim": "time",
            "dim": "time",
        },
    )
    mzz.translate(outpath, template_count=1)


json_paths = sorted(separate_root.iterdir())

combined_root = pathlib.Path("combined")
combined_root.mkdir(exist_ok=True)

path_1908 = combined_root / "1908.json"
path_1909 = combined_root / "1909.json"

print("combining ... ", end='')
combine_json(json_paths[:1908], path_1908)
combine_json(json_paths, path_1909)
print("done")

mapper_1908 = fsspec.get_mapper(
    "reference://", fo=f"file://{path_1908.absolute()}", remote_protocol="file"
)
print("1908 files:", mapper_1908)
mapper_1909 = fsspec.get_mapper(
    "reference://", fo=f"file://{path_1909.absolute()}", remote_protocol="file"
)
print("1909 files:", mapper_1909)
traceback
Traceback (most recent call last):
  File ".../test.py", line 86, in <module>
    mapper_1909 = fsspec.get_mapper("reference://", fo=f"file://{path_1909.absolute()}", remote_protocol="file")
  File ".../fsspec/fsspec/mapping.py", line 228, in get_mapper
    fs, urlpath = url_to_fs(url, **kwargs)
  File ".../fsspec/fsspec/core.py", line 399, in url_to_fs
    fs = cls(**options)
  File ".../fsspec/fsspec/spec.py", line 76, in __call__
    obj = super().__call__(*args, **kwargs)
  File ".../fsspec/fsspec/implementations/reference.py", line 142, in __init__
    self._process_references(text, template_overrides)
  File ".../fsspec/fsspec/implementations/reference.py", line 259, in _process_references
    self._process_references1(references, template_overrides=template_overrides)
  File ".../fsspec/fsspec/implementations/reference.py", line 300, in _process_references1
    u = _render_jinja(u)
  File ".../fsspec/fsspec/implementations/reference.py", line 283, in _render_jinja
    return jinja2.Template(u).render(**self.templates)
  File "~/.conda/envs/fsspec/lib/python3.9/site-packages/jinja2/environment.py", line 1208, in __new__
    return env.from_string(source, template_class=cls)
  File "~/.conda/envs/fsspec/lib/python3.9/site-packages/jinja2/environment.py", line 1092, in from_string
    return cls.from_code(self, self.compile(source), gs, None)
  File "~/.conda/envs/fsspec/lib/python3.9/site-packages/jinja2/environment.py", line 757, in compile
    self.handle_exception(source=source_hint)
  File "~/.conda/envs/fsspec/lib/python3.9/site-packages/jinja2/environment.py", line 925, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<unknown>", line 1, in template
jinja2.exceptions.TemplateSyntaxError: expected token 'end of print statement', got 'integer'

A colleague traced this to the generation of the template ids in https://github.com/intake/fsspec-reference-maker/blob/fda8a5318e307c23ef656257c0bc78262bcb9a6f/fsspec_reference_maker/combine.py#L131-L138 My conclusion is that jinja requires those ids to be valid python identifiers, but the 1909th template gets a id of 01 (there are also single digit ids earlier, but those are fine, apparently).

Side note: I don’t understand why it uses itertools.combinations(..., i) and not itertools.product(..., repeat=i), which would allow more values with fewer characters

To fix this here I’d either filter out invalid identifiers:

for tup in itertools.combinations(string.ascii_letters + string.digits, i): 
    if tup[0].isdigit():
        continue
    yield "".join(tup)

or always prefix the id with a character (this results in a bigger file size, though):

for tup in itertools.combinations(string.ascii_letters + string.digits, i): 
    yield "i" + "".join(tup)

but I may be missing a better way. In any case, both fix my issue.

If any of these sound good we can send in a PR.

cc @tinaok @HectorBaril

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Oct 15, 2021

Are they only used to reduce the filesize

Yes. In theory, if using jinja, you could make a more complex expression, and the spec allows for this, but none of the references we generate make use of anything ecept a simple replace.

0reactions
keewiscommented, Oct 16, 2021

There is also an argument to not making templates at all

I did notice that, but I didn’t know how important the templates really are, or if they influenced the open time. Are they only used to reduce the filesize?

Edit: and thanks for the tip, I will try compressing the resulting json file

Read more comments on GitHub >

github_iconTop Results From Across the Web

invalid jinja templates for a bigger number of files #86 - GitHub
When trying to open a combined JSON file for more than 1908 netcdf files with more than 5 variables (or when setting templates_count...
Read more >
Template Designer Documentation - Jinja
A Jinja template is simply a text file. Jinja can generate any text-based format (HTML, XML, CSV, LaTeX, etc.). A Jinja template doesn't...
Read more >
Multiple level template inheritance in Jinja2? - Stack Overflow
One of the best way to achieve multiple level of templating using jinja2 is to use 'include' let say you have 'base_layout.html' as...
Read more >
Primer on Jinja Templating - Real Python
When you want to create text files with programmatic content, Jinja can help you out. In this tutorial, you'll learn how to: Install...
Read more >
Built-in template tags and filters - Django documentation
This document describes Django's built-in template tags and filters. It is recommended that you use the automatic documentation, if available, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found