question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent type of kwarg 'store' across write and update

See original GitHub issue

The eager write functions appear to expect the argument supplied to store to directly be a store object, whereas the update function appears to expect a factory (python callable) - can it be standardized one way or another please?

   import numpy as np
   import pandas as pd
   from functools import partial
   from storefact import get_store_from_url
   from tempfile import TemporaryDirectory
   from kartothek.io.eager import store_dataframes_as_dataset
   from kartothek.io.eager import update_dataset_from_dataframes

   df = pd.DataFrame(
       {
           "A": 1.,
           "B": pd.Timestamp("20130102"),
           "C": pd.Series(1, index=list(range(4)), dtype="float32"),
           "D": np.array([3] * 4, dtype="int32"),
           "E": pd.Categorical(["test", "train", "test", "train"]),
           "F": "foo",
       }
   )

   dataset_dir = TemporaryDirectory()
   store = get_store_from_url(f"hfs://{dataset_dir.name}") 

   dm = store_dataframes_as_dataset(
      store, #store object works fine here  
      "a_unique_dataset_identifier", 
      df, 
      metadata_version=4
   )

   another_df = pd.DataFrame(
       {
           "A": 2.,
           "B": pd.Timestamp("20190604"),
           "C": pd.Series(2, index=list(range(4)), dtype="float32"),
           "D": np.array([6] * 4, dtype="int32"),
           "E": pd.Categorical(["test", "train", "test", "train"]),
           "F": "bar",
       }
   )

   store_factory = partial(get_store_from_url, f"hfs://{dataset_dir.name}")

   dm = update_dataset_from_dataframes(
       [another_df],
       store=store_factory, #but this needs to be a callable
       dataset_uuid="a_unique_dataset_identifier"
       )
   dm

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:9 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
xhochycommented, Jun 5, 2019

Thanks @fjetter. Out of curiosity and for the wider user-base, what’s the reasoning behind the choice of store factories?

Store objects encapsulate connections to a storage service. In the methods that have a distributed computing backend, we pass the function arguments via pickle to the other workers. While pickle can preserve the state of the attributes of an object, the connections it holds are no longer valid / cannot be transferred between processes. Thus we pass callables so that on each worker a new connection can be instantiated.

0reactions
kagharpurecommented, Jun 17, 2019

@lr4d - Agreed. After posting that comment, I realized that there’s already an issue (#44) that’s about documenting store factories; so maybe adding a Gotchas document a bit further down the line will be a good idea, which can have a section on store factories and the reasoning behind them (as well as pitfalls, best practices, etc).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Type annotations for *args and **kwargs - Stack Overflow
I'm trying out Python's type annotations with abstract base classes to write some interfaces. Is there a way to annotate the possible types...
Read more >
Proposal: signature copying for kwargs. #270 - python/typing
This presents two problems for a static analyzer: the call from function to other_function can not be type-checked properly because of the *args ......
Read more >
Python Type Checking (Guide) - Real Python
In this guide, you'll look at Python type checking. Traditionally, types have been handled by the Python interpreter in a flexible but implicit...
Read more >
DiskCache Tutorial - Grant Jenks
An index is added to the access time field stored in the cache database. On every access, the field is updated. This makes...
Read more >
ResolveChoice class - AWS Glue
MATCH_CATALOG – Attempts to cast each ChoiceType to the corresponding type in the specified Data Catalog table. database – The AWS Glue Data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found