question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Meta] Project refactoring

See original GitHub issue

Note: the description is updated with comments and changes requested in comments.

The goal is to rework the script module to allow more flexibility and clearly separate concerns.

First, about the module name: script. It has been decided to change to wikidict.

Overview

I would like to see the module splitted into 4 parts (each part will independent from others and can be replayed & extended easily). This will also help leveraging multithreading to speed-up the whole process.

  1. Download the data (#466)
  2. Parse and store raw data (#469)
  3. Render templates and store results (#469)
  4. Output to the proper eBook reader format

I have in mind a SQLite database where raw data will be stored and updated when needed. Then, the parts will only use the data from the database. It should speed-up regenerating a whole dictionary when we update a template.

Then, each and every part will have its own CLI:

$ python -m wikidict --download ...
$ python -m wikidict --parse ...
$ python -m wikidict --render ...
$ python -m wikidict --output ...

And the all-in-one operation would be:

$ python -m wikidict --run ...

Side note: we could use an entry point to only having to type wikidict instead of python -m wikidict.

Splitting get.py

Here we are talking about parts 1 and 2.

Part 1 is already almost fine as-is, we just need to move the code into its own submodule. We could improve the CLI by allowing passing the Wiktionary dump date as argument, instead of relying on an envar.

Part 2 is only the mater of parsing the big XML file and storing raw data into a SQLite database. I am thinking of using this schema:

table: Word
fields:
    - word: varchar(256)
    - code: text
index on: word

table: Render
fields:
    - word_id: int
    - nature: varchar(16)
    - text: text
foreign key: word_id (Word._rowid_)
  • The Word table will contain raw data from the Wiktionary.
  • The Render table will be used to store the transformed text for a given word (after being cleaned up and where templates were processed). It will allow to have multiple texts for a given word (noun 1, noun 2, verb, adjective, …).

We will have one database per locale, located at data/$LOCALE/$WIKIDUMP_DATE.db.

At the download step, if no database exists, it will be retrieved from GitHub releases where they will be saved alongside dictionaries. This is a cool thing IMO: everyone will have the good and up-to-date local database. Of course, we will have options to skip it if the local file already exists or if we would like to force the download.

At the parse step, we will have to find a way to prevent parsing again if we run the command twice on the same Wiktionary dump. I was thinking of using the PRAGME user_version that would contain the Wiktionary dump date as integer. It would be set only after the full parsing is done with success.

Splitting convert.py

Here we are talking about parts 3 and 4.

Part 3 will call clean() and process_templates() on the wikicode. And store the result into the rendered field. This is the most time and CPU consuming part. It will be parallelized.

Part 4 will rethink how we are handling dictionary output to easily add more formats.

I was thinking of using a class with those methods (not really thought about it, I am just proposing the idea):

class BaseFormat:

    __slots__ = {"locale", "output_dir"}

    def __init__(self, locale: str, output_dir: Path) -> None:
        self.locale = locale
        self.output_dir = output_dir
    
    def process(self) -> None:
        raise NotImplementedError()

    def save(self) -> None:
        raise NotImplementedError()


class KoboFormat(BaseFormat):
    def process(self, words) -> None:
        groups = self.make_groups(self.words)
        variants = self.make_variants(self.words)

        wordlist = []
        for word in words:
            wordlist.append(self.process_word(word))

        self.save(wordlist, groups, variants)

    def save(self, ...) -> None:
        ...

That part is way from being finished, but when we have a fully working format, in our code will will use that kind of code to generate the dict file:

# Get all registered formats
formaters = get_formaters()

# Get all words from the database
words = get_words()

# And distribute the workload
from multiprocessing import Pool

def run(cls):
    formater = cls(locale, output_dir)
    formater.process(words)

with Pool(len(formaters)) as pool:
    pool.map(run_formatter, formaters))

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:26 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
lasconiccommented, Dec 15, 2020

test_$LOCALE.py was easy 😉 see PR #478 Good luck with the test_N_*.py

1reaction
lasconiccommented, Dec 14, 2020

It was faster on the second run with the profiler running… Less than 6 minutes https://gist.github.com/lasconic/063a2e71a4300c2815251d68832270d8

Read more comments on GitHub >

github_iconTop Results From Across the Web

Refactoring With Meta Cvs - C2 wiki
This is a kind of nesting escape operator which lets you execute a mcvs command on an outer sandbox. The default, implicit behavior...
Read more >
Refactoring as Meta Programming?
Refactoring [1] is widely acknowledged as one of the best practices of OO programming, and has been practiced in the functional and procedural...
Read more >
[meta] Proof-handling refactoring plans · Issue #10041 · coq/coq
Hi folks, I am opening this issue to provide a bit more information about the ongoing refactoring on the proof and constant-handling ...
Read more >
An Infrastructure to Support Meta-Differencing and Refactoring ...
The proposed research aims to construct an underlying infrastructure to support (semi) automated construction of refactorings and system wide transformation ...
Read more >
(PDF) Identifying refactoring opportunities using logic meta ...
In this paper, we show how automated support can be provided for identifying refactoring opportunities, e.g., when an application's design ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found