Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

extracting all etext numbers, titles, and authors

See original GitHub issue

How can I extract all the etext numbers, titles, and authors from the cache?

In other words, how can I first create a list of all the etext identifiers, titles, or authors, and then loop through the list(s) to download the actual text. Based on the documentation (and my ability to read the actual function calls), I don’t see a way to so something like this:

from gutenberg.query import get_metadata
from gutenberg.cleanup import strip_headers
from gutenberg.acquire import load_etext
keys = get_etexts('title', '*')
for key in keys :
  text = strip_headers(load_etext(key)).strip()
  print( text, key + '.txt')

Put yet another way, the system seems to be implemented such that one needs to know an exact etext number, title, or author value before they are able to successfully query the cache.

P.S. I guess I’d really like a list of all the valid etext numbers or authors rather than the titles.

Issue Analytics

State:
Created 4 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

ericleasemorgancommented, Apr 26, 2019

@hugovk, yes, thank you. This helps. It is not exactly what I was seeking, but it is a step the right direction. I have downloaded and am running your code. It is outputting the json file which I will loop through to fill an SQLlite database, and from there do various types of additional indexing.

I appreciate the good work done by @c-w, and I thought about sequentially looping through the etext identifiers starting at #1 and continuing until getting an error, but I was wondering whether or not some identifiers were missing for one reason or another.

Put another way, both of you (@c-w and @hugovk) saved me a lot of time. Thank you.

0reactions

ericleasemorgancommented, Apr 29, 2019

Thank you for the prompt reply, and using the existing triple store, extracting the metadata, and then filling my own local SQL database seem robust and straight forward.