question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

extracting all etext numbers, titles, and authors

See original GitHub issue

How can I extract all the etext numbers, titles, and authors from the cache?

In other words, how can I first create a list of all the etext identifiers, titles, or authors, and then loop through the list(s) to download the actual text. Based on the documentation (and my ability to read the actual function calls), I don’t see a way to so something like this:

from gutenberg.query import get_metadata
from gutenberg.cleanup import strip_headers
from gutenberg.acquire import load_etext
keys = get_etexts('title', '*')
for key in keys :
  text = strip_headers(load_etext(key)).strip()
  print( text, key + '.txt')

Put yet another way, the system seems to be implemented such that one needs to know an exact etext number, title, or author value before they are able to successfully query the cache.

P.S. I guess I’d really like a list of all the valid etext numbers or authors rather than the titles.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ericleasemorgancommented, Apr 26, 2019

@hugovk, yes, thank you. This helps. It is not exactly what I was seeking, but it is a step the right direction. I have downloaded and am running your code. It is outputting the json file which I will loop through to fill an SQLlite database, and from there do various types of additional indexing.

I appreciate the good work done by @c-w, and I thought about sequentially looping through the etext identifiers starting at #1 and continuing until getting an error, but I was wondering whether or not some identifiers were missing for one reason or another.

Put another way, both of you (@c-w and @hugovk) saved me a lot of time. Thank you.

0reactions
ericleasemorgancommented, Apr 29, 2019

Thank you for the prompt reply, and using the existing triple store, extracting the metadata, and then filling my own local SQL database seem robust and straight forward.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Extracting Pages With Matching Text (Invoices ...
The AutoSplit™ software can easily create PDF documents by extracting pages with matching text from PDF documents and name files according to user...
Read more >
Extract Text, Title, Paragraph, Image From A Image ... - YouTube
Video demonstrates the extraction of particular text, title, images from an image document.
Read more >
Ask HN: Extracting book titles from comments
Using named-entity recognition, how to extract book titles from HN comments? Should I train NER chunker on HN data?
Read more >
How to extract (recognize) book title from the article?
I can recognize author names using nltk, so my idea is to get list of book titles with authors from some external source...
Read more >
Intelligent Content Based Title and Author Name Extraction ...
Abstract. This paper describes the development of algorithms for extracting the title and the names of the authors from documents available on the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found