extracting all etext numbers, titles, and authors
See original GitHub issueHow can I extract all the etext numbers, titles, and authors from the cache?
In other words, how can I first create a list of all the etext identifiers, titles, or authors, and then loop through the list(s) to download the actual text. Based on the documentation (and my ability to read the actual function calls), I don’t see a way to so something like this:
from gutenberg.query import get_metadata
from gutenberg.cleanup import strip_headers
from gutenberg.acquire import load_etext
keys = get_etexts('title', '*')
for key in keys :
text = strip_headers(load_etext(key)).strip()
print( text, key + '.txt')
Put yet another way, the system seems to be implemented such that one needs to know an exact etext number, title, or author value before they are able to successfully query the cache.
P.S. I guess I’d really like a list of all the valid etext numbers or authors rather than the titles.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Extracting Pages With Matching Text (Invoices ...
The AutoSplit™ software can easily create PDF documents by extracting pages with matching text from PDF documents and name files according to user...
Read more >Extract Text, Title, Paragraph, Image From A Image ... - YouTube
Video demonstrates the extraction of particular text, title, images from an image document.
Read more >Ask HN: Extracting book titles from comments
Using named-entity recognition, how to extract book titles from HN comments? Should I train NER chunker on HN data?
Read more >How to extract (recognize) book title from the article?
I can recognize author names using nltk, so my idea is to get list of book titles with authors from some external source...
Read more >Intelligent Content Based Title and Author Name Extraction ...
Abstract. This paper describes the development of algorithms for extracting the title and the names of the authors from documents available on the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@hugovk, yes, thank you. This helps. It is not exactly what I was seeking, but it is a step the right direction. I have downloaded and am running your code. It is outputting the json file which I will loop through to fill an SQLlite database, and from there do various types of additional indexing.
I appreciate the good work done by @c-w, and I thought about sequentially looping through the etext identifiers starting at #1 and continuing until getting an error, but I was wondering whether or not some identifiers were missing for one reason or another.
Put another way, both of you (@c-w and @hugovk) saved me a lot of time. Thank you.
Thank you for the prompt reply, and using the existing triple store, extracting the metadata, and then filling my own local SQL database seem robust and straight forward.