Is newspaper.build method deterministic?
See original GitHub issueWhenever I call newspaper.build
, I often get different results in the number of articles. If I’m lucky, I get A TON of articles, but sometimes I get very few or none at all.
I have been trying this with cnn and I get very different results from one minute to the next and I am not sure what’s wrong.
I tried this using newspaper
as installed from pip and I also set up this repository’s clone and downloaded all the prerequisites inside of virtualenv. Still same results.
I am not sure what else I can describe.
All tests are passing (5 are skipped though).
This is what I am experiencing.
>>> import newspaper
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
... print(article.url)
...
http://cnn.com/2016/05/06/technology/panama-papers-search/index.html
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
... print(article.url)
...
http://cnn.com/2016/05/06/opinions/sadiq-khan-london-mayor-ahmed/index.html
http://money.cnn.com/2016/05/06/news/economy/london-mayor-sadiq-khan/index.html
http://money.cnn.com/2016/05/06/news/economy/london-mayor-sadiq-khan/index.html?section=money_topstories
http://money.cnn.com/2016/05/05/news/verizon-strikes-temporary-relocation/index.html?section=money_topstories
http://cnn.com/2016/05/06/europe/uk-london-mayoral-race-sadiq-khan/index.html
>>>
5 minutes later…
>>> import newspaper
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
... print(article.url)
...
http://cnn.com/videos/health/2016/05/06/teen-pageant-contestant-collapses-on-stage-pkg.kvly/video/playlists/cant-miss/
http://cnn.com/2016/05/06/news/economy/london-mayor-sadiq-khan/index.html
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
... print(article.url)
...
>>> # nothing...
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:12
Top Results From Across the Web
Probabilistic vs deterministic: Which method should you be ...
The draw of probabilistic modeling is that it allows you to build customer profiles without collecting any personally identifiable information ( ...
Read more >Stochastic vs Deterministic Models: Understand the Pros and ...
Want to learn the difference between a stochastic and deterministic model? Read our latest blog to find out the pros and cons of...
Read more >Deterministic/Probabilistic Data
This definition explains deterministic and probabilistic data, two types of customer data, and how to choose which option is the best approach.
Read more >Introduction to Deterministic Policy Gradient (DPG)
In this post, I will be exploring the concepts following the paper Deterministic Policy Gradient Algorithms (Silver et al.)
Read more >Large Learning Rate for Multiscale Objective Function
Our approach for demonstrating the 'stochasticity' of ϕ consists of three key ingredients: (i) construct another map ˆϕ, which is a truly stochastic...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ijkilchenko @B0nzo93 after looking atto build() a little bit, I think it’s related to caching…
http://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#article-caching
Can you try reproducing it with caching disabled?
e.g.
cnn = newspaper.build('http://cnn.com', memoize_articles=False)
I’m getting 768 articles from cnn every time I run this… It’s either a bug with how caching works or simply the default behavior, not sure which since I never used build() before.
So I guess everything is actually working as expected now that I am aware of caching.
Here’s a little experiment. I compare whether calling
build
returns anything the second time that was not returned the first time. If it doesn’t, then the caching works as expected. I also do this experiment with caching turned off and see that some urls are the same between calling it the first and second time.I guess I should have RTFM’ed first. Closing issue.