question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is newspaper.build method deterministic?

See original GitHub issue

Whenever I call newspaper.build, I often get different results in the number of articles. If I’m lucky, I get A TON of articles, but sometimes I get very few or none at all.

I have been trying this with cnn and I get very different results from one minute to the next and I am not sure what’s wrong.

I tried this using newspaper as installed from pip and I also set up this repository’s clone and downloaded all the prerequisites inside of virtualenv. Still same results.

I am not sure what else I can describe.

All tests are passing (5 are skipped though).

This is what I am experiencing.

>>> import newspaper
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
...     print(article.url)
... 
http://cnn.com/2016/05/06/technology/panama-papers-search/index.html
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
...     print(article.url)
... 
http://cnn.com/2016/05/06/opinions/sadiq-khan-london-mayor-ahmed/index.html
http://money.cnn.com/2016/05/06/news/economy/london-mayor-sadiq-khan/index.html
http://money.cnn.com/2016/05/06/news/economy/london-mayor-sadiq-khan/index.html?section=money_topstories
http://money.cnn.com/2016/05/05/news/verizon-strikes-temporary-relocation/index.html?section=money_topstories
http://cnn.com/2016/05/06/europe/uk-london-mayoral-race-sadiq-khan/index.html
>>> 

5 minutes later…

>>> import newspaper
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
...     print(article.url)
... 
http://cnn.com/videos/health/2016/05/06/teen-pageant-contestant-collapses-on-stage-pkg.kvly/video/playlists/cant-miss/
http://cnn.com/2016/05/06/news/economy/london-mayor-sadiq-khan/index.html
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
...     print(article.url)
... 
>>> # nothing... 

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:12

github_iconTop GitHub Comments

19reactions
yprezcommented, May 10, 2016

@ijkilchenko @B0nzo93 after looking atto build() a little bit, I think it’s related to caching…

http://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#article-caching

Can you try reproducing it with caching disabled?

e.g. cnn = newspaper.build('http://cnn.com', memoize_articles=False)

I’m getting 768 articles from cnn every time I run this… It’s either a bug with how caching works or simply the default behavior, not sure which since I never used build() before.

4reactions
ijkilchenkocommented, May 10, 2016

So I guess everything is actually working as expected now that I am aware of caching.

Here’s a little experiment. I compare whether calling build returns anything the second time that was not returned the first time. If it doesn’t, then the caching works as expected. I also do this experiment with caching turned off and see that some urls are the same between calling it the first and second time.

>>> import newspaper
>>> cnn1 = newspaper.build('http://cnn.com')
>>> urls1 = set([article.url for article in cnn1.articles])
>>> cnn2 = newspaper.build('http://cnn.com')
>>> urls2 = set([article.url for article in cnn2.articles])
>>> urls1.intersection(urls2)
set() # no urls are shared between calls when caching is on
>>> cnn1_fresh = newspaper.build('http://cnn.com', memoize_articles=False)
>>> urls1_fresh = set([article.url for article in cnn1_fresh.articles])
>>> cnn2_fresh = newspaper.build('http://cnn.com', memoize_articles=False)
>>> urls2_fresh = set([article.url for article in cnn2_fresh.articles])
>>> len(urls1_fresh.intersection(urls2_fresh))
1078 # same same urls are returned because caching is on

I guess I should have RTFM’ed first. Closing issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Probabilistic vs deterministic: Which method should you be ...
The draw of probabilistic modeling is that it allows you to build customer profiles without collecting any personally identifiable information ( ...
Read more >
Stochastic vs Deterministic Models: Understand the Pros and ...
Want to learn the difference between a stochastic and deterministic model? Read our latest blog to find out the pros and cons of...
Read more >
Deterministic/Probabilistic Data
This definition explains deterministic and probabilistic data, two types of customer data, and how to choose which option is the best approach.
Read more >
Introduction to Deterministic Policy Gradient (DPG)
In this post, I will be exploring the concepts following the paper Deterministic Policy Gradient Algorithms (Silver et al.)
Read more >
Large Learning Rate for Multiscale Objective Function
Our approach for demonstrating the 'stochasticity' of ϕ consists of three key ingredients: (i) construct another map ˆϕ, which is a truly stochastic...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found