Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is newspaper.build method deterministic?

See original GitHub issue

Whenever I call newspaper.build, I often get different results in the number of articles. If I’m lucky, I get A TON of articles, but sometimes I get very few or none at all.

I have been trying this with cnn and I get very different results from one minute to the next and I am not sure what’s wrong.

I tried this using newspaper as installed from pip and I also set up this repository’s clone and downloaded all the prerequisites inside of virtualenv. Still same results.

I am not sure what else I can describe.

All tests are passing (5 are skipped though).

This is what I am experiencing.

>>> import newspaper
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
...     print(article.url)
... 
http://cnn.com/2016/05/06/technology/panama-papers-search/index.html
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
...     print(article.url)
... 
http://cnn.com/2016/05/06/opinions/sadiq-khan-london-mayor-ahmed/index.html
http://money.cnn.com/2016/05/06/news/economy/london-mayor-sadiq-khan/index.html
http://money.cnn.com/2016/05/06/news/economy/london-mayor-sadiq-khan/index.html?section=money_topstories
http://money.cnn.com/2016/05/05/news/verizon-strikes-temporary-relocation/index.html?section=money_topstories
http://cnn.com/2016/05/06/europe/uk-london-mayoral-race-sadiq-khan/index.html
>>>

5 minutes later…

>>> import newspaper
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
...     print(article.url)
... 
http://cnn.com/videos/health/2016/05/06/teen-pageant-contestant-collapses-on-stage-pkg.kvly/video/playlists/cant-miss/
http://cnn.com/2016/05/06/news/economy/london-mayor-sadiq-khan/index.html
>>> p = newspaper.build('http://cnn.com')
>>> for article in p.articles:
...     print(article.url)
... 
>>> # nothing...

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:12

Top GitHub Comments

19reactions

yprezcommented, May 10, 2016

@ijkilchenko @B0nzo93 after looking atto build() a little bit, I think it’s related to caching…

http://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#article-caching

Can you try reproducing it with caching disabled?

e.g. cnn = newspaper.build('http://cnn.com', memoize_articles=False)

I’m getting 768 articles from cnn every time I run this… It’s either a bug with how caching works or simply the default behavior, not sure which since I never used build() before.

4reactions

ijkilchenkocommented, May 10, 2016

So I guess everything is actually working as expected now that I am aware of caching.

Here’s a little experiment. I compare whether calling build returns anything the second time that was not returned the first time. If it doesn’t, then the caching works as expected. I also do this experiment with caching turned off and see that some urls are the same between calling it the first and second time.

>>> import newspaper
>>> cnn1 = newspaper.build('http://cnn.com')
>>> urls1 = set([article.url for article in cnn1.articles])
>>> cnn2 = newspaper.build('http://cnn.com')
>>> urls2 = set([article.url for article in cnn2.articles])
>>> urls1.intersection(urls2)
set() # no urls are shared between calls when caching is on
>>> cnn1_fresh = newspaper.build('http://cnn.com', memoize_articles=False)
>>> urls1_fresh = set([article.url for article in cnn1_fresh.articles])
>>> cnn2_fresh = newspaper.build('http://cnn.com', memoize_articles=False)
>>> urls2_fresh = set([article.url for article in cnn2_fresh.articles])
>>> len(urls1_fresh.intersection(urls2_fresh))
1078 # same same urls are returned because caching is on

I guess I should have RTFM’ed first. Closing issue.

Top Results From Across the Web

Probabilistic vs deterministic: Which method should you be ...

The draw of probabilistic modeling is that it allows you to build customer profiles without collecting any personally identifiable information ( ...

Stochastic vs Deterministic Models: Understand the Pros and ...

Want to learn the difference between a stochastic and deterministic model? Read our latest blog to find out the pros and cons of...

Deterministic/Probabilistic Data

This definition explains deterministic and probabilistic data, two types of customer data, and how to choose which option is the best approach.

Introduction to Deterministic Policy Gradient (DPG)

In this post, I will be exploring the concepts following the paper Deterministic Policy Gradient Algorithms (Silver et al.)

Large Learning Rate for Multiscale Objective Function

Our approach for demonstrating the 'stochasticity' of ϕ consists of three key ingredients: (i) construct another map ˆϕ, which is a truly stochastic...