question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicate content on certain site

See original GitHub issue

I tried with this site: https://www.seoul.co.kr/news/newsView.php?id=20190815001004&wlog_sub=svt_006

and the article.article_html returns duplicate content. If you find in that page, you will only find one match of “EMP” (even in the html source code), but in article_html it appears twice. Here is my code:

from newspaper import Article, Config as NewspaperConfig
url="https://www.seoul.co.kr/news/newsView.php?id=20190815001004&wlog_sub=svt_006"
conf = NewspaperConfig()
article = Article(url, config=conf, keep_article_html=True, language = 'ko')
article.download()
article.parse()
print(article.article_html)
print(article.text)

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
mercureecommented, Aug 16, 2019

Yes you are right. What I mean is on the original webpage there’s only one match of “emp”. But if you run my code above and print(article.article_html), you will find two matches in the output, which means the content is repeated.

First you can try to apply my fix https://github.com/codelucas/newspaper/pull/456 if it does not help, then it is probably long-story bug, i fixed it in my fork, but not sure if it is made properly.

0reactions
JohnChu101commented, Oct 8, 2020

Yes you are right. What I mean is on the original webpage there’s only one match of “emp”. But if you run my code above and print(article.article_html), you will find two matches in the output, which means the content is repeated.

First you can try to apply my fix #456 if it does not help, then it is probably long-story bug, i fixed it in my fork, but not sure if it is made properly.

It seems to be removing the space before and after elements. For example, <p>We released <a href="https://www.google.com/" target="_blank">a new video</a> here. <a href="https://www.google.com/" target="_blank">Click here to watch it now</a>.</p> We released a new video here. Click here to watch it now. Will be converted to <p>We released <a href="https://www.google.com/" target="_blank">a new video</a> here. <a href="https://www.google.com/" target="_blank">Click here to watch it now</a>.</p> We releaseda new videohere.Click here to watch it now. Same thing happens with bold texts.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Duplicate Content: Why does it happen and how to fix issues
Duplicate content is content that appears on the Internet in more than one place. That “one place” is defined as a location with...
Read more >
Duplicate Content and SEO: The Complete Guide - Backlinko
Duplicate content is content that's similar or exact copies of content on other websites or on different pages on the same website.
Read more >
Duplicate Content: 5 Myths and 5 Facts About How It Impacts ...
According to Google, duplicate content won't tank your SEO rankings. They specifically say: “Duplicate content on a site is not grounds for action...
Read more >
What is Duplicate Content and How Does it Affect Your SEO?
Duplicate content refers to blocks of content that are either completely identical to one another (exact duplicates) or very similar, also known ...
Read more >
The Truth About Duplicate Content - Search Engine Journal
Duplicate content is just what it sounds like. It's when the same copy appears on two or more web pages. Duplicate content can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found