Duplicate content on certain site
See original GitHub issueI tried with this site: https://www.seoul.co.kr/news/newsView.php?id=20190815001004&wlog_sub=svt_006
and the article.article_html returns duplicate content. If you find in that page, you will only find one match of “EMP” (even in the html source code), but in article_html it appears twice. Here is my code:
from newspaper import Article, Config as NewspaperConfig
url="https://www.seoul.co.kr/news/newsView.php?id=20190815001004&wlog_sub=svt_006"
conf = NewspaperConfig()
article = Article(url, config=conf, keep_article_html=True, language = 'ko')
article.download()
article.parse()
print(article.article_html)
print(article.text)
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (1 by maintainers)
Top Results From Across the Web
Duplicate Content: Why does it happen and how to fix issues
Duplicate content is content that appears on the Internet in more than one place. That “one place” is defined as a location with...
Read more >Duplicate Content and SEO: The Complete Guide - Backlinko
Duplicate content is content that's similar or exact copies of content on other websites or on different pages on the same website.
Read more >Duplicate Content: 5 Myths and 5 Facts About How It Impacts ...
According to Google, duplicate content won't tank your SEO rankings. They specifically say: “Duplicate content on a site is not grounds for action...
Read more >What is Duplicate Content and How Does it Affect Your SEO?
Duplicate content refers to blocks of content that are either completely identical to one another (exact duplicates) or very similar, also known ...
Read more >The Truth About Duplicate Content - Search Engine Journal
Duplicate content is just what it sounds like. It's when the same copy appears on two or more web pages. Duplicate content can...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
First you can try to apply my fix https://github.com/codelucas/newspaper/pull/456 if it does not help, then it is probably long-story bug, i fixed it in my fork, but not sure if it is made properly.
It seems to be removing the space before and after elements. For example,
<p>We released <a href="https://www.google.com/" target="_blank">a new video</a> here. <a href="https://www.google.com/" target="_blank">Click here to watch it now</a>.</p>
We released a new video here. Click here to watch it now. Will be converted to<p>We released <a href="https://www.google.com/" target="_blank">a new video</a> here. <a href="https://www.google.com/" target="_blank">Click here to watch it now</a>.</p>
We releaseda new videohere.Click here to watch it now. Same thing happens with bold texts.