question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support references scraping link title with URL?

See original GitHub issue

I’m working on a tool where I want to scrape the common link “Official Website” from the “External links” section that appears in almost every company and organization article (for example). According to the references function it will return links that appear in the “External links” section. My problem is that I cannot easily locate which link is the “Official Website” as it returns a giant list of URL’s.

Perhaps references could return a dictionary that contained both the URL and name of the link?

Something like this:

>>> from mediawiki import MediaWiki
>>> wikipedia = MediaWiki()
>>> page = wikipedia.page("McDonald's")
>>> page.references
{ ... "Official Website": "https://www.mcdonalds.com" ... }
>>> page.references["Official Website"]
"https://www.mcdonalds.com"

I’m submitting this issue here as it seems to be the most updated and active python MediaWiki wrapper. Thanks for your work on this. I’ll be looking to see if I can add this feature myself, however I’m sure you are more familiar with the source code and may have a better solution to this problem. Thanks.

Edit: Hmm, after some digging it looks like the MediaWiki API just doesn’t support returning the link title. I hope I can solve this problem without having to do some regex or beautifulsoup on page.html or something.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
barrustcommented, Sep 18, 2017

@vesche I decided that, even though this is not a part of the API, I am already parsing the html and content for other functions. This seemed like another good addition to the api. As of PR #34 you can use the parse_section_links(section) to get all the links from the external links section. It returns a list of tuples, in order based on the html markup of links and the text representing the link in the desired section.

So to pull the external links you can use:

>>> from mediawiki import MediaWiki
>>> wikipedia = MediaWiki()
>>> page = wikipedia.page("McDonald's")
>>> page.parse_section_links('External Links')
[('Official Website, 'https://www.mcdonalds.com'), ...]
0reactions
barrustcommented, Sep 21, 2017

Also, I pushed this code to pypi so you can upgrade from 0.3.14 to 0.3.15

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - Scraping a webpage for link titles and URLs utilizing ...
I have a webpage of popular articles which I want to scrape for each quoted webpage's hyperlink and the title of the ...
Read more >
Scraping titles and links from a site using python
In the second function, I've just used BeautifulSoup to get the title and the urls of the videos that you're interested in.
Read more >
How to scrape Amazon Product Information using Beautiful ...
How to scrape Amazon Product Information using Beautiful Soup · Some basic requirements: · Creating a User-Agent · Sending a request to a...
Read more >
How To Scrape Websites Using Google Sheets - Ultimate Guide
Simply put, web scraping is a method of extracting website data like Page titles, Headings, Meta descriptions, Internal links, External links, ...
Read more >
Crawl and Follow links with SCRAPY - YouTube
... scrapign framework for Python, we can use it to following links and crawl a website, in this case I am going to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found