Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support references scraping link title with URL?

See original GitHub issue

I’m working on a tool where I want to scrape the common link “Official Website” from the “External links” section that appears in almost every company and organization article (for example). According to the references function it will return links that appear in the “External links” section. My problem is that I cannot easily locate which link is the “Official Website” as it returns a giant list of URL’s.

Perhaps references could return a dictionary that contained both the URL and name of the link?

Something like this:

>>> from mediawiki import MediaWiki
>>> wikipedia = MediaWiki()
>>> page = wikipedia.page("McDonald's")
>>> page.references
{ ... "Official Website": "https://www.mcdonalds.com" ... }
>>> page.references["Official Website"]
"https://www.mcdonalds.com"

I’m submitting this issue here as it seems to be the most updated and active python MediaWiki wrapper. Thanks for your work on this. I’ll be looking to see if I can add this feature myself, however I’m sure you are more familiar with the source code and may have a better solution to this problem. Thanks.

Edit: Hmm, after some digging it looks like the MediaWiki API just doesn’t support returning the link title. I hope I can solve this problem without having to do some regex or beautifulsoup on page.html or something.

Issue Analytics

State:
Created 6 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

barrustcommented, Sep 18, 2017

@vesche I decided that, even though this is not a part of the API, I am already parsing the html and content for other functions. This seemed like another good addition to the api. As of PR #34 you can use the parse_section_links(section) to get all the links from the external links section. It returns a list of tuples, in order based on the html markup of links and the text representing the link in the desired section.

So to pull the external links you can use:

>>> from mediawiki import MediaWiki
>>> wikipedia = MediaWiki()
>>> page = wikipedia.page("McDonald's")
>>> page.parse_section_links('External Links')
[('Official Website, 'https://www.mcdonalds.com'), ...]

0reactions

barrustcommented, Sep 21, 2017

Also, I pushed this code to pypi so you can upgrade from 0.3.14 to 0.3.15