Support references scraping link title with URL?
See original GitHub issueI’m working on a tool where I want to scrape the common link “Official Website” from the “External links” section that appears in almost every company and organization article (for example). According to the references function it will return links that appear in the “External links” section. My problem is that I cannot easily locate which link is the “Official Website” as it returns a giant list of URL’s.
Perhaps references could return a dictionary that contained both the URL and name of the link?
Something like this:
>>> from mediawiki import MediaWiki
>>> wikipedia = MediaWiki()
>>> page = wikipedia.page("McDonald's")
>>> page.references
{ ... "Official Website": "https://www.mcdonalds.com" ... }
>>> page.references["Official Website"]
"https://www.mcdonalds.com"
I’m submitting this issue here as it seems to be the most updated and active python MediaWiki wrapper. Thanks for your work on this. I’ll be looking to see if I can add this feature myself, however I’m sure you are more familiar with the source code and may have a better solution to this problem. Thanks.
Edit: Hmm, after some digging it looks like the MediaWiki API just doesn’t support returning the link title. I hope I can solve this problem without having to do some regex or beautifulsoup on page.html
or something.
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (4 by maintainers)
@vesche I decided that, even though this is not a part of the API, I am already parsing the html and content for other functions. This seemed like another good addition to the api. As of PR #34 you can use the
parse_section_links(section)
to get all the links from the external links section. It returns a list of tuples, in order based on the html markup of links and the text representing the link in the desired section.So to pull the external links you can use:
Also, I pushed this code to pypi so you can upgrade from 0.3.14 to 0.3.15