Method `page.sections` return html stuff in some cases
See original GitHub issueHello, I’m using this library to get textual descriptions for classes in the CUB 2011 dataset.
For each class of the 200 bird classes in the CUB dataset, I get the relative wikipedia page and look at the sections with the property page.sections
.
In some cases I get html codes inside the sections, for example:
from mediawiki import MediaWiki
wikipedia = MediaWiki()
page = wikipedia.page('Pied billed Grebe')
print(page.sections)
output:
[u'Taxonomy and name', u'Subspecies<sup>[8]</sup>', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']
Then, if I use the page.section(str)
method with the string u'Subspecies<sup>[8]</sup>'
:
print(page.section(page.sections[1]))
output: None
The correct string to find the object with the method page.sections(str)
is simply 'Subspecies'
.
I actually managed to fix this issue implementing this method:
def fixed_sections(page_content, verbose=False):
sections = []
import re
section_regexp = r'\n==* .* ==*\n' # '== {STUFF_NOT_\n} =='
found_obj = re.findall( section_regexp, page.content)
if found_obj is not None:
for obj in found_obj:
obj = obj.lstrip('\n= ').rstrip(' =\n')
sections.append(obj)
if verbose: print("Found section: {}".format(obj))
return sections
correct_sections = fixed_sections(page.content)
print(correct_sections)
print(page.section(correct_sections[1]))
With this code I get the correct output, i.e. the content of the section (sub-section in this case):
[u'Taxonomy and name', u'Subspecies', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']
P. p. podiceps, (Linnaeus, 1758), North America to Panama & Cuba.
P. p. antillarum, (Bangs, 1913), Greater & Lesser Antilles.
P. p. antarcticus, (Lesson, 1842), South America to central Chile & Argentina.
This fix works for me, but it require to execute a reg-exp for each page, so maybe is not optimal.
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (5 by maintainers)
This has been published in version 0.4.0; please let me know if you encounter further issues!
Thank you for your interest. I noticed something like this long ago but forgot to get back to it. As sections are only used on demand I am not opposed to using regex. If you want to submit a PR to fix the sections title parsing I would love to review it!