Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Method `page.sections` return html stuff in some cases

See original GitHub issue

Hello, I’m using this library to get textual descriptions for classes in the CUB 2011 dataset.

For each class of the 200 bird classes in the CUB dataset, I get the relative wikipedia page and look at the sections with the property page.sections. In some cases I get html codes inside the sections, for example:

from mediawiki import MediaWiki
wikipedia = MediaWiki()
page = wikipedia.page('Pied billed Grebe')
print(page.sections)

output: [u'Taxonomy and name', u'Subspecies<sup>[8]</sup>', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']

Then, if I use the page.section(str) method with the string u'Subspecies<sup>[8]</sup>':

print(page.section(page.sections[1]))

output: None

The correct string to find the object with the method page.sections(str) is simply 'Subspecies'.

I actually managed to fix this issue implementing this method:

def fixed_sections(page_content, verbose=False):
    sections = []
    import re
    section_regexp = r'\n==* .* ==*\n' # '== {STUFF_NOT_\n} =='
    found_obj = re.findall( section_regexp, page.content)
    
    if found_obj is not None:
        for obj in found_obj:
            obj = obj.lstrip('\n= ').rstrip(' =\n')
            sections.append(obj)
            if verbose: print("Found section: {}".format(obj))
    return sections

correct_sections  = fixed_sections(page.content)
print(correct_sections)
print(page.section(correct_sections[1]))

With this code I get the correct output, i.e. the content of the section (sub-section in this case):

[u'Taxonomy and name', u'Subspecies', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']
P. p. podiceps, (Linnaeus, 1758), North America to Panama & Cuba.
P. p. antillarum, (Bangs, 1913), Greater & Lesser Antilles.
P. p. antarcticus, (Lesson, 1842), South America to central Chile & Argentina.

This fix works for me, but it require to execute a reg-exp for each page, so maybe is not optimal.

Issue Analytics

State:
Created 6 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

barrustcommented, Mar 9, 2018

This has been published in version 0.4.0; please let me know if you encounter further issues!

1reaction

barrustcommented, Feb 23, 2018

Thank you for your interest. I noticed something like this long ago but forgot to get back to it. As sections are only used on demand I am not opposed to using regex. If you want to submit a PR to fix the sections title parsing I would love to review it!

Top Results From Across the Web

How to Use the Section Element in HTML - HubSpot Blog

In HTML, a section is a semantic element for creating standalone sections in a web page. These sections should be made up of...

Positioning Content - Learn to Code HTML & CSS - Shay Howe

In this chapter we're going to take a look at a few different use cases—creating reusable layouts and uniquely positioning one-off elements—and describe...

man-pages(7) - Linux manual page - man7.org

RETURN VALUE For Section 2 and 3 pages, this section gives a list of the values the library routine will return to the...

How to structure a web form - Learn web development | MDN

Objective: To understand how to structure HTML forms and give them semantics so they are usable and accessible. The flexibility of forms makes ......

How to Write Doc Comments for the Javadoc Tool - Oracle

A doc comment is written in HTML and must precede a class, field, constructor or method declaration. It is made up of two...