question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Method `page.sections` return html stuff in some cases

See original GitHub issue

Hello, I’m using this library to get textual descriptions for classes in the CUB 2011 dataset.

For each class of the 200 bird classes in the CUB dataset, I get the relative wikipedia page and look at the sections with the property page.sections. In some cases I get html codes inside the sections, for example:

from mediawiki import MediaWiki
wikipedia = MediaWiki()
page = wikipedia.page('Pied billed Grebe')
print(page.sections)

output: [u'Taxonomy and name', u'Subspecies<sup>&#91;8&#93;</sup>', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']

Then, if I use the page.section(str) method with the string u'Subspecies<sup>&#91;8&#93;</sup>':

print(page.section(page.sections[1]))

output: None

The correct string to find the object with the method page.sections(str) is simply 'Subspecies'.

I actually managed to fix this issue implementing this method:

def fixed_sections(page_content, verbose=False):
    sections = []
    import re
    section_regexp = r'\n==* .* ==*\n' # '== {STUFF_NOT_\n} =='
    found_obj = re.findall( section_regexp, page.content)
    
    if found_obj is not None:
        for obj in found_obj:
            obj = obj.lstrip('\n= ').rstrip(' =\n')
            sections.append(obj)
            if verbose: print("Found section: {}".format(obj))
    return sections

correct_sections  = fixed_sections(page.content)
print(correct_sections)
print(page.section(correct_sections[1]))

With this code I get the correct output, i.e. the content of the section (sub-section in this case):

[u'Taxonomy and name', u'Subspecies', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']
P. p. podiceps, (Linnaeus, 1758), North America to Panama & Cuba.
P. p. antillarum, (Bangs, 1913), Greater & Lesser Antilles.
P. p. antarcticus, (Lesson, 1842), South America to central Chile & Argentina.

This fix works for me, but it require to execute a reg-exp for each page, so maybe is not optimal.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
barrustcommented, Mar 9, 2018

This has been published in version 0.4.0; please let me know if you encounter further issues!

1reaction
barrustcommented, Feb 23, 2018

Thank you for your interest. I noticed something like this long ago but forgot to get back to it. As sections are only used on demand I am not opposed to using regex. If you want to submit a PR to fix the sections title parsing I would love to review it!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Use the Section Element in HTML - HubSpot Blog
In HTML, a section is a semantic element for creating standalone sections in a web page. These sections should be made up of...
Read more >
Positioning Content - Learn to Code HTML & CSS - Shay Howe
In this chapter we're going to take a look at a few different use cases—creating reusable layouts and uniquely positioning one-off elements—and describe...
Read more >
man-pages(7) - Linux manual page - man7.org
RETURN VALUE For Section 2 and 3 pages, this section gives a list of the values the library routine will return to the...
Read more >
How to structure a web form - Learn web development | MDN
Objective: To understand how to structure HTML forms and give them semantics so they are usable and accessible. The flexibility of forms makes ......
Read more >
How to Write Doc Comments for the Javadoc Tool - Oracle
A doc comment is written in HTML and must precede a class, field, constructor or method declaration. It is made up of two...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found