Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Easy access to citations for specific methods

See original GitHub issue

Hello,

As a scientist, I would appreciate an easy access to citation information for a specific estimator. A very good example would be the citation function from R https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/citation That provides citation for each packages (and even better in bibtex format !).

The citation information in sklearn is more or less available in the reference sections of the doctrings and i coded a short function that basically prints the section:

import re
import textwrap

def _get_docstring_sections(txt):
    """Returns the sections content of numpy formated docstrings as dict."""
    
    if not type(txt)==str:
        txt=''
    
    # remove docstring indentation
    txt=textwrap.dedent('    '+txt)

    # split the docstring by sections (------)
    split_sec=re.split('\n--*\n',txt)
    
    # Find fisrt title
    secs={}
    lines=split_sec[0].splitlines()
    if len(lines):
        title= lines[-1]
        for i in range(1,len(split_sec)):
            lines=split_sec[i].splitlines()
            secs[title]='\n'.join(lines[:-1])
            title=lines[-1]
        
    return secs

def cite(obj=None,format=None):
    """Get citation information from object with numpy docstrings
    
    """

    txt="""No references found for this object"""

    if obj is not None:
        
        secs=_get_docstring_sections(obj.__doc__)
        
        if 'References' in secs:
            txt= secs['References']

    print(txt)

It seems to work rather well in practice and print the references in rst format. in a perfect world we could also return bibtex format in option but it seems quite hard without calling some reference web API.

I used it on all estimators using function sklearn.utils.testing.all_estimators() :

lst_all=sklearn.utils.testing.all_estimators()

for name,cl in lst_all:
    fmt='\n\n{} : {}'.format(name,cl)
    print(fmt)
    print('='*(len(fmt)-2))
    
    cite(cl)

It returns a lot of nice references for the methods with some notable empty approaches actually due to documentation error (References in sklearn.svm.SVR for instance in not a proper section). It also work on any python object using the numpy docstring format with reference section so I find it quite practical.

Are you interested by this function in a PR maybe in sklearn.utils ?

Issue Analytics

State:
Created 4 years ago
Comments:9 (7 by maintainers)

Top GitHub Comments

3reactions

NicolasHugcommented, Feb 29, 2020

Just found out about this https://github.com/duecredit/duecredit

1reaction

rflamarycommented, Feb 26, 2020

OK, I’m all for better documentation. I’m just saying that a lots of publication nowadays cite sklearn (one of the main papers) instead of the proper paper corresponding to the estimator when they should cite both. This might be lazyness but that’s why a simple function that prints a list of reference could help, assuming of course they know about it.

When there is several references, the function returns them all and the user can look into the documentation which one corresponds to the parameters he is using.

Anyways it was just an idea. I think I will implement it for my toolbox and see if i can find a way to add those awesome bibtex that the guys from R have. The way it is implemented it works on functions from numpy/scipy/sklearn and my toolbox as long as it’s using numpy docstring format.