question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

getpapers has many fewer hits than EUPMC interface

See original GitHub issue

from a correspondent:

I installed ContentMine on my Mac laptop. I tried to do content mine to my research topic – “Postdoc career outcome”. I was able to get 78 open access full-text papers. See the logs of “getpapers” output below,

$ getpapers -q "postdoc career outcome" -o PDcareer -x

info: Searching using eupmc API

info: Found 78 open access results

Retrieving results [==============================] 100% (eta 0.0s)

info: Done collecting results

info: Saving result metadata

info: Full EUPMC result metadata written to eupmc_results.json

info: Individual EUPMC result metadata records written

info: Extracting fulltext HTML URL list (may not be available for all articles)

info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt

info: Got XML URLs for 78 out of 78 results

info: Downloading fulltext XML files

Downloading files [=======================] 100% (78/78) [4.2s elapsed, eta 0.0]

info: All downloads succeeded!

I did the same search through “Europe PMC” web interface. I got total 297 results, in which 296 are full-text articles and 172 are open-access articles. See the screenshot below,

My questions are:

  1.   Why “getpapers” extracted far fewer papers than “EUPMC” provides, 78 vs. 172 (or 296)? Is it caused by limited coverage of journal scrapers?
    
  2.   Not all the extracted papers are relevant to my research topic. So manual filtering may be needed. Is it possible to provide “getpapers” a list of PMC IDs for paper extraction?
    
  3.   For my research topic, I really need to get researcher's name, affiliation, contribution, and bibliometrics (citation number, H-index, journal impact factor) from journal papers. This cannot be done through standard content mine, which extract information about sequence, gene, species, and word count. How do I develop my own “ami2” plugins for extracting facts that I’m interested?
    

Thank you so much for developing this great open-source software! I’m looking forward to hearing from you soon.

Issue Analytics

  • State:open
  • Created 7 years ago
  • Reactions:1
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
blahahcommented, Apr 12, 2017

We get our results directly from the EUPMC API, so this sounds like an API bug. @tarrow can you follow up with EUPMC?

0reactions
sedimentation-faultcommented, Apr 27, 2017

It seems that Europe PMC (EUPMC) has listened to complaints about sudden API changes and has modified its procedures.

I have just stumbled upon the EUPMC SOAP Web Service Reference Guide. There, in Introduction (p. 6 of the document, p.7 of the PDF file), it says:

From January 2016 a new web service release procedure has been introduced. This allows two versions of the web service to be simultaneously available. This approach to release management will allow users to prepare for a new version, rather than having to immediately respond to a version change. The details of web service releases will be communicated to all known users. A mailing list of users is compiled from those that have supplied an email address in the ‘Email’ parameters of the various methods available.

You can thus

  • use two versions of the API at any given time, the new and the last stable one, and
  • add yourself to a mailing list and keep up with the API changes by passing your email address to one of the API methods.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · ContentMine/getpapers - GitHub
getpapers 'JavaScript heap out of memory error' ... Syntax error with getpapers minimal usage ... getpapers has many fewer hits than EUPMC interface....
Read more >
Correctness – A paradigm for sustainable software development
1. OOP is a higher-level paradigm than FP, so people comparing them directly usually are missing the point to begin with. OOP systems...
Read more >
Science Careers Classified Advertising
fied candidates must have a Ph.D. in Neuroscience (or related field of biology). Experience with molecular bi-.
Read more >
jupyter - OUseful.Info, the blog…
There are multiple ways of running Jupyter notebooks, including the ... a year or so ago: Sports Data and R – Scope for...
Read more >
Method and system for managing games of bingo
The Bingo game is then played with them called Bingo numbers being entered into the computer. When a player calls BINGO, the hall...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found