Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can I search ISSN, ISBN and DOI in a web-page, Not only URL?

See original GitHub issue

I’m writing an indexer for org-roam and BibTeX to link between org-roam to web-browser.

Some org-file has citation syntax like below.

:PROPERTIES:
:ID:       120cf393-9ec3-40b8-a486-d903036236f8
:ROAM_REFS: cite:Dong2018
:END:
#+TITLE: Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
#+CREATED: [2021-07-18 Sun 15:38]
#+filetags: :Literature:

- tags ::
- keywords ::
- author(s) :: Dong, Linhao and Xu, Shuang and Xu, Bo

The bib file would be like this.

@InProceedings{Dong2018,
  author     = {Dong, Linhao and Xu, Shuang and Xu, Bo},
  booktitle  = {2018 {IEEE} {International} {Conference} on {Acoustics}, {Speech} and {Signal} {Processing} ({ICASSP})},
  title      = {Speech-{Transformer}: {A} {No}-{Recurrence} {Sequence}-to-{Sequence} {Model} for {Speech} {Recognition}},
  year       = {2018},
  month      = apr,
  note       = {ZSCC: 0000311 ISSN: 2379-190X},
  pages      = {5884--5888},
  abstract   = {Recurrent sequence-to-sequence models using encoder-decoder architecture have made great progress in speech recognition task. However, they suffer from the drawback of slow training speed because the internal recurrence limits the training parallelization. In this paper, we present the Speech-Transformer, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency. We also propose a 2D-Attention mechanism, which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech-Transformer. Evaluated on the Wall Street Journal (WSJ) speech recognition dataset, our best model achieves competitive word error rate (WER) of 10.9\%, while the whole training process only takes 1.2 days on 1 GPU, significantly faster than the published results of recurrent sequence-to-sequence models.},
  doi        = {10.1109/ICASSP.2018.8462506},
  file       = {:Dong2018 - Speech Transformer_ a No Recurrence Sequence to Sequence Model for Speech Recognition.html:URL;:dong2018.pdf:PDF},
  issn       = {2379-190X},
  keywords   = {Hidden Markov models, Encoding, Training, Decoding, Speech recognition, Time-frequency analysis, Spectrogram, Speech Recognition, Sequence-to-Sequence, Attention, Transformer},
  shorttitle = {Speech-{Transformer}},
}

BibTeX can have ISBN or ISSN or DOI or URL.

The Indexer parse the BibTeX files first and links URL to ROAM_REFS and CUSTOM_ID of the Org file. I think this quite works well.

However, some entries are books which have only ISBN. I think Promnesia extension needs to scrape identifiers(ISBN, DOI) in web-page to link it to org-roam files. Book sites except Amazon Kindle provide ISBN in open-graph meta of their web-page. But I don’t think it is a good idea. It means Promnesia extension needs some identifier parsers or using extra scraping in the indexer.

Can I add it to Promnesia to scrape identifiers in a web-page? Will it be a good idea?

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

2reactions

sopoforiccommented, Mar 18, 2022

it’s very quick to query all hyperlinks from the DOM. Not sure what would it take to scrape ISBN/DOI, but hopefully if it’s just a regex it should be pretty quick?

Depending on the site, these are very often in a <meta> tag. For example, this book has:

<meta content="9780191776267" property="book:isbn"/>
<meta content="10.1093/actrade/9780192840943.001.0001" name="dc.identifier"/>

An article from ACM similarly has:

<meta name="dc.Identifier" scheme="doi" content="10.1145/953051.801372">

This is typical for journal publishers’ sites. It’s less convenient if you’re looking at other pages, but e.g. Abebooks has <meta itemprop="isbn" content="9781435127739" /> and amazon has the ASIN scattered all over including stuff like <input type="hidden" id="ASIN" name="ASIN" value="0385015836"> which shouldn’t be hard to get at reliably.

0reactions

hwiorncommented, Mar 25, 2022

Let me know if you want any guidance, there might be some rought edges, especially with all the extension shenanigans.

And by the way you’ll be very welcome in https://memex.zulipchat.com/ – there are spaces there to discuss Promnesia in particular and you might get some input from other people as well (you can login with github – so won’t need to create a new account!)

Actually, I already in the memex chat. But I have no enough time to make an implementation now because of work. meta-data(tag) of web page that I said is what @sopoforic said is. I don’t think it is good to parse every DOM and HTML using Regex which extractor can bloat easily. But always there will be exception, may be needs to specific parsers(extractors) sometimes.

However, I do get orig_urls from hypothesis like urn:x-pdf:3719… that produce norm_urls like x-pdf%3A3719f…, so certainly the world wouldn’t end if we stored urn:isbn:0123456789 or doi:10.1234/5678 or even com.github.karlicoss.promnesia:novel-id:1234567 if you want to make up something non-conflicting. The canonifier just needs to emit something sensible given non-URL URIs.

Using urn:isbn:0123456789 or urn:doi:10.1234/5678 is a good idea.