Can I search ISSN, ISBN and DOI in a web-page, Not only URL?
See original GitHub issueI’m writing an indexer for org-roam and BibTeX to link between org-roam to web-browser.
Some org-file has citation syntax like below.
:PROPERTIES:
:ID: 120cf393-9ec3-40b8-a486-d903036236f8
:ROAM_REFS: cite:Dong2018
:END:
#+TITLE: Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
#+CREATED: [2021-07-18 Sun 15:38]
#+filetags: :Literature:
- tags ::
- keywords ::
- author(s) :: Dong, Linhao and Xu, Shuang and Xu, Bo
The bib file would be like this.
@InProceedings{Dong2018,
author = {Dong, Linhao and Xu, Shuang and Xu, Bo},
booktitle = {2018 {IEEE} {International} {Conference} on {Acoustics}, {Speech} and {Signal} {Processing} ({ICASSP})},
title = {Speech-{Transformer}: {A} {No}-{Recurrence} {Sequence}-to-{Sequence} {Model} for {Speech} {Recognition}},
year = {2018},
month = apr,
note = {ZSCC: 0000311 ISSN: 2379-190X},
pages = {5884--5888},
abstract = {Recurrent sequence-to-sequence models using encoder-decoder architecture have made great progress in speech recognition task. However, they suffer from the drawback of slow training speed because the internal recurrence limits the training parallelization. In this paper, we present the Speech-Transformer, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency. We also propose a 2D-Attention mechanism, which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech-Transformer. Evaluated on the Wall Street Journal (WSJ) speech recognition dataset, our best model achieves competitive word error rate (WER) of 10.9\%, while the whole training process only takes 1.2 days on 1 GPU, significantly faster than the published results of recurrent sequence-to-sequence models.},
doi = {10.1109/ICASSP.2018.8462506},
file = {:Dong2018 - Speech Transformer_ a No Recurrence Sequence to Sequence Model for Speech Recognition.html:URL;:dong2018.pdf:PDF},
issn = {2379-190X},
keywords = {Hidden Markov models, Encoding, Training, Decoding, Speech recognition, Time-frequency analysis, Spectrogram, Speech Recognition, Sequence-to-Sequence, Attention, Transformer},
shorttitle = {Speech-{Transformer}},
}
BibTeX can have ISBN or ISSN or DOI or URL.
The Indexer parse the BibTeX files first and links URL
to ROAM_REFS
and CUSTOM_ID
of the Org file.
I think this quite works well.
However, some entries are books which have only ISBN. I think Promnesia extension needs to scrape identifiers(ISBN, DOI) in web-page to link it to org-roam files. Book sites except Amazon Kindle provide ISBN in open-graph meta of their web-page. But I don’t think it is a good idea. It means Promnesia extension needs some identifier parsers or using extra scraping in the indexer.
Can I add it to Promnesia to scrape identifiers in a web-page? Will it be a good idea?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
Depending on the site, these are very often in a
<meta>
tag. For example, this book has:An article from ACM similarly has:
This is typical for journal publishers’ sites. It’s less convenient if you’re looking at other pages, but e.g. Abebooks has
<meta itemprop="isbn" content="9781435127739" />
and amazon has the ASIN scattered all over including stuff like<input type="hidden" id="ASIN" name="ASIN" value="0385015836">
which shouldn’t be hard to get at reliably.Actually, I already in the memex chat. But I have no enough time to make an implementation now because of work. meta-data(tag) of web page that I said is what @sopoforic said is. I don’t think it is good to parse every DOM and HTML using Regex which extractor can bloat easily. But always there will be exception, may be needs to specific parsers(extractors) sometimes.
Using
urn:isbn:0123456789
orurn:doi:10.1234/5678
is a good idea.