Preprocessing steps
See original GitHub issueFollowing the discussion on Discourse, we seem to moving toward separating rendering citations from CSL data and wrangling data into clean CSL structures into discrete steps.
I think this is a reasonable conceptual distinction, could be useful for division of development labor, and it can permit speeding up of processors by letting them skip parsing if the item already is fully-specified.
Related to my concerns in the Discourse thread, I think we should make clear that these steps must occur somewhere in the CSL workflow. Whether it’s the calling application or the processor (either itself or by calling an external script) is open.
I’ve started a list of these preprocessing steps below. It would be good if someone could go through the test suite for CSL (and maybe citeproc-js, and pandoc-citeproc) to identify others.
- title part parsing from string field
- name particle and suffix parsing from Given and Family parts
- name parsing from string
- for legacy reasons, citeproc-js parses
Family || Given
. That’s an easy enough extra parse beyond the above, and it’s a nice format for plain text extra
- for legacy reasons, citeproc-js parses
- locator parsing from string
- page parsing from string
- generation of
citation-label
if a format is called?- this one probably needs to be done by processors rather than a calling application
- generating name initials/determining whether letters are initials or not
One other question is how a processor should distinguish clean versus “messy” CSL data. We should identify what signals “clean” status for each field (e.g., a title having ‘main’ indicates it’s been parsed; what about a name–the presence of all of the name-parts even if empty?).
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:22 (17 by maintainers)
Top GitHub Comments
WDYT about this?
Currently, the content is mostly a placeholder, aside from the beginnings of the JSON schema (I would, however need to hook it up to the data schema for the output representation).
So idea is it’s just a simple repo aimed at publishing both human (markdown -> html) and machine readable (json) representations.
This would be the URL for the dates json file, for example:
https://citationstyles.org/data-parsing/json/dates.json
And a start of a titles html page (would be best to include real examples though):
https://citationstyles.org/data-parsing/titles.html
So separate pages and json files for each data type.
Ideally, we’d move some of the examples from the test suite here, so the suite is only focused on what to do with correctly structured data.
Would allow developers to submits PRs, of course.
Yes, if there is a collection of representative/useful input-output pairs for a task then I can imagine a 3-tier approach:
It would be an important advantage of using simple task-specific lists of input-output pairs in 3. that the problem of representing the unparsed input somehow in (extended?) CSL-JSON would simply go away.