question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Preprocessing steps

See original GitHub issue

Following the discussion on Discourse, we seem to moving toward separating rendering citations from CSL data and wrangling data into clean CSL structures into discrete steps.

I think this is a reasonable conceptual distinction, could be useful for division of development labor, and it can permit speeding up of processors by letting them skip parsing if the item already is fully-specified.

Related to my concerns in the Discourse thread, I think we should make clear that these steps must occur somewhere in the CSL workflow. Whether it’s the calling application or the processor (either itself or by calling an external script) is open.

I’ve started a list of these preprocessing steps below. It would be good if someone could go through the test suite for CSL (and maybe citeproc-js, and pandoc-citeproc) to identify others.

  • title part parsing from string field
  • name particle and suffix parsing from Given and Family parts
  • name parsing from string
    • for legacy reasons, citeproc-js parses Family || Given. That’s an easy enough extra parse beyond the above, and it’s a nice format for plain text extra
  • locator parsing from string
  • page parsing from string
  • generation of citation-label if a format is called?
    • this one probably needs to be done by processors rather than a calling application
  • generating name initials/determining whether letters are initials or not

One other question is how a processor should distinguish clean versus “messy” CSL data. We should identify what signals “clean” status for each field (e.g., a title having ‘main’ indicates it’s been parsed; what about a name–the presence of all of the name-parts even if empty?).

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:22 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
bdarcuscommented, Jul 21, 2020

WDYT about this?

Currently, the content is mostly a placeholder, aside from the beginnings of the JSON schema (I would, however need to hook it up to the data schema for the output representation).

So idea is it’s just a simple repo aimed at publishing both human (markdown -> html) and machine readable (json) representations.

This would be the URL for the dates json file, for example:

https://citationstyles.org/data-parsing/json/dates.json

And a start of a titles html page (would be best to include real examples though):

https://citationstyles.org/data-parsing/titles.html

So separate pages and json files for each data type.

Ideally, we’d move some of the examples from the test suite here, so the suite is only focused on what to do with correctly structured data.

Would allow developers to submits PRs, of course.

1reaction
andras-simonyicommented, Jul 21, 2020

So what form would this take? A webpage with examples of in-the-wild string data, and how it should be converted, at least logically, to CSL JSON names, dates, and titles? It does strike me that some of this parsing might be in the spec, and others not. But I’m just focus on the input data record angle.

Yes, if there is a collection of representative/useful input-output pairs for a task then I can imagine a 3-tier approach:

  1. A few examples could figure in the (text version of the) standard spec., hinting (hand-waving…) at the semantics of elements/fields, e.g., clarify what is a name suffix;
  2. a larger number of useful examples also dealing with corner cases etc., perhaps with discussions of the rationale behind them if it’s not transparent, could be available on a separate web page outside the spec.;
  3. the full list of examples would be published in a machine readable format, e.g. in JSON, but this would simply be something like an array of input-output pairs, nothing like a full-fledged test in the current test-suite. The only additional structure I can imagine is to indicate which tier an example belongs to, but I’m not sure whether this is necessary.

It would be an important advantage of using simple task-specific lists of input-output pairs in 3. that the problem of representing the unparsed input somehow in (extended?) CSL-JSON would simply go away.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What Is Data Preprocessing & What Are The Steps Involved?
Data Preprocessing Steps · 1. Data quality assessment · 2. Data cleaning · 3. Data transformation · 4. Data reduction.
Read more >
Data Preprocessing in Machine Learning [Steps & Techniques]
4 Steps in Data Preprocessing · Data Cleaning · Data Integration · Data Transformation · Data Reduction · Data Quality Assessment.
Read more >
Data Preprocessing in Data Mining - GeeksforGeeks
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format. Steps Involved...
Read more >
What Is Data Preprocessing? 4 Crucial Steps to Do It Right
The four stages of data preprocessing · 1. Data cleaning · 2. Data integration · 3. Data reduction · 4. Data transformation.
Read more >
What Are the Most Important Preprocessing Steps in Machine ...
What Are the Most Important Preprocessing Steps in Machine Learning and Data Science? Data Science and Machine Learning has been the latest talk ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found