Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Preprocessing steps

See original GitHub issue

Following the discussion on Discourse, we seem to moving toward separating rendering citations from CSL data and wrangling data into clean CSL structures into discrete steps.

I think this is a reasonable conceptual distinction, could be useful for division of development labor, and it can permit speeding up of processors by letting them skip parsing if the item already is fully-specified.

Related to my concerns in the Discourse thread, I think we should make clear that these steps must occur somewhere in the CSL workflow. Whether it’s the calling application or the processor (either itself or by calling an external script) is open.

I’ve started a list of these preprocessing steps below. It would be good if someone could go through the test suite for CSL (and maybe citeproc-js, and pandoc-citeproc) to identify others.

title part parsing from string field
name particle and suffix parsing from Given and Family parts
name parsing from string
- for legacy reasons, citeproc-js parses Family || Given. That’s an easy enough extra parse beyond the above, and it’s a nice format for plain text extra
locator parsing from string
page parsing from string
generation of citation-label if a format is called?
- this one probably needs to be done by processors rather than a calling application
generating name initials/determining whether letters are initials or not

One other question is how a processor should distinguish clean versus “messy” CSL data. We should identify what signals “clean” status for each field (e.g., a title having ‘main’ indicates it’s been parsed; what about a name–the presence of all of the name-parts even if empty?).

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:22 (17 by maintainers)

Top GitHub Comments

1reaction

bdarcuscommented, Jul 21, 2020

WDYT about this?

Currently, the content is mostly a placeholder, aside from the beginnings of the JSON schema (I would, however need to hook it up to the data schema for the output representation).

So idea is it’s just a simple repo aimed at publishing both human (markdown -> html) and machine readable (json) representations.

This would be the URL for the dates json file, for example:

https://citationstyles.org/data-parsing/json/dates.json

And a start of a titles html page (would be best to include real examples though):

https://citationstyles.org/data-parsing/titles.html

So separate pages and json files for each data type.

Ideally, we’d move some of the examples from the test suite here, so the suite is only focused on what to do with correctly structured data.

Would allow developers to submits PRs, of course.

1reaction

andras-simonyicommented, Jul 21, 2020

So what form would this take? A webpage with examples of in-the-wild string data, and how it should be converted, at least logically, to CSL JSON names, dates, and titles? It does strike me that some of this parsing might be in the spec, and others not. But I’m just focus on the input data record angle.

Yes, if there is a collection of representative/useful input-output pairs for a task then I can imagine a 3-tier approach:

A few examples could figure in the (text version of the) standard spec., hinting (hand-waving…) at the semantics of elements/fields, e.g., clarify what is a name suffix;
a larger number of useful examples also dealing with corner cases etc., perhaps with discussions of the rationale behind them if it’s not transparent, could be available on a separate web page outside the spec.;
the full list of examples would be published in a machine readable format, e.g. in JSON, but this would simply be something like an array of input-output pairs, nothing like a full-fledged test in the current test-suite. The only additional structure I can imagine is to indicate which tier an example belongs to, but I’m not sure whether this is necessary.

It would be an important advantage of using simple task-specific lists of input-output pairs in 3. that the problem of representing the unparsed input somehow in (extended?) CSL-JSON would simply go away.

Top Results From Across the Web

What Is Data Preprocessing & What Are The Steps Involved?

Data Preprocessing Steps · 1. Data quality assessment · 2. Data cleaning · 3. Data transformation · 4. Data reduction.

Data Preprocessing in Machine Learning [Steps & Techniques]

4 Steps in Data Preprocessing · Data Cleaning · Data Integration · Data Transformation · Data Reduction · Data Quality Assessment.

Data Preprocessing in Data Mining - GeeksforGeeks

Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format. Steps Involved...

What Is Data Preprocessing? 4 Crucial Steps to Do It Right

The four stages of data preprocessing · 1. Data cleaning · 2. Data integration · 3. Data reduction · 4. Data transformation.

What Are the Most Important Preprocessing Steps in Machine ...

What Are the Most Important Preprocessing Steps in Machine Learning and Data Science? Data Science and Machine Learning has been the latest talk ......