Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[proposal] Specifying input/output formats and (natural) languages

See original GitHub issue

I am currently missing the ability to specify what types of files the SoftwareApplication consumes or produces. I think this is important software metadata. I would want to propose adding something like:

inputFormat - (Text) - Media type, typically MIME format of a file consumed as input (in whatever way) by the application
outputFormat - (Text) - Media type, typically MIME format of a file produced as output (in whatever way) by the application

Also, if the input concerns any kind of human text or speech, adding a language identifier is very desirable, for which I’d suggest something like:

inputLanguage (Text or schema:Language) - Supported natural language for input data
outputLanguage (Text or schema:Language) - Supported natural language for output data

Context: I’m producing codemeta metadata for a lot of NLP tools.

Producing complex profiles of input and output is most probably well beyond the scope of the codemeta initiative and best left to things like OpenAPI/swagger, but I think some very simple basics should be in place.

What do you think?

Issue Analytics

State:
Created 5 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

proyconcommented, May 24, 2022

Yes, you’re right when you say that you need to know which of the inputs/outputs accepts/produces which files if you really want the specification to be actionable, this was indeed more intended as a high level description that can for example be used for users to discover tools and for them to make some assessment whether the tools are suitable for them based on possible input/output data.

And then most of the time, having the output format is not useful. For instance, I can select that the output is “CSV” and you would not know what to do with it, besides opening it.

Well, that’s already something, at least you can use it to decide with what to open it.

Yeah, add it to the profile as well if you want.

I’ll draft up something in our https://github.com/SoftwareUnderstanding/software_types repo

1reaction

proyconcommented, Jun 15, 2018

@cboettig Thanks for your reaction! I understand the need to stay as close to schema.org as possible, and you’re undoubtedly more at home in their conventions than I am. I really like your idea of having an attribute (inputData/outputData? consumesData/producesData?) take the full CreativeWork (or derivatives) types, that may be more elegant than what I suggested. Then in turn I can indeed just use the inLanguage and encodingFormat, so I’d gladly go along with that.

Alternatively, schema.org has availableLanguage (A language someone may use with or at the item, service or place) which could perhaps be stretched to mean what I suggested (but leaves the MIME type issue open still). I also found that EntryPoint (see also #183), does have contentType and encodingType (for describing web API endpoints), which in a way is already more specific (and too specific for my use) than what I propose.

I wouldn’t want to go the entire way of describing the entire software API of course, but the notion of software consuming some kind of data and producing another (either one or more) is so central (also outside of NLP) that I think it wouldn’t be out of place.

For the moment, listing input or output MIME types in the keywords might make them more visible.

Yeah, but I’m more concerned about proper semantics than visibility. I developed a codemeta-based portal (example: https://webservices-lst.science.ru.nl, source: https://github.com/proycon/labirinto), so I can easily implement whatever proper solution is agreed on.

Top Results From Across the Web

Natural Language Semantics Markup Language - W3C

This specification describes markup for representing natural language semantics, and forms part of the proposals for the W3C Speech Interface Framework.

(PDF) Proposal for using NLP interchange format for question ...

Proposal for using NLP interchange format for question answering in ... accepts input as natural language form and the output is in SPARQL ......

Natural Language Assessment: A New Framework to Promote ...

In this blog, we introduce an important natural language understanding ... from input question, answer and expectation to assessment output.

A pipeline proposal

This describes an unimplemented XML pipeline language. ... Inputs and outputs are named in order that they can be distinguished. Each input is...

Structured prediction as translation be - OpenReview

of Translation between Augmented Natural Languages (TANL). ... input/output formats for all structured prediction tasks in Section 4.