Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pass `Accept` header in `contrib.utils.download`

See original GitHub issue

I’m copying a comment here that I made in the HEPData Zulip chat on 16th October 2020.

Regarding the issue (HEPData/hepdata#162) to mint DOIs for all local resource files attached to a submission, if we do eventually get around to addressing it, we would probably redirect the DOI to a landing page for the resource file, rather than to the resource file itself (e.g. the pyhf tarball). This would follow the DataCite Best Practices for DOI Landing Pages, e.g. “DOIs should resolve to a landing page, not directly to the content”, which I’m currently breaking for the two manually minted DOIs. In the issue (HEPdata/hepdata#162) I mentioned the possibility of using DataCite Content Negotiation to redirect to the resource file itself, but the linked page now says “Custom content types are no longer supported since January 1st, 2020”. I thought maybe content negotiation could be used to return the .tar.gz file directly, but the intended purpose is to retrieve DOI metadata in different formats, not to provide the content itself. In anticipation of possible future changes, I’d recommend that you use the URL directly rather than the DOI in pyhf download scripts and documentation (e.g. revert #1109).

Issue Analytics

State:
Created 2 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

GraemeWattcommented, Oct 26, 2021

I’ve been investigating three options to directly return content (i.e. the pyhf tarball) from the DOI after we mint DOIs for local resource files with URLs directing to a landing page rather than the resource file itself (see HEPData/hepdata#162).

Following the suggestion of @mfenner, we could embed Schema.org metadata on the HEPData landing page for the resource file in JSON-LD format (see HEPData/hepdata#145) including a contentUrl property. One problem is that doing curl -LH "Accept: application/vnd.schemaorg.ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2 or curl -LH "Accept: application/ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2 returns JSON-LD from DataCite (without contentUrl) using DataCite Content Negotiation before getting to the HEPData server. I think we would need to introduce a custom metadata content type like curl -LH "Accept: application/vnd.hepdata.ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2 to return the JSON-LD from the HEPData landing page. The pyhf code would then parse the contentUrl and make the download in another request.
DataCite offers a media API where custom content types can be registered and then later retrieved via a public REST API, although content negotiation is no longer supported. However, it should be possible to retrieve the metadata via, for example, https://api.datacite.org/dois/10.17182/hepdata.89408.v1/r2 and then parse the media to find the registered URL of the content for a specific media type like application/x-tar. I tried to test the DataCite media API by registering a custom content type for one DOI, but it doesn’t seem to be working. I reported the problems I found to DataCite support, but I don’t think the media API option is worth pursuing further.
A 2019 blog article by @mfenner mentions an alternative option to “use content negotiation at the landing page for the resource that the DOI resolves to. DataCite content negotiation is forwarding all requests with unknown content types to the URL registered in the handle system.” This seems like the simplest option for the pyhf use case. The HEPData landing page for the resource file can check if the Accept request HTTP header matches the content type of the resource file and return the content directly if so, for example, curl -LH "Accept: application/x-tar" https://doi.org/10.17182/hepdata.89408.v1/r2. In the pyhf Python code, you’d just need to replace this line: https://github.com/scikit-hep/pyhf/blob/260315d2930b38258ad4c0718b0274c9eca2e6d4/src/pyhf/contrib/utils.py#L56 with:

        with requests.get(archive_url, headers={'Accept': 'application/x-tar'}) as response:

Some other suggestions for improvements to this code:

Check the response.status_code and return an error message if not OK.
Use tarfile.is_tarfile to check that response.content is actually a tarball and return an error message if not.
Remove mode="r|gz" or replace it with mode="r" or mode="r:*" for reading with transparent compression, so that the code works also with uncompressed tarballs (see #1111 and #1519), where the media type is still application/x-tar.
Maybe add an option to download a zipfile instead of a tarball (see #1519), then you’d need headers={'Accept': 'application/zip'} in the request and zipfile.is_zipfile to check the response content. You could use the Python zipfile module to unpack, but maybe easier to use shutil.unpack_archive for both tarballs and zipfiles.

Making these changes should not break the functionality with the current situation (where https://doi.org/10.17182/hepdata.89408.v1/r2 returns the tarball directly). I’d therefore recommend you make them ASAP before the next pyhf release. After we redirect the DOI to the landing page, probably in the next few weeks, the DOI will return the HTML landing page instead of the tarball unless the request contains the Accept: application/x-tar header.

1reaction

danielskatzcommented, Jun 16, 2021

Hey @mfenner - can you help here?

I think it should be possible to programatically query the DOI and get the location of the underlying object, then fetch it.

Is this correct? Is there any code available that demonstrates this?