question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pass `Accept` header in `contrib.utils.download`

See original GitHub issue

I’m copying a comment here that I made in the HEPData Zulip chat on 16th October 2020.

Regarding the issue (HEPData/hepdata#162) to mint DOIs for all local resource files attached to a submission, if we do eventually get around to addressing it, we would probably redirect the DOI to a landing page for the resource file, rather than to the resource file itself (e.g. the pyhf tarball). This would follow the DataCite Best Practices for DOI Landing Pages, e.g. “DOIs should resolve to a landing page, not directly to the content”, which I’m currently breaking for the two manually minted DOIs. In the issue (HEPdata/hepdata#162) I mentioned the possibility of using DataCite Content Negotiation to redirect to the resource file itself, but the linked page now says “Custom content types are no longer supported since January 1st, 2020”. I thought maybe content negotiation could be used to return the .tar.gz file directly, but the intended purpose is to retrieve DOI metadata in different formats, not to provide the content itself. In anticipation of possible future changes, I’d recommend that you use the URL directly rather than the DOI in pyhf download scripts and documentation (e.g. revert #1109).

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
GraemeWattcommented, Oct 26, 2021

I’ve been investigating three options to directly return content (i.e. the pyhf tarball) from the DOI after we mint DOIs for local resource files with URLs directing to a landing page rather than the resource file itself (see HEPData/hepdata#162).

  1. Following the suggestion of @mfenner, we could embed Schema.org metadata on the HEPData landing page for the resource file in JSON-LD format (see HEPData/hepdata#145) including a contentUrl property. One problem is that doing curl -LH "Accept: application/vnd.schemaorg.ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2 or curl -LH "Accept: application/ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2 returns JSON-LD from DataCite (without contentUrl) using DataCite Content Negotiation before getting to the HEPData server. I think we would need to introduce a custom metadata content type like curl -LH "Accept: application/vnd.hepdata.ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2 to return the JSON-LD from the HEPData landing page. The pyhf code would then parse the contentUrl and make the download in another request.

  2. DataCite offers a media API where custom content types can be registered and then later retrieved via a public REST API, although content negotiation is no longer supported. However, it should be possible to retrieve the metadata via, for example, https://api.datacite.org/dois/10.17182/hepdata.89408.v1/r2 and then parse the media to find the registered URL of the content for a specific media type like application/x-tar. I tried to test the DataCite media API by registering a custom content type for one DOI, but it doesn’t seem to be working. I reported the problems I found to DataCite support, but I don’t think the media API option is worth pursuing further.

  3. A 2019 blog article by @mfenner mentions an alternative option to “use content negotiation at the landing page for the resource that the DOI resolves to. DataCite content negotiation is forwarding all requests with unknown content types to the URL registered in the handle system.” This seems like the simplest option for the pyhf use case. The HEPData landing page for the resource file can check if the Accept request HTTP header matches the content type of the resource file and return the content directly if so, for example, curl -LH "Accept: application/x-tar" https://doi.org/10.17182/hepdata.89408.v1/r2. In the pyhf Python code, you’d just need to replace this line: https://github.com/scikit-hep/pyhf/blob/260315d2930b38258ad4c0718b0274c9eca2e6d4/src/pyhf/contrib/utils.py#L56 with:

        with requests.get(archive_url, headers={'Accept': 'application/x-tar'}) as response:

Some other suggestions for improvements to this code:

  • Check the response.status_code and return an error message if not OK.
  • Use tarfile.is_tarfile to check that response.content is actually a tarball and return an error message if not.
  • Remove mode="r|gz" or replace it with mode="r" or mode="r:*" for reading with transparent compression, so that the code works also with uncompressed tarballs (see #1111 and #1519), where the media type is still application/x-tar.
  • Maybe add an option to download a zipfile instead of a tarball (see #1519), then you’d need headers={'Accept': 'application/zip'} in the request and zipfile.is_zipfile to check the response content. You could use the Python zipfile module to unpack, but maybe easier to use shutil.unpack_archive for both tarballs and zipfiles.

Making these changes should not break the functionality with the current situation (where https://doi.org/10.17182/hepdata.89408.v1/r2 returns the tarball directly). I’d therefore recommend you make them ASAP before the next pyhf release. After we redirect the DOI to the landing page, probably in the next few weeks, the DOI will return the HTML landing page instead of the tarball unless the request contains the Accept: application/x-tar header.

1reaction
danielskatzcommented, Jun 16, 2021

Hey @mfenner - can you help here?

I think it should be possible to programatically query the DOI and get the location of the underlying object, then fetch it.

Is this correct? Is there any code available that demonstrates this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to set a header for a HTTP GET request, and trigger file ...
Then you can do this to download the file and send the headers: $.ajax({ url: url, type: 'GET', dataType: 'binary', headers: headers, ...
Read more >
How to define request Accept header with constant value so ...
My goal is to specify Accept header that is used to control content and version of API just as GitHub and other API...
Read more >
Request and response objects - Django documentation
Django uses request and response objects to pass state through the system. ... Setting an explicit Accept header in API requests can be...
Read more >
HTTP Response Headers - ServiceNow Docs
The ability to configure and pass response headers enables special handling of the page content by a client, most typically a browser. To...
Read more >
Accept - HTTP - MDN Web Docs
The Accept request HTTP header indicates which content types, expressed as MIME types, the client is able to understand.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found