Pass `Accept` header in `contrib.utils.download`
See original GitHub issueI’m copying a comment here that I made in the HEPData Zulip chat on 16th October 2020.
Regarding the issue (HEPData/hepdata#162) to mint DOIs for all local resource files attached to a submission, if we do eventually get around to addressing it, we would probably redirect the DOI to a landing page for the resource file, rather than to the resource file itself (e.g. the pyhf tarball). This would follow the DataCite Best Practices for DOI Landing Pages, e.g. “DOIs should resolve to a landing page, not directly to the content”, which I’m currently breaking for the two manually minted DOIs. In the issue (HEPdata/hepdata#162) I mentioned the possibility of using DataCite Content Negotiation to redirect to the resource file itself, but the linked page now says “Custom content types are no longer supported since January 1st, 2020”. I thought maybe content negotiation could be used to return the .tar.gz file directly, but the intended purpose is to retrieve DOI metadata in different formats, not to provide the content itself. In anticipation of possible future changes, I’d recommend that you use the URL directly rather than the DOI in pyhf download scripts and documentation (e.g. revert #1109).
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (6 by maintainers)

Top Related StackOverflow Question
I’ve been investigating three options to directly return content (i.e. the
pyhftarball) from the DOI after we mint DOIs for local resource files with URLs directing to a landing page rather than the resource file itself (see HEPData/hepdata#162).Following the suggestion of @mfenner, we could embed Schema.org metadata on the HEPData landing page for the resource file in JSON-LD format (see HEPData/hepdata#145) including a
contentUrlproperty. One problem is that doingcurl -LH "Accept: application/vnd.schemaorg.ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2orcurl -LH "Accept: application/ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2returns JSON-LD from DataCite (withoutcontentUrl) using DataCite Content Negotiation before getting to the HEPData server. I think we would need to introduce a custom metadata content type likecurl -LH "Accept: application/vnd.hepdata.ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2to return the JSON-LD from the HEPData landing page. Thepyhfcode would then parse thecontentUrland make the download in another request.DataCite offers a media API where custom content types can be registered and then later retrieved via a public REST API, although content negotiation is no longer supported. However, it should be possible to retrieve the metadata via, for example, https://api.datacite.org/dois/10.17182/hepdata.89408.v1/r2 and then parse the
mediato find the registered URL of the content for a specific media type likeapplication/x-tar. I tried to test the DataCite media API by registering a custom content type for one DOI, but it doesn’t seem to be working. I reported the problems I found to DataCite support, but I don’t think the media API option is worth pursuing further.A 2019 blog article by @mfenner mentions an alternative option to “use content negotiation at the landing page for the resource that the DOI resolves to. DataCite content negotiation is forwarding all requests with unknown content types to the URL registered in the handle system.” This seems like the simplest option for the
pyhfuse case. The HEPData landing page for the resource file can check if theAcceptrequest HTTP header matches the content type of the resource file and return the content directly if so, for example,curl -LH "Accept: application/x-tar" https://doi.org/10.17182/hepdata.89408.v1/r2. In thepyhfPython code, you’d just need to replace this line: https://github.com/scikit-hep/pyhf/blob/260315d2930b38258ad4c0718b0274c9eca2e6d4/src/pyhf/contrib/utils.py#L56 with:Some other suggestions for improvements to this code:
response.status_codeand return an error message if not OK.tarfile.is_tarfileto check thatresponse.contentis actually a tarball and return an error message if not.mode="r|gz"or replace it withmode="r"ormode="r:*"for reading with transparent compression, so that the code works also with uncompressed tarballs (see #1111 and #1519), where the media type is stillapplication/x-tar.headers={'Accept': 'application/zip'}in the request andzipfile.is_zipfileto check the response content. You could use the Pythonzipfilemodule to unpack, but maybe easier to useshutil.unpack_archivefor both tarballs and zipfiles.Making these changes should not break the functionality with the current situation (where https://doi.org/10.17182/hepdata.89408.v1/r2 returns the tarball directly). I’d therefore recommend you make them ASAP before the next
pyhfrelease. After we redirect the DOI to the landing page, probably in the next few weeks, the DOI will return the HTML landing page instead of the tarball unless the request contains theAccept: application/x-tarheader.Hey @mfenner - can you help here?
I think it should be possible to programatically query the DOI and get the location of the underlying object, then fetch it.
Is this correct? Is there any code available that demonstrates this?