push: RAW file considered as text file (bad MD5)
See original GitHub issueBug Report
Description
The RAW file has a md5sum
fd0de1350b92b00d60afd53b015f6aea 214089_JAI.raw
But DVC calculates it as
md5: 0b4d86bc06ee3260e8172b2196805382 size: 63232000 path: 214089_JAI.raw
This happens because it identifies it as a text file and runs the dos2unix replacement: https://github.com/iterative/dvc/blob/1.11/dvc/utils/__init__.py#L39 -> https://github.com/iterative/dvc/blob/1.11/dvc/istextfile.py#L34
It still happens in version 2.4.3 https://github.com/iterative/dvc/blob/2.4.3/dvc/utils/__init__.py#L37 -> https://github.com/iterative/dvc/blob/2.4.3/dvc/istextfile.py#L22
When uploading it through the gocloud.dev library, it fails due to the MD5 check, since the one calculated by DVC and the real one of the file is not the same: https://github.com/google/go-cloud/blob/v0.23.0/blob/blob.go#L328
Reproduce
- dvc init
- dvc remote modify --local our-proxy password 123123
- Copy 214089_JAI.raw to the directory
- dvc add 214089_JAI.raw
- dvc push
Expected
The file is expected to upload correctly, but since the md5 of the file and the one sent by DVC do not match, the upload is canceled
Environment information
Output of dvc doctor:
$ dvc doctor
DVC version: 1.11.16 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-5.4.0-65-generic-x86_64-with-glibc2.29
Supports: http, https, ssh
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git
Additional Information (if any): https://github.com/atekoa/dvc-rawfile
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (5 by maintainers)

Top Related StackOverflow Question
I use the DVC calculated hash to avoid having to rewrite the file in our proxy, recalculate the hash and send the correct hash and the file to the gocloud.dev library, which is ultimately responsible for uploading the file and verifying the md5. The gocloud.dev library is the one that requires me to send the md5 of the file to verify that have written the data correctly, but I don’t have the correct md5 if I don’t generate it myself, right?
Potentially relevant with: https://github.com/iterative/dvc/issues/4658