Reducing mgf file size
See original GitHub issueHi Niels
I was wondering why are the output mgf files so big compared to the raw files. If I convert the raw file from #4 , which is 2.8GB big, I get an mgf output file with 13GB. I have checked that most of the rows correspond to peaks with intensity equal to 0.0000000000, and removing them with grep
grep -v 0.0000000000 PD7505-GDTHP1-A_C2.mgf > PD7505-GDTHP1-A_C2_non_zero.mgf
leaves a 3.2 GB file.
There are still many peaks with very low intensities like 0.0000000277 that I am guessing really are noise and thus make the file unnecessarily big. I tried running msconvert on it
msconvert PD7505-GDTHP1-A_C2_non_zero.mgf --filter "peakPicking true [2,3] zeroSamples removeExtra" --mgf -o msconvert_output
but the file size remains the same.
If I further trim the mgf file to keep intensities > 0.1 with awk by printing lines starting with a capital letter (to keep spectra headers) or where the second field is > 0.1, I get a 1.7 GB file.
awk '{ if ($2 > 0.1 || $1 ~ /^[A-Z]/) {print} }' msconvert_output/PD7505-GDTHP1-A_C2_non_zero.mgf > trimmed.mgf
While I achieved a lot of file size reduction, this is still much bigger than what one gets by running the msconvertGUI on Windows with the same raw file. If I run the program with filters:
| Filter | Parameter |
|---|---|
| peakPicking | vendor msLevel=1- |
| zeroSamples | removeExtra 1- |
I get a 297 MB file.
From ThermoRawFileParser’s README.md
It takes a thermo RAW file as input and outputs a metadata file and the MS2 spectra (centroided) in MGF format.
Could you please provide more information about how the tool performs the centroiding of the spectra and how to exactly emulate the output one would get by running msconvert on Windows? Ideally, mgf files with sizes similar to those produced by standard msconvert calls should be easy to produce, either using ThermoRawFileParser alone or in combination with msconvert filtering.
Thank you very much for your help!
Cheers
Antonio
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (3 by maintainers)

Top Related StackOverflow Question
Hi Antonio,
I agree it is an important issue to get these types of tools working on Linux, but I think there is some confusion here. Both ThermoRawFileParser and RawQuant do various processing tasks and MGF conversion of raw files on Windows/Mac/Linux without issue. The problem being encountered with both our tool and ThermoRawFileParser is that your MS2 spectra are acquired in an ion trap and are in ‘profile’ format. In this situation, both of these tools will spit out all of the profile data (that contains many zeroes due to the nature of profile data). The reason you are getting smaller files with msconvert is because you have enabled peakPicking…this will centroid the data and reduce file size significantly. I imagine if you processed your file through msconvert with no filter parameters specified the result would look the same as it does for ThermoRawFileParser.
The problem here is not a file conversion issue, it is just that neither ThermoRawFileParser or RawQuant have centroiding algorithms built into their code. For Orbitrap spectra this doesn’t matter because centroid and profile data are stored in the raw file by default, but for ion trap spectra only the specified type is stored.
I can only speak for our tool, but we have discussed adding a centroiding algorithm for ion trap data. However, it is near the bottom of our priority list because this is a very uncommon data type (most ion trap data is acquired in centroid mode).
Re: your command above - there is no need to filter low abundance ions if you have enabled peakPicking. Centroiding will eliminate a lot of the redundant data and noise in your file just due to the nature of the way it works.
Cheers, Chris
Hi Antonio, Chris,
Thanks again for your input, it’s very useful to me. Indeed the library doesn’t do any centroiding on profile data, only the centroiding done by the instrument and provided by the Thermo library. I’m playing around with msconvert to see if their peak picking makes a difference in the file size. It’s not my intention to remake msconvert though 😉
I’ve encountered 3 types of RAW files during testing
From the Thermo documentation:
So this is what I can do:
Thoughts and suggestions are more than welcome.
Best regards,
Niels