question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reducing mgf file size

See original GitHub issue

Hi Niels

I was wondering why are the output mgf files so big compared to the raw files. If I convert the raw file from #4 , which is 2.8GB big, I get an mgf output file with 13GB. I have checked that most of the rows correspond to peaks with intensity equal to 0.0000000000, and removing them with grep grep -v 0.0000000000 PD7505-GDTHP1-A_C2.mgf > PD7505-GDTHP1-A_C2_non_zero.mgf leaves a 3.2 GB file. There are still many peaks with very low intensities like 0.0000000277 that I am guessing really are noise and thus make the file unnecessarily big. I tried running msconvert on it

msconvert PD7505-GDTHP1-A_C2_non_zero.mgf --filter "peakPicking true [2,3] zeroSamples removeExtra" --mgf -o msconvert_output

but the file size remains the same.

If I further trim the mgf file to keep intensities > 0.1 with awk by printing lines starting with a capital letter (to keep spectra headers) or where the second field is > 0.1, I get a 1.7 GB file.

awk '{ if ($2 > 0.1 || $1 ~ /^[A-Z]/) {print} }' msconvert_output/PD7505-GDTHP1-A_C2_non_zero.mgf > trimmed.mgf

While I achieved a lot of file size reduction, this is still much bigger than what one gets by running the msconvertGUI on Windows with the same raw file. If I run the program with filters:

Filter Parameter
peakPicking vendor msLevel=1-
zeroSamples removeExtra 1-

I get a 297 MB file.

From ThermoRawFileParser’s README.md

It takes a thermo RAW file as input and outputs a metadata file and the MS2 spectra (centroided) in MGF format.

Could you please provide more information about how the tool performs the centroiding of the spectra and how to exactly emulate the output one would get by running msconvert on Windows? Ideally, mgf files with sizes similar to those produced by standard msconvert calls should be easy to produce, either using ThermoRawFileParser alone or in combination with msconvert filtering.

Thank you very much for your help!

Cheers

Antonio

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
chrishugescommented, Jun 13, 2018

Hi Antonio,

I agree it is an important issue to get these types of tools working on Linux, but I think there is some confusion here. Both ThermoRawFileParser and RawQuant do various processing tasks and MGF conversion of raw files on Windows/Mac/Linux without issue. The problem being encountered with both our tool and ThermoRawFileParser is that your MS2 spectra are acquired in an ion trap and are in ‘profile’ format. In this situation, both of these tools will spit out all of the profile data (that contains many zeroes due to the nature of profile data). The reason you are getting smaller files with msconvert is because you have enabled peakPicking…this will centroid the data and reduce file size significantly. I imagine if you processed your file through msconvert with no filter parameters specified the result would look the same as it does for ThermoRawFileParser.

The problem here is not a file conversion issue, it is just that neither ThermoRawFileParser or RawQuant have centroiding algorithms built into their code. For Orbitrap spectra this doesn’t matter because centroid and profile data are stored in the raw file by default, but for ion trap spectra only the specified type is stored.

I can only speak for our tool, but we have discussed adding a centroiding algorithm for ion trap data. However, it is near the bottom of our priority list because this is a very uncommon data type (most ion trap data is acquired in centroid mode).

Re: your command above - there is no need to filter low abundance ions if you have enabled peakPicking. Centroiding will eliminate a lot of the redundant data and noise in your file just due to the nature of the way it works.

Cheers, Chris

1reaction
nielshulstaertcommented, Jun 6, 2018

Hi Antonio, Chris,

Thanks again for your input, it’s very useful to me. Indeed the library doesn’t do any centroiding on profile data, only the centroiding done by the instrument and provided by the Thermo library. I’m playing around with msconvert to see if their peak picking makes a difference in the file size. It’s not my intention to remake msconvert though 😉

I’ve encountered 3 types of RAW files during testing

  • The ''normal" use case: MS2 scans are in centroid mode and with a centroid stream from the Thermo library.
  • The raw file from issue 4 (20070920_CL_Orbi2_XIC_Hela60_Ecoli10_Offgel_green_Frac4_02.RAW): MS2 scans are in centroid mode but without a centroid stream from the Thermo library. I look for the segmented data for each scan in that case. The MGF file size is acceptable.
  • And the one that Antonio has send me: MS2 scans are in profile mode so I assume a centroid stream is by default not available and writing the segmented data to MGF results in a huge file. It’s basically converting the binary raw file to MGF 😉

From the Thermo documentation:

Segments are useful for high resolution SIM or SRM data, as it is possible to see the peak groups (centroid) or profiles within each mass window, using the common core plotting tools. An application could also choose to do peak integration based on “the data in a particular segment”.

So this is what I can do:

  • Add a filter flag for filtering out low intensity peaks with a value provided by the user. If no value is given, no filtering occurs. I think I would implement a filter that removes peaks that are lower than x percent of the highest peak in the scan. Setting a cutoff across all scans doesn’t make much sense IMHO.
  • Add a flag to include MS2 profile scans or not. Or I just don’t include these and output a message that no centroid MS2 scans were found. I had a quick chat with a masspec operator and she told me that the MS2 profile mode can be used for MS2 level quant so trying to centroid that data is probably defeating the purpose of the profile.

Thoughts and suggestions are more than welcome.

Best regards,

Niels

Read more comments on GitHub >

github_iconTop Results From Across the Web

MGF Files (MS/MS container files)
The ion count filter is used to reduce the filesize. A small MS/MS MGF file with 2000 precursor ions can have a datasize...
Read more >
File size, read and write time compared with standard ...
This yielded an average file size reduction of 87% compared with standard mzML across all 10 test set files ( Fig. 2A), with...
Read more >
Very Large Searches
If the MGF files have search parameters at the beginning, you'll need to remove these before merging the files. Because a number of...
Read more >
Employing ProteoWizard to Convert Raw Mass ...
Options for File Size Reduction. As noted above, conversion from a vendor format to an open format may lead to an increase in...
Read more >
File Formats Commonly Used in Mass Spectrometry ...
The recently introduced mz5 format (24) addresses file size concerns by translating mzML ... The MGF file was developed by Matrix Science (London,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found