Why is cutadapt discarding some paired end reads?
See original GitHub issueI asked this same question in StackExchange Bioinformatics (link here), but I am posting it since I am not seeing any reply.
Sorry for the very long post.
So problem is that cutadapt
is removing a read (actually more than one) but I found something unexpected.
So, I am using cutadapt
to remove two primers from some reads.
I have 2 files for each sample, i.e. sample_R1.fastq and sample_R2.fastq since the reads are paired end.
I have primer_1 and primer_2 and both of them can be found in sample_R1.fastq and sample_R2.fastq. Usually, I have half sample_R1.fastq reads starting with primer_1 and half starting with primer_2. Same for sample_R2.fastq, i.e. I can find both primers.
To properly trim all the primers what I do is the following, through R:
setOne <- paste("-g", primer_1, "-a", primer_1)
setTwo <- paste("-G", primer_2, "-A", primer_2)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2,
"-o", outOne, "-p", outTwo,
inputOne, inputTwo, "--minimum-length", 1))
The command works fine except for one thing. On the one hand cutadapt
should remove those sequences which, after the trimming, will have length < 1
, as specified by the command. However, some reads are removed even though they are longer than 1 and I would like to know why this happens, since I can’t find a specific reference mentioning that cutadapt
removes sequences using different parameters rather than the length.
If I am getting this straight, cutadapt
only removes the primer sequence from the beginning of the read, then removes the equivalent characters in the quality field of the read that was trimmed and, finally, it returns the trimmed read with the quality field adjusted to the new length. If the length is lower than the threshold, the read is discarded. If it is discarded in R1, it will be discarded in R2 and viceversa (since we are dealing with paired end reads and since I specified two inputs and two outputs in the command)
I give an example of read that is removed, this one from R1:
@some:sample:number 1:N:0:1
ACCGTCACGTCATCAGCAGCATCGACTCCTGTTTGATCCCCGTGCCTGTACTCAAGTCAACCAGTACACGCTTTTCAAGCTGCCTTCTGCCGTATTTCACACTCAGCGCATTTGCAATTGGTGTTCTTTGCACCACATATCTATCAGTTAAGCCACCGCTACTCCGCCTACCTCGTACTGACTTAATCGACCGCCTACGCACCCTTTAAACCCAATTCGAGACCGTCAATTCAATCATACTTTAGTAAGTG
+
BBBBBBBBBBBBGGGGHEGHFHA@EGGGHHHHHHHHHGHHHHHHHHGGGGGHGHGHGGGHHHHHGDGGGFHGGGGHGGGGGGHGHGFHHHHHHHHHHGGGHHHGHGHHHHHHHGFFGGGGGGGGGGGGGGGGGFFBDFDFFFFFFFBFFFFFFFFFHGHGGHHHHHEGGGGGHHHHGHHHHHGHHHHHHHHHHHHGGGHGHHGGGGEFGGHHHHHHHHHHHHHGHHHHHHHHHHHFHFGHHHHGGGGGHHG
this one from R2:
@some:sample:number 2:N:0:1
CTGCTGCTATCACTAGCTAAAGCTTTAACGTAGGGCAATTGACGAAGGTGAAATGCATAGATATGTGATGACGGTGTAGCGTTGGCGCTGGGGTTTAAAGGGTGCGTAGTGACTTCACAGAACACCAATTGGCAGCTTACTAAAGCATGAGTACAGACGAGGTAGGCGGAATTGCGGTCGATTAAGTCAGTTGTGAAATACGGCAGCTTAACTGTTGGCAAGCGTTGTCCGGCACTGGTTTTATTCGAGGT
+
AAAAA@BFFFFFGFGCGGFDDC/CDHF2FGFHHG//<<CCC1FGHH1<1>11C<FF/0BFFFFFFF/:DEEEFFHGG0AEG/B9B./-.@9GHGEFC>EE<FHEHEGHDHGGFHGGBGGGHGGDGH0CDDGFFHHCC--.:CGCFG??GGAEGGHFHHBFFFEFB./B..;/;:/999=AAA../BB/;//////9/EFFCBDAA..990ADD=99/;/BFGGGGGGGGGFHHHHHHGHDGHHFFDHHHHG
and these are the primers that I am removing:
- ACCGTCACGTCATCAGCAGCATCGAC
- CTGCTGCTATCACTAGCTAAAGCTTTA
matching the reads I provided. Clearly, if I trim them manually, the reads length is way greater than 1 so, why are these reads removed from the final cutadapt output file?
EDIT to the StackExchange Bioinformatics question:
I have the feeling that the command I posted above is not the one I should have used in the first place.
What I am doing now is to perform the trimming in two steps:
# here we remove primer_1 from R1 and primer_2 from R2
setOne <- paste("-g", primer_1)
setTwo <- paste("-G", primer_2)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2,
"-o", outOne, "-p", outTwo,
inputOne, inputTwo, "--minimum-length", 1))
Then, using the output of the above command, I am simply switching the primers and then let cutadapt
remove them:
# here we remove primer_2 from R1 and primer_1 from R2
setOne <- paste("-g", primer_2)
setTwo <- paste("-G", primer_1)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2,
"-o", outOne, "-p", outTwo,
inputOne, inputTwo, "--minimum-length", 1))
If I follow such pipeline, I am able to keep the above mentioned read, i.e. it is not discarded anymore and I guess this is the right approach to solve my problem, i.e. dealing with two primers that can be both found in R1 and R2.
I just wanted to make sure I am getting this right, since I am quite new to these kind of analyses.
Also, I am wondering what happens when cutadapt
searches for a primer, like in my first command, the reverse complement of primer_1 (or primer_2) but it can’t find a match. Is cutadapt
discarding reads if a match is not found?
I guess it is: the read that is discarded, as mentioned above, in the example I provided contains primer_2 but, if I run cutadapt
in its first (and I believe wrong) version then primer_2 is not found in sample_R1.fastq. Indeed, cutadapt
is searching for primer_1 and reverseComplement(primer_2)
in sample_R1.fastq. But, cutadapt
can’t find any of them (the read in question contains primer_2 but belongs to sample_R1.fastq). At this point I have the feeling cutadapt
is removing the reads since it does not match anything. Is this what is going on?
But if this what is going on…why does it not happen the same way for the two other commands, i.e. the two-step analysis?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Great, it appears you figured it out now 😃 (I cannot judge whether your script is correct, but it looks fine.)
The first library ever from which I needed to remove primers, which actually prompted me to start work on Cutadapt, was also of quite low quality. The 5’ adapters were found at the 3’ end, there were multiple consecutive adapter occurrences (primer dimers) and stuff like that. So I can relate a bit.
Thank you for getting back to me, and good luck with your project.
There we go. I finally got an idea about how the whole library was built and as far as I understood it should be the way you described it. The problem is that I am basically finding primers everywhere. All the sequences have their primer at the beginning, either forward or reverse. But, it is also possible to find a primer, either in its forward and/or reverseComplement version, inside some of the sequences.
What I did was to run several steps of cutatapt, in order to get rid of all of them since, as you mentioned, cutadapt removes only the best matching one. So I needed to run a first round to remove the expected primers from the beginning of the sequences. Then I removed primers in their original form from the inside of the sequences by doing the following:
Finally, I did the same but using the reverse complement of the sequences above, to get rid of reverse complement primers that I also found inside the sequences.
I guess these problems are a consequence of the fact that the amplification was done many years ago and probably the technique worked quite poorly.
Anyway, thank you for your suggestions!