question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why is cutadapt discarding some paired end reads?

See original GitHub issue

I asked this same question in StackExchange Bioinformatics (link here), but I am posting it since I am not seeing any reply.

Sorry for the very long post.

So problem is that cutadapt is removing a read (actually more than one) but I found something unexpected.

So, I am using cutadapt to remove two primers from some reads.

I have 2 files for each sample, i.e. sample_R1.fastq and sample_R2.fastq since the reads are paired end.

I have primer_1 and primer_2 and both of them can be found in sample_R1.fastq and sample_R2.fastq. Usually, I have half sample_R1.fastq reads starting with primer_1 and half starting with primer_2. Same for sample_R2.fastq, i.e. I can find both primers.

To properly trim all the primers what I do is the following, through R:

setOne <- paste("-g", primer_1, "-a", primer_1)
setTwo <- paste("-G", primer_2, "-A", primer_2)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2, 
                           "-o", outOne, "-p", outTwo,
                            inputOne, inputTwo, "--minimum-length", 1))

The command works fine except for one thing. On the one hand cutadapt should remove those sequences which, after the trimming, will have length < 1, as specified by the command. However, some reads are removed even though they are longer than 1 and I would like to know why this happens, since I can’t find a specific reference mentioning that cutadapt removes sequences using different parameters rather than the length.

If I am getting this straight, cutadapt only removes the primer sequence from the beginning of the read, then removes the equivalent characters in the quality field of the read that was trimmed and, finally, it returns the trimmed read with the quality field adjusted to the new length. If the length is lower than the threshold, the read is discarded. If it is discarded in R1, it will be discarded in R2 and viceversa (since we are dealing with paired end reads and since I specified two inputs and two outputs in the command)

I give an example of read that is removed, this one from R1:

@some:sample:number 1:N:0:1
ACCGTCACGTCATCAGCAGCATCGACTCCTGTTTGATCCCCGTGCCTGTACTCAAGTCAACCAGTACACGCTTTTCAAGCTGCCTTCTGCCGTATTTCACACTCAGCGCATTTGCAATTGGTGTTCTTTGCACCACATATCTATCAGTTAAGCCACCGCTACTCCGCCTACCTCGTACTGACTTAATCGACCGCCTACGCACCCTTTAAACCCAATTCGAGACCGTCAATTCAATCATACTTTAGTAAGTG
+
BBBBBBBBBBBBGGGGHEGHFHA@EGGGHHHHHHHHHGHHHHHHHHGGGGGHGHGHGGGHHHHHGDGGGFHGGGGHGGGGGGHGHGFHHHHHHHHHHGGGHHHGHGHHHHHHHGFFGGGGGGGGGGGGGGGGGFFBDFDFFFFFFFBFFFFFFFFFHGHGGHHHHHEGGGGGHHHHGHHHHHGHHHHHHHHHHHHGGGHGHHGGGGEFGGHHHHHHHHHHHHHGHHHHHHHHHHHFHFGHHHHGGGGGHHG

this one from R2:

@some:sample:number 2:N:0:1
CTGCTGCTATCACTAGCTAAAGCTTTAACGTAGGGCAATTGACGAAGGTGAAATGCATAGATATGTGATGACGGTGTAGCGTTGGCGCTGGGGTTTAAAGGGTGCGTAGTGACTTCACAGAACACCAATTGGCAGCTTACTAAAGCATGAGTACAGACGAGGTAGGCGGAATTGCGGTCGATTAAGTCAGTTGTGAAATACGGCAGCTTAACTGTTGGCAAGCGTTGTCCGGCACTGGTTTTATTCGAGGT
+
AAAAA@BFFFFFGFGCGGFDDC/CDHF2FGFHHG//<<CCC1FGHH1<1>11C<FF/0BFFFFFFF/:DEEEFFHGG0AEG/B9B./-.@9GHGEFC>EE<FHEHEGHDHGGFHGGBGGGHGGDGH0CDDGFFHHCC--.:CGCFG??GGAEGGHFHHBFFFEFB./B..;/;:/999=AAA../BB/;//////9/EFFCBDAA..990ADD=99/;/BFGGGGGGGGGFHHHHHHGHDGHHFFDHHHHG

and these are the primers that I am removing:

  • ACCGTCACGTCATCAGCAGCATCGAC
  • CTGCTGCTATCACTAGCTAAAGCTTTA

matching the reads I provided. Clearly, if I trim them manually, the reads length is way greater than 1 so, why are these reads removed from the final cutadapt output file?


EDIT to the StackExchange Bioinformatics question:

I have the feeling that the command I posted above is not the one I should have used in the first place.

What I am doing now is to perform the trimming in two steps:

# here we remove primer_1 from R1 and primer_2 from R2
setOne <- paste("-g", primer_1)
setTwo <- paste("-G", primer_2)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2, 
                           "-o", outOne, "-p", outTwo,
                            inputOne, inputTwo, "--minimum-length", 1))

Then, using the output of the above command, I am simply switching the primers and then let cutadapt remove them:

# here we remove primer_2 from R1 and primer_1 from R2
setOne <- paste("-g", primer_2)
setTwo <- paste("-G", primer_1)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2, 
                           "-o", outOne, "-p", outTwo,
                            inputOne, inputTwo, "--minimum-length", 1))

If I follow such pipeline, I am able to keep the above mentioned read, i.e. it is not discarded anymore and I guess this is the right approach to solve my problem, i.e. dealing with two primers that can be both found in R1 and R2.

I just wanted to make sure I am getting this right, since I am quite new to these kind of analyses.

Also, I am wondering what happens when cutadapt searches for a primer, like in my first command, the reverse complement of primer_1 (or primer_2) but it can’t find a match. Is cutadapt discarding reads if a match is not found? I guess it is: the read that is discarded, as mentioned above, in the example I provided contains primer_2 but, if I run cutadapt in its first (and I believe wrong) version then primer_2 is not found in sample_R1.fastq. Indeed, cutadapt is searching for primer_1 and reverseComplement(primer_2) in sample_R1.fastq. But, cutadapt can’t find any of them (the read in question contains primer_2 but belongs to sample_R1.fastq). At this point I have the feeling cutadapt is removing the reads since it does not match anything. Is this what is going on?

But if this what is going on…why does it not happen the same way for the two other commands, i.e. the two-step analysis?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
marcelmcommented, Sep 24, 2021

Great, it appears you figured it out now 😃 (I cannot judge whether your script is correct, but it looks fine.)

The first library ever from which I needed to remove primers, which actually prompted me to start work on Cutadapt, was also of quite low quality. The 5’ adapters were found at the 3’ end, there were multiple consecutive adapter occurrences (primer dimers) and stuff like that. So I can relate a bit.

Thank you for getting back to me, and good luck with your project.

0reactions
gabrieletcommented, Sep 24, 2021

There we go. I finally got an idea about how the whole library was built and as far as I understood it should be the way you described it. The problem is that I am basically finding primers everywhere. All the sequences have their primer at the beginning, either forward or reverse. But, it is also possible to find a primer, either in its forward and/or reverseComplement version, inside some of the sequences.

What I did was to run several steps of cutatapt, in order to get rid of all of them since, as you mentioned, cutadapt removes only the best matching one. So I needed to run a first round to remove the expected primers from the beginning of the sequences. Then I removed primers in their original form from the inside of the sequences by doing the following:

# step two: remove remaining forward or reverse primers
for (smpl in c(1:nrow(infoSample))) {

	print(paste0("STEP TWO - processing sample ", smpl))

	# get samples name from step one
	sNameROne <- paste0(cutaOne, infoSample$sampleID[smpl], "_CD18_L000_R1_001.fastq")
	sNameRTwo <- paste0(cutaOne, infoSample$sampleID[smpl], "_CD18_L000_R2_001.fastq")

	# create filenames for the cutadapt-step-two outputs
	fwdCut <- paste0(cutaTwo, paste0(infoSample$sampleID[smpl], "_CD18_L000_R1_001.fastq"))
	revCut <- paste0(cutaTwo, paste0(infoSample$sampleID[smpl], "_CD18_L000_R2_001.fastq"))
	
	# remove remaining forward or reverse primers
	system2(cutadapt, args = c("-g", "ACGTAGTGTGCCAGCMGCCGCGGTAA", "-g", "TATGGTGTGCCAGCMGCCGCGGTAA", "-g", "ACACGGTGTGCCAGCMGCCGCGGTAA", "-g", "GTAGGTGTGCCAGCMGCCGCGGTAA", "-g", "CTGATAGTGTGCCAGCMGCCGCGGTAA", "-g", "CGATACGTGTGCCAGCMGCCGCGGTAA", "-g", "TGCTGTGTGCCAGCMGCCGCGGTAA", "-g", "CGCAGAGTGTGCCAGCMGCCGCGGTAA", "-g", "GACGCAGTGTGCCAGCMGCCGCGGTAA", "-g", "AGATGTGTGCCAGCMGCCGCGGTAA", "-g", "ACTACAGTGTGCCAGCMGCCGCGGTAA", "-g", "GATCTGTGTGCCAGCMGCCGCGGTAA", "-g", "ACGTACCGGACTACHVGGGTWTCTAAT", "-g", "TATGCCGGACTACHVGGGTWTCTAAT", "-g", "ACACGCCGGACTACHVGGGTWTCTAAT", "-g", "GTAGCCGGACTACHVGGGTWTCTAAT", "-g", "CTGATACCGGACTACHVGGGTWTCTAAT", "-g", "TGCTCCGGACTACHVGGGTWTCTAAT", "-g", "CGCAGACCGGACTACHVGGGTWTCTAAT", "-g", "CGATACCCGGACTACHVGGGTWTCTAAT", "-g", "GACGCACCGGACTACHVGGGTWTCTAAT", "-g", "AGATCCGGACTACHVGGGTWTCTAAT", "-g", "ACTACACCGGACTACHVGGGTWTCTAAT", "-g", "GATCTCCGGACTACHVGGGTWTCTAAT", "-G", "ACGTAGTGTGCCAGCMGCCGCGGTAA", "-G", "TATGGTGTGCCAGCMGCCGCGGTAA", "-G", "ACACGGTGTGCCAGCMGCCGCGGTAA", "-G", "GTAGGTGTGCCAGCMGCCGCGGTAA", "-G", "CTGATAGTGTGCCAGCMGCCGCGGTAA", "-G", "CGATACGTGTGCCAGCMGCCGCGGTAA", "-G", "TGCTGTGTGCCAGCMGCCGCGGTAA", "-G", "CGCAGAGTGTGCCAGCMGCCGCGGTAA", "-G", "GACGCAGTGTGCCAGCMGCCGCGGTAA", "-G", "AGATGTGTGCCAGCMGCCGCGGTAA", "-G", "ACTACAGTGTGCCAGCMGCCGCGGTAA", "-G", "GATCTGTGTGCCAGCMGCCGCGGTAA", "-G", "ACGTACCGGACTACHVGGGTWTCTAAT", "-G", "TATGCCGGACTACHVGGGTWTCTAAT", "-G", "ACACGCCGGACTACHVGGGTWTCTAAT", "-G", "GTAGCCGGACTACHVGGGTWTCTAAT", "-G", "CTGATACCGGACTACHVGGGTWTCTAAT", "-G", "TGCTCCGGACTACHVGGGTWTCTAAT", "-G", "CGCAGACCGGACTACHVGGGTWTCTAAT", "-G", "CGATACCCGGACTACHVGGGTWTCTAAT", "-G", "GACGCACCGGACTACHVGGGTWTCTAAT", "-G", "AGATCCGGACTACHVGGGTWTCTAAT", "-G", "ACTACACCGGACTACHVGGGTWTCTAAT", "-G", "GATCTCCGGACTACHVGGGTWTCTAAT", "-o", fwdCut, "-p", revCut, sNameROne, sNameRTwo, "--minimum-length", 200, "--quiet"))
}

Finally, I did the same but using the reverse complement of the sequences above, to get rid of reverse complement primers that I also found inside the sequences.

I guess these problems are a consequence of the fact that the amplification was done many years ago and probably the technique worked quite poorly.

Anyway, thank you for your suggestions!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why is cutadapt discarding some paired end reads?
If the length is lower than the threshold, the read is discarded. If it is discarded in R1, it will be discarded in...
Read more >
User guide — Cutadapt 4.2 documentation - Read the Docs
Cutadapt searches for the adapter in all reads and removes it when it finds it. Unless you use a filtering option, all reads...
Read more >
Discarded results produced by q2-cutadapt's trim-paired when ...
Essentially, we demux the mixed-orientation reads in two rounds. In the first round we treat R1 and R2 as forward and reverse reads, ......
Read more >
Trimming both ends of paired-end reads · Issue #237 - GitHub
The only thing that remains is your last requirement: You want the entire pair to be discarded if one of the reads in...
Read more >
bio-utils/cutadapt - BioLib
Default: discard reads --untrimmed-output FILE Write reads that do not contain any adapter to FILE. Default: output to same file as trimmed reads...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found