Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flexible read name modifications

See original GitHub issue

A new command-line option would be useful that allows to provide a template for changing the read name.

The idea is to have a more general version of the -x and -y options, which can currently only add prefixes or suffixes to read names and only work well for single-end reads.

I’m using the following terms: The FASTQ header line consists of a read id and is optionally followed by a separator (whitespace) and a read comment. The two constituent reads of a paired-end read are R1 and R2.

In paired-end FASTQ files, the read ids of R1 and R2 need to be identical. Cutadapt enforces this when reading paired-end FASTQ files, except that it allows a single trailing “1” or “2” as only difference between the read ids. This allows for read ids ending in /1 and /2 (some old formats are like this) or .1 and .2 (fastq-dump produces this).

Some requirements

If read names are modified by Cutadapt, the read ids should still match. (That is, Cutadapt itself would not complain when reading its own output.)
It should be possible to move information from R1 to R2. That is, when removing a UMI from R1, it should be possible to have it in both headers.

Suggested template variables

Single-end

{id} – The part of the header before the whitespace separator
{comment} – The part of the header after the whitespace separator
{length} – Read length
{removed_prefix} or {cut} – The bases removed with --cut
{name} – This is already supported by the -x/-y options and is the name of the last matching adapter, or the string no_adapter if there was no match. Even though it should probably have been called {adapter_name}, this should continue to work for compatibility.

The following variables would have lower priority in my opinion and would not be implemented at first.

{sep}– The separator between id and comment (using a space should be fine, tabs are rare anyway)
{header} – The full original header ({id} {comment} should work just as well most of the time)
{match} – Complete information about the match. This would be all the information present in the Match object that is available internally anyway. The information includes: Start and end position within the adapter, start and end position within the sequence, number of matches, number of errors, name of the matched adapter.
{matched_sequence} – If there was an adapter match, this would be the matched part of the sequence. Alternatively, this could be spelled as {match.sequence}
{matched_adapter} – Matched part of the adapter. Could be spelled as {match.adapter_sequence}.

Paired-end

For paired-end reads, it may be useful to allow specifying two different templates, one for R1 and one for R2, but that may not be necessary initially. The following assumes we have one option only.

The same templates as for single-end reads would be allowed, and they would be interpreted relative to the read whose name is being set. That is, if {length} is used, this would be replaced with the length of R1 in the header of R1 and with the length of R2 in the header of R2.

Additionally, we want to allow transferring information from R1 or R2 into both headers. For this, there would be read-specific versions of the template variables, such as {length_r1} and {length_r2}, which would always contain the length of R1 and R2, respectively.

So there would be {cut_r1}, {cut_r2}, {comment_r1}, {comment_r2}, {match_r1}, {match_r2}, etc. Also:

{rn} – The read number (1 for R1 and 2 for R2).

Edited on 2020-04-22 to take comments into account.

Issue Analytics

State:
Created 4 years ago
Comments:11 (5 by maintainers)

Top GitHub Comments

1reaction

marcelmcommented, Apr 22, 2020

Thanks all, I’ve update the issue description to reflect the suggestions from your comments.

0reactions

marcelmcommented, Feb 18, 2022

Closing this issue now as the remaining features have been added:

The {rn} placeholder was added already in Cutadapt 3.2
I have added a {match_sequence} placeholder just now.

Please open a new issue for any remaining feature requests!

Top Results From Across the Web

Flexible Brand Names in Marketing - The Logo Company

Keep reading to learn how to do it. What is a Flexible Brand Name? A flexible brand name is a name that can...

Do airlines allow name changes on your ticket? - TravelPerk

Generally, only minor name changes or corrections are allowed, such as fixing a typo or updating the ticket to reflect a legal name...

Active Directory FSMO roles in Windows - Microsoft Learn

The domain naming master FSMO role holder is the DC responsible for making changes to the forest-wide domain name space of the directory,...

1. fio - Flexible I/O tester rev. 3.32 - FIO's documentation!

This ioengine defines engine specific options. libhdfs: Read and write through Hadoop (HDFS). The filename option is used to specify host,port of the...

Picture Books about Adapting to Change and Flexible Thinking

These picture books about adapting to change show characters dealing with obstacles and facing uncertainty and disappointment head-on.