Flexible read name modifications
See original GitHub issueA new command-line option would be useful that allows to provide a template for changing the read name.
The idea is to have a more general version of the -x
and -y
options, which can currently only add prefixes or suffixes to read names and only work well for single-end reads.
I’m using the following terms: The FASTQ header line consists of a read id and is optionally followed by a separator (whitespace) and a read comment. The two constituent reads of a paired-end read are R1 and R2.
In paired-end FASTQ files, the read ids of R1 and R2 need to be identical. Cutadapt enforces this when reading paired-end FASTQ files, except that it allows a single trailing “1” or “2” as only difference between the read ids. This allows for read ids ending in /1
and /2
(some old formats are like this) or .1
and .2
(fastq-dump
produces this).
Some requirements
- If read names are modified by Cutadapt, the read ids should still match. (That is, Cutadapt itself would not complain when reading its own output.)
- It should be possible to move information from R1 to R2. That is, when removing a UMI from R1, it should be possible to have it in both headers.
Suggested template variables
Single-end
{id}
– The part of the header before the whitespace separator{comment}
– The part of the header after the whitespace separator{length}
– Read length{removed_prefix}
or{cut}
– The bases removed with--cut
{name}
– This is already supported by the-x
/-y
options and is the name of the last matching adapter, or the stringno_adapter
if there was no match. Even though it should probably have been called{adapter_name}
, this should continue to work for compatibility.
The following variables would have lower priority in my opinion and would not be implemented at first.
{sep}
– The separator between id and comment (using a space should be fine, tabs are rare anyway){header}
– The full original header ({id} {comment}
should work just as well most of the time){match}
– Complete information about the match. This would be all the information present in theMatch
object that is available internally anyway. The information includes: Start and end position within the adapter, start and end position within the sequence, number of matches, number of errors, name of the matched adapter.{matched_sequence}
– If there was an adapter match, this would be the matched part of the sequence. Alternatively, this could be spelled as{match.sequence}
{matched_adapter}
– Matched part of the adapter. Could be spelled as{match.adapter_sequence}
.
Paired-end
For paired-end reads, it may be useful to allow specifying two different templates, one for R1 and one for R2, but that may not be necessary initially. The following assumes we have one option only.
The same templates as for single-end reads would be allowed, and they would be interpreted relative to the read whose name is being set. That is, if {length}
is used, this would be replaced with the length of R1 in the header of R1 and with the length of R2 in the header of R2.
Additionally, we want to allow transferring information from R1 or R2 into both headers. For this, there would be read-specific versions of the template variables, such as {length_r1}
and {length_r2}
, which would always contain the length of R1 and R2, respectively.
So there would be {cut_r1}
, {cut_r2}
, {comment_r1}
, {comment_r2}
, {match_r1}
, {match_r2}
, etc. Also:
{rn}
– The read number (1 for R1 and 2 for R2).
Edited on 2020-04-22 to take comments into account.
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (5 by maintainers)
Top GitHub Comments
Thanks all, I’ve update the issue description to reflect the suggestions from your comments.
Closing this issue now as the remaining features have been added:
{rn}
placeholder was added already in Cutadapt 3.2{match_sequence}
placeholder just now.Please open a new issue for any remaining feature requests!