how to extract UMI barcode from .FASTQ file(GSM3139597)?
See original GitHub issueHi, I need to remove degenerate sequence (N4) at the 5’ and 3’ of the reads. my target is GSM3139597. So for extracting UMI from .fastq, I install UMI tools. Now, I have to extract GSM3139597 from UMI barcode. based on instruction, for the single-end read, I have to use the below code:
umi_tools extract --extract-method=string --bc-pattern=[PATTERN] -L extract.log [OPTIONS]
.
I need a guide for handling GSM3139597 as input in that code. In other words, I don’t know how can I use that code for solving my problem. I don’t know how can I prepare arguments of that code relate to my GSM?
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
Extract UMI from fastq — UMI-tools documentation
Extract UMI barcode from a read and add it to the read name, leaving any sample barcode ... The whitelist should be in...
Read more >Extracting UMI sequences from paired-end reads - Biostars
Hello,. I have a paired end fastq file and my experiment is designed in a way that each PAIRED READ has ONE barcode...
Read more >How to remove degenerate sequence (N4) at the 5' and 3' of ...
I have to analyse GSE114327. For that, I use GSE instruction based on GSM3139597. In that data processing section has written: FASTQ reads...
Read more >UMI-tools Documentation - Read the Docs
etc, we have provided a BAM file of the mapped reads from this example ... umi_tools extract --stdin=example.fastq.gz --bc-pattern=NNNNNNNNN ...
Read more >Single cell tutorial
Firstly, note that FASTQ file that contains the barcodes is passed to --stdin ... umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \ --stdin ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’m not sure I see any conflict. The post to the umi-tools documentation is instructions for how to craft a regex to deal with any library. The instructions for the NEXTFlex library are for that specific library and were written without any knowledge of UMI-tools.
You should process the reads to trim the 3’ adaptor sequence using your favourity adaptor trimmer. For example
cutadapt
is used by the the depositors of your GSE. (This is step one in the process outlined above).The second step of the process in the NEXTflex data analysis section is to trim the first and last 4 bases of the adaptor clipped reads. In this protocol those bases are discarded. However, it looks like the depositors of the GSE you quote realised that as these bases were random bases, they could be used as a UMI, and used UMI-tools to remove them from the ends of the read and put them on the read name. To do this it is neccessary to tell UMI-tools that the “UMI” is the first and last 4 bases of the read.
In regex we specify any 4 bases as
.{4}
. In UMI-tools we specify bases to be captured by creating a named capture group, so a named capture group that captures 4 bases as the first part of a UMI would be(?P<umi_1>.{4})
. We start our pattern with this to specify that we want to capture the first 4 bases of the read. If we wanted to make sure this was anchored at the start of the read, we can make doubly sure by adding a^
to the start.^(?P<umi_1>.{4})
We then need a similar capture group at the end of the read to capture the last 4 bases of the read as the second part of the UMI(?P<umi_2>.{4})
. We can make doubly sure that is anchored at the end of the read by adding a$
:(?P<umi_2>.{4})$
. All we need now is to connect the the two together by any number of other bases to give:^(?P<umi_1>.{4}).+(?P<umi_2>.{4})$
Interestingly while the depositors of the GSE use UMI-tools to grab the first and last 4 bases of the read as a “UMI”, they don’t appear to actually use that UMI for anything in their downstream pipeline. Unless they have an
umi_tools dedup
step that they have forgotten to include in their methods.I’m closing this due to low activity. Please re-open if this issue still needs to be resolved