Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to extract UMI barcode from .FASTQ file(GSM3139597)?

See original GitHub issue

Hi, I need to remove degenerate sequence (N4) at the 5’ and 3’ of the reads. my target is GSM3139597. So for extracting UMI from .fastq, I install UMI tools. Now, I have to extract GSM3139597 from UMI barcode. based on instruction, for the single-end read, I have to use the below code:

umi_tools extract --extract-method=string --bc-pattern=[PATTERN] -L extract.log [OPTIONS].

I need a guide for handling GSM3139597 as input in that code. In other words, I don’t know how can I use that code for solving my problem. I don’t know how can I prepare arguments of that code relate to my GSM?

Issue Analytics

State:
Created 4 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

IanSudberycommented, Aug 6, 2019

I’m not sure I see any conflict. The post to the umi-tools documentation is instructions for how to craft a regex to deal with any library. The instructions for the NEXTFlex library are for that specific library and were written without any knowledge of UMI-tools.

You should process the reads to trim the 3’ adaptor sequence using your favourity adaptor trimmer. For example cutadapt is used by the the depositors of your GSE. (This is step one in the process outlined above).

The second step of the process in the NEXTflex data analysis section is to trim the first and last 4 bases of the adaptor clipped reads. In this protocol those bases are discarded. However, it looks like the depositors of the GSE you quote realised that as these bases were random bases, they could be used as a UMI, and used UMI-tools to remove them from the ends of the read and put them on the read name. To do this it is neccessary to tell UMI-tools that the “UMI” is the first and last 4 bases of the read.

In regex we specify any 4 bases as .{4}. In UMI-tools we specify bases to be captured by creating a named capture group, so a named capture group that captures 4 bases as the first part of a UMI would be (?P<umi_1>.{4}). We start our pattern with this to specify that we want to capture the first 4 bases of the read. If we wanted to make sure this was anchored at the start of the read, we can make doubly sure by adding a ^ to the start. ^(?P<umi_1>.{4}) We then need a similar capture group at the end of the read to capture the last 4 bases of the read as the second part of the UMI (?P<umi_2>.{4}). We can make doubly sure that is anchored at the end of the read by adding a $: (?P<umi_2>.{4})$. All we need now is to connect the the two together by any number of other bases to give:

^(?P<umi_1>.{4}).+(?P<umi_2>.{4})$

Interestingly while the depositors of the GSE use UMI-tools to grab the first and last 4 bases of the read as a “UMI”, they don’t appear to actually use that UMI for anything in their downstream pipeline. Unless they have an umi_tools dedup step that they have forgotten to include in their methods.

0reactions

TomSmithCGATcommented, Jul 7, 2020

I’m closing this due to low activity. Please re-open if this issue still needs to be resolved

Top Results From Across the Web

Extract UMI from fastq — UMI-tools documentation

Extract UMI barcode from a read and add it to the read name, leaving any sample barcode ... The whitelist should be in...

Extracting UMI sequences from paired-end reads - Biostars

Hello,. I have a paired end fastq file and my experiment is designed in a way that each PAIRED READ has ONE barcode...

How to remove degenerate sequence (N4) at the 5' and 3' of ...

I have to analyse GSE114327. For that, I use GSE instruction based on GSM3139597. In that data processing section has written: FASTQ reads...

UMI-tools Documentation - Read the Docs

etc, we have provided a BAM file of the mapped reads from this example ... umi_tools extract --stdin=example.fastq.gz --bc-pattern=NNNNNNNNN ...

Single cell tutorial

Firstly, note that FASTQ file that contains the barcodes is passed to --stdin ... umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \ --stdin ...