Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to extract UMI barcode from .FASTQ file(GSM3139597)?

See original GitHub issue

Hi, I need to remove degenerate sequence (N4) at the 5’ and 3’ of the reads. my target is GSM3139597. So for extracting UMI from .fastq, I install UMI tools. Now, I have to extract GSM3139597 from UMI barcode. based on instruction, for the single-end read, I have to use the below code:

umi_tools extract --extract-method=string --bc-pattern=[PATTERN] -L extract.log [OPTIONS].

I need a guide for handling GSM3139597 as input in that code. In other words, I don’t know how can I use that code for solving my problem. I don’t know how can I prepare arguments of that code relate to my GSM?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

IanSudberycommented, Aug 6, 2019

I’m not sure I see any conflict. The post to the umi-tools documentation is instructions for how to craft a regex to deal with any library. The instructions for the NEXTFlex library are for that specific library and were written without any knowledge of UMI-tools.

You should process the reads to trim the 3’ adaptor sequence using your favourity adaptor trimmer. For example cutadapt is used by the the depositors of your GSE. (This is step one in the process outlined above).

The second step of the process in the NEXTflex data analysis section is to trim the first and last 4 bases of the adaptor clipped reads. In this protocol those bases are discarded. However, it looks like the depositors of the GSE you quote realised that as these bases were random bases, they could be used as a UMI, and used UMI-tools to remove them from the ends of the read and put them on the read name. To do this it is neccessary to tell UMI-tools that the “UMI” is the first and last 4 bases of the read.

In regex we specify any 4 bases as .{4}. In UMI-tools we specify bases to be captured by creating a named capture group, so a named capture group that captures 4 bases as the first part of a UMI would be (?P<umi_1>.{4}). We start our pattern with this to specify that we want to capture the first 4 bases of the read. If we wanted to make sure this was anchored at the start of the read, we can make doubly sure by adding a ^ to the start. ^(?P<umi_1>.{4}) We then need a similar capture group at the end of the read to capture the last 4 bases of the read as the second part of the UMI (?P<umi_2>.{4}). We can make doubly sure that is anchored at the end of the read by adding a $: (?P<umi_2>.{4})$. All we need now is to connect the the two together by any number of other bases to give:


Interestingly while the depositors of the GSE use UMI-tools to grab the first and last 4 bases of the read as a “UMI”, they don’t appear to actually use that UMI for anything in their downstream pipeline. Unless they have an umi_tools dedup step that they have forgotten to include in their methods.

TomSmithCGATcommented, Jul 7, 2020

I’m closing this due to low activity. Please re-open if this issue still needs to be resolved

Read more comments on GitHub >

github_iconTop Results From Across the Web

Extract UMI from fastq — UMI-tools documentation
Extract UMI barcode from a read and add it to the read name, leaving any sample barcode ... The whitelist should be in...
Read more >
Extracting UMI sequences from paired-end reads - Biostars
Hello,. I have a paired end fastq file and my experiment is designed in a way that each PAIRED READ has ONE barcode...
Read more >
How to remove degenerate sequence (N4) at the 5' and 3' of ...
I have to analyse GSE114327. For that, I use GSE instruction based on GSM3139597. In that data processing section has written: FASTQ reads...
Read more >
UMI-tools Documentation - Read the Docs
etc, we have provided a BAM file of the mapped reads from this example ... umi_tools extract --stdin=example.fastq.gz --bc-pattern=NNNNNNNNN ...
Read more >
Single cell tutorial
Firstly, note that FASTQ file that contains the barcodes is passed to --stdin ... umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \ --stdin ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Post

No results found

github_iconTop Related Hashnode Post

No results found