Question about TF_distance_matrix.txt
See original GitHub issueHi, Dear developer I have some questions about TF_distance_matrix.
- According to the below explanation
Distance matrix used to cluster the transcription factors in the bindetect_figures-dendrograms. This is based on the overlap of individual transcription factor binding sites.
The calculation of the distance value is based on the TFBS , so it is from <outdir>/<TF>/beds/<TF>_all.bed
? So every distance tree in each page of bindetect_figures.pdf is same ?
-
What is the calculation method of distance ? Is it Jaccard index?
-
Can I get the distance value of specific condition using the
<outdir>/<TF>/beds/<TF>_<condition>_bound.bed
? I think it may offers another information. -
And Does
Cluster Motifs: Cluster motifs and create consensus motifs based on similarity
has relationship with the distance value? I think clusterMotifs is just based on the motif similarity without thinking TFBS.
And thanks for this wonderful tools 👍 😃 Best wishes Guandong Shang
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
Hi Guandong Shang,
The reason why the distance is calculated by TFBS overlap is to get an idea of possible false-positive footprints within motif families. E.g. if a factor such as GATA4 is found to have a footprint, the distances will show that GATA2, GATA3, etc. are also very similar. We can therefore not be sure which of the proteins is actually causing the footprint.
You could probably calculate a distance based on the cooperation of TFs - but this is a different problem than what is being solved by TOBIAS. You can create a TF x peaks matrix from the “TOBIAS BINDetect” output files, as you have all information of which peaks each TFBS was found within. Sounds interesting, but it is not something that I am going to get into at this point 😃
Best Mette
I have just released
TOBIAS 0.12.3
containing a utility script calledcluster_sites_by_overlap.py
, which creates the distance matrix and dendrogram for a subset of sites. As an example, to get the clustering of sites in the bound subsets, you can run it with:cluster_sites_by_overlap.py --bedfiles BINDetect_output/*/beds/*Bcell_bound.bed
Because of the internal normalization in BINDetect, it is not possible to calculate the differential footprint scores on a subset of sites (as the scores would then be shifted with regards to the background scores). So this plot only gives you the dendrogram of overlapping sites. I hope it helps you nonetheless!