Question: Tissue Specificity R-package
gravatar for YaGalbi
20 months ago by
MRC Harwell Institute, Oxford, UK
YaGalbi20 wrote:

Hello everyone,

I have written a pipeline that takes count data and calculates tissue specificity. I am planning to make an R-package from this and go for publication. Before I begin packaging and writing drafts, I am doing my literature review and also looking to see if someone has created a package with this function already. Is anyone aware of another package/program that calculates tissue specificity? I certainly couldn't find one - maybe someone else has. Any pointers or advice is appreciated.

I will be posting this on biostars and stack overflow also.

Thanks in advance,



ADD COMMENTlink modified 20 months ago by jaro.slamecka130 • written 20 months ago by YaGalbi20
Answer: Tissue Specificity R-package
gravatar for jaro.slamecka
20 months ago by
Mitchell Cancer Institute, Mobile AL, USA
jaro.slamecka130 wrote:

Take a look at CellNet from Dr. Patrick Cahan's lab, it uses gene regulatory networks to classify samples into around 14-16 tissue types. It works with human and mouse bulk and single-cell RNA-seq data but it also includes tools for training new tissue types. CellNet classifies and scores the similarity of the samples to real tissues and also returns a list of transcription factors the researchers might be interested in trying to modulate for their samples to score better. It uses salmon to quantify transcripts first so it's pretty fast too. (latest RNA-seq protocol) (original microarray version)

That being said, the fact there's already a tool out there doesn't at all mean yours won't offer something that the other does not, it's always good to have more than one. So please go for it, I'll definitely be curious.

ADD COMMENTlink written 20 months ago by jaro.slamecka130

thank you for that Jaro - I'll be looking into it.

ADD REPLYlink written 20 months ago by YaGalbi20

As you clearly have some experience with CellNet - does it have clear advantages or disadvantages?

ADD REPLYlink written 20 months ago by YaGalbi20

I'd say the main advantage is that the authors curated lots of datasets derived by expression profiling of real tissues. So as a biologist if you're developing a new protocol to engineer cells and tissues (e.g. by differentiation of pluripotent stem cells), it can help you check how well you've done without you having to get the tissues yourself, extract RNA and use it as a control in your expression profiling. The training data for each tissue comes from multiple samples and sources deliberately to account for perturbations, something that an individual lab would have to invest considerable resources to be able to match. Another advantage is its ability to calculate candidate genes for the biologist to target in order to bring the engineered cells or tissues closer to normal tissues, the authors demonstrated that in another publication.

One disadvantage used to be the lack of support for single-cell RNA-seq data but the newest version has added that (I haven't had a chance to test that). The only disadvantage I can think of is that it only works with single-end data so if you have paired-end data, only reads from one of the pairs is kept. It also trims the reads down to 40 bases before running salmon. So if you have 100PE data you could argue that only 20% of that information is included in the profiling. The authors say that this is for consistency across a greater number of RNA-seq datasets.

I don't know how much of the training data is adapted from its original microarray version, that's maybe another thing to consider. One other bioinformatic assay that can calculate pluripotency (PluriTest) was also originally built around microarray data and what they did to adapt it to RNA-seq was that they intersected microarray probes with corresponding RNA-seq reads, if I'm not mistaken, which would also mean that a part of the RNA-seq data is discarded before running the test. So if your approach can make full use of paired-end RNA-seq data, it could have an edge. Also, starting from the count matrix as you are proposing would make it easier to quickly analyze data from other labs, provided that there wouldn't be major differences between all the possible pipelines that you can use to get the counts.

Either way, it would be great to directly compare your approach and CellNet. Hope this helps!

ADD REPLYlink written 19 months ago by jaro.slamecka130

Oh wow thank you Jaro...I'll go through this tomorrow...but just some thoughts on what I'm considering doing:


1) The basic function of the package would be to take count data and simply calculate specificity from that. For anyone that can load a package in R its a fast algorithm that takes about 30 sec once you have the normalised counts from each tissue.  Bare in mind also that the algorithm is working on counts and so has already been applied (by us here) to counts from other types of data i.e. ChIP-seq histone marks

2) However , lots of labs will only have their own tissue data, not know where to obtain more and not know how to use R. For this reason, I'm considering also including counts data from >20 tissues I have processed myself.

3) This raises the issue of a lab processing data with their own pipeline to compare with data from my pipeline. I'm considering packaging my pipeline into a docker container so if they have access to a bioinformatician the new data can be added with confidence.

ADD REPLYlink modified 19 months ago • written 19 months ago by YaGalbi20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 126 users visited in the last hour