Search
Question: Isoforms detection using bioconductor
0
14 months ago by
Berlin
seungjoon.kim0 wrote:

Dear all,

hi we have RNA-seq data with long reads and try to find transcript isoforms(AS) for every genes. I have now aligned BAM file and want to detect isoforms using IRanges, GenomicRanges and biomart. I have no idea how to detect and quantify the number of isoforms for each gene. Could you give me any advice about work flow or tips how to analyze my data?

modified 14 months ago by Michael Lawrence10k • written 14 months ago by seungjoon.kim0
0
14 months ago by
United States
Michael Lawrence10k wrote:

It's a complex topic but here are some tips:

One option is to quantify junctions and use those to make inferences about transcript structure and quantity. The summarizeJunctions() function in GenomicAlignments might help with this. There are also specialized packages like SGSeq.

External tools like Salmon and Kallisto use deconvolution approaches to quantifying isoforms. The tximport package loads the output of those tools into R/Bioconductor.

Thanks for the advice. So I tried to use SGSeq to analyze data and some questions and errors came out. I tried to follow the method in splice-variant-identification written by Leonard D Goldstein and how do you retrieve GRanges object of FBXO31 gene. Because it is possible to retrieve a certain chromosome but not the specific gene.

So I manually create GRanges object and proceed the protocol and in the last step using analyzeFeathres, it caused errors

> sgfc_ucsc <- analyzeFeatures(si, features = txf_ucsc)
Error: sample_info is missing column(s) paired_end, read_length, frag_length, lib_size

Could you figure out how to figure out this error?

Hi,

analyzeFeatures needs some basic information on your data (e.g. whether reads are paired-end, the total number of aligned reads in the BAM file etc.) All required info can be obtained with

> si <- getBamInfo(si)

Providing a gene locus is optional, you can do your analysis genome-wide, but this may be computationally intensive, so I would test it on one gene first. Please check the package vignette or help page for more information.

Leonard

Thanks. I will try it.

In the beginning, I found this kind of error.

> bamPath <- "/data/murphy/home/XXX/XXX/sorted_ccs.bam"

> bamFile <- BamFile(bamPath)

> getBamInfo(bamFile)
Error: sample_info must be a data frame with columns sample_name, file_bam

In this case, how can I add sample_name?

Also, I used the same file previously and why other sample_info is missing? (e.g. paired_end, read_length, frag_length, lib_size)

The input for getBamInfo is a data.frame with columns sample_name and file_bam (see example below). Please read the vignette or help pages.

> si <- data.frame(sample_name = c("a", "b"), file_bam = c("file_a.bam", "file_b.bam"), stringsAsFactors = FALSE)
> si <- getBamInfo(si)

Thanks for the comment. it works well with analyzing isoforms

but still has problem when plotting the results.

When I followed this script:

par(mfrow = c(2,1), mar = c(2,2,1,2), pty = "m")
plotSpliceGraph(rowRanges(sgfc_pred), geneID = 1, toscale = "none", color_novel = "red")
for (j in 1:4){
plotCoverage(sgfc_pred[, j], geneID = 1, toscale = "none")
}

It gives this messages:

Error: subscript contains out-of-bounds indices

And the the plot goes out of bound so I cannot see the whole plot at all.

Could you help me to figure out of this problem?

Based on the error message it looks like you are working with less than 4 samples, but you try to make plots for columns 1-4 of the SGFeatureCounts object. Regarding out-of-bound plots, the vignette includes several examples on how to specify what genomic regions to include in the plots. Please have a look at the examples in the vignette.