Question

Isoforms detection using bioconductor

0

Entering edit mode

seungjoon.kim • 0

@seungjoonkim-13660

Last seen 6.7 years ago

Berlin

Dear all,

hi we have RNA-seq data with long reads and try to find transcript isoforms(AS) for every genes. I have now aligned BAM file and want to detect isoforms using IRanges, GenomicRanges and biomart. I have no idea how to detect and quantify the number of isoforms for each gene. Could you give me any advice about work flow or tips how to analyze my data?

Thank you in advance!!

bioconductor biomart granges rna-seq • 2.2k views

ADD COMMENT • link updated 6.7 years ago by Michael Lawrence ★ 11k • written 6.7 years ago by seungjoon.kim • 0

score 0 · Answer 1 · 2017-08-04

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

It's a complex topic but here are some tips:

One option is to quantify junctions and use those to make inferences about transcript structure and quantity. The summarizeJunctions() function in GenomicAlignments might help with this. There are also specialized packages like SGSeq.

External tools like Salmon and Kallisto use deconvolution approaches to quantifying isoforms. The tximport package loads the output of those tools into R/Bioconductor.

ADD COMMENT • link 6.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks for the advice. So I tried to use SGSeq to analyze data and some questions and errors came out. I tried to follow the method in splice-variant-identification written by Leonard D Goldstein and how do you retrieve GRanges object of FBXO31 gene. Because it is possible to retrieve a certain chromosome but not the specific gene.

So I manually create GRanges object and proceed the protocol and in the last step using analyzeFeathres, it caused errors

> sgfc_ucsc <- analyzeFeatures(si, features = txf_ucsc)

Error: sample_info is missing column(s) paired_end, read_length, frag_length, lib_size

Could you figure out how to figure out this error?

ADD REPLY • link 6.7 years ago seungjoon.kim • 0

0

Entering edit mode

Hi,

analyzeFeatures needs some basic information on your data (e.g. whether reads are paired-end, the total number of aligned reads in the BAM file etc.) All required info can be obtained with

> si <- getBamInfo(si)

Providing a gene locus is optional, you can do your analysis genome-wide, but this may be computationally intensive, so I would test it on one gene first. Please check the package vignette or help page for more information.

Leonard

ADD REPLY • link 6.7 years ago Leonard Goldstein ▴ 170

0

Entering edit mode

Thanks. I will try it.

ADD REPLY • link 6.7 years ago seungjoon.kim • 0

0

Entering edit mode

In the beginning, I found this kind of error.

> bamPath <- "/data/murphy/home/XXX/XXX/sorted_ccs.bam"

> bamFile <- BamFile(bamPath)

> getBamInfo(bamFile)
Error: sample_info must be a data frame with columns sample_name, file_bam

In this case, how can I add sample_name?

Also, I used the same file previously and why other sample_info is missing? (e.g. paired_end, read_length, frag_length, lib_size)

Thanks in advance!!

ADD REPLY • link 6.7 years ago seungjoon.kim • 0

0

Entering edit mode

The input for getBamInfo is a data.frame with columns sample_name and file_bam (see example below). Please read the vignette or help pages.

> si <- data.frame(sample_name = c("a", "b"), file_bam = c("file_a.bam", "file_b.bam"), stringsAsFactors = FALSE)
> si <- getBamInfo(si)

ADD REPLY • link 6.7 years ago Leonard Goldstein ▴ 170

0

Entering edit mode

Thanks for the comment. it works well with analyzing isoforms

but still has problem when plotting the results.

When I followed this script:

par(mfrow = c(2,1), mar = c(2,2,1,2), pty = "m")
plotSpliceGraph(rowRanges(sgfc_pred), geneID = 1, toscale = "none", color_novel = "red")
for (j in 1:4){
plotCoverage(sgfc_pred[, j], geneID = 1, toscale = "none")
}

It gives this messages:

Error: subscript contains out-of-bounds indices

And the the plot goes out of bound so I cannot see the whole plot at all.

Could you help me to figure out of this problem?

ADD REPLY • link 6.7 years ago seungjoon.kim • 0

0

Entering edit mode

Based on the error message it looks like you are working with less than 4 samples, but you try to make plots for columns 1-4 of the SGFeatureCounts object. Regarding out-of-bound plots, the vignette includes several examples on how to specify what genomic regions to include in the plots. Please have a look at the examples in the vignette.

ADD REPLY • link 6.7 years ago Leonard Goldstein ▴ 170