Search
Question: RNA-seq analysis with R/BioC: alignment and read counts
0
gravatar for Fiona
2.4 years ago by
Fiona70
United Kingdom
Fiona70 wrote:

Hi everyone,

I'm analyzing my first paired-end RNA-seq data in R/BioConductor and I'm having some trouble. I've given detail below but please let me know if I need to give more information. I'd be really grateful for any help or pointers.

I downloaded the D. melanogaster reference genome (dm6, the most recent version) from the UCSC server, and used SAMtools and Picard to create a reference genome .fa file. I carried out my alignment to this reference file with R BioC 'QuasR' package, command qAlign, and this gave me .bam files for all my samples. I used:

proj <- qAlign(sampleFile,genomeFile,cacheDir=results)

with sampleFile and genomeFile providing the paths to the sample and reference .fa files, respectively. This step seemed to go fine.

Now I am trying to generate read counts with R BioC 'GenomicFeatures' package. Following the package notes and 'how to' guide, I used:

txdb <- makeTxDbFromGFF("dmel-all-r6.05.gff.gz",format="gff")

exbygene <- exonsBy(txdb, by="gene")

I get a warning message for makeTxDbFromGFF, stating that:

Warning messages:
1: In matchCircularity(seqlevels(gr), circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  FBtr0084079, FBtr0084080, FBtr0084081, FBtr0084082, FBtr0084083,
  FBtr0084084, FBtr0084085, FBtr0307759, FBtr0307760
3: In .reject_transcripts(bad_tx, because) :
  The following transcripts were rejected because they have CDSs that
  cannot be mapped to an exon: FBtr0100857, FBtr0100861, FBtr0100863,
  FBtr0100866, FBtr0100868, FBtr0100870, FBtr0100883, FBtr0433498,
  FBtr0433502

The exonsBy command runs fine though, so I tried to continue with qCount in 'QuasR' with this code:

anycounts <- qCount(proj, txdb, reportLevel="gene", orientation="any")

and I get the following error:

Error in qCount(proj, txdb, reportLevel = "gene", orientation = "any") :
  sequence levels in 'query' not found in alignment files: 2L, 2R, 3L, 3R, 4, X, Y, Unmapped_Scaffold_8_D1580_D1567, 211000022278158, 211000022278279, 211000022278282, 211000022278298, 211000022278307, 211000022278309, 211000022278436, 211000022278449, 211000022278498, 211000022278522, 211000022278603, 211000022278604, 211000022278664, 211000022278724, 211000022278750, 211000022278760, 211000022278875, 211000022278877, 211000022278878, 211000022278879, 211000022278880, 211000022278985, 211000022279055, 211000022279108, 211000022279132, 211000022279134, 211000022279165, 211000022279188, 211000022279222, 211000022279264, 211000022279342, 211000022279392, 211000022279446, 211000022279528, 211000022279529, 211000022279531, 211000022279555, 211000022279681, 211000022279708, 211000022280133, 211000022280328, 211000022280341, 211000022280347, 211000022280481, 211000022280494, 211000022280645, 211000022280703, mitochondrion_genome, rDNA


As far as my beginner knowledge can tell, the reference I used to create .bam files is not matching up with the reference I am using to count reads against. Is this the case?

If so, I understand why that would be a problem, but I don't understand how the problem has occurred, nor how to fix it. As far as I can tell, I used the same reference for both of these steps (in the form of the .fa file for alignment and the .gff to make the gene model). Can anyone help me solve this one?

Thanks in advance for any help,

Fiona

sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Scientific Linux release 6.6 (Carbon)

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] ShortRead_1.26.0        GenomicAlignments_1.4.1 BiocParallel_1.2.3     
 [4] Rsamtools_1.20.4        Biostrings_2.36.1       XVector_0.8.0          
 [7] rtracklayer_1.28.4      edgeR_3.10.2            limma_3.24.10          
[10] DESeq_1.20.0            lattice_0.20-31         locfit_1.5-9.1         
[13] GenomicFeatures_1.20.1  AnnotationDbi_1.30.1    Biobase_2.28.0         
[16] QuasR_1.8.3             Rbowtie_1.8.0           GenomicRanges_1.20.5   
[19] GenomeInfoDb_1.4.0      IRanges_2.2.4           S4Vectors_0.6.0        
[22] BiocGenerics_0.14.0    

loaded via a namespace (and not attached):
 [1] BiocInstaller_1.18.3 RColorBrewer_1.1-2   futile.logger_1.4.1
 [4] bitops_1.0-6         futile.options_1.0.0 tools_3.2.0         
 [7] zlibbioc_1.14.0      biomaRt_2.24.0       annotate_1.46.0     
[10] RSQLite_1.0.0        BSgenome_1.36.0      DBI_0.3.1           
[13] genefilter_1.50.0    hwriter_1.3.2        grid_3.2.0          
[16] XML_3.98-1.2         survival_2.38-2      latticeExtra_0.6-26
[19] geneplotter_1.46.0   lambda.r_1.1.7       splines_3.2.0       
[22] xtable_1.7-4         RCurl_1.95-4.6     
ADD COMMENTlink modified 2.4 years ago by Hervé Pagès ♦♦ 13k • written 2.4 years ago by Fiona70
0
gravatar for Sonali Arora
2.4 years ago by
Sonali Arora360
United States
Sonali Arora360 wrote:

Hi Fiona, 

My understanding from the information that you've given is, that the seqlevelsStyle of the GFF files is not UCSC and that is giving rise
to these issues. I suggest using the following steps 

a) read in the GFF file as a GRanges
b) read in one of the bam files as a GenomicAlignments. 
c) check if both of them have the same chromosome naming convention ?
d) convert everything to have same naming convention
e) then create the txdb object. 

I have shown some code below - but have used test files which you can 
replace with your own GFF and BAM files

Test GFF file

library(rtracklayer)
test_path <- system.file("tests", package = "rtracklayer")
test_gff3 <- file.path(test_path, "genes.gff3")


Test Bam file

library(pasillaBamSubset)
fl <- untreated3_chr4()   # filename as a character


Here is an example of how to read in a GFF file as a GRanges - 

library(rtracklayer)
gff_gr <- import(test_gff3, format="gff")
seqlevelsStyle(gff_gr)

Read in your BAM files into R with  - 

library(GenomicAlignments) 
bamData <- readGAlignmentsList(fl)
seqlevelsStyle(bamData) 

Checking step: 

if(seqlevelsStyle(galist1)==seqlevelsStyle(gff_gr))
{
    # create the txdb
    txdb <- makeTxdbFromGRanges(gff_gr) # should work!
} 


This is how you change the chromosome naming style to "UCSC"

library(GenomeInfoDb)
seqlevelsStyle(gff_gr) <- "UCSC"

then run 

txdb <- makeTxdbFromGRanges(gff_gr)


Hope that helps!
Sonali. 

ADD COMMENTlink written 2.4 years ago by Sonali Arora360

Thank you for the code and for the explanation! This makes sense, and the code fixes the problem.

ADD REPLYlink written 2.4 years ago by Fiona70
0
gravatar for Gordon Smyth
2.4 years ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

Hi Fiona,

I don't want to confuse you with more alternatives, but there is a complete start to finish case study of analysing D. melanogaster RNA-seq data in the last chapter of the limma User's Guide:

 http://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

This example uses featureCounts() in the Rsubread package to count the reads. It may be worth considering.

ADD COMMENTlink written 2.4 years ago by Gordon Smyth32k

Thanks very much for the suggestion - I didn't realize limma could be used for RNAseq as well as microarray analysis, so this will definitely be worth me trying out.

ADD REPLYlink written 2.4 years ago by Fiona70
0
gravatar for Hervé Pagès
2.4 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Fiona,

I suppose you got dmel-all-r6.05.gff.gz from FlyBase. AFAIK it contains annotations based on reference genome BDGP Release 6 which is indeed the same assembly as dm6 from UCSC. However, as Sonali suggested, UCSC doesn't use the same chromosome naming convention as FlyBase. An easy way to address this is to grab the gene/transcript/exon models from UCSC. That is, just replace:

txdb <- makeTxDbFromGFF("dmel-all-r6.05.gff.gz", format="gff")

with

txdb <- makeTxDbFromUCSC(genome="dm6", tablename="refGene")

See ?makeTxDbFromUCSC for more information.

Hope this helps,

H.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Hervé Pagès ♦♦ 13k

Thanks for pointing this out - I didn't know about the makeTxDbFromUCSC function, and it fixes the problem.

ADD REPLYlink written 2.4 years ago by Fiona70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 111 users visited in the last hour