Question: RNA-seq analysis with R/BioC: alignment and read counts
gravatar for Fiona
4.0 years ago by
United Kingdom
Fiona70 wrote:

Hi everyone,

I'm analyzing my first paired-end RNA-seq data in R/BioConductor and I'm having some trouble. I've given detail below but please let me know if I need to give more information. I'd be really grateful for any help or pointers.

I downloaded the D. melanogaster reference genome (dm6, the most recent version) from the UCSC server, and used SAMtools and Picard to create a reference genome .fa file. I carried out my alignment to this reference file with R BioC 'QuasR' package, command qAlign, and this gave me .bam files for all my samples. I used:

proj <- qAlign(sampleFile,genomeFile,cacheDir=results)

with sampleFile and genomeFile providing the paths to the sample and reference .fa files, respectively. This step seemed to go fine.

Now I am trying to generate read counts with R BioC 'GenomicFeatures' package. Following the package notes and 'how to' guide, I used:

txdb <- makeTxDbFromGFF("dmel-all-r6.05.gff.gz",format="gff")

exbygene <- exonsBy(txdb, by="gene")

I get a warning message for makeTxDbFromGFF, stating that:

Warning messages:
1: In matchCircularity(seqlevels(gr), circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  FBtr0084079, FBtr0084080, FBtr0084081, FBtr0084082, FBtr0084083,
  FBtr0084084, FBtr0084085, FBtr0307759, FBtr0307760
3: In .reject_transcripts(bad_tx, because) :
  The following transcripts were rejected because they have CDSs that
  cannot be mapped to an exon: FBtr0100857, FBtr0100861, FBtr0100863,
  FBtr0100866, FBtr0100868, FBtr0100870, FBtr0100883, FBtr0433498,

The exonsBy command runs fine though, so I tried to continue with qCount in 'QuasR' with this code:

anycounts <- qCount(proj, txdb, reportLevel="gene", orientation="any")

and I get the following error:

Error in qCount(proj, txdb, reportLevel = "gene", orientation = "any") :
  sequence levels in 'query' not found in alignment files: 2L, 2R, 3L, 3R, 4, X, Y, Unmapped_Scaffold_8_D1580_D1567, 211000022278158, 211000022278279, 211000022278282, 211000022278298, 211000022278307, 211000022278309, 211000022278436, 211000022278449, 211000022278498, 211000022278522, 211000022278603, 211000022278604, 211000022278664, 211000022278724, 211000022278750, 211000022278760, 211000022278875, 211000022278877, 211000022278878, 211000022278879, 211000022278880, 211000022278985, 211000022279055, 211000022279108, 211000022279132, 211000022279134, 211000022279165, 211000022279188, 211000022279222, 211000022279264, 211000022279342, 211000022279392, 211000022279446, 211000022279528, 211000022279529, 211000022279531, 211000022279555, 211000022279681, 211000022279708, 211000022280133, 211000022280328, 211000022280341, 211000022280347, 211000022280481, 211000022280494, 211000022280645, 211000022280703, mitochondrion_genome, rDNA

As far as my beginner knowledge can tell, the reference I used to create .bam files is not matching up with the reference I am using to count reads against. Is this the case?

If so, I understand why that would be a problem, but I don't understand how the problem has occurred, nor how to fix it. As far as I can tell, I used the same reference for both of these steps (in the form of the .fa file for alignment and the .gff to make the gene model). Can anyone help me solve this one?

Thanks in advance for any help,



R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Scientific Linux release 6.6 (Carbon)

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] ShortRead_1.26.0        GenomicAlignments_1.4.1 BiocParallel_1.2.3     
 [4] Rsamtools_1.20.4        Biostrings_2.36.1       XVector_0.8.0          
 [7] rtracklayer_1.28.4      edgeR_3.10.2            limma_3.24.10          
[10] DESeq_1.20.0            lattice_0.20-31         locfit_1.5-9.1         
[13] GenomicFeatures_1.20.1  AnnotationDbi_1.30.1    Biobase_2.28.0         
[16] QuasR_1.8.3             Rbowtie_1.8.0           GenomicRanges_1.20.5   
[19] GenomeInfoDb_1.4.0      IRanges_2.2.4           S4Vectors_0.6.0        
[22] BiocGenerics_0.14.0    

loaded via a namespace (and not attached):
 [1] BiocInstaller_1.18.3 RColorBrewer_1.1-2   futile.logger_1.4.1
 [4] bitops_1.0-6         futile.options_1.0.0 tools_3.2.0         
 [7] zlibbioc_1.14.0      biomaRt_2.24.0       annotate_1.46.0     
[10] RSQLite_1.0.0        BSgenome_1.36.0      DBI_0.3.1           
[13] genefilter_1.50.0    hwriter_1.3.2        grid_3.2.0          
[16] XML_3.98-1.2         survival_2.38-2      latticeExtra_0.6-26
[19] geneplotter_1.46.0   lambda.r_1.1.7       splines_3.2.0       
[22] xtable_1.7-4         RCurl_1.95-4.6     
ADD COMMENTlink modified 4.0 years ago by Hervé Pagès ♦♦ 14k • written 4.0 years ago by Fiona70
Answer: RNA-seq analysis with R/BioC: alignment and read counts
gravatar for Sonali Arora
4.0 years ago by
Sonali Arora380
United States
Sonali Arora380 wrote:

Hi Fiona, 

My understanding from the information that you've given is, that the seqlevelsStyle of the GFF files is not UCSC and that is giving rise
to these issues. I suggest using the following steps 

a) read in the GFF file as a GRanges
b) read in one of the bam files as a GenomicAlignments. 
c) check if both of them have the same chromosome naming convention ?
d) convert everything to have same naming convention
e) then create the txdb object. 

I have shown some code below - but have used test files which you can 
replace with your own GFF and BAM files

Test GFF file

test_path <- system.file("tests", package = "rtracklayer")
test_gff3 <- file.path(test_path, "genes.gff3")

Test Bam file

fl <- untreated3_chr4()   # filename as a character

Here is an example of how to read in a GFF file as a GRanges - 

gff_gr <- import(test_gff3, format="gff")

Read in your BAM files into R with  - 

bamData <- readGAlignmentsList(fl)

Checking step: 

    # create the txdb
    txdb <- makeTxdbFromGRanges(gff_gr) # should work!

This is how you change the chromosome naming style to "UCSC"

seqlevelsStyle(gff_gr) <- "UCSC"

then run 

txdb <- makeTxdbFromGRanges(gff_gr)

Hope that helps!

ADD COMMENTlink written 4.0 years ago by Sonali Arora380

Thank you for the code and for the explanation! This makes sense, and the code fixes the problem.

ADD REPLYlink written 4.0 years ago by Fiona70
Answer: RNA-seq analysis with R/BioC: alignment and read counts
gravatar for Gordon Smyth
4.0 years ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

Hi Fiona,

I don't want to confuse you with more alternatives, but there is a complete start to finish case study of analysing D. melanogaster RNA-seq data in the last chapter of the limma User's Guide:

This example uses featureCounts() in the Rsubread package to count the reads. It may be worth considering.

ADD COMMENTlink written 4.0 years ago by Gordon Smyth37k

Thanks very much for the suggestion - I didn't realize limma could be used for RNAseq as well as microarray analysis, so this will definitely be worth me trying out.

ADD REPLYlink written 4.0 years ago by Fiona70
Answer: RNA-seq analysis with R/BioC: alignment and read counts
gravatar for Hervé Pagès
4.0 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Fiona,

I suppose you got dmel-all-r6.05.gff.gz from FlyBase. AFAIK it contains annotations based on reference genome BDGP Release 6 which is indeed the same assembly as dm6 from UCSC. However, as Sonali suggested, UCSC doesn't use the same chromosome naming convention as FlyBase. An easy way to address this is to grab the gene/transcript/exon models from UCSC. That is, just replace:

txdb <- makeTxDbFromGFF("dmel-all-r6.05.gff.gz", format="gff")


txdb <- makeTxDbFromUCSC(genome="dm6", tablename="refGene")

See ?makeTxDbFromUCSC for more information.

Hope this helps,


ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Hervé Pagès ♦♦ 14k

Thanks for pointing this out - I didn't know about the makeTxDbFromUCSC function, and it fixes the problem.

ADD REPLYlink written 4.0 years ago by Fiona70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 114 users visited in the last hour