Question: mapping FASTQ files to single gene (fluorescent reporter)
0
gravatar for A
6 months ago by
A40
A40 wrote:

Hi all, 

I have recently posted an Rsubread featurecounts related question, but I have another one, based on a quality control analysis I would like to do: 

I am attempting to identify a fluorescent reporter that is driven of a specific promoter in our cells of interest which we then sort (based on fluorescence) and sequence. The fluorescent reporter is ZsG. I cannot find a dedicated FASTA file for this on ensembl or anywhere else for that matter apart from the ebi. The ebi contains two sequences (presumably two isoforms of the protein) that I used as the FASTA file for the reference genome. The following code was used which succesful

ref <- "AAF03372.fa"
buildindex(basename="reference_index",reference=ref)

Reference index builds successfully with no errors... 

I then use the code as per the vignette for paired-end data:

reads1<-"lane1_001_.......R1.fastq.gz"
reads2<-"ane1_001_.......R2.fastq.gz"

align(index="reference_index",readfile1=reads1,readfile2=reads2,
output_file="alignResultsPE_1.BAM",phredOffset=33 )

This also run succesfully, however, I get 0 mapping! And am abit concerned because I have no reason to believe out of ~300 samples that cell sorting has not worked correctly.. Especially as some organs express alot of this reporter based on a bigger population of cells of interest, and the gating is stringent based on negative control samples... so I have no reason to doubt that these cells are not green!!.. 

Are there any tips that you could give for why this may not be working? could it be the sequence in the fasta file? or any parameters that i should be tweaking in the align function (I have increased the number of mismatches to 6 to see if that helped but it didn't!)

Total fragments : 16,782,941                                           ||

||              Mapped : 0 (0.0%)                                             ||

||     Uniquely mapped : 0                                                    ||

||       Multi-mapping : 0                                                    ||

||                                                                            ||

||            Unmapped : 16,782,941                                           ||

||                                                                            || etc.......

Is there anything I am clearly doing wrong? What are the potential problems that could cause this, is compiling a library of one gene sufficient for this type of analysis? should this be added on to a FASTA file from the mouse genome etc... Any advice or tips would be greatly appreciated! Many thanks!!

alignment rsubread align • 238 views
ADD COMMENTlink written 6 months ago by A40

As per the posting guide, you need to provide the session info so we can see what versions of software packages you use. Please also provide the full screen output of buildindex() and align() functions. How long are the two reference sequences?

ADD REPLYlink written 6 months ago by Wei Shi3.1k

Thank you!! please see below! Apologies this was not done initially: Please see information requested below including all outputs:

Session Info:

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ShortRead_1.40.0            GenomicAlignments_1.18.0    SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
 [5] matrixStats_0.54.0          Biobase_2.42.0              Rsamtools_1.34.0            GenomicRanges_1.34.0       
 [9] GenomeInfoDb_1.18.1         Biostrings_2.50.1           XVector_0.22.0              IRanges_2.16.0             
[13] S4Vectors_0.20.1            BiocParallel_1.16.0         BiocGenerics_0.28.0         Rsubread_1.32.1            

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.5.1          Formula_1.2-3          assertthat_0.2.0      
 [5] BiocManager_1.30.4     latticeExtra_0.6-28    blob_1.1.1             GenomeInfoDbData_1.2.0
 [9] pillar_1.3.0           RSQLite_2.1.1          backports_1.1.2        lattice_0.20-38       
[13] glue_1.3.0             digest_0.6.18          RColorBrewer_1.1-2     checkmate_1.8.5       
[17] colorspace_1.3-2       htmltools_0.3.6        Matrix_1.2-15          plyr_1.8.4            
[21] DESeq2_1.22.1          XML_3.98-1.16          pkgconfig_2.0.2        genefilter_1.64.0     
[25] zlibbioc_1.28.0        purrr_0.2.5            xtable_1.8-3           scales_1.0.0          
[29] htmlTable_1.12         tibble_1.4.2           annotate_1.60.0        ggplot2_3.1.0         
[33] nnet_7.3-12            lazyeval_0.2.1         survival_2.43-1        magrittr_1.5          
[37] crayon_1.3.4           memoise_1.1.0          hwriter_1.3.2          foreign_0.8-71        
[41] tools_3.5.1            data.table_1.11.8      stringr_1.3.1          locfit_1.5-9.1        
[45] munsell_0.5.0          cluster_2.0.7-1        AnnotationDbi_1.44.0   bindrcpp_0.2.2        
[49] compiler_3.5.1         rlang_0.3.0.1          grid_3.5.1             RCurl_1.95-4.11       
[53] rstudioapi_0.8         htmlwidgets_1.3        bitops_1.0-6           base64enc_0.1-3       
[57] gtable_0.2.0           DBI_1.0.0              R6_2.3.0               gridExtra_2.3         
[61] knitr_1.20             dplyr_0.7.8            bit_1.1-14             bindr_0.1.1           
[65] Hmisc_4.1-1            stringi_1.2.4          Rcpp_1.0.0             geneplotter_1.60.0    
[69] rpart_4.1-13           acepack_1.4.1          tidyselect_0.2.5      

 

 

ADD REPLYlink written 6 months ago by A40

Submittign differenet replies because of word count: Build index output:

BuildIndex output:

Rsubread 1.32.1

//=========================== indexBuilder setting ===========================\

|| ||

|| Index name : reference_index ||

|| Index space : base-space ||

|| One block index : yes ||

|| Repeat threshold : 100 repeats ||

|| Distance to next subread : 1 ||

|| ||

|| Input files : 1 file in total ||

|| o AAF03372.fasta ||

|| ||

\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\

|| ||

|| Check the integrity of provided reference sequences ... ||

|| No format issues were found ||

|| Scan uninformative subreads in reference sequences ... ||

|| 8%, 0 mins elapsed, rate=0.0k bps/s, total=0m ||

|| 16%, 0 mins elapsed, rate=0.1k bps/s, total=0m ||

|| 24%, 0 mins elapsed, rate=0.1k bps/s, total=0m ||

|| 33%, 0 mins elapsed, rate=0.2k bps/s, total=0m ||

|| 41%, 0 mins elapsed, rate=0.2k bps/s, total=0m ||

|| 49%, 0 mins elapsed, rate=0.3k bps/s, total=0m ||

|| 58%, 0 mins elapsed, rate=0.3k bps/s, total=0m ||

|| 66%, 0 mins elapsed, rate=0.4k bps/s, total=0m ||

|| 74%, 0 mins elapsed, rate=0.4k bps/s, total=0m ||

|| 83%, 0 mins elapsed, rate=0.5k bps/s, total=0m ||

|| 91%, 0 mins elapsed, rate=0.5k bps/s, total=0m ||

|| 99%, 0 mins elapsed, rate=0.6k bps/s, total=0m ||

|| 107%, 0 mins elapsed, rate=0.6k bps/s, total=0m ||

|| Build the index... ||

|| 199%, 0 mins elapsed, rate=612.0k bps/s, total=0m ||

|| 207%, 0 mins elapsed, rate=318.8k bps/s, total=0m ||

|| 215%, 0 mins elapsed, rate=265.2k bps/s, total=0m ||

|| 224%, 0 mins elapsed, rate=275.4k bps/s, total=0m ||

|| 232%, 0 mins elapsed, rate=238.0k bps/s, total=0m ||

|| 240%, 0 mins elapsed, rate=246.5k bps/s, total=0m ||

|| 249%, 0 mins elapsed, rate=255.0k bps/s, total=0m ||

|| 257%, 0 mins elapsed, rate=158.1k bps/s, total=0m ||

|| 265%, 0 mins elapsed, rate=136.0k bps/s, total=0m ||

|| 274%, 0 mins elapsed, rate=120.2k bps/s, total=0m ||

|| 282%, 0 mins elapsed, rate=115.6k bps/s, total=0m ||

|| 290%, 0 mins elapsed, rate=119.0k bps/s, total=0m ||

|| 299%, 0 mins elapsed, rate=114.7k bps/s, total=0m ||

|| 307%, 0 mins elapsed, rate=117.9k bps/s, total=0m ||

|| Save current index block... ||

|| [ 0.0% finished ] ||

|| [ 10.0% finished ] ||

|| [ 20.0% finished ] ||

|| [ 30.0% finished ] ||

|| [ 40.0% finished ] ||

|| [ 50.0% finished ] ||

|| [ 60.0% finished ] ||

|| [ 70.0% finished ] ||

|| [ 80.0% finished ] ||

|| [ 90.0% finished ] ||

|| [ 100.0% finished ] ||

|| ||

|| Total running time: 0.1 minutes. ||

|| Index reference_index was successfully built! ||

|| ||

 

ADD REPLYlink written 6 months ago by A40

aligh() output:

|Function : Read alignment (RNA-Seq) || || Input file 1 : lane1_020_Ki_GTCTTGGC_L001_4198STDY7571463_R1.fastq.gz || || Input file 2 : lane1_020_Ki_GTCTTGGC_L001_4198STDY7571463_R2.fastq.gz || || Output file : alignResultsPE_1.BAM (BAM) || || Index name : reference_index || || || || ------------------------------------ || || || || Threads : 1 || || Phred offset : 33 || || # of extracted subreads : 10 || || Min read1 vote : 3 || || Min read2 vote : 1 || || Max fragment size : 600 || || Min fragment size : 50 || || Max mismatches : 3 || || Max indel length : 5 || || Report multi-mapping reads : yes || || Max alignments per multi-mapping read : 1 || || || \===================== http://subread.sourceforge.net/ ======================//

//================= Running (06-Dec-2018 07:53:10, pid=3681) =================\ || || || The input file contains base space reads. || || The range of Phred scores observed in the data is [0,37] || || Load the 1-th index block... || || 0% completed, 0.0 mins elapsed, rate=36.6k fragments per second || || 6% completed, 0.6 mins elapsed, rate=57.5k fragments per second || ...... 99% completed, 10 mins elapsed, rate=49.3k fragments per second || || || || Completed successfully. || || || \============================================================================//

//================================= Summary ==================================\ || || || Total fragments : 30,661,522 || || Mapped : 0 (0.0%) || || Uniquely mapped : 0 || || Multi-mapping : 0 || || || || Unmapped : 30,661,522 || || || || Correctly paired : 0 || || Not mapped in pairs : 0 || || Only one end mapped : 0 || || Multi-chromosomes : 0 || || Different strands : 0 || || Not in PE distance : 0 || || Abnormal order : 0 || || || || Indels : 0 || || || || Running time : 10.4 minutes ||

ADD REPLYlink modified 6 months ago • written 6 months ago by A40

apologies, align() output pasted correctly but posted as above, hope thats ok to read?

>ENA|AAF03373|AAF03373.1 Zoanthus sp. fluorescent protein FP538
ATGGCTCATTCAAAGCACGGTCTAAAAGAAGAAATGACAATGAAATACCACATGGAAGGG
TGCGTCAACGGACATAAATTTGTGATCACGGGCGAAGGCATTGGATATCCGTTCAAAGGG
AAACAGACTATTAATCTGTGTGTGATCGAAGGGGGACCATTGCCATTTTCCGAAGACATA
TTGTCAGCTGGCTTTAAGTACGGAGACAGGATTTTCACTGAATATCCTCAAGACATAGTA
GACTATTTCAAGAACTCGTGTCCTGCTGGATATACATGGGGCAGGTCTTTTCTCTTTGAG
GATGGAGCAGTCTGCATATGCAATGTAGATATAACAGTGAGTGTCAAAGAAAACTGCATT
TATCATAAGAGCATATTTAATGGAATGAATTTTCCTGCTGATGGACCTGTGATGAAAAAG
ATGACAACTAACTGGGAAGCATCCTGCGAGAAGATCATGCCAGTACCTAAGCAGGGGATA
CTGAAAGGGGATGTCTCCATGTACCTCCTTCTGAAGGATGGTGGGCGTTACCGGTGCCAG
TTCGACACAGTTTACAAAGCAAAGTCTGTGCCAAGTAAGATGCCGGAGTGGCACTTCATC
CAGCATAAGCTCCTCCGTGAAGACCGCAGCGATGCTAAGAATCAGAAGTGGCAGCTGACA
GAGCATGCTATTGCATTCCCTTCTGCCTTGGCCTGA

sequences as above, that is one of the sequences.. other sequences similar in length and I have also tried the complete CDS... Thank you!

ADD REPLYlink written 6 months ago by A40

You commands look fine. I have no idea why you got no mapped reads at all. You may try to Blast some reads to all nucleotide databases to see if you could get any hits to other species:

https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome

 

 

ADD REPLYlink written 6 months ago by Wei Shi3.1k

Ok thank you! Yes I am trying to manually blast the sequence to see if its there!

ADD REPLYlink written 6 months ago by A40

Hi Wei, 

Resurrecting this post! I figured out the problem and it was simply the sequence of the FASTA file. It was incorrect and the sequence in the cloning vector was slightly different to the reporter gene on the embl website.

I am trying to annotate this and generate counts using featurecounts, but I am haing trouble generating a gtf file, however I can;t see the problem. I have looked at other forum posts and I feel like the information entered is correct but also in the right. The entry in the GTF file is shown below (It is a one line GTF file as I just want to generate counts for this feature).. All spaces are separated by an exon (below has changed when pasting). 

ZsGreen1  (696 bp) unknown CDS 1 696 . + . gene_id ”ZsGreen1  (696 bp)”; transcript_id “ZsG1”;

When running feature counts:

fc <- Rsubread::featureCounts(fls, isGTFAnnotationFile = TRUE, isPairedEnd = F, allowMultiOverlap = TRUE, requireBothEndsMapped = F, countMultiMappingReads = FALSE, annot.ext = "ZsG.gtf")

Features : 1                                                            ||
||    Meta-features : 1                                                       ||
||    Chromosomes/contigs : 1                                                 ||
||                                                                            ||
|| Process BAM file alignResultsPE.BAM...                                     ||
||    Single-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 20060707                                             ||
||    Successfully assigned alignments : 0 (0.0%)                             ||
||    Running time : 0.30 minutes     

and when looking at fc$stats: unassigned_noFeatures= 

10         Unassigned_NoFeatures: 4347

I really dont know whats wrong with the GTF file that is causing the feature to be missed during counting and any help would be much appreciated!!

fc$annotation

 

 GeneID                Chr Start End Strand
1 ZsGreen1  (696 bp) ZsGreen1  (696 bp)     1 696      +
  Length
1    696

Not posting sessioninfo() because of word count, however if required I will happily provide!

Thank you so much!

ADD REPLYlink written 6 months ago by A40

UPDATE

Fixed! Top line of the FASTA file was ZsGreen1       (696 bp), 

I thought 696bp was part of the title, but changing the first line to ZsGreen1 alone solved the problem!!

 

ADD REPLYlink written 6 months ago by A40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 278 users visited in the last hour