help with converting MSTRG tags with isoformswictanalyzer
0
0
Entering edit mode
sumitra.20 • 0
@3b59063e
Last seen 9 months ago
Denmark

Hi everyone,

Im using the hisat2-stringtie pipeline for my rice transcriptome (msu7) analysis and then isoformswictanalyzer to fix all the MSTRG tags. Im facing an error that i have never encountered before and i can't seem to understand where im going wrong. Using the merged.gtf file from stringtie gives me a an error on 'non-conformable arguments error' and when i use the reference gtf file i get a different error. Not sure if i should be using the reference .gtf file or the stringtie merged.gtf file, either ways im getting errors. I'm still a beginner and cant seem to fix this problem. Any help will be appreciated.

    ### Import StringTie Expression data (using the stringtie merged.gtf file)
    stringTieQuant <- importIsoformExpression(parentDir ="/stringtie/GH_ballgown_msu7/",readLength=150)

    myDesign = data.frame(sampleID = c("GH1BM_R1","GH1BM_R2","GH1BM_R3","GH1NC_R1","GH1NC_R2","GH1NC_R2"),
                          condition = c("BM","BM","BM","NC","NC","NC"))

    switchAnalyzeRlist <- importRdata(
        isoformCountMatrix   = stringTieQuant$counts,
        isoformRepExpression = stringTieQuant$abundance,
        designMatrix         = myDesign,
        isoformExonAnnoation = "/stringtie/msu7_merged.gtf",
    )   

OUTPUT:
Step 1 of 10: Checking data...
Step 2 of 10: Obtaining annotation...
importing GTF (this may take a while)...
6695 ( 7.65%) isoforms were removed since they were not expressed in any samples.
Step 3 of 10: Fixing StringTie gene annoation problems...
13604 isoforms were assigned the ref_gene_id and gene_name of their associated gene_id.
This was only done when the parent gene_id were associated with a single ref_gene_id/gene_name.
3155 isoforms were assigned the ref_gene_id and gene_name of the most similar
annotated isoform (defined via overlap in genomic exon coordinates).
This was only done if the overlap met the requriements
indicated by the three fixStringTieViaOverlap* arguments.
We were unable to assign 270 isoforms (located within annotated genes) to a known ref_gene_id/gene_name.
These were removed to enable analysis of the rest of the isoform from within the merged genes.
1430 gene_ids which were associated with multiple ref_gene_id/gene_names
were split into mutliple genes via their ref_gene_id/gene_names.
28749 genes_id were assigned their original gene_id instead of the StringTie gene_id.
This was only done when it could be done unambiguous.
Step 4 of 10: Calculating expression estimates from count data...
Skipped as user supplied expression via the "isoformRepExpression" argument...
Step 5 of 10: Testing for unwanted effects...
**Error in H %*% t(dat) : non-conformable arguments**

    #using reference .gtf file

    switchAnalyzeRlist <- importRdata(
        isoformCountMatrix   = stringTieQuant$counts,
        isoformRepExpression = stringTieQuant$abundance,
        designMatrix         = myDesign,
        isoformExonAnnoation = "/index_msu7/msu7_all.gtf",
    ) 

Error:
Step 1 of 3: Identifying which algorithm was used...
    The quantification algorithm used was: StringTie
    Found 6 quantification file(s) of interest
Step 2 of 3: Reading data...
reading in files with read_tsv
1 2 3 4 5 6 
Step 3 of 3: Normalizing abundance values (not counts) via edgeR...
Done

Step 1 of 10: Checking data...
Step 2 of 10: Obtaining annotation...
    importing GTF (this may take a while)...
Error in importRdata(isoformCountMatrix = stringTieQuant$counts, isoformRepExpression = stringTieQuant$abundance,  : 
  The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925). 
Either isforoms found in the annotation are not quantifed or vise versa. 
Specifically:
 93946 isoforms were quantified.
 66336 isoforms are annotated.
 Only 66296 overlap.
 27650 isoforms quantifed had no corresponding annoation

This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing all analysis.

If there is no overlap (as in zero or close) there are two options:
 1) The files do not fit together (e.g. different databases, versions, etc) (no fix except using propperly paired files).
 2) It is somthing to do with how the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.
     Examples from expression matrix are : LOC_Os03g62030.6, LOC_Os10g22070.1, MSTRG.24945.3 
     Examples of annoation are : LOC_Os09g32190.1, LOC_Os04g51210.1, LOC_Os06g21470.2 
     Examples of isoforms which were only found im the quantification are  : MSTRG.24656.1, MSTRG.10746.1, MSTRG.14504.1 

If there is a large overlap but still far from complete there are 3 possibilites:
 1) The files do not fit together (e.g different databases versions etc.) (no fix except using propperly paired files).
 2) If you are using Ensembl data you have supplied the GTF without phaplotyps. You need to supply the <Ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <Ensembl_version>.chr.gtf
 3) One file could contain non-chanonical chromosomes while the other do not (might be solved using the 'removeNonConvensionalChr' argument.)
 4) It is somthing to do with how a subset of the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.


> > sessionInfo() R version 4.3.1 (2023-06-16) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 20.04.5 LTS
> 
> Matrix products: default BLAS/LAPACK:
> /home/combio7/anaconda3/envs/tophat/lib/libopenblasp-r0.3.23.so; 
> LAPACK version 3.11.0
> 
> locale:  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               
> [3] LC_TIME=ms_MY.UTF-8        LC_COLLATE=en_US.UTF-8      [5]
> LC_MONETARY=ms_MY.UTF-8    LC_MESSAGES=en_US.UTF-8     [7]
> LC_PAPER=ms_MY.UTF-8       LC_NAME=C                   [9]
> LC_ADDRESS=C               LC_TELEPHONE=C             [11]
> LC_MEASUREMENT=ms_MY.UTF-8 LC_IDENTIFICATION=C       
> 
> time zone: Asia/Kuala_Lumpur tzcode source: system (glibc)
> 
> attached base packages: [1] stats4    stats     graphics  grDevices
> utils     datasets  methods   [8] base     
> 
> other attached packages:  [1] IsoformSwitchAnalyzeR_2.0.1
> pfamAnalyzeR_1.0.1           [3] dplyr_1.1.2                
> stringr_1.5.0                [5] readr_2.1.4                
> ggplot2_3.4.2                [7] sva_3.49.0                 
> genefilter_1.83.1            [9] mgcv_1.9-0                 
> nlme_3.1-163                [11] satuRn_1.8.0               
> DEXSeq_1.46.0               [13] RColorBrewer_1.1-3         
> AnnotationDbi_1.63.2        [15] DESeq2_1.40.2              
> SummarizedExperiment_1.31.1 [17] GenomicRanges_1.53.1       
> GenomeInfoDb_1.37.2         [19] IRanges_2.35.2             
> S4Vectors_0.39.1            [21] MatrixGenerics_1.13.1      
> matrixStats_1.0.0           [23] Biobase_2.61.0             
> BiocGenerics_0.47.0         [25] BiocParallel_1.35.3        
> limma_3.57.7               
> 
> loaded via a namespace (and not attached):   [1] jsonlite_1.8.7       
> tximport_1.28.0                 [3] magrittr_2.0.3               
> GenomicFeatures_1.53.1          [5] BiocIO_1.11.0                
> zlibbioc_1.47.0                 [7] vctrs_0.6.3                  
> locfdr_1.1-8                    [9] memoise_2.0.1                
> Rsamtools_2.17.0               [11] RCurl_1.98-1.12              
> htmltools_0.5.6                [13] S4Arrays_1.1.5               
> progress_1.2.2                 [15] AnnotationHub_3.8.0          
> lambda.r_1.2.4                 [17] curl_5.0.1                   
> SparseArray_1.1.11             [19] plyr_1.8.8                   
> futile.options_1.0.1           [21] cachem_1.0.8                 
> GenomicAlignments_1.37.0       [23] mime_0.12                    
> lifecycle_1.0.3                [25] pkgconfig_2.0.3              
> Matrix_1.6-0                   [27] R6_2.5.1                     
> fastmap_1.1.1                  [29] GenomeInfoDbData_1.2.10      
> shiny_1.7.4.1                  [31] digest_0.6.33                
> colorspace_2.1-0               [33] tximeta_1.18.1               
> geneplotter_1.78.0             [35] RSQLite_2.3.1                
> hwriter_1.3.2.1                [37] filelock_1.0.2               
> fansi_1.0.4                    [39] httr_1.4.6                   
> abind_1.4-5                    [41] compiler_4.3.1               
> bit64_4.0.5                    [43] withr_2.5.0                  
> DBI_1.1.3                      [45] biomaRt_2.57.1               
> rappdirs_0.3.3                 [47] DelayedArray_0.27.10         
> rjson_0.2.21                   [49] tools_4.3.1                  
> interactiveDisplayBase_1.38.0  [51] httpuv_1.6.11                
> glue_1.6.2                     [53] VennDiagram_1.7.3            
> restfulr_0.0.15                [55] promises_1.2.1               
> grid_4.3.1                     [57] reshape2_1.4.4               
> generics_0.1.3                 [59] gtable_0.3.3                 
> BSgenome_1.69.0                [61] tzdb_0.4.0                   
> tidyr_1.3.0                    [63] ensembldb_2.25.0             
> hms_1.1.3                      [65] xml2_1.3.5                   
> utf8_1.2.3                     [67] XVector_0.41.1               
> BiocVersion_3.17.1             [69] pillar_1.9.0                 
> vroom_1.6.3                    [71] later_1.3.1                  
> splines_4.3.1                  [73] BiocFileCache_2.9.1          
> lattice_0.21-8                 [75] survival_3.5-5               
> rtracklayer_1.61.0             [77] bit_4.0.5                    
> annotate_1.79.0                [79] tidyselect_1.2.0             
> locfit_1.5-9.8                 [81] Biostrings_2.69.2            
> pbapply_1.7-2                  [83] gridExtra_2.3                
> ProtGenerics_1.33.1            [85] edgeR_3.43.8                 
> futile.logger_1.4.3            [87] statmod_1.5.0                
> stringi_1.7.12                 [89] lazyeval_0.2.2               
> yaml_2.3.7                     [91] boot_1.3-28.1                
> codetools_0.2-19               [93] tibble_3.2.1                 
> BiocManager_1.30.22            [95] cli_3.6.1                    
> xtable_1.8-4                   [97] munsell_0.5.0                
> Rcpp_1.0.11                    [99] dbplyr_2.3.3                 
> png_0.1-8                     [101] XML_3.99-0.14                
> parallel_4.3.1                [103] ellipsis_0.3.2               
> blob_1.2.4                    [105] prettyunits_1.1.1            
> AnnotationFilter_1.25.0       [107] bitops_1.0-7                 
> scales_1.2.1                  [109] purrr_1.0.2                  
> crayon_1.5.2                  [111] rlang_1.1.1                  
> KEGGREST_1.41.0               [113] formatR_1.14
stringtie IsoformSwitchAnalyzeR pfamAnalyzeR gtf • 530 views
ADD COMMENT

Login before adding your answer.

Traffic: 766 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6