Question

DEXSeq errors "Error in scan( line ... did not have 3 elements" and "Error in FUN(X[[i]], ...) : subscript out of bounds" (DEXSeqDataSetFromHTSeq)

0

Entering edit mode

Matthew Thornton ▴ 360

@matthew-thornton-5564

Last seen 4 days ago

USA, Los Angeles, USC

Hello!

I am trying to run DEXseq on some paired-end data and I am getting some errors that I don't normally get.

I used the DEXSeq python scripts to prepare the data and I am using "Homo_sapiens.GRCh38.106.gtf" which I converted with 'python3 /home/met/.local/R_Libraries/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.106.gtf Homo_sapiens.GRCh38.106.chr.gff'

I sorted the STAR output sam files by name

#!/bin/bash

cd /dataVol/data/Vasudevan/13Apr22/paired2/sams/ &&

FILES=*.sam

for f in $FILES
do
        describer=$(echo ${f} | sed 's/.sam//')
        echo $describer
        # sort
        samtools sort -O sam -@ 32 -n -T temp -o "sorted_"${describer}.sam ${describer}.sam
done

I then used the DEXseq scripts to count the data

#!/bin/bash

cd /dataVol/data/Vasudevan/17Apr22/DEX/SAMS/ &&

FILES=*.sam

for f in $FILES
do
        echo "Processing $f file..."
        describer=$(echo ${f} | sed 's/.sam//')
        echo $describer

        python3 /home/met/.local/R_Libraries/DEXSeq/python_scripts/dexseq_count.py -p yes -s reverse -r name Homo_sapiens.GRCh38.106.chr.gff ${describer}.sam ${describer}.txt

done

So then when I use the DEXSeq script I made from the Vignette.

#!/usr/bin/Rscript --vanilla --slave

## Change to data directory
setwd("/dataVol/data/Vasudevan/17Apr22/DEX/SAMS/")

today <- Sys.Date()
dt <- format(today, format="%d%b%y")

library(DEXSeq)

countFiles = list.files(getwd(), pattern="txt$", full.names=TRUE)
basename(countFiles)

flattenedFile = list.files(getwd(), pattern="gff$", full.names=TRUE)
basename(flattenedFile)

sampleTable = data.frame(row.names = c("control1", "control2", "control3", "control4", "test1", "test2", "test3", "test4"), condition = c("control", "control", "control", "control", "test", "test", "test", "test"), LibType = c("paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end"))

sampleTable

dxd = DEXSeqDataSetFromHTSeq(countFiles, sampleData=sampleTable, design = ~ sample + exon + condition:exon, flattenedfile=flattenedFile)

dxd = estimateSizeFactors(dxd)

dxd = estimateDispersions(dxd)

try({
tiff(file=paste(dt, "_Dispersion_estimates_DEXSeq.tiff", sep=""), width=8*300, height=8*300, res=300)
plotDispEsts(dxd)
dev.off()
})

# Save an image
save.image(file=paste(dt,"_1.RData", sep=""))

dxd = testForDEU(dxd)

dxd = estimateExonFoldChanges(dxd, fitExpToVar="condition")

dxr1 = DEXSeqResults(dxd)

dxr1

table(dxr1$padj < 0.05)

table(tapply(dxr1$padj < 0.05, dxr1$groupID, any))

try({
tiff(file=paste(dt, "_MA_plot_DEXSeq.tiff", sep=""), width=8*300, height=8*300, res=300)
plotMA(dxr1, cex=0.8)
dev.off()
})

library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

DEXSeqHTML(dxr1, FDR=0.001, color=c("#FF000080", "#0000FF80"), mart=ensembl, filter="ensembl_gene_id", attributes=c("external_gene_name"))

# Save an image
save.image(file=paste(dt,"_2.RData", sep=""))

# clean all
rm(list=ls(all=TRUE))

quit("yes")

I get the first Error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 688116 did not have 3 elements
Calls: DEXSeqDataSetFromHTSeq -> lapply -> lapply -> FUN -> read.table -> scan
Execution halted

When I open up the files with vim, The line 688116 is close to the bottom and has some data for goodness of HTSeq count. There were four of them,.

_ambiguous      36642
_ambiguous_readpair_position    0
_empty  1704024
_lowaqual       0
_notaligned     0

I deleted them. This fixed the error.

The second error.

The scripts runs until

[1] "sorted_SE7840_SA145170_S1_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[2] "sorted_SE7840_SA145171_S2_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[3] "sorted_SE7840_SA145172_S3_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[4] "sorted_SE7840_SA145173_S4_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[5] "sorted_SE7840_SA145174_S5_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[6] "sorted_SE7840_SA145175_S6_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[7] "sorted_SE7840_SA145176_S7_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[8] "sorted_SE7840_SA145177_S8_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[1] "Homo_sapiens.GRCh38.106.chr.gff"
         condition    LibType
control1   control paired-end
control2   control paired-end
control3   control paired-end
control4   control paired-end
test1         test paired-end
test2         test paired-end
test3         test paired-end
test4         test paired-end
Error in FUN(X[[i]], ...) : subscript out of bounds
Calls: DEXSeqDataSetFromHTSeq -> sapply -> sapply -> lapply
Execution halted

I searched for "DEXSeq and subscript out of bounds and I found some answers where I should see if the "the number of lines containing "exonic_part" in the third column of the flattened gtf file correspond to the number of lines from the count files?" from here 'DEXSeq: Error in FUN(X[[i]], ...) : subscript out of bounds'

When I look at the gff file with vim. There are 745398 rows and the end of the file looks like this.

Y       dexseq_prepare_annotation.py    exonic_part     26622512        26622608        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "007"
Y       dexseq_prepare_annotation.py    exonic_part     26623586        26623666        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "008"
Y       dexseq_prepare_annotation.py    exonic_part     26627943        26628022        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "009"
Y       dexseq_prepare_annotation.py    exonic_part     26628271        26628437        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "010"
Y       dexseq_prepare_annotation.py    exonic_part     26630647        26630749        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "011"
Y       dexseq_prepare_annotation.py    exonic_part     26633345        26633431        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "012"
Y       dexseq_prepare_annotation.py    exonic_part     26634523        26634652        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "013"
Y       dexseq_prepare_annotation.py    aggregate_gene  26626520        26627159        .       -       .       gene_id "ENSG00000231514"
Y       dexseq_prepare_annotation.py    exonic_part      26626520       26627159        .       -       .       gene_id "ENSG00000231514"; transcripts "ENST00000435741"; exonic_part_number "001"
Y       dexseq_prepare_annotation.py    aggregate_gene  56855244        56855488        .       +       .       gene_id "ENSG00000235857"
Y       dexseq_prepare_annotation.py    exonic_part     56855244        56855488        .       +       .       gene_id "ENSG00000235857"; transcripts "ENST00000431853"; exonic_part

Then the end of the count files. 68115 rows.

"ENSG00000289711+ENSG00000084072":"037" 1
"ENSG00000289711+ENSG00000084072":"038" 2
"ENSG00000289711+ENSG00000084072":"039" 9
"ENSG00000289711+ENSG00000084072":"040" 9
"ENSG00000289712":"001" 0
"ENSG00000289712":"002" 0
"ENSG00000289713":"001" 0
"ENSG00000289714":"001" 0
"ENSG00000289718":"001" 0
"ENSG00000289718":"002" 0
"ENSG00000289718":"003" 0
"ENSG00000289718":"004" 0
"ENSG00000289718":"005" 0

There is quite a big jump in the GFF file, from 26626520 to 56855244. Also, it isn't clear how I should be checking. as there are quite a few less rows in my count file. Could it be that the GFF file needs to be sorted by name? Is this what is fragging me?

Could this be a memory error?

Any advice or assistance is greatly appreciated. Thank you!

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS/LAPACK: /usr/local/lib/libopenblas_zenp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.1.2

DEXSeq • 3.1k views

ADD COMMENT • link updated 14 months ago by EChambers • 0 • written 2.1 years ago by Matthew Thornton ▴ 360

0

Entering edit mode

I tried it over again with the sam file directly from the STAR alignment. Same problem. I will sort by coordinate and try it again. I am also going to use a computer with 256Gb of memory and see if that works. Thank you

ADD REPLY • link 2.1 years ago Matthew Thornton ▴ 360

0

Entering edit mode

Ran it on the USC CARC SLURM cluster with 512Gb of RAM and same error.

Error in FUN(X[[i]], ...) : subscript out of bounds
Calls: DEXSeqDataSetFromHTSeq -> sapply -> sapply -> lapply
Execution halted

ADD REPLY • link 2.1 years ago Matthew Thornton ▴ 360

0

Entering edit mode

I tried a different alignment and sorting by name and coordinate. Nada. Is no one using this anymore? What are people using? If I have to I'll go back to Ensembl 100. If it still barfs on me, i'm hosed.

ADD REPLY • link 2.1 years ago Matthew Thornton ▴ 360

0

Entering edit mode

Hi Matthew. I Will try to reproduce the error message. Could you point me to the link where you are getting the raw gtf file?

ADD REPLY • link 2.1 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

HI Alejandro! Thank you!! Absolutely. I am getting it from Ensembl, here, http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz I really appreciate your help!!

ADD REPLY • link 2.1 years ago Matthew Thornton ▴ 360

0

Entering edit mode

Hi Matthew. I Will try to reproduce the error message. Could you point me to the link where you are getting the raw gtf file?

ADD REPLY • link 2.1 years ago Alejandro Reyes ★ 1.9k

score 5 · Accepted Answer · 2022-04-20

5

Entering edit mode

Arthur ▴ 50

@93355568

Last seen 2.1 years ago

Netherlands

I ran into exactly the same problem, firstly removing the tail lines due to getting a number of column error and then getting a out of bounds error once those were removed. I manage to solve it for now, hopefully this will be useful.

Luckily, I had some files from a previous analysis so I could reverse engineer what went wrong. When looking into the exon count files generated by the dexseq_count.py, I noticed that there were some extra quotes, which were not there in the old files:

Old files:

ENSMUSG00000000001.4:001 2317

ENSMUSG00000000001.4:002 325

ENSMUSG00000000001.4:003 254

New files (from dexseq_count.py v 1.40.0):

"ENSMUSG00000000001.4":"001" 4745

"ENSMUSG00000000001.4":"002" 677

"ENSMUSG00000000001.4":"003" 522

"ENSMUSG00000000001.4":"004" 363

After removing the quotes (sed 's/\"//g' $file > $fileName"_clean.txt) directly from the dexseq_count.py output files (so keeping those final lines for ambiguous and other counts), DEXSeqDataSetFromHTSeq() was able to recognize the files and I'm not seeing any of those error anymore.

I hope it helps!

ADD COMMENT • link 2.1 years ago Arthur ▴ 50

1

Entering edit mode

That's the ticket! Thank you very much!!

ADD REPLY • link 2.1 years ago Matthew Thornton ▴ 360

0

Entering edit mode

Whoa! Cool. I'll try that. I did see the quotes. Thank you so much!

ADD REPLY • link 2.1 years ago Matthew Thornton ▴ 360

0

Entering edit mode

wow, it works. you're my GOD. btw: I wonder why this issue remains for 8 months, amazing.