DEXSeq errors "Error in scan( line ... did not have 3 elements" and "Error in FUN(X[[i]], ...) : subscript out of bounds" (DEXSeqDataSetFromHTSeq)
1
0
Entering edit mode
@matthew-thornton-5564
Last seen 4 weeks ago
USA, Los Angeles, USC

Hello!

I am trying to run DEXseq on some paired-end data and I am getting some errors that I don't normally get.

I used the DEXSeq python scripts to prepare the data and I am using "Homo_sapiens.GRCh38.106.gtf" which I converted with 'python3 /home/met/.local/R_Libraries/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.106.gtf Homo_sapiens.GRCh38.106.chr.gff'

I sorted the STAR output sam files by name

#!/bin/bash

cd /dataVol/data/Vasudevan/13Apr22/paired2/sams/ &&

FILES=*.sam

for f in $FILES
do
        describer=$(echo ${f} | sed 's/.sam//')
        echo $describer
        # sort
        samtools sort -O sam -@ 32 -n -T temp -o "sorted_"${describer}.sam ${describer}.sam
done

I then used the DEXseq scripts to count the data

#!/bin/bash

cd /dataVol/data/Vasudevan/17Apr22/DEX/SAMS/ &&

FILES=*.sam

for f in $FILES
do
        echo "Processing $f file..."
        describer=$(echo ${f} | sed 's/.sam//')
        echo $describer

        python3 /home/met/.local/R_Libraries/DEXSeq/python_scripts/dexseq_count.py -p yes -s reverse -r name Homo_sapiens.GRCh38.106.chr.gff ${describer}.sam ${describer}.txt

done

So then when I use the DEXSeq script I made from the Vignette.

#!/usr/bin/Rscript --vanilla --slave

## Change to data directory
setwd("/dataVol/data/Vasudevan/17Apr22/DEX/SAMS/")

today <- Sys.Date()
dt <- format(today, format="%d%b%y")

library(DEXSeq)

countFiles = list.files(getwd(), pattern="txt$", full.names=TRUE)
basename(countFiles)

flattenedFile = list.files(getwd(), pattern="gff$", full.names=TRUE)
basename(flattenedFile)

sampleTable = data.frame(row.names = c("control1", "control2", "control3", "control4", "test1", "test2", "test3", "test4"), condition = c("control", "control", "control", "control", "test", "test", "test", "test"), LibType = c("paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end"))

sampleTable

dxd = DEXSeqDataSetFromHTSeq(countFiles, sampleData=sampleTable, design = ~ sample + exon + condition:exon, flattenedfile=flattenedFile)

dxd = estimateSizeFactors(dxd)

dxd = estimateDispersions(dxd)

try({
tiff(file=paste(dt, "_Dispersion_estimates_DEXSeq.tiff", sep=""), width=8*300, height=8*300, res=300)
plotDispEsts(dxd)
dev.off()
})

# Save an image
save.image(file=paste(dt,"_1.RData", sep=""))

dxd = testForDEU(dxd)

dxd = estimateExonFoldChanges(dxd, fitExpToVar="condition")

dxr1 = DEXSeqResults(dxd)

dxr1

table(dxr1$padj < 0.05)

table(tapply(dxr1$padj < 0.05, dxr1$groupID, any))

try({
tiff(file=paste(dt, "_MA_plot_DEXSeq.tiff", sep=""), width=8*300, height=8*300, res=300)
plotMA(dxr1, cex=0.8)
dev.off()
})

library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

DEXSeqHTML(dxr1, FDR=0.001, color=c("#FF000080", "#0000FF80"), mart=ensembl, filter="ensembl_gene_id", attributes=c("external_gene_name"))

# Save an image
save.image(file=paste(dt,"_2.RData", sep=""))

# clean all
rm(list=ls(all=TRUE))

quit("yes")

I get the first Error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 688116 did not have 3 elements
Calls: DEXSeqDataSetFromHTSeq -> lapply -> lapply -> FUN -> read.table -> scan
Execution halted

When I open up the files with vim, The line 688116 is close to the bottom and has some data for goodness of HTSeq count. There were four of them,.

_ambiguous      36642
_ambiguous_readpair_position    0
_empty  1704024
_lowaqual       0
_notaligned     0

I deleted them. This fixed the error.

The second error.

The scripts runs until

[1] "sorted_SE7840_SA145170_S1_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[2] "sorted_SE7840_SA145171_S2_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[3] "sorted_SE7840_SA145172_S3_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[4] "sorted_SE7840_SA145173_S4_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[5] "sorted_SE7840_SA145174_S5_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[6] "sorted_SE7840_SA145175_S6_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[7] "sorted_SE7840_SA145176_S7_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[8] "sorted_SE7840_SA145177_S8_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[1] "Homo_sapiens.GRCh38.106.chr.gff"
         condition    LibType
control1   control paired-end
control2   control paired-end
control3   control paired-end
control4   control paired-end
test1         test paired-end
test2         test paired-end
test3         test paired-end
test4         test paired-end
Error in FUN(X[[i]], ...) : subscript out of bounds
Calls: DEXSeqDataSetFromHTSeq -> sapply -> sapply -> lapply
Execution halted

I searched for "DEXSeq and subscript out of bounds and I found some answers where I should see if the "the number of lines containing "exonic_part" in the third column of the flattened gtf file correspond to the number of lines from the count files?" from here 'DEXSeq: Error in FUN(X[[i]], ...) : subscript out of bounds'

When I look at the gff file with vim. There are 745398 rows and the end of the file looks like this.

Y       dexseq_prepare_annotation.py    exonic_part     26622512        26622608        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "007"
Y       dexseq_prepare_annotation.py    exonic_part     26623586        26623666        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "008"
Y       dexseq_prepare_annotation.py    exonic_part     26627943        26628022        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "009"
Y       dexseq_prepare_annotation.py    exonic_part     26628271        26628437        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "010"
Y       dexseq_prepare_annotation.py    exonic_part     26630647        26630749        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "011"
Y       dexseq_prepare_annotation.py    exonic_part     26633345        26633431        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "012"
Y       dexseq_prepare_annotation.py    exonic_part     26634523        26634652        .       -       .       gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "013"
Y       dexseq_prepare_annotation.py    aggregate_gene  26626520        26627159        .       -       .       gene_id "ENSG00000231514"
Y       dexseq_prepare_annotation.py    exonic_part      26626520       26627159        .       -       .       gene_id "ENSG00000231514"; transcripts "ENST00000435741"; exonic_part_number "001"
Y       dexseq_prepare_annotation.py    aggregate_gene  56855244        56855488        .       +       .       gene_id "ENSG00000235857"
Y       dexseq_prepare_annotation.py    exonic_part     56855244        56855488        .       +       .       gene_id "ENSG00000235857"; transcripts "ENST00000431853"; exonic_part

Then the end of the count files. 68115 rows.

"ENSG00000289711+ENSG00000084072":"037" 1
"ENSG00000289711+ENSG00000084072":"038" 2
"ENSG00000289711+ENSG00000084072":"039" 9
"ENSG00000289711+ENSG00000084072":"040" 9
"ENSG00000289712":"001" 0
"ENSG00000289712":"002" 0
"ENSG00000289713":"001" 0
"ENSG00000289714":"001" 0
"ENSG00000289718":"001" 0
"ENSG00000289718":"002" 0
"ENSG00000289718":"003" 0
"ENSG00000289718":"004" 0
"ENSG00000289718":"005" 0

There is quite a big jump in the GFF file, from 26626520 to 56855244. Also, it isn't clear how I should be checking. as there are quite a few less rows in my count file. Could it be that the GFF file needs to be sorted by name? Is this what is fragging me?

Could this be a memory error?

Any advice or assistance is greatly appreciated. Thank you!

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS/LAPACK: /usr/local/lib/libopenblas_zenp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.1.2
DEXSeq • 659 views
ADD COMMENT
0
Entering edit mode

I tried it over again with the sam file directly from the STAR alignment. Same problem. I will sort by coordinate and try it again. I am also going to use a computer with 256Gb of memory and see if that works. Thank you

ADD REPLY
0
Entering edit mode

Ran it on the USC CARC SLURM cluster with 512Gb of RAM and same error.

Error in FUN(X[[i]], ...) : subscript out of bounds
Calls: DEXSeqDataSetFromHTSeq -> sapply -> sapply -> lapply
Execution halted
ADD REPLY
0
Entering edit mode

I tried a different alignment and sorting by name and coordinate. Nada. Is no one using this anymore? What are people using? If I have to I'll go back to Ensembl 100. If it still barfs on me, i'm hosed.

ADD REPLY
0
Entering edit mode

Hi Matthew. I Will try to reproduce the error message. Could you point me to the link where you are getting the raw gtf file?

ADD REPLY
0
Entering edit mode

HI Alejandro! Thank you!! Absolutely. I am getting it from Ensembl, here, http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz I really appreciate your help!!

ADD REPLY
0
Entering edit mode

Hi Matthew. I Will try to reproduce the error message. Could you point me to the link where you are getting the raw gtf file?

ADD REPLY
3
Entering edit mode
Arthur ▴ 30
@93355568
Last seen 5 months ago
Netherlands

I ran into exactly the same problem, firstly removing the tail lines due to getting a number of column error and then getting a out of bounds error once those were removed. I manage to solve it for now, hopefully this will be useful.

Luckily, I had some files from a previous analysis so I could reverse engineer what went wrong. When looking into the exon count files generated by the dexseq_count.py, I noticed that there were some extra quotes, which were not there in the old files:

Old files:

ENSMUSG00000000001.4:001 2317

ENSMUSG00000000001.4:002 325

ENSMUSG00000000001.4:003 254

New files (from dexseq_count.py v 1.40.0):

"ENSMUSG00000000001.4":"001" 4745

"ENSMUSG00000000001.4":"002" 677

"ENSMUSG00000000001.4":"003" 522

"ENSMUSG00000000001.4":"004" 363

After removing the quotes (sed 's/\"//g' $file > $fileName"_clean.txt) directly from the dexseq_count.py output files (so keeping those final lines for ambiguous and other counts), DEXSeqDataSetFromHTSeq() was able to recognize the files and I'm not seeing any of those error anymore.

I hope it helps!

ADD COMMENT
1
Entering edit mode

That's the ticket! Thank you very much!!

ADD REPLY
0
Entering edit mode

Whoa! Cool. I'll try that. I did see the quotes. Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6