Atena package - error (new user)
1
0
Entering edit mode
lalanis1 • 0
@828cea77
Last seen 5 weeks ago
United States

Hi.

I am very new to bioinformatics/programming. Running the code for the first time ever using atena package (following their manual). I got the following error. What am I doing wrong? I have attached my full working screen's screen shot for reference.

Thanks a lot.

Lalani

''''r In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '', probable reason 'Invalid argument'

#problematic code
#TE
bamfiles <- list.files(system.file("extdata", package="atena"), pattern="*.bam", full.names=TRUE)
annotaTEs()
rmsk <- annotaTEs(genome = "hg38", parsefun = rmskbasicparser)
TE_annot <- readRDS(file = system.file("extdata", "Top100TEs.rds", package="atena"))

# results of running the following in an R session
> #TE
> bamfiles <- list.files(system.file("extdata", package="atena"), pattern="*.bam", full.names=TRUE)
> annotaTEs()
snapshotDate(): 2022-04-25
GRanges object with 5633664 ranges and 4 metadata columns:
seqnames        ranges strand |   swScore     repName      repClass
<Rle>     <IRanges>  <Rle> | <integer> <character>   <character>
[1]                chr1   10001-10468      + |       463   (TAACCC)n Simple_repeat
[2]                chr1   15798-15849      + |        18   (TGCTCC)n Simple_repeat
[3]                chr1   16713-16744      + |        18      (TGG)n Simple_repeat
[4]                chr1   18907-19048      + |       239         L2a          LINE
[5]                chr1   19972-20405      + |       994          L3          LINE
...                 ...           ...    ... .       ...         ...           ...
[5633660] chrX_KV766199v1_alt 179150-179234      - |       255    MIR1_Amn          SINE
[5633661] chrX_KV766199v1_alt 184474-184785      - |      2039       AluJb          SINE
[5633662] chrX_KV766199v1_alt 186964-187271      - |       386      MLT1G3           LTR
[5633663] chrX_KV766199v1_alt 187486-187569      - |       270      MLT1G3           LTR
[5633664] chrX_KV766199v1_alt 187597-187822      - |      1301       L1MA8          LINE
repFamily
<character>
[1] Simple_repeat
[2] Simple_repeat
[3] Simple_repeat
[4]            L2
[5]           CR1
...           ...
[5633660]           MIR
[5633661]           Alu
[5633662]     ERVL-MaLR
[5633663]     ERVL-MaLR
[5633664]            L1
-------
seqinfo: 640 sequences (1 circular) from hg38 genome
> rmsk <- annotaTEs(genome = "hg38", parsefun = rmskbasicparser)
snapshotDate(): 2022-04-25
> TE_annot <- readRDS(file = system.file("extdata", "Top100TEs.rds", package="atena"))
Error in gzfile(file, "rb") : cannot open the connection
In gzfile(file, "rb") :
cannot open compressed file '', probable reason 'Invalid argument'

RNAseq R Atena • 950 views
1
Entering edit mode
@25bc498b
Last seen 7 hours ago
Spain

Hi Lalani,

The error message appears because the file Top100TEs.rds cannot be found in the extdata directory of the atena package. This is because the file Top100TEs.rds is not present in the atena package, instead the file used in the vignettes and examples from the manual is named Top28TEs.rds. This file contains only annotations for the top 28 expressed TEs in the Drosophila melanogaster RNA-seq dataset (bam files) used in the vignette of the package. The other annotations included in the package are Top50genes.rds, which contains the top 50 expressed genes in this dataset. To retrieve the TE annotations from this set of 28 TEs from Drosophila melanogaster you can use the following code:

library(atena)
library(GenomicRanges)
TE_annot <- readRDS(file = system.file("extdata", "Top28TEs.rds",
package="atena"))
TE_annot
GRanges object with 28 ranges and 9 metadata columns:
seqnames            ranges strand |      source     feature_type     score     phase     repName
<Rle>         <IRanges>  <Rle> |    <factor>         <factor> <numeric> <integer> <character>
ROO_LTR_174       2L   5827762-5828190      - | rtracklayer sequence_feature        NA      <NA>     ROO_LTR
ROO_LTR_196       2L   6431563-6431991      - | rtracklayer sequence_feature        NA      <NA>     ROO_LTR
ROO_LTR_201       2L   6444684-6445112      + | rtracklayer sequence_feature        NA      <NA>     ROO_LTR
ROO_LTR_226       2L   7350010-7350438      + | rtracklayer sequence_feature        NA      <NA>     ROO_LTR
ROO_LTR_255       2L   8461159-8461588      + | rtracklayer sequence_feature        NA      <NA>     ROO_LTR
...      ...               ...    ... .         ...              ...       ...       ...         ...
ROO_LTR_11385       3L   5034213-5034641      - | rtracklayer sequence_feature        NA      <NA>     ROO_LTR
ROO_LTR_619       2L 18060605-18061019      + | rtracklayer sequence_feature        NA      <NA>     ROO_LTR
ROO_LTR_10524       2R   9871095-9871523      + | rtracklayer sequence_feature        NA      <NA>     ROO_LTR
ROO_LTR_10666       2R 12797226-12797654      - | rtracklayer sequence_feature        NA      <NA>     ROO_LTR
ROO_LTR_10731       2R 14467348-14467776      + | rtracklayer sequence_feature        NA      <NA>     ROO_LTR
repClass   repFamily          name            ID
<character> <character>   <character>   <character>
ROO_LTR_174         LTR         Pao   ROO_LTR_174   ROO_LTR_174
ROO_LTR_196         LTR         Pao   ROO_LTR_196   ROO_LTR_196
ROO_LTR_201         LTR         Pao   ROO_LTR_201   ROO_LTR_201
ROO_LTR_226         LTR         Pao   ROO_LTR_226   ROO_LTR_226
ROO_LTR_255         LTR         Pao   ROO_LTR_255   ROO_LTR_255
...         ...         ...           ...           ...
ROO_LTR_11385         LTR         Pao ROO_LTR_11385 ROO_LTR_11385
ROO_LTR_619         LTR         Pao   ROO_LTR_619   ROO_LTR_619
ROO_LTR_10524         LTR         Pao ROO_LTR_10524 ROO_LTR_10524
ROO_LTR_10666         LTR         Pao ROO_LTR_10666 ROO_LTR_10666
ROO_LTR_10731         LTR         Pao ROO_LTR_10731 ROO_LTR_10731
-------
seqinfo: 1539 sequences from an unspecified genome; no seqlengths


Or you can retrieve the full TE RepeatMasker annotation (for dm6 version) with:

rmsk <- annotaTEs(genome = "dm6", parsefun = rmskbasicparser)


Hope this helps!

0
Entering edit mode

Editted Hi.

Thank you so much for your help. I am working with human genome. I was able to get the hg38 from repeatmasker using

annotaTEs(genome = "hg38", parsefun = rmskidentity)

However, I am still getting the same error

TE_annot <- readRDS(file = system.file("extdata", "Top100TEs.rds", package="atena")) Error in gzfile(file, "rb") : cannot open the connection In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '', probable reason 'Invalid argument'

I get it that my desired .rds is not present in atena package. Then what should be done to read top 100 TEs in human genome using atena?

Thanks.

1
Entering edit mode

Hi,

To quantify TE expression with the atena package you do not need an object with annotations for the top "x" TEs (what would correspond to the TE_annot object in the vignette). Maybe the vignette from atena may lead to confusion when we use annotations for only the top 28 expressed TEs for that specific dataset:

TE_annot <- readRDS(file = system.file("extdata", "Top28TEs.rds",
package="atena"))


we only use a subset of the annotations in the vignette to minimize the computational cost and memory requirements for building the vignette. These 28 TEs were selected after quantifying all TEs in RepeatMasker annotations and, then, selecting those top 28 expressed TEs.

To quantify TE expression with the atena package, you need the TE annotations (you do not need the annotations of the top expressed TEs), which can be retrieved as you did using the command:

TE_annot_rmsk <- annotaTEs(genome = "hg38", parsefun = rmskidentity)


Then, you have to build a parameter object (using the previously obtained TE annotations) and, finally, quantify TE expression with the qtex() function. This will result in a RangedSummarizedExperiment object with the expression levels of all TEs, which you can, then, use to identify the most expressed TEs in your samples (you can use it to identify the top 100 TEs in your samples, for example).

In summary, you cannot use atena to directly read the top 100 TEs in your dataset, you first need to quantify expression of all TEs and then identify the top expressed TEs.

0
Entering edit mode

Thank you so much for your response. I will try to build parameter object (although I don't know what it is and how to do it at all but will work on it for sure). Hope I can do this.

I really appreciate your help and your patience with a person trying to use R and coding for the first time ever.

Best regards, Lalani

0
Entering edit mode

Hi.

I tried to build parameter object. I am having difficulty in doing so. Will it be possible for you to share an example of some other organism (if not human) so I can identify what parts of codes are required to be changed. This way, I will be able to relate it better and work to build parameter object for human genome.

Thanks.

This is what i was able to do using git hub code of yours. Not sure if it is ok at all. Please guide.

###what should be path here for read.table? Do i have to download rmsk_hg38? if no, how to go about it####

/genomics/users/bea/HERVs/drosophila_atena/Dmelanogaster_rmsk_dm6.txt",
"/D:/Users/Public/Documents/Lalani – R/STAR bam files renamed PG only/GRCh38_Ensembl_rmsk_TE.gtf.gz")
colnames(chr) <- c("UCSC","Ensembl")
ann <- merge(ann, chr, by.y = "UCSC", by.x = "genoName")

repClass <- c(“Simple_repeat”, “SINE”, "RC", "LINE", "LTR", "DNA", "Unknown", "Other", "RNA")
ann <- ann[ann$repClass %in% repClass,] # Creating a GRanges object with the annotations suppressPackageStartupMessages(library(GenomicRanges)) ann_gr <- GRanges(seqnames = Rle(ann$Ensembl),
ranges = IRanges(start = ann$genoStart, end = ann$genoEnd),
strand = Rle(ann$strand), mcols = DataFrame(ann[,11:13], name = paste(ann$repName, 1:nrow(ann), sep = "_")))
names(ann_gr) <- mcols(ann_gr)$mcols.name colnames(mcols(ann_gr)) <- gsub(colnames(mcols(ann_gr)), pattern = "mcols.", replacement = "") rtracklayer::export.gff2(ann_gr, con = "/D:/Users/Public/Documents/Lalani – R/STAR bam files renamed PG only/GRCh38_Ensembl_rmsk_TE.gtf.gz")  #Finding which TEs are the most expressed. Also finding which reads from infected map to these most expressed TEs: library(Matrix) library(Rsamtools) library(GenomicAlignments) library(rtracklayer) #' @importFrom S4Vectors nLnode nRnode isSorted from to Hits .appendHits <- function(hits1, hits2) { stopifnot(nRnode(hits1) == nRnode(hits2)) stopifnot(isSorted(from(hits1)) == isSorted(from(hits2))) hits <- c(Hits(from=from(hits1), to=to(hits1), nLnode=nLnode(hits1)+nLnode(hits2), nRnode=nRnode(hits1), sort.by.query=isSorted(from(hits1))), Hits(from=from(hits2)+nLnode(hits1), to=to(hits2), nLnode=nLnode(hits1)+nLnode(hits2), nRnode=nRnode(hits2), sort.by.query=isSorted(from(hits2)))) hits } ###what should be import path here?#### te_hg38 <- import( "/genomics/users/bea/HERVs/drosophila_atena/Dmelanogaster_rmsk_dm6.gtf") names(te_hg38) <- mcols(te_hg38)$ID

bampath <- "/D:/Users/Public/Documents/Lalani - R/STAR bam files renamed PG only"
bamfiles <- c("/D:/Users/Public/Documents/Lalani - R/STAR bam files renamed PG only, “pattern="C*.bam", full.names=TRUE",
"/ D:/Users/Public/Documents/Lalani - R/STAR bam files renamed PG only, pattern="I*.bam", full.names=TRUE ")

bfl <- BamFileList(bamfiles, yieldSize = 10000000, asMates = FALSE)
sbflags <- scanBamFlag(isUnmappedQuery=FALSE,
isDuplicate=FALSE,
isNotPassingQualityControls=FALSE)
param <- ScanBamParam(flag=sbflags)

i <- 2
bf <- bfl[[i]]
ov <- Hits(nLnode=0, nRnode=length(te_hg38), sort.by.query=TRUE)
l <- 0
open(bf)
list(param=param),
list(use.names=TRUE))))) {
l <- l+1
print(l)
ov <- .appendHits(ov, thisov)
}
close(bf)

.buildOvAlignmentsMatrix <- function(ov, arids, rids, fidx) {
oamat <- Matrix(FALSE, nrow=length(rids), ncol=length(fidx))
mt1 <- match(arids[queryHits(ov)], rids)
mt2 <- match(subjectHits(ov), fidx)
oamat[cbind(mt1, mt2)] <- TRUE

oamat
}

saveRDS(ov, file = "/D:/Users/Public/Documents/Lalani - R/STAR bam files renamed PG only_ovAllTEs.rds")
tx_idx <- sort(unique(subjectHits(ov)))
file = "/ D:/Users/Public/Documents/Lalani - R/STAR bam files renamed PG only_topreads.txt",
quote = FALSE, row.names = FALSE, col.names = FALSE)

ovalnmat_topreads <- ovalnmat[order(rowSums(ovalnmat), decreasing = TRUE)[1:175],]


#Finding which genes are the most expressed:

suppressPackageStartupMessages(library(GenomicAlignments))
suppressPackageStartupMessages(library(rtracklayer))

# BAM files
bamfiles <- c("/D:/Users/Public/Documents/Lalani - R/STAR bam files renamed PG only, “pattern="C*.bam", full.names=TRUE", "/ D:/Users/Public/Documents/Lalani - R/STAR bam files renamed PG only, pattern="I*.bam", full.names=TRUE ")

bfl <- BamFileList(bamfiles, yieldSize = 100000, asMates = FALSE)
sbflags <- scanBamFlag(isUnmappedQuery=FALSE,
isDuplicate=FALSE,
isNotPassingQualityControls=FALSE)
param <- ScanBamParam(flag=sbflags)

# Annotations
gene_hg38 <- import(
"/D:/Users/Public/Documents/Lalani – R/STAR bam files renamed PG only/GRCh38_Ensembl_rmsk_TE.gtf.gz")
gene_hg38 <- gene_hg38[!(gene_hg38$gene_biotype == "transposable_element")] gene_hg38 <- gene_hg38[gene_hg38$type == "gene"]
names(gene_hg38) <- gene_hg38\$gene_id

# Counting
overlap_all <- summarizeOverlaps(gene_hg38, bfl, singleEnd=TRUE,
param = param, ignore.strand = FALSE)
saveRDS(overlap_all, file = "/ D:/Users/Public/Documents/Lalani - R/STAR bam files renamed PG only/gene_counts.rds")

1
Entering edit mode

Hi Lalani,

Looking at the code from your last comment I think you are using as a guide the code found in inst/scripts/create-data_OhtaniSaito13.Rmd, is that correct? If so, I am afraid that this code is not intended to serve as a guide or pipeline to carry out analyses with the atena package, instead, this file create-data_OhtaniSaito13.Rmd is only meant as a documentation on how the files used for the examples and the vignette (found in inst/extdata) were created.

To use the atena package to quantify TE expression, the code is much simple:

First, you can retrieve TE annotations using the annotaTEs() function available in the atena package. For instance if you want to quantify TE expression in .bam files from human aligned to hg38, you can:

library(atena)
rmsk <- annotaTEs(genome = "hg38", parsefun = rmskbasicparser)


where the genome= argument can be used to specify which genome version you want. For more information on this function you can check the manual.

If instead you have already downloaded TE annotations and want to use a specific file you can read that file (for example using the rtracklayer package) and create a GRanges or GRangesList object.

Second, you build the parameter object for one of the quantification methods available. For example, for the Telescope method:

# vector with bam files names
bamfiles <- list.files("path_to_your_directory_with_bam_files", pattern="*.bam", full.names=TRUE)

# building parameter object
tspar <- TelescopeParam(bfl=bamfiles, teFeatures= rmsk)


Here, in TelescopeParam() you need to define some arguments (singleEnd, ignoreStrand, etc.) depending on the type of .bam files you are using (if they are paired-end, stranded, etc.), you can read the manual to know more.

Finally, you quantify TE expression using qtex():

tsq <- qtex(tspar)


This will output a RangedSummarizedExperiment with TE expression levels, you can access them by:

assay(tsq)


and then you can use these expression levels to identify the top expressed TEs, or do differential expression analysis, etc.

0
Entering edit mode

Thanks a lot for explaining it in detail.

I have run it now. Hope it goes well. Will update here if anything goes wrong.

Thanks again.

Regards, Lalani

0
Entering edit mode

Hi.

We ran the package with some trials and errors. TE transcripts and ERVmap ran fine. However, we are getting the below error for telescope. We were wondering where should we make changes in the code to attend the recommended suggestions if you could kindly assist in this regard.

Many thanks.

0
Entering edit mode

Hi,

could you show us the result of calling:

traceback()


and

sessionInfo()


right after you get the error?

thanks!

0
Entering edit mode

Hi

Sorry for the delay. Here are the requested results.

Thanks.

0
Entering edit mode

It seems the problem might be related to something in the BAM file that is not properly handled by the software. Would it be possible for you to identify the BAM file that causes the error, by running the software on just that BAM file and sharing that BAM file privately? (you can reach me by email at robert.castelo@upf.edu).

0
Entering edit mode

Sure. Will try to rule out and get back to you.

It is strange though as same bam files have worked for other two (ERVmap and TEtranscripts). Would that mean that the results for those two might not be accurate as well?

0
Entering edit mode

Not necessarily, in our own benchmarking (not yet published) the R implementation of the three methods reproduce very reliably their original implementation, so for us the question is whether the BAM file that causes this error has features that make it a cornercase outside our benchmarking, where we should correct something to address that problem, particularly in Telescope.

0
Entering edit mode

Thanks for your response. I have emailed you. I will need to check how I can share the bam file. Will get back to you on this.

Thanks again!

0
Entering edit mode

Hi Lalani,

We have fixed the bug that lead to the error. The new version of the package (with the bug fixed) will be available in 24-48h. To use it, you will need to re-install the package:

BiocManager::install("atena")


The error appeared due to the presence of not properly paired reads (e.g. with mates in different chromosomes). Thanks for bringing up this error, it has helped us to identify slight differences in the handling of this type of reads with respect to the algorithm used by the original Telescope method. We are still working on it to completely reproduce Telescope’s approach for not properly paired reads. However, this new version of the package provides a close implementation (in fact, since not properly paired reads make a very small fraction of the total number of reads, results should be very similar).

0
Entering edit mode

Thanks a lot, Calvo.

Will run my files with revised version soon!

Best regards, Lalani