Question: Convertion of indexed bam file to tsv file format
0
gravatar for malikbashirkhan2
3.9 years ago by
Pakistan
malikbashirkhan20 wrote:

I followed the following steps...

 

bowtie2-build Danio_rerio.Zv9.ncrna.fa zebra_virus

bowtie2 -p 4 -q --phred33 -x zebra_virus -1 file1.fastq -2 file2.fastq -S 0.sam

samtools view -bS 0.sam > 0.bam

samtools sort 0.bam 0_sorted

samtools index 0_sorted.bam

 

Now we have sorted indexed file. i want to convert this bam file into tsv file.

 

samtools python • 1.9k views
ADD COMMENTlink modified 3.9 years ago by Hervé Pagès ♦♦ 14k • written 3.9 years ago by malikbashirkhan20

sir my problem is that i have .BAI file and i wants to convert this into TSV file format.

for this we need a python script and i have that python script.

now plz sir help me how to convert this file into tsv.

ADD REPLYlink written 3.9 years ago by malikbashirkhan20
2

Like Aaron told you, this site is for R bioconductor support, not for analysis with Python or Bash. If you need help with your Python scripts or with Bash use other sites such as https://www.biostars.org/

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by b.nota330
Answer: Convertion of indexed bam file to tsv file format
0
gravatar for Aaron Lun
3.9 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

Depending on what fields you want to extract, you could do:

require(Rsamtools)
all.fields <- scanBam("0_sorted.bam")[[1]]

Running names(all.fields) should give you:

 [1] "qname"  "flag"   "rname"  "strand" "pos"    "qwidth" "mapq"   "cigar"
 [9] "mrnm"   "mpos"   "isize"  "seq"    "qual"  

You can then put any number of these into a *.tsv file, e.g.:

write.table(file="out.tsv", data.frame(all.fields[c("qname", "flag", "rname", "pos")]),
    quote=FALSE, sep="\t", row.names=FALSE)

Be aware that some of the fields may need coercion to a simpler data type, e.g., as.character(all.fields$seq). Also, if the file is large, you should look at restricting the number of fields you extract with the what argument in ScanBamParam; or, limiting yourself to extracting only one chromosome at a time with the which argument.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Aaron Lun24k

Thanks for the detailed reply i will try it

ADD REPLYlink written 3.9 years ago by malikbashirkhan20

i tried but it gives syntax error

bash: syntax error near unexpected token `('

 

ADD REPLYlink written 3.9 years ago by malikbashirkhan20
1

This is the Bioconductor support site, and these are R commands. You should be running them at the R prompt, after installing Rsamtools. If you want a non-R answer, then this forum is not the right place to ask.

ADD REPLYlink written 3.9 years ago by Aaron Lun24k

sir i installed the Rsamtools Package again and i had run them at the R prompt.

At R prompt it gives this error.

bash: syntax error near unexpected token `('

ADD REPLYlink written 3.9 years ago by malikbashirkhan20

sir i tried these commands in another way it runs but it hangs my system for unlimited time then i restarts my system. i tried it multiple times but it hangs the system all the time.

plz sir help me what to do.

 

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by malikbashirkhan20

bioinformatics@bioinformatics-desktop:~/bowtie2/SRR630463$ R

R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> require(Rsamtools)
Loading required package: Rsamtools
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

    xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
    do.call, duplicated, eval, evalq, Filter, Find, get, intersect,
    is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
    pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int,
    rownames, sapply, setdiff, sort, table, tapply, union, unique,
    unlist, unsplit

Creating a generic function for ‘nchar’ from package ‘base’ in package ‘S4Vectors’
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: XVector
Loading required package: Biostrings
> all.fields <- scanBam("0_sorted.bam")[[1]]

after this the system will become hang and it will become hang for unlimited time

ADD REPLYlink written 3.9 years ago by malikbashirkhan20

Your machine probably doesn't have enough memory to read the entire BAM file in at once. Try using that what or which arguments as I've described in my original answer, to restrict the amount of data it loads in at a time.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Aaron Lun24k

sir i followed your instruction and limited the parameters and perform the following steps

p2 <- ScanBamParam(what=c("rname", "strand", "pos", "qwidth"))

then run this command

res2 <- scanBam("0_sorted.bam", param=p2)

then i wrote this command

write.table(file="out.tsv",data.frame(c("rname","strand", "pos","qwidth")),quote=FALSE,sep="\t", row.names=FALSE)

but the out.tsv has no values..correct me if i did something wrong

 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by malikbashirkhan20

actually we want to tabulate the count of genes ..which parameter will do it ?

ADD REPLYlink written 3.8 years ago by malikbashirkhan20
1

AFAIK the "count of genes" is never in the BAM/SAM file, even after you've sorted/indexed the file. So you're on the wrong path if you expect to get this information by converting the file to tsv. More generally speaking opening a BAM/SAM file in Excel and perform analysis on it won't get you anywhere, trust me.

To get the read counts for each gene you can use the featureCounts() function from the Rsubread package as Aaron suggested, or the summarizeOverlaps() function from the GenomicAlignments package, or the HTSeq tool (Python package) from Simon Anders:

http://www-huber.embl.de/HTSeq/doc/overview.html

If you have further questions about read counting, please start a new thread and be clear and concise about what you're trying to do and also tell us what you've done so far.

Thanks,

H.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Hervé Pagès ♦♦ 14k

If you want to get read counts for each gene, I'd suggest using the featureCounts function in the Rsubread package. This function can be directly applied to the SAM/BAM files, so you don't need to do any conversion. It's also a tried-and-tested method that is preferable to manually writing your own code.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Aaron Lun24k

Read my original post, and you'll notice that you've made an error in your command.

ADD REPLYlink written 3.8 years ago by Aaron Lun24k

Sir this is the output of feature count.How to tabulate read count through it,,Sir actually there is no concerned Professor here to help us out so may be our questions are not explained properly .kindly let me know if you didnt get what i said.i need those read count in a table form which is further used in DSEQ

 [ reached getOption("max.print") -- omitted 9635 rows ]

$targets
[1] "X0_sorted.bam"

$stat
                       Status X0_sorted.bam
1                    Assigned             0
2        Unassigned_Ambiguity             0
3     Unassigned_MultiMapping             0
4       Unassigned_NoFeatures        531165
5         Unassigned_Unmapped      30268995
6   Unassigned_MappingQuality             0
7  Unassigned_FragementLength             0
8          Unassigned_Chimera             0
9        Unassigned_Secondary             0
10     Unassigned_Nonjunction             0
11       Unassigned_Duplicate             0

 

ADD REPLYlink written 3.8 years ago by malikbashirkhan20

You have zero reads 'Assigned'.

If you provide your code, maybe someone can see why you have zero reads assigned.

ADD REPLYlink written 3.8 years ago by b.nota330

i run the following commands

ann <- data.frame(

+ GeneID=c("gene1","gene1","gene2","gene2"),

+ Chr="chr_dummy",

+ Start=c(100,1000,3000,5000),

+ End=c(500,1800,4000,5500),

+ Strand=c("+","+","-","-"),

+ stringsAsFactors=FALSE)

> fc_SE <- featureCounts("0_sorted.bam",annot.ext=ann)

> fc_SE

ADD REPLYlink written 3.8 years ago by malikbashirkhan20

With this code you summarized the reads which are mapped to 4 genes named "gene1" (2 x) and "gene2" (2 x) which are located on chromosome "chr_dummy".

Why do you make this data.frame called ann?

You need to get the right gtf file from UCSC or ENSEMBL instead of your "dummy" ann data.frame.

ADD REPLYlink written 3.8 years ago by b.nota330

It looks like the vast majority of your reads are unmapped. You should check that you've aligned your reads to the correct genome build. If you haven't, then you should realign your reads - see the Rsubread documentation for more details. However, if the genome is nominally correct, then the only conclusion I can make is that you've got contaminating DNA in your sample. This is indicative of poor-quality data that cannot be rescued at the analysis stage - the best you could do is to map the foreign DNA, but that's not of interest.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Aaron Lun24k
Answer: Convertion of indexed bam file to tsv file format
0
gravatar for Hervé Pagès
3.9 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi,

Maybe I'm missing something but isn't SAM a TAB-delimited text format? So converting BAM to tsv should just be a matter of going back to SAM:

samtools view -h -o 0_sorted.sam 0_sorted.bam

Right? Or is there something else to it that you didn't tell us?

Cheers,

H.

PS: I found the answer to "How Can I Convert Bam To Sam" on Biostars: https://www.biostars.org/p/1701/

ADD COMMENTlink written 3.9 years ago by Hervé Pagès ♦♦ 14k
1

Just as a heads-up, the original poster has posted the same question repeatedly on both Biostars and Bioconductor and has been told what you have just told him several times, see:

  How to convert a bam file into tsv file format?

  https://www.biostars.org/p/160176/

  https://www.biostars.org/p/160012/

and also

  How to convert bia file to tsv file

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Gordon Smyth38k

sir we have tha sam file but we need tsv file so that we can open it using excel and perform analysis on it ..

ADD REPLYlink written 3.8 years ago by malikbashirkhan20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 235 users visited in the last hour