Question: makeOrgPackage in AnnotationForge "GID" problem
gravatar for samanthapizi
3 months ago by
samanthapizi0 wrote:

I am trying to make a custom package for Acidovorax citrulli.

The two files I am using are:

> head(aac_info2)
      GID  LOCUSTAG                                      GENENAME  PROTEIN_ID
1 4666313 Aave_0001    chromosomal replication initiation protein YP_968393.1
2 4666219 Aave_0002               DNA polymerase III subunit beta YP_968394.1
3 4666217 Aave_0003                          DNA gyrase subunit B YP_968395.1
4 4666220 Aave_0004            putative transcriptional regulator YP_968396.1
5 4666222 Aave_0005 putative type I restriction enzyme, R subunit YP_968397.1
6 4666226 Aave_0007                   DNA polymerase subunit beta YP_968398.1


> head(go_aac2)
      GID         GO EVIDENCE
1 4666313 GO:0005737      IEA
2 4666219 GO:0005737      IEA
3 4666217 GO:0005694      IEA
4 4666220 GO:0005524      IEA
5 4666222 GO:0000166      IEA
6 4666226 GO:0016779      IEA


But I get this error"The 1st column must always be the gene ID 'GID'"

> makeOrgPackage(gene_info=aac_info2,go=go_aac2,version="0.1",maintainer="chen",author="chen",ourputDir=".",tax_id="397945",genus="Acidovorax",species="citrulli",goTable="go")

Error in .makeOrgPackage(data, version = version, maintainer = maintainer, : The 1st column must always be the gene ID 'GID'

Actually when I was first trying, I used locus_tag for my GID, but it gave me the same error. Then I found the Gene IDs and put them in the first column, but I got the same thing.What is wrong?

My second question is that, for my GO file, I actually have way more lines than my annotation file, because one gene have multiple GO IDs. Can this work?



R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)

 [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C                  
 [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915    
 [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915   
 [7] LC_PAPER=en_US.iso885915       LC_NAME=C                     
 [9] LC_ADDRESS=C                   LC_TELEPHONE=C                
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C           

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] AnnotationForge_1.14.2 AnnotationDbi_1.34.4   IRanges_2.6.1         
[4] S4Vectors_0.10.2       Biobase_2.32.0         BiocGenerics_0.18.0   

loaded via a namespace (and not attached):
[1] DBI_0.4-1     RSQLite_1.0.0 XML_3.98-1.4 


ADD COMMENTlink modified 3 months ago by James W. MacDonald43k • written 3 months ago by samanthapizi0
gravatar for James W. MacDonald
3 months ago by
United States
James W. MacDonald43k wrote:

The first argument for makeOrgPackage is the ellipsis argument (...). This means that any argument that doesn't exactly match any of the named arguments will be 'sucked up' by that argument and processed as if they are data.frames containing your data.

The reason this matters, is because one of your arguments is ourputDir=".", which doesn't match any of the arguments (you meant that to be outputDir). Since it doesn't match one of the named arguments exactly, R is trying to process it as if it were a data.frame, and well, you see the result. Fixing that typo should set things right.

As to your second argument, that's to be expected, and shouldn't pose a problem.

ADD COMMENTlink written 3 months ago by James W. MacDonald43k

I fixed that typo, sorry for making silly mistakes. But now I get another error:

> makeOrgPackage(gene_info=aac_info2,go=go_aac2,version="0.1",maintainer="chen<>",author="chen<>",outputDir=".",tax_id="397945",genus="Acidovorax",species="citrulli",goTable="go")
Error in structure(res, levels = lv, names = nm, class = "factor") : 
  'names' attribute [16058] must be the same length as the vector [2]

This is what makes me wonder about the length of the go file. What caused this error? 

ADD REPLYlink modified 3 months ago • written 3 months ago by samanthapizi0

I don't know. It's not obvious from the error message. What do you get if you run traceback() right after you get the error?

ADD REPLYlink written 3 months ago by James W. MacDonald43k

> traceback()
6: structure(res, levels = lv, names = nm, class = "factor")
5: unlist(unname(lapply(data, "[", "GID")))
4: unique(unlist(unname(lapply(data, "[", "GID"))))
3: makeOrgDbFromDataFrames(data, tax_id, genus, species, dbFileName, 
2: .makeOrgPackage(data, version = version, maintainer = maintainer, 
       author = author, outputDir = outputDir, tax_id = tax_id, 
       genus = genus, species = species, goTable = goTable, verbose = verbose)
1: makeOrgPackage(gene_info = aac_info2, go = go_aac2, version = "0.1", 
       maintainer = "chen<>", author = "chen<>", 
       outputDir = ".", tax_id = "397945", genus = "Acidovorax", 
       species = "citrulli", goTable = "go")


ADD REPLYlink written 3 months ago by samanthapizi0

The problem is that your GIDs are factors rather than numeric, which implies that you have either done something weird when you read those in, or you have some GIDs that R is somehow interpreting as character, which causes it to convert to factor.

In other words, if you read something into R, and you have a column that appears to contain text, R will by default convert that column to factor. As an example:

> df <- data.frame(first = c(1:5, "a"), second = 1:6)
> df
  first second
1     1      1
2     2      2
3     3      3
4     4      4
5     5      5
6     a      6
> df$first
[1] 1 2 3 4 5 a
Levels: 1 2 3 4 5 a
> df$second
[1] 1 2 3 4 5 6

So if your GID column is all numbers, R will read it in as numbers. But if there are some things in that column that look like strings, the column will be read in as a character vector and then converted to a factor. And this will blow up, giving the error you are seeing:

> df1 <- data.frame(GID = letters, LOCUS = letters)
> df2 <- data.frame(GID = c(letters,LETTERS), GO = 1:52)
> lst <- list(df1,df2)
> unique(unlist(unname(lapply(lst, "[", "GID"))))
Error in structure(res, levels = lv, names = nm, class = "factor") :
  'names' attribute [78] must be the same length as the vector [2]

You could read in using stringsAsFactor = FALSE, and that will work:

> df1 <- data.frame(GID = letters, LOCUS = letters, stringsAsFactors = FALSE)
> df2 <- data.frame(GID = c(letters,LETTERS), GO = 1:52, stringsAsFactors = FALSE)
> lst <- list(df1,df2)
> unique(unlist(unname(lapply(lst, "[", "GID"))))
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L"
[39] "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

But all those GIDs should be numeric, so if I were you, I would track down the non-numeric looking things and figure out what's up wit dat.

ADD REPLYlink written 3 months ago by James W. MacDonald43k

Ahhhh I see! I finally got it made. Thank you very much!

ADD REPLYlink written 3 months ago by samanthapizi0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 149 users visited in the last hour