Bioconductor has extensive facilities for mapping between microarray probe, gene, pathway, gene ontology, homology and other annotations.
Bioconductor has built-in representations of GO, KEGG, vendor, and other annotations, and can easily access NCBI, Biomart, UCSC, and other sources.
The organism wide gene centered packages (OrgDb packages) each contain gene centered data for an organism. These packages are the primary place for storing data that can be directly associated with genes. Lets take a closer look at the organism package for human:
library(org.Hs.eg.db)
Once loaded, each OrgDb object can be accessed using the following four methods:
To list the kinds of things that can be retrieved, use the columns method.
columns(org.Hs.eg.db)
## [1] "ENTREZID" "PFAM" "IPI" "PROSITE"
## [5] "ACCNUM" "ALIAS" "CHR" "CHRLOC"
## [9] "CHRLOCEND" "ENZYME" "MAP" "PATH"
## [13] "PMID" "REFSEQ" "SYMBOL" "UNIGENE"
## [17] "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME"
## [21] "UNIPROT" "GO" "EVIDENCE" "ONTOLOGY"
## [25] "GOALL" "EVIDENCEALL" "ONTOLOGYALL" "OMIM"
## [29] "UCSCKG"
To list the kinds of things that can be used as keys we can use the keytypes method
keytypes(org.Hs.eg.db)
## [1] "ENTREZID" "PFAM" "IPI" "PROSITE"
## [5] "ACCNUM" "ALIAS" "ENZYME" "MAP"
## [9] "PATH" "PMID" "REFSEQ" "SYMBOL"
## [13] "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [17] "GENENAME" "UNIPROT" "GO" "EVIDENCE"
## [21] "ONTOLOGY" "GOALL" "EVIDENCEALL" "ONTOLOGYALL"
## [25] "OMIM" "UCSCKG"
And to extract viable keys of a particular kind (keytype), we can use the keys method.
head(keys(org.Hs.eg.db, keytype="ENTREZID"))
## [1] "1" "2" "3" "9" "10" "11"
Since the keys method can tell us specific things that can be used as keys, here we will use it to extract a few ids to use for demonstrating the fourth method type.
ids = head(keys(org.Hs.eg.db, keytype="ENTREZID"))
Once you have some ids that you want to look up data for, the select method allows you to map these ids as long as you use the columns argument to indicate what you need to know and the keytype argument to specify what kind of keys they are.
select(org.Hs.eg.db, keys=ids, columns="SYMBOL", keytype="ENTREZID")
## ENTREZID SYMBOL
## 1 1 A1BG
## 2 2 A2M
## 3 3 A2MP1
## 4 9 NAT1
## 5 10 NAT2
## 6 11 NATP
And since the columns argument can take a vector of valid columns, you can look up multiple things at once.
select(org.Hs.eg.db, keys=ids, columns=c("GENENAME", "SYMBOL"), keytype="ENTREZID")
## ENTREZID GENENAME SYMBOL
## 1 1 alpha-1-B glycoprotein A1BG
## 2 2 alpha-2-macroglobulin A2M
## 3 3 alpha-2-macroglobulin pseudogene 1 A2MP1
## 4 9 N-acetyltransferase 1 (arylamine N-acetyltransferase) NAT1
## 5 10 N-acetyltransferase 2 (arylamine N-acetyltransferase) NAT2
## 6 11 N-acetyltransferase pseudogene NATP
But where would we normally get the “ids” that we would pass in to the keys argument? Usually these kinds of ids come from the result of a data analysis. Here I will load an experiment data package to provide us with an example data set to look at:
library("parathyroidSE")
data(exonicParts)
exonicParts[1:3]
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <CharacterList>
## [1] X [99883667, 99884983] - | ENSG00000000003
## [2] X [99885756, 99885863] - | ENSG00000000003
## [3] X [99887482, 99887537] - | ENSG00000000003
## tx_name exonic_part
## <CharacterList> <integer>
## [1] ENST00000373020 1
## [2] ENST00000373020 2
## [3] ENST00000373020 3
## -------
## seqinfo: 580 sequences (1 circular) from an unspecified genome
Now having just loaded this data set, I can extract the ensembl gene IDs contained in this object like this:
ids = unlist(mcols(exonicParts)$gene_id)
head(ids)
## [1] "ENSG00000000003" "ENSG00000000003" "ENSG00000000003" "ENSG00000000003"
## [5] "ENSG00000000003" "ENSG00000000003"
From here I can look up gene symbols for these ids by using the select method like this. Notice how I have to specify the correct (and different from before) keytype in order to extract these:
res <- select(org.Hs.eg.db, keys=ids, columns="SYMBOL", keytype="ENSEMBL")
head(res)
## ENSEMBL SYMBOL
## 1 ENSG00000000003 TSPAN6
## 2 ENSG00000000003 TSPAN6
## 3 ENSG00000000003 TSPAN6
## 4 ENSG00000000003 TSPAN6
## 5 ENSG00000000003 TSPAN6
## 6 ENSG00000000003 TSPAN6
And if I am careful to make sure that there were not any many to one relationships changing the size of the result data.frame, then I can even put these gene symbols back into the initial object:
dim(res)[1] == length(exonicParts)
## [1] TRUE
newMcols <- cbind(mcols(exonicParts), res[,2,drop=FALSE])
mcols(exonicParts) <- newMcols
exonicParts[1:3]
## GRanges object with 3 ranges and 4 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <CharacterList>
## [1] X [99883667, 99884983] - | ENSG00000000003
## [2] X [99885756, 99885863] - | ENSG00000000003
## [3] X [99887482, 99887537] - | ENSG00000000003
## tx_name exonic_part SYMBOL
## <CharacterList> <integer> <character>
## [1] ENST00000373020 1 TSPAN6
## [2] ENST00000373020 2 TSPAN6
## [3] ENST00000373020 3 TSPAN6
## -------
## seqinfo: 580 sequences (1 circular) from an unspecified genome
Of course we can look up many things other than just gene names and symbols. For example, we could also extract the GO ids associated with the first id like this:
id = ids[1]
res <- select(org.Hs.eg.db, keys=id, columns="GO", keytype="ENSEMBL")
head(res)
## ENSEMBL GO EVIDENCE ONTOLOGY
## 1 ENSG00000000003 GO:0004871 IMP MF
## 2 ENSG00000000003 GO:0005515 IPI MF
## 3 ENSG00000000003 GO:0007165 IMP BP
## 4 ENSG00000000003 GO:0016021 IEA CC
## 5 ENSG00000000003 GO:0039532 IMP BP
## 6 ENSG00000000003 GO:0043123 IMP BP
You may have noticed that the above request results in many rows for just one input id (and that a warning was issued about this). Sometimes when you use select you may ask for columns that will result in select having to return multiple values for each key that you passed in. This is caused by the structure of the underlying data. This kind of data is sometimes said to have a many to one relationship because there are many things that can match to each key. When this happens select() will return multiple rows for each key that you used as input because the return value for select is a data.frame object. A warning is issued in this case because this behavior might not be what you were expecting. If you request multiple many to one relationships at once, it will result in a multiplication of the returned rows as each row will represent a unique combination of the data that you asked for. This is not recommended as you can very quickly generate a data.frame object that is both very large and simultaneously not very useful. For best results, use select carefully, and avoid requesting more than one many to one value at any given time.
When making use of GO ids, you can also use the GO.db package to find the Terms associated with those GO ids. The GO.db package will load a GOdb object that can be used in a manner similar to what we just saw with our OrgDb object for org.Hs.eg.db. And we can use the same four methods that we just learned about (columns, keytypes, keys and select), to extract whatever data we need.
library("GO.db")
##
head(res$GO) ## shows what we are using as keys
## [1] "GO:0004871" "GO:0005515" "GO:0007165" "GO:0016021" "GO:0039532"
## [6] "GO:0043123"
head(select(GO.db, keys=res$GO, columns="TERM", keytype="GOID"))
## GOID
## 1 GO:0004871
## 2 GO:0005515
## 3 GO:0007165
## 4 GO:0016021
## 5 GO:0039532
## 6 GO:0043123
## TERM
## 1 signal transducer activity
## 2 protein binding
## 3 signal transduction
## 4 integral component of membrane
## 5 negative regulation of viral-induced cytoplasmic pattern recognition receptor signaling pathway
## 6 positive regulation of I-kappaB kinase/NF-kappaB signaling
Sometimes you might want something 'simpler' than the select method. This can happen if perhaps you are only interested in getting a character vector which quickly retrieves values that match a set of IDs. For this sort of use case we also have added the mapIds accessor. Lets suppose that you have some entrez gene ids and you just want to get the gene symbol for them. You could do it like this:
ids = head(keys(org.Hs.eg.db, keytype="ENTREZID"))
mapIds(org.Hs.eg.db, keys=ids, column='SYMBOL', keytype='ENTREZID')
## 1 2 3 9 10 11
## "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"
The main difference between select() and mapIds() is that mapIds expects that you want a simple named vector object as a return value where the id that you started with are the named of the value returned. Because it is not returning a data.frame, mapIds has an extra argument (multiVals) to indicate how you would like to handle the case where multiple things match your keys. By default, mapIds will just give you the 1st of the matching objects like so:
mapIds(org.Hs.eg.db, keys=ids, column='ALIAS', keytype='ENTREZID')
## 1 2 3 9 10 11
## "A1B" "A2MD" "A2MP" "AAC1" "AAC2" "AACP"
But you can also ask it to return other kinds of vectors like a list, CharacterList etc.:
mapIds(org.Hs.eg.db, keys=ids, column='ALIAS', keytype='ENTREZID', multiVals='CharacterList')
## CharacterList of length 6
## [["1"]] A1B ABG GAB HYST2477 A1BG
## [["2"]] A2MD CPAMD5 FWP007 S863-7 A2M
## [["3"]] A2MP A2MP1
## [["9"]] AAC1 MNAT NAT-1 NATI NAT1
## [["10"]] AAC2 NAT-2 PNAT NAT2
## [["11"]] AACP NATP1 NATP
Or you can even define your own behavior and just pass in the function to multiVals:
last <- function(x){x[[length(x)]]}
mapIds(org.Hs.eg.db, keys=ids, column='ALIAS', keytype='ENTREZID', multiVals=last)
## 1 2 3 9 10 11
## "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"
Exercise 1: Look at the help page for the different columns and keytypes values with: help(“SYMBOL”). Now use this information and what we just described to look up the entrez gene and chromosome for the gene symbol “MSX2”.
Exercise 2: In the previous exercise we had to use gene symbols as keys. But in the past this kind of behavior has sometimes been inadvisable because some gene symbols are used as the official symbol for more than one gene. To learn if this is still happening take advantage of the fact that entrez gene ids are uniquely assigned, and extract all of the gene symbols and their associated entrez gene ids from the org.Hs.eg.db package. Then check the symbols for redundancy.
[ Back to top ]
The following illustrates a typical R / Bioconductor session for a ChipDb package. It continues the differential expression workflow, taking a 'top table' of differentially expressed probesets and discovering the genes probed, and the Gene Ontology pathways to which they belong.
First lets consider some typical probe Ids. If you have done a microarray analysis before you have probably already run into IDs like this. They are typically manufacturer assigned and normally only relevant to a small number of chips. Below I am just going to demonstrate on 6 probe Ids from the u133 2.0 affymetrix platform.
## Affymetrix U133 2.0 array IDs of interest; these might be
## obtained from
##
## tbl <- topTable(efit, coef=2)
## ids <- tbl[["ID"]]
##
## as part of a more extensive workflow.
ids <- c("39730_at", "1635_at", "1674_at", "40504_at", "40202_at")
Load libraries as sources of annotation
library("hgu95av2.db")
And from here you can use the new ChipDb object in the same way that you learned to use an OrgDb object before. The only real change is that the ChipDb object will also have data about how platform specific probes match to specific genes. So for example:
columns(hgu95av2.db)
## [1] "PROBEID" "ENTREZID" "PFAM" "IPI"
## [5] "PROSITE" "ACCNUM" "ALIAS" "CHR"
## [9] "CHRLOC" "CHRLOCEND" "ENZYME" "MAP"
## [13] "PATH" "PMID" "REFSEQ" "SYMBOL"
## [17] "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [21] "GENENAME" "UNIPROT" "GO" "EVIDENCE"
## [25] "ONTOLOGY" "GOALL" "EVIDENCEALL" "ONTOLOGYALL"
## [29] "OMIM" "UCSCKG"
keytypes(hgu95av2.db)
## [1] "ENTREZID" "PFAM" "IPI" "PROSITE"
## [5] "ACCNUM" "ALIAS" "ENZYME" "MAP"
## [9] "PATH" "PMID" "REFSEQ" "SYMBOL"
## [13] "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [17] "GENENAME" "UNIPROT" "GO" "EVIDENCE"
## [21] "ONTOLOGY" "GOALL" "EVIDENCEALL" "ONTOLOGYALL"
## [25] "PROBEID" "OMIM" "UCSCKG"
columns <- c("PFAM","SYMBOL")
select(hgu95av2.db, keys=ids, columns, keytype="PROBEID")
## PROBEID PFAM SYMBOL
## 1 39730_at PF00017 ABL1
## 2 39730_at PF00018 ABL1
## 3 39730_at PF07714 ABL1
## 4 39730_at PF08919 ABL1
## 5 1635_at PF00017 ABL1
## 6 1635_at PF00018 ABL1
## 7 1635_at PF07714 ABL1
## 8 1635_at PF08919 ABL1
## 9 1674_at PF00017 YES1
## 10 1674_at PF00018 YES1
## 11 1674_at PF07714 YES1
## 12 40504_at PF01731 PON2
## 13 40202_at <NA> KLF9
Exercise 3: Examine the gene symbols for both the hgu95av2.db and the org.Hs.eg.db packages. Which one has more gene symbols? Which one has more gene symbols that can be mapped to an entrez gene ID? Which object seems to contain more information?
[ Back to top ]
The genome centered TxDb packages support the same interface as that ChipDb and the OrgDb packages.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ## done for convenience
keys <- head(keys(txdb, keytype="GENEID"), n=2)
columns <- c("TXNAME", "TXSTART","TXSTRAND")
select(txdb, keys, columns, keytype="GENEID")
## GENEID TXNAME TXSTRAND TXSTART
## 1 1 uc002qsd.4 - 58858172
## 2 1 uc002qsf.2 - 58859832
## 3 10 uc003wyw.1 + 18248755
But in addition to supporting the standard set of methods (select, keytypes, keys and columns). The TxDb objects also support methods to retrieve the annotations as ranges. These accessors break down into two basic categories. The most basic will return annotations as GRanges objects. Some examples of these are: transcripts(), exons() and cds().
This for example will return all the transcripts as ranges:
transcripts(txdb)
## GRanges object with 82960 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 [ 11874, 14409] + | 1 uc001aaa.3
## [2] chr1 [ 11874, 14409] + | 2 uc010nxq.1
## [3] chr1 [ 11874, 14409] + | 3 uc010nxr.1
## [4] chr1 [ 69091, 70008] + | 4 uc001aal.1
## [5] chr1 [321084, 321115] + | 5 uc001aaq.2
## ... ... ... ... ... ... ...
## [82956] chrY [27605645, 27605678] - | 78803 uc004fwx.1
## [82957] chrY [27606394, 27606421] - | 78804 uc022cpc.1
## [82958] chrY [27607404, 27607432] - | 78805 uc004fwz.3
## [82959] chrY [27635919, 27635954] - | 78806 uc022cpd.1
## [82960] chrY [59358329, 59360854] - | 78807 uc011ncc.1
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
And this will return all the exons as ranges:
exons(txdb)
## GRanges object with 289969 ranges and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 [11874, 12227] + | 1
## [2] chr1 [12595, 12721] + | 2
## [3] chr1 [12613, 12721] + | 3
## [4] chr1 [12646, 12697] + | 4
## [5] chr1 [13221, 14409] + | 5
## ... ... ... ... ... ...
## [289965] chrY [27607404, 27607432] - | 277746
## [289966] chrY [27635919, 27635954] - | 277747
## [289967] chrY [59358329, 59359508] - | 277748
## [289968] chrY [59360007, 59360115] - | 277749
## [289969] chrY [59360501, 59360854] - | 277750
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
But these operations will also support the extraction of extra metadata. All extra data will be inserted into the metadata slot of the returned GRanges object. So for example you could spice up your call to transcripts by using the columns argument like this.
transcripts(txdb, columns = c("tx_id","tx_name","gene_id"))
## GRanges object with 82960 ranges and 3 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 [ 11874, 14409] + | 1 uc001aaa.3
## [2] chr1 [ 11874, 14409] + | 2 uc010nxq.1
## [3] chr1 [ 11874, 14409] + | 3 uc010nxr.1
## [4] chr1 [ 69091, 70008] + | 4 uc001aal.1
## [5] chr1 [321084, 321115] + | 5 uc001aaq.2
## ... ... ... ... ... ... ...
## [82956] chrY [27605645, 27605678] - | 78803 uc004fwx.1
## [82957] chrY [27606394, 27606421] - | 78804 uc022cpc.1
## [82958] chrY [27607404, 27607432] - | 78805 uc004fwz.3
## [82959] chrY [27635919, 27635954] - | 78806 uc022cpd.1
## [82960] chrY [59358329, 59360854] - | 78807 uc011ncc.1
## gene_id
## <CharacterList>
## [1] 100287102
## [2] 100287102
## [3] 100287102
## [4] 79501
## [5]
## ... ...
## [82956]
## [82957]
## [82958]
## [82959]
## [82960]
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
The 2nd kind of range accessor supported by TxDb objects are the ones that return GRangesList objects. Some examples of these are: transcriptsBy(), exonsBy() or cdsBy(). These accessors just allow you to return a GRangesList object that contains the desired ranges by split up by some important feature type that is specified using the “by” argument. A typical case is to extract all the transcript ranges known for all the genes. You can do that like this:
transcriptsBy(txdb, by="gene")
## GRangesList object of length 23459:
## $1
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr19 [58858172, 58864865] - | 70455 uc002qsd.4
## [2] chr19 [58859832, 58874214] - | 70456 uc002qsf.2
##
## $10
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## [1] chr8 [18248755, 18258723] + | 31944 uc003wyw.1
##
## $100
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## [1] chr20 [43248163, 43280376] - | 72132 uc002xmj.3
##
## ...
## <23456 more elements>
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
[ Back to top ]
Exercise 4: Use the accessors for the TxDb.Hsapiens.UCSC.hg19.knownGene package to retrieve the gene id, transcript name and transcript chromosome for all the transcripts. Do this using both the select() method and also using the transcripts() method. What is the difference in the output?
Exercise 5: Load the TxDb.Athaliana.BioMart.plantsmart22 package. This package is not from UCSC and it is based on plantsmart. Now use select or one of the range based accessors to look at the gene ids from this TxDb object. How tdo they compare to what you saw in the TxDb.Hsapiens.UCSC.hg19.knownGene package?
[ Back to top ]
What if you wanted to combine all the good stuff from the GO.db package with what you find in the appropriate TxDb and OrgDb packages for an organism? Then you would want to use an OrganismDb package. An example of an OrganismDb package is the Homo.sapiens package. Like the OrgDb, ChipDb and TxDb packages, it supports the use of select, keytypes, keys and columns.
library(Homo.sapiens)
keys <- head(keys(Homo.sapiens, keytype="ENTREZID"), n=2)
columns <- c("SYMBOL","TXNAME")
select(Homo.sapiens, keys, columns, keytype="ENTREZID")
## ENTREZID SYMBOL TXNAME
## 1 1 A1BG uc002qsd.4
## 2 1 A1BG uc002qsf.2
## 3 2 A2M uc001qvk.1
## 4 2 A2M uc009zgk.1
When an OrganismDb package knows about a relevant TxDb package, it can also support the ranged accessors introduced with the TxDb objects.
transcripts(Homo.sapiens, columns=c("TXNAME","SYMBOL"))
## GRanges object with 82960 ranges and 2 metadata columns:
## seqnames ranges strand | TXNAME
## <Rle> <IRanges> <Rle> | <CharacterList>
## [1] chr1 [ 11874, 14409] + | uc001aaa.3
## [2] chr1 [ 11874, 14409] + | uc010nxq.1
## [3] chr1 [ 11874, 14409] + | uc010nxr.1
## [4] chr1 [ 69091, 70008] + | uc001aal.1
## [5] chr1 [321084, 321115] + | uc001aaq.2
## ... ... ... ... ... ...
## [82956] chrY [27605645, 27605678] - | uc004fwx.1
## [82957] chrY [27606394, 27606421] - | uc022cpc.1
## [82958] chrY [27607404, 27607432] - | uc004fwz.3
## [82959] chrY [27635919, 27635954] - | uc022cpd.1
## [82960] chrY [59358329, 59360854] - | uc011ncc.1
## SYMBOL
## <CharacterList>
## [1] DDX11L1
## [2] DDX11L1
## [3] DDX11L1
## [4] OR4F5
## [5] NA
## ... ...
## [82956] NA
## [82957] NA
## [82958] NA
## [82959] NA
## [82960] NA
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
You might be surprised to learn that an OrganismDb package does not itself contain very much information. Instead, it “knows where to find it”, but referencing other packages that themselves implement a select interface. So to create an OrganismDb package, you really only need to specify where the information needs to come from. Configuring an OrganismDb object is therefore pretty simple. You simply create a special list object that describes which IDs from each package are the same kind of IDs in other packages to be included, along with the relevant package names. So in the following example, the “GOID” values from the GO.db package act as foreign keys for the “GO” values in the org.Hs.eg.db package and so on.
gd <- list(join1 = c(GO.db="GOID", org.Hs.eg.db="GO"),
join2 = c(org.Hs.eg.db="ENTREZID",
TxDb.Hsapiens.UCSC.hg19.knownGene="GENEID"))
makeOrganismPackage(pkgname = "Homo.sapiens",
graphData = gd,
organism = "Homo sapiens",
version = "1.0.0",
maintainer = "Package Maintainer<maintainer@somewhere.org>",
author = "Some Body",
destDir = ".",
license = "Artistic-2.0")
In this way, you can create a custom OrganismDb package for any organism of interest, providing that you have also have access to the supporting packages. There is a vignette that covers this topic in more detail here.
Exercise 6: Use the Homo.sapiens object to look up the gene symbol, transcript start and chromosome using select(). Then do the same thing using transcripts. You might expect that this call to transcripts will look the same as it did for the TxDb object, but (temporarily) it will not.
Exercise 7: Look at the results from call the columns method on the Homo.sapiens object and compare that to what happens when you call columns on the org.Hs.eg.db object and then look at a call to columns on the TxDb.Hsapiens.UCSC.hg19.knownGene object. What is the difference between TXSTART and CHRLOC? Which one do you think you should use for transcripts or other genomic information?
[ Back to top ]
Lets look more closely at the keys method. We have already talked about how you can use it to do this:
library(Homo.sapiens)
keys <- head(keys(Homo.sapiens, keytype="ENTREZID"), n=2)
And then you can use it with select to look up other kinds of information. But what if you only know partial information about the keys you are looking up? In Bioconductor 2.13 and higher there are extra arguments for the keys method that you can make use of to find keys that match certain criteria. The most useful is probably the pattern argument. The pattern argument allows you to find out which keys match a certain pattern. So for example, you can look up entrez gene IDs that start with a “2” like this:
head(keys(Homo.sapiens, keytype="ENTREZID", pattern="^2"), n=6)
## [1] "2" "20" "21" "22" "23" "24"
Or you could look up gene symbols that start with “MS”:
head(keys(Homo.sapiens, keytype="SYMBOL", pattern="^MS"), n=6)
## [1] "MS4A1" "MS4A3" "MS4A2" "MSTN" "MSH6" "MS"
If your string matching is too specific, you could also try to use fuzzy matching by setting the fuzzy argument to TRUE:
head(keys(Homo.sapiens, keytype="SYMBOL", pattern="^MS", fuzzy=TRUE), n=6)
## [1] "MS4A1" "MS4A3" "MS4A2" "MSTN" "MSH6" "LIMS1"
And if you want to match one one key and actually return another, then you can use the column argument to indicate which key you want to search for pattern on while using the keytype to indicate which kind of key you want returned. So you could (for example) get back ensembl IDs where the symbol starts with “MS”.
keys <- head(keys(Homo.sapiens, keytype="ENSEMBL", pattern="^MS", column="SYMBOL"), n=6)
keys
## [1] "ENSG00000156738" "ENSG00000149516" "ENSG00000149534" "ENSG00000138379"
## [5] "ENSG00000116062" "ENSG00000095002"
select(Homo.sapiens, keys, "SYMBOL", keytype="ENSEMBL")
## ENSEMBL SYMBOL
## 1 ENSG00000156738 MS4A1
## 2 ENSG00000149516 MS4A3
## 3 ENSG00000149534 MS4A2
## 4 ENSG00000138379 MSTN
## 5 ENSG00000116062 MSH6
## 6 ENSG00000095002 MSH2
Exercise 8: Use the Homo.sapiens object with the keys method to look up the entrez gene IDs for all gene symbols that contain the letter “X”.
[ Back to top ]
So far we have been discussing annotations that are fairly well established and that represent consensus findings from the scientific community. These kinds of annotations are usually curated at large governmental institutions like NCBI or ensembl and for the most part everyone basically agrees about what they mean and how to use them.
But sometimes the annotations that you need are not as well established. Sometimes (for example) we just need to compare our results to the data from a recent large study such as the encode project. The AnnotationHub package is designed to be useful for getting access to data like this. AnnotationHub allows you to get access to data from a range of different data reposotories, with the caveat that the data objects in AnnotationHub have all been pre-processed into appropriate R objects for you.
To make use of AnnotationHub, you need to load the package and then create an AnnotationHub object. Notice that unlike the other packages, with AnnotationHub, you have to create an AnnotationHub object when you 1st start up your R session.
library(AnnotationHub)
ah <- AnnotationHub()
You can think of the AnnotationHub object as being a little bit like a list. So you can see how many records are present by looking at the length like this:
length(ah)
## [1] 19268
And the show method for the AnnotationHub will also try to tell you about what it contains. It will preferentially show you the most common kinds of data inside of it, so these values will change as you carve away things that you are uninterested in.
ah
## AnnotationHub with 19268 records
## # snapshotDate(): 2015-03-26
## # $dataprovider: UCSC, Ensembl, BroadInstitute, NCBI, Haemcode, dbSNP, ...
## # $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Da...
## # $rdataclass: GRanges, FaFile, OrgDb, ChainFile, CollapsedVCF, Inparan...
## # additional mcols(): taxonomyid, genome, description, tags,
## # sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH169"]]'
##
## title
## AH169 | Meleagris_gallopavo.UMD2.69.cdna.all.fa
## AH170 | Meleagris_gallopavo.UMD2.69.dna.toplevel.fa
## AH171 | Meleagris_gallopavo.UMD2.69.dna_rm.toplevel.fa
## AH172 | Meleagris_gallopavo.UMD2.69.dna_sm.toplevel.fa
## AH173 | Meleagris_gallopavo.UMD2.69.ncrna.fa
## ... ...
## AH28851 | Tursiops_truncatus.turTru1.77.gtf
## AH28852 | Vicugna_pacos.vicPac1.77.gtf
## AH28853 | Xenopus_tropicalis.JGI_4.2.77.gtf
## AH28854 | Xiphophorus_maculatus.Xipmac4.4.2.77.gtf
## AH28855 | RNA-Sequencing and clinical data for 7706 tumor samples fro...
The AnnotationHub resource currently has a LOT of different data in it. So to really make use of that, you need to be able to find the data that is of specific interest to you. To get started, you can just use the '$' operator and then just tab complete like this.
ah$
And if you hit tab, that should give an output like this:
ah$ah_id ah$dataprovider ah$taxonomyid ah$description ah$rdataclass ah$sourcetype
ah$title ah$species ah$genome ah$tags ah$sourceurl
Each of the above contains a column of metadata values that can be used to filter the AnnotationHub data. You can look at the possible contents of any specific one by using unique.
unique(ah$sourcetype)
## [1] "FASTA" "BED" "UCSC track" "VCF"
## [5] "GTF" "Inparanoid" "NCBI/blast2GO" "TwoBit"
## [9] "Chain" "GRASP" "Zip" "CSV"
## [13] "BioPax" "BioPaxLevel2" "RData" "BigWig"
## [17] "tar.gz"
So from this you can make decisions about how you want to subset out the parts of the AnnotationHub that you want to explore in more detail. Lets suppose that you are interested in data that was initially a 'BED' file. You have already seen above that those would be the resources where the sourcetype is 'BED'. So you can use the subset method to throw away all the other kinds of data.
ahs <- subset(ah, ah$sourcetype=='BED')
length(ahs)
## [1] 7855
So now we have gone from almost 20 thousand entries to about 7500. That helps, but we probably need to look closer than that. So lets look again but this time lets look at the dataprovider information:
table(ahs$dataprovider)
##
## BroadInstitute EncodeDCC Haemcode UCSC
## 3146 5 315 4389
From here lets say you are interested in the EncodeDCC data. You can get it like before:
ahs2 <- subset(ahs, ah$dataprovider=='EncodeDCC')
ahs2
## AnnotationHub with 5 records
## # snapshotDate(): 2015-03-26
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description, tags,
## # sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH15221"]]'
##
## title
## AH15221 | broadPeaks.E001-H3K27me3.gappedPeak.gz
## AH15222 | broadPeaks.E001-H3K36me3.broadPeak.gz
## AH15223 | broadPeaks.E001-H3K36me3.gappedPeak.gz
## AH15224 | broadPeaks.E001-H3K4me1.broadPeak.gz
## AH15225 | broadPeaks.E001-H3K4me1.gappedPeak.gz
And you can also use the powerful query function to subset based on matching a string in the metadata. So lets further simplify to only those records where the data is a 'gappedPeak'.
ahs3 <- query(ahs2, 'gappedPeak')
ahs3
## AnnotationHub with 3 records
## # snapshotDate(): 2015-03-26
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description, tags,
## # sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH15221"]]'
##
## title
## AH15221 | broadPeaks.E001-H3K27me3.gappedPeak.gz
## AH15223 | broadPeaks.E001-H3K36me3.gappedPeak.gz
## AH15225 | broadPeaks.E001-H3K4me1.gappedPeak.gz
From here you can hopefully see what you need. So the only thing remaining to do is to retrieve your record. Doing so works just like subsetting on a list like object. So you can either retrieve it by position:
res <- ahs3[[2]]
Or by name:
res <- ahs3[["AH15223"]]
res
## GRanges object with 56125 ranges and 8 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr6 [ 86552465, 86552967] * | Rank_1 98
## [2] chr21 [ 20161036, 20161230] * | Rank_2 95
## [3] chr10 [116966708, 116966953] * | Rank_3 95
## [4] chr6 [115576892, 115577227] * | Rank_4 91
## [5] chr6 [ 30583633, 30585005] * | Rank_5 91
## ... ... ... ... ... ... ...
## [56121] chr2 [23185234, 23189325] * | Rank_56121 2
## [56122] chr3 [45495081, 45498832] * | Rank_56122 2
## [56123] chr10 [35679309, 35683318] * | Rank_56123 2
## [56124] chr8 [35655815, 35657331] * | Rank_56124 2
## [56125] chrX [96423877, 96426822] * | Rank_56125 2
## itemRgb signalValue pValue qValue
## <character> <numeric> <numeric> <numeric>
## [1] <NA> 6.92034 13.73329 9.85752
## [2] <NA> 7.42345 13.63199 9.59692
## [3] <NA> 6.91558 13.35056 9.51450
## [4] <NA> 6.50700 12.98364 9.12909
## [5] <NA> 5.69253 12.93009 9.18319
## ... ... ... ... ...
## [56121] <NA> 1.76373 1.43508 0.24132
## [56122] <NA> 1.74488 1.42889 0.22499
## [56123] <NA> 1.75109 1.42736 0.23781
## [56124] <NA> 1.74756 1.41481 0.23177
## [56125] <NA> 1.72623 1.38609 0.22755
## thick blocks
## <IRanges> <IRangesList>
## [1] [ 86552467, 86552661] [3, 197]
## [2] [ 20161038, 20161227] [3, 192]
## [3] [116966758, 116966951] [51, 244]
## [4] [115576894, 115577083] [3, 192]
## [5] [ 30583754, 30584991] [122, 1359]
## ... ... ...
## [56121] [23185684, 23185863] [451, 630]
## [56122] [45496342, 45496521] [1262, 1441]
## [56123] [35680270, 35680449] [962, 1141]
## [56124] [35657036, 35657215] [1222, 1401]
## [56125] [96425116, 96425295] [1240, 1419]
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
And if you just want to explore the metadata entirely in R as a DataFrame, then you can also get all of the relevant metadata from the object with the mcols method like this:
meta <- mcols(ah)
We can also view and filter our AnnotationHub object interactively by simply calling the display function on it
d <- display(ah)
We can then filter the AnnotationHub object for “Homo sapiens” by either using the Global search field on the top right corner of the page or the in-column search field for “Species”.
By default 1000 entries are displayed per page, we can change this using the filter on the top of the page or navigate through different pages using the page scrolling feature at the bottom of the page.
We can also select the rows of interest to us and send them back to the R session using 'Send Rows' button ; this sets a filter internally which filters the AnnotationHub object.
[ Back to top ]
Exercise 9: Use the AnnotationHub to extract UCSC data that is from Homo sapiens and also specifically from the hg19 genome. What happens to the hub object as you filter data at each step?
Exercise 10 Now that you have basically narrowed things down to the hg19 annotations from UCSC genome browser, lets get one of these annotations. Find the oreganno track and save it into a local variable.
[ Back to top ]
Another valuable resource is the biomaRt package. The biomaRt package exposes a huge family of online annotation resources called marts. Here is a brief run down of how to use it. For the first step, load the package and decide which “mart” you want to use, then use the useMart() method to create a mart object
library("biomaRt")
head(listMarts())
## biomart version
## 1 ensembl ENSEMBL GENES 79 (SANGER UK)
## 2 snp ENSEMBL VARIATION 79 (SANGER UK)
## 3 regulation ENSEMBL REGULATION 79 (SANGER UK)
## 4 vega VEGA 59 (SANGER UK)
## 5 fungi_mart_26 ENSEMBL FUNGI 26 (EBI UK)
## 6 fungi_variations_26 ENSEMBL FUNGI VARIATION 26 (EBI UK)
ensembl <- useMart("ensembl")
ensembl
## Object of class 'Mart':
## Using the ensembl BioMart database
## Using the dataset
Next you need to decide on a dataset. This can also be specified in the mart object that is created when you call the the useMart() method.
head(listDatasets(ensembl))
## dataset
## 1 oanatinus_gene_ensembl
## 2 cporcellus_gene_ensembl
## 3 gaculeatus_gene_ensembl
## 4 lafricana_gene_ensembl
## 5 itridecemlineatus_gene_ensembl
## 6 choffmanni_gene_ensembl
## description version
## 1 Ornithorhynchus anatinus genes (OANA5) OANA5
## 2 Cavia porcellus genes (cavPor3) cavPor3
## 3 Gasterosteus aculeatus genes (BROADS1) BROADS1
## 4 Loxodonta africana genes (loxAfr3) loxAfr3
## 5 Ictidomys tridecemlineatus genes (spetri2) spetri2
## 6 Choloepus hoffmanni genes (choHof1) choHof1
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
ensembl
## Object of class 'Mart':
## Using the ensembl BioMart database
## Using the hsapiens_gene_ensembl dataset
Next we need to think about filters and values. In the biomaRt package, filters are things that can be used with values to restrict or choose what comes back. So you might choose a filter of “affy_hg_u133_plus_2” to go with specific values. For example you might choose c(“202763_at”,“209310_s_at”,“207500_at”) to go with the filter “affy_hg_u133_plus_2”. Together these two things would request things that matched those probeset IDs on the platform listed as the filter. There is an accessor for the kinds of filters that are available from a given mart/dataset:
head(listFilters(ensembl))
## name description
## 1 chromosome_name Chromosome name
## 2 start Gene Start (bp)
## 3 end Gene End (bp)
## 4 band_start Band Start
## 5 band_end Band End
## 6 marker_start Marker Start
Also, you need to know about attributes. Attributes here mean the things that you want returned. So if you want to know the gene symbol or something like that. You would list that as an attribute. There are accessors to list the kinds of attributes you can look up too:
head(listAttributes(ensembl))
## name description
## 1 ensembl_gene_id Ensembl Gene ID
## 2 ensembl_transcript_id Ensembl Transcript ID
## 3 ensembl_peptide_id Ensembl Protein ID
## 4 ensembl_exon_id Ensembl Exon ID
## 5 description Description
## 6 chromosome_name Chromosome Name
Once you are done exploring and know what you want to extract, you can call the getBM method to get your data like this:
affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'),
filters = 'affy_hg_u133_plus_2',
values = affyids, mart = ensembl)
## affy_hg_u133_plus_2 entrezgene
## 1 202763_at 836
## 2 207500_at 838
## 3 209310_s_at 837
Now what would you do if you didn't know what the possible values are for a given filter? Well you could just request all the possible values by not specifying the filter, and instead only specifying it as an attribute like this:
head(getBM(attributes='affy_hg_u133_plus_2', mart = ensembl))
## affy_hg_u133_plus_2
## 1 1553551_s_at
## 2 1553569_at
## 3 1553538_s_at
## 4 1553570_x_at
## 5 1553567_s_at
## 6 1553588_at
Of course if you find the standard biomaRt methods difficult to work with, you can now also use the standard select methods here.
[ Back to top ]
Exercise 11: Pull down GO terms for entrez gene id “1” from human by using the ensembl “hsapiens_gene_ensembl” dataset.
Exercise 12: Now compare the GO terms you just pulled down to the same GO terms from the org.Hs.eg.db package (which you can now retrieve using select()). What differences do you notice? Why do you suspect that is?
[ Back to top ]
There are many BSgenome packages in the repository too. These packages contain sequence data for sequenced organisms. You can load one of these packages just like this:
library(BSgenome.Hsapiens.UCSC.hg19)
ls(2)
## [1] "NP2009code" "attributePages" "columns" "exportFASTA"
## [5] "filterOptions" "filterType" "getBM" "getBMlist"
## [9] "getGene" "getLDS" "getSequence" "getXML"
## [13] "keys" "keytypes" "listAttributes" "listDatasets"
## [17] "listEnsembl" "listFilters" "listMarts" "select"
## [21] "show" "useDataset" "useEnsembl" "useMart"
Hsapiens
## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## # chr1 chr2 chr3
## # chr4 chr5 chr6
## # chr7 chr8 chr9
## # chr10 chr11 chr12
## # chr13 chr14 chr15
## # ... ... ...
## # chrUn_gl000235 chrUn_gl000236 chrUn_gl000237
## # chrUn_gl000238 chrUn_gl000239 chrUn_gl000240
## # chrUn_gl000241 chrUn_gl000242 chrUn_gl000243
## # chrUn_gl000244 chrUn_gl000245 chrUn_gl000246
## # chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)
The getSeq method is useful for extracting data from these pacakges. This method takes several arguments but the important ones are the 1st two. The 1st argument specifies the BSgenome object to use and the second argument (names) specifies what data you want back out. So for example, if you call it and give a character vector that names the seqnames for the object then you will get the sequences from those chromosomes as a DNAStringSet object.
seqNms <- seqnames(Hsapiens)
head(seqNms)
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
getSeq(Hsapiens, seqNms[1:2])
## A DNAStringSet instance of length 2
## width seq
## [1] 249250621 NNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## [2] 243199373 NNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Whereas if you give the a GRanges object for the 2nd argument, you can instead get a DNA StringSet that corresponds to those ranges.
rngs <- GRanges(seqnames = c('chr1', 'chr4'), strand=c('+','-'),
ranges = IRanges(start=c(100000,300000),
end=c(100023,300037)))
rngs
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [100000, 100023] +
## [2] chr4 [300000, 300037] -
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
res <- getSeq(Hsapiens, rngs)
res
## A DNAStringSet instance of length 2
## width seq
## [1] 24 CACTAAGCACACAGAGAATAATGT
## [2] 38 GCTGGTCCCTTACTTCCAGTAGAAAAGACGTGTTCAGG
This can be a very powerful way to quickly get sequences of interest. And for more useful tools the BSgenome package also has useful functions for finding a pattern in a string set etc.
[ Back to top ]
Follow installation instructions to start using these packages. To install the annotations associated with the Affymetrix Human Genome U95 V 2.0, and with Gene Ontology, use
source("http://bioconductor.org/biocLite.R")
biocLite(c("hgu95av2.db", "GO.db"))
Package installation is required only once per R installation. View a full list of available software and annotation packages.
To use the AnnotationDbi
and GO.db
package, evaluate the commands
library(AnnotationDbi)
library(GO.db)
These commands are required once in each R session.
[ Back to top ]
Packages have extensive help pages, and include vignettes highlighting common use cases. The help pages and vignettes are available from within R. After loading a package, use syntax like
help(package="GO.db")
?select
to obtain an overview of help on the GO.db
package, and the select
method. The AnnotationDbi
package is used by most .db
packages. View the vignettes in the AnnotationDbi
package with
browseVignettes(package="AnnotationDbi")
To view vignettes (providing a more comprehensive introduction to
package functionality) in the AnnotationDbi
package. Use
help.start()
To open a web page containing comprehensive help resources.
[ Back to top ]
The following guides the user through key annotation packages. Users
interested in how to create custom chip packages should see the
vignettes in the AnnotationForge
package. There is additional
information in the AnnotationDbi
, OrganismDbi
and
GenomicFeatures
packages for how to use some of the extra tools
provided. You can also refer to the complete list of annotation
packages.
AnnotationDbi
package. This
package will be automatically installed for you if you install
another “.db” annotation package using biocLite(). It contains the
code to allow annotation mapping objects to be made and manipulated
as well as code to use the select methods etc..AnnotationDbi
package. These packages must be
upgraded before you attempt to update your custom chip packages as
they contain the source databases needed by the SQLForge code.[ Back to top ]
keys <- "MSX2"
columns <- c("ENTREZID", "CHR")
select(org.Hs.eg.db, keys, columns, keytype="SYMBOL")
## SYMBOL ENTREZID CHR
## 1 MSX2 4488 5
## 1st get all the gene symbols
orgSymbols <- keys(org.Hs.eg.db, keytype="SYMBOL")
## and then use that to get all gene symbols matched to all entrez gene IDs
egr <- select(org.Hs.eg.db, keys=orgSymbols, "ENTREZID", "SYMBOL")
length(egr$ENTREZID)
## [1] 56340
length(unique(egr$ENTREZID))
## [1] 56340
## VS:
length(egr$SYMBOL)
## [1] 56340
length(unique(egr$SYMBOL))
## [1] 56332
## So lets trap these symbols that are redundant and look more closely...
redund <- egr$SYMBOL
badSymbols <- redund[duplicated(redund)]
select(org.Hs.eg.db, badSymbols, "ENTREZID", "SYMBOL")
## SYMBOL ENTREZID
## 1 HBD 3045
## 2 HBD 100187828
## 3 RNR1 4549
## 4 RNR1 6052
## 5 RNR2 4550
## 6 RNR2 6053
## 7 SFPQ 6421
## 8 SFPQ 654780
## 9 TEC 7006
## 10 TEC 100124696
## 11 MEMO1 7795
## 12 MEMO1 51072
## 13 MMD2 221938
## 14 MMD2 100505381
## 15 LSAMP-AS1 100506708
## 16 LSAMP-AS1 101926903
Initially you might expect that hgu95av2.db will have less information in it. After all, it's an old Affymetrix platform that was developed before we even had a very complete human genome. So you might try something like this:
chipSymbols <- keys(hgu95av2.db, keytype="SYMBOL")
orgSymbols <- keys(org.Hs.eg.db, keytype="SYMBOL")
length(orgSymbols)
## [1] 56332
length(chipSymbols)
## [1] 56332
And you might feel confused and so you might try this:
dim(select(org.Hs.eg.db,orgSymbols, "ENTREZID", "SYMBOL"))
## [1] 56340 2
dim(select(hgu95av2.db,chipSymbols, "ENTREZID", "SYMBOL"))
## [1] 56340 2
And you might also have noticed this:
length(columns(org.Hs.eg.db)) < length(columns(hgu95av2.db))
## [1] TRUE
Well the answer you have in front of you is actually correct. There actually is more information available in the hgu95av2.db object than in the org.Hs.eg.db object. This is because even though the hgu95av2.db object technically can only have probes for some genes in the genome, it still (behind the scenes) retrieves data about gene names etc. from the org.Hs.eg.db package. So it effectively has access to all the data from the org package PLUS the probes for that platform and what those map to. So that means that for there will be information about many gene symbols that don't actually match up to any probeset Ids. And that is what we see if we use gene symbols to look up the probes Ids.
head(select(hgu95av2.db,chipSymbols, "PROBEID", "SYMBOL"))
## SYMBOL PROBEID
## 1 A1BG <NA>
## 2 A2M <NA>
## 3 A2MP1 <NA>
## 4 NAT1 38187_at
## 5 NAT2 38912_at
## 6 NATP <NA>
So to retrieve this information using select you need to do it like this:
res1 <- select(TxDb.Hsapiens.UCSC.hg19.knownGene,
keys(TxDb.Hsapiens.UCSC.hg19.knownGene, keytype="TXID"),
columns=c("GENEID","TXNAME","TXCHROM"), keytype="TXID")
head(res1)
## TXID GENEID TXNAME TXCHROM
## 1 1 100287102 uc001aaa.3 chr1
## 2 2 100287102 uc010nxq.1 chr1
## 3 3 100287102 uc010nxr.1 chr1
## 4 4 79501 uc001aal.1 chr1
## 5 5 <NA> uc001aaq.2 chr1
## 6 6 <NA> uc001aar.2 chr1
And to do it using transcripts you do it like this:
res2 <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene,
columns = c("gene_id","tx_name"))
head(res2)
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | gene_id tx_name
## <Rle> <IRanges> <Rle> | <CharacterList> <character>
## [1] chr1 [ 11874, 14409] + | 100287102 uc001aaa.3
## [2] chr1 [ 11874, 14409] + | 100287102 uc010nxq.1
## [3] chr1 [ 11874, 14409] + | 100287102 uc010nxr.1
## [4] chr1 [ 69091, 70008] + | 79501 uc001aal.1
## [5] chr1 [321084, 321115] + | uc001aaq.2
## [6] chr1 [321146, 321207] + | uc001aar.2
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Notice that in the 2nd case we don't have to ask for the chromosome, as transcripts() returns a GRanges object, so the chromosome will automatically be returned as part of the object.
library(TxDb.Athaliana.BioMart.plantsmart22)
res <- transcripts(TxDb.Athaliana.BioMart.plantsmart22, columns = c("gene_id"))
You will notice that the gene ids for this package are TAIR locus IDs and are NOT entrez gene IDs like what you saw in the TxDb.Hsapiens.UCSC.hg19.knownGene package. It's important to always pay attention to the kind of gene id is being used by the TxDb you are looking at.
library(Homo.sapiens)
keys <- keys(Homo.sapiens, keytype="TXID")
res1 <- select(Homo.sapiens,
keys= keys,
columns=c("SYMBOL","TXSTART","TXCHROM"), keytype="TXID")
head(res1)
And to do it using transcripts you do it like this:
library(Homo.sapiens)
res2 <- transcripts(Homo.sapiens, columns="SYMBOL")
head(res2)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | SYMBOL
## <Rle> <IRanges> <Rle> | <CharacterList>
## [1] chr1 [ 11874, 14409] + | DDX11L1
## [2] chr1 [ 11874, 14409] + | DDX11L1
## [3] chr1 [ 11874, 14409] + | DDX11L1
## [4] chr1 [ 69091, 70008] + | OR4F5
## [5] chr1 [321084, 321115] + | NA
## [6] chr1 [321146, 321207] + | NA
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
columns(Homo.sapiens)
## [1] "GOID" "TERM" "ONTOLOGY" "DEFINITION"
## [5] "ENTREZID" "PFAM" "IPI" "PROSITE"
## [9] "ACCNUM" "ALIAS" "CHR" "CHRLOC"
## [13] "CHRLOCEND" "ENZYME" "MAP" "PATH"
## [17] "PMID" "REFSEQ" "SYMBOL" "UNIGENE"
## [21] "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME"
## [25] "UNIPROT" "GO" "EVIDENCE" "GOALL"
## [29] "EVIDENCEALL" "ONTOLOGYALL" "OMIM" "UCSCKG"
## [33] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND"
## [37] "CDSSTART" "CDSEND" "EXONID" "EXONNAME"
## [41] "EXONCHROM" "EXONSTRAND" "EXONSTART" "EXONEND"
## [45] "GENEID" "TXID" "EXONRANK" "TXNAME"
## [49] "TXTYPE" "TXCHROM" "TXSTRAND" "TXSTART"
## [53] "TXEND"
columns(org.Hs.eg.db)
## [1] "ENTREZID" "PFAM" "IPI" "PROSITE"
## [5] "ACCNUM" "ALIAS" "CHR" "CHRLOC"
## [9] "CHRLOCEND" "ENZYME" "MAP" "PATH"
## [13] "PMID" "REFSEQ" "SYMBOL" "UNIGENE"
## [17] "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME"
## [21] "UNIPROT" "GO" "EVIDENCE" "ONTOLOGY"
## [25] "GOALL" "EVIDENCEALL" "ONTOLOGYALL" "OMIM"
## [29] "UCSCKG"
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)
## [1] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND" "CDSSTART"
## [6] "CDSEND" "EXONID" "EXONNAME" "EXONCHROM" "EXONSTRAND"
## [11] "EXONSTART" "EXONEND" "GENEID" "TXID" "EXONRANK"
## [16] "TXNAME" "TXTYPE" "TXCHROM" "TXSTRAND" "TXSTART"
## [21] "TXEND"
## You might also want to look at this:
transcripts(Homo.sapiens, columns=c("SYMBOL","CHRLOC"))
## GRanges object with 82960 ranges and 3 metadata columns:
## seqnames ranges strand | CHRLOC
## <Rle> <IRanges> <Rle> | <IntegerList>
## [1] chr1 [ 11874, 14409] + | 11874
## [2] chr1 [ 11874, 14409] + | 11874
## [3] chr1 [ 11874, 14409] + | 11874
## [4] chr1 [ 69091, 70008] + | 69091
## [5] chr1 [321084, 321115] + | NA
## ... ... ... ... ... ...
## [82956] chrY [27605645, 27605678] - | NA
## [82957] chrY [27606394, 27606421] - | NA
## [82958] chrY [27607404, 27607432] - | NA
## [82959] chrY [27635919, 27635954] - | NA
## [82960] chrY [59358329, 59360854] - | NA
## CHRLOCCHR SYMBOL
## <CharacterList> <CharacterList>
## [1] 1 DDX11L1
## [2] 1 DDX11L1
## [3] 1 DDX11L1
## [4] 1 OR4F5
## [5] NA NA
## ... ... ...
## [82956] NA NA
## [82957] NA NA
## [82958] NA NA
## [82959] NA NA
## [82960] NA NA
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
The key difference is that the TXSTART refers to the start of a transcript and originates in the TxDb object from the TxDb.Hsapiens.UCSC.hg19.knownGene package, while the CHRLOC refers to the same thing but originates in the OrgDb object from the org.Hs.eg.db package. The point of origin is significant because the TxDb object represents a transcriptome from UCSC and the OrgDb is primarily gene centric data that originates at NCBI. The upshot is that CHRLOC will not have as many regions represented as TXSTART, since there has to be an official gene for there to even be a record. The CHRLOC data is also locked in for org.Hs.eg.db as data for hg19, whereas you can swap in a different TxDb object to match the genome you are using to make it hg18 etc. For these reasons, we strongly recommend using TXSTART instead of CHRLOC. Howeverm CHRLOC still remains in the org packages for historical reasons.
To find the keys that match, make use of the pattern and column arguments.
library(Homo.sapiens)
xk = head(keys(Homo.sapiens, keytype="ENTREZID", pattern="X", column="SYMBOL"))
xk
## [1] "51" "179" "189" "239" "240" "241"
select verifies the results
select(Homo.sapiens, xk, "SYMBOL", "ENTREZID")
## ENTREZID SYMBOL
## 1 51 ACOX1
## 2 179 AGMX2
## 3 189 AGXT
## 4 239 ALOX12
## 5 240 ALOX5
## 6 241 ALOX5AP
The 1st thing you need to do is look for thing from UCSC
ahs <- query(ah, "UCSC")
Then you can look for Genome values that match 'hg19' and a species that matches 'Homo sapiens'.
ahs <- subset(ahs, ahs$genome=='hg19')
length(ahs)
## [1] 5490
ahs <- subset(ahs, ahs$species=='Homo sapiens')
length(ahs)
## [1] 5490
You might notice that the last two filtering steps are redundant (IOW doing either one of them is the same as doing both of them.) If this were not the case, we might suspect that there was a problem with the metadata.
This pulls down the oreganno annotations. Which are described on the UCSC site thusly: “This track displays literature-curated regulatory regions, transcription factor binding sites, and regulatory polymorphisms from ORegAnno (Open Regulatory Annotation). For more detailed information on a particular regulatory element, follow the link to ORegAnno from the details page.”
ahs <- query(ah, 'oreganno')
ahs
## AnnotationHub with 9 records
## # snapshotDate(): 2015-03-26
## # $dataprovider: Pazar, UCSC
## # $species: Homo sapiens, Saccharomyces cerevisiae, NA
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description, tags,
## # sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH5087"]]'
##
## title
## AH5087 | ORegAnno
## AH5213 | ORegAnno
## AH7053 | ORegAnno
## AH7061 | ORegAnno
## AH22286 | pazar_ORegAnno_20120522.csv
## AH22287 | pazar_ORegAnno_ENCODEprom_20120522.csv
## AH22288 | pazar_ORegAnno_Erythroid_20120522.csv
## AH22289 | pazar_ORegAnno_STAT1_ChIP_20120522.csv
## AH22290 | pazar_ORegAnno_STAT1_lit_20120522.csv
ahs[1]
## AnnotationHub with 1 record
## # snapshotDate(): 2015-03-26
## # names(): AH5087
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $title: ORegAnno
## # $description: GRanges object from UCSC track 'ORegAnno'
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: UCSC track
## # $sourceurl: rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/dat...
## # $sourcelastmodifieddate: NA
## # $sourcesize: NA
## # $tags: oreganno, UCSC, track, Gene, Transcript, Annotation
## # retrieve record with 'object[["AH5087"]]'
oreg <- ahs[['AH5087']]
## retrieving 1 resources
##
|
| | 0%
|
|= | 1%
|
|==== | 6%
|
|===== | 8%
|
|====== | 9%
|
|======= | 11%
|
|======== | 13%
|
|========= | 15%
|
|========== | 16%
|
|=========== | 18%
|
|============ | 18%
|
|============= | 20%
|
|============== | 21%
|
|=============== | 23%
|
|================ | 25%
|
|================= | 26%
|
|================== | 28%
|
|=================== | 29%
|
|==================== | 31%
|
|===================== | 32%
|
|====================== | 34%
|
|======================= | 35%
|
|======================== | 37%
|
|========================= | 39%
|
|========================== | 40%
|
|=========================== | 42%
|
|============================ | 43%
|
|============================= | 45%
|
|============================== | 46%
|
|=============================== | 48%
|
|================================ | 49%
|
|================================= | 51%
|
|================================== | 52%
|
|=================================== | 54%
|
|==================================== | 56%
|
|===================================== | 57%
|
|====================================== | 58%
|
|======================================= | 60%
|
|======================================== | 62%
|
|========================================= | 63%
|
|========================================== | 65%
|
|=========================================== | 66%
|
|============================================ | 68%
|
|============================================= | 70%
|
|============================================== | 70%
|
|=============================================== | 72%
|
|================================================ | 73%
|
|================================================= | 75%
|
|================================================== | 76%
|
|=================================================== | 78%
|
|==================================================== | 80%
|
|===================================================== | 82%
|
|====================================================== | 83%
|
|======================================================= | 85%
|
|======================================================== | 87%
|
|========================================================= | 87%
|
|========================================================= | 88%
|
|========================================================== | 90%
|
|=========================================================== | 91%
|
|============================================================ | 93%
|
|============================================================= | 94%
|
|============================================================== | 96%
|
|=============================================================== | 97%
|
|================================================================ | 99%
|
|=================================================================| 100%
oreg
## GRanges object with 23118 ranges and 2 metadata columns:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 [873499, 873849] + | OREG0012989
## [2] chr1 [886764, 887214] + | OREG0012990
## [3] chr1 [886938, 886958] + | OREG0007909
## [4] chr1 [919400, 919950] + | OREG0012991
## [5] chr1 [919695, 919715] + | OREG0007910
## ... ... ... ... ... ...
## [23114] chr7_gl000195_random [ 1, 851] + | OREG0026736
## [23115] chr7_gl000195_random [103427, 103447] + | OREG0012963
## [23116] chr7_gl000195_random [121139, 121159] + | OREG0012964
## [23117] chr17_gl000204_random [ 58370, 58955] + | OREG0026769
## [23118] chr17_gl000205_random [117492, 118442] + | OREG0026772
## score
## <numeric>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [23114] 0
## [23115] 0
## [23116] 0
## [23117] 0
## [23118] 0
## -------
## seqinfo: 93 sequences from hg19 genome
library("biomaRt")
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
ids=c("1")
getBM(attributes=c('go_id', 'entrezgene'),
filters = 'entrezgene',
values = ids, mart = ensembl)
## go_id entrezgene
## 1 GO:0003674 1
## 2 GO:0005576 1
## 3 GO:0005615 1
## 4 GO:0008150 1
## 5 GO:0070062 1
## 6 GO:0072562 1
## 7 GO:0005575 1
## 8 GO:0043226 1
## 9 GO:0005515 1
library(org.Hs.eg.db)
ids=c("1")
select(org.Hs.eg.db, keys=ids, columns="GO", keytype="ENTREZID")
## ENTREZID GO EVIDENCE ONTOLOGY
## 1 1 GO:0003674 ND MF
## 2 1 GO:0005576 IDA CC
## 3 1 GO:0005615 IDA CC
## 4 1 GO:0008150 ND BP
## 5 1 GO:0070062 IDA CC
## 6 1 GO:0072562 IDA CC
When this exercise was written, there was a different number of GO terms returned from biomaRt than from org.Hs.eg.db. This may not always be true in the future though as both of these resources are updated. It is expected however that this web service, (which is updated continuously) will fall in and out of sync with the org.Hs.eg.db package (which is updated twice a year). This is an important difference as each approach has different advantages and disadvantages. The advantage to updating continuously is that you always have the very latest annotations which are frequently different for something like GO terms. The advantage to using a package is that the results are frozen to a release of Bioconductor. And this can help you to get the same answers that you get today (reproducibility), a few years from now.
[ Back to top ]
sessionInfo()
## R version 3.2.0 Patched (2015-04-22 r68234)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## Running under: OS X 10.6.8 (Snow Leopard)
##
## locale:
## [1] C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] TxDb.Athaliana.BioMart.plantsmart22_3.0.0
## [2] biomaRt_2.24.0
## [3] Homo.sapiens_1.1.2
## [4] OrganismDbi_1.10.0
## [5] hgu95av2.db_3.1.2
## [6] GO.db_3.1.2
## [7] parathyroidSE_1.6.0
## [8] ensemblVEP_1.8.0
## [9] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [10] BSgenome_1.36.0
## [11] rtracklayer_1.28.2
## [12] org.Mm.eg.db_3.1.2
## [13] org.Hs.eg.db_3.1.2
## [14] RSQLite_1.0.0
## [15] DBI_0.3.1
## [16] TxDb.Mmusculus.UCSC.mm10.ensGene_3.1.2
## [17] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
## [18] GenomicFeatures_1.20.0
## [19] AnnotationDbi_1.30.1
## [20] Biobase_2.28.0
## [21] AnnotationHub_2.0.1
## [22] VariantAnnotation_1.14.0
## [23] Rsamtools_1.20.1
## [24] Biostrings_2.36.0
## [25] XVector_0.8.0
## [26] GenomicRanges_1.20.3
## [27] GenomeInfoDb_1.4.0
## [28] IRanges_2.2.1
## [29] S4Vectors_0.6.0
## [30] BiocGenerics_0.14.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.11.5 BiocInstaller_1.18.1
## [3] formatR_1.2 futile.logger_1.4.1
## [5] bitops_1.0-6 futile.options_1.0.0
## [7] tools_3.2.0 zlibbioc_1.14.0
## [9] digest_0.6.8 evaluate_0.7
## [11] graph_1.46.0 shiny_0.11.1
## [13] stringr_0.6.2 httr_0.6.1
## [15] knitr_1.10 R6_2.0.1
## [17] RBGL_1.44.0 XML_3.98-1.1
## [19] BiocParallel_1.2.1 RJSONIO_1.3-0
## [21] lambda.r_1.1.7 htmltools_0.2.6
## [23] GenomicAlignments_1.4.1 mime_0.3
## [25] interactiveDisplayBase_1.6.0 xtable_1.7-4
## [27] httpuv_1.3.2 RCurl_1.95-4.6
## [29] markdown_0.7.7
[ Back to top ]