Last updated: 2020-04-26

Checks: 7 0

Knit directory: Bgee/

This reproducible R Markdown analysis was created with workflowr (version 1.6.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200417) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version c287d01. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory

Untracked files:
    Untracked:  Drosophila_melanogaster_Bgee_14_1/
    Untracked:  analysis/.here
    Untracked:  genes_Drosophila_melanogaster.tsv
    Untracked:  release.tsv
    Untracked:  species_Bgee_14_1.tsv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/topanat.Rmd) and HTML (docs/topanat.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd c287d01 SFonsecaCosta 2020-04-26 Update
html ae29961 SFonsecaCosta 2020-04-22 Build site.
Rmd 9907294 SFonsecaCosta 2020-04-22 clean text
html 8d9e88f SFonsecaCosta 2020-04-20 Build site.
Rmd f18beb5 SFonsecaCosta 2020-04-20 Minor things
html 8d821e2 SFonsecaCosta 2020-04-20 Build site.
Rmd 9073f83 SFonsecaCosta 2020-04-20 add analysis

Load the packages

library(BgeeDB)
library(biomaRt)
library(ggplot2)
library(here)

The TopAnat allows you to make GO-like enrichment of anatomical terms, mapped to genes by expression patterns.

In this section we will use 2 different species, Drosophila melanogaster and Bos taurus to run the TopAnat analysis.

Initially we will use the output results from our previous analysis with Drosophila melanogaster to run TopAnat, after that we will run the TopAnat tool with data retrieved from biomart that match specific phenotype from a non-model organism.

Model organism

Create a Bgee object

In order to start the analysis in TopAnat you have to create a Bgee object, as demonstrated before in the download data section.

Note that in this example we do not specify any data type in the argument dataType, which means we will use all data types from Bgee.

targetDrosophila <- Bgee$new(species = "Drosophila_melanogaster")

NOTE: You did not specify any data type. The argument dataType will be set to c("rna_seq","affymetrix","est","in_situ") for the next steps.

Querying Bgee to get release information...

NOTE: the file describing Bgee species information for release 14_1 was found in the download directory /Users/sfonseca1/Bgee. Data will not be redownloaded.

API key built: 342617eeb6e728567eeaf07855efc4fa274e6ad21dde329ecce7bd3668c4efa9f23c364808e45b7b772594a592dfe6855ff75027e7983a848ff43a25416ebe6f

loadTopAnatData()

After that you should download the data for the reference species in order to allows you to perform the TopAnat analysis. The function loadTopAnatData() loads multiple files for the target species, as:

  • an organ relationships file

  • an organ names file

  • a gene to organs mapping file

drosophilaTopAnatData <- loadTopAnatData(targetDrosophila, confidence = "silver", stage = "UBERON:0000092")

Building URLs to retrieve organ relationships from Bgee.........
   URL successfully built (https://r.bgee.org/bgee14_1/?page=r_package&action=get_anat_entity_relations&display_type=tsv&species_list=7227&attr_list=SOURCE_ID&attr_list=TARGET_ID&api_key=342617eeb6e728567eeaf07855efc4fa274e6ad21dde329ecce7bd3668c4efa9f23c364808e45b7b772594a592dfe6855ff75027e7983a848ff43a25416ebe6f&source=BgeeDB_R_package&source_version=2.12.1)
   Submitting URL to Bgee webservice (can be long)
   Got results from Bgee webservice. Files are written in "/Users/sfonseca1/Bgee/Drosophila_melanogaster_Bgee_14_1"

Building URLs to retrieve organ names from Bgee.................
   URL successfully built (https://r.bgee.org/bgee14_1/?page=r_package&action=get_anat_entities&display_type=tsv&species_list=7227&attr_list=ID&attr_list=NAME&api_key=342617eeb6e728567eeaf07855efc4fa274e6ad21dde329ecce7bd3668c4efa9f23c364808e45b7b772594a592dfe6855ff75027e7983a848ff43a25416ebe6f&source=BgeeDB_R_package&source_version=2.12.1)
   Submitting URL to Bgee webservice (can be long)
   Got results from Bgee webservice. Files are written in "/Users/sfonseca1/Bgee/Drosophila_melanogaster_Bgee_14_1"

Building URLs to retrieve mapping of gene to organs from Bgee...
   URL successfully built (https://r.bgee.org/bgee14_1/?page=r_package&action=get_expression_calls&display_type=tsv&species_list=7227&attr_list=GENE_ID&attr_list=ANAT_ENTITY_ID&api_key=342617eeb6e728567eeaf07855efc4fa274e6ad21dde329ecce7bd3668c4efa9f23c364808e45b7b772594a592dfe6855ff75027e7983a848ff43a25416ebe6f&source=BgeeDB_R_package&source_version=2.12.1&data_qual=SILVER&stage_id=UBERON:0000092)
   Submitting URL to Bgee webservice (can be long)
   Got results from Bgee webservice. Files are written in "/Users/sfonseca1/Bgee/Drosophila_melanogaster_Bgee_14_1"

Parsing the results.............................................

Adding BGEE:0 as unique root of all terms of the ontology.......

Done.

Note that the loadTopAnatData() function allows you to retrieve data from a specific developmental stage by using the stage argument and allow you to specify the confidence of the quality of expression calls by using the argument confidence. In this particular case we use all data and the developmental stage UBERON:0000092.

Background/foreground genes

The next step of the analysis relies on topGO package, for this a list of genes in the foreground and in the background are necessary.

Note that in this case the foreground genes will be the genes detected with high variability between the two samples from the experiment SRP002072 analysed in the processing data section.

Retrieve info from biomaRt

The initial step is to retrieve information (ensembl gene id) for the target species by using biomart.

ensemblDM <- useMart("ENSEMBL_MART_ENSEMBL", dataset="dmelanogaster_gene_ensembl", host="mar2016.archive.ensembl.org")

# background genes
universeDM <- getBM(attributes=c("ensembl_gene_id"),mart=ensemblDM)
Cache found
# foreground genes from the analysis
foregroundDM <- read.table(file = here("genes_Drosophila_melanogaster.tsv"), header=F, sep="\t")
foregroundDM <- unlist(foregroundDM, use.names = FALSE)

We will use this list of genes to create our geneList object.

geneListDM <- factor(as.integer(unique(universeDM$ensembl_gene_id) %in% foregroundDM))
names(geneListDM) <- unique(universeDM$ensembl_gene_id)
summary(geneListDM)
    0     1 
17533    26 

topGO object

Create the object with data loaded and the gene list with foreground genes.

myTopAnatObject_DM <-  topAnat(drosophilaTopAnatData, geneListDM)

Checking topAnatData object.............

Checking gene list......................

WARNING: Some genes in your gene list have no expression data in Bgee, and will not be included in the analysis. 17478 genes in background will be kept.

Building most specific Ontology terms...  (  461  Ontology terms found. )

Building DAG topology...................  (  1048  Ontology terms and  1964  relations. )

Annotating nodes (Can be long)..........  (  17478  genes annotated to the Ontology terms. )

Enrichment test

After creating the final object, the enrichment analysis is performed by using the statistical tests implemented in topGO package.

results_DM <- runTest(myTopAnatObject_DM, algorithm = 'classic', statistic = 'fisher')

             -- Classic Algorithm -- 

         the algorithm is scoring 181 nontrivial nodes
         parameters: 
             test statistic: fisher

Results

The results can be retrieved by specifying the desired cutoff using the cutoff argument. You can also export the table by ordering the desire column specifying that in ordering argument.

tableOver <- makeTable(drosophilaTopAnatData, myTopAnatObject_DM, results_DM)

Building the results table for the 406 significant terms at FDR threshold of 1...
Ordering results by pValue column in increasing order...
Done
## look 
head(tableOver,4)
                      organId                  organName annotated significant
UBERON:0000992 UBERON:0000992               female gonad     12885          26
UBERON:0003134 UBERON:0003134  female reproductive organ     12885          26
UBERON:0000474 UBERON:0000474 female reproductive system     13078          26
UBERON:0003100 UBERON:0003100            female organism     13078          26
               expected foldEnrichment       pValue        FDR
UBERON:0000992    19.17       1.356286 0.0003585320 0.04185368
UBERON:0003134    19.17       1.356286 0.0003585320 0.04185368
UBERON:0000474    19.45       1.336761 0.0005279216 0.04185368
UBERON:0003100    19.45       1.336761 0.0005279216 0.04185368

Non-Model organism

In this example we will use a non-model organism Bos taurus (cattle) to run the TopAnat analysis.

Create a Bgee object

In order to start the analysis in TopAnat you have to create a new Bgee object, as demonstrated before.

Note that in this example we do not specify any data type in the argument dataType, as in the previous example, which means that we will use all data types from Bgee once again.

targetBT <- Bgee$new(species = "Bos_taurus")

NOTE: You did not specify any data type. The argument dataType will be set to c("rna_seq","affymetrix","est","in_situ") for the next steps.

Querying Bgee to get release information...

NOTE: the file describing Bgee species information for release 14_1 was found in the download directory /Users/sfonseca1/Bgee. Data will not be redownloaded.

API key built: 342617eeb6e728567eeaf07855efc4fa274e6ad21dde329ecce7bd3668c4efa9f23c364808e45b7b772594a592dfe6855ff75027e7983a848ff43a25416ebe6f

loadTopAnatData()

After that you should download the data for the reference species in order to allows you to perform the TopAnat analysis. Note that in this case we will not specify any developmental stage or confidence in the calls.

bostaurusTopAnatData <- loadTopAnatData(targetBT)

Building URLs to retrieve organ relationships from Bgee.........
   URL successfully built (https://r.bgee.org/bgee14_1/?page=r_package&action=get_anat_entity_relations&display_type=tsv&species_list=9913&attr_list=SOURCE_ID&attr_list=TARGET_ID&api_key=342617eeb6e728567eeaf07855efc4fa274e6ad21dde329ecce7bd3668c4efa9f23c364808e45b7b772594a592dfe6855ff75027e7983a848ff43a25416ebe6f&source=BgeeDB_R_package&source_version=2.12.1)
   Submitting URL to Bgee webservice (can be long)
   Got results from Bgee webservice. Files are written in "/Users/sfonseca1/Bgee/Bos_taurus_Bgee_14_1"

Building URLs to retrieve organ names from Bgee.................
   URL successfully built (https://r.bgee.org/bgee14_1/?page=r_package&action=get_anat_entities&display_type=tsv&species_list=9913&attr_list=ID&attr_list=NAME&api_key=342617eeb6e728567eeaf07855efc4fa274e6ad21dde329ecce7bd3668c4efa9f23c364808e45b7b772594a592dfe6855ff75027e7983a848ff43a25416ebe6f&source=BgeeDB_R_package&source_version=2.12.1)
   Submitting URL to Bgee webservice (can be long)
   Got results from Bgee webservice. Files are written in "/Users/sfonseca1/Bgee/Bos_taurus_Bgee_14_1"

Building URLs to retrieve mapping of gene to organs from Bgee...
   URL successfully built (https://r.bgee.org/bgee14_1/?page=r_package&action=get_expression_calls&display_type=tsv&species_list=9913&attr_list=GENE_ID&attr_list=ANAT_ENTITY_ID&api_key=342617eeb6e728567eeaf07855efc4fa274e6ad21dde329ecce7bd3668c4efa9f23c364808e45b7b772594a592dfe6855ff75027e7983a848ff43a25416ebe6f&source=BgeeDB_R_package&source_version=2.12.1&data_qual=SILVER)
   Submitting URL to Bgee webservice (can be long)
   Got results from Bgee webservice. Files are written in "/Users/sfonseca1/Bgee/Bos_taurus_Bgee_14_1"

Parsing the results.............................................

Adding BGEE:0 as unique root of all terms of the ontology.......

Done.

Background/foreground genes

In the next step we will perform the TopAnat analysis with topGO, for this a list of genes in the foreground and in the background are necessary.

In this example we will retrieve genes enriched for expression in muscle.

Retrieve info from biomaRt

The initial step is to retrieve information for the target species by using biomart.

ensembl_BT <- useMart("ENSEMBL_MART_ENSEMBL", dataset="btaurus_gene_ensembl", host="mar2016.archive.ensembl.org")

# background genes
universe_BT <- getBM(attributes=c("ensembl_gene_id","phenotype_description"),mart=ensembl_BT)
Cache found
# select phenotypes related to muscle
phenotypes <- grep("musc", unique(universe_BT$phenotype_description), value=T)

# Foreground genes are those with an annotated phenotype related to muscle.
myGenes <- unique(universe_BT$ensembl_gene_id[universe_BT$phenotype_description %in% phenotypes])

# list vector with 0 and 1  
geneList_BT <- factor(as.integer(unique(universe_BT$ensembl_gene_id) %in% myGenes))
names(geneList_BT) <- unique(universe_BT$ensembl_gene_id)
summary(geneList_BT)
    0     1 
24609     7 

Note that you have just 7 genes in the foreground that match ‘musc’ keyword when you extract uniques with getBM() function.

If we do a search in UniProtKB/Swiss-Prot entries with keyword ‘muscle’ we will found the following genes that can be part of our foreground:

phenotypes <- c("ENSBTAG00000008394", "ENSBTAG00000006030", "ENSBTAG00000015644", "ENSBTAG00000011424", "ENSBTAG00000010799", "ENSBTAG00000014614", "ENSBTAG00000011869", "ENSBTAG00000045757", "ENSBTAG00000026266", "ENSBTAG00000022158", "ENSBTAG00000006419", "ENSBTAG00000030425", "ENSBTAG00000009707", "ENSBTAG00000009749", "ENSBTAG00000033217", "ENSBTAG00000007196", "ENSBTAG00000021218", "ENSBTAG00000005333", "ENSBTAG00000006424", "ENSBTAG00000005353", "ENSBTAG00000021992", "ENSBTAG00000026972", "ENSBTAG00000005970", "ENSBTAG00000046332", "ENSBTAG00000018285", "ENSBTAG00000005714", "ENSBTAG00000016024", "ENSBTAG00000017992", "ENSBTAG00000014003", "ENSBTAG00000018204", "ENSBTAG00000019159")

We will use this list of genes to create our geneList object.

geneList_BT <- factor(as.integer(unique(universe_BT$ensembl_gene_id) %in% phenotypes))
names(geneList_BT) <- unique(universe_BT$ensembl_gene_id)
summary(geneList_BT)
    0     1 
24585    31 

topGO object

Create the object with data loaded and the gene list with foreground genes (phenotypes from the search in UniProtKB/Swiss-Prot).

myTopAnatObject_BT <-  topAnat(bostaurusTopAnatData, geneList_BT)

Checking topAnatData object.............

Checking gene list......................

WARNING: Some genes in your gene list have no expression data in Bgee, and will not be included in the analysis. 22645 genes in background will be kept.

Building most specific Ontology terms...  (  19  Ontology terms found. )

Building DAG topology...................  (  172  Ontology terms and  260  relations. )

Annotating nodes (Can be long)..........  (  22645  genes annotated to the Ontology terms. )

Enrichment test

After creating the final object, the enrichment analysis is performed by using the statistical tests implemented in topGO package. Note that in this example we will use weight in the argument algorithm.

results_BT <- runTest(myTopAnatObject_BT, algorithm = 'weight', statistic = 'fisher')

             -- Weight Algorithm -- 

         The algorithm is scoring 172 nontrivial nodes
         parameters: 
             test statistic: fisher : ratio

     Level 19:  1 nodes to be scored.

     Level 18:  1 nodes to be scored.

     Level 17:  1 nodes to be scored.

     Level 16:  2 nodes to be scored.

     Level 15:  4 nodes to be scored.

     Level 14:  7 nodes to be scored.

     Level 13:  14 nodes to be scored.

     Level 12:  16 nodes to be scored.

     Level 11:  12 nodes to be scored.

     Level 10:  17 nodes to be scored.

     Level 9:   17 nodes to be scored.

     Level 8:   22 nodes to be scored.

     Level 7:   18 nodes to be scored.

     Level 6:   12 nodes to be scored.

     Level 5:   9 nodes to be scored.

     Level 4:   10 nodes to be scored.

     Level 3:   4 nodes to be scored.

     Level 2:   4 nodes to be scored.

     Level 1:   1 nodes to be scored.

Results

Global results

The results can be retrieved by specifying the desired cutoff using the cutoff argument. You can also export the table by ordering the desire column specifying that in ordering argument.

tableOver <- makeTable(bostaurusTopAnatData, myTopAnatObject_BT, results_BT)

Building the results table for the 172 significant terms at FDR threshold of 1...
Ordering results by pValue column in increasing order...
Done
grph <- ggplot(tableOver, aes(x = reorder(organId, FDR), y = FDR))
grph <- grph + geom_bar(stat="identity", color='black',fill='gray')
grph <- grph + theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) + xlab("Anatomic entities") + ylab("FDR") + ggtitle("Results - enrichment test")
grph

Version Author Date
8d821e2 SFonsecaCosta 2020-04-20

Retrieve significant genes

tableOver <- makeTable(bostaurusTopAnatData, myTopAnatObject_BT, results_BT, cutoff = 0.01, ordering = -8)

Building the results table for the 5 significant terms at FDR threshold of 0.01...
Ordering results by FDR column in decreasing order...
Done
targetUberon <- c("UBERON:0001134", "UBERON:0001111", "UBERON:0001401", "UBERON:0002000", "UBERON:0034908")

termStat(myTopAnatObject_BT, targetUberon) 
               Annotated Significant Expected
UBERON:0001134     16812          31    23.01
UBERON:0001111     14106          30    19.31
UBERON:0001401     11858          28    16.23
UBERON:0002000     14178          31    19.41
UBERON:0034908     14475          31    19.82

In order to use some target gene in the next section (Sparql endpoint), we should retrieve the ensembl IDs for one of the anatomical entities that are significant.

annotated <- genesInTerm(myTopAnatObject_BT, targetUberon)[["UBERON:0001134"]]
annotated[annotated %in% sigGenes(myTopAnatObject_BT)]
 [1] "ENSBTAG00000005333" "ENSBTAG00000005353" "ENSBTAG00000005714"
 [4] "ENSBTAG00000005970" "ENSBTAG00000006030" "ENSBTAG00000006419"
 [7] "ENSBTAG00000006424" "ENSBTAG00000007196" "ENSBTAG00000008394"
[10] "ENSBTAG00000009707" "ENSBTAG00000009749" "ENSBTAG00000010799"
[13] "ENSBTAG00000011424" "ENSBTAG00000011869" "ENSBTAG00000014003"
[16] "ENSBTAG00000014614" "ENSBTAG00000015644" "ENSBTAG00000016024"
[19] "ENSBTAG00000017992" "ENSBTAG00000018204" "ENSBTAG00000018285"
[22] "ENSBTAG00000019159" "ENSBTAG00000021218" "ENSBTAG00000021992"
[25] "ENSBTAG00000022158" "ENSBTAG00000026266" "ENSBTAG00000026972"
[28] "ENSBTAG00000030425" "ENSBTAG00000033217" "ENSBTAG00000045757"
[31] "ENSBTAG00000046332"

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] here_0.1             ggplot2_3.3.0        biomaRt_2.42.1      
 [4] BgeeDB_2.12.1        tidyr_1.0.2          topGO_2.38.1        
 [7] SparseM_1.78         GO.db_3.10.0         AnnotationDbi_1.48.0
[10] IRanges_2.20.2       S4Vectors_0.24.4     Biobase_2.46.0      
[13] graph_1.64.0         BiocGenerics_0.32.0  workflowr_1.6.1     

loaded via a namespace (and not attached):
 [1] httr_1.4.1           bit64_0.9-7          assertthat_0.2.1    
 [4] askpass_1.1          BiocFileCache_1.10.2 blob_1.2.1          
 [7] yaml_2.2.1           progress_1.2.2       pillar_1.4.3        
[10] RSQLite_2.2.0        backports_1.1.6      lattice_0.20-41     
[13] glue_1.4.0           digest_0.6.25        promises_1.1.0      
[16] colorspace_1.4-1     htmltools_0.4.0      httpuv_1.5.2        
[19] XML_3.99-0.3         pkgconfig_2.0.3      purrr_0.3.3         
[22] scales_1.1.0         whisker_0.4          later_1.0.0         
[25] git2r_0.26.1         tibble_3.0.0         openssl_1.4.1       
[28] farver_2.0.3         ellipsis_0.3.0       withr_2.1.2         
[31] cli_2.0.2            magrittr_1.5         crayon_1.3.4        
[34] memoise_1.1.0        evaluate_0.14        fs_1.4.1            
[37] fansi_0.4.1          tools_3.6.0          data.table_1.12.8   
[40] prettyunits_1.1.1    hms_0.5.3            lifecycle_0.2.0     
[43] matrixStats_0.56.0   stringr_1.4.0        munsell_0.5.0       
[46] compiler_3.6.0       rlang_0.4.5          grid_3.6.0          
[49] RCurl_1.98-1.1       rappdirs_0.3.1       labeling_0.3        
[52] bitops_1.0-6         rmarkdown_2.1        gtable_0.3.0        
[55] DBI_1.1.0            curl_4.3             R6_2.4.1            
[58] knitr_1.28           dplyr_0.8.5          bit_1.1-15.2        
[61] rprojroot_1.3-2      stringi_1.4.6        Rcpp_1.0.4.6        
[64] vctrs_0.2.4          dbplyr_1.4.2         tidyselect_1.0.0    
[67] xfun_0.13