Data Annotation Process for gDR Pipeline
Introduction
Before running the gDR pipeline, it is essential to annotate the data properly with drug and cell line information. This document outlines the process of data annotation and the requirements for the annotation files.
Annotation Files
gDR uses two types of annotation: drug annotation and cell line
annotation. These annotations are added to a data table before running
the pipeline. The scripts for adding data annotation are located in
R/add_annotation.R
, which contains several primary
functions: annotate_dt_with_cell_line
,
annotate_dt_with_drug
,
get_cell_line_annotation
, and
get_drug_annotation
for receiving the default annotation
for the data. Additionally, annotate_se_with_drug
,
annotate_mae_with_drug
,
annotate_se_with_cell_line
, and
annotate_mae_with_cell_line
are provided to annotate
SummarizedExperiment
and MultiAssayExperiment
objects. It is recommended to run the cleanup_metadata
function, which adds annotations and performs some data cleaning.
Annotation File Locations
Both drug and cell line annotation files are stored in
gDRtestData/inst/annotation_data
. There are two files:
cell_lines.csv
drugs.csv
Users can edit these files to add their own annotations. After
updating, it is required to reinstall gDRtestData
to use
the new annotations.
Alternatively, users can use other annotation files stored outside of this package. For this purpose, it is necessary to set two environmental variables:
-
GDR_CELLLINE_ANNOTATION
: Represents the path to the cell line annotation CSV file. -
GDR_DRUG_ANNOTATION
: Represents the path to the drug annotation CSV file.
Sys.setenv(GDR_CELLLINE_ANNOTATION = "some/path/to/cell_line_annotation.csv")
Sys.setenv(GDR_DRUG_ANNOTATION = "some/path/to/drug_annotation.csv")
NOTE: gDR annotation can be sourced from different locations. Setting environmental variables with paths for annotation has the highest priority and will be used as the first source of annotation, even if other sources are available. To clarify, if both the environmental variables and the internal annotation databases are set, gDR will prioritize the environmental variables for annotation.
To turn off the usage of external paths for data annotation, please set these two environmental variables to empty.
Sys.setenv(GDR_CELLLINE_ANNOTATION = "")
Sys.setenv(GDR_DRUG_ANNOTATION = "")
Annotation Requirements
gDR has specific requirements for the annotation files to properly annotate the data.
Drug Annotation Requirements
The obligatory fields for drug annotation are:
-
Gnumber
: Represents the ID of the drug. -
DrugName
: Represents the name of the drug. -
drug_moa
: Represents the drug mechanism of action.
Cell Line Annotation Requirements
The obligatory fields for cell line annotation are:
-
clid
: Represents the cell line ID. -
CellLineName
: Represents the name of the cell line. -
Tissue
: Represents the primary tissue of the cell line. -
ReferenceDivisionTime
: Represents the doubling time of the cell line in hours. -
parental_identifier
: Represents the name of the parental cell line. -
subtype
: Represents the subtype of the cell line.
If some information is not known for the cell line or drug, the corresponding field should be left empty or NA. Nonetheless, since the fill parameter is consistently specified in the annotation function, the default value of ‘unknown’ can be altered by the user.
Annotating SummarizedExperiment and MultiAssayExperiment Objects
To annotate SummarizedExperiment
and
MultiAssayExperiment
objects, use the functions
annotate_se_with_drug
, annotate_mae_with_drug
,
annotate_se_with_cell_line
, and
annotate_mae_with_cell_line
. These functions take the
experiment objects and the corresponding annotation tables as input and
return the annotated objects.
# Example for SummarizedExperiment
se <- SummarizedExperiment::SummarizedExperiment(
rowData = data.table::data.table(Gnumber = c("D1", "D2", "D3"))
)
drug_annotation <- get_drug_annotation(data.table::as.data.table(SummarizedExperiment::rowData(se)))
annotated_se <- annotate_se_with_drug(se, drug_annotation)
# Example for MultiAssayExperiment
mae <- MultiAssayExperiment::MultiAssayExperiment(
experiments = list(exp1 = SummarizedExperiment::SummarizedExperiment(
rowData = data.table::data.table(clid = c("CL1", "CL2", "CL3"))
))
)
cell_line_annotation <- get_cell_line_annotation(data.table::as.data.table(
SummarizedExperiment::rowData(
MultiAssayExperiment::experiments(mae)[[1]])))
annotated_mae <- annotate_mae_with_cell_line(mae, cell_line_annotation)
Additional Information for Genentech/Roche Users
For users within Genentech/Roche, we recommend utilizing our internal
annotation databases. We provide the gDRinternal
package
specifically for internal users, which includes functions for managing
internal annotation data. If you are an internal user, you can install
the gDRinternal
package, and gDRcore
will
automatically utilize this package as a source of data annotation.
SessionInfo
sessionInfo()
#> R version 4.3.0 (2023-04-21)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BiocStyle_2.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.6.5 cli_3.6.5 knitr_1.45
#> [4] rlang_1.1.6 xfun_0.42 stringi_1.8.7
#> [7] purrr_1.0.4 textshaping_0.3.7 jsonlite_2.0.0
#> [10] glue_1.8.0 htmltools_0.5.7 ragg_1.2.7
#> [13] sass_0.4.8 rmarkdown_2.25 evaluate_0.23
#> [16] jquerylib_0.1.4 fastmap_1.1.1 yaml_2.3.8
#> [19] lifecycle_1.0.4 memoise_2.0.1 bookdown_0.37
#> [22] BiocManager_1.30.22 stringr_1.5.1 compiler_4.3.0
#> [25] fs_1.6.3 systemfonts_1.0.5 digest_0.6.34
#> [28] R6_2.6.1 magrittr_2.0.3 bslib_0.6.1
#> [31] tools_4.3.0 pkgdown_2.0.7 cachem_1.0.8
#> [34] desc_1.4.3