To install rosace
start R and first install devtools by
typing:
and install rosace
by typing:
If you have cloned the git repository locally, navigate to the
rosace
folder and type:
Next load rosace
with:
To explain how to use rosace
we provide an example based
on the OCT1 drug cytotoxicity screen (https://www.biorxiv.org/content/10.1101/2023.06.06.543963v1.full).
The screen has three replicates. The count matrix is stored in “data/oct1.rda”.
First, load the count file into the Assay object. Each replicate in the experiment will form an Assay object and will share the same “key” (1SM73).
assay1 <- CreateAssayObject(counts = as.matrix(oct1_rep1[2:ncol(oct1_rep1)]),
var.names = oct1_rep1$hgvs,
key = key, rep = 1, type = type)
## Warning: filtering 77 variants that have more than 0.5 missing data
## Warning: filtering 187 variants that have less than 20 count
assay2 <- CreateAssayObject(counts = as.matrix(oct1_rep2[2:ncol(oct1_rep2)]),
var.names = oct1_rep2$hgvs,
key = key, rep = 2, type = type)
## Warning: filtering 57 variants that have more than 0.5 missing data
## Warning: filtering 113 variants that have less than 20 count
assay3 <- CreateAssayObject(counts = as.matrix(oct1_rep3[2:ncol(oct1_rep3)]),
var.names = oct1_rep3$hgvs,
key = key, rep = 3, type = type)
## Warning: filtering 115 variants that have more than 0.5 missing data
## Warning: filtering 238 variants that have less than 20 count
Next create Rosace object by adding three Assay objects together.
rosace <- CreateRosaceObject(object = assay1)
rosace <- AddAssayData(object = rosace, assay = assay2)
rosace <- AddAssayData(object = rosace, assay = assay3)
GetAssayName(rosace)
## [1] "1SM73_1" "1SM73_2" "1SM73_3"
‘CreateAssayObject’ calls function ‘FilterData’ to filter the variants with more than ‘na.rmax’% of NAs and less than ‘min.count’ total counts by default. But we might want to filter more variants later.
Then we will impute the NA data either by K-Nearest Neighbor method or fill the NA with 0.
rosace <- ImputeData(rosace, key = key, impute.method = "knn", na.rmax = 0.5)
# rosace <- ImputeData(rosace, key = key, impute.method = "zero")
With a complete count matrix in the Assay object, we will normalize the data by either a list of wild-type variants or by the total count at the time point.
rosace <- NormalizeData(rosace, key = key,
normalization.method = "wt",
wt.var.names = c("_wt"), wt.rm = TRUE)
# rosace <- NormalizeData(rosace, key = key, normalization.method = "total")
After the Assay objects are all normalized, we can integrate three replicates into an Assays object stored in the “combined.assay” slot of the Rosace object.
## [1] "1SM73"
Provide your own function for parsing “hgvs” into position, mutation, wildtype, and type of mutation.
## [1] "variants"
rosace@var.data <- rosace@var.data %>%
mutate(tmp = substr(variants, 4, nchar(variants) - 1),
position = as.numeric(gsub("[[:alpha:]]", "", tmp)),
wildtype = substr(tmp, 1, 1),
tmp = substr(tmp, 2, nchar(tmp)),
mutation = gsub("[[:digit:]]", "", tmp)) %>%
dplyr::select(-tmp)
func_map <- function(wt, mut) {
if (nchar(wt) == 0) {
return("NA")
}
if (wt == mut) {
return("synonymous")
} else if (mut == "del") {
return("deletion")
} else {
return("missense")
}
}
rosace@var.data <- rosace@var.data %>%
rowwise() %>%
mutate(type = func_map(wildtype, mutation)) %>%
ungroup()
head(rosace@var.data)
## # A tibble: 6 × 5
## variants position wildtype mutation type
## <chr> <dbl> <chr> <chr> <chr>
## 1 _wt NA "" "" NA
## 2 p.(A107A) 107 "A" "A" synonymous
## 3 p.(A107C) 107 "A" "C" missense
## 4 p.(A107D) 107 "A" "D" missense
## 5 p.(A107E) 107 "A" "E" missense
## 6 p.(A107F) 107 "A" "F" missense
There are three ways to run Rosace model: 1) Provide no position or control column. The model would treat each variant independently but with shrinkage on the variance component. 2) Provide position column only. In addition, the model would group variants together with the amino acid position information provided. 3) Provide both position and control column (recommended). In addition, the model would group all control variants together into one position label.
For the purpose of this vignette, we’ve reduced the number of variants to 100 with brute force before running Rosace, since running on the full dataset might take an hour.
# running on an Assay (one replicate)
rosace@assays$`1SM73_1`@norm.counts <- rosace@assays$`1SM73_1`@norm.counts[1:100, ]
rosace@assays$`1SM73_1`@norm.var.names <- rosace@assays$`1SM73_1`@norm.var.names[1:100]
rosace <- RunRosace(object = rosace,
name = "1SM73_1",
type = "Assay",
savedir = "../tests/results/stan/assay/",
pos.col = "position",
ctrl.col = "type",
ctrl.name = "synonymous",
install = FALSE)
# running on an AssaySet (all three replicates)
rosace@assay.sets$`1SM73`@raw.counts <- rosace@assay.sets$`1SM73`@raw.counts[1:100, ]
rosace@assay.sets$`1SM73`@combined.counts <- rosace@assay.sets$`1SM73`@combined.counts[1:100, ]
rosace@assay.sets$`1SM73`@var.names <- rosace@assay.sets$`1SM73`@var.names[1:100]
rosace <- RunRosace(object = rosace,
name = "1SM73",
type = "AssaySet",
savedir = "../tests/results/stan/assayset/",
pos.col = "position",
ctrl.col = "type",
ctrl.name = "synonymous",
install = FALSE)
The Rosace results can be retrieved using the
OutputScore
function. First, check the name of your Rosace
run:
Then extract the scores data, which includes the variant information,
functional score (mean
), standard deviation
(sd
), and the local false sign discovery rate
(lfsr
) associated with the score.
The functional score represents the slope of the linear regression applied to normalized counts across time points, which serves as a measure of cell growth. A positive score indicates the mutation has a a gain of function (GOF) effect, while a negative score indicates a loss of function (LOF) effect. The Rosace model learns a distribution of scores and reports the mean for each variant.
The lfsr
, or local false sign rate, offers a Bayesian
perspective for estimating uncertainty in the sign of the score. For a
negative mean score, lfsr
estimates the probability it
could truly be positive, and vice versa. The below section illustrates
how to use score and lfsr
to identify LOF and GOF
mutations.
To have a more comprehensive understanding and clearer view of the score distribution, we’ve made available a precomputed object based on the complete OCT1 dataset. This may help you draw parallel to your own data. libra
# obtain the data
data("oct1_rosace_scored")
# extract scores
scores.data <- OutputScore(oct1_rosace_scored, name = "1SM73_ROSACE", sig.test = 0.05)
head(scores.data)
## variants position wildtype mutation type mean sd
## 1 p.(A107A) 107 A A synonymous -0.2755549 0.2365942
## 2 p.(A107C) 107 A C missense 0.1134388 0.2255906
## 3 p.(A107D) 107 A D missense 0.3873864 0.2108217
## 4 p.(A107E) 107 A E missense -0.6415181 0.2534535
## 5 p.(A107F) 107 A F missense -0.1401362 0.2692807
## 6 p.(A107G) 107 A G missense -1.0571610 0.2417313
## lfsr lfsr.neg lfsr.pos test.neg test.pos label
## 1 1.220757e-01 1.220757e-01 0.87792430 FALSE FALSE Neutral
## 2 3.075340e-01 6.924660e-01 0.30753402 FALSE FALSE Neutral
## 3 3.306753e-02 9.669325e-01 0.03306753 FALSE FALSE Neutral
## 4 5.685139e-03 5.685139e-03 0.99431486 TRUE FALSE Neg
## 5 3.013891e-01 3.013891e-01 0.69861090 FALSE FALSE Neutral
## 6 6.119409e-06 6.119409e-06 0.99999388 TRUE FALSE Neg
First, we can visualize the distribution of scores across mutation
types with the scoreDensity
function.
scoreDensity(scores.data,
hist = FALSE,
savedir = "../tests/results/stan/assayset_full/plot/",
name = "DensityPlot_1SM73")
For other visualizations including visualizing position-wise distributions, refer to the Visualizing functional score results vignette.
A common goal in DMS experiments is to determine which mutations
cause a GOF or a LOF. One natural approach is to conduct hypothesis
testing to check whether a score is significantly different from 0.
lfsr
stands for local false sign rate, and is defined as
the minimum of the density on the left of 0 (lfsr.pos
) and
on the right of 0 (lfsr.neg
). The OutputScore
function automatically does a two-sided test based on the
lfsr
with threshold 0.05: if lfsr.neg
<
0.05/2, we are confident that the mass of the distribution is on the
left side of 0 and label the variant “Neg”, and vice versa.
For the OCT1 data, positive scores represent loss-of-function and negative scores represent gain-of-function. You can update the label to “LOF”, “Neutral”, and “GOF” (instead of “Neg”, “Neutral”, and “Pos”) by the following function.
scores.data <- scores.data %>%
mutate(label = case_when(
label == "Neg" ~ "GOF",
label == "Pos" ~ "LOF",
label == "Neutral" ~ "Neutral"
))
Keep in mind that setting a threshold is a somewhat counter-Bayesian approach. The threshold should be adaptable, depending on the nature of the experiment and distribution of the data. You may be more or less conservative with thresholding based on the context and prior knowledge.
##
## GOF LOF Neutral
## deletion 67 359 102
## missense 2965 3682 3614
## synonymous 100 26 390
In the growth screen, hypothesis testing for negative scores is much harder than that for the positive scores because random dropout of a variant is common, even for the synonymous mutations, especially when the mass of LOF variant score is shifting towards positive (faster growth rate than synonymous). As a result, it is expected to see more false discovery in the “GOF” bin in the OCT1 data. One might adjust the threshold to be more stringent when discovering “Neg” variants, and looser when discovering “Pos” variants.
Sometimes, we want to select the most “LOF” or “GOF” variants for
validation. Our most confident set of variants are those that have a
high absolute value of the score and small lfsr
close to
0.
For example, the 5 most “LOF” and “GOF” variants in the OCT1 data can be found by the function below.
## variants position wildtype mutation type mean sd
## 1 p.(D5M) 5 D M missense 3.327057 0.3870681
## 2 p.(G508del) 508 G del deletion 3.204218 0.3741755
## 3 p.(D5W) 5 D W missense 2.724841 0.3856855
## 4 p.(A506L) 506 A L missense 2.636134 0.3641000
## 5 p.(N156del) 156 N del deletion 2.608736 0.2494461
## lfsr lfsr.neg lfsr.pos test.neg test.pos label
## 1 4.143899e-18 1 4.143899e-18 FALSE TRUE LOF
## 2 5.478980e-18 1 5.478980e-18 FALSE TRUE LOF
## 3 8.034936e-13 1 8.034936e-13 FALSE TRUE LOF
## 4 2.241158e-13 1 2.241158e-13 FALSE TRUE LOF
## 5 6.725221e-26 1 6.725221e-26 FALSE TRUE LOF
## variants position wildtype mutation type mean sd
## 1 p.(A219V) 219 A V missense -3.617156 0.3442180
## 2 p.(L505C) 505 L C missense -3.040896 0.4894418
## 3 p.(H304M) 304 H M missense -2.904934 0.4207863
## 4 p.(C473S) 473 C S missense -2.777315 0.4695054
## 5 p.(A21E) 21 A E missense -2.719693 0.3956385
## lfsr lfsr.neg lfsr.pos test.neg test.pos label
## 1 3.954126e-26 3.954126e-26 1 TRUE FALSE GOF
## 2 2.599326e-10 2.599326e-10 1 TRUE FALSE GOF
## 3 2.535298e-12 2.535298e-12 1 TRUE FALSE GOF
## 4 1.655296e-09 1.655296e-09 1 TRUE FALSE GOF
## 5 3.117210e-12 3.117210e-12 1 TRUE FALSE GOF
When running with the position mode (default), Rosace infers position-specific mean estimate (“phi”) and position-specific variance estimate (“sigma2”). The hypothesis testing columns can be interpreted the same way as the ones in the variant-specific section.
scores.data.list <- OutputScore(oct1_rosace_scored, pos.info = TRUE, name = "1SM73_ROSACE", sig.test = 0.05)
scores.var <- scores.data.list$df_variant
scores.pos <- scores.data.list$df_position
head(scores.pos)
## # A tibble: 6 × 11
## pos phi_mean phi_sd sigma2_mean sigma2_sd lfsr.neg lfsr.pos lfsr test.neg
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 2 -0.736 0.261 1.18 0.484 0.266 0.734 0.266 FALSE
## 2 3 -0.773 0.205 0.615 0.256 0.104 0.896 0.104 FALSE
## 3 4 0.297 0.231 0.959 0.382 0.622 0.378 0.378 FALSE
## 4 5 1.41 0.195 0.702 0.274 0.978 0.0225 0.0225 FALSE
## 5 6 -0.602 0.208 0.737 0.314 0.207 0.793 0.207 FALSE
## 6 7 0.364 0.252 1.23 0.464 0.616 0.384 0.384 FALSE
## # ℹ 2 more variables: test.pos <lgl>, label <chr>