Introduction to rosace

Jingyou Rao

Installation

To install rosace start R and first install devtools by typing:

if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

and install rosace by typing:

devtools::install_github("pimentellab/rosace")

If you have cloned the git repository locally, navigate to the rosace folder and type:

devtools::install(".") 

Next load rosace with:

library("rosace")

Example

Read count data

To explain how to use rosace we provide an example based on the OCT1 drug cytotoxicity screen (https://www.biorxiv.org/content/10.1101/2023.06.06.543963v1.full).

The screen has three replicates. The count matrix is stored in “data/oct1.rda”.

data("oct1")
key <- "1SM73"
type <- "growth"

Create Rosace object

First, load the count file into the Assay object. Each replicate in the experiment will form an Assay object and will share the same “key” (1SM73).

assay1 <- CreateAssayObject(counts = as.matrix(oct1_rep1[2:ncol(oct1_rep1)]),
                            var.names = oct1_rep1$hgvs,
                            key = key, rep = 1, type = type)
## Warning: filtering 77 variants that have more than 0.5 missing data
## Warning: filtering 187 variants that have less than 20 count
assay2 <- CreateAssayObject(counts = as.matrix(oct1_rep2[2:ncol(oct1_rep2)]),
                            var.names = oct1_rep2$hgvs,
                            key = key, rep = 2, type = type)
## Warning: filtering 57 variants that have more than 0.5 missing data
## Warning: filtering 113 variants that have less than 20 count
assay3 <- CreateAssayObject(counts = as.matrix(oct1_rep3[2:ncol(oct1_rep3)]),
                            var.names = oct1_rep3$hgvs,
                            key = key, rep = 3, type = type)
## Warning: filtering 115 variants that have more than 0.5 missing data
## Warning: filtering 238 variants that have less than 20 count

Next create Rosace object by adding three Assay objects together.

rosace <- CreateRosaceObject(object = assay1)
rosace <- AddAssayData(object = rosace, assay = assay2)
rosace <- AddAssayData(object = rosace, assay = assay3)
GetAssayName(rosace)
## [1] "1SM73_1" "1SM73_2" "1SM73_3"

Preprocessing

‘CreateAssayObject’ calls function ‘FilterData’ to filter the variants with more than ‘na.rmax’% of NAs and less than ‘min.count’ total counts by default. But we might want to filter more variants later.

rosace <- FilterData(rosace, key = key, na.rmax = 0.5, min.count = 20)

Then we will impute the NA data either by K-Nearest Neighbor method or fill the NA with 0.

rosace <- ImputeData(rosace, key = key, impute.method = "knn", na.rmax = 0.5)
# rosace <- ImputeData(rosace, key = key, impute.method = "zero")

With a complete count matrix in the Assay object, we will normalize the data by either a list of wild-type variants or by the total count at the time point.

rosace <- NormalizeData(rosace, key = key,
                        normalization.method = "wt", 
                        wt.var.names = c("_wt"), wt.rm = TRUE)
# rosace <- NormalizeData(rosace, key = key, normalization.method = "total")

After the Assay objects are all normalized, we can integrate three replicates into an Assays object stored in the “combined.assay” slot of the Rosace object.

rosace <- IntegrateData(object = rosace, key = key)
GetAssaySetName(rosace)
## [1] "1SM73"

Naive method: simple linear regression (optional)

rosace <- runSLR(rosace, name = key, type = "AssaySet")
rosace <- runSLR(rosace, name = paste(key, "1", sep = "_"), type = "Assay")
# rosace <- runSLR(rosace, name = paste(key, "2", sep = "_"), type = "Assay")
# rosace <- runSLR(rosace, name = paste(key, "3", sep = "_"), type = "Assay")

Process variants’ meta data (provide your own function)

Provide your own function for parsing “hgvs” into position, mutation, wildtype, and type of mutation.

library("dplyr")
colnames(rosace@var.data)
## [1] "variants"
rosace@var.data <- rosace@var.data %>%
  mutate(tmp = substr(variants, 4, nchar(variants) - 1),
         position = as.numeric(gsub("[[:alpha:]]", "", tmp)),
         wildtype = substr(tmp, 1, 1),
         tmp = substr(tmp, 2, nchar(tmp)),
         mutation = gsub("[[:digit:]]", "", tmp)) %>%
  dplyr::select(-tmp)

func_map <- function(wt, mut) {
  if (nchar(wt) == 0) {
    return("NA")
  }
  
  if (wt == mut) {
    return("synonymous")
  } else if (mut == "del") {
    return("deletion")
  } else {
    return("missense")
  }
}

rosace@var.data <- rosace@var.data %>%
  rowwise() %>%
  mutate(type = func_map(wildtype, mutation)) %>%
  ungroup()
head(rosace@var.data)
## # A tibble: 6 × 5
##   variants  position wildtype mutation type      
##   <chr>        <dbl> <chr>    <chr>    <chr>     
## 1 _wt             NA ""       ""       NA        
## 2 p.(A107A)      107 "A"      "A"      synonymous
## 3 p.(A107C)      107 "A"      "C"      missense  
## 4 p.(A107D)      107 "A"      "D"      missense  
## 5 p.(A107E)      107 "A"      "E"      missense  
## 6 p.(A107F)      107 "A"      "F"      missense

Run Rosace

There are three ways to run Rosace model: 1) Provide no position or control column. The model would treat each variant independently but with shrinkage on the variance component. 2) Provide position column only. In addition, the model would group variants together with the amino acid position information provided. 3) Provide both position and control column (recommended). In addition, the model would group all control variants together into one position label.

For the purpose of this vignette, we’ve reduced the number of variants to 100 with brute force before running Rosace, since running on the full dataset might take an hour.

# running on an Assay (one replicate)
rosace@assays$`1SM73_1`@norm.counts <- rosace@assays$`1SM73_1`@norm.counts[1:100, ]
rosace@assays$`1SM73_1`@norm.var.names <- rosace@assays$`1SM73_1`@norm.var.names[1:100]
rosace <- RunRosace(object = rosace,
                    name = "1SM73_1",
                    type = "Assay",
                    savedir = "../tests/results/stan/assay/",
                    pos.col = "position",
                    ctrl.col = "type",
                    ctrl.name = "synonymous",
                    install = FALSE)

# running on an AssaySet (all three replicates)
rosace@assay.sets$`1SM73`@raw.counts <- rosace@assay.sets$`1SM73`@raw.counts[1:100, ]
rosace@assay.sets$`1SM73`@combined.counts <- rosace@assay.sets$`1SM73`@combined.counts[1:100, ]
rosace@assay.sets$`1SM73`@var.names <- rosace@assay.sets$`1SM73`@var.names[1:100]
rosace <- RunRosace(object = rosace,
                    name = "1SM73",
                    type = "AssaySet",
                    savedir = "../tests/results/stan/assayset/",
                    pos.col = "position",
                    ctrl.col = "type",
                    ctrl.name = "synonymous",
                    install = FALSE)

The Rosace results can be retrieved using the OutputScore function. First, check the name of your Rosace run:

names(rosace@scores)

Then extract the scores data, which includes the variant information, functional score (mean), standard deviation (sd), and the local false sign discovery rate (lfsr) associated with the score.

scores.data <- OutputScore(rosace, pos.info = FALSE, name = "1SM73_ROSACE")
head(scores.data)

The functional score represents the slope of the linear regression applied to normalized counts across time points, which serves as a measure of cell growth. A positive score indicates the mutation has a a gain of function (GOF) effect, while a negative score indicates a loss of function (LOF) effect. The Rosace model learns a distribution of scores and reports the mean for each variant.

The lfsr, or local false sign rate, offers a Bayesian perspective for estimating uncertainty in the sign of the score. For a negative mean score, lfsr estimates the probability it could truly be positive, and vice versa. The below section illustrates how to use score and lfsr to identify LOF and GOF mutations.

Interpret Rosace Results

To have a more comprehensive understanding and clearer view of the score distribution, we’ve made available a precomputed object based on the complete OCT1 dataset. This may help you draw parallel to your own data. libra

# obtain the data 
data("oct1_rosace_scored")

# extract scores 
scores.data <- OutputScore(oct1_rosace_scored, name = "1SM73_ROSACE", sig.test = 0.05)
head(scores.data)
##    variants position wildtype mutation       type       mean        sd
## 1 p.(A107A)      107        A        A synonymous -0.2755549 0.2365942
## 2 p.(A107C)      107        A        C   missense  0.1134388 0.2255906
## 3 p.(A107D)      107        A        D   missense  0.3873864 0.2108217
## 4 p.(A107E)      107        A        E   missense -0.6415181 0.2534535
## 5 p.(A107F)      107        A        F   missense -0.1401362 0.2692807
## 6 p.(A107G)      107        A        G   missense -1.0571610 0.2417313
##           lfsr     lfsr.neg   lfsr.pos test.neg test.pos   label
## 1 1.220757e-01 1.220757e-01 0.87792430    FALSE    FALSE Neutral
## 2 3.075340e-01 6.924660e-01 0.30753402    FALSE    FALSE Neutral
## 3 3.306753e-02 9.669325e-01 0.03306753    FALSE    FALSE Neutral
## 4 5.685139e-03 5.685139e-03 0.99431486     TRUE    FALSE     Neg
## 5 3.013891e-01 3.013891e-01 0.69861090    FALSE    FALSE Neutral
## 6 6.119409e-06 6.119409e-06 0.99999388     TRUE    FALSE     Neg

Data visualization

First, we can visualize the distribution of scores across mutation types with the scoreDensity function.

scoreDensity(scores.data, 
             hist = FALSE,
             savedir = "../tests/results/stan/assayset_full/plot/", 
             name = "DensityPlot_1SM73")

For other visualizations including visualizing position-wise distributions, refer to the Visualizing functional score results vignette.

Hypothesis testing

A common goal in DMS experiments is to determine which mutations cause a GOF or a LOF. One natural approach is to conduct hypothesis testing to check whether a score is significantly different from 0. lfsr stands for local false sign rate, and is defined as the minimum of the density on the left of 0 (lfsr.pos) and on the right of 0 (lfsr.neg). The OutputScore function automatically does a two-sided test based on the lfsr with threshold 0.05: if lfsr.neg < 0.05/2, we are confident that the mass of the distribution is on the left side of 0 and label the variant “Neg”, and vice versa.

For the OCT1 data, positive scores represent loss-of-function and negative scores represent gain-of-function. You can update the label to “LOF”, “Neutral”, and “GOF” (instead of “Neg”, “Neutral”, and “Pos”) by the following function.

scores.data <- scores.data %>%
  mutate(label = case_when(
    label == "Neg" ~ "GOF",
    label == "Pos" ~ "LOF",
    label == "Neutral" ~ "Neutral"
  ))

Keep in mind that setting a threshold is a somewhat counter-Bayesian approach. The threshold should be adaptable, depending on the nature of the experiment and distribution of the data. You may be more or less conservative with thresholding based on the context and prior knowledge.

table(scores.data$type, scores.data$label)
##             
##               GOF  LOF Neutral
##   deletion     67  359     102
##   missense   2965 3682    3614
##   synonymous  100   26     390

In the growth screen, hypothesis testing for negative scores is much harder than that for the positive scores because random dropout of a variant is common, even for the synonymous mutations, especially when the mass of LOF variant score is shifting towards positive (faster growth rate than synonymous). As a result, it is expected to see more false discovery in the “GOF” bin in the OCT1 data. One might adjust the threshold to be more stringent when discovering “Neg” variants, and looser when discovering “Pos” variants.

Rank the variants

Sometimes, we want to select the most “LOF” or “GOF” variants for validation. Our most confident set of variants are those that have a high absolute value of the score and small lfsr close to 0.

For example, the 5 most “LOF” and “GOF” variants in the OCT1 data can be found by the function below.

# LOF
scores.data %>% arrange(desc(mean), lfsr.pos) %>% head(5)
##      variants position wildtype mutation     type     mean        sd
## 1     p.(D5M)        5        D        M missense 3.327057 0.3870681
## 2 p.(G508del)      508        G      del deletion 3.204218 0.3741755
## 3     p.(D5W)        5        D        W missense 2.724841 0.3856855
## 4   p.(A506L)      506        A        L missense 2.636134 0.3641000
## 5 p.(N156del)      156        N      del deletion 2.608736 0.2494461
##           lfsr lfsr.neg     lfsr.pos test.neg test.pos label
## 1 4.143899e-18        1 4.143899e-18    FALSE     TRUE   LOF
## 2 5.478980e-18        1 5.478980e-18    FALSE     TRUE   LOF
## 3 8.034936e-13        1 8.034936e-13    FALSE     TRUE   LOF
## 4 2.241158e-13        1 2.241158e-13    FALSE     TRUE   LOF
## 5 6.725221e-26        1 6.725221e-26    FALSE     TRUE   LOF
# GOF
scores.data %>% arrange(mean, lfsr.neg) %>% head(5)
##    variants position wildtype mutation     type      mean        sd
## 1 p.(A219V)      219        A        V missense -3.617156 0.3442180
## 2 p.(L505C)      505        L        C missense -3.040896 0.4894418
## 3 p.(H304M)      304        H        M missense -2.904934 0.4207863
## 4 p.(C473S)      473        C        S missense -2.777315 0.4695054
## 5  p.(A21E)       21        A        E missense -2.719693 0.3956385
##           lfsr     lfsr.neg lfsr.pos test.neg test.pos label
## 1 3.954126e-26 3.954126e-26        1     TRUE    FALSE   GOF
## 2 2.599326e-10 2.599326e-10        1     TRUE    FALSE   GOF
## 3 2.535298e-12 2.535298e-12        1     TRUE    FALSE   GOF
## 4 1.655296e-09 1.655296e-09        1     TRUE    FALSE   GOF
## 5 3.117210e-12 3.117210e-12        1     TRUE    FALSE   GOF

Position-level estiamte

When running with the position mode (default), Rosace infers position-specific mean estimate (“phi”) and position-specific variance estimate (“sigma2”). The hypothesis testing columns can be interpreted the same way as the ones in the variant-specific section.

scores.data.list <- OutputScore(oct1_rosace_scored, pos.info = TRUE, name = "1SM73_ROSACE", sig.test = 0.05)
scores.var <- scores.data.list$df_variant
scores.pos <- scores.data.list$df_position
head(scores.pos)
## # A tibble: 6 × 11
##     pos phi_mean phi_sd sigma2_mean sigma2_sd lfsr.neg lfsr.pos   lfsr test.neg
##   <dbl>    <dbl>  <dbl>       <dbl>     <dbl>    <dbl>    <dbl>  <dbl> <lgl>   
## 1     2   -0.736  0.261       1.18      0.484    0.266   0.734  0.266  FALSE   
## 2     3   -0.773  0.205       0.615     0.256    0.104   0.896  0.104  FALSE   
## 3     4    0.297  0.231       0.959     0.382    0.622   0.378  0.378  FALSE   
## 4     5    1.41   0.195       0.702     0.274    0.978   0.0225 0.0225 FALSE   
## 5     6   -0.602  0.208       0.737     0.314    0.207   0.793  0.207  FALSE   
## 6     7    0.364  0.252       1.23      0.464    0.616   0.384  0.384  FALSE   
## # ℹ 2 more variables: test.pos <lgl>, label <chr>