Notebook-scoped R libraries

Notebook-scoped R libraries enable you to create and modify custom R environments that are specific to a notebook session. When you install an R notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected.

Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster.

Notebook-scoped libraries libraries are automatically available on workers for SparkR UDFs.

To install libraries for all notebooks attached to a cluster, use cluster-installed libraries. See Cluster libraries.

Install notebook-scoped libraries in R

You can use any familiar method of installing packages in R, such as install.packages(), the devtools APIs, or Bioconductor.

R packages are accessible to worker nodes as well as the driver node.

Manage notebook-scoped libraries in R

In this section:

Install a package

require(devtools)

install_version(
  package = "caesar",
  repos   = "http://cran.us.r-project.org"
)

Databricks recommends using a CRAN snapshot as the repository to guarantee reproducible results.

devtools::install_github("klutometis/roxygen")

Remove an R package from a notebook environment

To remove a notebook-scoped library from a notebook, use the remove.packages() command.

remove.packages("caesar")

Notebook-scoped R libraries with Spark UDFs

In this section:

Notebook-scoped R libraries and SparkR

Notebook-scoped libraries are available on SparkR workers; just import a library to use it. For example, you can run the following to generate a caesar-encrypted message with a SparkR UDF:

require(devtools)

install_version(
  package = "caesar",
  repos   = "http://cran.us.r-project.org"
)

library(SparkR)
sparkR.session()

hello <- function(x) {
  library(caesar)
  caesar("hello world")
}

spark.lapply(c(1, 2), hello)

Notebook-scoped R libraries and sparklyr

By default, in sparklyr::spark_apply(), the packages argument is set to TRUE. This copies libraries in the current libPaths to the workers, allowing you to import and use them on workers. For example, you can run the following to generate a caesar-encrypted message with sparklyr::spark_apply():

require(devtools)

install_version(
  package = "caesar",
  repos   = "http://cran.us.r-project.org"
)

library(sparklyr)
sc <- spark_connect(method = 'databricks')

apply_caes <- function(x) {
  library(caesar)
  caesar("hello world")
}

sdf_len(sc, 5) %>%
  spark_apply(apply_caes)

If you do not want libraries to be available on workers, set packages to FALSE.

Library isolation and hosted RStudio

RStudio creates a separate library path for each user; therefore users are isolated from each other. However, the library path is not available on workers. If you want to use a package inside SparkR workers in a job launched from RStudio, you need to install it using cluster libraries.

Alternatively, if you use sparklyr UDFs, packages installed inside RStudio are available to workers when using spark_apply(..., packages = TRUE).

Frequently asked questions (FAQ)

How do I install a package on just the driver for all R notebooks?

Explicitly set the installation directory to /databricks/spark/R/lib. For example, with install.packages(), run install.packages("pckg", lib="/databricks/spark/R/lib"). Packages installed in /databricks/spark/R/lib are shared across all notebooks on the cluster, but they are not accessible to SparkR workers. To share libraries across notebooks and also workers, use cluster libraries.

Are notebook-scoped libraries cached?

There is no caching implemented for notebook-scoped libraries on a cluster. If you install a package in a notebook, and another user installs the same package in another notebook on the same cluster, the package is downloaded, compiled, and installed again.