Notebook-scoped R libraries
Notebook-scoped R libraries enable you to create and modify custom R environments that are specific to a notebook session. When you install an R notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected.
Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster.
Notebook-scoped libraries libraries are automatically available on workers for SparkR UDFs.
To install libraries for all notebooks attached to a cluster, use cluster-installed libraries. See Cluster libraries.
Install notebook-scoped libraries in R
You can use any familiar method of installing packages in R, such as install.packages(), the devtools APIs, or Bioconductor.
R packages are accessible to worker nodes as well as the driver node.
Manage notebook-scoped libraries in R
In this section:
Install a package
require(devtools)
install_version(
package = "caesar",
repos = "http://cran.us.r-project.org"
)
Databricks recommends using a CRAN snapshot as the repository to guarantee reproducible results.
devtools::install_github("klutometis/roxygen")
Remove an R package from a notebook environment
To remove a notebook-scoped library from a notebook, use the remove.packages()
command.
remove.packages("caesar")
Notebook-scoped R libraries with Spark UDFs
In this section:
- Notebook-scoped R libraries and SparkR
- Notebook-scoped R libraries and sparklyr
- Library isolation and hosted RStudio
Notebook-scoped R libraries and SparkR
Notebook-scoped libraries are available on SparkR workers; just import a library to use it. For example, you can run the following to generate a caesar-encrypted message with a SparkR UDF:
require(devtools)
install_version(
package = "caesar",
repos = "http://cran.us.r-project.org"
)
library(SparkR)
sparkR.session()
hello <- function(x) {
library(caesar)
caesar("hello world")
}
spark.lapply(c(1, 2), hello)
Notebook-scoped R libraries and sparklyr
By default, in sparklyr::spark_apply()
, the packages
argument is set to TRUE
. This copies libraries in the current libPaths
to the workers, allowing you to import and use them on workers. For example, you can run the following to generate a caesar-encrypted message with sparklyr::spark_apply()
:
require(devtools)
install_version(
package = "caesar",
repos = "http://cran.us.r-project.org"
)
library(sparklyr)
sc <- spark_connect(method = 'databricks')
apply_caes <- function(x) {
library(caesar)
caesar("hello world")
}
sdf_len(sc, 5) %>%
spark_apply(apply_caes)
If you do not want libraries to be available on workers, set packages
to FALSE
.
Library isolation and hosted RStudio
RStudio creates a separate library path for each user; therefore users are isolated from each other. However, the library path is not available on workers. If you want to use a package inside SparkR workers in a job launched from RStudio, you need to install it using cluster libraries.
Alternatively, if you use sparklyr UDFs, packages installed inside RStudio are available to workers when using spark_apply(..., packages = TRUE)
.
Frequently asked questions (FAQ)
How do I install a package on just the driver for all R notebooks?
Explicitly set the installation directory to /databricks/spark/R/lib
. For example, with install.packages()
, run install.packages("pckg", lib="/databricks/spark/R/lib")
.
Packages installed in /databricks/spark/R/lib
are shared across all notebooks on the cluster, but they are not accessible to SparkR workers. To share libraries across notebooks and also workers, use cluster libraries.
Are notebook-scoped libraries cached?
There is no caching implemented for notebook-scoped libraries on a cluster. If you install a package in a notebook, and another user installs the same package in another notebook on the same cluster, the package is downloaded, compiled, and installed again.