Azure HPC Cache data ingest - msrsync method
This article gives detailed instructions for using the msrsync
utility to copy data to an Azure Blob storage container for use with Azure HPC Cache.
To learn more about moving data to Blob storage for your Azure HPC Cache, read Move data to Azure Blob storage.
The msrsync
tool can be used to move data to a back-end storage target for the Azure HPC Cache. This tool is designed to optimize bandwidth usage by running multiple parallel rsync
processes. It is available from GitHub at https://github.com/jbd/msrsync.
msrsync
breaks up the source directory into separate “buckets” and then runs individual rsync
processes on each bucket.
Preliminary testing using a four-core VM showed best efficiency when using 64 processes. Use the msrsync
option -p
to set the number of processes to 64.
Note that msrsync
can only write to and from local volumes. The source and destination must be accessible as local mounts on the workstation used to issue the command.
Follow these instructions to use msrsync
to populate Azure Blob storage with Azure HPC Cache:
Install
msrsync
and its prerequisites (rsync
and Python 2.6 or later)Determine the total number of files and directories to be copied.
For example, use the utility
prime.py
with argumentsprime.py --directory /path/to/some/directory
(available by downloading https://github.com/Azure/Avere/blob/main/src/clientapps/dataingestor/prime.py).If not using
prime.py
, you can calculate the number of items with the GNUfind
tool as follows:find <path> -type f |wc -l # (counts files) find <path> -type d |wc -l # (counts directories) find <path> |wc -l # (counts both)
Divide the number of items by 64 to determine the number of items per process. Use this number with the
-f
option to set the size of the buckets when you run the command.Issue the
msrsync
command to copy files:msrsync -P --stats -p64 -f<ITEMS_DIV_64> --rsync "-ahv --inplace" <SOURCE_PATH> <DESTINATION_PATH>
For example, this command is designed to move 11,000 files in 64 processes from /test/source-repository to /mnt/hpccache/repository:
mrsync -P --stats -p64 -f170 --rsync "-ahv --inplace" /test/source-repository/ /mnt/hpccache/repository