Security
From: Developing big data solutions on Microsoft Azure HDInsight
It is vital to consider how you can maximize security for all the applications and services you build and use. This is particularly the case with distributed applications and services, such as big data solutions, that move data over public networks and store data outside the corporate network.
Typical areas of concern for security in these types of applications are:
- Securing the infrastructure
- Securing credentials in scripts and applications
- Securing data passing over the network
- Securing data in storage
Securing the infrastructure
HDInsight runs on a set of Azure virtual machines that are provisioned automatically when you create a cluster, and it uses an Azure SQL Database to store metadata for the cluster. The cluster is isolated and provides external access through a secure gateway node that exposes a single point of access and carries out user authentication. However, you must be aware of several points related to security of the overall infrastructure of your solutions:
- Ensure you properly protect the cluster by using passwords of appropriate complexity.
- Ensure you protect your Azure storage keys and keep them secret. If a malicious user obtains the storage key, he or she will be able to directly access the cluster data held in blob storage.
- Protect credentials, connection strings, and other sensitive information when you need to use them in your scripts or application code. See Securing credentials in scripts and applications for more information.
- If you enable remote desktop access to the cluster, use a suitably strong password and configure the access to expire as soon as possible after you finish using it. Remote desktop users do not have administrative level permissions on the cluster, but it is still possible to access and modify the core Hadoop system, and read data and the contents of configuration files (which contain security information and settings) through a remote desktop connection.
- Consider if protecting your clusters by using a custom or third party gatekeeper implementation that can authenticate multiple users with different credentials would be appropriate for your scenario.
- Use local security policies and features, such as file permissions and execution rights, for tools or scripts that transmit, store, and process the data.
Securing credentials in scripts and applications
Scripts, applications, tools, and other utilities will require access to credentials in order to load data, run jobs, or download the results from HDInsight. However, if you store credentials in plain text in your scripts or configuration files you leave the cluster itself, and the data in Azure storage, open to anyone who has access to these scripts or configuration files.
In production systems, and at any time when you are not just experimenting with HDInsight using test data, you should consider how you will protect credentials, connections strings, and other sensitive information in scripts and configuration files. Some solutions are:
- Prompt the user to enter the required credentials as the script or application executes. This is a common approach in interactive scenarios, but it is obviously not appropriate for automated solutions where the script or application may run in unattended mode on a schedule, or in response to a trigger event.
- Store the required credentials in encrypted form in the configuration file. This approach is typically used in .NET applications where sections of the configuration file can be encrypted using the methods exposed by the .NET framework. See Encrypting Configuration Information Using Protected Configuration for more information. You must ensure that only authorized users can execute the application by protecting it using local security policies and features, such as file permissions and execution rights.
- Store the required credentials in a text file, a repository, a database, or Windows Registry in encrypted form using the Data Protection API (DPAPI). This approach is typically used in Windows PowerShell scripts. You must ensure that only authorized users can execute the script by protecting it using local security policies and features, such as file permissions and execution rights.
Note
The article Working with Passwords, Secure Strings and Credentials in Windows PowerShell on the TechNet wiki includes some useful examples of the techniques you can use.
Securing data passing over the network
HDInsight uses several protocols for communication between the cluster nodes, and between the cluster and clients, including RPC, TCP/IP, and HTTP. Consider the following when deciding how to secure data that passes across the network:
- Use a secure protocol for all connections over the Internet to the cluster and to your Azure storage account. Consider using Secure Socket Layer (SSL) for the connection to your storage account to protect the data on the wire (supported and recommended for Azure storage). Use SSL or Transport Layer Security (TLS), or other secure protocols, where appropriate when communicating with the cluster from client-side tools and utilities, and keep in mind that some tools may not support SSL or may require you to specifically configure them to use SSL. When accessing Azure storage from client-side tools and utilities, use the wasbs secure protocol (you must specify the full path to a file when you use the wasbs protocol).
- Consider if you need to encrypt data in storage and on the wire. This is not trivial, and may involve writing custom components to carry out the encryption. If you create custom components, use proven libraries of encryption algorithms to carry out the encryption process. Note that the encryption keys must be available within your custom components running in Azure, when may leave them vulnerable.
Securing data in storage
Consider the following when deciding how to secure data in storage:
- Do not store data that is not associated with your HDInsight processing jobs in the storage accounts linked to a cluster. HDInsight has full access to all of the containers in linked storage accounts because the account names and keys are stored in the cluster configuration. See Cluster and storage initialization for details of how you can isolate parts of your data by using separate storage accounts.
- If you use non-linked storage accounts in an HDInsight job by specifying the storage key for these accounts in the job files, the HDInsight job will have full access to all of the containers and blobs in that account. Ensure that these non-linked storage accounts do not contain data that must be kept private from HDInsight, and that the containers do not have public access permission. See Using an HDInsight Cluster with Alternate Storage Accounts and Metastores and Use Additional Storage Accounts with HDInsight Hive for more information.
- Consider if using Shared Access Signatures (SAS) to provide access to data in Azure storage would be an advantage in your scenario. SAS can provide fine-grained controlled and time-limited access to data for clients. For more details see Create and Use a Shared Access Signature.
- Consider if you need to employ monitoring processes that can detect inappropriate access to the data, and can alert operators to possible security breaches. Ensure that you have a process in place to lock down access in this case, detect the scope of the security breach, and ensure validation and integrity of the data afterwards. Hadoop can log access to data. Azure blob storage also has a built-in monitoring capability—for more information see How To Monitor a Storage Account.
- Consider preprocessing or scrubbing the data to remove nonessential sensitive information before storing it in remote locations such as Azure storage. If you need to stage the data before processing, perhaps to remove personally identifiable information (a process sometimes referred to as de-identification), consider using separate storage (preferably on-premises) for the intermediate stage rather than the dedicated cluster storage. This provides isolation and additional protection against accidental or malicious access to the sensitive data.
- Consider encrypting sensitive data, sensitive parts of the data, or even whole folders and subfolders. This may include splitting data into separate files, such as dividing credit card information into different files that contain the card number and the related card-holder information. Azure blob storage does not have a built-in encryption feature, and so you will need to encrypt the data using encryption libraries and custom code, or with third-party tools.
- If you are handling sensitive data that must be encrypted you will need to write custom serializer and deserializer classes and install these in the cluster for use as the SerDe parameter in Hive statements, or create custom map/reduce components that can manage the serialization and deserialization. See the Apache Hive Develop Guide for more information about creating a custom SerDe. However, consider that the additional processing requirements for encryption imposes a trade-off between security and performance.