Huge volumes of data are generated daily in the healthcare, retail, logistics and finance industries. As they completely rely on the data, it is essential to provide a proper architecture built into the system. The hybrid Hadoop Model with proper security and networking capabilities offers an efficient framework for running jobs on multiple nodes of clusters.
Data Lake Security ensures that a user can access a specific file or specific data within a file as established by the security of an organization. The organization assures effective data management for a far better ROI by a certain group of policies with rigid privacy regulations. Hadoop Consulting Services provides the required security tools for a safe transition of data.
Hadoop clusters and their ecosystem are very reputed choices by various vendors in the industry who are provided by Data Lake, such as:
- Hortonworks Data Lake (HDP + HDFS) [Note: Hortonworks Data Platform is a Hadoop cluster ecosystem from Hortonworks]
- IBM Data Lake (HDP+HDF+DB2 Big SQL) [Note: IBM is a Hortonworks partner]
- AWS Data Lake (Amazon EMR + S3) [Note: Amazon EMR is a Hadoop cluster ecosystem from Amazon]
- Azure Data Lake (Azure HDInsight / Hadoop Cluster eco suite of tools + Gen2/Azure Blob)
Underneath, a file system may be varied from vendor to vendor who had provided “Data Lake”. However, one common bridge between all players is Hadoop clusters ecosystem for Data Lake on Data ingestion and Data processing.
In practicality, a data lake is categorized by three key attributes:
- Collect everything. A data lake comprises of all types of data, both raw sources over long terms as well as processed data.
- Dive in anywhere. A data lake allows users from various business units to refine, explore and enrich data on their terms.
- Flexible access. A data lake permits various data to access patterns throughout a shared infrastructure: in-memory, interactive, batch, search, online, and other processing engines.
Result: A data lake provides maximum scale and insight with the lowest possible resistance and cost.
Design Consideration for Security
Interceptor is the right design pattern when we implement security for our data architecture. Every request for data access on various clusters goes through a common channel that resolves the security issues.
It integrates with Identity Management and SSO systems used in enterprises.
Once a user has been authenticated, their access rights are determined. Authorization defines the rights of user access to resources.
Security in Hortonworks Data Platform (HDP)
As enterprise data in diverse types are acquired together into a central repository, the inherent security risks can increase. Hortonworks comprehends the importance of security and governance for every business. They use a holistic approach based on five core security features:
- Authentication and perimeter security
- Data protection
Apache Knox is a choice for Perimeter Security for all vendor who provides data platform.
At every layer of the Hadoop stack, security is provided from HDFS and YARN to Hive and the other Data Access constituents through the complete perimeter of the cluster via Apache Knox.
The Apache Knox Gateway is a system delivering a single authentication access point within a cluster for Apache Hadoop services. It simplifies when users access the Hadoop Security’s cluster data and execute jobs and operators that control access and manage the cluster. The security is provided for multiple Hadoop clusters with the following advantages:
Simplified Access: The simplified access broadens Hadoop’s REST/HTTP services within the Cluster by encapsulating Kerberos.
Security Enhancement: Without exposing network details or providing SSL out of the box, it reveals Hadoop’s REST/HTTP services.
Centralized Control: It implements REST API security in a centralized manner, routing requests to multiple Hadoop clusters.
Integrated Enterprise: It supports LDAP, SAML, Active Directory, SSO, and other authentication systems.
Configuring Authorization in Hadoop
Ranger assists us to create services for certain Hadoop resources (HDFS, HBase, Hive, etc.) and add access policies to those services.
Data Protection: Wire Encryption
The significant question every customer has is “How will we address the security of data at the transit point” and Wire Encryption is the solution.
Encryption is applied to electronic information to ensure privacy and confidentiality. The data is secured through Wire encryption when it moves from one place to another, into and out of a Hadoop cluster over RPC, JDBC, Data Transfer Protocol (DTP), and HTTP. Clients directly communicate with the Hadoop cluster. It can be secured using RPC encryption or Data Transfer Protocol.
RPC Encryption: Clients interacts directly with the Hadoop cluster through RPC. A client utilizes RPC to connect to the Name Node (NN) and initiate the file read and write operations. Hadoop with RPC connections uses Java’s Simple Authentication & Security Layer (SASL) favoring the encryption.
Data Transfer Protocol: The NN gives the client the primary Data Node’s (DN) address to read or write a block. The actual transfer between the client and a Data Node uses Data Transfer Protocol. Using a Browser or command-line tools, users communicate with the Hadoop clusters’ data which can be protected as follows:
HTTPS Encryption: By using a browser or component CLI, the users interact with Hadoop while applications and use the REST APIs or Thrift. Encryption over the HTTP protocol is applied with the help of SSL on a Hadoop cluster and for the individual components such as Ambari.
JDBC: HiveServer2 employs the encryption with the Java SASL protocol’s utilizing a quality of protection (QOP) setting. With this, the data moving between a HiveServer2 from JDBC and a JDBC client can be encrypted. Additionally, by using HTTPS encryption during MapReduce shuffle, cluster communication between processes can be secured.
HTTPS Encryption during Dhuffle: A process is called as shuffle when data moves between the Mappers and the Reducers over the HTTP protocol. By asking for data, the reducer initiates the connection to the Mapper; it acts as an SSL client.
HDFS “Data at Rest” Encryption
Hortonworks, AWS and Azure are provided with security in file management system perspective. Stored data can be encrypted in several ways by Hadoop.
The lowest level of encryption is volume encryption that secures data after an accidental loss or a physical theft of a disk volume. The entire volume is encrypted; this technique does not support pulverized encryption of specific files or directories. Additionally, volume encryption does not protect against viruses or other attacks that occur while a system is running.
Application-level encryption (if internally an application runs on top of Hadoop AWS) supports a higher level of speckles and ignores “rogue admin” access but adds a layer of complications to the application architecture. A third approach is HDFS data at rest encryption. Here, it encrypts selected files and directories stored (“at rest”) in HDFS. This approach uses customized designated HDFS directories called “encryption zones.”
HDFS encryption involves several elements:
Encryption Key: It is a new level of permission-based access protection, including the permissions of standard HDFS.
HDFS Encryption Zone: It is a unique HDFS directory where all data is encrypted upon write and decrypted upon read.
- Each encryption zone corresponds with an encryption key specifying when the zone was created.
- Each file in an encryption zone has a distinct encryption key, called the “data encryption key” (DEK).
- HDFS does not have access to DEKs. HDFS Data Nodes only see a stream of encrypted bytes. HDFS stores “encrypted data encryption keys” (EDEKs) as part of the file’s metadata on the Name Node.
- Clients decrypt an EDEK and use the relevant DEK to encrypt and decrypt data during write and read operations.
Ranger Key Management Service (Ranger KMS): Based upon Hadoop’s Key Provider API, it remains as an open-source key management service.
The Ranger KMS holds three basic responsibilities for HDFS encryption:
- Provide access to stored encryption zone keys.
- Create and manage encryption zone keys and make encrypted data keys that can be stored in Hadoop.
- Audit all access events in Ranger KMS.
The Big Data warehouses and data lakes are regularly targeted for attacks to breach security. It is significant for us to build a strong infrastructure. Data Lake Solutions provides the solutions with Hadoop managed services delivering the required cybersecurity and interoperability. This creates a consistent database environment in Data Lake.