The data lake serves as an alternative to multiple information silos typical of enterprise environments. The data lake does not care where the data came from or how it was used. It is indifferent to data quality or integrity. It is concerned only with providing a common repository from which to perform in-depth analytics. Only then is any sort of structure imposed upon the data.
As the popularity of the data lake grows, so too does the number of vendors jumping into data lake waters, each bringing its own idea of what a data lake entails. While any data lake solution will have at its core a massive repository, some vendors also roll in an analytics component or two, which is exactly what Microsoft is planning to do. As the following figure shows, the Azure Data Lake platform comprises three primary services: Data Lake Store, Data Lake Analytics, and Azure HDInsight.
Data Lake Store provides the repository necessary to persist the influx of data, and Data Lake Analytics offers a mechanism for picking apart that data. Both components are now in public preview. Microsoft has also rolled HDInsight into the Data Lake mix, a service that offers a wide range of Hadoop-based tools for additional analytic capabilities. To facilitate access between the storage and analytic layers, the Data Lake platform leverages Apache YARN (Yet Another Resource Negotiator) and WebHDFS-compatible REST APIs.
Azure Data Lake Store
Microsoft describes Data Lake Store as a “hyper-scale repository for big data analytic workloads,” a mouthful, to be sure, but descriptive nonetheless. The service will let you store data of any size, type, or ingestion speed, whether originating from social networks, relational databases, web-based applications, line-of-business (LOB) applications, mobile and desktop devices, or a variety of other sources. The repository provides unlimited storage without restricting file sizes or data volumes. An individual file can be petabytes in size, with no limit on how long you keep it there.
Data Lake Store uses a Hadoop file system to support compatibility with the Hadoop Distributed File System (HDFS), making the data store accessible to a wide range of Hadoop-friendly tools and services. Data Lake Store is already integrated with Data Lake Analytics and HDInsight, as well as Azure Data Factory; however, Microsoft also plans eventual integration with services such as Microsoft’s Revolution-R Enterprise, distributions from Hortonworks, Cloudera, and MapR, and Hadoop projects such as Spark, Storm, and HBase.
To protect the data, Microsoft makes redundant copies for ensuring durability and promises enterprise-grade security, based on Azure Active Directory (AAD). The AAD service manages identity and access for all stored data, providing multifactor authentication, role-based access control, conditional access, and numerous other features.
To use AAD to protect data, you must first create AAD security groups to facilitate role-based access control in Azure Portal. Next, you must assign the security groups to the Data Lake Store account, which control access to the repository for portal and management operations. The next step is to assign the security group to the access control lists (ACLs) associated with the repository’s file system. Currently, you can assign access control permissions only at the repository level, but Microsoft is planning to add folder- and file-level controls in a future release.
Data Lake Store supports POSIX-style permissions exposed through the WebHDFS-compatible REST APIs. The WebHDFS protocol makes it possible to support all HDFS operations, not only read and write, but also such operations as accessing block locations and configuring replication factors. In addition, WebHDFS can use the full bandwidth of the Hadoop cluster for streaming data.
Data Lake Store also implements a new file system-AzureDataLakeFilesystem (adl://)-for directly accessing the repository. Applications and services capable of using the file system can realize additional flexibility and performance gains over WebHDFS. Systems not compatible with the new file system can continue to use the WebHDFS-compatible APIs.
Azure Data Lake Analytics
Data Lake Analytics represents one of Microsoft’s newest cloud offerings, appearing on the scene only within the last couple months. According to Microsoft, the company built Data Lake Analytics from the ground up with scalability and performance in mind. The service provides a distributed infrastructure that can dynamically allocate or de-allocate resources so customers pay for only the services they use.
As with similar cloud platforms, Data Lake Analytics users can focus on the business logic, rather than on the logistics of how to implement systems and process large data sets. The service handles all the complex management tasks so customers can develop and execute their solutions without worrying about deploying or maintaining the infrastructure to support them.
Data Lake Analytics is also integrated with AAD, making it easier to manage users and permissions. It is also integrated with Visual Studio, providing developers with a familiar environment for creating analytic solutions.
A solution built for the Data Lake Analytics service is made up of one or more jobs that define the business logic. A job can reference data within Data Lake Store or Azure Blob storage, impose a structure on that data, and process the data in various ways. When a job is submitted Data Lake Analytics, the service will access the source data, carry out the defined operations, and output the results to Data Lake Store or Blob storage.
Azure Data Lake provides several options for submitting jobs to Data Lake Analytics:
- Use Azure Data Lake Tools in Visual Studio to submit jobs directly.
- Use the Azure Portal to submit jobs via the Data Lake Analytics account.
- Use the Data Lake SDK job submission API to submit jobs programmatically.
- Use the job submission command available through the Azure PowerShell extensions to submit jobs programmatically.
The U-SQL difference
A job is actually a U-SQL script that instructs the service on how to process the data. U-SQL is a new language that Microsoft developed for writing scalable, distributed queries that analyze data. An important function of the language is its ability to process unstructured data by applying schema on read logic, which imposes a structure on the data as you retrieve it from its source. The U-SQL language also lets you insert custom logic and user-defined functions into your scripts, as well as provides fine-grained control over how to run a job at scale.
U-SQL evolved out of Microsoft’s internal big data language SCOPE (Structured Computations Optimized for Parallel Execution), a SQL-like language that supports set-oriented record and column manipulation. U-SQL is a type of hybrid language that combines the declarative capabilities of SQL with the extensibility and programmability of C#. The language also incorporates big data processing concepts such as custom reducers and processors, as well as schema on reads.
Not surprisingly, U-SQL has its own peculiarities. Keywords such as SELECT
must be all uppercase, and the expression language within clauses such as SELECT
and WHERE
use C# syntax. For example, a WHERE
clause equality operator would take two equal signs, and a string value would be enclosed in double quotes, as in WHERE Veggie == "tiger nut"
.
In addition, the U-SQL type system is based on C# types, providing tight integration with the C# language. You can use any C# type in a U-SQL expression. However, you can use only a subset of C# types to define rowset columns or certain other schema objects. The usable types are referred to as built-in U-SQL types and can be classified as simple built-in types or complex built-in types . The simple built-in types include your basic numeric, string, and temporal types, along with a few others, and the complex ones include map and array types.
You can also use C# to extend your U-SQL expressions. For example, you can add inline C# expressions to your script, which can be handy if you have a small set of C# methods you want to use to process scalar values. In addition, you can write user-defined functions, aggregators, and operators in C# assemblies, load the assemblies into the U-SQL metadata catalog, and then reference the assemblies within your U-SQL scripts.
Data Lake Analytics executes a U-SQL job as a batch script, with data retrieved in a rowset format. If the source data are files, U-SQL schematizes the data on extract. However, the source data can also be U-SQL tables or tables in other data sources, such as Azure SQL Database, in which case it does not need to be schematized. In addition, you can define a U-SQL job to transform the data before storing it in a file or U-SQL table. The language also supports data definition statements such as CREATE TABLE
so you can define metadata artifacts.