Thursday, July 28, 2016

Understanding Stream Analytics

What is Stream Analytics?

Stream Analytics is an event processing engine that can ingest events in real-time, whether from one data stream or multiple streams. Events can come from sensors, applications, devices, operational systems, websites, and a variety of other sources. Just about anything that can generate event data is fair game.
Stream Analytics provides high-throughput, low-latency processing, while supporting real-time stream computation operations. With a Stream Analytics solution, organizations can gain immediate insights into real-time data as well as detect anomalies in the data, set up alerts to be triggered under specific conditions, and make the data available to other applications and services for presentation or further analysis. Stream Analytics can also incorporate historical or reference data into the real-time streams to further enrich the information and derive better analytics.
Stream Analytics is built on a pull-based communication model that utilizes adaptive caching with configured size limits and timeouts. The service also adheres to a client-anchor model that provides built-in recovery and check-pointing capabilities. In addition, the service can persist data to protect against node or downstream failure.
To implement a streaming pipeline, developers create one or more jobs that define a stream’s inputs and outputs. The jobs also incorporate SQL-like queries that determine how the data should be transformed. In addition, developers can adjust a number of a job’s settings. For example, they can control when the job should start producing result output, how to handle events that do not arrive sequentially, and what to do when a partition lags behind others or does not contain data. Once a job is implemented, administrators can view the job’s status via the Azure portal.
Stream Analytics supports two input types, stream data and reference data, and two source types, Azure Event Hubs and Azure Blob storage. Event Hubs is a publish-subscribe data integration service that can consume large volumes of events from a wide range of sources. Blob storage is a data service for storing and retrieving binary large object (BLOB) files. The following table shows the types of data that Stream Analytics can handle and the supported sources and input formats for each.
Input type
Supported Sources
Supported Formats
Size Limits
Stream
Event Hubs
Blob storage
JSON
CSV
Avro
N/A
Reference
Blob storage
JSON
CSV
50 MB
A Stream Analytics job must include at least one stream input type. If Blob storage is used, the file must contain all events before they can be streamed to Stream Analytics. The file is also limited to a maximum size of 50 MB. In this sense, the stream is historical in nature, no matter how recently the file was created. Only Event Hubs can deliver real-time event streams.
Reference data is optional in a Stream Analytics job and can come only from Blob storage. Reference data can be useful for performing lookups or correlating data in multiple streams.
Once a job has the input it needs, the data can be transformed. To facilitate these transformations, Stream Analytics supports a declarative SQL-like language. The language includes a range of specialized functions and operators that let developers implement everything from simple filters to complex aggregations across correlated streams. The language’s SQL-like nature makes it relatively easy for developers to transform data without having to dig into the technical complexities of the underlying system.
The last piece of the job puzzle is the stream output. Stream Analytics can write the query results (the transformed data) to Azure SQL Database or Blob storage. SQL Database can be useful if the data is relational in nature or supports applications that require database hosting. Blob storage is a good choice for long-term archiving or later processing. A Stream Analytics job can also send data back to Event Hubs to support other streaming pipelines and applications.
According to Microsoft, Stream Analytics can scale to any volume of data, while still achieving high throughput and low latency. An organization can start with a system that supports only a few kilobytes of data per second and scale up to gigabytes per second as needed. Stream Analytics can also leverage the partitioning capabilities of Event Hubs. In addition, administrators can specify how much compute power to dedicate to each step of the pipeline in order to achieve the most efficient and cost-effective throughput.

The Azure real-time analytics stack

Stream Analytics was designed to work in conjunction with other Azure services. Data inputted into and outputted from a Stream Analytics job must come and go through those Azure services. The following diagram provides a conceptual overview of how the Azure layers fit together and data flows through those layers in order to provide a complete stream analytics solution.
2202-4e806a62-c9ca-49b0-9dc7-8496fb22fd2
The top layer shown in the figure represents the starting point. These are the data sources that generate the event data. The data can come from just about anywhere, whether a piece of equipment, mobile device, cloud service, ATM, aircraft, oil platform-any device, sensor, or operation that can transmit event data. The data source might connect directly to Event Hubs or Blob storage or go through a gateway that connects to Azure.
Event Hubs can ingest and integrate millions of events per second. The events can be in various formats and stream in at different velocities. Event Hubs persists the events for a configurable period of time, allowing the events to support multiple Stream Analytics jobs or other operations. Blob storage can also store event data and make it available to Stream Analytics for operations that rely on historical data. In addition, Blob storage can provide reference data to support operations such as correlating multiple event streams.
The next layer in the Azure stack is where the actual analytics occur. Stream Analytics provides built-in integration with Event Hubs to support seamless, real-time analytics and with Blob storage to facilitate access to historical event data and reference data.
In addition to Stream Analytics, Azure provides Machine Learning, a predictive analytics service for mining data and identifying trends and patterns across large data sets. After analyzing the data, Machine Learning can publish a model that can then be used to generate real-time predictions based on incoming event data in Stream Analytics.
Also at the analytics layer is HDInsight Storm, an engine similar to Stream Analytics. However, unlike Stream Analytics, Storm runs on dedicated HDInsight clusters and supports a more diversified set of languages. Stream Analytics provides a built-in, multi-tenant environment and supports only the SQL language. In general, Stream Analytics is more limited in scope but makes it easier for an organization to get started. Storm can ingest data from more services and is more expansive in scope, but requires more effort. This, of course, is just a basic overview of the differences between the two services, so be sure to check Microsoft resources for more information.
From the analytics layer, we move to what is primarily the storage layer, where data can be persisted for presentation or made available for further consumption. As noted earlier, Stream Analytics can send data to SQL Database or Blob storage. SQL Database is a managed database as a service (DBaaS) and can be a good choice when you require an interactive query response from the transformed data sets.
Stream Analytics can also persist data to Blob storage. From there, the data can again be processed as a series of events, used as part of an HDInsight solution, or made available to large-scale analytic operations such as machine learning. In addition, Stream Analytics can write data back to Event Hubs for consumption by other applications or services or to support additional Stream Analytics jobs.
The final layer in the analytics stack is presentation and consumption, which can include any number of tools and services. For example, Power Query in Excel includes a built-in native connection for accessing data directly from Blob storage, providing a self-service, in-memory environment for working with the processed event data. Another option is Power BI, which offers a rich set of self-service visualizations. With Power BI, users can interact directly with the processed event data in SQL Database. In addition, a wide range of applications and services can consume data from Blob storage, SQL Database, or Event Hubs, providing almost unlimited options for presenting the processed event data or consuming it for further analysis.

Putting Stream Analytics to Work

Stream Analytics, in conjunction with Event Hubs, provides the structure necessary to perform real-time stream analytics on large sets of data. It is not meant to replace batch-oriented services, but rather to offer a way to handle the growing influx of event data resulting from the expected IoT onslaught. Most organizations will still have a need for traditional transactional databases and data warehouses for some time to come.
Within the world of real-time analytics, the potential uses for services such as Stream Analytics are plenty. The following table provides an overview of only some of the possibilities.
Usage
Description
Examples
Connected devices
Monitor and diagnose real-time data from connected devices such as vehicles, buildings, or machinery in order to generate alerts, respond to events, or optimize operations.
Plan and schedule maintenance; coordinate vehicle usage and respond to changing traffic conditions; scale or repair systems.
Business operations
Analyze real-time data to respond to dynamic environments in order to take immediate action.
Provide stock trade analytics and alerts; recalculate pricing based on changing trends; adjust inventory levels.
Fraud detection
Monitor financial transactions in real-time to detect fraudulent activity.
Correlate a credit card’s use across geographic locations; monitor the number of transactions on a single credit card.
Website analytics
Collect real-time metrics to gain immediate insight into a website’s usage patters or application performance.
Perform clickstream analytics; test site layout and application features; determine an ad campaign’s impact; respond to degraded customer experience.
Customer dashboards
Provide real-time dashboards to customers so they can discover trends as they occur and be notified of events relevant to their operations.
Respond to a service, website, or application going down; view current user activity; view analytics based on data collected from devices or operations.
These scenarios are, of course, only a sampling of the ways an organization can reap the benefits of stream-processing analytics. Services such as Stream Analytics can also translate into significant savings, depending on the organization and type of implementation.
Currently, Microsoft prices Stream Analytics by the volume of processed data and the number of stream units used to process the data, at a per-hour rate. A stream unit is a compute capacity (CPU, memory, throughput), with a maximum throughput of 1 MB/s. Stream Analytics imposes a default quota of 12 streaming units per region, but requires no start-up or termination fees. Customers pay only for what they use, based on the following pricing structure:
  • Data volume: $0.001/GB
  • Streaming unit: $0.031/hour