The rise of Apache Spark and the general shift from batch to real-time, streaming big data has disrupted the traditional data stack over the last two years. But one of the last hurdles to extracting value out of Big Data rests on the analytics side. How much speed it takes to effectively query and visualize Big Data and translate those visualizations into business value is still a point of contention. Although, eighty-seven percent of enterprises believe Big Data analytics will redefine the competitive landscape of their industries within the next three years, some work remains to get there.
Most engineers who use legacy business intelligence tools find them inadequate for the performance load of Big Data. Others who write their own analytics with D3.js or similar tools wrestle with the backend challenge of fusing real-time data with other data stores. So let’s take a look at streaming architectures and what they mean for developers focused on big data analytics.
Big Data Streaming Analytics
Data Naturally Exists in Streams
All commerce, whether online or in person, takes place as a stream of events and transactions. In terms of data, the term “streaming” simply refers to how data moves from place A to place B. With the right technology, historical data can be streamed just as easily as real-time data. Nevertheless, when streaming data and streaming analytics are referenced in a business context, the discussion generally refers to streaming, real-time data.
A long time ago, businesses recorded their streams of events and transactions in a book—yes, an actual book—which kept a record of inventory and sales. Books eventually yielded to computers and databases, but practical limitations still constrained data processing to local operations. Later, data was packaged, written to disk, and shipped between locations for processing and analysis. Grouping the data stream into batches made it easier to store and transport.
Now, batching is no longer necessary. Systems and networks are faster and more reliable, and programming languages and databases have evolved to accommodate a more distributed, streaming data architecture. For example, physical retail stores used to close for a day each quarter to conduct inventory. Then, they evolved to batch analysis of various locations on a weekly and then a daily basis. Today, they keep a running inventory that is accurate through the most recent transaction. Similar examples exist across every industry.
Batch Mode Slows Data Analysis and Visualization
Traditional, batch-oriented data warehouses pull data from multiple sources at regular periods, bringing it to a central location and assembling it for analysis. This practice causes data management and security headaches that grow larger over time as the number of data sources and the size of each batch grows. It takes time to export batches from the data source and import them into the data warehouse. In large organizations, batching can cause conflicts with backup operations. The process of batching, transporting, and analysis often takes so much time that it becomes impossible for a complex business to know what happened yesterday or even last week.
By contrast, with streaming, real-time data analysis, organizations know they are working with the most recent version of data because they stream the data on demand. By tapping into data sources only when they need the data, organizations eliminate the problems presented by storing and managing multiple versions of data. This simplifies data governance and security; working with streaming data means not having to track and secure multiple batches.
We live in an on-demand world. Working with data streams enables enterprises to harness the value of the data they work so hard to collect, and tap into it to build competitive advantage.
Eliminating the Barriers to Real-Time Analysis
Previously, building streaming-data analysis environments took months or even years. They required expensive, dedicated infrastructure, which suffered from a lack of interoperability. They needed specialized developers and data architects. And, they were rigid—unable to adapt to rapid changes in the database world, such as the rise of unstructured data.
In the past few years, we’ve witnessed a flurry of activity in the streaming data analysis space, with new software, hardware and networking technologies. Always-on, low latency, high-bandwidth networks cost less than ever before. Inexpensive and fast memory and storage allow for more efficient data analysis.
We’ve also seen the rise of easy-to-use, inexpensive, and open-source streaming data platform components. The Weather Channel, Spotify, WebMD, and Alibaba.com have implemented Apache Storm, a Hadoop compatible add-on (developed by Twitter) for rapid data transformation. Apache Spark, a fast, general engine for large-scale data processing, supports SQL, machine learning, and streaming-data analysis. Apache Kafka, an open-source message broker, has been widely adopted for consumption of streaming data. And Amazon Kinesis, a fully managed, cloud-based service for real-time data processing over large, distributed data streams, can continuously capture large volumes of data from streaming sources.
Big Data Streaming Analytics Checklist for Developers
The number of streaming big data sources and the volume of streaming data continues to grow. So businesses can’t limit themselves to historical data for business insight. They require timely analysis of streaming data from the Internet of Things (IoT), social media, location, market, news and weather feeds, clickstream analysis, and live transactional data.
Examples of streaming-data analytics include:
- Optimizing mobile networks on the fly using network device logs and subscriber location data
- Decreasing the risk of hospital-originated infections by capturing and analyzing real-time data from monitors on newborn babies
- Alerting office equipment service technicians to respond to impending equipment failures
As the importance of big data streaming analytics grows, the focus becomes closing the “time-to-analytics” gap—how long it takes from the arrival of data to realized business value. I see three ways that developers should rethink how they embed analytics in their applications:
Simplicity of Use: Analytics must be accessible to business users. With streaming data, a responsive, well-designed visual interface will make the data more accessible to the “citizen data scientist.” Any developer bringing analytics into a business application must deal with the “consumerization of IT,” ensuring that business users get the same convenience and intuitive operation that mobile or web applications offer.
Speed and Performance First: With visualization comes the need to bring query results to the users in near real-time. Business users won’t tolerate spinning pinwheels while queries resolve. Today we’re seeing analytics pushed into the stream via Spark, which is emerging as the de facto approach for sub-second query response times across billions of rows of data, while leaving data in place.
Consolidating multiple data sources: Businesses shouldn’t have to distinguish between “Big Data” and other forms of data. There’s just data. And it needs to be available within business applications for sub-second, interactive visualization. So developers that embed applications with analytics need to consolidate multiple and varied data sources so they can be queried as a single source.
To see streaming data analytics in action, check out our interactive demos or download a free trial of Zoomdata.