Key Moments

Customizable Scalable Compute Intensive Stream Queries

Google TalksGoogle Talks
Education5 min read47 min video
Aug 22, 2012|148 views|4
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

A new system allows custom parallelization of complex stream queries, enabling scientific applications like radio telescopes to process massive data volumes, but the optimal strategy depends heavily on the specific algorithm and hardware.

Key Insights

1

The data stream rate for a virtual radio telescope can reach an enormous 20 terabits per second, necessitating specialized cluster computing beyond standard number crunching.

2

Grid Stream Data Manager (GSDM) utilizes 'data flow distribution templates' (DFDTs) to define how to parallelize algorithms in a distributed environment, allowing customization of execution strategies.

3

The system supports user-defined functions (UDFs) written in C, where computations are specified as continuous queries over sliding windows of data, not SQL.

4

Two primary distribution strategies are 'window distribute' (generic, akin to round-robin) and 'window split' (computation-dependent, for algorithms like FFT radix), with a measured trade-off in performance based on window size and computation complexity.

5

The follow-up project, SIS-KU (Supercomputer Stream Query Processor), extends these concepts to highly parallel Blue Gene supercomputers, adapting to their unique architecture, slower processors, and fast inter-node communication.

6

Optimization can be automated via a template called 'OPT,' which profiles different distribution strategies for a given computation and selects the most efficient one, though this profiling can take up to 10 minutes.

The challenge of massive scientific data streams

Modern scientific applications, such as virtual radio telescopes and patient monitoring, generate vast amounts of data in continuous streams rather than static disk storage. A prime example discussed is a software radio telescope with a data stream rate of 20 terabits per second. Traditional database management systems are ill-equipped for this scale and continuous nature. Data stream management systems (DSMS) are introduced as a similar but distinct paradigm, processing queries that yield streams as results. These systems must handle enormous data volumes, often exceeding disk storage capacity, requiring real-time reduction and processing. Crucially, computations are performed on 'moving windows' of data, not entire datasets. Furthermore, these applications necessitate user-defined functions (UDFs) to handle complex, application-specific filtering and signal processing, moving beyond static query languages like SQL. Scalability is paramount, not just for data volume but also for accommodating computationally expensive UDFs.

Grid Stream Data Manager (GSDM) for parallel stream processing

The Grid Stream Data Manager (GSDM) is designed as a specialized DSMS for high-volume scientific data. It operates as a parallel, distributed system that can dynamically adjust its cluster size. At its core, GSDM allows continuous queries to be defined over data streams, supporting UDFs written in C for flexibility. A key innovation is the use of 'data flow distribution templates' (DFDTs). These templates define how a given algorithm, particularly UDFs, can be parallelized across multiple nodes. This means users can customize the parallelization strategy based on the algorithm's characteristics, which is critical for achieving performance. The system's execution engine runs in main memory to minimize latency. Queries are compiled into logical data flow graphs that dictate the distribution and computation across available nodes.

Customizable parallelization strategies: Window distribute vs. Window split

GSDM offers different DFDTs catering to various computational needs. 'Window distribute' is a generic template, similar to round-robin partitioning used in distributed databases. It distributes entire data windows across nodes for parallel computation. This approach is application-independent and performs well for simple operations where window size doesn't significantly impact algorithm speed. In contrast, 'window split' is a computation-dependent strategy, exemplified by its use with Fast Fourier Transforms (FFTs) employing the radix algorithm. This method splits the window itself into smaller parts before distribution, leading to a more efficient parallel FFT, especially for larger windows. Experimental results indicate a trade-off: 'window distribute' is better for small operations and small windows, while 'window split' excels with expensive computations like FFT, particularly as window sizes increase. The choice between them depends on the algorithm's complexity and the data window size.

The Partition-Compute-Combine (PCC) pattern

A common and highly useful pattern identified is Partition-Compute-Combine (PCC). This pattern involves first distributing data partitions, then performing parallel computation on these partitions, and finally combining the results. GSDM provides PCC as a parameterized constructor, which can be configured with specific distribution and computation functions. For instance, it can utilize 'window distribute' or 'window split' as its partitioning mechanism. This hierarchical structure allows for recursive application of templates, enabling the construction of complex execution patterns. While potentially using more nodes, this strategy can offer performance benefits by balancing distribution, computation, and merging phases, especially when dealing with complex algorithms and large datasets. The system measures these strategies to understand their impact on throughput, latency, and resource utilization.

Automated optimization with the OPT template

To further streamline the process of selecting the best execution strategy, GSDM introduces an 'OPT' template. This template enables automated optimization by profiling different distribution strategies for a given computation. It queries available templates (like PCC with various partitioning functions) and their parameters, runs them with sample data, collects performance metrics (e.g., execution time, resource usage), and stores this profiling data. The system then selects the configuration that yields the best performance. While this profiling process might take several minutes (e.g., 10 minutes), it allows for specialized, highly optimized query execution, particularly beneficial for computationally intensive tasks like those found in scientific data analysis.

SIS-KU: Scaling to massive Blue Gene supercomputers

The follow-up project, SIS-KU (Supercomputer Stream Query Processor), extends these principles to the extreme scale of IBM Blue Gene supercomputers. These machines, like the one in the Netherlands with 12,000 processors, are designed for massive parallelism but feature slower individual processors to manage energy consumption. SIS-KU adapts GSDM's customizable approach to this architecture, leveraging its extremely fast inter-node communication (gigabit network per node). The system can deploy stream processors across different types of nodes (computation, communication, I/O), handling input from distributed antenna arrays. It integrates with both regular Linux clusters (acting as front-end and back-end processors) and the Blue Gene itself, facilitating data reduction at multiple stages. A key challenge is adapting to the Blue Gene’s OS subset and its requirement for identical executables on all nodes, which is managed by interpreting application-specific code on each node.

Research challenges and opportunities in stream processing

Both GSDM and SIS-KU highlight ongoing research areas in stream processing. Key challenges include incorporating hardware specifications into optimization strategies, better utilizing specialized hardware nodes, and managing limitations imposed by operating systems on supercomputers. Opportunities lie in optimizing stream processing across multiple, disparate clusters, leveraging the abundance of cheap processors on supercomputers instead of threads, and developing advanced customization techniques. The adaptable nature of these systems is crucial for handling the complexity and scale of future scientific data applications, moving beyond generic solutions to fine-tuned, algorithm-aware parallel processing.

Comparison of Stream Processing Strategies for FFT

Data extracted from this episode

StrategyWindow SizePerformance CharacteristicNotes
Central ExecutionTypical (e.g., 256)Slower than distributedBaseline for comparison
Window Distribute (Round Robin)Small to Medium (e.g., 256)Good for small operations, less flexible for complex computationsEffective when window size doesn't heavily influence algorithm speed
Window Split (FFT Radix based)Large (e.g., 16K), smaller internal windowsBetter for expensive computations like FFT, scales well with larger windowsPartitioning and combination are more complex, may use more nodes.
Tree DistributionLarge contextCan be best for highly optimized FFT, but uses significantly more nodesTrade-off between node count and computational efficiency; may not be feasible with limited nodes.

Common Questions

A DSMS is a system designed to manage and query continuous streams of data, similar in concept to a traditional database management system but specifically adapted for the high-volume, real-time nature of data streams.

Topics

Mentioned in this video

More from GoogleTalksArchive

View all 48 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free