June 9, 2003

Aurora

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

The Aurora system 1 is an experimental data stream management system with a fully functional prototype. It includes both a graphical development environment, and a runtime system. We propose to demonstrate the Aurora system with its development environment and runtime system, with several example monitoring applications developed in consultation with defense, financial, and natural science communities. We will also demonstrate the effect of various system alternatives on various workloads. For example, we will show how different scheduling algorithms affect tuple latency and internal queue lengths. We will use some of our visualization tools to accomplish this. Data Stream Management Aurora is a data stream management system for monitoring applications. Streams are continuous data feeds from such sources as sensors, satellites and stock feeds. Monitoring applications track the data from numerous streams, filtering them for signs of abnormal activity and processing them for purposes of aggregation, reduction and correlation. The management requirements for monitoring applications differ profoundly from those satisfied by a traditional DBMS: o A traditional DBMS assumes a passive model where most data processing results from humans issuing transactions and queries. Data stream management requires a more active approach, monitoring data feeds from unpredictable external sources (e.g., sensors) and alerting humans when abnormal activity is detected. o A traditional DBMS manages data that is currently in its tables. Data stream management often requires processing data that is bounded by some finite window of values, and not over an unbounded past. o A traditional DBMS provides exact answers to exact queries, and is blind to real-time deadlines. Data stream management often must respond to real-time deadlines (e.g., military applications monitoring positions of enemy platforms) and therefore must often provide reasonable approximations to queries. o A traditional query processor optimizes all queries in the same way (typically focusing on response time). A stream data manager benefits from application specific optimization criteria (QoS). o A traditional DBMS assumes pull-based queries to be the norm. Push-based data processing is the norm for a data stream management system. A Brief Summary of Aurora Aurora has been designed to deal with very large numbers of data streams. Users build queries out of a small set of operators (a.k.a. boxes). The current implementation provides a user interface for tapping into pre-existing inputs and network flows and for wiring boxes together to produces answers at the outputs. While it is certainly possible to accept input as declarative queries, we feel that for a very large number of such queries, the process of common sub-expression elimination is too difficult. An example of an Aurora network is given in Screen Shot 1. A simple stream is a potentially infinite sequence of tuples that all have the same stream ID. An arc carries multiple simple streams. This is important so that simple streams can be added and deleted from the system without having to modify the basic network. A query, then, is a sub-network that ends at a single output and includes an arbitrary number of inputs. Boxes can connect to multiple downstream boxes. All such path splits carry identical tuples. Multiple streams can be merged since some box types accept more than one input (e.g., Join, Union). We do not allow any cycles in an operator network. Each output is supplied with a Quality of Service (QoS) specification. Currently, QoS is captured by three functions (1) a latency graph, (2) a value-based graph, and (3) a loss-tolerance graph. The latency graph indicates how utility drops as an answer is delayed. The value-based graph shows which values of the output space are most important. The loss-tolerance graph is a simple way to describe how averse the application is to approximate answers. Tuples arrive at the input and are queued for processing. A scheduler selects a box with waiting tuples and executes that box on one or more of the input tuples. The output tuples of a box are queued at the input of the next box in sequence. In this way, tuples make their way from the inputs to the outputs. If the system is overloaded, QoS is adversely affected. In this case, we invoke a load shedder to strategically eliminate Aurora supports persistent storage in two different ways. First, when box queues consume more storage than available RAM, the system will spill tuples that are less likely to be needed soon to secondary storage. Second, ad hoc queries can be connected to (and disconnected from) any arc for which a connection point has been defined. A connection point stores a historical portion of a stream that has flowed on the arc. For example, one could define a connection point as the last hour’s worth of data that has been seen on a given arc. Any ad hoc query that connects to a connection point has access to the full stored history as well as any additional data that flows past while the query is connected.

Me gusta

Guardar