Technology Innovation
The Chengdu Metro COCC Project is a line-network comprehensive monitoring & operation platform led by Chengdu Metro Operation Co., Ltd., with the in-depth participation and construction of BII Transportation Technology (Beijing) Co., Ltd. The monitoring scope of the platform covers all major business indicators of subway operation, including traffic, passenger, equipment, and energy consumption.
Of which, the passenger indicator data is from the clearing data generated by the ACC system, and it is sent to the COCC data interface server in real-time in the form of file. The ACC system generates a large number of data files every day, and the size of the largest individual file is up to hundreds of megabytes. In order to meet the needs of real-time monitoring, the COCC system, in addition to receiving the data of these files, must perform the real-time calculation and processing on the data, to obtain the accurate prediction and early-warning data.
In the face of such a large amount of real-time processing tasks, it is difficult for the conventional stand-alone program to ensure the timeliness and accuracy of data processing. We adopt the common processing method in the field of big data processing, namely, using the distributed architecture to perform the real-time streaming processing on the ACC passenger traffic data.
I.The distributed architecture ensures the stability of the?system
As shown in the figure above, our system uses a distributed architecture. The Zookeeper cluster provides coordination and synchronization services for distributed programs such as Kafka and Storm. The Kafka cluster functions as a data buffer in the system. Before processed by Storm, the data is cached in the Kafka cluster. The Storm cluster is a real-time data processing framework that processes data in a pipelined manner, and allows programmers to easily have the powerful distributed processing capability without considering the underlying distributed implementation details.
When the system processing efficiency drops, it may be linearly improved simply by modifying the startup parameter and increasing the number of cluster nodes. Besides, our data processing tasks are running on the Storm cluster, when downtime occurs due to failure in any of the nodes in the cluster, the unprocessed data will be automatically transferred to another node for subsequent processing, without causing data loss or delay.
II.The Storm stream processing framework ensures the data?is processed timely.
Storm processes the data in a pipelined manner, which is also known as "streaming data processing." As shown in the figure above, each topology task is like a tree. The root is the source of the topology, and it is responsible for receiving data from Kafka, and performing some pre-processing, and then distributing the data to the downstream node. After receiving the distributed data from the upstream, the downstream node will continue to process the data and then store it into the database after processing.