Data Analysis Platform

Data Analysis Platform

With the rapid development of the Internet technology. Big Data and Cloud Computing has become the most popular technologies currently. Gartner Inc. gives the definition of Big Data as being high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Data analysis is an appropriate method with statistical analysis on large amounts of data to extract useful information and conclusions for the detailed process.

Since the set up of OMNILab, we have made extensive progress towards understanding Cloud Computing and Big Data. The goal of our platform is to integrate data collection from different sources, storage and computation services together to achieve effective data analysis. The platform is build upon a core message queue system and consists of four parts: data collection, storage, computation and service modules. Practically, ten nodes of Hadoop framework are built based on CentOS with the Cloudera CDH release and the Cloudera Manager as the management tool. As for now, the storage capacity is 80TB. Data computation module consists of two primary procedures: real-time processing and batch processing (a.k.a. Lambda Architecture). We adopt Apache Storm and Spark Streaming to handle the real-time processing with the data from Kafka and leverage Apache Spark as our batch processing framework to handle big data which is a distributed, scalable analytic tool for big data. The service module comprises data services (NFS storage, MySQL, PostgreSQL) and application services (HomePage, GitLab, Wiki) for external public use.

随着互联网技术的飞速发展,大数据和云计算技术已经变成了当下最为热门的前沿技术。对于“大数据”(Big data)研究机构Gartner给出了这样的定义。“大数据”是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样 化的信息资产。数据分析是指用适当的统计分析方法对收集来的大量数据进行分析,提取有用信息和形成结论而对数据加以详细研究和概括总结的过程。
自 实验室成立以来,OMNILab的研究人员,就致力于在数据分析和云计算等前沿科技方面的研究。我们的数据分析平台的目标是为了整合数据采集、存储、计算 和服务为一体,从而能够有效地实现从不同的数据源能够以标准化的程序来对数据进行有效地分析。平台以消息队列为核心,并且主要由数据采集、存储、计算和服 务为主要的四个模块。在我们的实践中采用了10台服务器作为Hadoop集群构架,使用Apache Kafka作为消息队列;通过网络爬虫、传感器等方式采集各种时序数据(服务器日志、设备状态、网络流量)和空间数据(IOT数据,GPS数据等);采用 HDFS作为我们的数据存储模块,并且使用Cloudera Manager作为管理工具,目前我们的平台存储容量已达80TB;数据计算模块主要采用了Storm和Spark作为分布式处理大数据;数据服务模块主 要包含数据服务(诸如NFS、MySQL、PostgreSQL)和应用服务(HomePage、GitLab、Wiki)。

Person/Organization: Haiyang Wang, Jianwen Wei, Yusu Zhao, Pengfei Zhang