一种基于Spark的高效增量频繁模式挖掘算法An Efficient Spark-based Approach for Incremental Frequent Patterns Mining
荀亚玲,孙娇娇,毕慧敏
摘要(Abstract):
大规模且快速增长的数据集处理给频繁项集挖掘(FIM)带来新的挑战。尽管现有一些方法具有出色的可伸缩性,但不能充分利用了原始数据集的计算结果,且给分布式数据集处理带来了过多的通信开销。针对该问题问题,基于Spark平台提出一种高效的并行增量FIM算法(FCFPIM).FCFPIM结合完全压缩频繁模式树(FCFP-Tree)结构实现增量频繁模式的有效挖掘,当存在数据更新时,无需再重新遍历和挖掘原始数据集,充分利用了原始数据集的挖掘结果;并设计了有效的RDD转换策略以实现频繁模式的有效并行挖掘;另外,为进一步提高并行挖掘效率,引入了相关性分组策略来平衡集群计算节点之间的负载。大量的实验结果表明,FCFPIM可以很好地扩展并有效地处理大规模动态数据集。
关键词(KeyWords): 频繁模式挖掘;增量数据挖掘;Spark;并行计算;负载均衡
基金项目(Foundation): 国家青年科学基金(61602335);; 山西省自然科学基金(201901D211302);; 太原科技大学博士科研启动基金(20172017)
作者(Author): 荀亚玲,孙娇娇,毕慧敏
参考文献(References):
- [1] AGARWAL S.Data mining:data mining concepts and techniques[C]//2013 International Conference on Machine Intelligence and Research Advancement.IEEE,2013:203-207.
- [2] LV D,FU B,SUN X,et al.Efficient Fast Updated Frequent Pattern tree algorithm and its parallel implementation[C]// International Conference on Image.IEEE,2017.
- [3] XUN Y,ZHANG J,QIN X .FiDoop:Parallel Mining of Frequent Itemsets Using MapReduce[J].IEEE transactions on systems man & cybernetics systems,2016,46(3):313-325.
- [4] SUN J,XUN Y,ZHANG J,et al.Incremental Frequent Itemsets Mining with FCFP Tree[J].IEEE Access,2019,7(99):136511-136524.
- [5] VCHEUNG D W,HAN J,NG V T,et al.Maintenance of discovered association rules in large databases:an incremental updating technique[C]// Twelfth International Conference on Data Engineering.IEEE,2002,106-114.
- [6] LI Y,ZHANG Z H,CHEN W B,et al.TDUP:an approach to incremental mining of frequent itemsets with three-way-decision pattern updating[J].International journal of machine learning and cybernetics,2017,8(2):441-453.
- [7] LEE C H,LIN C R,CHEN M S .Sliding-Window Filtering:An Efficient Algorithm for Incremental Mining[C]// Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management,Atlanta,Georgia,USA,November 5-10,2001.
- [8] PRIYA R V,VADIVEL A .Partition-based sorted pre-fix tree construction using global list to mine maximal patterns with incremental and interactive mining[J].International Journal of Knowledge Engineering and Data Mining,2012,2(2-3):137-159.
- [9] KOH J L,SHIEH S F.An effificient approach for maintaining association rules based on adjusting fp-tree structures[C]//International Conference on Database Systems for Advanced Applications:Springer,2004:417-424.
- [10] CHEUNG W,ZAIANE O R.Incremental mining of frequent patterns without candidate generation or support constraint[C]//Seventh International Database Engineering and Applications Symposium:IEEE,2003:111-116.
- [11] LEUNG K S,KHAN Q I,LI Z,et al.CanTree:a canonical-order tree for incremental frequent-pattern mining[J].Knowledge & Information Systems,2007,11(3):287-311.
- [12] LI N,ZENG L,HE Q,et al.Parallel Implementation of Apriori Algorithm Based on MapReduce[C]// International Conference on Software Engineering,Artificial Intelligence,Networking & Parallel & Distributed Computing.IEEE,2012.
- [13] SONG Y G,CUI H M,FENG X B.Parallel Incremental Frequent Itemset Mining for Large Data[J].Journal of Computer Science and Technology,2017,32(2):368-385.
- [14] QIU H,GU R,YUAN C,et al.Yafim:a parallel frequent itemset mining algorithm with spark[C]//IEEE International Parallel and Distributed Processing Symposium Workshops:IEEE,2014:1664-1671.
- [15] YANG Shaosong,XU Guoyan,WANG Zhijian,et al.The Parallel Improved Apriori Algorithm Research Based on Spark[C]// Ninth International Conference on Frontier of Computer Science & Technology.IEEE,2015.
- [16] SETHI K K,RAMESH D .HFIM:a Spark-based hybrid frequent itemset mining algorithm for big data processing[J].Journal of supercomputing,2017,73(8):3652-3668.