15. Add pi to U 16. end for 17.end if
18.for each point pi∈U do 19. 计算 LDOF(pi) 20.end for
21.根据LDOF的值对所有点进行排序 22.拥有高LDOF值得前n个点作为离群点 B.实验结果
这一部分,我们比较离群点检测算法PLDOF和LDOF。
(1)医学诊断数据集:在现实的数据库中很难找到一个数据集来评估离群点检测算法,因为仅有很少的数据集能正确的认识到哪些对象是真正的行为异常。在这个试验中我们使用医学的数据集WDBC,这个数据集是为了乳房肿块诊断做原子特征提取的。这个数据集包括569个医学诊断记录,每个记录都有32个属性(ID,诊断信息,30个真实的输入特征)。这个诊断是二元的,要么是良性的要么是恶性的。我们认为良性的记录的正常数据。这个实验中,我们用所有的这357个良性诊断记录作为正常对象,然后加入5个恶性诊断记录作为离群点。通过改变邻近的大小k,把真实离群点占隐藏离群点的比率作为检测的精度。实验中,我们逐渐增加k的值,然后计算每个方法的计算精度。
图1 WDBC数据
5
为了进一步验证我们的方法,我们重复10次实验,每次离群点的数目不相
同。每次独立执行10步,然后计算平均检测精度,k的取值范围从10到50.从图1中我们可以看出精度随前n个点和邻域的大小k变化而变化。正如图1所示,我们的算法比LDOF算法精度高。如果考虑钱20个点,两个算法都能检测到所有的离群点。我们同时也改变邻域大小k来进行实验。可以看到,两个算法在k=30时精度达到了极限。从表1中我们可以看出虽然我们从数据集中剪去了57%的点,但是两个算法的精度还是差不多。由于剪去了57%的点,计算的消耗也大大降低,这也是基于LDOF剪枝方法值得肯定的一面。
表1 WDBC数据集的剪枝比率
(2)酵母数据集:在这个实验中,我们使用酵母数据集,这个数据集曾被用来查找蛋白质集中的地方。数据集包括1484个对象,每个对象有如下属性:名字和其他8个输入特征。我们认为在C-terminus中值为0的目标信号为正常数据,其中POX的值比异常的大2.实验中我们使用整个1474个记录作为正常对象,然后加入10个记录作为离群点。我执行与医学诊断数据集相似类型的实验可以观察到相同的趋势。实验结果如图2和表2.可以看出我们可以从原始数据集中剪去超过60%的点而不丢失离群点。
表2 酵母数据集的剪枝比率
6
图2 酵母数据
5.结论
本篇论文我们提出了一个有效的离群点检测方法。首先通过使用每个聚类的
半径来判断哪些点可能不是离群点,然后从数据集中移除这些点。由于降低了数据集的大小,时间复杂度得到了降低。同时使用基于距离的局部离群系数来衡量一个对象偏离其邻域的程度。通过剪除一些点使得我们的方法在检测离群点上比现在的方法精度要高。
参考文献
[1] F. Angiulli, S. Basta, and C. Pizzuti. Distance-based detection and prediction of outliers. IEEE Transactions on Knowledge and Data Engineering, 18:145–160, 2006. [2] F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In PKDD ’02: Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, pages 15–26, 2002.
[3] F. Angiulli and C. Pizzuti. Outlier mining in large highdimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 17:203–215, 2005. [4] A. Arning, R. Agrawal, and P. Raghavan. A linear method for deviation detection
7
in large databases. pages 164–169,1996.
[5] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander.Lof: identifying density-based local outliers. SIGMOD Rec.,29(2):93–104, 2000.
[6] M. Ester, H.-P. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. In Proceedings of 1stInternational Conference on Knowledge Discovery and DataMining (KDD-95), 1995.
[7] A. Ghoting, S. Parthasarathy, and M. Otey. Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery, 16(3):349–364, June 2008.
[8] S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. SIGMOD Rec.,27(2):73–84, 1998.
[9] W. Jin, A. K. H. Tung, and J. Han. Mining top-n local outliers in large databases. In KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 293–298, 2001.
[10] E. M. Knorr and R. T. Ng. Algorithms for mining distancebased outliers in large datasets. In Proc. 24th Int. Conf. Very Large Data Bases, VLDB, pages 392–403, 1998.
[11] E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB ’99: Proceedings of the 25th International Conference on Very Large Data Bases, pages211–222, 1999.
[12] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: algorithms and applications. The VLDB Journal, 8(3-4):237–253, 2000.
[13] R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. pages 144–155, 1994.
[14] S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos.Loci: fast outlier detection using the local correlation integral.In Data Engineering, 2003. Proceedings. 19th International Conference on Data Engineering, pages 315–326, March 2003. [15] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. pages 427–438, 2000.
[16] Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large
8
databases in any metric space. KDD 06, 20, 2006.
[17] B. V and L. T. Outliers in Statistical Data. John Wiley and Sons, New York, 1994.
[18] K. Zhang, M. Hutter, and H. Jin. A new local distance-based outlier detection approach for scattered real-world data. In PAKDD ’09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining,pages 813–822, 2009.
[19] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. SIGMOD Rec., 25(2):103–114, 1996.
9

