(1)Theory of Simhash is analyzed and discussed. The Chinese text elimination model based on Simhash is built. Given the long process time for massive data, a new storage and retrieval plan is introduc
(1) Theory of Simhash is analyzed and discussed. The Chinese text elimination model based on Simhash is built. Given the long process time for massive data, a new storage and retrieval plan is introduced and analyzed.
(2) Theory of Shingling is analyzed and discussed. The Chinese text elimination model based on Shingling is built. The selection of some key parameters in this algorithm are discussed. In terms of optimizing storage, the scheme of using hash value is put forward. And the space and time complexity are analyzed.
(3) Theory of LSA model is analyzed and discussed. The Chinese text elimination model based on LSA is built. LSA model is a spatial model based on word, theme and document. Through singular value decomposition, the doc-word matrix is mapped to low-dimensional semantic space, which efficiently reduce the complexity of the algorithm.
(4) Tests have been done on these three models. The recalls and precisions of the models are all above 75%, where those of LSA are above 90%. In terms of time complexity, Simhash-based model using hash storage is the best, the process time is the lowest among the models. The Shingling-based model has extremely high time complexity. Therefore, it is hard to apply it into real use. Both Simhash-based model and LSA-based model show good performance. Although Simhash-based model is inferior to LSA-based model in terms of performance, Simhash-based model owns the advantages of lower average retrieval time and smaller storage space in terms of massive data. Therefore, it can be the first choice in real projects.
At present, there are few theories for Chinese text elimination in China. This paper combines Chinese word segmentation technology with Simhash which is originally used by Google for web filtering and proposes a Chinese text elimination model. However, there are also several deficiencies: the specimens are related to business private information, thus few data can be used for experiments. It is unable to carry out experiments on massive data before the whole project is released and tested.
Keywords text elimination; Simhash; hash storage; Shingling; LSA
目 次
1 绪论 1
1.1 选题背景及意义 1
1.2 国内外研究现状 1
1.3 本文研究框架 3
2 相关技术研究 6
2.1 中文分词技术 6
2.1.1 基于词典的中文分词算法 6
2.1.2 基于统计的中文分词方法 7
2.1.3 词典统计混合算法 7
2.2 字段加权方法 8
2.3 文本相似性度量指标 9
2.3.1 几何距离 9
2.3.2 非几何距离 9
2.4 结果评价指标 10
2.4.1 召回率和精确率 10
2.4.2 F值 11
2.5 本章小结 11
3 基于Simhash算法的去重模型 12
3.1 Simhash算法 12
3.2 哈希存储 14
3.3 整体算法模型设计 17
3.4 本章小结 18
4 基于Shingling算法的去重模型 19
4.1 Shingling算法 19
4.2 整体算法模型设计 21
4.3 本章小结 23
5 基于LSA算法的去重模型研究 24
5.1 LSA模型 24
5.2 整体算法模型设计 27
5.3 本章小结 29
6 算法实验分析 30
6.1 Simhash去重模型实现与验证 30
6.1.1 最佳参数的测定 30
6.1.2 算法消耗时间测定 32
6.2 Shingling去重模型实现与验证 33
6.2.1 最佳参数的测定 33
6.2.2 算法消耗时间的测定 35
6.3 LSA去重模型实现与验证 36
6.3.1 最佳参数的测定 36
6.3.2 算法消耗时间的测定 37
6.4 三种模型的分析与比较 38
6.5 本章小结 40
结 论 41
致 谢 42
参 考 文 献 43
附录 部分实验结果 45
1 绪论
1.1 选题背景及意义
随着科技的不断进步和经济的发展,当今的信息时代发展到了新阶段——“大数据”的时代,一系列相关的问题随之而来,例如:如何维护日益庞大的数据库,去除其中的冗余数据;如何从看似杂乱无章毫无规律的数据中抽取出有用的信息为己所用;如何高效识别相似网页和过滤垃圾邮件……这些问题进一步推动了数据挖掘、机器学习等多个计算机领域的发展。