Аннотації

Оптимізація системи пошуку збігів за допомогою використання алгоритмів локально-чутливого хешування наборів текстових даних

Автор(и):

Білощицький А. О., Діхтяренко О. В.

Автор(и) (англ)

Biloshchytskyi, A. , Dikhtiarenko, O.

Дата публікації:

30.09.2014

Анотація (укр):

Запропоновано використання алгоритмів локально чутливого хешування як способу збільшення повноти вибірки у процесі перевірки текстових документів системою пошуку збігів. Розглянуто кілька відомих алгоритмів та зроблено теоретичну оцінку доцільності їх застосування. Описано принципи роботи кожного з методів та спосіб використання в рамках системи, що розробляється.

Анотація (рус):

Предложено использование алгоритмов локально-чувствительного хеширования для увеличения полноты выборки при проверке текстовых документов системой поиска совпадений. Рассмотрено несколько известных алгоритмов и сделана теоретическая оценка целесообразности их использования. Сделано описание принципов роботы каждого метода и способ использования в разрабатываемой системе.

Анотація (англ):

The general goal of this article is a review of algorithms of locally sensitive hashing (LSH) and optimization of duplicate matching system by these algorithms. The LSH algorithms differ from cryptographic hash functions, it allows us to get similar output values for similar input data. Briefly describes the features of each algorithm and evaluated the possibility of integration into the system. Locally sensitive hashing can be applied when we’re comparing the shingles from one document to another. The duplicate matching system works in two stages: indexing and search. The indexing is performed for each document only once, so it makes sense to move as much as possible operations at the stage of indexing. The index of document is a set of the hash values from the shingles and information about original position of the elements in a original text document. Therefore, the implementation of LSH is possible without special manufacturing costs. In the second stage (matching) hashes of the elements are compared to full compliance. If we use LSH, the comparing of hashes should be processed using Hamming metric or addition modulo 2 (XOR), so, theoretically, the performance should not be adversely affected. The disadvantage of using LSH is the possibility of false-positive findings, when using of cryptographic hash functions can give results with very high accuracy.

Література:

1. Jaccard, P. Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines / Jaccard P. // Bull. Soc. Vaudoise sci. Natur. – 1901. – V. 37, Bd. 140. – S. 241–272.

2. Білощицький, А. О. Ефективність методів пошуку збігів у текстах / А.О. Білощицький, О.В. Діхтяренко // Управління розвитком складних систем. – 2013. – № 14.– С. 144 – 147.

3. Алферов, А. П. Основы криптографии / А. П. Алферов, А. Ю. Зубов, А. С. Кузьмин, А. В. Черемушкин // Гелиос АРВ. – М., 2001. – С. 347-348.

4. Левенштейн, В. И. . Двоичные коды с исправлением выпадений, вставок и замещений символов. / В. И. Левенштейн // Доклады Академий наук СССР, 1965. 163.4:845-848.

5. Jesse Kornblum Identifying almost identical ﬁles using context triggered piecewise hashing / Jesse Kornblum // Digital Investigation 3S (2006) S. 91–S97.

6. Karp, Richard M. Efficient randomized pattern-matching algorithms / Karp , Richard M., Rabin, M.O. // IBM Journal of Research and Development , vol.31, no.2, S. 249–260, March 1987 doi: 10.1147/rd.312.0249.

7. Томас Х. Кормен Алгоритмы: построение и анализ. – 2-е изд. / Томас Х. Кормен // М. : Вильямс, 2006. – С. 1296. – ISBN 0-07-013151-1.

8. Anand Rajaraman Mining of Massive Datasets / Anand Rajaraman, Jeffrey David Ullman // Cambridge University Press, 2011.

9. Hamming, Richard W. Error detecting and error correcting codes. / Hamming, Richard W. // Bell System technical journal 29.2 (1950): 147-160.

10. Charikar, Moses S. Similarity estimation techniques from rounding algorithms. / Charikar, Moses S. // Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 2002.

11. Sadowski, Caitlin Simhash: Hash-based similarity detection. / Sadowski, Caitlin, Greg Levin. // Technical report, Google, 2007.

References:

1. Jaccard, P. (1912). The distribution of the flora in the alpine zone. 1. New phytologist, 11(2), 37-50.

2. Biloshchytskyi, A. & Dikhtiarenko, O. (2013). The effectiveness of methods for finding matches in texts. Management of complex systems, 14, pp. 144 – 147.

3. Alferov, A. P., Teeth, A. J., Kuzmin, A. C., & Cheremushkin, A. C. (2005). Foundations of cryptography: a textbook. M: Helios ARVs.

4. Levenshtein V. I. (1965) Binary codes with correction for deletions, insertions and substitutions of characters. Reports of USSR Academy of Sciences, 163.4: 845-848.

5. Kornblum, J. (2006). Identifying almost identical files using context triggered piecewise hashing. Digital investigation, 3, 91-97.

6. Karp, R. M., & Rabin, M. O. (1987). Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2), 249-260.

7. Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to algorithms. Cambridge : MIT press, 2, 531-549.

8. Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press.

9. Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System technical journal, 29(2), 147-160.

10. Charikar, M. S. (2002, May). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380-388). ACM.

11. Sadowski, C., & Levin, G. (2007). Simhash: Hash-based similarity detection. Technical report, Google.