IC Colloquium: The Power of Theory in the Practice of Hashing with Focus on Similarity Estimation

Event details
Date | 27.09.2018 |
Hour | 16:15 › 17:30 |
Location | |
Category | Conferences - Seminars |
By: Mikkel Thorup - University of Copenhagen
Video of his talk
Abstract:
Hash functions have become ubiquitous tools in modern data analysis, e.g., the construction of small randomized sketches of large data streams. We like to think of abstract hash functions, assigning independent uniformly random hash values to keys, but in practice, we have to choose a hash function that only has an element of randomness, e.g., 2-independence. While this works for sufficiently random input, the real world has structured data where such simple hash functions fail, calling for the need of more powerful hash functions. In this talk, we focus hashing for set similarity, which is an integral component in the analysis of Big Data. The basic idea is to use the same hash function to do coordinated sampling from different sets. Depending on the context, we want subsets sampled without replacement, or fixed-length vectors of samples that may be with replacement. The latter is used as input to support vector machines (SVMs) and locality sensitive hashing (LSH). The most efficient constructions require very powerful hash functions that are also needed for efficient size estimation. We discuss the interplay between the hash functions and the algorithms using them. Finally, we present experiments on both real and synthetic data where standard 2-independent hashing yield systematically poor similarity estimates, while the right theoretical choice is sharply concentrated, and faster than standard cryptographic hash functions with no proven guarantees.
Bio:
Mikkel Thorup (born 1965) has a D.Phil. from Oxford University from 1993. From 1993 to 1998 he was at the University of Copenhagen. From 1998 to 2013 he was at AT&T Labs-Research. Since 2013 he has been back as Professor at the University of Copenhagen and is now Head of Center for Basic Algorithms Research Copenhagen (BARC) supported by the VILLUM Foundation.
Mikkel Thorup is a Fellow of the ACM and of AT&T, and a Member of the Royal Danish Academy of Sciences and Letters. He is co-winner of the 2011 MAA Robbins Award and winner of the 2015 Villum Kann Rasmussen Award for Technical and Scientific Research, which is Denmark's biggest individual prize for research. His main work is in algorithms and data structures. One of his best-known early results is a linear-time algorithm for the single-source shortest paths problem in undirected graphs. Recently one of his main focusses has been on hash functions unifying theory and practice.
More information
Video of his talk
Abstract:
Hash functions have become ubiquitous tools in modern data analysis, e.g., the construction of small randomized sketches of large data streams. We like to think of abstract hash functions, assigning independent uniformly random hash values to keys, but in practice, we have to choose a hash function that only has an element of randomness, e.g., 2-independence. While this works for sufficiently random input, the real world has structured data where such simple hash functions fail, calling for the need of more powerful hash functions. In this talk, we focus hashing for set similarity, which is an integral component in the analysis of Big Data. The basic idea is to use the same hash function to do coordinated sampling from different sets. Depending on the context, we want subsets sampled without replacement, or fixed-length vectors of samples that may be with replacement. The latter is used as input to support vector machines (SVMs) and locality sensitive hashing (LSH). The most efficient constructions require very powerful hash functions that are also needed for efficient size estimation. We discuss the interplay between the hash functions and the algorithms using them. Finally, we present experiments on both real and synthetic data where standard 2-independent hashing yield systematically poor similarity estimates, while the right theoretical choice is sharply concentrated, and faster than standard cryptographic hash functions with no proven guarantees.
Bio:
Mikkel Thorup (born 1965) has a D.Phil. from Oxford University from 1993. From 1993 to 1998 he was at the University of Copenhagen. From 1998 to 2013 he was at AT&T Labs-Research. Since 2013 he has been back as Professor at the University of Copenhagen and is now Head of Center for Basic Algorithms Research Copenhagen (BARC) supported by the VILLUM Foundation.
Mikkel Thorup is a Fellow of the ACM and of AT&T, and a Member of the Royal Danish Academy of Sciences and Letters. He is co-winner of the 2011 MAA Robbins Award and winner of the 2015 Villum Kann Rasmussen Award for Technical and Scientific Research, which is Denmark's biggest individual prize for research. His main work is in algorithms and data structures. One of his best-known early results is a linear-time algorithm for the single-source shortest paths problem in undirected graphs. Recently one of his main focusses has been on hash functions unifying theory and practice.
More information
Practical information
- General public
- Free
- This event is internal
Contact
- Host : Michael Kapralov