Key Moments
Hashing Searching Sketching.
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Near photorealistic virtual humans cost $1 million each to create, while AI can diagnose better than doctors but its reasoning is unexplainable.
Key Insights
A new hashing technique using two bins instead of one, where items are placed in the smaller bin, can achieve a maximum load per bin of as low as two items, drastically improving space utilization to about 85% compared to traditional methods.
Cuckoo hashing, a related technique, also uses multiple locations per item but typically requires more moves to resolve collisions than the described multi-choice hashing method.
Locality-Sensitive Hashing (LSH) can find approximate nearest neighbors by hashing similar objects to the same buckets, with a probability bias for nearby points to fall into the same cell.
For k-d trees, the probability of finding the nearest neighbor drops exponentially with dimensions (e.g., e^-d/c), making them unreliable in high-dimensional spaces.
A novel LSH approach involves searching for random points in the neighborhood of a query point rather than the query point itself, improving success rates in sparse or high-dimensional data.
The proposed LSH method for nearest neighbor search can achieve space efficiency, storing only a sketch of the object (around 8 bytes) even for millions of data points.
Revolutionizing exact match hashing with multi-choice "balls and bins"
Traditional hashing, often analyzed using the "balls and bins" model where items (balls) are thrown into hash tables (bins), suffers from collisions and space inefficiency. A significant improvement is achieved by a variant where each item is hashed to not one, but two random bins, and then placed into the bin with the current lower load. This 'two-choice' hashing dramatically improves load balancing: while standard random hashing can result in a maximum load of O(log n) items per bin, the two-choice method reduces this significantly. Further enhancing this, by allowing items to be moved between chosen bins during insertion (a form of "cuckoo hashing" or multi-choice hashing), it's possible to maintain a maximum load of just two items per bin with high probability. This enables hash tables with approximately 85% space utilization, eliminating the need for linked lists to resolve collisions and allowing pre-allocation of space for two items per bucket. The analysis reveals that for a random graph of items and bins, a complete binary tree structure (indicating failure to find a lightly loaded bin) is unlikely if the edge-to-vertex ratio is below approximately 1.67, corresponding to this 85% occupancy.
The power of dual-bin selection and item migration
The core idea of placing an item into the less loaded of two randomly chosen bins leads to a much more uniform distribution. This is because the probability of a bin reaching a high load 'i' reduces from exponentially to doubly exponentially. The recursive relationship for the probability 'p_i' of a bin having load 'i' becomes roughly p_{i+1} ≈ p_i^2, leading to a rapid decrease in maximum load as 'i' increases. This technique, inspired by router memory bandwidth optimization, allows for a constant maximum load, significantly improving search times and space efficiency over traditional hashing methods. Insertions and deletions can be managed efficiently, with moves typically occurring in constant expectation or with a logarithmic number of moves with high probability.
Locality-sensitive hashing for approximate nearest neighbor search
In contrast to exact search, nearest neighbor search (NNS) aims to find items similar to a query, even if not identical. This is crucial for tasks like image or document similarity. High-dimensional spaces exacerbate this problem due to the 'curse of dimensionality,' making exact NNS computationally prohibitive. Locality-Sensitive Hashing (LSH) addresses this by designing hash functions where similar items are more likely to hash to the same bucket than dissimilar items. This is achieved by mapping objects (e.g., images, documents) into a high-dimensional vector space and then using hash functions that preserve proximity. A common LSH technique involves using random hyperplanes and periodic shifts to partition space into cubes; nearby points are more likely to fall into the same cube. While the probability of a nearest neighbor pair hashing to the same bucket might be low (e.g., n^(-1/c) for a 'c'-approximate neighbor), amplifying this by using multiple hash tables or a modified search strategy can yield effective NNS.
Enhancing LSH with neighborhood search and sketches
A novel approach proposed improves LSH by deviating from hashing the query point directly. Instead, it involves searching for random points within a small neighborhood (a ball of radius 'r') around the query point. By hashing and checking these perturbed points, the algorithm significantly increases the probability of finding a point within the same LSH bucket as the true nearest neighbor. The number of points to sample is related to the entropy of the nearest neighbor's cell. This 'neighborhood search' strategy can reduce the required number of hash tables or copies, leading to substantial space savings. Furthermore, LSH can operate on compressed representations called 'sketches' of the original objects, reducing memory footprint to as little as 8 bytes per object, making large-scale NNS feasible.
Limitations of k-d trees in high dimensions
Traditional data structures like k-d trees, while effective in low dimensions, struggle with high-dimensional data. K-d trees recursively partition space along different dimensions. However, as the number of dimensions increases, the probability of a query point and its nearest neighbor being separated by partitioning hyperplanes rises sharply. Consequently, the chance of successfully finding the nearest neighbor via a standard k-d tree walk drops exponentially with the dimension 'd' (e.g., e^(-d/c)). In practice, even with moderate dimensions (e.g., d=4), k-d tree searches can fail up to 90% of the time.
Adapting LSH principles to k-d trees
The 'neighborhood search' strategy developed for LSH can also be beneficial when applied to k-d trees. Instead of solely querying the k-d tree with the exact query point, the modified approach involves repeatedly querying the tree with random points sampled from the vicinity of the original query point. By performing multiple searches within the neighborhood, the probability of encountering the true nearest neighbor increases significantly, mitigating the exponential drop-off in success rate seen with standard high-dimensional k-d tree searches. This simple modification can dramatically improve the reliability of k-d tree searches without requiring a complete rewrite of the underlying data structure.
Sketching algorithms for hierarchical data
Beyond unordered sets, sketching techniques can be extended to more complex data structures like trees. While traditional set sketching often uses methods like MinHash to find the smallest hash value among all elements, a generalized approach allows for creating compact sketches of hierarchical data. This involves adapting the hashing and minimum-value selection process to account for the tree structure, enabling similarity comparisons and searches on hierarchical datasets through their compressed representations.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Concepts
Common Questions
The curse of dimensionality refers to the problem where search becomes exponentially harder as the number of dimensions increases, leading to exponential search time or space requirements for finding exact nearest neighbors.
Topics
Mentioned in this video
More from GoogleTalksArchive
View all 48 summaries
58 minEverything is Miscellaneous
54 minStatistical Aspects of Data Mining (Stats 202) Day 7
45 minKey Phrase Indexing With Controlled Vocabularies
63 minMysteries of the Human Genome
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free