When | Sunday, October 27, 5:30 - 8:00 PM, CDT |
---|---|
Where | TBD |
Organizing Committee
- Vincent-Cohen Addad (Google Research, France)
- Rajesh Jayaram (Google Research, NYC)
Information
The 65th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2024) is hosting a special workshop aimed at bridging the gap between theoretical computer science and real-world applications. This workshop will feature presentations from leading researchers in industry who will showcase compelling fundamental problems encountered in their work. By bringing together academic and industry experts, the workshop seeks to foster collaboration and inspire new research directions with the potential for significant impact. Join us at FOCS 2024 in Chicago, IL, USA from October 27-30 to be a part of this exciting exchange of ideas.Schedule
This talk will provide an introduction to the problems related to differentially private analytics and modeling that arise in the field of digital advertising, survey recent results, and describe some research challenges in the space.
Nearest neighbor search is a fundamental problem with applications in machine learning, information retrieval, computer vision, and recommendation systems. Given a collection P of n points in a metric space, the goal of the (Approximate) Nearest Neighbor problem is to construct a data structure that, given a query point q, quickly identifies a point from P that is (approximately) closest to q. In this work, we consider this task under a "diversity" constraint. Formally, the algorithm is also provided a parameter k in the preprocessing stage, and now given q, the goal is to report a "diverse" set of k points approximately closest to the query. This problem has several applications including search, recommendations systems, and Ads.
We present the first efficient algorithms for diversity-aware search that are based on the popular graph-based methods. For data sets with low intrinsic dimension, our data structures have query time that only depends on k and log ∆, where ∆ is the ratio of the diameter to the closest pair distance in the data set. This bound is qualitatively similar to the best known bounds for standard (non-diverse) graph-based algorithms. Further, our experiments show that the search time of our algorithms is substantially lower than alternate methods that use standard non-diverse graph-based algorithms. Finally, our algorithm can be generalized to work under several (existing and new) notions of diversity that are all motivated by applications and of potential independent interest.
This is a joint work with Piyush Anand, Piotr Indyk, Ravishankar Krishnaswamy, Vikas Raykar, Kiran Shiragur, and Haike Xu.
Perhaps the biggest breakthrough in information retrieval in the 21st century has been the usage of embedding models. Given a dataset D which could consist of images, sentences, videos, or other modalities, these models map each datapoint in D to a vector in high-dimensional Euclidean space, such that similar datapoints (e.g. similar images) are mapped to similar vectors under the Euclidean distance. This transforms a non-mathematical similarity to a mathematical one, reducing information retrieval to Euclidean Nearest Neighbor Search, which has been studied extensively in theory for over three decades.
While this paradigm has been extremely successful, the representation of data by a single vector has limitations. In particular, it must compress all aspects of the data into a single "global" representation. Unfortunately, the actual similarity between two data-points can depend on a complex interaction between multiple "local" features of the data (e.g. words, subregions of an image, subsections of a document, ect.). Beginning with the landmark ColBERT paper (Khattab and Zaharia, SIGIR 2020), this has been addressed by training models to produce multiple embeddings per data-point. To measure the similarity between two sets of vectors A,B, one uses the so-called Chamfer Distance, which is the average distance between each point in A and its nearest point in B. Multi-vector models currently achieve SOTA on many retrieval benchmarks, and are now an extremely popular area of research.
However, the usage of the Chamfer Distance raises a host of totally unexplored algorithmic questions. Firstly, Chamfer Distance is more expensive to compute than Euclidean distance: taking O(n^2d) time naively instead of O(d) (where |A|=|B|=n are d-dimensional). Is it possible to compute it faster, even approximately? Furthermore, to be useful for information retrieval, we need fast nearest neighbor search algorithms for Chamfer, but do such algorithms exist? In this talk, we investigate these algorithmic questions, and explore how techniques from TCS can be applied to expand the possibilities of practical information retrieval.
Based in part on the papers:
A Near-Linear Time Algorithm for the Chamfer Distance (NeurIPS 2023). Joint with Ainesh Bakshi, Piotr Indyk, Sandeep Silwal, and Erik Waingarten.
MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings (NeurIPS 2024). Joint with Laxman Dhulipala, Majid Hadian, Jason Lee, and Vahab Mirrokni.
Pick up Dinner (25 minutes)
Scaling large models efficiently for faster training and inference is a fundamental challenge. In this talk, we present a number of algorithmic challenges and potential solutions from theory to practice. First, we discuss data efficiency and model efficiency problems that can be formalized as a subset selection problem. For model efficiency, we present sequential attention for feature selection and sparsification[ICLR'23]. For data efficiency, we present a sensitivity sampling technique for improved quality and efficiency of the models[ICML'24]. Furthermore, we discuss the intrinsic quadratic complexity of attention models as well as token generation. We first discuss HyperAttention; a technique to develop linear-time attention algorithms under mild assumptions[ICLR'24]. We then present PolySketchFormer, a technique to bypass the hardness results of achieving sub-quadratic attention by applying sketching of polynomial functions[ICML'24]. Finally, we show how to address the complexity of token generation via clustering techniques.