Sublinear Algorithms for Big Datasets

Grigory Yaroslavtsev, Brown University, EEUU.

Summary:

Increasingly large datasets are being collected by governments, private companies and research institutions. This motivates increased interest in the design and analysis of algorithms for rigorous analysis of such data. In many cases, even linear space/time algorithms can be too slow. Therefore, there has been growing body of work on "sub-linear algorithms", that use space or time that are sub-linear in the input size. In this course we will cover such algorithms, which can be used for the analysis of distributions, graphs, data streams and high-dimensional real-valued data.

We focus on two types of sublinear algorithms: sub-linear time algorithms, and sketching/streaming algorithms. The former access only a small number of input elements and perform very limited computation; the answer is then inferred using approaches such as random sampling or property testing. The latter extract only a small amount of information about the input (a "sketch") and compute an answer based on that. Manifestation of this approach include data stream algorithms and randomized dimensionality reduction.

This course is partially based on the class taught by Piotr Indyk and Ronitt Rubinfeld at MIT.

Outline:

Lecture 1: Introduction. Testing properties of distributions: uniformity, closeness, monotone distributions.

Lecture 2: Sublinear-time algorithms for graphs I. Dense and sparse model. Testing bipartiteness, approximating max-cut, testing triangle-freeness, estimating average degree.

Lecture 3: Sublinear-time algorithms for graphs II. Testing connectedness, approximating the number of connected components, approximating weight of the minimum spanning tree. Approximating Vertex Cover and distributed algorithms.

Lecture 4: Sketching and streaming algorithms I. Estimating the number of distinct elements, L2-norm, Johnson-Lindenstrauss transform, Bloom filter, MinHash.

Lecture 5: Sketching and streaming algorithms II, testing properties of real-valued data. Heavy hitters, Count-Min. Testing monotonicity and distance approximation. Dimension reduction techniques in property testing.

Bibliography:

Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1): 137-147 (1999)

Broder, Andrei Z. (1997), "On the resemblance and containment of documents", Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, IEEE, pp. 21–29, doi:10.1109/SEQUEN.1997.666900.

Johnson, William B.; Lindenstrauss, Joram (1984), "Extensions of Lipschitz mappings into a Hilbert space", Conference in Modern Analysis and Probability (New Haven, Conn., 1982), Contemporary Mathematics 26, Providence, RI: American Mathematical Society, pp. 189–206, doi:10.1090/conm/026/737400, MR 737400.

Funda Ergün, Sampath Kannan, Ravi Kumar, Ronitt Rubinfeld, Mahesh Viswanathan, Spot-Checkers. Journal of Computer System and Sciences 2000, STOC 1998.

Oded Goldreich, Dana Ron, Property testing in bounded degree graphs. Algorithmica 2002, STOC 1997.

Bernard Chazelle, Ronitt Rubinfeld, Luca Trevisan, Approximating the Minimum Spanning Tree Weight in Sublinear Time. SIAM Journal of Computing 2005, ICALP 2001.

Noga Alon, Michael Krivelevich, Testing k-colorability. SIAM J. Discrete Math. 15 (2002), 211-227.

Noga Alon, Testing subgraphs in large graphs. Random Structures and Algorithms 21 (2002), 359-370.

Noga Alon, Eldar Fischer, Michael Krivelevich, Mario Szegedy, Efficient testing of large graphs, Combinatorica 20 (2000), 451-476.

Oded Goldreich, Dan Ron, Approximating average parameters of graphs, Random Struct. Algorithms 32(4): 473-493 (2008)

Michal Parnas, Dan Ron, Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms, Theor. Comput. Sci., 381(1-3):183--196 (2007)

Huy N. Nguyen and Krzysztof Onak, Constant-Time Approximation Algorithms via Local Improvements, FOCS (2008)

Krzysztof Onak, Dana Ron, Michal Rosen, and Ronitt Rubinfeld, A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size, In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1123--1131 (2012)

Piotr Berman, Sofya Raskhodnikova, and Grigory Yaroslavtsev, Testing properties with respect to Lp distances, In submission to STOC 2014.

Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004: 29-38

 pie-paginas.jpg