paint-brush

This story draft by @scripting has not been reviewed by an editor, YET.

Maintaining Coreset Accuracy in Big Data Streaming

featured image - Maintaining Coreset Accuracy in Big Data Streaming
Scripting Technology HackerNoon profile picture
0-item

Authors:

(1) Andrew Draganov, Aarhus University and All authors contributed equally to this research;

(2) David Saulpic, Université Paris Cité & CNRS;

(3) Chris Schwiegelshohn, Aarhus University.

Table of Links

Abstract and 1 Introduction

2 Preliminaries and Related Work

2.1 On Sampling Strategies

2.2 Other Coreset Strategies

2.3 Coresets for Database Applications

2.4 Quadtree Embeddings

3 Fast-Coresets

4 Reducing the Impact of the Spread

4.1 Computing a crude upper-bound

4.2 From Approximate Solution to Reduced Spread

5 Fast Compression in Practice

5.1 Goal and Scope of the Empirical Analysis

5.2 Experimental Setup

5.3 Evaluating Sampling Strategies

5.4 Streaming Setting and 5.5 Takeaways

6 Conclusion

7 Acknowledgements

8 Proofs, Pseudo-Code, and Extensions and 8.1 Proof of Corollary 3.2

8.2 Reduction of k-means to k-median

8.3 Estimation of the Optimal Cost in a Tree

8.4 Extensions to Algorithm 1

References

5.4 Streaming Setting

One of the most common use-cases for big-data algorithms is the streaming setting, where one receives input in batches and must maintain a compression that is representative of the dataset. Although there is a wealth of sampling and coreset methods in the streaming paradigm, we require consistency across algorithms and therefore assume a black-box sampling procedure. Since the coreset property is preserved under composition, we utilize the merge-&-reduce strategy originally proposed by [11] and first applied to maintaining clustering coresets in stream by [40]. The idea is to first partition the input into b blocks and then perform sampling and composition along them until a single compression


Figure 5: Top: Coreset distortion on the k-means task in the streaming and non-streaming settings. This is a visualization of the data in Table 5. Bottom: Coreset construction runtimes in the streaming and non-streaming settings for the linear and sub-linear complexity coreset algorithms. Bars are [Streaming, Non-Streaming].


is obtained. Specifically, we start by obtaining a coreset on each block. Then, combining samples using a complete binary tree, we (1) recursively re-sample from the children until there is at least one coreset for each level[10] in the tree and then (2) concatenate these samples and obtain one final coreset from the composition. Since we are composing coresets from coresets, the errors stack and, in theory, we should require more samples to obtain a similar accuracy.


espite this, we see the opposite result for many of our sampling strategies. Surprisingly, Table 5 and Figure 5 show that the accelerated sampling methods all perform at least as well under composition on the artificial datasets and do not suffer significant drops in accuracy, variance or runtime on the real datasets. Although inconsistent with the prevailing intuition, we must therefore conclude that the accelerated sampling methods are equally feasible in the streaming setting. We suspect that this is due to the non-uniformity imposed by the merge-&-reduce algorithm. To see this, consider uniform sampling on the c-outlier dataset during the final step of the composition, where we are composing the samples corresponding to each layer of the tree. Assume first that our outlier points happened to fall in the first block. Then we have taken a sample of size m from this block and immediately use this for the final composition. Thus, in this case the outliers are more likely to be in our final sample than in the non-streaming setting. In the alternate setting where the outliers are less likely, our expected error is already high and missing the outlier ‘more’ cannot worsen our expected error.


This paper is available on arxiv under CC BY 4.0 DEED license.


[10] If there are b = 8 blocks, then there would be coresets corresponding to blocks [[1], [2], [3, 4], [5, 6, 7, 8]]