This story draft by @scripting has not been reviewed by an editor, YET.
Authors:
(1) Andrew Draganov, Aarhus University and All authors contributed equally to this research;
(2) David Saulpic, Université Paris Cité & CNRS;
(3) Chris Schwiegelshohn, Aarhus University.
2 Preliminaries and Related Work
2.3 Coresets for Database Applications
4 Reducing the Impact of the Spread
4.1 Computing a crude upper-bound
4.2 From Approximate Solution to Reduced Spread
5 Fast Compression in Practice
5.1 Goal and Scope of the Empirical Analysis
5.3 Evaluating Sampling Strategies
5.4 Streaming Setting and 5.5 Takeaways
8 Proofs, Pseudo-Code, and Extensions and 8.1 Proof of Corollary 3.2
8.2 Reduction of k-means to k-median
8.3 Estimation of the Optimal Cost in a Tree
One of the most common use-cases for big-data algorithms is the streaming setting, where one receives input in batches and must maintain a compression that is representative of the dataset. Although there is a wealth of sampling and coreset methods in the streaming paradigm, we require consistency across algorithms and therefore assume a black-box sampling procedure. Since the coreset property is preserved under composition, we utilize the merge-&-reduce strategy originally proposed by [11] and first applied to maintaining clustering coresets in stream by [40]. The idea is to first partition the input into b blocks and then perform sampling and composition along them until a single compression
is obtained. Specifically, we start by obtaining a coreset on each block. Then, combining samples using a complete binary tree, we (1) recursively re-sample from the children until there is at least one coreset for each level[10] in the tree and then (2) concatenate these samples and obtain one final coreset from the composition. Since we are composing coresets from coresets, the errors stack and, in theory, we should require more samples to obtain a similar accuracy.
espite this, we see the opposite result for many of our sampling strategies. Surprisingly, Table 5 and Figure 5 show that the accelerated sampling methods all perform at least as well under composition on the artificial datasets and do not suffer significant drops in accuracy, variance or runtime on the real datasets. Although inconsistent with the prevailing intuition, we must therefore conclude that the accelerated sampling methods are equally feasible in the streaming setting. We suspect that this is due to the non-uniformity imposed by the merge-&-reduce algorithm. To see this, consider uniform sampling on the c-outlier dataset during the final step of the composition, where we are composing the samples corresponding to each layer of the tree. Assume first that our outlier points happened to fall in the first block. Then we have taken a sample of size m from this block and immediately use this for the final composition. Thus, in this case the outliers are more likely to be in our final sample than in the non-streaming setting. In the alternate setting where the outliers are less likely, our expected error is already high and missing the outlier ‘more’ cannot worsen our expected error.
This paper is available on arxiv under CC BY 4.0 DEED license.
[10] If there are b = 8 blocks, then there would be coresets corresponding to blocks [[1], [2], [3, 4], [5, 6, 7, 8]]