Fast Fourier Transform: Scaling Multi-Point Evaluation

Trigger warning: specialized mathematical topic Special thanks to Karl Floersch for feedback One of the more interesting algorithms in number theory is the Fast Fourier transform (FFT). FFTs are a key building block in many algorithms, including , multiplication of polynomials, and extremely fast generation and recovery of . extremely fast multiplication of large numbers erasure codes Erasure codes in particular are highly versatile; in addition to their basic use cases in fault-tolerant data storage and recovery, erasure codes also have more advanced use cases such as and . This article will go into what fast Fourier transforms are, and how some of the simpler algorithms for computing them work. securing data availability in scalable blockchains STARKs Background Ok fine, Fourier transforms also have really important applications in signal processing, quantum mechanics, and other areas, and help make significant parts of the global economy happen. But come on, elephants are cooler. The original is a mathematical operation that is often described as converting data between the "frequency domain" and the "time domain". What this means more precisely is that if you have a piece of data, then running the algorithm would come up with a collection of sine waves with different frequencies and amplitudes that, if you added them together, would approximate the original data. Fourier transforms can be used for such wonderful things as and : Fourier transform expressing square orbits through epicycles deriving a set of equations that can draw an elephant The kind of Fourier transform we'll be talking about in this post is a similar algorithm, except instead of being a Fourier transform over , it's a over (see the "A Modular Math Interlude" section for a refresher on what finite fields are). continuous real or complex numbers discrete Fourier transform finite fields here Instead of talking about converting between "frequency domain" and "time domain", here we'll talk about two different operations: (evaluating a degree polynomial at different points) and its inverse, (given the evaluations of a degree polynomial at different points, recovering the polynomial). For example, if we are operating in the prime field with modulus 5, then the polynomial (for convenience we can write the coefficients in increasing order: ) evaluated at the points gives the values (not because we're operating in a finite field where the numbers wrap around at 5), and we can actually take the evaluations and the coordinates they were evaluated at ( ) to recover the original polynomial . multi-point polynomial evaluation <N N polynomial interpolation <N N y=x²+3 [3,0,1] [0,1,2] [3,4,2] [3,4,7] [3,4,2] [0,1,2] [3,0,1] There are algorithms for both multi-point evaluation and interpolation that can do either operation in time. Multi-point evaluation is simple: just separately evaluate the polynomial at each point. Here's python code for doing that: O(N²) def eval_poly_at(self, poly, x, modulus): y = power_of_x = coefficient poly: y += power_of_x * coefficient power_of_x *= x y % modulus 0 1 for in return The algorithm runs a loop going through every coefficient and does one thing for each coefficient, so it runs in time. Multi-point evaluation involves doing this evaluation at different points, so the total run time is . O(N) N O(N²) Lagrange interpolation is more complicated (search for "Lagrange interpolation" for a more detailed explanation). The key building block of the basic strategy is that for any domain and point , we can construct a polynomial that returns for and for any value in other than . For example, if = and = , the polynomial is: here D x 1 x 0 D x D [1,2,3,4] x 1 You can mentally plug in , , and to the above expression and verify that it returns for = and in the other three cases. 1 2 3 4 1 x 1 0 We can recover the polynomial that gives any desired set of outputs on the given domain by multiplying and adding these polynomials. If we call the above polynomial , and the equivalent ones for , , , , and , then the polynomial that returns on the domain is simply . Computing the Pi polynomials takes time (you first construct the polynomial that returns to 0 on the entire domain, which takes time, then separately divide it by for each , and computing the linear combination takes another ) time, so it's runtime total. P1 x=2 x=3 x=4 P2 P3 P4 [3,1,4,1] [1,2,3,4] 3⋅P1+P2+4⋅P3+P4 O(N ² ) O(N²) (x−xi) xi) O(N² O(N²) Fast Fourier Transforms There is a price you have to pay for using this much faster algorithm, which is that you cannot choose any arbitrary field and any arbitrary domain. Whereas with Lagrange interpolation, you could choose whatever x coordinates and y coordinates you wanted, and whatever field you wanted (you could even do it over plain old real numbers), and you could get a polynomial that passes through them., with an FFT, you have to use a finite field, and the domain must be a of the field (that is, a list of powers of some "generator" value). multiplicative subgroup For example, you could use the finite field of integers modulo , and for the domain use (that's the powers of in the field, eg. ; it stops at because the next power of cycles back to 1). Futhermore, the multiplicative subgroup must have size (there's ways to make it work for numbers of the form and possibly slightly higher prime powers but then it gets much more complicated and inefficient). The finite field of intergers modulo , for example, would not work, because there are only multiplicative subgroups of order and is too small to be interesting, and the factor is far too large to be FFT-friendly. The symmetry that comes from multiplicative groups of size lets us create a recursive algorithm that quite cleverly calculate the results we need from a much smaller amount of work. 337 [1,85,148,111,336,252,189,226] 85 85^3 % 337=111 226 85 2n 2^m⋅3^n 59 2, 29 58; 2 29 2^n To understand the algorithm and why it has a low runtime, it's important to understand the general concept of recursion. A recursive algorithm is an algorithm that has two cases: a "base case" where the input to the algorithm is small enough that you can give the output directly, and the "recursive case" where the required computation consists of some "glue computation" plus one or more uses of the same algorithm to smaller inputs. For example, you might have seen recursive algorithms being used for sorting lists. If you have a list (eg. ), then you can sort it using the following procedure: [1,8,7,4,5,6,3,2,9] If the input has one element, then it's already "sorted", so you can just return the input. If the input has more than one element, then separately sort the first half of the list and the second half of the list, and then merge the two sorted sub-lists (call them and ) as follows. Maintain two counters, and , both starting at zero, and maintain an output list, which starts empty. Until either or is at the end of the corresponding list, check if or is smaller. Whichever is smaller, add that value to the end of the output list, and increase that counter by 1. Once this is done, add the rest of whatever list has not been fully processed to the end of the output list, and return the output list. A B apos bpos apos bpos A [apos] B [bpos] Note that the "glue" in the second procedure has runtime : if each of the two sub-lists has elements, then you need to run through every item in each list once, so it's computation total. So the algorithm as a whole works by taking a problem of size , and breaking it up into two problems of size , plus of "glue" execution. O(N) N O(N) N N2 O(N) There is a theorem called the that lets us compute the total runtime of algorithms like this. It has many sub-cases, but in the case where you break up an execution of size N into k sub-cases of size with glue (as is the case here), the result is that the execution takes time . Master Theorem N/k O(N) O(N⋅log(N)) An FFT works in the same way. We take a problem of size , break it up into two problems of size , and do glue work to combine the smaller solutions into a bigger solution, so we get runtime total - than . Here is how we do it. I'll describe first how to use an FFT for multi-point evaluation (ie. for some domain and polynomial , calculate for every in ), and it turns out that you can use the same algorithm for interpolation with a minor tweak. N N2 O(N) O(N⋅log(N)) much faster O(N2) D P P(x) x D Suppose that we have an FFT where the given domain is the powers of in some field, where (eg. in the case we introduced above, the domain is the powers of modulo , and ). We have some polynomial, eg. (we'll write it as ). We want to evaluate this polynomial at each point in the domain, ie. at each of the eight powers of . Here is what we do. x x^2^k=1 85 337 85^2^3=1 y=6x7+2x6+9x5+5x4+x3+4x2+x+3 p=[3,1,4,1,5,9,2,6] 85 First, we break up the polynomial into two parts, which we'll call and and (or and ; yes, this is just taking the even-degree coefficients and the odd-degree coefficients). Now, we note a mathematical observation: and (think about this for yourself and make sure you understand it before going further). evens odds: evens= [3,4,5,2] odds [1,1,9,6] evens= 2x3+5x2+4x+3 odds= 6x3+9x2+x+1 p(x)=evens(x2)+x⋅odds(x2) p(−x)=evens(x2)−x⋅odds(x2) The "glue" is relatively easy (and in runtime): we receive the evaluations of evens and odds as lists, so we simply do O(N) size-N/2 p[i]=evens_result[i]+domain[i]⋅odds_result[i] and p[N2+i]=evens_result[i]−domain[i]⋅odds_result[i] for each index i. Here's the full code: def fft(vals, modulus, domain): len(vals) == : vals L = fft(vals[:: ], modulus, domain[:: ]) R = fft(vals[ :: ], modulus, domain[:: ]) o = [ i vals] i, (x, y) enumerate(zip(L, R)): y_times_root = y*domain[i] o[i] = (x+y_times_root) % modulus o[i+len(L)] = (x-y_times_root) % modulus o if 1 return 2 2 1 2 2 0 for in for in return We can try running it: >>> fft([ , , , , , , , ], , [ , , , , , , , ]) [ , , , , , , , ] 3 1 4 1 5 9 2 6 337 1 85 148 111 336 252 189 226 31 70 109 74 334 181 232 4 And we can check the result; evaluating the polynomial at the position , for example, actually does give the result . Note that this only works if the domain is "correct"; it needs to be of the form where . 85 70 [x^i % modulus for i in range(n)] x^n=1 An inverse FFT is surprisingly simple: def inverse_fft(vals, modulus, domain): vals = fft(vals, modulus, domain) [x * modular_inverse(len(vals), modulus) % modulus x [vals[ ]] + vals[ :][:: ]] return for in 0 1 -1 Basically, run the FFT again, but reverse the result (except the first item stays in place) and divide every value by the length of the list. >>> domain = [ , , , , , , , ] >>> def modular_inverse(x, n): pow(x, n - , n) >>> values = fft([ , , , , , , , ], , domain) >>> values [ , , , , , , , ] >>> inverse_fft(values, , domain) [ , , , , , , , ] 1 85 148 111 336 252 189 226 return 2 3 1 4 1 5 9 2 6 337 31 70 109 74 334 181 232 4 337 3 1 4 1 5 9 2 6 Now, what can we use this for? Here's one fun use case: we can use FFTs to multiply numbers very quickly. Suppose we wanted to multiply by . Here is what we would do. First, we would convert the problem into one that turns out to be slightly easier: multiply the by (that's just the digits of the two numbers in increasing order), and then convert the answer back into a number by doing a single pass to carry over tens digits. 1253 1895 polynomials [3,5,2,1] [5,9,8,1] We can multiply polynomials with FFTs quickly, because it turns out that if you convert a polynomial into (ie. for every in some domain ), then you can multiply two polynomials simply by multiplying their evaluations. So what we'll do is take the polynomials representing our two numbers in , use FFTs to convert them to evaluation form, multiply them pointwise, and convert back: evaluation form f(x) x D coefficient form >>> p1 = [ , , , , , , , ] >>> p2 = [ , , , , , , , ] >>> x1 = fft(p1, , domain) >>> x1 [ , , , , , , , ] >>> x2 = fft(p2, , domain) >>> x2 [ , , , , , , , ] >>> x3 = [(v1 * v2) % v1, v2 zip(x1, x2)] >>> x3 [ , , , , , , , ] >>> inverse_fft(x3, , domain) [ , , , , , , , ] 3 5 2 1 0 0 0 0 5 9 8 1 0 0 0 0 337 11 161 256 10 336 100 83 78 337 23 43 170 242 3 313 161 96 337 for in 253 183 47 61 334 296 220 74 337 15 52 79 66 30 10 1 0 This requires three FFTs (each time) and one pointwise multiplication ( time), so it takes ) time altogether (technically a little bit more than ), because for very big numbers you would need replace with a bigger modulus and that would make multiplication harder, but close enough). This is than schoolbook multiplication, which takes time: O(N⋅log(N)) O(N) O(N⋅log(N) O(N⋅log(N) 337 much faster O(N2) ------------ | | | | --------------------- 3 5 2 1 5 15 25 10 5 9 27 45 18 9 8 24 40 16 8 1 3 5 2 1 15 52 79 66 30 10 1 So now we just take the result, and carry the tens digits over (this is a "walk through the list once and do one thing at each point" algorithm so it takes time): O(N) [ , , , , , , , ] [ , , , , , , , ] [ , , , , , , , ] [ , , , , , , , ] [ , , , , , , , ] [ , , , , , , , ] [ , , , , , , , ] 15 52 79 66 30 10 1 0 5 53 79 66 30 10 1 0 5 3 84 66 30 10 1 0 5 3 4 74 30 10 1 0 5 3 4 4 37 10 1 0 5 3 4 4 7 13 1 0 5 3 4 4 7 3 2 0 And if we read the digits from top to bottom, we get . Let's check the answer.... 2374435 >>> * 1253 1895 2374435 Yay! It worked. In practice, on such small inputs, the difference between and isn't large, so schoolbook multiplication is faster than this FFT-based multiplication process just because the algorithm is simpler, but on large inputs it makes a really big difference. O(N⋅log(N)) O(N^2) that But FFTs are useful not just for multiplying numbers; as mentioned above, polynomial multiplication and multi-point evaluation are crucially important operations in implementing erasure coding, which is a very important technique for building many kinds of redundant fault-tolerant systems. If you like fault tolerance and you like efficiency, FFTs are your friend. FFTs and binary fields Prime fields are not the only kind of finite field out there. Another kind of finite field (really a special case of the more general concept of an , which are kind of like the finite-field equivalent of complex numbers) are binary fields. In an binary field, each element is expressed as a polynomial where all of the entries are or , eg. . extension field 0 1 x^3+x+1 Adding polynomials is done modulo , and subtraction is the same as addition (as ). We select some irreducible polynomial as a modulus (eg. would not work because can be factored into so it's not "irreducible"); multiplication is done modulo that modulus. For example, in the binary field mod , multiplying would give if you just do the multiplication, but , so the result is the remainder . 2 −1=1mod2 x^4+x+1; x^4+1 x^4+1 (x2+1)⋅(x2+1) x^4+x+1 x2+1 by x3+1 x5+x3+x2+1 x^5+x^3+x^2+1=(x^4+x+1)⋅x+(x^3+x+1) x^3+x+1 We can express this example as a multiplication table. First multiply (ie. ) by (ie. ): [1,0,0,1] x^3+1 [1,0,1] x^2+1 -------- | | | ------------ 1 0 0 1 1 1 0 0 1 0 0 0 0 0 1 1 0 0 1 1 0 1 1 0 1 The multiplication result contains an term so we can subtract : x^5 ( x4+x+1)⋅x - [(x⁴ + x + ) shifted right by one to reflect being multipled by x] ------------ 1 0 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 0 And we get the result, or . [1,1,0,1] ( x3+x+1) Addition and multiplication tables for the binary field mod x^4+x+1 . Field elements are expressed as integers converted from binary (eg. x^3+x^2→1100→12 ) Binary fields are interesting for two reasons. First of all, if you want to erasure-code binary data, then binary fields are really convenient because bytes of data can be directly encoded as a binary field element, and any binary field elements that you generate by performing computations on it will also be bytes long. You cannot do this with prime fields because prime fields' size is not exactly a power of two; for example, you could encode every bytes as a number from in the prime field modulo (which is prime), but if you do an FFT on these values, then the output could contain , which cannot be expressed in two bytes. N N 2 0...65536 65537 65536 Second, the fact that addition and subtraction become the same operation, and , create some "structure" which leads to some very interesting consequences. One particularly interesting, and useful, oddity of binary fields is the " " theorem: (and the same for exponents basically any power of two). 1+1=0 freshman's dream (x+y)^2=x^2+y^2 4,8,16... But if you want to use binary fields for erasure coding, and do so efficiently, then you need to be able to do Fast Fourier transforms over binary fields. But then there is a problem: in a binary field, . This is because the multiplicative groups are all order . For example, in the binary field with modulus , if you start calculating successive powers of , you cycle back to after steps - not . The reason is that the total number of elements in the field is , but one of them is zero, and you're never going to reach zero by multiplying any nonzero value by itself in a field, so the powers of cycle through every element but zero, so the cycle length is , not . So what do we do? there are no (nontrivial) multiplicative groups of order 2^n 2^n-1 x^4+x+1 x+1 1 15 16 16 x+1 15 16 The reason we needed the domain to have the "structure" of a multiplicative group with 2n elements before is that we needed to reduce the size of the domain by a factor of two by squaring each number in it: the domain gets reduced to because is the square of both and is the square of both and , and so forth. [1,85,148,111,336,252,189,226] [1,148,336,189] 1 1 336, 148 85 252 But what if in a binary field there's a different way to halve the size of a domain? It turns out that there is: given a domain containing values, including zero (technically the domain must be a ), we can construct a half-sized new domain D′ by taking for in using some specific in . Because the original domain is a subspace, since is in the domain, any in the domain has a corresponding also in the domain, and the function returns the same value for and so we get the same kind of two-to-one correspondence that squaring gives us. 2^k subspace x⋅(x+k) x D k D k x x+k f(x)=x⋅(x+k) x x+k So now, how do we do an FFT on top of this? We'll use the same trick, converting a problem with an -sized polynomial and -sized domain into two problems each with an -sized polynomial and -sized domain, but this time using different equations. We'll convert a polynomial p into two polynomials and such that . Note that for the evens and odds that we find, it will be true that . So we can then recursively do an FFT to evens and odds on the reduced domain , and then we use these two formulas to get the answers for two "halves" of the domain, one offset by k from the other. N N N/2 N/2 evens odds p(x)=evens(x⋅(k−x))+x⋅odds(x⋅(k−x)) also p(x+k)=evens(x⋅(k−x))+(x+k)⋅odds(x⋅(k−x)) [x⋅(k−x) for x in D] Converting p into and as described above turns out to itself be nontrivial. The "naive" algorithm for doing this is itself , but it turns out that in a binary field, we can use the fact that , and more generally , to create yet another recursive algorithm to do this in time. evens odds O(N^2) (x2−kx)2=x4−k2⋅x2 (x2−kx)2i=x2i+1−k2i⋅x2i O(N⋅log(N)) And if you want to do an FFT, to do interpolation, then you need to run the steps in the algorithm in reverse order. You can find the complete code for doing this here: , and a paper with details on more optimal algorithms here: inverse https://github.com/ethereum/research/tree/master/binary_fft http://www.math.clemson.edu/~sgao/papers/GM10.pdf So what do we get from all of this complexity? Well, we can try running the implementation, which features both a "naive" multi-point evaluation and the optimized FFT-based one, and time both. Here are my results: O(N^2) >>> binary_fft b >>> time, random >>> f = b.BinaryField( ) >>> poly = [random.randrange( ) i range( )] >>> a = time.time(); x1 = b._simple_ft(f, poly); time.time() - a >>> a = time.time(); x2 = b.fft(f, poly, list(range( ))); time.time() - a import as import 1033 1024 for in 1024 0.5752472877502441 1024 0.03820443153381348 And as the size of the polynomial gets larger, the naive implementation ( ) gets slower much more quickly than the FFT: _simple_ft >>> f = b.BinaryField( ) >>> poly = [random.randrange( ) i range( )] >>> a = time.time(); x1 = b._simple_ft(f, poly); time.time() - a >>> a = time.time(); x2 = b.fft(f, poly, list(range( ))); time.time() - a 2053 2048 for in 2048 2.2243144512176514 2048 0.07896280288696289 And voila, we have an efficient, scalable way to multi-point evaluate and interpolate polynomials. If we want to use FFTs to recover erasure-coded data where we are some pieces, then algorithms for this , though they are somewhat less efficient than just doing a single FFT. Enjoy! missing also exist This article was originally published as " Fast Fourier Transforms "