Implementing Data-Parallel Stroke-to-Fill Conversion on Modern GPUs

Table Of Links

GPU IMPLEMENTATION

We applied the techniques outlined in Section 7 to implement a data-parallel algorithm that can convert stroked 2D Bézier paths into flat geometry suitable for GPU rendering. We focused primarily on the global stroke-to-fill conversion problem in order to generate polygonal outlines for rasterization. Since the Euler spiral approximation can fit any source cubic Bézier and its parallel curves, our implementation can be used to render both filled and stroked primitives within a satisfactory error tolerance.

We aimed to satisfy the following the criteria:

(1) The implementation must be able to handle a large number of inputs at real-time rates.

(2) The stroke outlines must meet the strong correctness definition.

(3) The implementation must support the standard SVG stroke end cap (butt, square, round) and join styles (bevel, miter, round).

(4) The CPU must do no work beyond the basic preparation of the input path data for GPU consumption, in order to maximize the exploited parallelism.

One of our artifacts is a GPU compute shader that satisfies these criteria. We coupled this with a simple and efficient data layout for parallel processing. Notably, our implementation is fully dataparallel on input path segments and does not require any computationally expensive curve evaluation on the CPU.

We implemented our shader on general-purpose GPU compute primitives supported by all modern graphics APIs, particularly Metal, Vulkan, D3D12, and WebGPU [2]. As such, the ideas presented in this section are portable to various GPU platforms. The rest of this section describes the design of our GPU pipeline and input encoding scheme.

8.1 Pipeline Design

An SVG path consists of a sequence of instructions (or “verbs”). The lineto and curveto instructions denote an individual path segment consisting of either a straight line or a cubic Bézier. The moveto instruction is used to begin a new subpath and the closepath instruction can be used to create a closed contour (by inserting a line connecting the current endpoint to the start of the subpath). A subpath must form a closed contour if painted as a fill. When

painted as a stroke, each subpath must begin and end with a cap according to the desired cap style. In addition, each path segment in a stroked subpath must be connected to adjacent path segments of the same subpath with a join. For any given path, the cap and join styles do not vary across subpaths or individual path segments. The standard organizational primitive for a compute shader is the workgroup. A workgroup can be organized as a 1, 2, or 3- dimensional grid of individual threads that can cooperate using shared memory. Multiple workgroups can be dispatched as part of a larger global grid, also of 1, 2, or 3 dimensions.

Our compute shader is arranged as a 1-dimensional grid, parallelized over input path segments. We chose a workgroup size of 256 as it works well over a wide range of GPUs. However, this choice of workgroup size is not particularly important for our algorithm as it does not require any cross-thread communication. Each thread in the global grid is assigned to process a single path segment. We encode the individual path segments contiguously in a GPU storage buffer such that each element has 1:1 correspondence to a global thread ID in the dispatch.

This layout easily scales to hundreds of thousands of input path segments, which can be distributed across any number of paths and subpaths. We submit the kernel to the GPU as a single dispatch with as many workgroups as needed to handle the entire input.

Each thread outputs a polyline that fits a subset of the rendered subpath’s outer contour, represented by the path segment (see Figure 13). For a filled primitive, a thread only outputs the flattened approximation of the path segment.

For a stroke, a thread outputs the polylines approximating multiple curves:

a) the two parallel curves on both sides of the source segment, offset by the desired stroke half-width,

b) the joins that connect the path segment to the next adjacent segment, and c) evolute patches in regions of high curvature. We also store metadata in our input encoding that marks the position of a path segment within the containing subpath. We use this information to determine whether to output an end cap instead of a join.

We encode one additional segment designated as the stroke cap marker for every subpath. The thread assigned to the marker segment only outputs a start cap, which completes the stroke outline. The joins between adjacent segments need to form a continuous outline when combined with the segments. This requires that a thread have access to the tangent vector at the end of its assigned segment and the start tangent of the next adjacent segment.

We require that the input segments within a subpath are stored in order which simplifies the access to the adjacent segment down to a buffer read at the next array offset. All input segments first get converted to a cubic Bézier by degree elevation. Our flattening routine (see Algorithm 2) starts by computing the error estimate for geometric Hermite interpolation from the cubic Bézier to an Euler spiral segment. If the error exceeds the desired tolerance threshold, the cubic segment is iteratively subdivided in half and a new error estimate is computed. Once the error satisfies the tolerance, the Euler spiral segment is directly flattened

to a polyline by invoking a subroutine that is parameterized by the stroke half-width (see es_seg_flatten_offset in Algoritm 2). For a regular fill, this parameter is set to 0. For strokes, the sign of the parameter determines the position of the parallel curve relative to the source segment as well as the winding direction of the output lines. The adaptive subdivision in lowering cubic Béziers to Euler spiral segments is traditionally accomplished through recursion; in the case of subdivision there is a recursive call for each of the two subdivisions of the range.

As the execution model of compute shaders does not support recursion, we unroll recursion into an iterative loop. The cookbook technique for such unrolling is to store the stack explicitly as an array, which is potentially expensive (it increases pressure on registers). We exploit the specific nature of this recursion, encoding the entire state of the stack into two scalar values: the size of the range 𝑑𝑡, and the scaled start of the range 𝑡0_𝑢, so that the range is 𝑡0_𝑢 · 𝑑𝑡 ..(𝑡0_𝑢 + 1) · 𝑑𝑡. Initial values are 1 and 0 respectively. Pushing the call stack is represented by halving 𝑑𝑡 and doubling 𝑡0_𝑢, which preserves the start of the range but halves the size.

Accepting an approximation (a leaf in the call stack) is represented by incrementing 𝑡0_𝑢. At this point, the decision of whether to pop the stack or continue is dependent on whether 𝑡0_𝑢 is even or odd, respectively. If even, popping the stack is accomplished by doubling 𝑑𝑡 and halving 𝑡0_𝑢, and this is repeated until 𝑡0_𝑢 becomes odd. The repeated stack popping can be accomplished without iteration by using the “count trailing zeros” intrinsic to determine the number of stack pops 𝑛, then multiplying 𝑑𝑡 by 2 𝑛 , and shifting 𝑡0_𝑢 right by 𝑛 bits.

This procedure is sufficient to generate weakly correct stroke outlines. If the outline contains a cusp, we output an evolute patch (as shown in Figure 9) in order to achieve strong correctness. A selection of renderings produced by our algorithm that demonstrate both strong and weak correctness is shown in Figure 12. The output of the compute shader is a collection of line segments that gets stored in a GPU storage buffer. Every element in the output contains the viewport coordinates of the two endpoints of a single line segment.

Threads reserve storage buffer regions for their output by incrementing an atomic index stored in GPU device memory. A property of the data-parallel structure is that the line segments are unordered. Because the segments are processed in parallel, ordering cannot be guaranteed across path segments. We refer to this output data structure as line soup. An application may want to sort the line soup, depending on the requirements of the chosen rendering algorithm.

For instance, our own renderer employs tile-based scanline rasterization in a compute shader, with a pipeline architecture that resembles cudaraster [14] and MPVG [5]. In this design, the line soup elements are spatially sorted by subsequent compute stages into a grid of 16x16 pixel tiles. The final compute stage of our renderer (called fine rasterizer) operates at tile granularity and computes pixel coverage for filled polygons outlined by the line soup (Figure 14 illustrates the overall structure of our renderer, the details of which are beyond the scope of this paper).

We also implemented a variant that outputs circular arc segments instead of lines, in which the output entries store curvature in addition to the endpoints. For both primitive types, our renderer also outputs a 32-bit path ID used by downstream pipeline stages

to identify the path that a segment belongs to, in order to retrieve the correct paint (solid or gradient color) for rendering.

8.2 Input encoding

The input to our pipeline consists of a sequence of paths, each with an associated 2D affine transformation matrix and a style data structure containing information about the stroke style or fill rule. We encode paths, transforms, and styles as parallel input streams. Unlike a conventional structure-of-arrays arrangement, we permit a one-to-many mapping across elements between certain streams: each transform and style entry can apply to one or more path entries with an array index greater than or equal to transform or style index.

Paths get encoded as two parallel streams: a path tag stream containing an element for every segment in every subpath, and a path data stream of point coordinates. Each segment/verb contributes a variable number of points: 1 for moveto/lineto, 2 for quadratic Béziers, and 3 for cubic Béziers. This variable length stream encoding allows for a highly compact representation of SVG-type content in memory. A GPU thread assigned to a path segment must be able to reference the corresponding element in each stream in order to obtain the right curve coordinates, transform, and style information.

This is done by computing a set of stream offsets for every element in the path tag stream. We compute the offsets on the GPU in a separate compute shader that runs prior to the flattening stage. A path tag consists of 8 bits that store stream offset increments, such that the inclusive prefix sums of all increments for every element in the tag stream yield the stream offsets. We call this structure the tag monoid and have implemented the computation as a parallel prefix scan. Table 1 shows the individual fields of the 8-bit path tag. The offset increments for the transform and style streams can be stored

in a single bit that is set to 1 only if a path tag marks the beginning of a new transform and/or style entry. We also encode an additional path bit in the final segment of all paths, which allows us to access additional streams (such as fill color or gradient parameters) using a per-path offset in later rendering stages. The handling of the subpath bit allows for overlap in coordinates, so that except at subpath boundaries, the first coordinate pair of each segment overlaps with the last coordinate pair of the previous one.

During the CPU-side encoding, subpath boundaries are moved from the beginning of a subpath (moveto) to the end. The encoding process inserts an additional line segment into the tag and coordinate streams to close the subpath if the final point does not coincide with the start point. The subpath end bit is set to 1 for the final segment of every subpath. For strokes, the subpath end segment represents the stroke cap marker defined in Section 8.1. This special segment is assigned a coordinate count of 2, and

the corresponding two points in the path data stream encode the tangent vector of the subpath’s initial segment, used to correctly draw the start cap or the connecting join in a closed contour.

Authors:

This paper is available on arxiv under CC 4.0 license.