VCS #1 Tile-based Renderer Documentation

Lingheng Tao
Sep 26, 2024
8 min read

Updated: Oct 3, 2024

This is a blog / documentation for the Assignment 1 for CMU 15-473/673 Visual Computing System. The assignment is about implementing a sort-middle algorithm for a tile-based renderer.

The markdown format of this documentation can be accessed through here: https://hackmd.io/@Lockbrains/vcs1_tilebased_renderer. For a better reading experience you may want to read the blog at this link.

Data Handling

SIMD

In this project, I will pass the coordinates using the Intel SIMD (Single Instruction, Multiple Data) data type, specifically we will be using __m128. It is a data type specific to Intel's SIMD instruction set, particularly used in the Streaming SIMD Extensions (SSE) family of instructions. It represents a 128-bit-sized register that can store and operate on four 32 floating-point values simultaneously.

To handle __m128, we will need to use specific API to do basic arithmetic operations. Specifically in this assignment, we will need the following API's to work around.

#include <xmmintrin.h> 
int main () {

    // addition
    __m128 result_addition = _mm_add_ps(1.0f, 2.0f);
    __m128i result_addition_int = _mm_add_epi32(1,2);

    // subtraction
    __m128 result_subtraction = _mm_sub_ps(1.0f, 2.0f);
    __m128i result_subtraction_int = _mm_sub_epi32(1,2);

    // multiplication
    __m128 result_mul = _mm_mul_ps(1.0f, 2.0f);
    __m128i result_mul_int = _mm_mullo_epi32(1, 2);

    // division
    __m128 result_div = _mm_div_ps(1.0f, 2.0f);

    // set to zero
    __m128 zero = _mm_setzero_ps();

    // set to value
    __m128 one = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);

}

We can also handle comparisons and bit operations using SIMD instructions.

#include <xmmintrin.h> 

int main () {

    // Compare Equal
    __m128 result_equal = _mm_cmpeq_ps(1.0f, 1.0f);
    __m128i result_equal_int = _mm_cmpeq_si128(1,1);

    // Compare Less Than
    __m128 result_lessThan = _mm_cmplt_ps(1.0f, 2.0f);
    __m128i result_lessThan_int = _mm_cmplt_si128(1,2);

    // Compare Greater Than
    __m128 result_greaterThan = _mm_cmpgt_ps(1.0f, 2.0f);
    __m128i result_greaterThan_int = _mm_cmpgt_si128(1,2);

    // Bitwise And
    __m128 result_and = _mm_and_ps(1.0f, 2.0f);
    __m128i result_and_int = _mm_and_si128(1,2);

    // Bitwise Or
    __m128 result_or = _mm_or_ps(1.0f, 2.0f);
    __m128i result_or_int = _mm_or_si128(1,2);
}

N.4 Format

In this assignment, the vertex positions will be stored in a fixed-point representation with 4 bits of subpixel precision, and this is called N.4 format of floating point values - n bits for integer and 4 bits for decimal.

For example, a binary expression 1001.0110 can be understood as:

Integer part: 1001= 9.
Decimal part: 0110 = 6/16 = 0.375

So this number is 9.375.

N.4 format is widely used in GPU to reduce the cost of floating-point computation. Furthermore, one of the most important operations of N.4 format can be round. Using the definition, for a binary notation of a N.4 format floating-point number a, we can use bitwise shift to do floor and ceil operations.

Rasterization

Point-in-Triangle Test

When we want to rasterize the scene, we will make a call to the RasterizeTriangle() function. This function accepts a region of the output image and a triangle as input. It then identify all 2x2 pixel chunks within this screen region that may be at least partially covered by this input triangle.

During this process, we will need to determine if the triangle actually covers samples in this 2x2 pixel block - this is called the triangle coverage test. This part is done with the function TestQuadFragment() -- it accepts the coordinates of four screen points as input, corresponding to the pixels in this pixel block. Its output is a bitmask indicating which of these points lies within the triangle, as shown in the following figure.

Since a __m128 register can save upto 4 integers, we will use two registers for storing the x and y coordinates for the four samples in the quad 2x2 fragments (in N.4 fixed-point format as stated in the comments).

Edge Equations

To test if a pixel is covered by the triangle, we use the Edge Equation. Label the i-th vertex of the triangle as (xi, yi). For a query point q = (x, y), we first calculate

The Edge Equation the i-th edge is computed by

The necessary and sufficient conditions for the point q to pass the inside-triangle test is

i.e. the Edge Equations have to be samely signed. In case a point falls exactly on the edge (i.e. Ei = 0), we check if the edge is owned by the current triangle. We will be able to check this by accessing isOwnerEdge[i]. If isOwnerEdge[i] is true, it means this edge is owned by the triangle, so we consider all the points on this edge in the triangle.

Now, the masks for the edges should be clear.

Rasterization

Now that we have the function TestQuadFrag() which returns a bit mask indicating if the sample for each

fragment is covered, we will be able to rasterize the triangle. We will do this in the RasterizeTriangle()

method.

The way we rasterize a triangle can be summarized as find all the candidate pixels, and test each pixel

for coverage by using TestQuadFrag() we have already written. We will take as input

regionX0, regionY0: the top-left corner coordinate of current working tile.
regionW, regionH: the width and height of the current working tile in pixels.
tri: set up triangle equations, where we keep the coordinates of the corners of the triangle.
triSIMD: all values of tri in SIMD registers.
processQuadFragmentFunc: A function value – for every quad fragment that may generate coverage,
this method should call this function.

Find The Bounding Boxes

In order to accelerate the test, as well as not missing any possible candidates, we first find the bounding box of the triangle to decide which coordinates to look at. By using the bitwise operations for N.4 format floating numbers, which we have discussed in Section 2.2, we will easily be able to do this.

Traverse All Quad Fragments

Now we can start traversing all the possible fragments covered by the triangle by simply apply a for-loop. For each quad fragment we look at, we will save the coordinates of the pixels (bottom-left, bottom-right, top-left, top-right) into the registers. Then, we will call the TestQuadFragment() method to see if this triangle is accepted. We will use the trivial accept condition where all of the quads has to pass the test.

Sort-middle Tile-based Renderer

Now we will start the implementation of a sort-middle tiled parallel renderer. The tiled renderer will have

two phases:

Place the triangles into bins. It will build a data structure where there is one bin per screen tile, and each bin contains a list of triangles that potentially overlap the bin.
Process the bins in parallel, dynamically distributing the next unprocessed bin to the next available core.

The process can be shown as the Figure 2.

Figure 2: Sort-Middle Tiled Parallelization Scheme. The phases are briefly shown in the diagram.

Initialization

In tile-based rendering, we partition the screen into smaller tiles, which we call them bins. Each tile will be responsible for processing the triangles inside. Usually, in order to improve the performance, we will classify triangles by the tiles they covered. The process is called binning.

A intuitive way to keep track of the bins is to use a vector of tiles, where each tile is a vector of projected triangles, let’s call it bins. However, when binning is processed in a multi-thread context, multiple threads may handle triangles simultaneously, and place them into corresponding bins. If they write to the global bins directly, it may result in:

Race conditions. Multiple threads writing into the same data structure – it may cause a data disagreement or even a program crash.
Performance bottleneck. In order to avoid race conditions, we will need to apply locks to ensure the thread safety – but this is almost guaranteed to be slow, as it lowers the parallelism.

Thread Local Bins

To solve the above issues, we introduce the thread local bins, maintained by the data structure, which we call it the threadLocalBins. Now, each thread has its own independent data structure for maintaining bins. When a thread i is binning, it writes data into its own threadLocalBins[i]. By using the thread local bins, we also avoid using locks, which lowers the cost for synchronization. Once every thread finish its own binning, we will need to merge everyone’s local bins into the global bins, and hand over it to the ProcessBins() to handle them.

Steps

The following steps describe the things we need to get done in the BinTriangles().

Initialization. Assume the screen is partitioned into gridWidth × gridHeight tiles.
1. Global bins. It will be a vector of vectors of ProjectedTriangles, with size gridWidth × gridHeight.
2. Thread local bins. It will be a vector of vector of vectors of ProjectedTriangles, because it will keep track of all the local bins structures for all the threads. i.e. threadLocalBins[threadId] return the local bins for the threadId-th thread.
Thread local binning. Each thread will only handle a portion of the triangles in the scene. They will put their own triangles into the threadLocalBins[threadId] according to the triangles’ coverages. Notice that this also indicates that the majority of the bins in a threadLocalBins[threadId] will be empty, because a thread will only handle one or a few tiles. For each thread, the tiles out of their responsibility can be perceived as empty, as shown in Figure 3.
Merge into global bin. After all the threads finish binning, we will merge all the local bins in threadLocalBins into a global bins, where triangles are categorized by tiles. For each tile, we will merge the triangle sublists of all the threads. In these sublists, the triangles live in this tile.
Process the global bin. Finally we traverse the global bins, and handle each triangle in the tiles. Since each tile is handled independently, we will run ProcessBin() on multiple threads.

Figure 3: Thread local bins. A 4-thread renderer is handling the 12 bins in the diagram above. For each

of the thread, it has a thread local bins, which only handles a portion of the triangles in the scene.

Binning Triangles

The next thing we want to do is to process the bins in the global bins. For each tile, we will call a ProcessBin() to rasterize the fragments in the bin. Notice that in a tile-based renderer, parallelism happens among tiles, instead of fragments. Processing within a tile is carried out sequentially. Thus, we will call ShadeFragment() to shade a single quad fragment at a time, rather than call ShadeFragments() as was done on the full fragment buffer in the non-tiled implementation.

Frame Buffer Tiles

Instead of writing directly into the frame buffer, we also update the color and depth into the frame buffer. One advantage of a tiled renderer is that an entire frame-buffer tile will fit in cache. For maximum performance we will write results to tile-major frame buffer structure while processing the tile, and then copy the results back out to the renderer’s frame buffer at the end.

Adaptions

Since the rasterization happens at a granularity of quad fragments, the loop runs on triangles in the tile, and loops over each of the four fragments. We will call the RasterizeTriangle() we wrote in Section 3.2, with a region size specified to be the size of the sub-frame buffer for the current tile.

Here are some technical challenges that we may need to solve:

Since we directly shade the fragments (instead of shading the fragments in the fragment buffer), we will call ShadeFragment() instead of ShadeFragments(). Looking at the interface of ShadeFragment(), we will need to know which thread handles the triangle being processed. Otherwise, we will use the wrong vertex output buffer or index output buffer.
Since we write into tiled frame-buffer, we need to carefully treat the indexing.
Specifically, there is a triId parameter in the interface of ShadeFragment() – this is not the member variable in the ProjectedTriangle class. This is the index of the loop when in the BinTriangles() when we binned this triangle.

To solve these issues, we need to carry and propagate these information. To facilitate the processing, we will maintain a vector of BinnedTriangle structs, where the structure keeps the thread ID, the loop index, and the projected triangle.

Now we will be able to process the entire bins.

Finish

After the processing, we will need to write the results back to the frame buffer.

VCS #1 Tile-based Renderer Documentation

Data Handling