A common problem in real time rendering is ordering draw/compute tasks (henceforth, tasks) in an optimal manner. Different order of execution can greatly affect performance, for example by influencing the amount of API calls. A lot of recent development of the OpenGL API has been focused on reducing draw calls and “Approaching Zero Driver Overhead” [1]. Likewise it is a common practice to group tasks by GLSL programs, FBOs and textures.

This article isn’t about indirect rendering or OpenGL per se, but about a better way of grouping tasks in an attempt to reduce total amount of API calls.

Furthermore, with asynchronous compute and multiple parallel command queues on the horizon, we should not regard task dispatching as a serial queue but as a scheduling problem.

# Building the Transition Graph and Transition Functions

If we look at the problem of scheduling tasks while resolving dependencies, minimizing context state changes and doing so in an efficient way, the natural way to approach it is by considering tasks and transitions between tasks as a directed, weighted graph $G = (V, E)$.

For every task we create a vertex, and between every pair of vertices $u,v\in V$, we add an edge $e (u \leftrightarrow v)$ iff the task $u$ doesn’t depend on the task $v$. The edges are “*transition functions*” that represent legal transitions between tasks.

The transition function $\delta(u,v)$ is responsible for setting the GL context state, i.e. reset states and objects set by $u$ and not requested by $v$, and vice versa, set states/objects requested by $v$ and are different from what is requested (if at all) by $u$.

Generating the transition functions in a graceful way is a software engineering challenge. When generating a transition I employ a virtual context and dispatch the tasks normally, the virtual context keeps record of all states and object changes and from the data builds the required δ. This keeps the process transparent to the tasks.

Once δ is computed a weight can be applied for the edge. This involves a lot of heuristics, but generally we know that, in terms of performance, texture and GLSL program rebinds are expensive, buffer binds and state changes are cheap while VAO and FBO binds are in between. Applying some kind of “cost” for each operation (e.g. 20 for GLSL program switch, 1 for state change, etc.) allows us to compute the final weight of each edge.

# Ant Colony Optimization

As the transition graph alone clearly isn’t enough to enforce execution order that respects dependencies, each vertex is also paired with a list of dependencies on other vertices. A transition $u\to v$ can’t be made until all of $v$‘s dependencies have been visited (executed). This variant of the asymmetric travelling salesman problem is known as a sequential ordering problem (SOP). Both optimization problems are closely related and are $\mathcal{NP}$-Hard.

While finding the optimal solution is unfeasible, good approximating methods exist. Many SOP approximations employ a variant of an Ant Colony Optimization (ACO) based algorithm. ACO is a probabilistic algorithm that attempts to mimic real-life ant behavior, where sole ants forage nearly at random initially looking for food. Once an ant finds food, it returns to the nest, leaving a pheromone trail on the path. The more ants choose the same path the stronger the trail becomes. Ants are attracted to the pheromone trail, increasing the chance more ants will follow the path. The shorter the path the more ants travel it, generating a positive feedback loop that usually results in convergence to a short path by all ants.

If the shortest path suddenly becomes infeasible, the same behavior as seen initially can observed again around the blocked path, where ants scatter, breaking off and looking to complete the route through a different path.

It’s important to notice that in the opposite scenario, where a new shorter route suddenly becomes available after ants’ convergence, the ants will not choose the new one as the pheromone trail of the old path is already too strong, which limits exploration. That will be addressed later.

# Dynamic Sequential Ordering Optimization

Of course, to design a relevant system it is important to remember that $G$ is subject to changes up to every frame as our rendered world is probably not static.

In my system I’ve implemented a method similar to an ACO for SOP with a local optimizer described by Gambardella and Dorigo (2000) [2]. To the graph $G$ a dummy root node is added, with edges to all dependent-less vertices and edges from all vertices that don’t serve as a requisite to any task. The algorithm works by sending virtual ants from the root node, where each ant attempts to complete a legal Hamiltonian cycle. After a solution is built an optional local optimizer is applied.

At each step each ant considers all feasible outgoing edges (edges that lead to unvisited vertices that have all their dependencies fulfilled) and chooses one at random with probability to choose vertex $v$, coming from $u$:

Where $\tau$ is pheromone trail and $w$ is edge weight. $F(u)$ is a set of all feasible edges originating from $u$.

To increase exploration and avoid stagnation, ants consume part of the pheromone deposit (reduction of 5-15%) on every edge they visit. If the resulting route is shorter than the best found up to date, all the edges of the newly generated route get a pheromone deposit increase proportional to the inverse of the new path length.

Graph mutations: In case a vertex is deleted in $G$, the current best route remains viable simply by skipping the deleted vertex.

In case of vertex insertion all the new edges are initialized with initial pheromone deposit value. To facilitate fast convergence, the pheromone deposit of all geodesics of length 2 from the new vertex can be cleared, but that isn’t necessary thanks to the pheromone consumption mechanism.

I chose not to specifically handle edge modification as it rarely happens in a task transition graph and can be resolved by reinserting the affected vertices anyway.

# Conclusion

The algorithm described above selects a viable, useable execution order very quickly and converges to an efficient one in a few dozen runs. It is incremental and can be run in the background to keep the route up-to-date with very low overhead. When mutating, the unchanged parts of the graph retain the local selection data and assist in converging to a performant route in a speedy matter.

Finally, for upcoming APIs that support multiple parallel command queues, it is simple to extend the mechanism to schedule multiple independent compute tasks.

# References

[1] OpenGL Superbible 7th edition chapter 14

[2] http://people.idsia.ch/~luca/has-sop2000.pdf “An Ant Colony System Hybridized with a New Local Search for the Sequential Ordering Problem”