Controlling and Monitoring the Pipeline - In Depth - OpenGL SuperBible: Comprehensive Tutorial and Reference, Sixth Edition (2013)

OpenGL SuperBible: Comprehensive Tutorial and Reference, Sixth Edition (2013)

Part II: In Depth

Chapter 11. Controlling and Monitoring the Pipeline

What You’ll Learn in This Chapter

• How to ask OpenGL about the progress of your commands down the graphics pipeline

• How to measure the time taken for your commands to execute

• How to synchronize your application with OpenGL and how to synchronize multiple OpenGL contexts with each other

This chapter is about the OpenGL pipeline and how it executes your commands. As your application makes OpenGL function calls, work is placed in the OpenGL pipeline and makes its way down it one stage at a time. This takes time, and you can measure that. This allows you to tune your application’s complexity to match the performance of the graphics system and to measure and control latency, which is important for real-time applications. You’ll also learn how to synchronize your application’s execution to that of OpenGL commands you’ve issued and even how to synchronize multiple OpenGL contexts with each other.

Queries

Queries are a mechanism to ask OpenGL what’s happening in the graphics pipeline. There’s plenty of information that OpenGL can tell you; you just need to know what to ask — and how to ask the question.

Remember way back to your early days in school. The teacher wanted you to raise your hand before asking a question. This was almost like reserving your place in line for asking the question — the teacher didn’t know yet what your question was going to be, but she knew that you had something to ask. OpenGL is similar. Before we can ask a question, we have to reserve a spot so that OpenGL knows that the question is coming. Questions in OpenGL are represented by query objects, and much like any other object in OpenGL, query objects must be reserved, or generated. To do this, call glGenQueries(), passing it the number of queries you want to reserve and the address of a variable (or array) where you would like the names of the query objects to be placed:

void glGenQueries(GLsizei n,
GLuint *ids);

The function reserves some query objects for you and gives you their names so that you can refer to them later. You can generate as many query objects you need in one go:

GLuint one_query;
GLuint ten_queries[10];
glGenQueries(1, &one_query);
glGenQueries(10, ten_queries);

In this example, the first call to glGenQueries() generates a single query object and returns its name in the variable one_query. The second call to glGenQueries() generates 10 query objects and returns 10 names in the array ten_queries. In total, 11 query objects have been created, and OpenGL has reserved 11 unique names to represent them. It is very unlikely, but still possible that OpenGL will not be able to create a query for you, and in this case it returns zero as the name of the query. A well-written application always checks that glGenQueries() returns a non-zero value for the name of each requested query object. If there is a failure, OpenGL keeps track of the reason, and you can find that out by calling glGetError().

Each query object reserves a small but measurable amount of resources from OpenGL. These resources must be returned to OpenGL because if they are not, OpenGL may run out of space for queries and fail to generate more for the application later. To return the resources to OpenGL, callglDeleteQueries():

void glDeleteQueries(GLsizei n,
const GLuint *ids);

This works similarly to glGenQueries() — it takes the number of query objects to delete and the address of a variable or array holding their names:

glDeleteQueries(10, ten_queries);
glDeleteQueries(1, &one_query);

After the queries are deleted, they are essentially gone for good. The names of the queries can’t be used again unless they are given back to you by another call to glGenQueries().

Occlusion Queries

Once you’ve reserved your spot using glGenQueries(), you can ask a question. OpenGL doesn’t automatically keep track of the number of pixels it has drawn. It has to count, and it must be told when to start counting. To do this, use glBeginQuery(). The glBeginQuery() function takes two parameters: The first is the question you’d like to ask, and the second is the name of the query object that you reserved earlier:

glBeginQuery(GL_SAMPLES_PASSED, one_query);

GL_SAMPLES_PASSED represents the question you’re asking: “How many samples passed the depth test?” Here, OpenGL counts samples because you might be rendering to a multi-sampled display format, and in that case, there could be more than one sample per pixel. In the case of a normal, single-sampled format, there is one sample per pixel and therefore a one-to-one mapping of samples to pixels. Every time a sample makes it past the depth test (meaning that it hadn’t previously been discarded by the fragment shader), OpenGL counts one. It adds up all the samples from all the rendering it is doing and stores the answer in part of the space reserved for the query object. A query object that counts samples that might end up visible (because they passed the depth test) is known as an occlusion query.

Now OpenGL is counting samples, you can render as normal, and OpenGL keeps track of all the samples generated as a result. Anything that you render is counted toward the total — even samples that have no contribution to the final image due to blending or being covered by later samples, for example. When you want OpenGL to add up everything rendered since you told it to start counting, you tell it to stop by calling glEndQuery():

glEndQuery(GL_SAMPLES_PASSED);

This tells OpenGL to stop counting samples that have passed the depth test and made it through the fragment shader without being discarded. All the samples generated by all the drawing commands between the call to glBeginQuery() and glEndQuery() are added up.

Retrieving Query Results

Now that the pixels produced by your drawing commands have been counted, you need to retrieve them from OpenGL. This is accomplished by calling

glGetQueryObjectuiv(the_query, GL_QUERY_RESULT, &result);

Here, the_query is the name of the query object that’s being used to count samples, and result is the variable that you want OpenGL to write the result into (notice that we pass the address of the variable). This instructs OpenGL to place the count associated with the query object into your variable. If no pixels were produced as a result of the drawing commands between the last call to glBeginQuery() and glEndQuery() for the query object, the result will be zero. If anything actually made it to the end of the fragment shader without being discarded, the result will contain the number of samples that got that far. By rendering an object between a call to glBeginQuery() and glEndQuery() and then checking if the result is zero or not, you can determine whether the object is visible.

Because OpenGL operates as a pipeline, it may have many drawing commands queued up back-to-back waiting to be processed. It could be the case that not all of the drawing commands issued before the last call to glEndQuery() have finished producing pixels. In fact, some may not have even started to be executed. In that case, glGetQueryObjectuiv() causes OpenGL to wait until everything between glBeginQuery() and glEndQuery() has been rendered, and it is ready to return an accurate count. If you’re planning to use a query object as a performance optimization, this is certainly not what you want. All these short delays could add up and eventually slow down your application! The good news is that it’s possible to ask OpenGL if it’s finished rendering anything that might affect the result of the query and therefore has a result available for you. To do this, call

glGetQueryObjectuiv(the_query, GL_QUERY_RESULT_AVAILABLE, &result);

If the result of the query object is not immediately available and trying to retrieve it would cause your application to have to wait for OpenGL to finish what it is working on, the result becomes GL_FALSE. If OpenGL is ready and has your answer, the result becomes GL_TRUE. This tells you that retrieving the result from OpenGL will not cause any delays. Now you can do useful work while you wait for OpenGL to be ready to give you your pixel count, or you can make decisions based on whether the result is available to you. For example, if you would have skipped rendering something had the result been zero, you could choose to just go ahead and render it anyway rather than waiting for the result of the query.

Using the Results of a Query

Now that you have this information, what will you do with it? A very common use for occlusion queries is to optimize an application’s performance by avoiding unnecessary work. Consider an object that has a very detailed appearance. The object has many triangles and possibly a complex fragment shader with a lot of texture lookups and intensive math operations. Perhaps there are many vertex attributes and textures, and there’s a lot of work for the application to do just to get ready to draw the object. The object is very expensive to render. It’s also possible that the object may never end up being visible in the scene. Perhaps it’s covered by something else. Perhaps it’s off the screen altogether. It would be good to know this up front and just not draw it at all if it’s never going to be seen by the user anyway.

Occlusion queries are a good way to do this. Take your complex, expensive object and produce a much lower fidelity version of it. Usually, a simple bounding box will do. Start an occlusion query, render the bounding box, and then end the occlusion query and retrieve the result. If no part of the object’s bounding box produces any pixels, then the more detailed version of the object will not be visible, and it doesn’t need to be sent to OpenGL.

Of course, you probably don’t actually want the bounding box to be visible in the final scene. There are a number of ways you can make sure that OpenGL doesn’t actually draw the bounding box. The easiest way is probably to use glColorMask() to turn off writes to the color buffer by passing GL_FALSE for all parameters. You could also call glDrawBuffer() to set the current draw buffer to GL_NONE. Whichever method you choose, don’t forget to turn framebuffer writes back on again afterwards!

Listing 11.1 shows a simple example of how to use glGetQueryObjectuiv() to retrieve the result from a query object.


glBeginQuery(GL_SAMPLES_PASSED, the_query);
RenderSimplifiedObject(object);
glEndQuery(GL_SAMPLES_PASSED);
glGetQueryObjectuiv(the_query, GL_QUERY_RESULT, &the_result);
if (the_result != 0)
RenderRealObject(object);


Listing 11.1: Getting the result from a query object

RenderSimplifiedObject is a function that renders the low-fidelity version of the object, and RenderRealObject renders the object with all of its detail. Now, RenderRealObject only gets called if at least one pixel is produced by RenderSimplifiedObject. Remember that the call to glGetQueryObjectuiv causes your application to have to wait if the result of the query is not ready yet. This is likely if the rendering done by RenderSimplifiedObject is simple — which is the point of this example. If all you want to know is whether it’s safe to skip rendering something, you can find out if the query result is available and render the more complex object if the result is either unavailable (i.e., the object may be visible or hidden), or if the object result is available and nonzero (i.e., the object is certainly visible). Listing 11.2 demonstrates how you might determine whether a query object result is ready before you ask for the actual count, allowing you to make decisions based on both the availability and the value of a query result.


GLuint the_result = 0;

glBeginQuery(GL_SAMPLES_PASSED, the_query);
RenderSimplifiedObject(object);
glEndQuery(GL_SAMPLES_PASSED);

glGetQueryObjectuiv(the_query, GL_QUERY_RESULT_AVAILABLE, &the_result);

if (the_result != 0)
glGetQueryObjectuiv(the_query, GL_QUERY_RESULT, &the_result);
else
the_result = 1;

if (the_result != 0)
RenderRealObject(object);


Listing 11.2: Figuring out if occlusion query results are ready

In this new example, we determine whether the result is available and if so, retrieve it from OpenGL. If it’s not available, we put a count of one into the result so that the complex version of the object will be rendered.

It is possible to have multiple occlusion queries in the graphics pipeline at the same time so long as they don’t overlap. Using multiple query objects is another way for the application to avoid having to wait for OpenGL. OpenGL can only count and add up results into one query object at a time, but it can manage several query objects and perform many queries back-to-back. We can expand our example to render multiple objects with multiple occlusion queries. If we had an array of ten objects to render, each with a simplified representation, we might rewrite the example provided as follows in Listing 11.3.


int n;

for (n = 0; n < 10; n++)
{
glBeginQuery(GL_SAMPLES_PASSSED, ten_queries[n]);
RenderSimplifiedObject(&object[n]);
glEndQuery(GL_SAMPLES_PASSED);
}

for (n = 0; n < 10; n++)
{
glGetQueryObjectuiv(ten_queries[n], GL_QUERY_RESULT, &the_result);
if (the_result != 0)
RenderRealObject(&object[n]);
}


Listing 11.3: Simple, application-side conditional rendering

As discussed earlier, OpenGL is modeled as a pipeline and can have many things going on at the same time. If you draw something simple such as a bounding box, it’s likely that won’t have reached the end of the pipeline by the time you need the result of your query. This means that when you call glGetQueryObjectuiv(), your application may have to wait a while for OpenGL to finish working on your bounding box before it can give you the answer and you can act on it.

In our next example, we render ten bounding boxes before we ask for the result of the first query. This means that OpenGL’s pipeline can be filled, and it can have a lot of work to do and is therefore much more likely to have finished working on the first bounding box before we ask for the result of the first query. In short, the more time we give OpenGL to finish working on what we’ve asked it for, the more likely it is that it’ll have the result of your query and the less likely it is that your application will have to wait for results. Some complex applications take this to the extreme and use the results of queries from the previous frame to make decisions about the new frame.

Finally, putting both techniques together into a single example, we have the code shown in Listing 11.4.


int n;

for (n = 0; n < 10; n++)
{
glBeginQuery(GL_SAMPLES_PASSSED, ten_queries[n]);
RenderSimplifiedObject(&object[n]);
glEndQuery(GL_SAMPLES_PASSED);
}

for (n = 0; n < 10; n+)
{
glGetQueryObjectuiv(ten_queries[n],
GL_QUERY_RESULT_AVAILABLE,
&the_result);
if (the_result != 0)
glGetQueryObjectuiv(ten_queries[n],
GL_QUERY_RESULT,
&the_result);
else
the_result = 1;
if (the_result != 0)
RenderRealObject(&object[n]);
}


Listing 11.4: Rendering when query results aren’t available

Because the amount of work sent to OpenGL by RenderRealObject is much greater than by RenderSimplifiedObject, by the time we ask for the result of the second, third, fourth, and additional query objects, more and more work has been sent into the OpenGL pipeline, and it becomes more likely that our query results are ready. Within reason, the more complex our scene, and the more query objects we use, the more likely we are to see positive a performance impact.

Getting OpenGL to Make Decisions for You

The preceding examples show how you can ask OpenGL to count pixels and how to get the result back from OpenGL into your application so that it can make decisions about what to do next. However, in this application, we don’t really care about the actual value of the result. We’re only using it to decide whether to send more work to OpenGL or to make other changes to the way it might render things. The results have to be sent back from OpenGL to the application, perhaps over a CPU bus or even a network connection when you’re using a remote rendering system, just so the application can decide whether to send more commands to OpenGL. This causes latency and can hurt performance, sometimes outweighing any potential benefits to using the queries in the first place.

What would be much better is if we could send all the rendering commands to OpenGL and tell it to obey them only if the result of a query object says it should. This is called predication, and fortunately, it is possible through a technique called conditional rendering. Conditional rendering allows you to wrap up a sequence of OpenGL drawing commands and send them to OpenGL along with a query object and a message that says “ignore all of this if the result stored in the query object is zero.” To mark the start of this sequence of calls, use

glBeginConditionalRender(the_query, GL_QUERY_WAIT);

and to mark the end of the sequence, use

glEndConditionalRender();

Any drawing command, including functions like glDrawArrays(), glClearBufferfv(), and glDispatchCompute() that is called between glBeginConditionalRender() and glEndConditionalRender() is ignored if the result of the query object (the same value that you could have retrieved using glGetQueryObjectuiv()) is zero. This means that the actual result of the query doesn’t have to be sent back to your application. The graphics hardware can make the decision as to whether to render for you. Keep in mind, though, that state changes such as binding textures, turning blending on or off, and so on are still executed by OpenGL — only rendering commands are discarded. To modify the previous example to use conditional rendering, we could use the code in Listing 11.5.


// Ask OpenGL to count the samples rendered between the start
// and end of the occlusion query
glBeginQuery(GL_SAMPLES_PASSED, the_query);
RenderSimplifiedObject(object);
glEndQuery(GL_SAMPLES_PASSED);

// Only obey the next few commands if the occlusion query says something
// was rendered
glBeginConditionalRender(the_query, GL_QUERY_WAIT);
RenderRealObject(object);
glEndConditionalRender();


Listing 11.5: Basic conditional rendering example

The two functions, RenderSimplifiedObject and RenderRealObject, are functions within our hypothetical example application that render simplified (perhaps just the bounding box, for example) and more complex versions of the object, respectively. Notice now that we never callglGetQueryObjectuiv(), and we never read any information (such as the result of the query object) back from OpenGL.

The astute reader will have noticed the GL_QUERY_WAIT parameter passed to glBeginConditionalRender(). You may be wondering what that’s for — after all, the application doesn’t have to wait for results to be ready any more. As mentioned earlier, OpenGL operates as a pipeline, which means that it may not have finished dealing with RenderSimplifiedObject before your call to glBeginConditionalRender() or before the first drawing function called from RenderRealObject reaches the beginning of the pipeline. In this case, OpenGL can either wait for everything called fromRenderSimplifiedObject to reach the end of the pipeline before deciding whether to obey the commands sent by the application, or it can go ahead and start working on RenderRealObject if the results aren’t ready in time. To tell OpenGL not to wait and to just go ahead and start rendering if the results aren’t available, call

glBeginConditionalRender(the_query, GL_QUERY_NO_WAIT);

This tells OpenGL, “If the results of the query aren’t available yet, don’t wait for them; just go ahead and render anyway.” This is of greatest use when occlusion queries are being used to improve performance. Waiting for the results of occlusion queries can use up any time gained by using them in the first place. Thus, using the GL_QUERY_NO_WAIT flag essentially allows the occlusion query to be used as an optimization if the results are ready in time and to behave as if they aren’t used at all if the results aren’t ready. The use of GL_QUERY_NO_WAIT is similar to usingGL_QUERY_RESULT_AVAILABLE in the preceding examples. Don’t forget, though, if you use GL_QUERY_NO_WAIT, the actual geometry rendered is going to depend on whether the commands contributing to the query object have finished executing. This could depend on the performance of the machine your application is running on and can therefore vary from run to run. You should be sure that the result of your program is not dependent on the second set of geometry being rendered (unless this is what you want). If it is, your program might end up producing different output on a faster system than on a slower system.

Of course, it is also possible to use multiple query objects with conditional rendering, and so a final, combined example using all of the techniques in this section is given in Listing 11.6.


// Render simplified versions of 10 objects, each with its own occlusion
// query
int n;

for (n = 0; n < 10; n++)
{
glBeginQuery(GL_SAMPLES_PASSSED, ten_queries[n]);
RenderSimplifiedObject(&object[n]);
glEndQuery(GL_SAMPLES_PASSED);
}

// Render the more complex versions of the objects, skipping them
// if the occlusion query results are available and zero
for (n = 0; n < 10; n++)
{
glBeginConditionalRender(ten_queries[n], GL_QUERY_NO_WAIT);
RenderRealObject(&object[n]);
glEndConditionalRender();
}


Listing 11.6: A more complete conditional rendering example

In this example, simplified versions of ten objects are rendered first, each with its own occlusion query. Once the simplified versions of the objects have been rendered, the more complex versions of the objects are conditionally rendered based on the results of those occlusion queries. If the simplified versions of the objects are not visible, the more complex versions are skipped, potentially improving performance.

Advanced Occlusion Queries

The GL_SAMPLES_PASSED query target counts the exact number of samples that passed the depth test. Even if no significant rendering occurs, OpenGL must still effectively rasterize every primitive to determine the number of pixels it covers and how many of them pass the depth and stencil tests. Even worse, if your fragment shader does something to affect the result (such as using a discard statement or modifying the fragment’s depth value), then it must run your shader for every pixel as well. Sometimes, this really is what you want. However, very often, you will only care whether any sample passed the depth and stencil tests, or even whether any sample might have passed the depth and stencil tests.

To provide this kind of functionality, OpenGL provides two additional occlusion query targets. These are the GL_ANY_SAMPLES_PASSED and GL_ANY_SAMPLES_PASSED_CONSERVATIVE targets, and they are known as Boolean occlusion queries.

The first of these targets, GL_ANY_SAMPLES_PASSED, will produce a result of zero (or GL_FALSE) if no samples pass the depth and stencil tests, and one (the value of GL_TRUE) if any sample passes the depth test. In some circumstances, performance could be higher if the GL_ANY_SAMPLES_PASSEDquery target is used because OpenGL can stop counting samples as soon as any sample passes the depth and stencil tests. However, if no samples pass the depth and stencil tests, it is unlikely to provide any benefit.

The second Boolean occlusion query target, GL_ANY_SAMPLES_PASSED_CONSERVATIVE, is even more approximate.

In particular, it will count as soon as a sample might pass the depth and stencil tests. Many implementations of OpenGL implement some form of hierarchical depth testing, where the nearest and furthest depth values for a particular region of the screen are stored, and then as primitives are rasterized, the depth values for large blocks of them are tested against this hierarchical information to determine whether to continue to rasterize the interior of the region. A conservative occlusion query may simply count the number of these large regions and not run your shader at all, even if it discards fragments or modifies the final depth value.

Timer Queries

One further query type that you can use to judge how long rendering is taking is the timer query. Timer queries are used by passing the GL_TIME_ELAPSED query type as the target parameter of glBeginQuery() and glEndQuery(). When you call glGetQueryObjectuiv() to get the result from the query object, the value is the number of nanoseconds that elapsed between when OpenGL executes your calls to glBeginQuery() and glEndQuery(). This is actually the amount of time it took OpenGL to process all the commands between the glBeginQuery() and glEndQuery()commands. You can use this, for example, to figure out what the most expensive part of your scene is. Consider the code shown in Listing 11.7.


// Declare our variables
GLuint queries[3]; // Three query objects that we'll use
GLuint world_time; // Time taken to draw the world
GLuint objects_time; // Time taken to draw objects in the world
GLuint HUD_time; // Time to draw the HUD and other UI elements

// Create three query objects
glGenQueries(3, queries);

// Start the first query
glBeginQuery(GL_TIME_ELAPSED, queries[0]);

// Render the world
RenderWorld();

// Stop the first query and start the second...
// Note: we're not reading the value from the query yet
glEndQuery(GL_TIME_ELAPSED);
glBeginQuery(GL_TIME_ELAPSED, queries[1]);

// Render the objects in the world
RenderObjects();

// Stop the second query and start the third
glEndQuery(GL_TIME_ELAPSED);
glBeginQuery(GL_TIME_ELAPSED, queries[2]);

// Render the HUD
RenderHUD();

// Stop the last query
glEndQuery(GL_TIME_ELAPSED);

// Now, we can retrieve the results from the three queries.
glGetQueryObjectuiv(queries[0], GL_QUERY_RESULT, &world_time);
glGetQueryObjectuiv(queries[1], GL_QUERY_RESULT, &objects_time);
glGetQueryObjectuiv(queries[2], GL_QUERY_RESULT, &HUD_time);

// Done. world_time, objects_time, and hud_time contain the values we want.
// Clean up after ourselves.
glDeleteQueries(3, queries);


Listing 11.7: Timing operations using timer queries

After this code is executed, world_time, objects_time, and HUD_time will contain the number of nanoseconds it took to render the world, all the objects in the world, and the heads-up display (HUD), respectively. You can use this to determine what fraction of the graphics hardware’s time is taken up rendering each of the elements of your scene. This is useful for profiling your code during development — you can figure out what the most expensive parts of your application are, and so know from this where to spend optimization effort. You can also use it during runtime to alter the behavior of your application to try to get the best possible performance out of the graphics subsystem. For example, you could increase or reduce the number of objects in the scene depending on the relative value of objects_time. You could also dynamically switch between more or less complex shaders for elements of the scene based on the power of the graphics hardware. If you just want to know how much time passes, according to OpenGL, between two actions that your program takes, you can use glQueryCounter(), whose prototype is

void glQueryCounter(GLuint id, GLenum target);

You need to set id to GL_TIMESTAMP and target to the name of a query object that you’ve created earlier. This function puts the query straight into the OpenGL pipeline, and when that query reaches the end of the pipeline, OpenGL records its view of the current time into the query object. The time zero is not really defined — it just indicates some unspecified time in the past. To use this effectively, your application needs to take deltas between multiple time stamps. To implement the previous example using glQueryCounter(), we could write code as shown in Listing 11.8.


// Declare our variables
GLuint queries[4]; // Now we need four query objects
GLuint start_time; // The start time of the application
GLuint world_time; // Time taken to draw the world
GLuint objects_time; // Time taken to draw objects in the world
GLuint HUD_time; // Time to draw the HUD and other UI elements

// Create four query objects
glGenQueries(4, queries);

// Get the start time
glQueryCounter(GL_TIMESTAMP, queries[0]);

// Render the world
RenderWorld();

// Get the time after RenderWorld is done
glQueryCounter(GL_TIMESTAMP, queries[1]);

// Render the objects in the world
RenderObjects();

// Get the time after RenderObjects is done
glQueryCounter(GL_TIMESTAMP, queries[2]);

// Render the HUD
RenderHUD();

// Get the time after everything is done
glQueryCounter(GL_TIMESTAMP, queries[3]);

// Get the result from the three queries, and subtract them to find deltas
glGetQueryObjectuiv(queries[0], GL_QUERY_RESULT, &start_time);
glGetQueryObjectuiv(queries[1], GL_QUERY_RESULT, &world_time);
glGetQueryObjectuiv(queries[2], GL_QUERY_RESULT, &objects_time);
glGetQueryObjectuiv(queries[3], GL_QUERY_RESULT, &HUD_time);
HUD_time -= objects_time;
objects_time -= world_time;
world_time -= start_time;

// Done. world_time, objects_time, and hud_time contain the values we want.
// Clean up after ourselves.
glDeleteQueries(4, queries);


Listing 11.8: Timing operations using glQueryCounter()

As you can see, the code in this example is not that much different from that in Listing 11.7 shown earlier. You need to create four query objects instead of three, and you need to subtract out the results at the end to find time deltas. However, you don’t need to call glBeginQuery() andglEndQuery() in pairs, which means that there are fewer calls to OpenGL, in total. The results of the two samples aren’t quite equivalent. When you issue a GL_TIMESTAMP query, the time is written when the query reaches the end of the OpenGL pipeline. However, when you issue aGL_TIME_ELAPSED query, internally OpenGL will take a timestamp when glBeginQuery() reaches the start of the pipeline and again when glEndQuery() reaches the end of the pipeline, and then subtract the two. Clearly, the results won’t be quite the same. So long as you are consistent in which method you use, your results should still be meaningful, however.

One important thing to note about the results of timer queries is that, as they are measured in nanoseconds, their values can get very large in a small amount of time. A single, unsigned 32-bit value can count to a little over 4 seconds’ worth of nanoseconds. If you expect to time operations that take longer than this (hopefully over the course of many frames!), you might want to consider retrieving the full 64-bit results that query objects keep internally. To do this, call

void glGetQueryObjectui64v(GLuint id,
GLenum pname,
GLuint64 * params);

Just as with glGetQueryObjectuiv(), id is the name of the query object whose value you want to retrieve, and pname can be GL_QUERY_RESULT or GL_QUERY_RESULT_AVAILABLE to retrieve the result of the query or just an indication of whether it’s available or not.

Finally, although not technically a query, you can get an instantaneous, synchronous timestamp from OpenGL by calling

GLint64 t;
void glGetInteger64v(GL_TIMESTAMP, &t);

After this code has executed, t will contain the current time as OpenGL sees it. If you take this timestamp and then immediately launch a timestamp query, you can retrieve the result of the timestamp query and subtract t from it, and the result will be the amount of time that it took the query to reach the end of the pipeline. This is known as the latency of the pipeline and is approximately equal to the amount of time that will pass between your application issuing a command and OpenGL fully executing it.

Transform Feedback Queries

If you use transform feedback with a vertex shader but no geometry shader, the output from the vertex shader is recorded, and the number of vertices stored into the transform feedback is the same as the number of vertices sent to OpenGL unless the available space in any of the transform feedback buffers is exhausted. However, if a geometry shader is present, that shader may create or discard vertices, and so the number of vertices written to the transform feedback buffer may be different from the number of vertices sent to OpenGL. Also, if tessellation is active, the amount of geometry produced will depend on the tessellation factors produced by the tessellation control shader. OpenGL can keep track of the number of vertices written to the transform feedback buffers through query objects. Your application can then use this information to draw the resulting data or to know how much to read back from the transform feedback buffer, should it want to keep the data.

Query objects were introduced earlier in this chapter in the context of occlusion queries. It was stated that there are many questions that can be asked of OpenGL. Both the number of primitives generated and the number of primitives actually written to the transform feedback buffers are available as queries.

As before, to generate a query object, call

GLuint one_query;
glGenQueries(1, &one_query);

or to generate a number of query objects, call

GLuint ten_queries[10];
glGenQueries(10, ten_queries);

Now that you have created your query objects, you can ask OpenGL to start counting primitives as it produces them by beginning a GL_PRIMITIVES_GENERATED or GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN query by beginning the query of the appropriate type. To start either query, call

glBeginQuery(GL_PRIMITIVES_GENERATED, one_query);

or

glBeginQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN, one_query);

After a call to glBeginQuery() with either GL_PRIMITIVES_GENERATED or GL_TRANSFORM_FEEDBACK_PRIMTIVES_WRITTEN, OpenGL keeps track of how many primitives were produced by the front end, or how many were actually written into the transform feedback buffers until the query is ended using

glEndQuery(GL_PRIMITIVES_GENERATED);

or

glEndQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN);

The results of the query can be read by calling glGetQueryObjectuiv() with the GL_QUERY_RESULT parameter and the name of the query object. As with other OpenGL queries, the result might not be available immediately because of the pipelined nature of OpenGL. To find out if the results are available, call glGetQueryObjectuiv() with the GL_QUERY_RESULT_AVAILABLE parameter. See “Retrieving Query Results” earlier in this chapter for more information about query objects.

There are a couple of subtle differences between the GL_PRIMITIVES_GENERATED and GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN queries. The first is that the GL_PRIMITIVES_GENERATED query counts the number of primitives emitted by the front end, but theGL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN query only counts primitives that were successfully written into the transform feedback buffers. The primitive count generated by the front end may be more or less than the number of primitives sent to OpenGL, depending on what it does. Normally, the results of these two queries would be the same, but if not enough space is available in the transform feedback buffers, GL_PRIMITIVES_GENERATED will keep counting, but GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN will stop.

You can check whether all of the primitives produced by your application were captured into the transform feedback buffer by running one of each query simultaneously and comparing the results. If they are equal, then all the primitives were successfully written. If they differ, the buffers you used for transform feedback were probably too small.

The second difference is that GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN is only meaningful when transform feedback is active. That is why it has TRANSFORM_FEEDBACK in its name but GL_PRIMITIVES_GENERATED does not. If you run a GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN query when transform feedback is not active, the result will be zero. However, the GL_PRIMITIVES_GENERATED query can be used at any time and will produce a meaningful count of the number of primitives produced by OpenGL. You can use this to find out how many vertices your geometry shader produced or discarded.

Indexed Queries

If you are only using a single stream for storing vertices in transform feedback, then calling glBeginQuery() and glEndQuery() with the GL_PRIMITIVES_GENERATED or GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN targets works just fine. However, if you have a geometry shader in your pipeline, then that shader could produce primitives on up to four output streams. In that case, OpenGL provides indexed query targets that you can use to count how much data is produced on each stream. The glBeginQuery() and glEndQuery() functions associate queries with the first stream — the one with index zero. To begin and end a query on a different stream, you can call glBeginQueryIndexed() and glEndQueryIndexed(), whose prototypes are

void glBeginQueryIndexed(GLenum target,
GLuint index,
GLuint id);

void glEndQueryIndexed(GLenum target,
GLuint index);

These two functions behave just like their non-indexed counterparts, and the target and id parameters have the same meaning. In fact, calling glBeginQuery() is equivalent to calling glBeginQueryIndexed() with index set to zero. The same is true for glEndQuery() andglEndQueryIndexed(). When target is GL_PRIMITIVES_GENERATED, the query will count the primitives produced by the geometry shader on the stream whose index is given in index. Likewise, when target is GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN, the query will count the number of primitives actually written into the buffers associated with the output stream of the geometry shader whose index is given in index. If no geometry shader is present, you can still use these functions, but only stream zero will actually count anything.

You can actually use the indexed query functions with any query target (such as GL_SAMPLES_PASSED or GL_TIME_ELAPSED), but the only value for index that is valid for those targets is zero.

Using the Results of a Primitive Query

Now you have the results of the front end stored in a buffer. You also determined how much data is in that buffer by using a query object. Now it’s time to use those results in further rendering. Remember that the results of the front end are placed into a buffer using transform feedback. The only thing making the buffer a transform feedback buffer is that it’s bound to one of the GL_TRANSFORM_FEEDBACK_BUFFER binding points. However, buffers in OpenGL are generic chunks of data and can be used for other purposes.

Generally, after running a rendering pass that produces data into a transform feedback buffer, you bind the buffer object to the GL_ARRAY_BUFFER binding point so that it can be used as a vertex buffer. If you are using a geometry shader that might produce an unknown amount of data, you need to use a GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN query to figure out how many vertices to render on the second pass. Listing 11.9 shows an example of what such code might look like.


// We have two buffers, buffer1 and buffer2. First, we'll bind buffer1 as the
// source of data for the draw operation (GL_ARRAY_BUFFER), and buffer2 as
// the destination for transform feedback (GL_TRANSFORM_FEEDBACK_BUFFER).
glBindBuffer(GL_ARRAY_BUFFER, buffer1);
glBindBuffer(GL_TRANSFORM_FEEDBACK_BUFFFER, buffer2);

// Now, we need to start a query to count how many vertices get written to
// the transform feedback buffer
glBeginQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN, q);

// Ok, start transform feedback...
glBeginTransformFeedback(GL_POINTS);

// Draw something to get data into the transform feedback buffer
DrawSomePoints();

// Done with transform feedback
glEndTransformFeedback();

// End the query and get the result back
glEndQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN);
glGetQueryObjectuiv(q, GL_QUERY_RESULT, &vertices_to_render);

// Now we bind buffer2 (which has just been used as a transform
// feedback buffer) as a vertex buffer and render some more points
// from it.
glBindBuffer(GL_ARRAY_BUFFER, buffer2);
glDrawArrays(GL_POINTS, 0, vertices_to_render);


Listing 11.9: Drawing data written to a transform feedback buffer

Whenever you retrieve the results of a query from OpenGL, it has to finish what it’s doing so that it can provide an accurate count. This is true for transform feedback queries just as it is for any other type of query. When you execute the code shown in Listing 11.9, as soon as you callglGetQueryObjectuiv(), the OpenGL pipeline will drain and the graphics processor will idle. All this just so the vertex count can make a round trip from the GPU to your application and back again. To get around this, OpenGL provides two things. First, is the transform feedback object, which represents the state of the transform feedback stage. Up until now, you have been using the default transform feedback object. However, you can create your own by calling glGenTransformFeedbacks() followed by glBindTransformFeedback():

void glGenTransformFeedbacks(GLsizei n,
GLuint * ids);

void glBindTransformFeedback(GLenum target,
GLuint id);

For glGenTransformFeedbacks(), n is the number of object names to reserve and ids is a pointer to an array into which the new names will be written. Once you have a new name, you bind it using glBindTransformFeedback(), whose first parameter, target, must beGL_TRANSFORM_FEEDBACK and whose second parameter, id, is the name of the transform feedback object to bind. You can delete transform feedback objects using glDeleteTransformFeedbacks(), and you can determine whether a given value is the name of a transform feedback object by calling glIsTransformFeedback():

void glDeleteTransformFeedbacks(GLsizei n,
const GLuint * ids);

GLboolean glIsTransformFeedback(GLuint id);

Once a transform feedback object is bound, all state related to transform feedback is kept in that object, and this includes the transform feedback buffer bindings and the counts used to keep track of how much data has been written to each transform feedback stream. This is effectively the same data that would be returned in a transform feedback query, and we can use it to automatically draw the number of vertices captured using transform feedback. This is the second part of functionality that OpenGL provides for this purpose, and it consists of four functions:

void glDrawTransformFeedback(GLenum mode,
GLuint id);

void glDrawTransformFeedbackInstanced(GLenum mode,
GLuint id,
GLsizei primcount);

void glDrawTransformFeedbackStream(GLenum mode,
GLuint id,
GLuint stream);

void glDrawTransformFeedbackStreamInstanced(GLenum mode,
GLuint id,
GLuint stream,
GLsizei primcount);

For all four functions, mode is one of the primitive modes that can be used with other drawing functions such as glDrawArrays() and glDrawElements(), and id is the name of a transform feedback object that contains the counts.

• Calling glDrawTransformFeedback() is equivalent to calling glDrawArrays(), except that the number of vertices to process is taken from the first stream of the transform feedback object named in id.

• Calling glDrawTransformFeedbackInstanced() is equivalent to glDrawArraysInstanced(), with the vertex count again sourced from the first stream of the transform feedback object named in id and with the instance count specified in primcount.

• Calling glDrawTransformFeedbackStream() is equivalent to calling glDrawTransformFeedback(), except that the stream given in stream is used as the source of the count.

• Calling glDrawTransformFeedbackStreamInstanced() is equivalent to calling glDrawTransformFeedbackInstanced(), except that the stream given in stream is used as the source of the count.

When you use one of the functions that take a stream index, data must be recorded into the transform feedback buffers associated with streams other than zero using a geometry shader, as discussed in “Multiple Streams of Storage” back in Chapter 8.

Synchronization in OpenGL

In an advanced application, OpenGL’s order of operation and the pipeline nature of the system may be important. Examples of such applications are those with multiple contexts and multiple threads, or those sharing data between OpenGL and other APIs such as OpenCL. In some cases, it may be necessary to determine whether commands sent to OpenGL have finished yet and whether the results of those commands are ready. In this section, we discuss various methods of synchronizing various parts of the OpenGL pipeline.

Draining the Pipeline

OpenGL includes two commands to force it to start working on commands or to finish working on commands that have been issued so far. These are

glFlush();

and

glFinish();

There are subtle differences between the two. The first, glFlush(), ensures that any commands issued so far are at least placed into the start of the OpenGL pipeline and that they will eventually be executed. The problem is that glFlush() doesn’t tell you anything about the execution status of the commands issued — only that they will eventually be executed. glFinish(), on the other hand actually ensures that all commands issued have been fully executed and that the OpenGL pipeline is empty. While glFinish() does ensure that all of your OpenGL commands have beenprocessed, it will empty the OpenGL pipeline, causing a bubble and reducing performance, sometimes drastically. In general, it is recommended that you don’t call glFinish() for any reason.

Synchronization and Fences

Sometimes it may be necessary to know whether OpenGL has finished executing commands up to some point without forcing to empty the pipeline. This is especially useful when you are sharing data between two contexts or between OpenGL and OpenCL, for example. This type of synchronization is managed by what are known as sync objects. Like any other OpenGL object, they must be created before they are used and destroyed when they are no longer needed. Sync objects have two possible states: signaled and unsignaled. They start out in the unsignaled state, and when some particular event occurs, they move to the signaled state. The event that triggers their transition from unsignaled to signaled depends on their type. The type of sync object we are interested in is called a fence sync, and one can be created by calling

GLsync glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);

The first parameter is a token specifying the event we’re going to wait for. In this case, GL_SYNC_GPU_COMMANDS_COMPLETE says that we want the GPU to have processed all commands in the pipeline before setting the state of the sync object to signaled. The second parameter is a flags field and is zero here because no flags are relevant for this type of sync object. The glFenceSync() function returns a new GLsync object. As soon as the fence sync is created, it enters (in the unsignaled state) the OpenGL pipeline and is processed along with all the other commands without stalling OpenGL or consuming significant resources. When it reaches the end of the pipeline, it is “executed” like any other command, and this sets its state to signaled. Because of the in-order nature of OpenGL, this tells us that any OpenGL commands issued before the call to glFenceSync() have completed, even though commands issued after the glFenceSync() may not have reached the end of the pipeline yet.

Once the sync object has been created (and has therefore entered the OpenGL pipeline), we can query its state to find out if it’s reached the end of the pipeline yet, and we can ask OpenGL to wait for it to become signaled before returning to the application. To determine whether the sync object has become signaled yet, call

glGetSynciv(sync, GL_SYNC_STATUS, sizeof(GLint), NULL, &result);

When glGetSynciv() returns, result (which is a GLint) will contain GL_SIGNALED if the sync object was in the signaled state and GL_UNSIGNALED otherwise. This allows the application to poll the state of the sync object and use this information to potentially do some useful work while the GPU is busy with previous commands. For example, consider the code in Listing 11.10.


GLint result = GL_UNSIGNALED;
glGetSynciv(sync, GL_SYNC_STATUS, sizeof(GLint), NULL, &result);
while (result != GL_SIGNALED)
{
DoSomeUsefulWork();
glGetSynciv(sync, GL_SYNC_STATUS, sizeof(GLint), NULL, &result);
}


Listing 11.10: Working while waiting for a sync object

This code loops, doing a small amount of useful work on each iteration until the sync object becomes signaled. If the application were to create a sync object at the start of each frame, the application could wait for the sync object from two frames ago and do a variable amount of work depending on how long it takes the GPU to process the commands for that frame. This allows an application to balance the amount of work done by the CPU (such as the number of sound effects to mix together or the number of iterations of a physics simulation to run, for example) with the speed of the GPU.

To actually cause OpenGL to wait for a sync object to become signaled (and therefore for the commands in the pipeline before the sync to complete), there are two functions that you can use:

glClientWaitSync(sync, GL_SYNC_FLUSH_COMMANDS_BIT, timeout);

or

glWaitSync(sync, 0, GL_TIMEOUT_IGNORED);

The first parameter to both functions is the name of the sync object that was returned by glFenceSync(). The second and third parameters to the two functions have the same names but must be set differently.

For glClientWaitSync(), the second parameter is a bitfield specifying additional behavior of the function. The GL_SYNC_FLUSH_COMMANDS_BIT tells glClientWaitSync() to ensure that the sync object has entered the OpenGL pipeline before beginning to wait for it to become signaled.

Without this bit, there is a possibility that OpenGL could watch for a sync object that hasn’t been sent down the pipeline yet, and the application could end up waiting forever and hang. It’s a good idea to set this bit unless you have a really good reason not to. The third parameter is a timeout value in nanoseconds to wait. If the sync object doesn’t become signaled within this time, glClientWaitSync() returns a status code to indicate so. glClientWaitSync() won’t return until either the sync object becomes signaled or a timeout occurs.

There are four possible status codes that might be returned by glClientWaitSync(). They are summarized in Table 11.1.

Image

Table 11.1. Possible Return Values for glClientWaitSync()

There are a couple of things to note about the timeout value. First, while the unit of measurement is nanoseconds, there is no accuracy requirement in OpenGL. If you specify that you want to wait for one nanosecond, OpenGL could round this up to the next millisecond or more. Second, if you specify a timeout value of zero, glClientWaitSync() will return GL_ALREADY_SIGNALED if the sync object was in a signaled state at the time of the call and GL_TIMEOUT_EXPIRED otherwise. It will never return GL_CONDITION_SATISFIED.

For glWaitSync(), the behavior is slightly different. The application won’t actually wait for the sync object to become signaled, only the GPU will. Therefore, glWaitSync() will return to the application immediately. This makes the second and third parameters somewhat irrelevant. Because the application doesn’t wait for the function to return, there is no danger of your application hanging, and so the GL_SYNC_FLUSH_COMMANDS_BIT is not needed and would actually cause an error if specified. Also, the timeout will actually be implementation dependent, and so the special timeout value GL_TIMEOUT_IGNORED is specified to make this clear. If you’re interested, you can find out what the timeout value used by your implementation is by calling glGetInteger64v() with the GL_MAX_SERVER_WAIT_TIMEOUT parameter.

You might be wondering, “What is the point of asking the GPU to wait for a sync object to reach the end of the pipeline?” After all, the sync object will become signaled when it reaches the end of the pipeline, and so if you wait for it to reach the end of the pipeline, it will of course be signaled. Therefore, won’t glWaitSync() just do nothing? This would be true if we only considered simple applications that only use a single OpenGL context and that don’t use other APIs. However, the power of sync objects is harnessed when using multiple OpenGL contexts. Sync objects can be shared between OpenGL contexts and between compatible APIs such as OpenCL. That is, a sync object created by a call to glFenceSync() on one context can be waited for by a call to glWaitSync() (or glClientWaitSync()) on another context.

Consider this. You can ask one OpenGL context to hold off rendering something until another context has finished doing something. This allows synchronization between two contexts. You can have an application with two threads and two contexts (or more, if you want). If you create a sync object in each context, and then in each context you wait for the sync objects from the other contexts using either glClientWaitSync() or glWaitSync(), you know that when all of the functions have returned, all of those contexts are synchronized with each other. Together with thread synchronization primitives provided by your OS (such as semaphores), you can keep rendering to multiple windows in sync.

An example of this type of usage is when a buffer is shared between two contexts. The first context is writing to the buffer using transform feedback, while the second context wants to draw the results of the transform feedback. The first context would draw using transform feedback mode. After calling glEndTransformFeedback(), it immediately calls glFenceSync(). Now, the application makes the second context current and calls glWaitSync() to wait for the sync object to become signaled. It can then issue more commands to OpenGL (on the new context), and those are queued up by the drivers, ready to execute. Only when the GPU has finished recording data into the transform feedback buffers with the first context does it start to work on the commands using that data in the second context.

There are also extensions and other functionality in APIs like OpenCL that allow asynchronous writes to buffers. You can use glWaitSync() to ask a GPU to wait until the data in a buffer is valid by creating a sync object on the context that generates the data and then waiting for that sync object to become signaled on the context that’s going to consume the data.

Sync objects only ever go from the unsignaled to the signaled state. There is no mechanism to put a sync object back into the unsignaled state, even manually. This is because a manual flip of a sync object can cause race conditions and possibly hang the application. Consider the situation where a sync object is created, reaches the end of the pipeline and becomes signaled, and then the application set it back to unsignaled. If another thread tried to wait for that sync object but didn’t start waiting until after the application had already set the sync object back to the unsignaled state, it would wait forever. Each sync object therefore represents a one-shot event, and every time a synchronization is required, a new sync object must be created by calling glFenceSync(). Although it is always important to clean up after yourself by deleting objects when you’re done with them, this is particularly important with sync objects because you might be creating many new ones every frame. To delete a sync object, call

glDeleteSync(sync);

This deletes the sync object. This may not occur immediately; any thread that is watching for the sync object to become signaled will still wait for its respective timeouts, and the object will actually be deleted once nobody’s watching it any more. Thus, it is perfectly legal to call glWaitSync()followed by glDeleteSync() even though the sync object is still in the OpenGL pipeline.

Summary

This chapter discussed how to monitor the execution of your commands in the pipeline and get some feedback about their progress down it. You saw how to measure the time taken for your commands to complete, and have the tools necessary to measure the latency of the graphics pipeline. This, in turn, allows you to alter your application’s complexity to suit the system it’s running on and the performance targets you’ve set for it. We will use these tools for real-world performance tuning exercises in Chapter 13, “Debugging and Performance Optimization.” You also saw how it is possible to synchronize the execution of your application to the OpenGL context, and how to synchronize execution of multiple OpenGL contexts.