"Don't rush to implement your commander's orders. Wait until he changes his mind."
- Soviet Army saying
More pages: 1 2
How to cut your D3D call cost by a tiny immeasurable fraction
Wednesday, August 4, 2010 | Permalink

One difference between D3D and OpenGL is that the former is using an object-oriented API. All API calls are virtual rather than being plain C calls like in OpenGL. The main advantage of this is of course flexibility. The runtime can easily provide many different implementations and hand you back any one depending on your device creation call parameters. The obvious example of that would be the debug and retail runtime. I suppose the D3DCREATE_PUREDEVICE in DX9 also handed you a different implemention than the standard functions. It's of course faster to have a D3D runtime function that's trimmed down rather than have the same function and look at IsDebug and IsPure booleans. The disadvantage of having virtual functions is that dispatching virtual function calls comes with a bit of overhead.

One thing to note though is that once you've looked up the actual address for a virtual function, the actual function call is no different than calling a non-virtual function. In fact, the only thing different from a plain C function or static member function is that you pass the this pointer as well. Consider the following D3D11 call:

virtual void STDMETHODCALLTYPE DrawIndexed(UINT IndexCount, UINT StartIndexLocation, INT BaseVertexLocation);

We can declare the equivalent C-style function pointer type like this:
typedef void (STDMETHODCALLTYPE *DrawIndexedFunc)(ID3D11DeviceContext *ctx, UINT IndexCount, UINT StartIndexLocation, INT BaseVertexLocation);

Then we can create a function pointer like so:

DrawIndexedFunc MyDrawIndexed;

And the call to ID3D11DeviceContext:: DrawIndexed(...) can be done with MyDrawIndexed(context, ...) provided that MyDrawIndexed has been loaded with the correct function pointer. Very straight-forward. So how do we find the function pointer? Virtual functions are looked up through a v-table, which is essentially a list of function pointers for all the virtual functions in the class. When a class which has virtual functions is created a pointer to a static v-table will be stored in the object. The C++ standard doesn't require any particular memory layout, or even require a v-table at all to solve the virtual function dispatch problem, so this code will be highly unportable. But if you're coding DirectX you're only building for Windows anyway and chances are you're using the MSVC compiler. In that case the v-table pointer will be the very first member of the class. Other compilers may do it differently. From what I gather it's common for Unix compilers to put the v-table pointer at the end of the class instead.

Given a ID3D11DeviceContext pointer, let's call it "ctx", the first thing we need to do it grab its v-table:

void **v_table = *(void ***) ctx;

This somewhat cryptic code basically just grabs the first 4 bytes (8 bytes on x64) out of the memory ctx points to, which will be the v_table pointer. Now we just need to know which entry in the table represents DrawIndexed. The hard way is to look in the the D3D headers. DrawIndexed is the 6th member declared in ID3D11DeviceContext, however it also inherits from ID3D11DeviceChild which has 4 virtual functions and which in turn inherits from IUnknown which has 3. So it's the 13th function, or should be at the index 12. So we can find the pointer like this:

DrawIndexedFunc MyDrawIndexed = (DrawIndexedFunc) (v_table[12]);

The easy way to figure this out is to just set a breakpoint at a regular DrawIndexed call and switch to disassembly view to see what code the compiler generated. It could for instance look like this:

mov eax, dword ptr [esi]
mov ecx, dword ptr [eax]
mov edx, dword ptr [ecx+30h]
push 0
push 0
push 1Eh
push eax
call edx
mov eax, dword ptr [esi]

Here esi points to the class which holds "ctx". So first it grabs the ctx pointer, then grabs the v-table from it and on third line looks up the function address at offset 0x30 in the v-table. 0x30 / sizeof(void *) is 12, so there's your index. The following three lines pushes the arguments to the function on the stack in reverse order, and then the this pointer. The this pointer, which in this case is "ctx", was fetched to eax on the first line.

Now what happens if we make this call through MyDrawIndexed? Well, this:

mov ecx, dword ptr [esi]
push 0
push 0
push 1Eh
push ecx
call dword ptr [MyDrawIndexed]

That's two instructions less. Woot!
Also note that the first call was daisy chaining the fetches. Two of those indirections were removed. It should be noted however that for this to work, the MyDrawIndexed variable must either be a static member function or a global variable. In other words, its address should be resolvable at compile time. If you only have one device context this should be no problem. If you are using multiple contexts, for instance for threaded rendering, you may not want to rely on both contexts having the same function pointers in its v-tables, although this is likely to be true if they were created with the same parameters. You could in that case simply store the function pointer next to the device context in whatever encapsulating class you have, like my "Context" class I referred to earlier. While this is not as optimal, it still cuts down some work:

mov ecx,dword ptr [esi]
mov edx,dword ptr [esi+4]
push 0
push 0
push 1Eh
push ecx
call edx

This is one instruction longer, although still one shorter than the initial code. The most important thing though is that this code still only has one level of indirection, whereas the original one has three.

I should also mention that C++ has some fancy syntax for pointers to C++ member functions. The underlying mechanism for how those work is somewhat different from how standard C functions work. However, using a static or global function pointer the actual code generated with that is the same as with a regular function pointer. If you put it next to "ctx" though it will generate one more instruction, or the same as the original virtual call. It's still only one level of indirection though, so it's still better. The actual function pointer appears to have a sizeof() of 16. I don't know what the unused bytes are for, only the last 8 are actually used in the call. The advantage of using C++ function pointers though is that you can assign to it by name instead of figuring out the v_table index, so it creates somewhat prettier code.

typedef void (STDMETHODCALLTYPE ID3D11DeviceContext::*DrawIndexedFunc)(UINT IndexCount, UINT StartIndexLocation, INT BaseVertexLocation);

DrawIndexedFunc MyDrawIndexed = &ID3D11DeviceContext:: DrawIndexed;

And then the fancy calling syntax:


So what does all this messing around actually gain you? Performance-wise probably somewhere between infinitesimal and nothing. The number of cycles spent inside the DrawIndexed call probably far outweights any slight gain in calling it. In fact, if you set a breakpoint and step inside the function you will find that you're stepping over a quite large number of instructions before you return. You'll also notice that DrawIndexed in turn calls a few other virtual functions under the hood. If anything, you gain insight into the underlying mechanisms of virtual function calls. Plus of course that messing with v-tables is a lot of fun.



Enter the code below

Marek Olsak
Thursday, August 5, 2010

The following article explains how to turn most virtual function calls into inline ones and still have the same level of abstration. It's an example that virtual functions may have nearly no impact on performance when used wisely.

See here:

Cyril Crassin
Thursday, August 5, 2010

I love this kind if hack :-D
Thanks for the trick Humus ! That can also be applied in many other API relying on virtual calls.

Thursday, August 5, 2010

Beware though, this is obviously not language-compliant (you're looking under the hood). And I can also imagine that the indirection that the vtable provides could be hijacked by the runtime (think modifying the vtable at runtime ). Even if it's not now, it might happen in the future. (Unlikely, but still).

Oh, and BTW, the GL may be C static functions, the first thing it does on windows is calling into an indirection table that's part of the ICD model (for the windows-exposed GL functions at least. Things are different for extensions methods).
All in all, I would not recommend the method!

Thursday, August 5, 2010

You don't have to save the function pointer, there's syntax for getting access to it when calling it, which is done at compile time. Which might also save you a little more by using a constant function pointer value instead of reading it from a variable.

It also prevents you from going into ASM and the flaws on ASM (non-portability and readability), and works with the compiler instead of subverting it.

pointer->Class::FnCall(); //See link below

Thursday, August 5, 2010

Barbie, the v-table is located in read-only memory, in the code segment I think. It never changes at runtime. In fact, if you attempt to alter it, the code will crash. Of course, the D3D runtime could of course avoid using any form of standard C++ calls to create the object and simply create it as a struct and create a v-table dynamically and fill in whatever pointers it wants and send that back to you. In that case you shouldn't crash when attempting to write it though, unless they explicitely tagged the memory page as read-only after filling it in.

Reavenk, that's a nice trick for when you want a specific implementation, but it relies on the function pointer being resolvable at compile-time. That's impossible for D3D. If you attempt this on a device pointer you'll get a link error. That's because ID3D11DeviceContext is an abstract interface. It only has pure virtual functions. It should also be noted that if D3D really provided a base implementation in the ID3D11DeviceContext class and only overrided specific functions in deriving classes, using your trick would make it call the wrong function. The actual target function you need cannot be resolved at compile-time if the class type is variable, so a function pointer that you fill in at runtime is most certainly needed.
As for ASM, I don't know quite what you mean. I didn't write any assembly code for this, it's all plain C++. I used the disassembly view though for verifying that I'm getting the results I'm expecting. Even if you live in the C++ world you have to peek into the machine code once in a while to get an understanding of what the generated code looks like.

Sean Barrett
Thursday, August 5, 2010

Last time I checked, Direct3D API objects were COM objects, not C++ objects. They have a well-defined systematic way of being accessed from C which must be consistent and forever obeyed. This happens to be exactly the C++ vtable layout, of course, but there's no sense in which this is cheating or delving into proprietary aspects; if somebody decided to change how the C++ compiler laid things out, they would stop being C++ objects but would remain COM objects, I think. (Of course everything would break in practice, but at a minimum you've got binary compatibility guaranteeed on the old versions.)

I didn't know that the GL went through a jump table as Barbie suggests. I always assumed the ICD model had the vendors supplying their own opengl32.dll to avoid this, but perhaps not.

Also, the OpenGL functions appear to lack a 'this' pointer, but they implicitly have one--each thread has a render context, so instead of explicitly being passed 'this' they have the implicit 'rc' that they have to get from thread-local storage. I seem to recall hearing that the OpenGL Windows devs cajoled the appropriate OS folk into giving them a hardcoded tls slot directly off of FS so they can get to it in one instruction.

Thursday, August 5, 2010

I'm guilty of skimming the top and only reading the bottom, so I assumed ASM was being used. And I jumped the gun and completely forgot about requesting the DirectX object. So I've come back to apologize twice.

Friday, August 6, 2010

I think this is just a huge waste of time.

Even if there would be a difference on a modern superscalar out of order CPU (which I doubt), the time spent in the DirectX runtime is so large that the time saving will not be measurable.

And if you are calling DirectX functions in such a frequency that it does you are fucked anyway.

More pages: 1 2