"Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end."
- Henry Spencer
More pages: 1 ... 11 ... 15 16 17 18 19 20 21 22 23 24 25 ... 31 ... 41 ... 47
Shader programming tips #1
Thursday, January 29, 2009 | Permalink

DX9 generation hardware was largely vector based. The DX10 generation hardware on the other hand is generally scalar based. This is true for both ATI and Nvidia cards. The Nvidia chips are fully scalar, and while the ATI chips still have explicit parallelism the 5 scalars within an instruction slot don't need to perform the same operation or operate on the same registers. This is important to remember and should affect how you write shader code. Take for instance this simple diffuse lighting computation:

float3 lightVec = normalize(In.lightVec);
float3 normal = normalize(In.normal);
float diffuse = saturate(dot(lightVec, normal));

A normalize is essentially a DP3-RSQ-MUL sequence. DP3 and MUL are 3-way vector instructions and RSQ is scalar. The shader above will thus be 3 x DP3 + 2 x MUL + 2 x RSQ for a total of 17 scalar operations.
Now instead of multiplying the RSQ values into the vectors, why don't we just multiply those scalars into the final scalar instead? Then we would get this shader:

float lightVecRSQ = rsqrt(dot(In.lightVec, In.lightVec));
float normalRSQ = rsqrt(dot(In.normal, In.normal));
float diffuse = saturate(dot(In.lightVec, In.normal) * lightVecRSQ * normalRSQ);

This replaces two vector multiplications with two scalar multiplications, saving us a 4 scalar operations. The math savvy may also recognize that rsqrt(x) * rsqrt(y) = rsqrt(x * y). So we can simplify it to:

float lightVecSQ = dot(In.lightVec, In.lightVec);
float normalSQ = dot(In.normal, In.normal);
float diffuse = saturate(dot(In.lightVec, In.normal) * rsqrt(lightVecSQ * normalSQ));

We are now down to 12 operations instead of 17. Checking things out in GPU Shader Analyzer showed that the final instruction count is 5 in both cases, but the latter shader leaves more empty scalars which you can fill with other useful work.

It should be mentioned that while this gives the best benefit to modern DX10 cards it was always good to do these kind of scalarizations. It often helps older cards too. For instance on the R300-R580 generation it often meant more instructions could fit into the scalar pipe (they were vec3+scalar) instead of utilizing the vector pipe.

[ 1 comments | Last comment by sqrt[-1] (2009-01-31 14:32:40) ]

Custom alpha to coverage
Sunday, January 25, 2009 | Permalink

In DX10.1 you can write a custom sample mask to an SV_Coverage output. This nice little feature hasn't exactly received a lot of media coverage (haha!). Basically it's an uint where every bit tells to which samples in the multisample render target the output will be written to. For instance if you set it to 0x3 the output will be written to samples 0 and 1, and leave the rest of the samples unmodified.

What can you use it for? The most obvious thing is to create a custom alpha-to-coverage. Alpha-to-coverage simply converts the output alpha into a sample mask. If you can provide a better sample mask than the hardware, you'll get better quality. And quite frankly, the hardware implementations of alpha-to-coverage hasn't exactly impressed us with their quality. You can often see very obvious and repetitive dither patterns.

So I made a simple test with a pseudo-random value based on screen-space position. The left image is the standard alpha-to-coverage on an HD 3870x2, and on the right my custom alpha-to-coverage.

[ 4 comments | Last comment by Dr Black Adder (2011-10-14 01:08:17) ]

Drawbacks of modern technology
Saturday, January 24, 2009 | Permalink

CRT monitors have been out of fashion for a while now, and while they took a lot more desk space than the modern flat screens they had one important advantage: you could put stuff on top of them.

So I just got myself a nice 5.1 system for my computer:

The sound is great and all, but the problem is I have no place to put my center speaker. My monitor has a frame that's less than 5 cm (2") thick, and that's on a 30" monitor. I'd need about the double for the speaker to stand stably. I can hardly be the first one to have this problem, but oddly enough after spending the whole day going from one electronics shop to another and even going to Ikea and the likes I found no product for mounting anything on top of a LCD screen. Not even google seems to come up with anything useful. Back in the CRT days there were products for turning the top of a monitor into a shelf, even though often the screen itself was good enough for things like a center speaker. For now I've simply put some screws through the holes and just let them hang in front of the monitor, which should at least keep it from sliding backwards. Not a particularly pretty or safe solution, but at least it's standing there now.

[ 6 comments | Last comment by Humus (2009-01-29 19:25:41) ]

Wednesday, January 14, 2009 | Permalink

When you look at a highly tessellated model it's generally understood that it will be vertex processing heavy. Not quite as widely understood is the fact that increasing polygon count also adds to the fragment shading cost, even if the number of pixels covered on the screen remains the same. This is because fragments are processed in quads. So whenever a polygon edge cuts through a 2x2 pixel area, that quad will be processed twice, once for both of the polygons covering it. If several polygons cut through it, it may be processed multiple times. If the fragment shader is complex, it could easily become the bottleneck instead of the vertex shader. The rasterizer may also not be able to rasterize very thin triangles very efficiently. Since only pixels that have their pixel centers covered (or any of the sample locations in case of multisampling) are shaded the quads that need processing may not be adjacent. This will in general cause the rasterizer to require additional cycles. Some rasterizers may also rasterize at fixed patterns, for instance an 4x4 square for a 16 pipe card, which further reduces the performance of thin triangles. In addition you also get overhead because of less optimal memory accesses than if everything would be fully covered and written to at once. Adding multisampling into the mix further adds to the cost of polygon edges.

The other day I was looking at a particularly problematic scene. I noticed that a rounded object in the scene was triangulated pretty much as a fan, which created many long and thin triangles, which was hardly optimal for rasterization. While this wasn't the main problem of the scene it made me think of how bad such a topology could be. So I created a small test case to measure the performance of three different layouts of a circle. I used a non-trivial (but not extreme) fragment shader.

The most intuitive way to triangulate a circle would be to create a fan from the center. It's also a very bad way to do it. Another less intuitive but also very bad way to do it is to create a triangle strip. A good way to triangulate it is to start off with an equilateral triangle in the center and then recursively add new triangles along the edge. I don't know if this scheme has a particular name, but I call it "max area" here as it's a greedy algorithm that in every step adds the triangle that would grab the largest possible area out of the remaining parts on the circle. Intuitively I'd consider this close to optimal in general, but I'm sure there are examples where you could beat such a strategy with another division scheme. In any case, the three contenders look like this:

And their performance look like this. The number along the x-axis is the vertex count around the circle and the y-axis is frames per second.

Adding multisampling into the mix further adds to the burden with the first two methods, while the max area division is still mostly unaffected by the added polygons all the way across the chart.

[ 20 comments | Last comment by Hobgoblin (2009-08-08 04:53:00) ]

Sunday, January 11, 2009 | Permalink

So I left a comment over at Wolfgang Engel's place. And this is what the site's anti-bot system system asks me to type:

If you're not programmer you may not see what's so cool about it, but out of the set of random characters I found it a bit amazing that I got a C++ keyword.

Speaking of const, it's one of my favorite keywords. However, I often see code where it's not used at all, or not nearly as much as it should be. I tend to use it as often as possible and I'd encourage everyone to do the same. It's good for you.

[ 1 comments | Last comment by yosh64 (2009-01-14 08:55:29) ]

Wednesday, January 7, 2009 | Permalink

Over the holidays I began some work on a new framework. I'm building it around DX11; however, I'm still making it reasonably platform independent and I plan to put OpenGL 3.0 and Linux support in there at some point too. So far I haven't done much, but at least I have a window and DX11 device up and running. Yay!

There are a few changes I'm making to the overall structure of the code. One is of course to separate the rendering context from the device in order to be able to take advantage of DX11 deferred contexts, AFA command buffers. Another change I'm making is to try to resolve more stuff on link time rather than at runtime. In previous frameworks I've had base classes with virtual functions that subclasses implemented. For instance D3D10Renderer and OpenGLRenderer inheriting from Renderer. However, since I'm always using one or the other and never both in the same app it adds unnecessary overhead. Not that this overhead was ever a problem, but it just feels better to do it right. So instead I'll create a common interface, and just add the right implementation to each project.

Another change is that I'll use a better coding style. As a self-learned coder I've been using many odd coding practices throughout the years. Slowly I've adapted to common industry conventions, but Framework3 was still based on a lot of old coding style preferences which I never bothered to change, because I'd rather be consistent than mix conventions. With the clean break with Framework4 I'll now be using "m_" on member variables and "{" will be on its own line etc.

[ 11 comments | Last comment by acid (2012-12-18 15:36:19) ]

More site updates
Monday, January 5, 2009 | Permalink

Another thing I recently noticed is that I've been sort of anonymous for I don't know how long. I know I stated my name and occupation on this site in the past and I never really intended to hide who I am. I suppose when I rewrote the whole website a couple of years ago those bits got lost. In any case I've fixed that now and changed the "Contact" page into an "About Humus" page, which also holds the contact info in addition to a few words about who I am.


RSS feed
Monday, January 5, 2009 | Permalink

By popular request I've finally added an RSS feed to this website. For some reason Firefox refused to load my permalinks, but using start=X like on the front page worked fine. Not sure what up with that, but I suppose start=X links are OK since the feed should always be up to date with the front page anyway. I've tested it in Firefox only though, so I don't know how it'll work with any other RSS readers. Let me know if there are any issues with it.

[ 4 comments | Last comment by Humus (2009-01-15 21:00:56) ]

More pages: 1 ... 11 ... 15 16 17 18 19 20 21 22 23 24 25 ... 31 ... 41 ... 47