More pages: 1
DPPS (or why don't they ever get SSE right?)
Monday, March 16, 2009 | Permalink
So in my work on creating a new framework I've come to my vector class. So I decided to make use of SSE. I figure SSE3 is mainstream now so that's what I'm going to use as the baseline, with optional SSE4 support in case I ever need the extra performance, enabled with a USE_SSE4 #define.
Now, SSE is an instruction set that was large to begin with and has grown a lot with every new revision:
SSE: 70 instructions
SSE2: 144 instructions
SSE3: 13 instructions
SSSE3: 32 instructions
SSE4: 54 instructions
SSE4a: 4 instructions
SSE5: 170 instructions (not in any CPUs on the market yet)
Why all these instructions? Well, perhaps because they can't seem to get things right from the start. So new instructions are needed to overcome old limitations. There are loads of very specialized instructions while arguably very generic and useful instructions have long been missing. A dot product instruction should've been in the first SSE revision. Or at the very least a horizontal add. We got that in SSE3 finally. Yay! Only 6 years after 3DNow had that feature. As the name would make you believe, 3DNow was in its first revision very useful for anything related to 3D math, despite its limited instruction set of only 21 instructions (although to be fair it shared registers with MMX and thus didn't need to add new instructions for stuff that could already be done with MMX instructions).
So why this rant? Well, DPPS is an instruction that would at first make you think Intel finally got something really right about SSE. Maybe they has listened to a game developer for once. We finally have a dot product instruction. Yay! To their credit, it's more flexible than I ever expected such an instruction to be. But it disturbs me that they instead of making it perfect had to screw up one detail, which drastically reduces the usefulness of this instruction. The instruction comes with an immediate 8bit operand, which is a 4bit read mask and a 4bit write mask. The read mask is done right. It selects what components to use in the dot product. So you can easily make a three or two component dot product, or even use XZ for instance for computing distance in the XZ plane. Now the write mask on the other hand is not really a write mask. Instead of simply selecting what components you want to write the result to you select what components get the result and the rest are set to zero. Why oh why Intel? Why would I want to write zero to the remaining components? Couldn't you have let me preserve the values in the register instead? If I wanted them as zero I could have first cleared the register and then done the DPPS. Had the DPPS instruction had a real write mask we could've implemented a matrix-vector multiplication in four instructions. Now I have to write to different registers and then merge them with or-operations, which in addition to wasting precious registers also adds up to 7 instructions in total instead of 4, which ironically is the same number of instructions you needed in SSE3 to do the same thing. Aaargghh!!!!
Tuesday, April 7, 2009
SSE are 2 operand instructions so write mask would be useless anyway, you need separate destination register for this. However the bigger reason for such design is that there are problems with long dependency chains.
If we had 3 op DP with write mask the matrix-vector multiply code could look like this:
xmm0 - vector
xmm4-7 - matrix
xmm1 - result
dpps xmm1.x, xmm0, xmm4 // 0 (11)
dpps xmm1.y, xmm0, xmm5 // 11 (11)
dpps xmm1.z, xmm0, xmm6 // 22 (11)
dpps xmm1.w, xmm0, xmm7 // 33 (11)
on the right is a starting cycle and a (latency).
Dpps has 11 cycles latency on Core2 and each of these DPs depends on a previous one so the whole routine takes 4*11=44 cycles.
Now the same with the real code:
movaps xmm1, xmm0 // 0 (1)
movaps xmm2, xmm0 // 0 (1)
movaps xmm3, xmm0 // 0 (1)
dpps xmm0.x, xmm4 // 1 (11)
dpps xmm1.y, xmm5 // 4 (11)
dpps xmm2.z, xmm6 // 7 (11)
dpps xmm3.w, xmm7 // 10 (11)
orps xmm0, xmm1 // 15 (1)
orps xmm2, xmm3 // 21 (1)
orps xmm0, xmm2 // 22 (1)
DPs are independent here and can be issued every 3 cycles. The code takes 23 cycles to execute even if it's 6 instructions longer.
None of the above is actually tested, it is based on Core2 cycle tables from http://www.agner.org/optimize/
More pages: 1