"What's important is not to win, but to beat Finland."
- Swedish proverb
More pages: 1 ... 11 ... 21 ... 31 ... 41 ... 51 ... 61 ... 71 ... 81 ... 91 ... 101 ... 111 ... 121 ... 131 ... 141 ... 151 ... 161 ... 171 ... 181 ... 191 ... 195 196 197 198 199 200 201 202 203 204 205 ... 211 ... 221 ... 231 ... 241 ... 251 ... 261 ... 271 ... 281 ... 291 ... 301 ... 311 ... 321 ... 331 ... 341 ... 351 ... 361 ... 371 ... 381 ... 391 ... 401 ... 411 ... 421 ... 431 ... 438
Query FailedRohit Garg
Thursday, March 19, 2009

Since when has Intel designed good isa's?

Humus
Wednesday, March 18, 2009

I'll probably do two different classes, one "real" float3, and one that's more like float3_as_float4.

Groovounet
Wednesday, March 18, 2009

How do you expect to manage vec3? When the storage space matter it's quite unlikely to use vec4 instead of vec3...

Humus
Tuesday, March 17, 2009

I haven't made any benchmarks so far, but I of course intend to verify that there's a performance benefit. But from my attempts in SSE in the past I know it can certainly be worth it. So far I've only verified that my code works and that the compiler generates reasonable code. From what I can see it does pretty much what I expect. I was initially afraid there would be overhead in unnecessary loading and storing, but it in fact generates very good code from what I can tell. Although you have to ensure that you enable pretty much every optimization, especially code inlining for small functions, link time code generation and enabling SSE for general floating point use instead of the FPU.

I'm of course using intrinsics rather than assembly. Using assembly is only an option if you have a quite long sequence of instructions. Besides inline assembly is not supported anymore for 64bit code, so I don't want to use it unless I have to. It's also nice to see that GCC supports the same intrinsics, so my code worked in Linux with very minor changes.

Groovounet
Tuesday, March 17, 2009

An instruction called rdtsc allows to count the number of cycle taken by some instructions. It need to be use carefully but work quite well.

Compiler can vertorise a code, it's what visual c++ do, it's quite efficient but far from what a human can do in some case. If a c++ code isn't write with simd in mind, the compiler won't get much out of it because it will be hard to serialize.

The senario of vector and matrix classes is treaky. For some operations like matrix product or vector matrix product you can get a lot, 60 cycles and 36 cycles on a q6600, but when it come to basic thing like initialization or additions... The compiler work well!

A mistake is to write everything in asm. It's fun but it breaks compiler capability to optimize the code, especialy cross function optimisations.

It's all about testing.

Finally, don't bother to much! With a co-working we had some fun optimizing a the same code. We end up that the code he wrote was faster on his computer (atlhon x2 6000+) and the code I wrote was faster on mine q6600. Now with core i7 I'm sure that we could rewrite the code to reach better performance.

From CPU to CPU the number of cycle for each instructions change so what you took as an optimization could become slower on others CPUs because of variation on the instruction cycle count. Event the order of instructions change efficiency because of instructions latencies.

Groovounet
Tuesday, March 17, 2009

What actully worse is that on some CPU (q6600 and co if I remember) the number of cycles need for ddps is so high that you'd rather use the sse2 instructions. I think it's fixed now with q9300 & cobut I never tried tested it.

I think it was quite the same with the horizontal add on p3: no efficient enought at introduction.

I wish an intel interested in 3d they would now that dot product is a common operation

Greg
Tuesday, March 17, 2009

how much speed does SSE bring to a Vector3f? and can't a compiler vectorize the code itself?

and also, how do you effectively measure the performance gain?

Aras Pranckevicius
Tuesday, March 17, 2009

We had a very similar thought at work the other day. If you take a look at PS2 VUs, or PPC Altivec, or ARM VFP - they are all reasonable. And then you have SSE with it's umpteen revisions -- and it still hasn't got things right.

More pages: 1 ... 11 ... 21 ... 31 ... 41 ... 51 ... 61 ... 71 ... 81 ... 91 ... 101 ... 111 ... 121 ... 131 ... 141 ... 151 ... 161 ... 171 ... 181 ... 191 ... 195 196 197 198 199 200 201 202 203 204 205 ... 211 ... 221 ... 231 ... 241 ... 251 ... 261 ... 271 ... 281 ... 291 ... 301 ... 311 ... 321 ... 331 ... 341 ... 351 ... 361 ... 371 ... 381 ... 391 ... 401 ... 411 ... 421 ... 431 ... 438