#905 closed enhancement (fixed)
[PATCH] Improvement to bones' skinning
Reported by: | gruby | Owned by: | |
---|---|---|---|
Priority: | Nice to Have | Milestone: | Alpha 8 |
Component: | Core engine | Keywords: | patch, graphics, performance |
Cc: | Patch: |
Description
Bones' skinning can be improved by SSE for example.
Attachments (1)
Change History (15)
comment:1 by , 13 years ago
Keywords: | review graphics performance added |
---|---|
Summary: | Improvement to bones' skinning → [PATCH] Improvement to bones' skinning |
comment:2 by , 13 years ago
Milestone: | Backlog → Alpha 7 |
---|
comment:3 by , 13 years ago
How were you measuring the performance effect, and in what environment?
I tried the patch on Linux with GCC 4.4.5, x86_64, Pentium Dual Core 2.16GHz, in Release mode, on the "Combat demo (huge)" map (which is easier to measure since the units don't walk around), using the in-game profiler to get the "skinning bones" value:
Number of calls: 655/frame
With non-SSE function: 8.5msec/frame
With SSE function: 8.0msec/frame
which is about 6% improvement, which is very different from 140%. The "We are Legion" map seems to give a similar change.
comment:4 by , 13 years ago
I've just tried it on Visual Studio 2008 sp1, AMD Phenom II X4 955 3.20GHz, in Release mode too, on the "Combat demo (huge)" map.
Number of calls: 760/frame With non-SSE function: 9.7msec/frame With SSE function: 6.9msec/frame
which is about 40% improvement. That are my tests.
follow-up: 11 comment:5 by , 13 years ago
Hmm, is there any change if VS is configured to use SSE automatically? ("/arch:SSE" in "Enable Enhanced Instruction Set" in "Code Generation" in the project properties)
comment:7 by , 13 years ago
It is already - ARCH_X86_X64
means "x86(/ia32) or x86_64(/amd64)" (slightly confusingly). (Assuming that's what you were referring to.)
comment:9 by , 13 years ago
Keywords: | patch added |
---|
comment:10 by , 13 years ago
Milestone: | Alpha 7 → Alpha 8 |
---|
follow-up: 13 comment:11 by , 13 years ago
Replying to Philip:
Hmm, is there any change if VS is configured to use SSE automatically?
I finally (after far too long, sorry :-( ) got round to testing this, to be sure the patch makes a worthwhile difference, and it does.
On Win7 and VS2010 (32-bit game build) and Core i5-2500K, counting msecs/frame for "skinning bones" on "Combat demo (huge)":
- Original code: 5.2 msec
- Original plus "/arch:SSE2": 7.2 msec (!)
- Original plus "/arch:SSE": 5.2 msec
- Original plus "/fp:fast": 5.3 msec
- Original plus "/fp:fast /arch:SSE2": 6.1 msec
- Original plus "/fp:fast /arch:SSE": 4.8 msec
- With patch: 2.7 msec
- With patch plus optimisations: 1.8 msec
So the compiler flags are pretty random and don't help much anyway, and the patch seems to work well.
I added a few more optimisations that seem to help without adding much more complexity:
- got rid of the uses of
Position[j]
etc (the inlined function call toVertexArrayIterator::operator[]
is expensive, seemingly because aliasing forces lots of memory reads) - rearranged the vertex arrays so it can
_mm_store_ps
directly into the arrays, instead of going throughtmpVector
- made the input and output data be 16-byte aligned, to avoid some unaligned loads/stores
comment:12 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:13 by , 13 years ago
Replying to Philip:
made the input and output data be 16-byte aligned, to avoid some unaligned loads/stores
To justify this: I tested changing the _mm_{load,store}_ps
into _mm_{loadu,storeu}_ps
after the committed patch (i.e. the data is actually 16-byte aligned in both cases).
On the Core i5-2500K (VS2010 32-bit etc), it makes no difference.
On the Pentium Dual Core (GCC 64-bit etc), it increases the skinning time in the previous scenario from 7.1ms to 7.5ms.
That makes sense since Nehalem and Sandy Bridge apparently handle unaligned SSE about as fast as aligned, so it only makes a difference on older CPUs.
comment:14 by , 8 years ago
Keywords: | review removed |
---|
I have added alternative version of SkinPointsAndNormals which uses SSE. On my computer speed-up was approx. 140% (it is good to test it when there is a lot of units, like in We Are Legion Demo Map)