Opened 13 years ago

Closed 13 years ago

Last modified 8 years ago

#905 closed enhancement (fixed)

[PATCH] Improvement to bones' skinning

Reported by: gruby Owned by:
Priority: Nice to Have Milestone: Alpha 8
Component: Core engine Keywords: patch, graphics, performance
Cc: Patch:

Description

Bones' skinning can be improved by SSE for example.

Attachments (1)

905.patch (6.1 KB ) - added by gruby 13 years ago.
Bones' skinning with SSE (it modifies few files)

Download all attachments as: .zip

Change History (15)

comment:1 by gruby, 13 years ago

Keywords: review graphics performance added
Summary: Improvement to bones' skinning[PATCH] Improvement to bones' skinning

I have added alternative version of SkinPointsAndNormals which uses SSE. On my computer speed-up was approx. 140% (it is good to test it when there is a lot of units, like in We Are Legion Demo Map)

by gruby, 13 years ago

Attachment: 905.patch added

Bones' skinning with SSE (it modifies few files)

comment:2 by historic_bruno, 13 years ago

Milestone: BacklogAlpha 7

comment:3 by Philip Taylor, 13 years ago

How were you measuring the performance effect, and in what environment?

I tried the patch on Linux with GCC 4.4.5, x86_64, Pentium Dual Core 2.16GHz, in Release mode, on the "Combat demo (huge)" map (which is easier to measure since the units don't walk around), using the in-game profiler to get the "skinning bones" value:

Number of calls: 655/frame

With non-SSE function: 8.5msec/frame

With SSE function: 8.0msec/frame

which is about 6% improvement, which is very different from 140%. The "We are Legion" map seems to give a similar change.

comment:4 by gruby, 13 years ago

I've just tried it on Visual Studio 2008 sp1, AMD Phenom II X4 955 3.20GHz, in Release mode too, on the "Combat demo (huge)" map.

Number of calls: 760/frame With non-SSE function: 9.7msec/frame With SSE function: 6.9msec/frame

which is about 40% improvement. That are my tests.

comment:5 by Philip Taylor, 13 years ago

Hmm, is there any change if VS is configured to use SSE automatically? ("/arch:SSE" in "Enable Enhanced Instruction Set" in "Code Generation" in the project properties)

comment:6 by fabio, 13 years ago

Could it be also enabled on 32 bit?

comment:7 by Philip Taylor, 13 years ago

It is already - ARCH_X86_X64 means "x86(/ia32) or x86_64(/amd64)" (slightly confusingly). (Assuming that's what you were referring to.)

comment:8 by fabio, 13 years ago

Yes, thanks for the clarification :)

comment:9 by Kieran P, 13 years ago

Keywords: patch added

comment:10 by Kieran P, 13 years ago

Milestone: Alpha 7Alpha 8

in reply to:  5 ; comment:11 by Philip Taylor, 13 years ago

Replying to Philip:

Hmm, is there any change if VS is configured to use SSE automatically?

I finally (after far too long, sorry :-( ) got round to testing this, to be sure the patch makes a worthwhile difference, and it does.

On Win7 and VS2010 (32-bit game build) and Core i5-2500K, counting msecs/frame for "skinning bones" on "Combat demo (huge)":

  • Original code: 5.2 msec
  • Original plus "/arch:SSE2": 7.2 msec (!)
  • Original plus "/arch:SSE": 5.2 msec
  • Original plus "/fp:fast": 5.3 msec
  • Original plus "/fp:fast /arch:SSE2": 6.1 msec
  • Original plus "/fp:fast /arch:SSE": 4.8 msec
  • With patch: 2.7 msec
  • With patch plus optimisations: 1.8 msec

So the compiler flags are pretty random and don't help much anyway, and the patch seems to work well.

I added a few more optimisations that seem to help without adding much more complexity:

  • got rid of the uses of Position[j] etc (the inlined function call to VertexArrayIterator::operator[] is expensive, seemingly because aliasing forces lots of memory reads)
  • rearranged the vertex arrays so it can _mm_store_ps directly into the arrays, instead of going through tmpVector
  • made the input and output data be 16-byte aligned, to avoid some unaligned loads/stores

comment:12 by philip, 13 years ago

Resolution: fixed
Status: newclosed

(In [10499]) Optimise vertex skinning code with SSE, based on patch by gruby. Fixes #905.

in reply to:  11 comment:13 by Philip Taylor, 13 years ago

Replying to Philip:

made the input and output data be 16-byte aligned, to avoid some unaligned loads/stores

To justify this: I tested changing the _mm_{load,store}_ps into _mm_{loadu,storeu}_ps after the committed patch (i.e. the data is actually 16-byte aligned in both cases).

On the Core i5-2500K (VS2010 32-bit etc), it makes no difference.

On the Pentium Dual Core (GCC 64-bit etc), it increases the skinning time in the previous scenario from 7.1ms to 7.5ms.

That makes sense since Nehalem and Sandy Bridge apparently handle unaligned SSE about as fast as aligned, so it only makes a difference on older CPUs.

comment:14 by sanderd17, 8 years ago

Keywords: review removed
Note: See TracTickets for help on using tickets.