- Admin
- #1
I have started examining an implementation of Quake's InvSqrt() function that I have again. I started where I left off the last time and I found some strange things that Visual Studio 2008 Professional does.
I have* the following InvSqrt() function:
float InvSqrt ( float x )
{
*** union bitConverter {
*** *** int intPart
*** *** float floatPart
*** } bitC
*** /* Do Taylor Series Expansion */
*** bitC.floatPart = x
*** bitC.intPart = 0x5f375a86 - (bitC.intPart >> 1)
*** /* Do First Newton Approximation */
*** bitC.floatPart = ( 3.0f - x * bitC.floatPart * bitC.floatPart ) * 0.5 * bitC.floatPart
*** /* Do Second Newton Approximation and Return */
*** return ( 3.0f - x * bitC.floatPart * bitC.floatPart ) * 0.5 * bitC.floatPart
}
This is part of a small program that compares its output to the output from the native functions. It calls InvSqrt() in the context of this code:
y = InvSqrt(x)
z = 1.0l / sqrt(x)
Here is what Visual Studio 2008 Professional is generating under release mode with the default flags. I have annotated it with comments clarifying what it does:
00401035* fld******** qword ptr [__real@4008000000000000 (4021A8h)] // This is 3.0f
*** {
*** ***
*** *** y = InvSqrt(x)
0040103B* mov******** edx,5F375A86h
00401040* fld******** qword ptr [__real@3fe0000000000000 (4021A0h)] // This is 0.5f
00401046* fld******** dword ptr [esp+34h] //This is x
0040104A* fst******** dword ptr [esp+30h]
0040104E* mov******** ecx,dword ptr [esp+30h]
00401052* sar******** ecx,1
00401054* sub******** edx,ecx
00401056* mov******** dword ptr [esp+30h],edx
0040105A* fld******** dword ptr [esp+30h]* // This is bitC.floatpart
0040105E* fld******** st(0)* // This is bitC.floatpart
00401060* fmul******* st,st(2) // bitC.floatpart * x
00401062* fmul******* st,st(1) // (bitC.floatpart * x) * bitC.floatpart
00401064* fsubr****** st,st(4) // 3.0f - (bitC.floatpart * x) * bitC.floatpart
00401066* fmul******* st,st(3) // (3.0f - (bitC.floatpart * x) * bitC.floatpart) * 0.5
00401068* fmulp****** st(1),st // (3.0f - (bitC.floatpart * x) * bitC.floatpart) * 0.5 * bitC.floatpart
0040106A* fstp******* dword ptr [esp+30h] // Send it back to bitC.floatpart
0040106E* fld******** dword ptr [esp+30h] // To bring it back from bitC.floatpart???
00401072* fld******** st(0) // This is bitC.floatpart
00401074* fmul******* st,st(2) // bitC.floatpart * x
00401076* fmul******* st,st(1) // (bitC.floatpart * x) * bitC.floatpart
00401078* fsubp****** st(4),st // 3.0f - (bitC.floatpart * x) * bitC.floatpart
0040107A* fxch******* st(3) // Lets shuffle the registers
0040107C* fmulp****** st(2),st // (3.0f - (bitC.floatpart * x) * bitC.floatpart) * 0.5
0040107E* fxch******* st(1) // Lets shuffle the registers
00401080* fmulp****** st(2),st // (3.0f - (bitC.floatpart * x) * bitC.floatpart) * 0.5 * bitC.floatpart
00401082* fxch******* st(1) // Lets shuffle the registers
00401084* fstp******* dword ptr [esp+38h] // Send back the answer
Before I continue, I would like to point out what I see wrong here. The first is the following:
0040106A* fstp******* dword ptr [esp+30h] // Send it back to bitC.floatpart
0040106E* fld******** dword ptr [esp+30h] // To bring it back from bitC.floatpart???
It makes no sense to pop a value back to system memory only to push it back onto the stack. I examined my source code and apparently this is Visual Studio's intrepretation of "bitC.floatPart =" even though it is absolutely not necessary.
The best case scenerio is that we have a 6 cycle stall while the value goes to the L1 cache and comes back from the L1 cache. The worst case scenerio is that we have a ~180 cycle stall while the value goes back to the system memory and comes back from the system memory and all of this is because of two assembly statements that Visual Studio should have deleted.
The second thing I see wrong here is that 0.5 can be algebraically extracted from the expressions, squared and mutiplied with what is left after it is computed to give the final result, eliminating a multiplication. A general idea that I have encountered is that code should be made to be readable, because optimizations that can be done that obfuscate the code would be done by the compiler. Here is one place where an optimization that would obfuscate the code (i.e. making it difficult to see the two newton steps) is not being done by the compiler.
The third thing I see wrong here is the following:
00401046* fld******** dword ptr [esp+34h] //This is x
0040104A* fst******** dword ptr [esp+30h]
0040104E* mov******** ecx,dword ptr [esp+30h]
This pushes x onto the stack, sends it to a place in system memory and then calls it back to a general purpose register. Whatever place it is being sent is wasted, as it is possible to load into ecx directly from the location where it is loaded onto the stack:
00401046* fld******** dword ptr [esp+34h] //This is x
0040104E* mov******** ecx,dword ptr [esp+34h]
The same thing that applied to problem one about wasting cycles also applies here.
Now, there are 28 instructions generated by Visual Studio under release mode. Two of them are useless, one can be algebrically eliminated and a third can be modified to allow another to be removed. All that is necessary is 24 instructions.
Something that someone else might consider to be wrong is the fact that x is being left on the stack, but that is because z = 1.0l / sqrt(x) comes next and the function has been inlined.
Anyway, I compiled my code with Visual Studio 2008 Professional under release mode with the default settings, so I assumed that the fact that /fprecise is set by default to be the reason for these problems, so I changed it to /fp:fast and recompiled. Here is what Visual Studio 2008 Professional generated. I partially annotated it for clarity:
00401035* fld******** dword ptr [__real@40400000 (4021A8h)]
*** {
*** ***
*** *** y = InvSqrt(x)
0040103B* mov******** edx,5F375A86h
00401040* fld******** qword ptr [__real@3fe0000000000000 (4021A0h)]
*** *** z = 1.0l / sqrt(x)
*** *** printf("InvSqrt(%f) = %f (y) with custom function\nInvSqrt(%f) = %f (z) with built-in functions\nz - y = %f\nThe relative error is %f\n\n",
*** *** *** x, y, x, z,*** z - y,
*** *** *** 1.0l - y / z
*** *** *** )
00401046* sub******** esp,30h
00401049* fld******** dword ptr [esp+6Ch]
0040104D* fst******** dword ptr [esp+68h]
00401051* mov******** ecx,dword ptr [esp+68h]
00401055* sar******** ecx,1
00401057* sub******** edx,ecx
00401059* mov******** dword ptr [esp+68h],edx
0040105D* fld******** dword ptr [esp+68h]
00401061* fld******** st(0)
00401063* fmul******* st,st(1)
00401065* fmul******* st,st(2)
00401067* fsubr****** st,st(4)
00401069* fmulp****** st(1),st
0040106B* fmul******* st,st(2) // Unnecessary multiplication by 0.5
0040106D* fld******** st(0) // Load new bitC.floatpart
0040106F* fmul******* st,st(1) // bitC.floatpart * bitC.floatpart
00401071* fmul******* st,st(2) // bitC.floatpart * bitC.floatpart * x
00401073* fsubp****** st(4),st // 3.0f - bitC.floatpart * bitC.floatpart * x
00401075* fmulp****** st(3),st // (3.0f - bitC.floatpart * bitC.floatpart * x) * bitC.floatpart
00401077* fxch******* st(2) // Lets shuffle the registers
00401079* fmulp****** st(1),st // (3.0f - bitC.floatpart * bitC.floatpart * x) * bitC.floatpart * 0.5
There are 24 instructions here, but the number of multiplications is the same. Two of the fxch change instructions were removed because Visual Studio decided to leave x at st(1) and the result at st. I am not showing it here, but it then proceeds with fld st(1) when I would have used fstp. I assume this is to keep the result in a register so that it is ready for later printing, which is something that I could not determine myself without writing assembly specifically to display the result, which that I would never want to write.
Anyway, it seems that problems one and three were resolved. However, problem two is still an issue. I should not have to sacrafice the clarity that having two separate Newton steps in C code brings to get properly optimized assembly.
Does anyone know why Visual Studio 2008 Professional fails to do this optimization, even with /fp:fast set?
More...
View All Our Microsoft Related Feeds
I have* the following InvSqrt() function:
float InvSqrt ( float x )
{
*** union bitConverter {
*** *** int intPart
*** *** float floatPart
*** } bitC
*** /* Do Taylor Series Expansion */
*** bitC.floatPart = x
*** bitC.intPart = 0x5f375a86 - (bitC.intPart >> 1)
*** /* Do First Newton Approximation */
*** bitC.floatPart = ( 3.0f - x * bitC.floatPart * bitC.floatPart ) * 0.5 * bitC.floatPart
*** /* Do Second Newton Approximation and Return */
*** return ( 3.0f - x * bitC.floatPart * bitC.floatPart ) * 0.5 * bitC.floatPart
}
This is part of a small program that compares its output to the output from the native functions. It calls InvSqrt() in the context of this code:
y = InvSqrt(x)
z = 1.0l / sqrt(x)
Here is what Visual Studio 2008 Professional is generating under release mode with the default flags. I have annotated it with comments clarifying what it does:
00401035* fld******** qword ptr [__real@4008000000000000 (4021A8h)] // This is 3.0f
*** {
*** ***
*** *** y = InvSqrt(x)
0040103B* mov******** edx,5F375A86h
00401040* fld******** qword ptr [__real@3fe0000000000000 (4021A0h)] // This is 0.5f
00401046* fld******** dword ptr [esp+34h] //This is x
0040104A* fst******** dword ptr [esp+30h]
0040104E* mov******** ecx,dword ptr [esp+30h]
00401052* sar******** ecx,1
00401054* sub******** edx,ecx
00401056* mov******** dword ptr [esp+30h],edx
0040105A* fld******** dword ptr [esp+30h]* // This is bitC.floatpart
0040105E* fld******** st(0)* // This is bitC.floatpart
00401060* fmul******* st,st(2) // bitC.floatpart * x
00401062* fmul******* st,st(1) // (bitC.floatpart * x) * bitC.floatpart
00401064* fsubr****** st,st(4) // 3.0f - (bitC.floatpart * x) * bitC.floatpart
00401066* fmul******* st,st(3) // (3.0f - (bitC.floatpart * x) * bitC.floatpart) * 0.5
00401068* fmulp****** st(1),st // (3.0f - (bitC.floatpart * x) * bitC.floatpart) * 0.5 * bitC.floatpart
0040106A* fstp******* dword ptr [esp+30h] // Send it back to bitC.floatpart
0040106E* fld******** dword ptr [esp+30h] // To bring it back from bitC.floatpart???
00401072* fld******** st(0) // This is bitC.floatpart
00401074* fmul******* st,st(2) // bitC.floatpart * x
00401076* fmul******* st,st(1) // (bitC.floatpart * x) * bitC.floatpart
00401078* fsubp****** st(4),st // 3.0f - (bitC.floatpart * x) * bitC.floatpart
0040107A* fxch******* st(3) // Lets shuffle the registers
0040107C* fmulp****** st(2),st // (3.0f - (bitC.floatpart * x) * bitC.floatpart) * 0.5
0040107E* fxch******* st(1) // Lets shuffle the registers
00401080* fmulp****** st(2),st // (3.0f - (bitC.floatpart * x) * bitC.floatpart) * 0.5 * bitC.floatpart
00401082* fxch******* st(1) // Lets shuffle the registers
00401084* fstp******* dword ptr [esp+38h] // Send back the answer
Before I continue, I would like to point out what I see wrong here. The first is the following:
0040106A* fstp******* dword ptr [esp+30h] // Send it back to bitC.floatpart
0040106E* fld******** dword ptr [esp+30h] // To bring it back from bitC.floatpart???
It makes no sense to pop a value back to system memory only to push it back onto the stack. I examined my source code and apparently this is Visual Studio's intrepretation of "bitC.floatPart =" even though it is absolutely not necessary.
The best case scenerio is that we have a 6 cycle stall while the value goes to the L1 cache and comes back from the L1 cache. The worst case scenerio is that we have a ~180 cycle stall while the value goes back to the system memory and comes back from the system memory and all of this is because of two assembly statements that Visual Studio should have deleted.
The second thing I see wrong here is that 0.5 can be algebraically extracted from the expressions, squared and mutiplied with what is left after it is computed to give the final result, eliminating a multiplication. A general idea that I have encountered is that code should be made to be readable, because optimizations that can be done that obfuscate the code would be done by the compiler. Here is one place where an optimization that would obfuscate the code (i.e. making it difficult to see the two newton steps) is not being done by the compiler.
The third thing I see wrong here is the following:
00401046* fld******** dword ptr [esp+34h] //This is x
0040104A* fst******** dword ptr [esp+30h]
0040104E* mov******** ecx,dword ptr [esp+30h]
This pushes x onto the stack, sends it to a place in system memory and then calls it back to a general purpose register. Whatever place it is being sent is wasted, as it is possible to load into ecx directly from the location where it is loaded onto the stack:
00401046* fld******** dword ptr [esp+34h] //This is x
0040104E* mov******** ecx,dword ptr [esp+34h]
The same thing that applied to problem one about wasting cycles also applies here.
Now, there are 28 instructions generated by Visual Studio under release mode. Two of them are useless, one can be algebrically eliminated and a third can be modified to allow another to be removed. All that is necessary is 24 instructions.
Something that someone else might consider to be wrong is the fact that x is being left on the stack, but that is because z = 1.0l / sqrt(x) comes next and the function has been inlined.
Anyway, I compiled my code with Visual Studio 2008 Professional under release mode with the default settings, so I assumed that the fact that /fprecise is set by default to be the reason for these problems, so I changed it to /fp:fast and recompiled. Here is what Visual Studio 2008 Professional generated. I partially annotated it for clarity:
00401035* fld******** dword ptr [__real@40400000 (4021A8h)]
*** {
*** ***
*** *** y = InvSqrt(x)
0040103B* mov******** edx,5F375A86h
00401040* fld******** qword ptr [__real@3fe0000000000000 (4021A0h)]
*** *** z = 1.0l / sqrt(x)
*** *** printf("InvSqrt(%f) = %f (y) with custom function\nInvSqrt(%f) = %f (z) with built-in functions\nz - y = %f\nThe relative error is %f\n\n",
*** *** *** x, y, x, z,*** z - y,
*** *** *** 1.0l - y / z
*** *** *** )
00401046* sub******** esp,30h
00401049* fld******** dword ptr [esp+6Ch]
0040104D* fst******** dword ptr [esp+68h]
00401051* mov******** ecx,dword ptr [esp+68h]
00401055* sar******** ecx,1
00401057* sub******** edx,ecx
00401059* mov******** dword ptr [esp+68h],edx
0040105D* fld******** dword ptr [esp+68h]
00401061* fld******** st(0)
00401063* fmul******* st,st(1)
00401065* fmul******* st,st(2)
00401067* fsubr****** st,st(4)
00401069* fmulp****** st(1),st
0040106B* fmul******* st,st(2) // Unnecessary multiplication by 0.5
0040106D* fld******** st(0) // Load new bitC.floatpart
0040106F* fmul******* st,st(1) // bitC.floatpart * bitC.floatpart
00401071* fmul******* st,st(2) // bitC.floatpart * bitC.floatpart * x
00401073* fsubp****** st(4),st // 3.0f - bitC.floatpart * bitC.floatpart * x
00401075* fmulp****** st(3),st // (3.0f - bitC.floatpart * bitC.floatpart * x) * bitC.floatpart
00401077* fxch******* st(2) // Lets shuffle the registers
00401079* fmulp****** st(1),st // (3.0f - bitC.floatpart * bitC.floatpart * x) * bitC.floatpart * 0.5
There are 24 instructions here, but the number of multiplications is the same. Two of the fxch change instructions were removed because Visual Studio decided to leave x at st(1) and the result at st. I am not showing it here, but it then proceeds with fld st(1) when I would have used fstp. I assume this is to keep the result in a register so that it is ready for later printing, which is something that I could not determine myself without writing assembly specifically to display the result, which that I would never want to write.
Anyway, it seems that problems one and three were resolved. However, problem two is still an issue. I should not have to sacrafice the clarity that having two separate Newton steps in C code brings to get properly optimized assembly.
Does anyone know why Visual Studio 2008 Professional fails to do this optimization, even with /fp:fast set?
More...
View All Our Microsoft Related Feeds