How to solve weirdness of the high resolution counter

26 04 2008

In a previous post, some issues on QueryPerformanceCounter() was discussed.

Fortunately I found a very good blog, Zooba’s Blog on problems using counters like rdtsc and QueryPerformanceCounter. Because there is additional processing time needed to get the CPU frequency that is used along with the result of rdtsc, or because just approximate frequency is used by looking up a registry, I think it is not good to use the rdtsc.
So, the last option is to use the QueryPerformanceCounter.

There are two issues to solve.

  1. To guarantee the timing starts and ends where you want to do so.
  2. Because of optimization, the compiler may reorder instructions. So, your “Start Measuring” command can be placed earlier and later.

  3. To obtain reliable count.
  4. As it was discussed in the previous post, it doesn’t return reliable count number on multi-processors or multi-core processors.

To solve the 1st problem, special instructions called “serializing instruction” should be called.
According to the Zooba’s Blog, there are 3 of them : iret, rsm, cpuid.
However, the iret and rsm change the instruction pointer. So, they are out. The cpuid is for getting information a cpu. So, it has no harm.
(What is the “serialization instruction“? It is an instruction which forces codes to be serialized. So, instructions in the queue already will be flushed out, and an instruction like cpuid is processed. So, you can ensure that the instruction for starting and stopping measuring will be located as they are expected. )

The 2nd issue is raised especially when the CPU you use is multicore processor or multi processor. Also when your CPU has the speed-step technology, it happens.
However, as it was mentioned in the Zooba’s Blog, the speed-step case is minimized. Because in the code you want to measure its performance, it would make your CPU sweat enough in most cases. So, the most troublesome case is the multi-core, multi-processor case.
How to solve this problem? It is also explained in the Zooba’s blog. (Thank you, Zooba!)
If you set the a specific processor runs the QueryPerformanceCounter(), it will return reliable result. So, the SetProcessorAffinity() or the SetThreadAffinity() can be used.

So, here is the code example.


// performance_measure.h
#ifndef PERFORMANCE_MEASURE
#define PERFORMANCE_MEASURE

#define DECLARE_GLOBAL_FOR_PEFORMANCE_MEASURE()\
    LARGE_INTEGER g_Start_Counter, g_End_Counter, g_Frequency;\
    DWORD g_Old_ProcessAffinityMask,g_New_ProcessAffinityMask, g_SystemAffinityMask;\
    HANDLE hCurrentProcess;

DECLARE_GLOBAL_FOR_PEFORMANCE_MEASURE();

inline void INIT_PERFORMANCE_MEASURE( void )
{
    hCurrentProcess = GetCurrentProcess();
    GetProcessAffinityMask( hCurrentProcess, &g_Old_ProcessAffinityMask, &g_SystemAffinityMask );

    QueryPerformanceFrequency( &g_Frequency );
}   

inline void START_PERFORMANCE_MEASURE( void )
{
    int CPUInfo[4];

    // Serializing Information
    __cpuid( CPUInfo, 0 );  // used the intrinsic version of the cpuid

    g_New_ProcessAffinityMask = 0x01;
    SetProcessAffinityMask( hCurrentProcess, (DWORD_PTR)&g_New_ProcessAffinityMask );

    QueryPerformanceCounter( &g_Start_Counter );

    // Revert to back
    SetProcessAffinityMask(hCurrentProcess, (DWORD_PTR)&g_Old_ProcessAffinityMask );
}

inline void STOP_PERFORMANCE_MEASURE( void )
{
    int CPUInfo[4];

    __cpuid( CPUInfo, 0 );  // Serializing Information
    SetProcessAffinityMask( hCurrentProcess, (DWORD_PTR)&g_New_ProcessAffinityMask );

    QueryPerformanceCounter( &g_End_Counter );

    // Revert to back
    SetProcessAffinityMask(hCurrentProcess, (DWORD_PTR)&g_Old_ProcessAffinityMask );
}

double GET_PERFORMANCE_MEASURE( void )
{
    return ((double)g_End_Counter.QuadPart - (double)g_Start_Counter.QuadPart)/(double)g_Frequency.QuadPart;
}

#endif

Insert above code like this in your code.


#include <windows.h>
#include <intrin.h>
using namespace std;

// This header file contains above code
#include "performance_measure.h"

void matrix_multiplication( void )
{
    ...

    printf("Single\n");

    INIT_PERFORMANCE_MEASURE();

    START_PERFORMANCE_MEASURE();

    start_t = clock();

    for( iteration = 0; iteration < 90000; iteration++ )
    {
        for( i = 0; i < 8; i++ )
            for( j = 0; j < 8; j++ )
            {
                temp = 0;
                for( k = 0; k < 8; k++ )
                {
                    temp += matA[i][k]*matB[k][j];
                }
                matC[i][j] = temp;
            }
    }
    duration_t = clock() - start_t;

    STOP_PERFORMANCE_MEASURE();

    printf("Duration = %f (%f)\n", (double)duration_t/CLOCKS_PER_SEC,
        GET_PERFORMANCE_MEASURE() );

Now, you will get a reliable result.

Hew….





OBJC_API_VERSION and __OBJC2__

26 04 2008

I would like to post an answer from the objc-language mailing list.

On Apr 25, 2008, at 2:06 PM, JongAm Park wrote:
With C++, there is a macro*__cplusplus*. Is there anything analogous for the Objective-C 2.0?
The Xcode 3.0 support converting to Obj-C 2.0 code from pre-2.0 code. However, there are people who still use Obj-C pre 2.0, or even though they use the 2.0, they may want to maintain code compatibility with an old compiler.
So, if there is a macro __OBJC_2_0__, it would be very helpful to write appropriate code for both pre-2.0 and 2.0 compiler.

So, the code will be like :

#ifdef __OBJC_2_0__
statements in Objective-C 2.0 syntax
#else
statement in Obj-C pre 2.0 syntax
#endif

There are no macros that match the available syntax. The available flags are:

OBJC_API_VERSION
This is set based on the low-level runtime API available. OBJC_API_VERSION==0 is the legacy API. OBJC_API_VERSION==2 means the function-based API added in Leopard is available. This is closer to what you want; it helps because it will disallow the new syntax when the deployment target is pre-Leopard, but if you’re compiling for 10.5+ then it doesn’t tell you whether your compiler knows about the new syntax.

__OBJC2__
This is set based on the ABI version (i.e. the metadata format on disk). This distinguishes between the legacy (i386+ppc) version and the modern (x86_64+ppc64) version. This isn’t what you want.

You could use OBJC_API_VERSION if your supported combinations are:
* Old compiler with deployment target older than Leopard
* New compiler with any deployment target

Thank you, Mr. Parker for the information.