How to solve weirdness of the high resolution counter

26 04 2008

In a previous post, some issues on QueryPerformanceCounter() was discussed.

Fortunately I found a very good blog, Zooba’s Blog on problems using counters like rdtsc and QueryPerformanceCounter. Because there is additional processing time needed to get the CPU frequency that is used along with the result of rdtsc, or because just approximate frequency is used by looking up a registry, I think it is not good to use the rdtsc.
So, the last option is to use the QueryPerformanceCounter.

There are two issues to solve.

  1. To guarantee the timing starts and ends where you want to do so.
  2. Because of optimization, the compiler may reorder instructions. So, your “Start Measuring” command can be placed earlier and later.

  3. To obtain reliable count.
  4. As it was discussed in the previous post, it doesn’t return reliable count number on multi-processors or multi-core processors.

To solve the 1st problem, special instructions called “serializing instruction” should be called.
According to the Zooba’s Blog, there are 3 of them : iret, rsm, cpuid.
However, the iret and rsm change the instruction pointer. So, they are out. The cpuid is for getting information a cpu. So, it has no harm.
(What is the “serialization instruction“? It is an instruction which forces codes to be serialized. So, instructions in the queue already will be flushed out, and an instruction like cpuid is processed. So, you can ensure that the instruction for starting and stopping measuring will be located as they are expected. )

The 2nd issue is raised especially when the CPU you use is multicore processor or multi processor. Also when your CPU has the speed-step technology, it happens.
However, as it was mentioned in the Zooba’s Blog, the speed-step case is minimized. Because in the code you want to measure its performance, it would make your CPU sweat enough in most cases. So, the most troublesome case is the multi-core, multi-processor case.
How to solve this problem? It is also explained in the Zooba’s blog. (Thank you, Zooba!)
If you set the a specific processor runs the QueryPerformanceCounter(), it will return reliable result. So, the SetProcessorAffinity() or the SetThreadAffinity() can be used.

So, here is the code example.


// performance_measure.h
#ifndef PERFORMANCE_MEASURE
#define PERFORMANCE_MEASURE

#define DECLARE_GLOBAL_FOR_PEFORMANCE_MEASURE()\
    LARGE_INTEGER g_Start_Counter, g_End_Counter, g_Frequency;\
    DWORD g_Old_ProcessAffinityMask,g_New_ProcessAffinityMask, g_SystemAffinityMask;\
    HANDLE hCurrentProcess;

DECLARE_GLOBAL_FOR_PEFORMANCE_MEASURE();

inline void INIT_PERFORMANCE_MEASURE( void )
{
    hCurrentProcess = GetCurrentProcess();
    GetProcessAffinityMask( hCurrentProcess, &g_Old_ProcessAffinityMask, &g_SystemAffinityMask );

    QueryPerformanceFrequency( &g_Frequency );
}   

inline void START_PERFORMANCE_MEASURE( void )
{
    int CPUInfo[4];

    // Serializing Information
    __cpuid( CPUInfo, 0 );  // used the intrinsic version of the cpuid

    g_New_ProcessAffinityMask = 0x01;
    SetProcessAffinityMask( hCurrentProcess, (DWORD_PTR)&g_New_ProcessAffinityMask );

    QueryPerformanceCounter( &g_Start_Counter );

    // Revert to back
    SetProcessAffinityMask(hCurrentProcess, (DWORD_PTR)&g_Old_ProcessAffinityMask );
}

inline void STOP_PERFORMANCE_MEASURE( void )
{
    int CPUInfo[4];

    __cpuid( CPUInfo, 0 );  // Serializing Information
    SetProcessAffinityMask( hCurrentProcess, (DWORD_PTR)&g_New_ProcessAffinityMask );

    QueryPerformanceCounter( &g_End_Counter );

    // Revert to back
    SetProcessAffinityMask(hCurrentProcess, (DWORD_PTR)&g_Old_ProcessAffinityMask );
}

double GET_PERFORMANCE_MEASURE( void )
{
    return ((double)g_End_Counter.QuadPart - (double)g_Start_Counter.QuadPart)/(double)g_Frequency.QuadPart;
}

#endif

Insert above code like this in your code.


#include <windows.h>
#include <intrin.h>
using namespace std;

// This header file contains above code
#include "performance_measure.h"

void matrix_multiplication( void )
{
    ...

    printf("Single\n");

    INIT_PERFORMANCE_MEASURE();

    START_PERFORMANCE_MEASURE();

    start_t = clock();

    for( iteration = 0; iteration < 90000; iteration++ )
    {
        for( i = 0; i < 8; i++ )
            for( j = 0; j < 8; j++ )
            {
                temp = 0;
                for( k = 0; k < 8; k++ )
                {
                    temp += matA[i][k]*matB[k][j];
                }
                matC[i][j] = temp;
            }
    }
    duration_t = clock() - start_t;

    STOP_PERFORMANCE_MEASURE();

    printf("Duration = %f (%f)\n", (double)duration_t/CLOCKS_PER_SEC,
        GET_PERFORMANCE_MEASURE() );

Now, you will get a reliable result.

Hew….





OBJC_API_VERSION and __OBJC2__

26 04 2008

I would like to post an answer from the objc-language mailing list.

On Apr 25, 2008, at 2:06 PM, JongAm Park wrote:
With C++, there is a macro*__cplusplus*. Is there anything analogous for the Objective-C 2.0?
The Xcode 3.0 support converting to Obj-C 2.0 code from pre-2.0 code. However, there are people who still use Obj-C pre 2.0, or even though they use the 2.0, they may want to maintain code compatibility with an old compiler.
So, if there is a macro __OBJC_2_0__, it would be very helpful to write appropriate code for both pre-2.0 and 2.0 compiler.

So, the code will be like :

#ifdef __OBJC_2_0__
statements in Objective-C 2.0 syntax
#else
statement in Obj-C pre 2.0 syntax
#endif

There are no macros that match the available syntax. The available flags are:

OBJC_API_VERSION
This is set based on the low-level runtime API available. OBJC_API_VERSION==0 is the legacy API. OBJC_API_VERSION==2 means the function-based API added in Leopard is available. This is closer to what you want; it helps because it will disallow the new syntax when the deployment target is pre-Leopard, but if you’re compiling for 10.5+ then it doesn’t tell you whether your compiler knows about the new syntax.

__OBJC2__
This is set based on the ABI version (i.e. the metadata format on disk). This distinguishes between the legacy (i386+ppc) version and the modern (x86_64+ppc64) version. This isn’t what you want.

You could use OBJC_API_VERSION if your supported combinations are:
* Old compiler with deployment target older than Leopard
* New compiler with any deployment target

Thank you, Mr. Parker for the information.





Difference in Concurrency Model in MacOS X and the Windows (3)

25 04 2008

3. Event

Windows is made based-on event-driven model. Therefore, events play very important role on Windows environment, and are used very often whether a programmer make one or use ones provided by the OS. Let’s take a look at how events are used.

Windows는 event-driven 모델을 써서 만들어졌다. 그러므로 event는 상당히 중요한 역할을 하고, 많은 프로그램들이 OS가 제공하는 event를 사용하건, 아니면 해당 프로그램에서 event를 만들건 이 event를 많이 사용한다.
우선 이 event가 사용되는 예를 보자.


int _tmain(int argc, _TCHAR* argv[])
{
    HANDLE hThread[kMaxThreads];

    int i;

    initEvent();

    for( i = 0; i < kMaxThreads; i++ )
    {
	// Threads wait on their events and trigger events for others.
        hThread[i] = CreateThread( NULL, 0, doMultiThreadWay, 0, 0, &gThreadID[i] );

        if( hThread[i] == NULL )
        {
		...
            ExitProcess(i);
        }
        else
        {
		...
        }
    }

   // Until now, all threads are created and wait for their events.

   // set the 1st event, gEvents[0], or fire an event.
    SetEvent( gEvents[0] );

    // Wait until all threads have terminated
    WaitForMultipleObjects( kMaxThreads, hThread, TRUE, INFINITE );

    // Close all thread handles
    for( i = 0; i < kMaxThreads; i++ )
        CloseHandle( hThread[i] );

    destroyEvent();

	return 0;
}

// This is how the events are initialized.
void initEvent( void )
{
    int i;

    for( i = 0; i < kMaxThreads; i++ )
    {
	// events are automatically reset if there are once set.
        gEvents[i] = CreateEvent( NULL, FALSE, FALSE, NULL );

        if( gEvents[i] == NULL )
            outputString( __T("Error in creating events\n"), FOREGROUND_RED | FOREGROUND_INTENSITY );
    }
}

// Threading function
DWORD WINAPI doMultiThreadWay( LPVOID lpParam )
{
    TCHAR msgBuf[kBuffSize];
    size_t cchStringSize;
    DWORD dwChars;
    DWORD threadID;
    DWORD dwWaitResult;
    WORD textColor;

    int i;

    threadID = GetCurrentThreadId();
    if( threadID == gThreadID[0] )
        textColor = FOREGROUND_GREEN | FOREGROUND_RED;
    else if( threadID == gThreadID[1] )
        textColor = FOREGROUND_BLUE | FOREGROUND_RED;
    else
        textColor = FOREGROUND_BLUE | FOREGROUND_GREEN;

    // Thread safe way of outputting
    StringCchPrintf( msgBuf, kBuffSize, __T("doMultiThreadWay (%d)\n"), threadID );
    outputString( msgBuf, textColor );

    for( i = 0; i < 5; i++ )
    {
	// Each threat wait for its event.
        if( threadID == gThreadID[0] )
        {
            dwWaitResult = WaitForSingleObject( gEvents[0], INFINITE );
            outputString(__T("First thread says \"Do It\" to the second thread\n"), textColor );
		// An event, e.g. gEvents[0], is automatically reset.
            SetEvent( gEvents[1] );
        }
        else if ( threadID == gThreadID[1] )
        {
            dwWaitResult = WaitForSingleObject( gEvents[1], INFINITE );
            outputString(__T("Second thread says \"Do It\" to the third thread\n"), textColor );
            SetEvent( gEvents[2] );
        }
        else
        {
            dwWaitResult = WaitForSingleObject( gEvents[2], INFINITE );
            outputString(__T("Third thread says \"Do It\" to the First thread\n\n"), textColor );
            SetEvent( gEvents[0] );
        }

    }

    return 0;

}

What the threading function does is to wait for their event and trigger the next event. It is to wake up threads one by one.
This illustrates the effect of using events.

위에 있는 쓰레드 함수가 하는 것은, 각 쓰레드에 대응하는 이벤트를 기다리다가, 자기 것이 트리거되면, 해당 쓰레드가 다음의 이벤트를 fire함으로써, 다음번 쓰레드가 깨어나게 하는 것이다.
이런 행동이 바로 이벤트를 사용함으로써 얻고자 하는 효과이다.

If you don’t want to read the whole text above, here is the screenshot which will help you what the codes do.

위의 긴 글을 읽기 싫다면, 다음의 스크린샷을 보면 위의 코드가 무엇을 하는지 대번에 눈치를 챌 수있을 것이다.
What the codes do.

As it has always been, events can be implemented using mutex or semaphore. However, using events will simplify things.

역시 여기서도 생각해 볼 수있는 것이, 이 Event라는 것도 semaphore나 mutex을 이용하면 구현할 수있을 거라는 생각이다. 하지만 event를 사용하면 편리하게 구현을 할 수가 있다.

It is characteristic that there are functions like WaitForSingleObject() and WaitForMultipleObjects(), and this makes the Windows different from other OSes like Unix. So, a student who learned multiprocessing and parallel computing model based on Unix and other Oses than Windows can be confused.
However, it is also easy and reasonable model, and there is no problem in learning this Windows model.

이상에서 살펴본 Win32에서의 synchronization 모델에는 그 특징이 있다.
Critical Section, Mutex, Semaphore, Event등을 선언하고 세팅한 후, WaitForSingleObject()와 같은 함수를 이용해서 해당 상황이 발생하는지 기다리는 것이다. 이것이 주목해야 할 Win32의 synchronization 프로그래밍 모델이다.
무척 이해하기가 쉽고 논리적으로 설계가 되었지만, 다른 OS에는 이런 WaitForSingleObject()와 같은 함수가 없다. 그러므로 Unix와 같은 다른 OS에서 프로그래밍을 하다가 Windows에서 하게 되었을때, 혼동을 일으킬 수있다.

Windows multithreading (MFC)

MFC contains lots of wrappers to Win32 data types and their behaviour. So, it is framework.
MFC는 바로 이상의 것들을 감싸서 사용하기 쉬운 클래스로 만들어준 것이다. 즉 Framework인 것이다.

However, the MFC wrappers to synchronization do more than that.
It makes the synchronization model of Windows look similar to that of the Unix.
Let’s take a look at an example.

그런데 synchronization에 관해서 MFC의 wrapper들은 단순히 wrapping해서 쓰기 쉽게만 해주는 것이 아니라, 그 모델을 Unix의 그것과 비슷하게 해준다.
자 예를 한번 보자.


// Global Mutex Object
CMutex g_m;
int g_C;

UINT ThreadFunction1(LPVOID lParam)
{
    // Create object for Single Lock using the mutex
    CSingleLock lock(&g_m);

	// try obtaining a lock.
    lock.Lock();

    // code block protected by the lock.
	...

	// release the lock
    lock.Unlock();

    return 0;
}

UINT ThreadFunction2(LPVOID lParam)
{
    // Single Lock Construct Mutex
    CSingleLock lock(&g_m);

   // If the other thread already obtained the lock, this thread will wait here.
    lock.Lock();

    // code block protected by the lock.
	...

    lock.Unlock();

    return 0;
}

Where the Lock() function is located is comparable to the lines where WaitForSingleObject() is used in Win32.
For critical section, i.e. CCriticalSection, can be also implented by replacing g_m with a CCriticalSection. So, for mutex, semaphore, event, and critical section, the style how they are locked and and unlocked are the same.
This is the major difference between the Win32 model and the MFC model.

Anyway, where it is locked and unlocked are similar to the model for the Unix.

Lock() 메소드가 쓰여진 부분이 바로, Win32의 경우에 WaitForSingleObject()가 쓰여진 부분에 대응한다고 볼 수있다.
MFC에서는 그 locking variable이 뭐던간에, 즉 critical section이냐, mutex냐, event냐에 상관없이 모두 같은 프로그래밍 모델을 제공한다. 즉 위의 코드에서 CMutex로 선언된 부분을 CCriticalSection으로 바꾸면, 거의 코드를 고칠 필요없이, 그대로 사용할 수있게 된다. 즉 다시 말하자면, 다른 locking variable에 대해서 통합된 모델을 제공한다는 것이다.

아무튼 전체적으로 lock을 하고 unlock을 하는 부분이 Unix를 닮은 부분이다.

So far, we tried figuring out how synchronization looks like on the Windows.
In the next post, let’s try the Objective-C and Cocoa case.

자 이상으로 Windows에서의 synchronization에 대해서 알아보았다.
다음에는 Objective-C와 Cocoa의 경우를 살펴보기로 하자.





Difference in Concurrency Model in MacOS X and MS Windows (2)

23 04 2008

This post is the 2nd part of the previous post a while ago. As I promised before, this series of post is written in English and Korean.

  OK. It is time to return back to this issue, “multi-threading design” on Windows and Mac. When I studied multi-threading and synchronization on Windows after learning Unix, it was a little confusing. Although those on Windows is easy to learn and similar to those on Unix, there are some difference. The reason of difference comes from how the functions and facilities are designed.
Basically they share the same model. However, they present it in slightly different

자 한동안 잊고 지냈던 multi-threading에 대한 이야기를 해보자. Unix를 배우고 나서, Windows의 muti-threading과 synchronization에 대해서 공부를 하게 되면, 약간 좀 헷갈리는 면이 생긴다. 상당히 흡사하면서도, 익히기 쉽게 되어 있는 Windows의 그것은 하지만 좀 다른 면도 있다. 그 이유는 어떻게 해당 함수들을 디자인했는가에 기인한다.

In this post, the facilities provided by the Windows for multi-threading are presented, and let’s figure out how to use them. In next post, those for Objective-C and Cocoa will be explained.
이 글에서는 multi-threading을 위해 Windows에서 마련해 놓은 여러 장치들을 알아보고, 그 쓰는 법을 간단히 살펴본다. 그리고 다음번에는 Objective-C와 Cocoa등 Apple이 접근하는 방법을 알아보기로 하자.

1. Synchronization in Win32

1.1 Critical Section

The critical section seesm to be the simplest synchronization method. By embracing a code block with two functions, it enables mutually-exclusive access to the block.

이 critical section은 개인적으로 볼때 가장 간단한 synchronization 방법이 아닌가 한다.  일련의 코드 블럭을  감싸는 두 함수를 호출함으로써, 해당 블럭에 대한 배타적 접근을 가능하게 한다.


 for( i = 0; i < 5; i++ )
 {
#ifdef USE_CRITICAL_SECTION
 	EnterCriticalSection( &gCriticalSection );

        // Thread safe way of outputting
        StringCchPrintf( msgBuf, kBuffSize, __T("doMultiThreadWay (%d) : %d\n"), threadID, i );
        outputString( msgBuf, textColor );

        LeaveCriticalSection( &gCriticalSection );
#endif
}

The EnterCriticalSection() and the LeaveCriticalSection() are those two functions.

EnterCriticalSection()과 LeaveCriticalSection()이 바로 그 두 함수이다.

1.2 Mutex ( Mutually Exclusive Semaphore ) & Semaphore

The Windows prepares special functions for realizing mutex, or more generally semaphore : CreateMutex(), CreateSemaphore(), WaitForSingleObject(), WaitForMultipleObjects(), ReleaseMutex(), and ReleaseSemaphore().

윈도우즈에선 mutex 혹은 좀더 일반적으로 말하자면 semaphore를 처리하기 위해서 특별한 함수들을 준비해 놓고 있는데, 바로 CreateMutex(), CreateSemaphore(), WaitForSingleObject(), WaitForMultipleObjects(), ReleaseMutex(), ReleaseSemaphore()와 같은 함수들이다.

Mutexs and Semaphores are created by calling CreateMutext() and CreateSemaphore(), respectively. After creating them, a code block can be accessed as seen below.

Mutex와 Semaphore는 각각 CreateMutex()와 CreateSemaphore()를 호출함으로써 만들어지고, 일단 만들어진 후에는 다음에 보이는 것처럼 코드 블락을 억세스하는데 사용할 수있다. ( 아니 오히려 억세스를 regulate한다라고 봐야하겠다. )


        dwWaitResult = WaitForSingleObject( gMutex, 5000L );
        switch( dwWaitResult )
        {
        case    WAIT_OBJECT_0:
                __try
                {
                    // Thread safe way of outputting
                    StringCchPrintf( msgBuf, kBuffSize, __T("doMultiThreadWay (%d) : %d\n"), threadID, i );
                    outputString( msgBuf, textColor );
                }
                __finally
                {
                    if( !ReleaseMutex( gMutex ) )
                    {
                        // Save old attribute for a console
                        WORD wPrevColorAttrs = normalTextCsbiInfo.wAttributes;

                        // Now, write in Red
                        if( !SetConsoleTextAttribute( hStdout, FOREGROUND_RED ) )
                        {
                            MessageBox( NULL, __T("SetConsoleTextAttribute"), __T("Console Error"), MB_OK );
                            break;
                        }

                        // Thread safe way of outputting
                        StringCchPrintf( msgBuf, kBuffSize, __T("doMultiThreadWay (%d) : Error in releasing mutex\n"), threadID );
                        outputString( msgBuf, FOREGROUND_GREEN );

                        if( !SetConsoleTextAttribute( hStdout, wPrevColorAttrs ) )
                        {
                            MessageBox( NULL, __T("SetConsoleTextAttribute"), __T("Console Error"), MB_OK );
                            break ;
                        }
                    }

                    break;
                }

        case WAIT_TIMEOUT:
            break;

        case WAIT_ABANDONED:
            break;
        }

So, when a thread which is at after the WaitForSingleObject() line releases the mutex by calling ReleaseMutex(). Then next thread waiting at the line WaitForSingleObject() get the mutex, blocks other thread to get the mutex, and proceeds.

WaitForSingleObject()를 넘어간 쓰레드는, mutex를 획득한 것인데, ReleaseMutex()를 호출함으로써 mutex를 놓게 된다. 그러면 WaitForSingleObject()에서 기다리고 있던 다음의 쓰레드가 이제 mutex를 획득하고, 처리를 계속해 나간다.

Simple, isn’t it?
What is somewhat different from the Unix model is to use calls like WaitForSingleObject(). However, it is quite easy to understand and manipulate.

간단하지 않은가?
이런 모델이 Unix의 모델과 다른 점은 WaitForSingleObject()와 같은 함수를 씀으로써 달라지는 형식이다. 하지만 이런 Windows의 방식도 굉장히 이해하기 쉽고, 다루기가 쉽다.

Actually, at this point, you may wonder why the critical section is necessary. You can implement critical section using mutex. Then why are there the critical section? Actually some OSes don’t have the critical section. Anyway, to understand the difference and similarity, please read MSDN document at Critical Section Objects.

이 시점에서, 왜 critical section이 필요한지 궁금할 수있다. 즉 mutex를 이용하면 critical section을 구현할 수가 있는데, 굳이 왜 critical section이란 것을 만들까?
실제로 어떤 OS에는 critical section이 없는것도 있다. 자 우선 MS의 critical section과 mutex등의 차이점에 대해선 MSDN의 Critical Section Objects라는 문서를 참조해 보자.

“A critical section object provides synchronization similar to that provided by a mutex object, except that a critical section can be used only by the threads of a single process. Event, mutex, and semaphore objects can also be used in a single-process application, but critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization (a processor-specific test and set instruction). Like a mutex object, a critical section object can be owned by only one thread at a time, which makes it useful for protecting a shared resource from simultaneous access. Unlike a mutex object, there is no way to tell whether a critical section has been abandoned.”

The clear difference is that critical section can be used only for the threads of a single process. And in that case, it is faster.

결정적인 차이는 바로 critical section은 single process의 thread에서만 쓸 수있다는 것이고, 그럴 경우 속도가 빠르다는 것이다.

One good example for things which make us confusing when we develop on many different OSes is this critical section. On some lines above, I said that some OSes didn’t have the critical section. Well, to make things more correct, I should revise the statement. It’s wrong. The concept of critical section exist on all multiprocess, multithreading OSes. If you use mutex to force atomic access to some code blocks, then it is the critical section. On the other hand, the critical section mentioned on a MSDN page is the MS’s special structure, CRITICAL_SECTION, rather than critical secition as general concept. A code example is here :

여러 OS에서 프로그래밍을 하다보면 헷갈리게 되는 게 생기는데, 그 좋은 예가 바로 이 critical section이다. 앞에서 잠깐 어떤 OS에선 critical section이 없다고 이야기 했는데, 지금와서 밝히자면 이 말은 좀 잘못된 말이다. critical section의 개념은 다 존재한다. mutex를 이용해서 특정 블럭에 대해서 atomic access를 하게 하면, 그게 critical section이다. 반면에 위의 MSDN 문서에서 언급하는 critical section이란 일반적 개념으로써의 critical section이 아니라 다음과 같은 코드로 만들어질 수있는, MS가 만든 특별한 구조체인 CRITICAL_SECTION이다.


CRITICAL_SECTION gOutputCriticalSection;
InitializeCriticalSectionAndSpinCount( &gCriticalSection, 0x80000400 );

So, it is rather safe not to think, “Oh, there is no critical section on xxx OS.”.
그러므로 Windows에서 프로그래밍을 하다가 혹 다른 OS에서 하게 될 경우 “critical section이 없네”하는 생각은 하지 않는게 옳다.





Impressive battery life of 2007 Nov. version of MacBook

20 04 2008

Sometimes it is relaxing to go back to non-technical subject.
I have used the base model of MacBook for a few months. Usually I don’t believe manufacturer’s claim on battery life. If they say that it lasts for 4 hours, the new battery will last about 2 hour and 30 minutes usually, and more realistically 2 hours. If you use that kind of notebook computer for about 1 year, it will be stabilized around 1 hour and 30 minutes.

However, somehow the MacBook has different story.
I charged my machine last sunday, and I turned it on today.
The remained battery level is as follows :

The natural discharging quality is very good. And it actually last more than 4 hours in actual use.

Is the battery really different from those used in Windows notebooks? Or is the Mac OS X power manager is really good?





QueryPerformanceCounter() equivalent on Mac OS X

20 04 2008

Timer is quite an issue to some people who need to process image in realtime or who want to measure very fast code.

Because the QueryPerformanceCounter() and QueryPerformanceFrequency() are discussed in my previous post, one will raise a question, “Is there a similar function for the Mac OS X?”.

Yeah.. Actually, my blog stat showed that a few people searched with that term.

I found some functions like mach_absolute_time() and mach_timebase_info().

You can read very nice explanation here at the MacResearch and here at the Apple’s Q&A page.





Weirdness of the High Resolution Counter, i.e. QueryPerformanceCounter()

19 04 2008

For the most of time, using clock() for measuring performance for a block can be enough.
However, there are some cases where you want to compare two logically identical but differently implemented blocks.
Let’s assume that you want to compare performance of intrinsic version of strcpy and your own implementation of strcpy block written in SIMD instructions.
In most case, the clock()-based functions, like clock() and GetTickCount(), will not reveal the difference between them.

So, you decided to use high performance, or high resolution timer. The Windows supports these two functions for that purpose.

1. QueryPerformanceCounter( LARGE_INTEGER *pVal )
This function is like the clock(). the value returned in a location pointed by pVal is the number of counts, just like that the clock() returns number of ticks.

2. QueryPerformanceFrequency( LARGE_INTEGER *pVal )
This returns how many times it ocillates per a second.

So, the duration of time can be obtained by


    LARGE_INTEGER aVal, aFreq;
    __int64 durataion_in_time;

    QueryPerformanceCounter( &aVal );
    QueryPerformanceFrequency( &aFreq );
    duration_in_time = aVal.QuadPart / aFreq.QuadPart;

However this has some glitches with contemporary CPUs.

Before mentionting the glitch, let’s take a look at how the LARGE_INTEGER is declared.


typedef union _LARGE_INTEGER {
    struct {
        DWORD LowPart;
        LONG HighPart;
    };
    struct {
        DWORD LowPart;
        LONG HighPart;
    } u;
    LONGLONG QuadPart;
} LARGE_INTEGER;

The LONGLONG is __int64 type. So, if your compiler and CPU supports 64bit data type, you can access the content of the LARGE_INTEGER with the QuadPart.

The 1st glitch is that the returned value easily exceeds the boundary of the 64bits for the QuadPart, because current CPUs are so fast.
(If you search on the Google, you will find some web pages on which people explain that it exceeds the 32bit boundary.
And they recommend to use 64bit data type. Well, actually it even exceeds the 64bit boundary. )
So, probably you can use unsigned __int64 instead.

The 2nd glitch is that you can’t print them out properly when you use %I64d for aVal.QuadPart/aFreq.QuadPart.
Even %Lf doesn’t solve the problem. They are all for 64bit integer and real numbers. Then how to display them properly?


printf("%f", (double)aVal.QuadPart/(double)aFreq.QuadPart);

double is also 64bit real number type, and it works.

The 3rd glitch is the real glitch.
Let’s take a look at this screenshot from real invocation of the code.

Hmm… Why the high performance counter is not reliable?
By searching on the Google, I found a clue that it was due to the speed-step or similar technology which changes the CPU speed on demand.
Because it has very high resolution, it has the glitch.
I read somewhere in Intel’s forum that Intel or MS was working on making the call to measure on the FSB side instead of inner core of the CPU.
By doing so, it is said that the function would return more reliable value even when battery-saving technology in a CPU is used.

I assume that the GetTickCount() Win32 API function is also based on the clock(). However, it displays somewhat expected result seemilgy reliably.
The clock()/CLOCKS_PER_SEC displays 2 and 1.9… from time to time.

Probably the GetTickCount() has the lowest resolution.
However, one convenient side of using the GetTickCount() is that it returns a value in millisecond, if you want “time” instead of number of ticks.
So, you don’t need to divide it by some constant like CLOCKS_PER_SEC. Then it should be renamed to GetTickTime().
Well.. the function name again misleads, but it is the brain-child of the MS.

Finally, here is a screenshot when all of them return good results. :)





GCC comes with Mac OS X 10.4.x and 10.5.x doesn’t support flexible array member

11 04 2008

According to GCC manual, it supports flexible array member.


struct foo { int x; int y[]; };
    struct bar { struct foo z; };

    struct foo a = { 1, { 2, 3, 4 } };        // Valid.
    struct bar b = { { 1, { 2, 3, 4 } } };    // Invalid.
    struct bar c = { { 1, { } } };            // Valid.

The lines commented as valid should be compiled without any error. However, the GCC 4.x installed with the Xcode 2.5 and 3.x on OS X 10.4.x and 10.5.x respectively doesn’t compile it without errors.

I reported this bug to the Apple.

NEW on April, 22, 2008
I got message from Apple.

This is a follow-up to Bug ID# 5857390. Engineering has determined that this issue behaves as intended based on the following information:

Page 232 of the GCC manual available at http://gcc.gnu.org/onlinedocs/gcc-4.2.3/gcc.pdf states that:

“To avoid undue complication and confusion with initialization of deeply nested arrays, we simply disallow any non-empty initialization except when the structure is the top-level ob ject.”

Compiling the example on page 232 will produce an error based on the two disallowed statements specifically marked as “Invalid”. Compiling the file with only the “Valid” lines works correctly.