Implementation of aligned memory allocation

12 06 2008

Thanks to evolution in CPU architecture, you have super computers in your house. However, to get benefit from it, it requires some techniques. There are something new, while there are something very traditional.

One of such traditional approach is to use aligned memory allocation. Because CPU access memory most efficiently when accessing in certain unit, it is better if the allocated memory lies in the certain unit boundary. For example, let’s assume a CPU architecture is most efficient when memory is aligned every 4 bytes. When it accesses memory located at multiples of 4 in its address space, it is much faster than when the accessed memory is at 0×03 or 0×05.

The Unix malloc functions usually return aligned memory space, while Windows version doesn’t. Instead, the Windows provide _aligned_malloc().

Then, how to create a aligned malloc() function? There can be some special cases that you want to implement your own aligned malloc, although I don’t imagine such a case. Let’s figure out how to by looking at one of existing implementations. ( You can search one using the Google, and you will find out that they are similar.)


// size : the size of allocated memory
//        The actual size of allocation will be greater than this size.
// alignment : the alignment boundary
void *aligned_memory_alloc( size_t size, size_t alignment )
{
	void *pa, *ptr;

	//pa=malloc(((size+alignment-1)&~(alignment-1))+sizeof(void *)+alignment-1);

	// 1
    pa=malloc((size+alignment-1)+sizeof(void *));
	if(!pa)
		return NULL;

	// 2
	ptr=(void*)( ((ULONG_PTR)pa+sizeof(void *)+alignment-1)&~(alignment-1) );

	// 3
	*((void **)ptr-1)=pa;

    printf("CAlignedAlloc::new(%d,%d)=%x\n", (ULONG)size, alignment, (ULONG)ptr);

	return ptr;
}

Point is to allocate more space than required and make it point to some position in the allocated memory. The pointed position is aligned location.

At 1, the total space to allocate is :

size ; the space a user want to allocate
+ (alignment-1) ; additional space due to the alignment
+ sizeof (void *) ; The head location of the newly allocated memory
; contains an address where the aligned memory block starts

The size part is obvious. The sizeof (void *) part follows the design of aligned memory allocation. Without this part, it will not know from where to free the aligned memory space, and from where to access the memory to read/write from/to the space.

For the (alignment-1), please take a look at this picture.

The red arrows show what the destination address should be if the allocated memory is 1, 2, 3 or 5, 6, 7. They are relocated to 4 and 8, respectively. So, it can be shifted up to 3 slots to the right. So, it is (alignment-1)

At 2, the ptr points to the location of aligned place. After reserving the space for storing where the whole allocated memory block is, i.e. pa, it calculates the aligned location. If you look at the picture above, you will see why (alignment-1) is added and the address is “AND”ed with 1’s complement of the (alignment-1). For example, 4-bytes alignment means masking out the last 2bits. It is like removing the last 2 bits.

At 3, “address length” bytes before, it saves the address which points where the whole allocated memory starts. Why it uses (void **) instead of (void *) is because the pointer (ptr-1) is points to an address, which is a pointer to pointer.

And finally it returns the aligned memory location, ptr.

How about the free() function? You can’t free the address to the aligned position. The whole memory space should be freed. That is why the starting address of the whole memory space was saved above.


void aligned_free( void *ptr )
{
    printf("CAlignedAlloc::free(%x)\n", (ULONG)ptr);
	if(ptr)
		free(*((void **)ptr-1));
}

At Just 1 address-width, i.e. 4bytes for 32bit CPU and 8bytes for 64bit CPU, before the location of the aligned space, it contains the starting address of the whole block.
It is better to use (void **) or (void *) to calculate how much space is required to save a pointer, because it works for 64bit architecture as well as 32bit architecture. Actually it works for any architecture.





Why OpenMP?

23 05 2008

   When I attended the WWDC, i.e. Apple’s World Wide Developer Conference, a few years ago, and if I remember it correctly, some people raised their hands and asked when the OpenMP support would be included in the GCC provided by the Apple. At that time, I didn’t not understand why it is important. I thought the OpenMP and MPI are for speicial market, like high-performance science and data analysis market. I thought they are for their own league.
   Also, I didn’t understand why we needed another threading and mutiprocessing/multithreading API when we already have the pthread and other message passing APIs. I would confess this. “Why should programmers learn another threading API? I don’t want to do so!”

   However, about 1 month ago, I found out that the OpenMP, at least, can boost performance of any individual programmer’s codes very “easilty”. I woudl like to put emphasis on “easily”. If a new library is announced, it should be easy to be tried without sacrificing your precious time in my opinion.
The OpenMP was turned out to be in that category.

   Actually, it looks like a collection of macros which utilize the pthread functions. However, actually it is built into compilers like the Visual C++’s compiler and GCC v. 4.2.x. So, in other words, you need a compiler which supports the OpenMP.

The great features of the OpenMP are :

  1. Very easy to use; Not so many new keywords to memorize; very straight forward to use.
  2. Enables applying “fine” level of multithreading without hassle.
  3. You can use almost same source codes for single threaded version and multithreaded version no matter how many threads you want to create.

Let’s talk about them more to get better idea what I mean.

1. The OpenMP keywords are very easy to learn. They are quite clean, and doesn’t introduce new concept. It consists of only a couple of keywords, and you can try them very easilty without modifying your logic much. Actually, embrace your logic, which should be handled by threads, with brackets (braces?) and their keywords. That’s it!

2. When codes are written in multithreaded way, they can be usually for coarse-grained multithreaded. It is because that it is tedious to write multithreaded codes in fine-grained way. I just create a new thread which uses a function as a thread function. But with the OpenMP, you can easily slice your time-wasting for-loop and give them to their own threads.

3. If OpenMP allows you to write multithreaded code, but if it makes you to change your codes a lot, it is not useful. The OpenMP allows to convert a single threaded version of code into multithreaded version by adding their new statements. Usually there is no need to change the structure of existing codes. If you want to use 3 threads instead of 2, you can just specify the number of threads to utilize without change the exisiting code structure!

Also, it is quite handy for current multicore processors. The main target of the OpenMP is multiprocessor or multicore processors in one computer. On the other hand, the MPI is for distributed environment.

Here is my sample code which uses the OpenMP. It shows how fast it can be if the OpenMP is used. I also tried using SIMD instructions if it can achieve faster performance than using multithreading. I think SIMDs are more efficient than using multithreads, because there is no overhead to create multiple threads, and maintain them. However, my code sample shows that poorly designed SIMD codes are slower than simpler but multithreaded codes.


// OpenMP.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <omp.h>
#include <cstdio>
#include <cmath>
#include <ctime>
#include <cstring>
#include <windows.h>
#include <intrin.h>
using namespace std;

#include "performance_measure.h"

#define NUM_THREADS 4
#define NUM_START 1
#define NUM_END 10

void test(int val)
{
    #pragma omp parallel if (val) num_threads(val)
    if (omp_in_parallel())
        #pragma omp single
        printf_s("val = %d, parallelized with %d threads\n",
                 val, omp_get_num_threads());
    else
        printf_s("val = %d, serialized\n", val);
}

void AnotherTest( void )
{
   int i, nRet = 0, nSum = 0, nStart = NUM_START, nEnd = NUM_END;
   int nThreads = 0, nTmp = nStart + nEnd;
   unsigned uTmp = (unsigned((abs(nStart - nEnd) + 1)) *
                               unsigned(abs(nTmp))) / 2;
   int nSumCalc = uTmp;

   if (nTmp < 0)
      nSumCalc = -nSumCalc;

   omp_set_num_threads(NUM_THREADS);

   #pragma omp parallel default(none) private(i) shared(nSum, nThreads, nStart, nEnd)
   {
      #pragma omp master
      nThreads = omp_get_num_threads();

      #pragma omp for
      for (i=nStart; i<=nEnd; ++i) {
            #pragma omp atomic
            nSum += i;
      }
   }

   if  (nThreads == NUM_THREADS) {
      printf_s("%d OpenMP threads were used.\n", NUM_THREADS);
      nRet = 0;
   }
   else {
      printf_s("Expected %d OpenMP threads, but %d were used.\n",
               NUM_THREADS, nThreads);
      nRet = 1;
   }

   if (nSum != nSumCalc) {
      printf_s("The sum of %d through %d should be %d, "
               "but %d was reported!\n",
               NUM_START, NUM_END, nSumCalc, nSum);
      nRet = 1;
   }
   else
      printf_s("The sum of %d through %d is %d\n",
               NUM_START, NUM_END, nSum);

}

void test2(int iter)
{
    #pragma omp ordered
    printf_s("test2() iteration %d by thread ID %d\n", iter, omp_get_thread_num());
}

void AnotherTest2( void )
{
    int i;
    #pragma omp parallel
    {
        #pragma omp for ordered
        for (i = 0 ; i < 5 ; i++)
            test2(i);
    }

}

/*
 * taylor.c
 *
 * This program calculates the value of e*pi by first calculating e
 * and pi by their taylor expansions and then multiplying them
 * together.
 */

#define num_steps 20000000

void sequential_taylor( void )
{
  double start, stop; /* times of beginning and end of procedure */
  double e, pi, factorial, product;
  int i;

  printf("Sequential Taylor\n");
  /* start the timer */
  start = clock();

  /* First we calculate e from its taylor expansion */
  printf("e started\n");
  e = 1;
  factorial = 1; /* rather than recalculating the factorial from
		    scratch each iteration we keep it in this varialbe
		    and multiply it by i each iteration. */
  for (i = 1; i<num_steps; i++) {
    factorial *= i;
    e += 1.0/factorial;
  }
  printf("e done\n");

  /* Then we calculate pi from its taylor expansion */
  printf("pi started\n");

  pi = 0;
  for (i = 0; i < num_steps*10; i++) {
    /* we want 1/1 - 1/3 + 1/5 - 1/7 etc.
       therefore we count by fours (0, 4, 8, 12...) and take
         1/(0+1) =  1/1
       - 1/(0+3) = -1/3
         1/(4+1) =  1/5
       - 1/(4+3) = -1/7 and so on */
    pi += 1.0/(i*4.0 + 1.0);
    pi -= 1.0/(i*4.0 + 3.0);
  }
  pi = pi * 4.0;
  printf("pi done\n");

    product = e * pi;

  stop = clock();

  printf("Reached result %f in %.3f seconds\n", product, (stop-start)/1000);

}

void parallel_taylor( void )
{
  double start, stop; /* times of beginning and end of procedure */
  double e, pi, factorial, product;
  int i;

  printf("Parallel Taylor\n");

  /* start the timer */
  start = clock();

  /* Now there is no first and seccond, we calculate e and pi */
#pragma omp parallel sections //shared(e, pi)
  {
#pragma omp section
    {
      printf("e started\n");
      e = 1;
      factorial = 1; /* rather than recalculating the factorial from
			scratch each iteration we keep it in this varialbe
			and multiply it by i each iteration. */
      for (i = 1; i<num_steps; i++) {
	factorial *= i;
	e += 1.0/factorial;
      }
      printf("e done\n");
    } /* e section */

#pragma omp section
    {
      /* In this thread we calculate pi expansion */
      printf("pi started\n");

      pi = 0;
      for (i = 0; i < num_steps*10; i++) {
	/* we want 1/1 - 1/3 + 1/5 - 1/7 etc.
	   therefore we count by fours (0, 4, 8, 12...) and take
             1/(0+1) =  1/1
	   - 1/(0+3) = -1/3
             1/(4+1) =  1/5
	   - 1/(4+3) = -1/7 and so on */
	pi += 1.0/(i*4.0 + 1.0);
	pi -= 1.0/(i*4.0 + 3.0);
      }
      pi = pi * 4.0;
      printf("pi done\n");
    } /* pi section */

  } /* omp sections */
  /* at this point the threads should rejoin */

  product = e * pi;

  stop = clock();

  printf("Reached result %f in %.3f seconds\n", product, (stop-start)/1000);

}

void display_matrix( unsigned char mat[][8] )
{
    int i, j;
    for( i = 0 ; i < 8; i++ )
    {
        for( j = 0 ; j < 8; j++ )
            printf("%3d ", mat[i][j] );
        printf("\n");
    }
    printf("\n");

}

const int PI = 3.141592;

void matrix_multiplication( void )
{

    unsigned __int64 time_stamp_start, time_stamp_duration;

    __declspec(align(16)) unsigned char matA[8][8] = {  { 0, 1, 2, 3, 4, 5, 6, 7 },
                        { 0, 0, 10, 11, 12, 13, 14, 15 },
                        { 0, 0, 0, 3, 4, 5, 6, 7 },
                        { 0, 0, 0, 0, 12, 13, 14, 15 },
                        { 0, 0, 0, 0, 0, 5, 6, 7 },
                        { 0, 0, 0, 0, 0, 0, 14, 15 },
                        { 0, 0, 0, 0, 0, 0, 0, 7 },
                        { 0, 0, 0, 0, 0, 0, 0, 0 }};
    __declspec(align(16)) unsigned char matB[8][8] = {  { 0, 0, 0, 0, 0, 0, 0, 0 },
                        { 0, 0, 0, 0, 0, 0, 0, 7 },
                        { 0, 0, 0, 0, 0, 0, 14, 15 },
                        { 0, 0, 0, 0, 0, 5, 6, 7 },
                        { 0, 0, 0, 0, 12, 13, 14, 15 },
                        { 0, 0, 0, 3, 4, 5, 6, 7 },
                        { 0, 0, 10, 11, 12, 13, 14, 15 },
                        { 0, 1, 2, 3, 4, 5, 6, 7 }};

    __declspec(align(16)) unsigned char matC[8][8];

    memset( matC, 0, 64*sizeof(char) );

    int i, j, k;
    int iteration;
    int nThreads;
    unsigned char temp;
    clock_t start_t, duration_t;

    printf("Single\n");

    INIT_PERFORMANCE_MEASURE();

    START_PERFORMANCE_MEASURE();

    start_t = clock();

    for( iteration = 0; iteration < 90000; iteration++ )
    {
        for( i = 0; i < 8; i++ )
            for( j = 0; j < 8; j++ )
            {
                temp = 0;
                for( k = 0; k < 8; k++ )
                {
                    temp += matA[i][k]*matB[k][j];
                }
                matC[i][j] = temp;
            }
    }
    duration_t = clock() - start_t;

    STOP_PERFORMANCE_MEASURE();

    printf("Duration = %f (%f)\n", (double)duration_t/CLOCKS_PER_SEC,
        GET_PERFORMANCE_MEASURE() );

    display_matrix( matC );
    memset( matC, 0, 64*sizeof(unsigned char) );

    printf("Multithreaded 2\n");

    START_PERFORMANCE_MEASURE();
    start_t = clock();

    #pragma omp parallel default(none) private(i,j,k,temp) shared(nThreads, matA, matB, matC) num_threads(2)
    {
        #pragma omp master
        nThreads = omp_get_num_threads();

        #pragma omp for
        for( iteration = 0; iteration < 90000; iteration++ )
        {
            for( i = 0; i < 8; i++ )
                for( j = 0; j < 8; j++ )
                {
                    temp = 0;
                    for( k = 0; k < 8; k++ )
                    {
                        temp += matA[i][k]*matB[k][j];
                    }
                    matC[i][j] = temp;
                }
        }
   }
    duration_t = clock() - start_t;

    STOP_PERFORMANCE_MEASURE();

    printf("Duration = %f (%f)\n", (double)duration_t/CLOCKS_PER_SEC, GET_PERFORMANCE_MEASURE() );

    display_matrix( matC );
    printf("nThreads : %d\n", nThreads);
    printf("\n");

    /////////////////////////////////////////////////////////////////////////////
    printf("Using SIMD\n");

    __m128i a, b, c, zero, high_temp, low_temp, high_temp2, low_temp2, temp_128, temp2_128;

    start_t = clock();

    zero = _mm_setzero_si128();
    for( iteration = 0; iteration < 90000; iteration++ )
    {
        for( i = 0; i < 8; i++ )
        {
            // ith row
            a = _mm_set_epi16( matA[i][7], matA[i][6],
                              matA[i][5], matA[i][4],
                              matA[i][3], matA[i][2],
                              matA[i][1], matA[i][0] );

            //a = _mm_unpacklo_epi8( a, zero ); // Now they are in 16bits

            for( j = 0; j < 8; j++ )
            {
                b = _mm_set_epi16( matB[7][j], matB[6][j], matB[5][j], matB[4][j],
                               matB[3][j], matB[2][j], matB[1][j], matB[0][j]);

                //b = _mm_unpacklo_epi8( b, zero ); // Now they are in 16bits

                low_temp = _mm_mullo_epi16( a, b );
                high_temp = _mm_mulhi_epi16( a, b );

                low_temp2 = _mm_unpacklo_epi16( low_temp, high_temp ); // 32bits
                high_temp2 = _mm_unpackhi_epi16( low_temp, high_temp ); // 32bits

                temp_128 = _mm_add_epi32( low_temp2, high_temp2 );
                temp2_128 = _mm_shuffle_epi32( temp_128, 0x4E );

                temp_128 = _mm_add_epi32( temp_128, temp2_128 );
                temp2_128 = _mm_shuffle_epi32( temp_128, 0xB1 );

                temp_128 = _mm_add_epi32( temp_128, temp2_128 );

                matC[i][j] = temp_128.m128i_i8[0];

            }
        }
    }
    duration_t = clock() - start_t;
    printf("Duration = %f\n", (double)duration_t/CLOCKS_PER_SEC );

    display_matrix( matC );
    printf("\n");

    /////////////////////////////////////////////////////////////////////////////
    printf("Using SIMD with threads\n");
    start_t = clock();

    start_t = clock();

    zero = _mm_setzero_si128();
#pragma omp parallel default(none) private(i, j, a, b, low_temp, high_temp, low_temp2, high_temp2, temp_128, temp2_128) shared( nThreads, matA, matB, matC, zero )  num_threads(2)
{
#pragma omp master
        nThreads = omp_get_num_threads();

#pragma omp for
    for( iteration = 0; iteration < 90000; iteration++ )
    {
        for( i = 0; i < 8; i++ )
        {
            // ith row
            a = _mm_set_epi16( matA[i][7], matA[i][6],
                              matA[i][5], matA[i][4],
                              matA[i][3], matA[i][2],
                              matA[i][1], matA[i][0] );

            //a = _mm_unpacklo_epi8( a, zero ); // Now they are in 16bits

            for( j = 0; j < 8; j++ )
            {
                b = _mm_set_epi16( matB[7][j], matB[6][j], matB[5][j], matB[4][j],
                               matB[3][j], matB[2][j], matB[1][j], matB[0][j]);

                //b = _mm_unpacklo_epi8( b, zero ); // Now they are in 16bits

                low_temp = _mm_mullo_epi16( a, b );
                high_temp = _mm_mulhi_epi16( a, b );

                low_temp2 = _mm_unpacklo_epi16( low_temp, high_temp ); // 32bits
                high_temp2 = _mm_unpackhi_epi16( low_temp, high_temp ); // 32bits

                temp_128 = _mm_add_epi32( low_temp2, high_temp2 );
                temp2_128 = _mm_shuffle_epi32( temp_128, 0x4E );

                temp_128 = _mm_add_epi32( temp_128, temp2_128 );
                temp2_128 = _mm_shuffle_epi32( temp_128, 0xB1 );

                temp_128 = _mm_add_epi32( temp_128, temp2_128 );

                matC[i][j] = temp_128.m128i_i8[0];
            }
        }
    }
}
    duration_t = clock() - start_t;
    printf("Duration = %f\n", (double)duration_t/CLOCKS_PER_SEC );

    display_matrix( matC );
    printf("nThreads : %d\n", nThreads);
    printf("\n");

}

int _tmain(int argc, _TCHAR* argv[])
{
    //test(0);
    //test(3);

    //AnotherTest();
    //printf("\n\n");
    //AnotherTest2();

    //sequential_taylor();
    //parallel_taylor();

    matrix_multiplication();

	return 0;
}

 

By the way, you can enable the OpenMP in the Visual C++ by setting like this.

For the GCC, 4.2.x versions or above are required. For the Mac OS X, if you log in the ADC web site, you can download 4.2.3(?) version or above. It is still kind of beta.

If you want to know more about the OpenMP, visit :
http://openmp.org/wp/
http://en.wikipedia.org/wiki/OpenMP

GOMP is for C/C++ and Fortran 95 in the GNU Compiler Collection, aka, GCC.
http://gcc.gnu.org/projects/gomp/





Variable argument list bug in Visual C++ 2005 library

22 05 2008

Today, I found a bug in a Visual C++ 2005 standard library related to variable argument list. The problem is that va_arg() doesn’t return correct value.


#include "stdafx.h"
#include <stdarg.h>

void var_tester( char *aString, ... )
{
    int num_arg = 1;
    va_list argument_ptr;
    int aVal;

    va_start( argument_ptr, aString );

    while( (aVal = va_arg( argument_ptr, int )) != NULL )
    {
        num_arg++;
        printf("%d st arg = %X\n", num_arg, aVal );
    }

    va_end( argument_ptr );
}

int _tmain(int argc, _TCHAR* argv[])
{
    var_tester( "Hmm..", 1, 2, 3 );
    printf("\n");
    var_tester( "Hmm..", 1, 2, 3, 4 );

    printf("\n");
    var_tester( "Hmm..", 1, 3 );

    printf("\n");
    var_tester( "Hmm..", 1, 2, 3, 4, 5 );

	return 0;
}

If the code is debugged, it works correctly. But if it is launched without debugging, it doesn’t.

Here are the screenshot.
Correct!

And.. here is the wrong one.
Wrong!

Update : han9kin left a comment which said that this was not a bug. Unix man page explains about it more well. However, I would like to put MSDN explanation here.

“va_arg retrieves a value of type from the location given by arg_ptr and increments arg_ptr to point to the next argument in the list, using the size of type to determine where the next argument starts. va_arg can be used any number of times within the function to retrieve arguments from the list.”

In a code sample following it, they passes -1 as the last parameter, and they check if -1 is retrieved ,and if so they exist the va_arg() loop.

And this hot fix, FIX: The va_arg function returns an incorrect value in a Visual C++ 2005 application , doesn’t explain what it fixes specifically.
Can anyone tell me what “the va_arg function returns an incorrect value” means?

Anyway, in the 1st sample at this site, checks it against NULL. And a sample in this site , uses a number of parameters as its 1st parameter.

Additionally, this site explains interesting topic about variable parameters.





Back to the basic : Pointer to Array, Array of Pointers…

21 05 2008

   To the beginner of C/C++, one of the most confusing concept is the pointer. Although the concept of the pointer is quite understandable, the notation for pointers and arrays seems to confuse people. When I was a freshman, other students kept mumbling, “Pointer to Pointer”, “Pointer to Array”, “Array of Pointer”, “Array Pointer” or “Pointer of Array”. Because it is in English, it made me quite confusing also. Pointer to Pointer is the easiest to understand. But others were very confusing. It is because Korean is different from English. If I set up concept with terms, “Pointer to Array” and “Array of Pointers”, some of my friends* approached to me and asked what “Array Pointer” is or what “Pointer of Array” is. Semantically, “Array Pointer” means a pointer which is made for array, and “Pointer of Array” means “Pointer to Array”. So, the threes, “Array Pointer”, “Pointer to Array” and “Pointer of Array”, were the most confusing. In Korean, “to…” can mean same to “of…”.

   When I came to the U.S., students from other countries didn’t seem to be confused by those. Probably some terms like “Pointer of Array” sound awkwardly. I don’t know if it is also true to native English speakers.

   When I got interviewed from U.S. companies, I was quite impressed that they asked about every detail of C/C++. The questions was quite creative. And it gave me impression that they know how to manage project and programmers. The way they interview is totally different from the way you do in Korea. ( Nowadays, Google imported their interview style to Korea, and I have heard that Haan soft interviews like americans. ) And they asked me how to make 2D array dynamically. I answered my best solution. ( I did lots of experiment with dynamic array when I made a 3D graphics engine before, and found a way which was very flexible and very powerful. This code will be introduced later. ) However, they said that my answer was wrong. And they showed me how to make a “Pointer to Array” and using it, they built 2D arrays. So, I figured out that their intervew question was not creative. I found one book which contained questions they asked me. Hmm.. It turned out that it was not creative questions.
So, I told them why using pointer to array was not flexible, and why my approach was better. They nodded affirmatively. However, if I didn’t answer things, which I actually know, but forgot because I was out of those problems for a while, they thought me that I didn’t know, although I just needed to remind me of those.. I’m quite experienced programmer, but my English capability limited any quick answer or chatting. So, it gave good impression that I was not compelling.

   Anyway, let’s go back to the root. It’s time to refresh things again!
In this post, I will show how to make “Pointer to Array”, and “Array of Pointers”, and also will show shortcomings of array notation and pointer notation to handle multi-dimensional arrays.

   Let’s start with array notation and pointer notation.


void outputUsingArray( int array[][4], int n_rows, int n_cols )
{
	int i, j;

	printf("Output Using array\n");
	for( i = 0; i < n_rows; i++ )
	{
		for( j = 0; j < n_cols; j++ )
		{
			// Either can be used.
			//printf("%2d ", array[i][j] );
			printf("%2d ", *( *(array + i) + j ) );
		}
		printf("\n");
	}
	printf("\n");
}

void outputUsingPointer( int (*array)[4], int n_rows, int n_cols )
{
	int i, j;

	printf("Output Using Pointer to Array i.e. int (*array)[4]\n");
	for( i = 0; i < n_rows; i++ )
	{
		for( j = 0; j < n_cols; j++ )
		{
			printf("%2d ", *(*(array+i) + j ) );
		}
		printf("\n");
	}
	printf("\n");
}

How it is used is :


int _tmain(int argc, _TCHAR* argv[])
{
	int array[4][4] = { { 0, 1, 2, 3 },
						{ 4, 5, 6, 7 },
						{ 8, 9, 10, 11 },
						{ 12, 13, 14, 15 } };

	outputUsingPointer( (int (*)[4])array, 4, 4 );

	outputUsingArray( array, 4, 4 );

   What is a problem with array notation is that dimension of array is static and functions like outputUsingArray() can be used only for the specific dimension.
However, accessing array using array notation is convenient.
Pointer to array notation can be used like outputUsingPointer(); However, this also has the same problem of the outputUsingArray() case.

   If the pointer to pointer is passed, and if the function is designed so, to access row, n_cols*i, term should be used.
It is flexible, because one function can access any dimension of arrays.
However, you can’t use array notation.

   By the way, before presenting the most flexible solution to it, let’s think about making a dynamic array using array of pointers.


	printf("Using array of pointers -- Half Dynamic\n");
	printf("------------------------\n");
	int *array3[4];
	int i;

	for( i = 0; i < 4; i++ )
		*(array3+i) = (int *)malloc( 4*sizeof( int ) );

	array3[0][0] = 0; array3[0][1] = 2; array3[0][2] = 3; array3[0][3] = 4;
	array3[1][0] = 0, array3[1][1] = 2, array3[1][2] = 3, array3[1][3] = 4;
	array3[2][0] = 0, array3[2][1] = 2, array3[2][2] = 3, array3[2][3] = 4;
	array3[3][0] = 0, array3[3][1] = 2, array3[3][2] = 3, array3[3][3] = 4;

	outputUsingPointer3( array3, 4, 4 );
	outputUsingArray3( array3, 4, 4 );

Then, the outputUsingPointer3() and outputUsingArray3() are :


void outputUsingPointer3( int **array, int n_rows, int n_cols )
{
	int i, j;

	printf("Output Using Pointer to Pointer i.e.\n");
	for( i = 0; i < n_rows; i++ )
	{
		for( j = 0; j < n_cols; j++ )
		{
			printf("%2d ", *(*(array+i) + j ) );
		}
		printf("\n");
	}
	printf("\n");
}
void outputUsingArray3( int **array, int n_rows, int n_cols )
{
	int i, j;

	printf("Output Using Array i.e. int array[][]\n");
	for( i = 0; i < n_rows; i++ )
	{
		for( j = 0; j < n_cols; j++ )
		{
			printf("%2d ", array[i][j]);
		}
		printf("\n");
	}
	printf("\n");
}

   As you can see, the array can be passed as a pointer to pointer, and you can access it using array notation and pointer notation easily. It is quite flexbile.
However, the number of colums are static when such an array is declared. So, I called it “Half-flexible”.

   By the way, can’t you pass the static array declared as int array[4][4] to the outputUsingPointer3() by casting it to (int (*)[4])? No. It is not possible. The Visual C++ 2005 compiler doesn’t allow it.
Also it would be handy if the array can be passed as a pointer to pointer but casted to (int (*)[n_cols] ) inside of the function. Then, the function can be used for any dimension of arrays. However, it is not possible also. Even with a cast statement, the compiler doesn’t allow putting a variable as its dimension. Probably, GCC allows it. Because GCC has its own extension which allows variables for setting dimension of arrays. But it is only for the GCC.

casting doesn\'t work

Now, it’s time to introduce the most flexible dynamic array.


	printf("Using Pointer to Pointer -- Fully Dynamic\n");
	printf("------------------------\n");
	int **array2;

	array2 = (int **)malloc( 4* sizeof( int * ) );

	// each row is a 1D array
	for( i = 0; i < 4; i++ )
		*(array2+i) = (int *)malloc( 4*sizeof( int ) );

	array2[0][0] = 0; array2[0][1] = 2; array2[0][2] = 3; array2[0][3] = 4;
	array2[1][0] = 0, array2[1][1] = 2, array2[1][2] = 3, array2[1][3] = 4;
	array2[2][0] = 0, array2[2][1] = 2, array2[2][2] = 3, array2[2][3] = 4;
	array2[3][0] = 0, array2[3][1] = 2, array2[3][2] = 3, array2[3][3] = 4;

	outputUsingPointer3( array2, 4, 4 );
	outputUsingArray3( array2, 4, 4 );

   By using pointer to pointer, instead of pointer to array, to declare multidimensional array, you can achieve the convenience of array notation and the power of pointer notation as well as flexibility. Here, the outputUsingPointer3() and outputUsingArray3() are same as the example above. In short, you can use any notation in any case for your convenience.

   Using pointer to pointer is semantically easy to understand and the most powerful.
This is the way I explained in one of previous interview. In most of books, only using poiner to array is explained, and I found out that many people had difficulty in understanding “Pointer to Array” and “Array of Pointer.”

   If you have difficulty in memorizing what notation is for pointer to array and array of pointer, try this :

int (*array)[4] : (, ) operators have higher priority because it is on the left of the [,]. So, it is a POINTER to [4].
Again.. It is a POINTER. So, it means it is a pointer to array.

int *array[4] : [,] have heigher priority than *, so, it is ARRAY of pointers.
It is not pointer, it is ARRAY.





How to solve weirdness of the high resolution counter

26 04 2008

In a previous post, some issues on QueryPerformanceCounter() was discussed.

Fortunately I found a very good blog, Zooba’s Blog on problems using counters like rdtsc and QueryPerformanceCounter. Because there is additional processing time needed to get the CPU frequency that is used along with the result of rdtsc, or because just approximate frequency is used by looking up a registry, I think it is not good to use the rdtsc.
So, the last option is to use the QueryPerformanceCounter.

There are two issues to solve.

  1. To guarantee the timing starts and ends where you want to do so.
  2. Because of optimization, the compiler may reorder instructions. So, your “Start Measuring” command can be placed earlier and later.

  3. To obtain reliable count.
  4. As it was discussed in the previous post, it doesn’t return reliable count number on multi-processors or multi-core processors.

To solve the 1st problem, special instructions called “serializing instruction” should be called.
According to the Zooba’s Blog, there are 3 of them : iret, rsm, cpuid.
However, the iret and rsm change the instruction pointer. So, they are out. The cpuid is for getting information a cpu. So, it has no harm.
(What is the “serialization instruction“? It is an instruction which forces codes to be serialized. So, instructions in the queue already will be flushed out, and an instruction like cpuid is processed. So, you can ensure that the instruction for starting and stopping measuring will be located as they are expected. )

The 2nd issue is raised especially when the CPU you use is multicore processor or multi processor. Also when your CPU has the speed-step technology, it happens.
However, as it was mentioned in the Zooba’s Blog, the speed-step case is minimized. Because in the code you want to measure its performance, it would make your CPU sweat enough in most cases. So, the most troublesome case is the multi-core, multi-processor case.
How to solve this problem? It is also explained in the Zooba’s blog. (Thank you, Zooba!)
If you set the a specific processor runs the QueryPerformanceCounter(), it will return reliable result. So, the SetProcessorAffinity() or the SetThreadAffinity() can be used.

So, here is the code example.


// performance_measure.h
#ifndef PERFORMANCE_MEASURE
#define PERFORMANCE_MEASURE

#define DECLARE_GLOBAL_FOR_PEFORMANCE_MEASURE()\
    LARGE_INTEGER g_Start_Counter, g_End_Counter, g_Frequency;\
    DWORD g_Old_ProcessAffinityMask,g_New_ProcessAffinityMask, g_SystemAffinityMask;\
    HANDLE hCurrentProcess;

DECLARE_GLOBAL_FOR_PEFORMANCE_MEASURE();

inline void INIT_PERFORMANCE_MEASURE( void )
{
    hCurrentProcess = GetCurrentProcess();
    GetProcessAffinityMask( hCurrentProcess, &g_Old_ProcessAffinityMask, &g_SystemAffinityMask );

    QueryPerformanceFrequency( &g_Frequency );
}   

inline void START_PERFORMANCE_MEASURE( void )
{
    int CPUInfo[4];

    // Serializing Information
    __cpuid( CPUInfo, 0 );  // used the intrinsic version of the cpuid

    g_New_ProcessAffinityMask = 0x01;
    SetProcessAffinityMask( hCurrentProcess, (DWORD_PTR)&g_New_ProcessAffinityMask );

    QueryPerformanceCounter( &g_Start_Counter );

    // Revert to back
    SetProcessAffinityMask(hCurrentProcess, (DWORD_PTR)&g_Old_ProcessAffinityMask );
}

inline void STOP_PERFORMANCE_MEASURE( void )
{
    int CPUInfo[4];

    __cpuid( CPUInfo, 0 );  // Serializing Information
    SetProcessAffinityMask( hCurrentProcess, (DWORD_PTR)&g_New_ProcessAffinityMask );

    QueryPerformanceCounter( &g_End_Counter );

    // Revert to back
    SetProcessAffinityMask(hCurrentProcess, (DWORD_PTR)&g_Old_ProcessAffinityMask );
}

double GET_PERFORMANCE_MEASURE( void )
{
    return ((double)g_End_Counter.QuadPart - (double)g_Start_Counter.QuadPart)/(double)g_Frequency.QuadPart;
}

#endif

Insert above code like this in your code.


#include <windows.h>
#include <intrin.h>
using namespace std;

// This header file contains above code
#include "performance_measure.h"

void matrix_multiplication( void )
{
    ...

    printf("Single\n");

    INIT_PERFORMANCE_MEASURE();

    START_PERFORMANCE_MEASURE();

    start_t = clock();

    for( iteration = 0; iteration < 90000; iteration++ )
    {
        for( i = 0; i < 8; i++ )
            for( j = 0; j < 8; j++ )
            {
                temp = 0;
                for( k = 0; k < 8; k++ )
                {
                    temp += matA[i][k]*matB[k][j];
                }
                matC[i][j] = temp;
            }
    }
    duration_t = clock() - start_t;

    STOP_PERFORMANCE_MEASURE();

    printf("Duration = %f (%f)\n", (double)duration_t/CLOCKS_PER_SEC,
        GET_PERFORMANCE_MEASURE() );

Now, you will get a reliable result.

Hew….





Difference in Concurrency Model in MacOS X and the Windows (3)

25 04 2008

3. Event

Windows is made based-on event-driven model. Therefore, events play very important role on Windows environment, and are used very often whether a programmer make one or use ones provided by the OS. Let’s take a look at how events are used.

Windows는 event-driven 모델을 써서 만들어졌다. 그러므로 event는 상당히 중요한 역할을 하고, 많은 프로그램들이 OS가 제공하는 event를 사용하건, 아니면 해당 프로그램에서 event를 만들건 이 event를 많이 사용한다.
우선 이 event가 사용되는 예를 보자.


int _tmain(int argc, _TCHAR* argv[])
{
    HANDLE hThread[kMaxThreads];

    int i;

    initEvent();

    for( i = 0; i < kMaxThreads; i++ )
    {
	// Threads wait on their events and trigger events for others.
        hThread[i] = CreateThread( NULL, 0, doMultiThreadWay, 0, 0, &gThreadID[i] );

        if( hThread[i] == NULL )
        {
		...
            ExitProcess(i);
        }
        else
        {
		...
        }
    }

   // Until now, all threads are created and wait for their events.

   // set the 1st event, gEvents[0], or fire an event.
    SetEvent( gEvents[0] );

    // Wait until all threads have terminated
    WaitForMultipleObjects( kMaxThreads, hThread, TRUE, INFINITE );

    // Close all thread handles
    for( i = 0; i < kMaxThreads; i++ )
        CloseHandle( hThread[i] );

    destroyEvent();

	return 0;
}

// This is how the events are initialized.
void initEvent( void )
{
    int i;

    for( i = 0; i < kMaxThreads; i++ )
    {
	// events are automatically reset if there are once set.
        gEvents[i] = CreateEvent( NULL, FALSE, FALSE, NULL );

        if( gEvents[i] == NULL )
            outputString( __T("Error in creating events\n"), FOREGROUND_RED | FOREGROUND_INTENSITY );
    }
}

// Threading function
DWORD WINAPI doMultiThreadWay( LPVOID lpParam )
{
    TCHAR msgBuf[kBuffSize];
    size_t cchStringSize;
    DWORD dwChars;
    DWORD threadID;
    DWORD dwWaitResult;
    WORD textColor;

    int i;

    threadID = GetCurrentThreadId();
    if( threadID == gThreadID[0] )
        textColor = FOREGROUND_GREEN | FOREGROUND_RED;
    else if( threadID == gThreadID[1] )
        textColor = FOREGROUND_BLUE | FOREGROUND_RED;
    else
        textColor = FOREGROUND_BLUE | FOREGROUND_GREEN;

    // Thread safe way of outputting
    StringCchPrintf( msgBuf, kBuffSize, __T("doMultiThreadWay (%d)\n"), threadID );
    outputString( msgBuf, textColor );

    for( i = 0; i < 5; i++ )
    {
	// Each threat wait for its event.
        if( threadID == gThreadID[0] )
        {
            dwWaitResult = WaitForSingleObject( gEvents[0], INFINITE );
            outputString(__T("First thread says \"Do It\" to the second thread\n"), textColor );
		// An event, e.g. gEvents[0], is automatically reset.
            SetEvent( gEvents[1] );
        }
        else if ( threadID == gThreadID[1] )
        {
            dwWaitResult = WaitForSingleObject( gEvents[1], INFINITE );
            outputString(__T("Second thread says \"Do It\" to the third thread\n"), textColor );
            SetEvent( gEvents[2] );
        }
        else
        {
            dwWaitResult = WaitForSingleObject( gEvents[2], INFINITE );
            outputString(__T("Third thread says \"Do It\" to the First thread\n\n"), textColor );
            SetEvent( gEvents[0] );
        }

    }

    return 0;

}

What the threading function does is to wait for their event and trigger the next event. It is to wake up threads one by one.
This illustrates the effect of using events.

위에 있는 쓰레드 함수가 하는 것은, 각 쓰레드에 대응하는 이벤트를 기다리다가, 자기 것이 트리거되면, 해당 쓰레드가 다음의 이벤트를 fire함으로써, 다음번 쓰레드가 깨어나게 하는 것이다.
이런 행동이 바로 이벤트를 사용함으로써 얻고자 하는 효과이다.

If you don’t want to read the whole text above, here is the screenshot which will help you what the codes do.

위의 긴 글을 읽기 싫다면, 다음의 스크린샷을 보면 위의 코드가 무엇을 하는지 대번에 눈치를 챌 수있을 것이다.
What the codes do.

As it has always been, events can be implemented using mutex or semaphore. However, using events will simplify things.

역시 여기서도 생각해 볼 수있는 것이, 이 Event라는 것도 semaphore나 mutex을 이용하면 구현할 수있을 거라는 생각이다. 하지만 event를 사용하면 편리하게 구현을 할 수가 있다.

It is characteristic that there are functions like WaitForSingleObject() and WaitForMultipleObjects(), and this makes the Windows different from other OSes like Unix. So, a student who learned multiprocessing and parallel computing model based on Unix and other Oses than Windows can be confused.
However, it is also easy and reasonable model, and there is no problem in learning this Windows model.

이상에서 살펴본 Win32에서의 synchronization 모델에는 그 특징이 있다.
Critical Section, Mutex, Semaphore, Event등을 선언하고 세팅한 후, WaitForSingleObject()와 같은 함수를 이용해서 해당 상황이 발생하는지 기다리는 것이다. 이것이 주목해야 할 Win32의 synchronization 프로그래밍 모델이다.
무척 이해하기가 쉽고 논리적으로 설계가 되었지만, 다른 OS에는 이런 WaitForSingleObject()와 같은 함수가 없다. 그러므로 Unix와 같은 다른 OS에서 프로그래밍을 하다가 Windows에서 하게 되었을때, 혼동을 일으킬 수있다.

Windows multithreading (MFC)

MFC contains lots of wrappers to Win32 data types and their behaviour. So, it is framework.
MFC는 바로 이상의 것들을 감싸서 사용하기 쉬운 클래스로 만들어준 것이다. 즉 Framework인 것이다.

However, the MFC wrappers to synchronization do more than that.
It makes the synchronization model of Windows look similar to that of the Unix.
Let’s take a look at an example.

그런데 synchronization에 관해서 MFC의 wrapper들은 단순히 wrapping해서 쓰기 쉽게만 해주는 것이 아니라, 그 모델을 Unix의 그것과 비슷하게 해준다.
자 예를 한번 보자.


// Global Mutex Object
CMutex g_m;
int g_C;

UINT ThreadFunction1(LPVOID lParam)
{
    // Create object for Single Lock using the mutex
    CSingleLock lock(&g_m);

	// try obtaining a lock.
    lock.Lock();

    // code block protected by the lock.
	...

	// release the lock
    lock.Unlock();

    return 0;
}

UINT ThreadFunction2(LPVOID lParam)
{
    // Single Lock Construct Mutex
    CSingleLock lock(&g_m);

   // If the other thread already obtained the lock, this thread will wait here.
    lock.Lock();

    // code block protected by the lock.
	...

    lock.Unlock();

    return 0;
}

Where the Lock() function is located is comparable to the lines where WaitForSingleObject() is used in Win32.
For critical section, i.e. CCriticalSection, can be also implented by replacing g_m with a CCriticalSection. So, for mutex, semaphore, event, and critical section, the style how they are locked and and unlocked are the same.
This is the major difference between the Win32 model and the MFC model.

Anyway, where it is locked and unlocked are similar to the model for the Unix.

Lock() 메소드가 쓰여진 부분이 바로, Win32의 경우에 WaitForSingleObject()가 쓰여진 부분에 대응한다고 볼 수있다.
MFC에서는 그 locking variable이 뭐던간에, 즉 critical section이냐, mutex냐, event냐에 상관없이 모두 같은 프로그래밍 모델을 제공한다. 즉 위의 코드에서 CMutex로 선언된 부분을 CCriticalSection으로 바꾸면, 거의 코드를 고칠 필요없이, 그대로 사용할 수있게 된다. 즉 다시 말하자면, 다른 locking variable에 대해서 통합된 모델을 제공한다는 것이다.

아무튼 전체적으로 lock을 하고 unlock을 하는 부분이 Unix를 닮은 부분이다.

So far, we tried figuring out how synchronization looks like on the Windows.
In the next post, let’s try the Objective-C and Cocoa case.

자 이상으로 Windows에서의 synchronization에 대해서 알아보았다.
다음에는 Objective-C와 Cocoa의 경우를 살펴보기로 하자.





Difference in Concurrency Model in MacOS X and MS Windows (2)

23 04 2008

This post is the 2nd part of the previous post a while ago. As I promised before, this series of post is written in English and Korean.

  OK. It is time to return back to this issue, “multi-threading design” on Windows and Mac. When I studied multi-threading and synchronization on Windows after learning Unix, it was a little confusing. Although those on Windows is easy to learn and similar to those on Unix, there are some difference. The reason of difference comes from how the functions and facilities are designed.
Basically they share the same model. However, they present it in slightly different

자 한동안 잊고 지냈던 multi-threading에 대한 이야기를 해보자. Unix를 배우고 나서, Windows의 muti-threading과 synchronization에 대해서 공부를 하게 되면, 약간 좀 헷갈리는 면이 생긴다. 상당히 흡사하면서도, 익히기 쉽게 되어 있는 Windows의 그것은 하지만 좀 다른 면도 있다. 그 이유는 어떻게 해당 함수들을 디자인했는가에 기인한다.

In this post, the facilities provided by the Windows for multi-threading are presented, and let’s figure out how to use them. In next post, those for Objective-C and Cocoa will be explained.
이 글에서는 multi-threading을 위해 Windows에서 마련해 놓은 여러 장치들을 알아보고, 그 쓰는 법을 간단히 살펴본다. 그리고 다음번에는 Objective-C와 Cocoa등 Apple이 접근하는 방법을 알아보기로 하자.

1. Synchronization in Win32

1.1 Critical Section

The critical section seesm to be the simplest synchronization method. By embracing a code block with two functions, it enables mutually-exclusive access to the block.

이 critical section은 개인적으로 볼때 가장 간단한 synchronization 방법이 아닌가 한다.  일련의 코드 블럭을  감싸는 두 함수를 호출함으로써, 해당 블럭에 대한 배타적 접근을 가능하게 한다.


 for( i = 0; i < 5; i++ )
 {
#ifdef USE_CRITICAL_SECTION
 	EnterCriticalSection( &gCriticalSection );

        // Thread safe way of outputting
        StringCchPrintf( msgBuf, kBuffSize, __T("doMultiThreadWay (%d) : %d\n"), threadID, i );
        outputString( msgBuf, textColor );

        LeaveCriticalSection( &gCriticalSection );
#endif
}

The EnterCriticalSection() and the LeaveCriticalSection() are those two functions.

EnterCriticalSection()과 LeaveCriticalSection()이 바로 그 두 함수이다.

1.2 Mutex ( Mutually Exclusive Semaphore ) & Semaphore

The Windows prepares special functions for realizing mutex, or more generally semaphore : CreateMutex(), CreateSemaphore(), WaitForSingleObject(), WaitForMultipleObjects(), ReleaseMutex(), and ReleaseSemaphore().

윈도우즈에선 mutex 혹은 좀더 일반적으로 말하자면 semaphore를 처리하기 위해서 특별한 함수들을 준비해 놓고 있는데, 바로 CreateMutex(), CreateSemaphore(), WaitForSingleObject(), WaitForMultipleObjects(), ReleaseMutex(), ReleaseSemaphore()와 같은 함수들이다.

Mutexs and Semaphores are created by calling CreateMutext() and CreateSemaphore(), respectively. After creating them, a code block can be accessed as seen below.

Mutex와 Semaphore는 각각 CreateMutex()와 CreateSemaphore()를 호출함으로써 만들어지고, 일단 만들어진 후에는 다음에 보이는 것처럼 코드 블락을 억세스하는데 사용할 수있다. ( 아니 오히려 억세스를 regulate한다라고 봐야하겠다. )


        dwWaitResult = WaitForSingleObject( gMutex, 5000L );
        switch( dwWaitResult )
        {
        case    WAIT_OBJECT_0:
                __try
                {
                    // Thread safe way of outputting
                    StringCchPrintf( msgBuf, kBuffSize, __T("doMultiThreadWay (%d) : %d\n"), threadID, i );
                    outputString( msgBuf, textColor );
                }
                __finally
                {
                    if( !ReleaseMutex( gMutex ) )
                    {
                        // Save old attribute for a console
                        WORD wPrevColorAttrs = normalTextCsbiInfo.wAttributes;

                        // Now, write in Red
                        if( !SetConsoleTextAttribute( hStdout, FOREGROUND_RED ) )
                        {
                            MessageBox( NULL, __T("SetConsoleTextAttribute"), __T("Console Error"), MB_OK );
                            break;
                        }

                        // Thread safe way of outputting
                        StringCchPrintf( msgBuf, kBuffSize, __T("doMultiThreadWay (%d) : Error in releasing mutex\n"), threadID );
                        outputString( msgBuf, FOREGROUND_GREEN );

                        if( !SetConsoleTextAttribute( hStdout, wPrevColorAttrs ) )
                        {
                            MessageBox( NULL, __T("SetConsoleTextAttribute"), __T("Console Error"), MB_OK );
                            break ;
                        }
                    }

                    break;
                }

        case WAIT_TIMEOUT:
            break;

        case WAIT_ABANDONED:
            break;
        }

So, when a thread which is at after the WaitForSingleObject() line releases the mutex by calling ReleaseMutex(). Then next thread waiting at the line WaitForSingleObject() get the mutex, blocks other thread to get the mutex, and proceeds.

WaitForSingleObject()를 넘어간 쓰레드는, mutex를 획득한 것인데, ReleaseMutex()를 호출함으로써 mutex를 놓게 된다. 그러면 WaitForSingleObject()에서 기다리고 있던 다음의 쓰레드가 이제 mutex를 획득하고, 처리를 계속해 나간다.

Simple, isn’t it?
What is somewhat different from the Unix model is to use calls like WaitForSingleObject(). However, it is quite easy to understand and manipulate.

간단하지 않은가?
이런 모델이 Unix의 모델과 다른 점은 WaitForSingleObject()와 같은 함수를 씀으로써 달라지는 형식이다. 하지만 이런 Windows의 방식도 굉장히 이해하기 쉽고, 다루기가 쉽다.

Actually, at this point, you may wonder why the critical section is necessary. You can implement critical section using mutex. Then why are there the critical section? Actually some OSes don’t have the critical section. Anyway, to understand the difference and similarity, please read MSDN document at Critical Section Objects.

이 시점에서, 왜 critical section이 필요한지 궁금할 수있다. 즉 mutex를 이용하면 critical section을 구현할 수가 있는데, 굳이 왜 critical section이란 것을 만들까?
실제로 어떤 OS에는 critical section이 없는것도 있다. 자 우선 MS의 critical section과 mutex등의 차이점에 대해선 MSDN의 Critical Section Objects라는 문서를 참조해 보자.

“A critical section object provides synchronization similar to that provided by a mutex object, except that a critical section can be used only by the threads of a single process. Event, mutex, and semaphore objects can also be used in a single-process application, but critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization (a processor-specific test and set instruction). Like a mutex object, a critical section object can be owned by only one thread at a time, which makes it useful for protecting a shared resource from simultaneous access. Unlike a mutex object, there is no way to tell whether a critical section has been abandoned.”

The clear difference is that critical section can be used only for the threads of a single process. And in that case, it is faster.

결정적인 차이는 바로 critical section은 single process의 thread에서만 쓸 수있다는 것이고, 그럴 경우 속도가 빠르다는 것이다.

One good example for things which make us confusing when we develop on many different OSes is this critical section. On some lines above, I said that some OSes didn’t have the critical section. Well, to make things more correct, I should revise the statement. It’s wrong. The concept of critical section exist on all multiprocess, multithreading OSes. If you use mutex to force atomic access to some code blocks, then it is the critical section. On the other hand, the critical section mentioned on a MSDN page is the MS’s special structure, CRITICAL_SECTION, rather than critical secition as general concept. A code example is here :

여러 OS에서 프로그래밍을 하다보면 헷갈리게 되는 게 생기는데, 그 좋은 예가 바로 이 critical section이다. 앞에서 잠깐 어떤 OS에선 critical section이 없다고 이야기 했는데, 지금와서 밝히자면 이 말은 좀 잘못된 말이다. critical section의 개념은 다 존재한다. mutex를 이용해서 특정 블럭에 대해서 atomic access를 하게 하면, 그게 critical section이다. 반면에 위의 MSDN 문서에서 언급하는 critical section이란 일반적 개념으로써의 critical section이 아니라 다음과 같은 코드로 만들어질 수있는, MS가 만든 특별한 구조체인 CRITICAL_SECTION이다.


CRITICAL_SECTION gOutputCriticalSection;
InitializeCriticalSectionAndSpinCount( &gCriticalSection, 0x80000400 );

So, it is rather safe not to think, “Oh, there is no critical section on xxx OS.”.
그러므로 Windows에서 프로그래밍을 하다가 혹 다른 OS에서 하게 될 경우 “critical section이 없네”하는 생각은 하지 않는게 옳다.





QueryPerformanceCounter() equivalent on Mac OS X

20 04 2008

Timer is quite an issue to some people who need to process image in realtime or who want to measure very fast code.

Because the QueryPerformanceCounter() and QueryPerformanceFrequency() are discussed in my previous post, one will raise a question, “Is there a similar function for the Mac OS X?”.

Yeah.. Actually, my blog stat showed that a few people searched with that term.

I found some functions like mach_absolute_time() and mach_timebase_info().

You can read very nice explanation here at the MacResearch and here at the Apple’s Q&A page.





Weirdness of the High Resolution Counter, i.e. QueryPerformanceCounter()

19 04 2008

For the most of time, using clock() for measuring performance for a block can be enough.
However, there are some cases where you want to compare two logically identical but differently implemented blocks.
Let’s assume that you want to compare performance of intrinsic version of strcpy and your own implementation of strcpy block written in SIMD instructions.
In most case, the clock()-based functions, like clock() and GetTickCount(), will not reveal the difference between them.

So, you decided to use high performance, or high resolution timer. The Windows supports these two functions for that purpose.

1. QueryPerformanceCounter( LARGE_INTEGER *pVal )
This function is like the clock(). the value returned in a location pointed by pVal is the number of counts, just like that the clock() returns number of ticks.

2. QueryPerformanceFrequency( LARGE_INTEGER *pVal )
This returns how many times it ocillates per a second.

So, the duration of time can be obtained by


    LARGE_INTEGER aVal, aFreq;
    __int64 durataion_in_time;

    QueryPerformanceCounter( &aVal );
    QueryPerformanceFrequency( &aFreq );
    duration_in_time = aVal.QuadPart / aFreq.QuadPart;

However this has some glitches with contemporary CPUs.

Before mentionting the glitch, let’s take a look at how the LARGE_INTEGER is declared.


typedef union _LARGE_INTEGER {
    struct {
        DWORD LowPart;
        LONG HighPart;
    };
    struct {
        DWORD LowPart;
        LONG HighPart;
    } u;
    LONGLONG QuadPart;
} LARGE_INTEGER;

The LONGLONG is __int64 type. So, if your compiler and CPU supports 64bit data type, you can access the content of the LARGE_INTEGER with the QuadPart.

The 1st glitch is that the returned value easily exceeds the boundary of the 64bits for the QuadPart, because current CPUs are so fast.
(If you search on the Google, you will find some web pages on which people explain that it exceeds the 32bit boundary.
And they recommend to use 64bit data type. Well, actually it even exceeds the 64bit boundary. )
So, probably you can use unsigned __int64 instead.

The 2nd glitch is that you can’t print them out properly when you use %I64d for aVal.QuadPart/aFreq.QuadPart.
Even %Lf doesn’t solve the problem. They are all for 64bit integer and real numbers. Then how to display them properly?


printf("%f", (double)aVal.QuadPart/(double)aFreq.QuadPart);

double is also 64bit real number type, and it works.

The 3rd glitch is the real glitch.
Let’s take a look at this screenshot from real invocation of the code.

Hmm… Why the high performance counter is not reliable?
By searching on the Google, I found a clue that it was due to the speed-step or similar technology which changes the CPU speed on demand.
Because it has very high resolution, it has the glitch.
I read somewhere in Intel’s forum that Intel or MS was working on making the call to measure on the FSB side instead of inner core of the CPU.
By doing so, it is said that the function would return more reliable value even when battery-saving technology in a CPU is used.

I assume that the GetTickCount() Win32 API function is also based on the clock(). However, it displays somewhat expected result seemilgy reliably.
The clock()/CLOCKS_PER_SEC displays 2 and 1.9… from time to time.

Probably the GetTickCount() has the lowest resolution.
However, one convenient side of using the GetTickCount() is that it returns a value in millisecond, if you want “time” instead of number of ticks.
So, you don’t need to divide it by some constant like CLOCKS_PER_SEC. Then it should be renamed to GetTickTime().
Well.. the function name again misleads, but it is the brain-child of the MS.

Finally, here is a screenshot when all of them return good results. :)





GCC comes with Mac OS X 10.4.x and 10.5.x doesn’t support flexible array member

11 04 2008

According to GCC manual, it supports flexible array member.


struct foo { int x; int y[]; };
    struct bar { struct foo z; };

    struct foo a = { 1, { 2, 3, 4 } };        // Valid.
    struct bar b = { { 1, { 2, 3, 4 } } };    // Invalid.
    struct bar c = { { 1, { } } };            // Valid.

The lines commented as valid should be compiled without any error. However, the GCC 4.x installed with the Xcode 2.5 and 3.x on OS X 10.4.x and 10.5.x respectively doesn’t compile it without errors.

I reported this bug to the Apple.

NEW on April, 22, 2008
I got message from Apple.

This is a follow-up to Bug ID# 5857390. Engineering has determined that this issue behaves as intended based on the following information:

Page 232 of the GCC manual available at http://gcc.gnu.org/onlinedocs/gcc-4.2.3/gcc.pdf states that:

“To avoid undue complication and confusion with initialization of deeply nested arrays, we simply disallow any non-empty initialization except when the structure is the top-level ob ject.”

Compiling the example on page 232 will produce an error based on the two disallowed statements specifically marked as “Invalid”. Compiling the file with only the “Valid” lines works correctly.