When not to use threads, part 2

September 25, 2010

The lesson learned from the previous experiment was that threads (at least pthreads on Linux) aren’t lightweight. They take quite some time to create and start executing. If your task is of pretty short duration you’ll spend more time setting up the thread than you will actually doing work. They also take up considerable memory. You’ll want to be considerate of these things when designing something using threads.

For my logging example task there’s another design that might be superior. Instead of starting a thread for each bit of data to be logged I could create a single thread that would log data and sleep. If more data is logged it can be passed to the sleeping thread and written. This will eliminate the setup time for each data write and greatly reduce the amount of system resources used if many writes are done in succession.

Let’s test that. Here’s the code to implement a single thread version:


// queue for log data
std::queue< std::string > LogQueue;
pthread_cond_t myconvar;
BMutex QueueLock;

// thread function
void* QueueThread( void* arg )
{
 while ( true )
 {
 QueueLock.lock();
 pthread_cond_wait( &myconvar, &QueueLock._mutex );
 while ( ! LogQueue.empty() )
 {
 fwrite( LogQueue.front().c_str(), LogQueue.front().size(), 1, fd );
 LogQueue.pop();
 }
 QueueLock.unlock();
 }
}

{
 fd = fopen( "log4.txt", "wt" );

 pthread_setconcurrency( 2 );

 pthread_attr_t attr;
 pthread_attr_init( &attr );
 pthread_attr_setdetachstate( &attr, PTHREAD_CREATE_JOINABLE );
 pthread_attr_setstacksize ( &attr, 65536 );

 pthread_cond_init( &myconvar, NULL );

 pthread_t thread;
 int rc = pthread_create( &thread, &attr, QueueThread, NULL );

 clock_t start = clock();

 for ( int i = 0; i < tests; i++ )
 {
 QueueLock.lock();
 // queue the data
 LogQueue.push( sz );
 if ( LogQueue.size() % 4 == 0 )
 // signal the thread to write it
 pthread_cond_signal( &myconvar );
 QueueLock.unlock();
 }

 clock_t stop = clock();

 double d = ((double)(stop-start))/ CLOCKS_PER_SEC;
 printf( "%8.3f seconds (%d clocks)\n", d, (int)(stop-start) );

 // wait for queue to empty
 bool done;
 do
 {
 QueueLock.lock();
 done = LogQueue.empty();
 QueueLock.unlock();
 BSleep::sleep( 10 );
 } while ( ! done );

 pthread_cancel( thread );
 pthread_cond_destroy( &myconvar );
 pthread_attr_destroy( &attr );
 fclose( fd );
}

I’m using pthreads condition variables to communicate to the sleeping thread when new data is available.

How does this perform?

This design is much better than the previous one.

I changed the test to use 10,000,000 iterations this time because the difference between the two designs was much lower. I needed that many iteration to magnify any differences enough to make them visible.

Manual data write:                0.920 seconds (920000 clocks)
Threaded write:                   14.640 seconds (14640000 clocks)

This is many orders of magnitude better than the previous design but the threaded code is 15.9 times slower still! This pretty much proves that none of the threaded designs was in any way superior to just writing the data directly to a file. The logging frameworks in all my programs aren’t going to use threading!

Have a great day

++djs


When not to use threads

September 19, 2010

As an exercise I recently tried applying threading to the task of logging. I wanted a very high performance program activity logging framework so I tried to determine if threading could provide any benefits. What I learned is a perfect example of when not to use threads.

The design:

To minimize overhead for the process that writes the log all manipulation of log information is moved to other systems. All it needs to accomplish is saving the information in an accessible location as fast as possible. The fastest logging framework basically just writes text to a disk file.

The disk system on modern computers is fast but it’s still very slow compared to using cache or random access memory. If the main process could pass the data to be logged to another thread then it could perform the file write operation and the main thread could continue to other tasks more quickly.

So, in summary:

The main process will:

* Copy the data to be logged.
* Start a new thread and pass the logged data to it.
The thread will:
* Establish a lock on the output file
* Write the log data to the file
* Release the file lock
* Delete the copy of the log data

You might wonder why the main process copies the data to be logged. Unfortunately you can’t just pass a pointer to the data and let the thread write it. If the main process erases or deallocates the log data string before the thread has written it then logging the wrong data, or program crashes, will occur.

The file will need to have a MUTEX to control access. The I/O operations in the C standard library are not guaranteed to be thread safe on all platforms. If you want to share a log from multiple threads you can’t share the same file. A single MUTEX guarantees sequential access and the common logging code eliminates the need for multiple file handles.

In this logging framework I don’t need to guarantee the order of the log entries, which makes the code a lot simpler. A date and time stamp on each log entry can be added to order the log if that’s desired.

The test system:

* A Quad Core Intel processor
* Operating system is Ubuntu 10.04

The code:

Test 1:
Here’s the base line for the tests. Write a string to a file handle 5000 times and determine how long it takes.

{
 // the open and close are done at initialization time
 // This will measure the run time performance
 fd = fopen( "log1.txt", "wt" );

 clock_t start = clock();

 for ( long i = 0; i < tests; i++ )
 {
 fwrite( sz.c_str(), sz.size(), 1, fd );
 }

 clock_t stop = clock();
 double d = ((double)(stop-start))/ CLOCKS_PER_SEC;
 printf( "%8.3f seconds (%d clocks)\n", d, (int)(stop-start) );

 fclose( fd );
}

Results:

5000 writes requires less than 10 milliseconds for all tests run. This is the minimum time resoltion of the clock() function. This test always returns zero clocks/zero seconds.

Test 2:
Test sleep lock threaded write to file. This code was much more complicated. The pthread library doesn’t seem to release system resources if the thread exits. You have to manually release it with a detach call. I also found that setting the concurrency level was necessary when using more than 380 threads on my system. If you don’t do this then it tries to allocate resources for all of the threads and just refuses to work correctly after it hits a failure condition. You also should set the stack size for the threads. The default value is much larger than this example requires.

The MUTEX is a “Sleep lock”. It will block the thread execution and sleep until the lock is obtained.

// this function is executed by the thread
void* thread_function( void* arg )
{
 // get a copy of the input string
 std::string* sz = (std::string*) arg;
 const char* psz = sz->c_str();
 int n = sz->size();

 FileLock.lock();
 fwrite( psz, n, 1, fd );
 FileLock.unlock();

 delete sz;
 // doesn't seem to reclaim storage unless you detach yourself... even if you exit
 pthread_detach( pthread_self() );
 pthread_exit( 0 );
}

// this is the test code:
{
 fd = fopen( "log2.txt", "wt" );

 pthread_setconcurrency( ConcurrentThreads );

 pthread_attr_t attr;
 pthread_attr_init( &attr );
 pthread_attr_setdetachstate( &attr, PTHREAD_CREATE_JOINABLE );
 pthread_attr_setstacksize ( &attr, 65536 );

 pthread_t thread[tests];

 clock_t start = clock();

 for ( int i = 0; i < tests; i++ )
 {
 // pass a copy of the string to be logged for the thread
 std::string* psz = new std::string;
 *psz = sz;
 // if necessary wait for threads to be created
 while ( true )
 {
 int rc = pthread_create( &thread[i], &attr, thread_function, (void*)psz );
 if ( rc == 0 )
 break;
 if ( rc == EAGAIN )
 // out of resources. wait for existing threads to finish
 continue;
 throw new std::runtime_error("pthread failed");
 }
 }

 clock_t stop = clock();
 double d = ((double)(stop-start))/ CLOCKS_PER_SEC;
 printf( "%8.3f seconds (%d clocks)\n", d, (int)(stop-start) );

 BSleep::sleep( 100 );

 pthread_attr_destroy( &attr );
 fclose( fd );
}

Results:

The concurrence is the number of simultaneous threads requested. The result was highly variable. Some tests required more than 7 seconds! The outliers in the data have been discarded.

set to concurrence 4
 0.300 seconds (300000 clocks)

 0.840 seconds (840000 clocks)

 0.140 seconds (140000 clocks)

 0.120 seconds (120000 clocks)

set to concurrence 8
 0.140 seconds (140000 clocks)

 0.240 seconds (240000 clocks)

 0.140 seconds (140000 clocks)

 0.350 seconds (350000 clocks)

 0.280 seconds (280000 clocks)

 0.340 seconds (340000 clocks)

 0.280 seconds (280000 clocks)

set to concurrence 12
 0.150 seconds (150000 clocks)

 0.270 seconds (270000 clocks)

 0.090 seconds (90000 clocks)

 0.090 seconds (90000 clocks)

 0.130 seconds (130000 clocks)

set to concurrence 32
 0.160 seconds (160000 clocks)

 0.230 seconds (230000 clocks)

 0.130 seconds (130000 clocks)

Test 3:
Test spin lock threaded write to file. This is the same as test 2 with a different MUTEX locking method.

The MUTEX is used here as a “Spin lock”. It will block the thread execution and burn processor cycles in a tight loop until the lock is obtained. If the wait time for the lock is less than the time required for the thread to go to sleep and then wake up you gain time at the expense of power usage and wasted processor cycles.

// thread function with a spin lock instead of a sleep lock
// just burn cpu cycles instead of trying to release to another thread
void* spin_function( void* arg )
{
 // get a copy of the input string
 std::string* sz = (std::string*) arg;
 const char* psz = sz->c_str();
 int n = sz->size();

 while ( FileLock.tryLock() == false )
 ;
 fwrite( psz, n, 1, fd );
 FileLock.unlock();

 delete sz;
 // doesn't seem to reclaim storage unless you detach yourself... even if you exit
 pthread_detach( pthread_self() );
 pthread_exit( 0 );
}

Results:

The concurrence is the number of simultaneous threads requested. The results were again highly variable.

set to concurrence 4
 0.110 seconds (110000 clocks)

 0.170 seconds (170000 clocks)

 0.470 seconds (470000 clocks)

 0.490 seconds (490000 clocks)

set to concurrence 8
 0.190 seconds (190000 clocks)

 0.340 seconds (340000 clocks)

 0.450 seconds (450000 clocks)

 0.410 seconds (410000 clocks)

 0.160 seconds (160000 clocks)

 0.120 seconds (120000 clocks)

 0.150 seconds (150000 clocks)

set to concurrence 12
 0.500 seconds (500000 clocks)

 0.280 seconds (280000 clocks)

 0.220 seconds (220000 clocks)

 0.540 seconds (540000 clocks)

 0.940 seconds (940000 clocks)

set to concurrence 32
 0.260 seconds (260000 clocks)

 0.300 seconds (300000 clocks)

 0.650 seconds (650000 clocks)

Conclusions:

Threading has so much overhead compared to the fwrite() call that it’s dramatically worse in performance for this application. The threaded code was also highly variable in performance (probably due to resource usage by the system).  The use of a spin lock instead of a sleep lock actually made the performance here markedly worse. I would have expected it would have perfomed better than the sleep lock.

* Threading can only be used efficiently in applications where the function being performed takes considerable time compared with the setup and tear down for threads (roughly 100 milliseconds for my system).
* It should also not be relied upon for time sensitive operations due to its very high variability.
* The number of threads did not make significant difference in the performance.

++djs