I don’t write for absolute beginners, so this post is for someone who already knows on a high level how threads work but does not know the low-level machinery yet.
Some facts on threads first:
- A thread is the smallest unit of execution that the operating system scheduler can run.
- A process can have one or more threads.
- Threads inside the same process share the same address space.
- Each thread has its own stack: function calls, local variables, return addresses.
- Threads share the heap, global variables, file descriptors, sockets, and most process resources.
- Concurrency means multiple tasks can make progress during the same time window.
- Parallelism means multiple tasks are literally running at the same instant on different CPU cores.
- The useful part: while one thread is blocked on I/O, another thread can use the CPU.
- The scary part: because threads share memory, two innocent-looking lines of C can corrupt your program in very creative ways.
The last point is what this whole article is going to be about.
So lets lock in.
What pthreads actually gives you
In C, when people say "threads", they usually mean POSIX threads, also called pthreads.
Compile programs using pthreads like this:
gcc main.c -pthread -o main
That -pthread flag matters. It tells the compiler and linker that this program uses threading support.
The smallest possible pthread program looks like this:
#include <pthread.h>
#include <stdio.h>
void *worker(void *arg) {
int id = *(int *)arg;
printf("hello from worker %d\n", id);
return NULL;
}
int main(void) {
pthread_t t;
int id = 1;
pthread_create(&t, NULL, worker, &id);
pthread_join(t, NULL);
printf("main is done\n");
return 0;
}
There are two important functions here.
pthread_create starts a new thread.
pthread_join waits for that thread to finish.
If you remove the pthread_join, the main thread can exit before the worker thread runs. This is the same shape of bug as launching a goroutine and then letting main return immediately. The child thread was created, but nobody promised that the process will stay alive for it.
Now let us create two threads.
#include <pthread.h>
#include <stdio.h>
void *worker(void *arg) {
int id = *(int *)arg;
printf("worker %d\n", id);
return NULL;
}
int main(void) {
pthread_t t1, t2;
int a = 1;
int b = 2;
pthread_create(&t1, NULL, worker, &a);
pthread_create(&t2, NULL, worker, &b);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
return 0;
}
The output may be:
worker 1
worker 2
or:
worker 2
worker 1
Both are correct.
Starting a thread is not the same thing as deciding exactly when it runs. You asked the OS to make the thread runnable. After that the scheduler gets involved, and the scheduler does not owe you a friendly sequential story.
This is where most concurrency confusion starts. The code is written in one order. The execution may happen in many orders.
Race condition
Consider a simple counter.
#include <pthread.h>
#include <stdio.h>
#define N 1000000
int counter = 0;
void *increment(void *arg) {
for (int i = 0; i < N; i++) {
counter++;
}
return NULL;
}
int main(void) {
pthread_t t1, t2;
pthread_create(&t1, NULL, increment, NULL);
pthread_create(&t2, NULL, increment, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("counter = %d\n", counter);
return 0;
}
You expect:
counter = 2000000
But you may get:
counter = 1372841
or:
counter = 1849920
or sometimes even the correct answer, just to gaslight you.
Why?
Because counter++ is not one operation.
At the machine level it is more like:
load counter from memory into register
add 1 to register
store register back into memory
Now imagine two threads doing this at the same time:
counter is 10
thread A loads 10
thread B loads 10
thread A adds 1 -> 11
thread B adds 1 -> 11
thread A stores 11
thread B stores 11
Two increments happened, but the counter only increased by one.
This is a race condition. The result depends on timing. And timing is not something you control.
More formally: a data race happens when two threads access the same memory location at the same time, at least one access is a write, and there is no synchronization ordering those accesses.
In C/C++, a data race is not just "maybe wrong". It is undefined behavior. The compiler is allowed to assume that data races do not happen and optimize under that assumption.
That sounds academic until the optimizer turns your bug from "wrong value sometimes" into "this loop never exits in production".
Critical sections
A critical section is a region of code that must be executed by only one thread at a time.
For the counter example, this is the critical section:
counter++;
Only one thread should be allowed to read-modify-write counter at once.
The classic tool for this is a mutex.
Mutexes
Mutex stands for mutual exclusion.
Meaning: if one thread is inside the protected section, every other thread must wait outside.
Here is the fixed counter:
#include <pthread.h>
#include <stdio.h>
#define N 1000000
int counter = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
void *increment(void *arg) {
for (int i = 0; i < N; i++) {
pthread_mutex_lock(&lock);
counter++;
pthread_mutex_unlock(&lock);
}
return NULL;
}
int main(void) {
pthread_t t1, t2;
pthread_create(&t1, NULL, increment, NULL);
pthread_create(&t2, NULL, increment, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("counter = %d\n", counter);
pthread_mutex_destroy(&lock);
return 0;
}
Now each increment happens with exclusive access.
The lock/unlock pair creates a boundary:
pthread_mutex_lock(&lock);
// shared data access
pthread_mutex_unlock(&lock);
The rule is simple:
If a variable is shared and mutable, protect every access to it with the same mutex.
Not "protect writes only".
Protect reads too, unless you have a very specific reason and are using atomics or another synchronization scheme.
Bad:
pthread_mutex_lock(&lock);
shared_value = 42;
pthread_mutex_unlock(&lock);
printf("%d\n", shared_value); // unprotected read
Good:
pthread_mutex_lock(&lock);
int snapshot = shared_value;
pthread_mutex_unlock(&lock);
printf("%d\n", snapshot);
The mutex is not attached magically to the variable. It is just an agreement between all threads.
If one thread ignores the agreement, your program is back in race condition land.
Make the critical section small
A mutex serializes execution.
That is the point, but it also means you can accidentally remove most of your parallelism.
Bad:
pthread_mutex_lock(&lock);
read_from_network();
parse_large_file();
counter++;
pthread_mutex_unlock(&lock);
Here every other thread is blocked while one thread waits on the network and parses a file. That is usually not what you want.
Better:
int delta = read_from_network_and_parse();
pthread_mutex_lock(&lock);
counter += delta;
pthread_mutex_unlock(&lock);
Do expensive work outside the lock. Enter the lock only to touch the shared state.
This is one of the biggest differences between code that is merely correct and code that performs well.
Passing arguments to threads
This bug is extremely common:
pthread_t threads[5];
for (int i = 0; i < 5; i++) {
pthread_create(&threads[i], NULL, worker, &i);
}
You meant to pass each thread its own id.
But every thread receives the address of the same loop variable i.
By the time a worker reads it, the loop may have moved on. You may see multiple threads print 5, or duplicate ids, or whatever timing gives you.
Correct:
pthread_t threads[5];
int ids[5];
for (int i = 0; i < 5; i++) {
ids[i] = i;
pthread_create(&threads[i], NULL, worker, &ids[i]);
}
Now every thread gets a pointer to stable memory.
This is not a synchronization primitive, but it is part of writing thread-safe C: lifetime matters.
Condition variables
A mutex protects shared state.
A condition variable lets a thread sleep until some shared state becomes true.
This is the shape:
pthread_mutex_lock(&lock);
while (!condition_is_true) {
pthread_cond_wait(&cond, &lock);
}
// condition is true, use shared state
pthread_mutex_unlock(&lock);
The while is not decoration. Use while, not if.
Why?
Because condition variables can wake up even if nobody signaled them. These are called spurious wakeups. Also, another thread may consume the condition before the awakened thread gets the lock back.
So the rule is:
Wait in a loop. Always re-check the predicate.
Now the weird part:
pthread_cond_wait(&cond, &lock) atomically unlocks the mutex and puts the thread to sleep. When it wakes up, it re-locks the mutex before returning.
This is exactly what you want.
If it did not unlock the mutex while sleeping, the producer would never be able to acquire the lock and change the state.
Let us build the smallest producer-consumer with one slot.
#include <pthread.h>
#include <stdio.h>
#include <stdbool.h>
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t not_empty = PTHREAD_COND_INITIALIZER;
pthread_cond_t not_full = PTHREAD_COND_INITIALIZER;
int slot = 0;
bool has_item = false;
void *producer(void *arg) {
for (int i = 1; i <= 5; i++) {
pthread_mutex_lock(&lock);
while (has_item) {
pthread_cond_wait(¬_full, &lock);
}
slot = i;
has_item = true;
printf("produced %d\n", i);
pthread_cond_signal(¬_empty);
pthread_mutex_unlock(&lock);
}
return NULL;
}
void *consumer(void *arg) {
for (int i = 1; i <= 5; i++) {
pthread_mutex_lock(&lock);
while (!has_item) {
pthread_cond_wait(¬_empty, &lock);
}
int item = slot;
has_item = false;
printf("consumed %d\n", item);
pthread_cond_signal(¬_full);
pthread_mutex_unlock(&lock);
}
return NULL;
}
int main(void) {
pthread_t p, c;
pthread_create(&p, NULL, producer, NULL);
pthread_create(&c, NULL, consumer, NULL);
pthread_join(p, NULL);
pthread_join(c, NULL);
pthread_mutex_destroy(&lock);
pthread_cond_destroy(¬_empty);
pthread_cond_destroy(¬_full);
return 0;
}
The condition variable itself does not store the condition.
This is important.
The condition is has_item.
The condition variable is only the sleeping/waking mechanism.
People often mentally model condition variables as queues of events. That model is dangerous. A signal sent when nobody is waiting is lost. It does not get saved for the future.
That is why the real state must live in normal variables protected by the mutex.
Producer-consumer with a real bounded buffer
The one-slot example teaches the idea. Real systems usually use a bounded queue.
Why bounded?
Because unbounded queues are a polite way to move your outage into memory usage.
If producers are faster than consumers, a bounded queue applies backpressure. Producers block when the queue is full.
#include <pthread.h>
#include <stdio.h>
#include <stdbool.h>
#define CAPACITY 8
#define ITEMS 40
typedef struct {
int data[CAPACITY];
int head;
int tail;
int count;
bool closed;
pthread_mutex_t lock;
pthread_cond_t not_empty;
pthread_cond_t not_full;
} queue_t;
void queue_init(queue_t *q) {
q->head = 0;
q->tail = 0;
q->count = 0;
q->closed = false;
pthread_mutex_init(&q->lock, NULL);
pthread_cond_init(&q->not_empty, NULL);
pthread_cond_init(&q->not_full, NULL);
}
void queue_destroy(queue_t *q) {
pthread_mutex_destroy(&q->lock);
pthread_cond_destroy(&q->not_empty);
pthread_cond_destroy(&q->not_full);
}
void queue_push(queue_t *q, int value) {
pthread_mutex_lock(&q->lock);
while (q->count == CAPACITY) {
pthread_cond_wait(&q->not_full, &q->lock);
}
q->data[q->tail] = value;
q->tail = (q->tail + 1) % CAPACITY;
q->count++;
pthread_cond_signal(&q->not_empty);
pthread_mutex_unlock(&q->lock);
}
bool queue_pop(queue_t *q, int *out) {
pthread_mutex_lock(&q->lock);
while (q->count == 0 && !q->closed) {
pthread_cond_wait(&q->not_empty, &q->lock);
}
if (q->count == 0 && q->closed) {
pthread_mutex_unlock(&q->lock);
return false;
}
*out = q->data[q->head];
q->head = (q->head + 1) % CAPACITY;
q->count--;
pthread_cond_signal(&q->not_full);
pthread_mutex_unlock(&q->lock);
return true;
}
void queue_close(queue_t *q) {
pthread_mutex_lock(&q->lock);
q->closed = true;
pthread_cond_broadcast(&q->not_empty);
pthread_mutex_unlock(&q->lock);
}
queue_t q;
void *producer(void *arg) {
for (int i = 0; i < ITEMS; i++) {
queue_push(&q, i);
}
queue_close(&q);
return NULL;
}
void *consumer(void *arg) {
int value;
while (queue_pop(&q, &value)) {
printf("consumed %d\n", value);
}
return NULL;
}
int main(void) {
pthread_t p;
pthread_t c1;
pthread_t c2;
queue_init(&q);
pthread_create(&p, NULL, producer, NULL);
pthread_create(&c1, NULL, consumer, NULL);
pthread_create(&c2, NULL, consumer, NULL);
pthread_join(p, NULL);
pthread_join(c1, NULL);
pthread_join(c2, NULL);
queue_destroy(&q);
return 0;
}
Two details are worth noticing.
First, queue_close uses pthread_cond_broadcast, not pthread_cond_signal.
If multiple consumers are sleeping and the producer is done forever, every consumer needs a chance to wake up, see closed == true, and exit.
Second, every queue field is read or written while holding the queue mutex.
That is the invariant. Once you have an invariant like that, the code becomes easier to reason about.
Semaphores
A semaphore is basically a counter with atomic sleep/wake behavior.
If the counter is positive, sem_wait decrements it and continues.
If the counter is zero, sem_wait blocks until somebody increments it with sem_post.
On POSIX systems:
#include <semaphore.h>
Simple example: allow at most 3 threads into a section.
#include <pthread.h>
#include <semaphore.h>
#include <stdio.h>
#include <unistd.h>
sem_t slots;
void *worker(void *arg) {
int id = *(int *)arg;
sem_wait(&slots);
printf("worker %d entered\n", id);
sleep(1);
printf("worker %d leaving\n", id);
sem_post(&slots);
return NULL;
}
int main(void) {
pthread_t threads[10];
int ids[10];
sem_init(&slots, 0, 3);
for (int i = 0; i < 10; i++) {
ids[i] = i;
pthread_create(&threads[i], NULL, worker, &ids[i]);
}
for (int i = 0; i < 10; i++) {
pthread_join(threads[i], NULL);
}
sem_destroy(&slots);
return 0;
}
Only three workers can be inside at once.
This is useful for rate limiting access to a finite resource:
- database connections
- file descriptors
- worker slots
- GPU jobs
- network requests
Can you use a binary semaphore like a mutex?
Technically, yes.
Should you?
Usually no.
A mutex has ownership: the thread that locks it should unlock it. A semaphore does not have the same ownership model. Any thread can post. That is exactly what makes semaphores useful in producer-consumer signaling, but it also makes them easier to misuse as locks.
Use mutexes for protecting shared state.
Use semaphores for counting resources or signaling availability.
Producer-consumer with semaphores
The bounded buffer can also be expressed with semaphores.
We use:
empty: how many empty slots are availablefull: how many filled slots are availablelock: a mutex to protect the actual circular buffer indices
#include <pthread.h>
#include <semaphore.h>
#include <stdio.h>
#define CAPACITY 8
#define ITEMS 20
int buffer[CAPACITY];
int head = 0;
int tail = 0;
sem_t empty;
sem_t full;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
void push(int value) {
sem_wait(&empty);
pthread_mutex_lock(&lock);
buffer[tail] = value;
tail = (tail + 1) % CAPACITY;
pthread_mutex_unlock(&lock);
sem_post(&full);
}
int pop(void) {
sem_wait(&full);
pthread_mutex_lock(&lock);
int value = buffer[head];
head = (head + 1) % CAPACITY;
pthread_mutex_unlock(&lock);
sem_post(&empty);
return value;
}
void *producer(void *arg) {
for (int i = 0; i < ITEMS; i++) {
push(i);
}
return NULL;
}
void *consumer(void *arg) {
for (int i = 0; i < ITEMS; i++) {
printf("consumed %d\n", pop());
}
return NULL;
}
int main(void) {
pthread_t p, c;
sem_init(&empty, 0, CAPACITY);
sem_init(&full, 0, 0);
pthread_create(&p, NULL, producer, NULL);
pthread_create(&c, NULL, consumer, NULL);
pthread_join(p, NULL);
pthread_join(c, NULL);
sem_destroy(&empty);
sem_destroy(&full);
pthread_mutex_destroy(&lock);
return 0;
}
This is clean, but notice that semaphores do not remove the need for a mutex. The semaphores count slots. The mutex protects the buffer data structure itself.
Synchronization primitives compose. They do not magically replace each other.
Atomics
Now suppose you only need a counter.
Using a mutex works, but it may be heavier than necessary.
C11 introduced atomics in <stdatomic.h>.
#include <pthread.h>
#include <stdatomic.h>
#include <stdio.h>
#define N 1000000
atomic_int counter = 0;
void *increment(void *arg) {
for (int i = 0; i < N; i++) {
atomic_fetch_add(&counter, 1);
}
return NULL;
}
int main(void) {
pthread_t t1, t2;
pthread_create(&t1, NULL, increment, NULL);
pthread_create(&t2, NULL, increment, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("counter = %d\n", atomic_load(&counter));
return 0;
}
Here the increment is atomic. No two threads can lose an update.
Atomics are great for simple shared values:
- counters
- flags
- reference counts
- statistics
- lock-free-ish coordination, if you really know what you are doing
But atomics are not a drop-in replacement for mutexes.
If you need to update multiple variables together, a mutex is usually the right tool.
Bad idea:
atomic_int balance_a;
atomic_int balance_b;
// transfer 10 from A to B
atomic_fetch_sub(&balance_a, 10);
atomic_fetch_add(&balance_b, 10);
Each operation is atomic individually, but the transfer as a whole is not atomic. Another thread can observe the intermediate state.
For invariants across multiple fields, use a lock.
Memory ordering basics
This topic gets deep very quickly, but you need the basic map.
Modern CPUs and compilers reorder operations for performance.
If one thread writes:
data = 42;
ready = 1;
you may think another thread doing:
if (ready) {
printf("%d\n", data);
}
must see 42.
But without synchronization, this is broken.
There are two separate problems:
- The compiler/CPU may reorder or cache things in ways you did not expect.
- The program has a data race if
dataandreadyare normal variables.
Atomics let you create ordering relationships.
The most important pair is release/acquire.
Release means: everything before this store becomes visible before the store is visible.
Acquire means: after I observe this value, I also observe the writes that happened before the matching release.
Example:
#include <pthread.h>
#include <stdatomic.h>
#include <stdio.h>
int data = 0;
atomic_int ready = 0;
void *producer(void *arg) {
data = 42;
atomic_store_explicit(&ready, 1, memory_order_release);
return NULL;
}
void *consumer(void *arg) {
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) {
// spin
}
printf("data = %d\n", data);
return NULL;
}
The consumer can safely read data after the acquire load sees ready == 1, because the release store published the earlier write to data.
The default atomic operations use memory_order_seq_cst, sequential consistency. It is the strongest and easiest to reason about. It makes all sequentially consistent atomic operations appear in one global order.
That sounds nice because it is nice.
For most code, start with the default. Reach for weaker ordering only when you have measured a real performance issue and understand the proof.
The usual orders:
memory_order_relaxed: atomicity only, no ordering. Good for approximate counters/statistics.memory_order_release: publish prior writes.memory_order_acquire: consume writes published by release.memory_order_acq_rel: both acquire and release, often for read-modify-write operations.memory_order_seq_cst: strongest, easiest mental model, default.
Here is a relaxed counter:
atomic_ulong requests = 0;
void record_request(void) {
atomic_fetch_add_explicit(&requests, 1, memory_order_relaxed);
}
This is fine if you only care that increments are not lost. You are not using the counter to publish access to other data.
But this would be wrong:
data = 42;
atomic_store_explicit(&ready, 1, memory_order_relaxed);
If ready is a publication flag, use release/acquire or a mutex.
Mutexes also give memory ordering
This is easy to forget.
When thread A does:
pthread_mutex_lock(&lock);
shared = 42;
pthread_mutex_unlock(&lock);
and thread B later does:
pthread_mutex_lock(&lock);
printf("%d\n", shared);
pthread_mutex_unlock(&lock);
the mutex is not only preventing simultaneous access. It is also creating the memory visibility relationship.
Unlock releases. Lock acquires.
So when in doubt, use a mutex. It gives both mutual exclusion and ordering.
Read-write locks
Sometimes many threads read shared data, but writes are rare.
A normal mutex allows only one reader at a time, even if readers do not modify anything.
A read-write lock allows:
- many readers at the same time
- one writer at a time
- no readers while a writer holds the lock
POSIX gives you pthread_rwlock_t.
#include <pthread.h>
pthread_rwlock_t rw = PTHREAD_RWLOCK_INITIALIZER;
int config_value = 0;
int read_config(void) {
pthread_rwlock_rdlock(&rw);
int value = config_value;
pthread_rwlock_unlock(&rw);
return value;
}
void update_config(int value) {
pthread_rwlock_wrlock(&rw);
config_value = value;
pthread_rwlock_unlock(&rw);
}
This sounds obviously better than a mutex, but it is not always.
Read-write locks have overhead. They can also create writer starvation depending on implementation and usage. If the critical section is tiny, a normal mutex may be faster and simpler.
The boring rule wins again: measure.
Deadlocks
A deadlock happens when threads wait forever for each other.
Classic example:
pthread_mutex_t a = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t b = PTHREAD_MUTEX_INITIALIZER;
void *thread1(void *arg) {
pthread_mutex_lock(&a);
pthread_mutex_lock(&b);
// work
pthread_mutex_unlock(&b);
pthread_mutex_unlock(&a);
return NULL;
}
void *thread2(void *arg) {
pthread_mutex_lock(&b);
pthread_mutex_lock(&a);
// work
pthread_mutex_unlock(&a);
pthread_mutex_unlock(&b);
return NULL;
}
Possible execution:
thread1 locks a
thread2 locks b
thread1 waits for b
thread2 waits for a
Nobody can move.
The fix is boring and powerful:
Always acquire locks in the same global order.
void lock_both(void) {
pthread_mutex_lock(&a);
pthread_mutex_lock(&b);
}
void unlock_both(void) {
pthread_mutex_unlock(&b);
pthread_mutex_unlock(&a);
}
Now every thread follows the same order: a then b.
You can also use pthread_mutex_trylock in some designs, but do not use it as a way to avoid thinking. It often just turns deadlocks into livelocks, where threads keep politely stepping aside forever.
Four conditions for deadlock
Deadlocks require these conditions:
- Mutual exclusion: some resource can only be held by one thread at a time.
- Hold and wait: a thread holds one resource while waiting for another.
- No preemption: resources cannot be forcibly taken away.
- Circular wait: thread A waits for B, B waits for C, C waits for A.
Break one of these and you break the deadlock.
In practice, the most common technique is breaking circular wait with a lock ordering rule.
Write the rule down somewhere. Future you is not as disciplined as current you thinks.
Common locking bugs
Forgetting to unlock on an error path
Bad:
pthread_mutex_lock(&lock);
if (something_failed()) {
return -1;
}
pthread_mutex_unlock(&lock);
return 0;
The lock stays locked forever.
Common C style:
int result = 0;
pthread_mutex_lock(&lock);
if (something_failed()) {
result = -1;
goto out;
}
// work
out:
pthread_mutex_unlock(&lock);
return result;
This is one of the few places where goto in C is not evil. It gives you one cleanup path.
Destroying a lock too early
Do not destroy a mutex while another thread may still use it.
This usually means:
- Tell workers to stop.
- Wake them if they are sleeping.
- Join them.
- Destroy synchronization objects.
Holding a lock while calling unknown code
If you call a callback while holding a lock, the callback may try to acquire the same lock or another lock in a bad order.
pthread_mutex_lock(&lock);
callback(shared);
pthread_mutex_unlock(&lock);
This can be fine if you fully control callback, but it is dangerous as an API pattern.
Often better:
pthread_mutex_lock(&lock);
snapshot_t snapshot = make_snapshot(shared);
pthread_mutex_unlock(&lock);
callback(snapshot);
Lock around your state. Do not lock around the whole universe.
Thread cancellation and shutdown
Many examples online create threads and then ignore shutdown.
Real systems need a stop path.
A common pattern is a shared stop flag plus a condition variable.
typedef struct {
pthread_mutex_t lock;
pthread_cond_t cond;
bool stop;
int work_count;
} state_t;
Workers wait while there is no work and stop is false:
pthread_mutex_lock(&s->lock);
while (s->work_count == 0 && !s->stop) {
pthread_cond_wait(&s->cond, &s->lock);
}
if (s->stop && s->work_count == 0) {
pthread_mutex_unlock(&s->lock);
return NULL;
}
// take work
pthread_mutex_unlock(&s->lock);
Shutdown thread:
pthread_mutex_lock(&s->lock);
s->stop = true;
pthread_cond_broadcast(&s->cond);
pthread_mutex_unlock(&s->lock);
Then pthread_join all workers.
Do not rely on killing threads from outside unless you really understand the cleanup implications. Cooperative shutdown is much easier to reason about.
Thread pools
Creating a thread is not free.
If you create one thread per tiny task, you may spend more time managing threads than doing useful work.
A thread pool creates a fixed number of worker threads. Tasks go into a queue. Workers pop tasks and execute them.
The rough structure:
main thread:
create queue
create N worker threads
push jobs into queue
close queue
join workers
worker thread:
while queue_pop(job):
run job
This is the C version of the same idea behind worker pools in Go, Java executors, Node worker pools, database connection pools, etc.
The important systems lesson is this:
Concurrency should usually be bounded.
Unbounded concurrency feels fast in toy examples and then becomes a production incident.
Debugging threaded C
Threading bugs are annoying because adding printf can change timing and make the bug disappear.
Still, there are good tools.
ThreadSanitizer
Compile with:
clang -fsanitize=thread -g main.c -pthread -o main
./main
or:
gcc -fsanitize=thread -g main.c -pthread -o main
./main
ThreadSanitizer catches many data races and lock-order issues.
If you are learning concurrency in C, use it early. It will humble you, which is useful.
Helgrind
Valgrind has a tool called Helgrind:
valgrind --tool=helgrind ./main
It is slower, but useful, especially on codebases where sanitizers are harder to enable.
gdb
Inside gdb:
info threads
thread 3
bt
thread apply all bt
thread apply all bt is extremely useful when the program is stuck. It prints the stack trace of every thread. If you have a deadlock, you can often see each thread blocked inside pthread_mutex_lock or pthread_cond_wait.
Logging
When logging threaded programs, include the thread id.
printf("[thread %lu] acquired lock\n", (unsigned long)pthread_self());
Do not overdo it inside hot loops, but thread-aware logs make many bugs much easier to see.
A mental checklist
When you write threaded C, ask:
- What data is shared?
- Who owns it?
- Which lock protects it?
- Are all reads and writes protected by the same rule?
- Can this thread go to sleep while holding a lock?
- Is there a shutdown path?
- If two locks are needed, what is the order?
- Could a callback or error path skip unlock?
- Is concurrency bounded?
- Can ThreadSanitizer run on this?
Most thread bugs come from not having crisp answers to these questions.
Mutex vs condition variable vs semaphore vs atomic
Very rough cheat sheet:
Use a mutex when:
- protecting a shared data structure
- maintaining invariants across multiple fields
- you want the simplest correct thing
Use a condition variable when:
- a thread needs to sleep until shared state changes
- you already have a mutex protecting that state
- examples: queue not empty, queue not full, shutdown requested
Use a semaphore when:
- you are counting available resources
- you want to limit concurrency
- examples: at most N jobs, N database connections, N slots
Use an atomic when:
- the shared state is a simple independent value
- you need counters/flags/reference counts
- you understand what ordering you need
If unsure, start with a mutex. It is not a moral failure. It is often the correct engineering choice.
Final picture
Threads are powerful because they let one process have multiple flows of execution sharing the same memory.
That shared memory is also the trap.
A normal line of C like:
counter++;
can become a race because the CPU is doing loads, arithmetic, stores, cache coherence, and scheduling beneath it.
Synchronization is how we put structure back into that chaos.
Mutexes give exclusive access.
Condition variables let threads sleep until state changes.
Semaphores count resources.
Atomics give safe low-level operations on individual values.
Memory ordering explains when writes from one thread become visible to another.
The hard part is not memorizing the APIs. The hard part is deciding what the shared state is, what invariant must stay true, and which primitive expresses that invariant cleanly.
Once you start thinking like that, thread synchronization stops being dark magic and becomes just another systems design problem.