jwatte
Posts: 203
Joined: Sat Aug 13, 2011 7:28 pm

Is there a race condition in the pthread_condvar implementation?

Sun Mar 11, 2018 9:14 pm

I have an application that uses "desktop GL" mode with fake KMS.
It captures data from the camera, and displays it in a GL window, after processing.
When I furiously click on the window while this process is going, the hand-off between capture and display locks up.
Looking at the involved threads, here are the back traces:

Code: Select all

(gdb) thread 8
[Switching to thread 8 (Thread 0x6dd8e2e0 (LWP 4902))]
#0  0x76eebebc in __lll_lock_wait (futex=0x2c28c0 <model_cond>, private=0) at lowlevellock.c:46
46	lowlevellock.c: No such file or directory.
(gdb) bt
#0  0x76eebebc in __lll_lock_wait (futex=0x2c28c0 <model_cond>, private=0) at lowlevellock.c:46
#1  0x76ee8a50 in __pthread_cond_wait (cond=0x2c28c0 <model_cond>, mutex=0x2c28a4 <model_mutex>)
    at pthread_cond_wait.c:117
#2  0x0002bf3c in model_thread_fun () at lib/model.cpp:191
#3  0x76ee1fc4 in start_thread (arg=0x6dd8e2e0) at pthread_create.c:335
#4  0x757eec68 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 6
[Switching to thread 6 (Thread 0x6edff2e0 (LWP 4863))]
#0  0x76eebeec in __lll_lock_wait (futex=0x2c28c0 <model_cond>, [email protected]=0) at lowlevellock.c:43
43	in lowlevellock.c
(gdb) bt
#0  0x76eebeec in __lll_lock_wait (futex=0x2c28c0 <model_cond>, [email protected]=0) at lowlevellock.c:43
#1  0x76ee9050 in __pthread_cond_signal (cond=0x2c28c0 <model_cond>) at pthread_cond_signal.c:40
#2  0x0002c45c in submit_model_input_write () at lib/model.cpp:274
#3  0x00014d9c in data_buffer (d=0x6e58f000, s=460800, pm=0x8326c0) at apps/drive/drive.cpp:148
#4  0x0001abcc in preview_buffer_callback (port=0x83b2d0, buffer=0x83dde8) at lib/RaspiCapture.cpp:537
#5  0x75bbf7dc in mmal_port_buffer_header_callback () from /opt/vc/lib/libmmal_core.so
#6  0x74a38c44 in mmal_vc_do_callback_loop () from /opt/vc/lib/libmmal_vc_client.so
#7  0x75bc29a0 in mmal_component_action_thread_func () from /opt/vc/lib/libmmal_core.so
#8  0x75beecc4 in vcos_thread_entry (arg=0x83a9d0)
    at /home/dc4/projects/staging/userland/interface/vcos/pthreads/vcos_pthreads.c:144
#9  0x76ee1fc4 in start_thread (arg=0x6edff2e0) at pthread_create.c:335
#10 0x757eec68 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6
You will note that the pthread_cond_signal and the pthread_cond_wait threads are waiting on the same mutex.
That mutex is handled in my code with a simple C++ wrapper that locks the mutex on creation, and unlocks when leaving scope.
The fact that the code made it to the condvar functions means that the threads could acquire the mutex.
The fact that the "signal" function is blocked concerns me -- that should basically never happen, when I call pthread_cond_signal() with the associated mutex held?

Code: Select all

void submit_model_input_write() {
    assert(data_head - data_tail < 2);
    data_head += 1; 
    {   
        PLock l(model_mutex);
        pthread_cond_signal(&model_cond);
    }
}   

Code: Select all

        ModelData *work = get_model_input_read();
        if (!work) {
            PLock l(model_mutex);
            work = get_model_input_read();
            if (!work) {
                if (model_running) {
                    pthread_cond_wait(&model_cond, &model_mutex);
                    continue;
                } else {
                    break;
                }
            }
        }

Code: Select all

class PLock {
    public:
        PLock(pthread_mutex_t &mtx) : mtx_(mtx) {
            pthread_mutex_lock(&mtx_);
        }           
        ~PLock() {
            pthread_mutex_unlock(&mtx_);
        }       
        pthread_mutex_t &mtx_;
    private:
        PLock(PLock const &) = delete;
        PLock &operator=(PLock const &) = delete;
};
There is one more place in the code where I acquire the mutex and signal the condvar; it's in the shutdown path. I'm not taking the shutdown path; the GUI that would call that is still up and running a responsive.

So: How can the pthread_cond_signal() function be blocked on the mutex like this?
Or maybe the "futex" argument to the blocking system call is not actually the mutex of the condvar, but something else?

jwatte
Posts: 203
Joined: Sat Aug 13, 2011 7:28 pm

Re: Is there a race condition in the pthread_condvar implementation?

Sun Mar 11, 2018 11:41 pm

I figured out approximately what was causing this weird problem.
I was using a software rendering library that didn't clip input parameters, and thus rendered outside the pixel buffer.
This ended up overwriting a variety of globals and perhaps even some stack variables, which would then cause random problems in the program, this being one of them.
Interestingly, I had a watchpoint right at the end of the pixel buffer, and that didn't trip, because the rendered lines touched "pixels" further up in memory.

Return to “Advanced users”