Friday, January 22, 2010
Multi-threaded nightmare
Over the past many months I have been working on a project for a major company (I'll not say who but their pretty big). The project was for a DLL that would communicate with outside hardware to run a financial transaction. But recently a problem with the DLL came up. It took quite some time to figure out what was going on. So I thought I would post it here so that someone finding themselves in a similar situation will have a less painful time finding the problem.
The bug in question is a classic Threading dead-lock. But one achieved through somewhat unusual means. There are two classes of functions in the DLL synchronous and asynchronous functions (there is really only one async function). Well the async functions signal certain events by using C# delegates as call backs.
The original API actually only had the synchronous functions exposed, however at the customer's request the async function was added. There in lies my downfall. I did not Analise the consequences of having the mishmash of functions adequately and now my DLL's interface has a dead-lock staring me right in the face.
What happens is this (I need a diagram to best explain but don't have the time to make one). There are two threads involved. The main thread of control in the customer's software we will call T1, the thread spawned to handle the async code T2. So T1 calls the async function and the afore mentioned T2 is spawned and begins doing it's thing. Properly T1 should wait for T2 to send the final "OK I'm done" event before doing anything with the interface, however "The best laid schemes o' mice an' men / Gang aft agley". When T1 decides it's tired of waiting they call a synchronous function, however the mutex that keeps the functions from trampling over T2's work is locked. So T1 begins waiting for T2 to unlock the mutex. At some point following this T2 reaches an event point and uses the callback to signal the fact. Well because of the way the customer's software must work this callback actually cross invokes a function on T1, but T1 is blocked, so the callback never returns and we have a deadlock.
That took forever to find and was a classic eureka moment when I did find it. But now I am stuck. Originally the code would just say "Busy leave me alone" when the mutex was locked, which alleviated the problem. The customer was unsatisfied with that though and required that each function not return until it's work was finished. Which given the semantics of the interface brought the deadlock to the fore. I also offered that I could delay the event call-backs until T1 was unblocked and able to handle them, which was also shot down. So now I need to break the interface to fix the problem but can't.... rock and a hard place indeed.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment