Do you have an application that starts misbehaving randomly? You aren’t quite sure what is going on, but it seems like for some reason your Timer just stops firing it’s event handler–and once it stops it never starts back up.
This drove me nuts! But, it is definitely a problem in the Windows Server 2003 SP1. This bug is particularly devious since even if your application uses extensive logging it maybe be nearly impossible to find proof of a timer not firing in your logs.
Short of implementing a Homer Simpson type of “everything is okay alarm” is there anything you can do to determine whether or not your application might be suffering from dying timers?
When I realized that my application had timers that randomly stopped firing, I set out on an Internet quest to figure out if anyone has ever had this problem before. And, yes, I found many many forum threads and blog posts already discussing this issue (just do a quick Google Search if you don’t believe me). The problem was that everything I found was from over a year ago and only mentioned .NET 1.1. Well… I was having what seemed like the same issue, but with .NET 2.0.
It was nearly impossible to find anything from Microsoft about the issue. The only thing I came across was this article which had a misleading title, but contained the following (plausible) explanation:
The Timer classes are implemented as a linked list of timer objects. When the first System.Threading.Timer object is created, the thread pool manager starts a thread to process the linked list. Every timer object is added to the linked list. The thread that process the linked list cycles through the linked list and determines when the timer event is expected to be signaled against the current clock count.
If the timer object has expired, the thread asynchronously queues a callback function before the thread updates the time that the timer event is expected to be signaled. After the thread has processed all timer objects in the linked list, the thread updates the time that the linked list was last processed. Then, the thread calculates the shortest time that the thread should sleep before the thread reprocesses the linked list for the next elapsed timer object.
Sometimes, when the system is under stress or when the linked list includes many timer objects, the processing thread may be pre-empted by a higher priority thread before the whole linked list has been processed. When this behavior occurs, the time that a timer event is next expected to be signaled is calculated to be earlier than the timestamp when the linked list was last processed. Therefore, the time that the timer event is expected to be signaled is in the past and never expires.
Because the time that the timer event is expected to be signaled has already passed, the thread may calculate a negative period to wait before the timer event must be signaled. When the thread has a negative period to wait, the thread enters a sleep state for a long time.
I decided that the only way to be sure that this bug is caused by the Server 2003 SP1 (and not the .NET 1.1 Framework SP1 as the Microsoft article claims) would be to test using the .NET 2.0 Framework.
After some more grueling searching I came across a Google Groups Topic that contained source code to test System.Threading.Timer bug. Basically, all it does is create 30,000 Timers and tells them all to go off in 5 seconds. Then it prints out the updated count each time one fires–so if all of them fire the application will count up to 30000.
I built the testing application from the source, and ran it on the 2003 SP1 machine. The first time I ran the process all 30k timers fired (this got me worried), but after running it a few more times I was able to see that sometimes hardly ANY fired at all (the lowest I saw was around 11,000).
This proved that this System.Threading.Timer problem is well alive in .NET 2.0 Framework and that it is in fact caused by Server 2003 SP1 (not only the .NET 1.1 Framework SP1). I have tested the same process on SP2 and SP3 and each time all 30000 timers fire.
I have not tried to install the hotfix which Microsoft says will also fix the problem, but there is one available here: KB900822. Please feel free to respond to this post if the hotfix has worked for you.