AJ's blog

July 28, 2006

Anatomy of a bug…

Filed under: .NET, .NET Framework, ASP.NET, C#, Software Development — ajdotnet @ 5:31 pm

I should probably write that post as long as the blame is fresh and still hurts…😳

The Problem:
We encounterd problems with our ASP.NET 1.1 application under load conditions (of course this happened in production… 8O). The application became “less stable” than it used to be. Using a load test tool we could easily reproduce the problems in our development environment. One simulated user works fine. Two users fine as well. 10 concurrent users and we encountered the firts problems, 50 and we had pages with 25% to 35% error rate.
Note that we had load tested our application before, albeit only the major release before the current one. Unfortunately we had done quite some changes, so it was reasonable to asume that one of those changes was the root cause, and yet, the problem might have been there before (as it related to timing issues) and just didn’t manifest itself. But let’s look at some details first.

The Situation:
We user pages that contained user controls (employing our own template mechanism) which in turn contained components (System.ComponentModel.Component). These components provided the data to be shown at the page and they had to be initialized. In Page.OnInit we called the base class method (which should initialize the user controls and create the components) and afterwards our component initialization. Some snippets:

Code in SomeUserControl:

private void InitializeComponent() 
{ 
    this.components = new System.ComponentModel.Container(); 
    this.MyDataComponent = new DataComponent(this.components); 
    this.MyDataComponent.Name = "MyDataComponent"; 
}

Code in our page class:

protected override void OnInit(EventArgs e) 
{ 
    base.OnInit (e); 
    Controller.OnPageInit(); 
    // the controller calls ComponentManager.InitComponents()... 
}

With one single user (i.e. thread) active at a time it worked smoothly. When we debugged the application under load test conditions some arkward exceptions occured that gave us the impression that the user controls were initializes asynchonoulsy.

1. the components beeing created in InitializeComponent()  were null. If we stopped in the debugger and set the instruction pointer back a few lines (or wrote a loop that just slept a few msec and tried to access the component until it actually got hold of it), the component would eventually appear.

Code in our page class:

public ArrayList GetComponents() 
{ 
    if (_components==null) 
    { 
        bool ret= FindComponents(); 
        int loopCount= 1; 
        while(ret==false) 
        { 
            System.Threading.Thread.Sleep(10); 
            ++loopCount; 
            ret= FindComponents(); 
            if (loopCount>100) // 10*100 = 1 sec. 
                break; 
        } 
        else if (loopCount>1) 
            Trace.WriteLine( 
                "Page needed several attempts to get components: " 
                + loopCount); 
    } 
    return _components; 
}

Usually we had 2, 3, 4, 5, 6 loops at most.

2. (different error condition after some changes): The Name property of the component (being set in InitializeComponent() ) is accessed during initialization. We checked the property in code and it had the default value (“<undefined>”). We stopped within the debugger and the value was correct.

string name= component.Name.ToLower(); 
if (name=="<undefined>") 
    throw new Exception("<undefined>"); 
    // breakpoint on throw => debugger show correct name.

Ergo: The page/user control is obviously being initialized asynchronously, ASP.NET on the other hand guarantees that a page is processed by exactly one thread. We didn’t start any threads, so we had a contradiction. This obviously had to be a bug within the ASP.NET runtime…. . How likely could that be? Right. The probability would be close to 0. (If you don’t agree you may want to have a look at Testing ASP.NET 2.0 and Visual Web Developer.)

The Search
We finally decided to strip down the project to isolate the root cause of our problem. To accomplish that we had quite some work to do. Our applications consists of a substantial number of pages, user controls, components, business interfaces, business objects, etc., and additionaly some sophisticated core funtionalities… . Eventually our application was down to one page with one user control and the bug still reproducable. Next in line was getting rid of the various libraries that we used. Bang! Spring.NET! (The factory framework we used: http://www.springframework.net/.)

Spring.NET? The call chain looked like Page –> Controller –> ComponentManager, with the component manager iterating over the components on the page. And the component manager was created using Spring.NET. And some further research revealed that there have been multi threading issues with Spring.NET, see here for example. So we had a bug in Spring.NET. Just a quick look at the documentation… yes, looks good…. yes…. err… you’re kidding, aren’t you…. Ahhrrggghhhh!
Some time later (about 2 hours) we had recovered…😡

The Cause
Spring.NET uses a configuartion file in which the classes are registered for instantiation. One can also specify that a class shall only be instantiated once (i.e. as singletons), effectively returning the same object for each subsequent request. And – Heads up! – the default is singleton=”true… . Nobody in our team would have guessed that.

The consequence was that all pages used the same component manager which used its back reference to the page (that was changed constantly) to process the components. This not only explained the exceptions, it also may have supplied the pages with data of another page of another user.

A simple addition of singleton=”false” in that config file solved our problem.

The Conclusion
So, what do we learn from all this?

1. Load test your application! You never know what your application will do under stress.
2. Take this as a rule: The bug is not within the .NET Framework or one of the (widely used) libraries (or JRE if you happen to be from the other camp). The bug is in your code!
3. RTFM! Explicitely specify configuration values and don’t rely on default values – especially with libraries you use. What a feasible “default” is depends on subjective opinions and those can vary from developer to developer… (q.e.d.)

There’s also some positive experience (not for the first time): Any problem – no matter how intricate it appears to be – can be tracked down eventually – given enough time and the right people.

That all for now folks,
AJ.NET

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: