Friday, December 26, 2008

Coding War Story - Sometimes You Just Have to be Facing the Right Direction

Sometimes seeing something flicker out of the corner of your eye turns out to be the key to tracking down a bug.

One Monday morning I came in and saw that one of the processes in the missile command & control (C2) system our team was developing had crashed, throwing up an exception traceback. This system ran across several rack-mounted workstations when it was deployed. However, it also functioned just fine on a single workstation, which simplified the development tasks for the developers. It was not uncommon for me to leave it running on my workstation overnight or over the weekend so that I could just pick up again when I came in the next morning. Since the system had been idle for hours, and had been fine when I left for the weekend it was a little puzzling that it would spontaneously crash like that. The first thing that pops into mind is some kind of memory exhaustion, but when I looked at the state of the system, and where the traceback showed the crash occurring, it didn't look to be memory, nor was there any other readily apparent cause.

When this system was idling, it was pretty much literally idling. I'll get into some of the details in a bit, but basically in the idle state all the threads are suspended waiting to be notified of there being something to do. And since there was no command and control activity going on on my workstation over the weekend, everything should have just stayed quietly suspended.

And the other odd thing was what the traceback indicated. A subprogram had been invoked that went and retrieved some data for processing, and since there was no data for processing--the system was idle and had been for hours--it started trying to process garbage and just crashed out. And the weird thing was that there was no rational explanation for why that subprogram chain got invoked. It was called only when a specific message had come in, and there was no record in the logs of that having occurred, nor was there any reason to transmit that message when the system was idle.

This is not what we were working on, but it kinda looks like itI tasked one of the developer to go off and see what he could track down, and some time later he came back, just as puzzled as I was about the cause.

So we let it go for the time being, and other than letting the system run overnight (and the problem didn't repeat), chalked it up at the time as just one of those intermittent bugs that would hopefully manifest itself more helpfully later.

Time goes by, and more development happens.

One of the other developers comes to me and asks for help in finding out why the system crashed on their desktop workstation. It had been simply idling and they could find no clear reason why a particular sequence of subprograms calls had been made that led to the exception being raised that crashed it out.

This was familiar, but again we could find no reason for the idle system to crash out like this.

More weeks go by.

This system did a lot of detailed logging of what it was doing, so a common practice amongst the developers was to "tail -f" one or more of its logs while the system was running so as to get some immediate feedback about what was going on under the hood as the developer checked out their changes.

I was doing a code review or something at my desk when I noticed something flicker over on my workstation monitor. I hadn't yet written any code or done any testing that day, so everything should've been quietly idling. Looking at the monitor to see what had just happened I saw it--the exception traceback and crash had just happened, which took down the process that was running in its own window, and recorded some tracing in the logfile I happened to be tailing.

This time I noticed something about the freshly logged entries, versus those that immediately preceded it: the timestamps differed by one day and a fraction of a second.

That was suggestive.

Now this C2 system was written in Ada, and I'd been working with that programming language for around 15 years by this time. So I knew the language REALLY well. And also over that interval I'd gained a deep understanding of the runtime environments (RTEs) within which Ada programs executed. It's not that I explicitly set out to learn about the Ada RTE, but over years of development and debugging you just absorb information about data formats, layouts, byte boundaries, thread suspension, scheduling, resumption policies, etc.

So when I saw that there had been a nearly exact 24-hour interval between the last nominal log message and the crash, the first connection I made in my mind was to the Alsys Ada Duration type. Why that type? Well, Duration is a predefined Ada type used to represent time, in the form of seconds and fractions thereof.

Now for those unfamiliar with the Ada programming language, Ada is a strongly typed programming language that allow programmers to define application-specific types, both composite (records, arrays) and scalar (integer-based, float-based, character, and enumeration). And for scalar type definitions, one can specify its range and, in the case of real types, precision.

The Ada 83 programming standard does not specify a range and precision for the Duration type, those were left to the implementor to make common-sense choices appropriate to the hardware/OS being targeted by the compiler.

While I don't recall the precision specification, I do know that Alsys Ada running under HP-UX predefined the Duration type to have a range of 0.0 .. 86400.0, i.e. 24 hours. And that was the only place I knew of in the Alsys Ada RTE that had a direct association with a 24-hour time period.

So how exactly could Duration's 24 hour type definition be related to this crash? Well, there was no clear and obvious relationship, there were no delays or anything going in the vicinity of the code where the crash sequence appeared to initiate.

But, once again, there was another connection to be made.

Duration is closely associated with Ada's tasking functionality, in fact they're both covered in the same chapter of the Ada Language Reference Manual. And tasking did play a significant role in this C2 system.

Tasks are Ada's built-in concurrency constructs, and are conceptually rather similarly to threads. Rather then employing a separate library of functions or classes for implementing parallelism, Ada's concurrency features are built in to the language itself. Nowadays most Ada compiler implementations implement tasking on top of the native OS or VM threads, but there's no requirement that they do that. The semantics of Ada tasking specified by the standard usually don't match up exactly with the thread primitives provided by an OS, so the vendor may just let the differences slide if the inconsistencies are minor or trivial. Or, a vendor (or third party) can create a tasking RTE that manages all tasks completely on its own, with no reliance on any underlying OS thread capabilities. There are pros and cons to system-supplied versus vendor-created tasking/threading approaches, but especially with the state of compiler and threading technology in the early days of Ada, having the tasking implementation totally under the vendor's control could usually be justified in order to reliably get the right tasking behavior. And that's what Alsys Ada did.

A task that is going to communicate with other tasks, including the main program, does so via "entries", which specify a name and parameters for the data that is to be exchanged. The actual language statement that receives an entry invocation is the "accept" statement. So if you had a task, like Command_Processor, that had an entry "Receive_Command", you would see this somewhere in the task:

accept Receive_Command(Command_Buffer : Buffer_Type) do
-- Extract the command data to a local buffer before processing
... statements ...
end Receive_Command;


When the thread of control within the task hits such an accept statement, task execution is suspended and blocked. And it stays blocked until a call is made on that entry. The task will stay blocked even if the main program is ready to exit. Ada requires that until all tasks are terminated the program cannot perform a normal exit.

If a task needs multiple entries so that is able to perform multiple functions (including being explicitly told to terminate), the "select" statement is used to block the task until any one of multiple entries is invoked. For example:

select
accept Receive_Command(Command_Buffer : Buffer_Type) do
-- Extract the command data to a local buffer before processing
... statements ...
end Receive_Command;
or
accept Shutdown_System;
end select;

Now this task is suspended and blocked waiting for either of its two entries to be called. (And note in the latter accept statement that there need be no parameters if no data is to be exchanged, and no body of statements to execute while the task rendezvous is occurring--a call to an entry defined like this is effectively simply a notification that the task can exit the select statement and resume executing at the first statement following the select.)

Now while there's nothing wrong with this form of select statement usage in a task, it is a bit of an atypical usage of Ada's tasking idiom. The reason for that is that it provides no way to shut itself down when the main program is ready to exit. The task with this select statement is going to remain suspended until one of those two entries is invoked, and it wouldn't have mattered whether that second one was named Shutdown_System or "Queue_Message", to the enclosing task they're all simply entries to wait on.

So the typical composition of a select statement in a task includes a "terminate" alternative:

select
accept Receive_Command(Command_Buffer : Buffer_Type) do
-- Extract the command data to a local buffer before processing
... statements ...
end Receive_Command;
or
accept Shutdown_System;
or
terminate;
end select;

Terminate is an Ada reserved word that in the context of a select statement terminates its enclosing task if the entire application is attempting to perform a normal exit. Thus there's no requirement that either of the entries be explicitly called in order to terminate the task and accomplish a program exit.

There are two alternatives to the "terminate alternative" that can be employed: The delay alternative and the conditional alternative. The delay alternative, which looks like:

select
accept Receive_Command(Command_Buffer : Buffer_Type) do
-- Extract the command data to a local buffer before processing
... statements ...
end Receive_Command;
or
accept Shutdown_System;
or
delay Time_To_Wait;
end select;

causes the select statement, after the delay interval has elapsed, to stop waiting for any entry calls, unblock task execution, and continue execution after the select statement. The delay expression can either be a literal, like 2.0, or a variable of type...Duration.

The other option is an "else clause" in the select statement, which means that if upon hitting the select statement there's no invocations pending on any of the accept alternatives that execution should go right into the sequence of statements that follow the "else".

Now the thing is that any one select statement can contain only one of these three alternatives: terminate, delay, or conditional. (I'm sure there's a very good reason for this, but I've never dug into the rationale.) This means that if you have a select statement enclosed in a loop that is supposed to cycle every so often--based on the delay expression--that you need to have an explicit way to shut down the task, because you're not allowed to have a terminate alternative to do it for you. Also, if your task holds system resources that need to be explicitly closed/released, you don't want to use the terminate alternative, because that just directly shuts down the task on program exit, with no opportunity within that task to release the resources.

So what what was in it in our problematic C2 system that was triggering a subprogram invocation from a task that was supposed to be blocked on a select statement?

Studying the code I saw that the select statement in the task was of that first, rather atypical select/accept style, i.e. just selecting on entry accept statements, there was no delay, terminate, or conditional alternative. It was written that way because that task held resources that were best explicitly released by the task itself, hence the terminate alternative was out; there was no reason for any kind of timeout, so no delay alternative was provided; nor was there any reason for a conditional alternative.

Since this was an uncommon form for a select statement, and 24-hour intervals of total inactivity are also rather unusual for a C2 software system, I decided to try something. I wrapped the select statement in a loop and added a delay alternative:

loop
select
accept Receive_Command(Command_Buffer : Buffer_Type) do
-- Extract the command data to a local buffer before processing
... statements ...
end Receive_Command;
or
accept Shutdown_System;
or
delay Duration'Last - 10.0;

end select;
end loop;


Basically I just told the select statement to timeout after 23 hours, 59 minutes, and 50 seconds had elapsed. Once it did it would simply abort the select then loop back up and reinitiate it. Now there were two other instances of this particular select statement usage in the system, but I left them unchanged as a control.

I rebuilt the system, kicked it off, waited for it to go idle, noted the time, and went off and did some other stuff for the rest of the day.

Came back the next day, and waited for the 24 hours of idle time to elapse. The time approached, and I got a crash and traceback--but this time a different subprogram chain had gotten invoked, which it was quick to verify had originated from one of the unmodified control select statements. I changed both of the remaining select statements to the same loop/delay composition, and started it off again.

Work, go home, come back to work, blah blah... 24 hours. Nothin'. No crash, no traceback, everything just kept idling. I ran through a verification scenario and everything performed just fine.

So that was it, somehow that particular "select/accept/no alternatives" sequence could not stand being suspended for 24 hours. Maybe it had something to do with the range definition of the Duration type, and maybe not, I never knew for sure, since Alsys Ada was almost an orphaned product by then, and support was pretty much negligible.

We let the other co-contractors know about this issue, since it is not impossible, or even improbable, that a C2 system would sit idle for many hours, and the last thing you want in a time and place of elevated tensions is for one of your missile command & control systems to abruptly crash on you, when it had been running just fine for hours!

Y'know, if I hadn't happened to be facing in the general direction of the monitor when that crash happened I might never have noticed it. It would've been easy for me to have an updated build ready to test, and so simply blindly terminated the current run without even noticing its state before doing so. But one little flicker of movement seen out of the corner of my eye caught my attention, and years of experience with the language and its RTE guided me right down the path to tracing it to a vendor bug, and instituting a work-around before it ever came near to being deployed.

Thursday, December 4, 2008

Snow in the Higher Elevations

We got a little bit of snow here in North Alabama a couple days ago. Though it snowed around much of the area, the weather forecasters said it might amount to something only in the higher elevations.

Well, I can see the higher elevations from my front porch, and they were right:


















The elevation of that ridge is about a thousand feet above my house, so it looks to me that about the top 100-200 feet of it were high enough to hold the snow for the day.