Tuesday, April 22, 2008

Why the Chart Wasn't Opening

Here's a not atypical experience when porting a humongous (1.3ish MSLOC) legacy application from one platform to another; in this case from SG/IRIX to Linux.

During a simulation run the operator can click on a button that opens up a chart displaying some statistics depicted as a line graph. In the working version of the port that chart wasn't opening up.


Because the chart module never received a "Create Line Graph message".


Because that message is sent only when the simulation's "current time counter" is not 0.0, and it was not getting incremented.


Because no "Timestamp" message had been received.


(At this point I embarked on a Wild Goose Chase -- Marc)

Because the sending process never sent one.


Because when reading a file it turned out the file was unexpectedly empty and so froze up, having thrown an "End_Error" exception.

What did the original version do?

Err..I guess the file is empty there too, but it throws an "Out_Of_Data" exception.


Ah...race condition.

So what's the net effect of the difference between how the two exceptions are handled?

Huh. None. They're both taken to mean "no data", which is a not unexpected condition, and so the exceptions are resolved and processing continues.

-- End of Wild Goose Chase. Backtracking to where I left off:


Because no "Timestamp" message had been received.


Because the sending module is experiencing a SEGFAULT.


Because some data being extracted from a database is getting stomped on with bad, bad values, triggering the segfault.


Compiler bug.


Looks that way. Though 99% of the time that I start to think "compiler bug" it turns out to be a programming error, this is one time it looks legit. A procedure is getting called that does some calling of additional subprograms to retrieve the data from the DB. Down at the bottom an exception is thrown that propagates back up to the calling routine. This is a "no data found" exception that is perfectly legitimate to have occur and propagate back up. When control returns to the controlling procedure, though, some of the local variables have gotten clobbered--even those that are not part of the calling sequence. Everything is fine until that exception is handed up to the calling procedure. So the work-around for this was to catch the exception within that first called procedure, and change the function parameter list to include a "found" flag, which is set according to whether the exception occurred or not. The caller than checks the flag and handles the response as if the exception had occurred.

And then?

The chart still doesn't open.


In a color setting function, the name of the color is passed in and checked against a table that maps each color name to some internal data. That function lower-cases the color name parameter, since all the names in the table are lower case. The function, though, is modifying (via tolower()) the color name within the parameter itself, rather than to a local variable. For some reason trying to overwrite the parameter in place is causing another segfault. This is a less-than-desireable thing to be doing anyway, i.e. modifying the passed-in argument that should only be used as a lookup value, so the function was modified to lower-case the value into a local variable, which was then used for the table lookup.


The chart kinda opens, and then freezes.




Yes, down in the Lesstif code a null dereference is occurring.


Beats the hell out of me on this one. I built Lesstif from source, with debug, so I can find the line of code that's causing the problem but I really don't know what exact sequence of events is leading to this problem (I'm an applications, not a systems, programmer!) It does seem to be happening with the ScrolledList widget, when doing something pertaining to fonts.

What now?

Try exploratory code removal. Comment out the line that sets the font list and see if maybe some sort of default gets used.


Chart correctly opened and displayed, although the text is not italicized like it is on the original platform.

I can live with that.


Anonymous said...

I'd guess that the particular italicised font isn't installed.

Sounds like you earned your money today. And probably had some fun. I know I feel ecstatic when I've just squashed some particularly gnarly bug.

Prematurely Airconditioned Supermarket said...

Boy all this sounds familiar-- we ported our Motif-based system from IRIX to Linux several years ago.

We found that lesstif was and still is quite buggy, and if you use OpenMotif instead, you'll be much happier. Seems that lesstif works well for simple things, but the nastier the Motif involved, the less likely that lesstif will be bug-compliant.

The modify-a-string-in-place problem sounds like gcc is putting the strings in a read-only section, which is what it does for string constants. MipsPro doesn't do that, so it usually worked.

I can't say what's up with the exceptions. They never quite work right for me.