Here's a not atypical experience when porting a humongous (1.3ish MSLOC) legacy application from one platform to another; in this case from SG/IRIX to Linux.
During a simulation run the operator can click on a button that opens up a chart displaying some statistics depicted as a line graph. In the working version of the port that chart wasn't opening up.
Why?
Because the chart module never received a "Create Line Graph message".
Why?
Because that message is sent only when the simulation's "current time counter" is not 0.0, and it was not getting incremented.
Why?
Because no "Timestamp" message had been received.
Why?
(At this point I embarked on a Wild Goose Chase -- Marc)
Because the sending process never sent one.
Why?
Because when reading a file it turned out the file was unexpectedly empty and so froze up, having thrown an "End_Error" exception.
What did the original version do?
Err..I guess the file is empty there too, but it throws an "Out_Of_Data" exception.
Why?
Ah...race condition.
So what's the net effect of the difference between how the two exceptions are handled?
Huh. None. They're both taken to mean "no data", which is a not unexpected condition, and so the exceptions are resolved and processing continues.
-- End of Wild Goose Chase. Backtracking to where I left off:
Why?
Because no "Timestamp" message had been received.
Why?
Because the sending module is experiencing a SEGFAULT.
Why?
Because some data being extracted from a database is getting stomped on with bad, bad values, triggering the segfault.
Why?
Compiler bug.
Really??
Looks that way. Though 99% of the time that I start to think "compiler bug" it turns out to be a programming error, this is one time it looks legit. A procedure is getting called that does some calling of additional subprograms to retrieve the data from the DB. Down at the bottom an exception is thrown that propagates back up to the calling routine. This is a "no data found" exception that is perfectly legitimate to have occur and propagate back up. When control returns to the controlling procedure, though, some of the local variables have gotten clobbered--even those that are not part of the calling sequence. Everything is fine until that exception is handed up to the calling procedure. So the work-around for this was to catch the exception within that first called procedure, and change the function parameter list to include a "found" flag, which is set according to whether the exception occurred or not. The caller than checks the flag and handles the response as if the exception had occurred.
And then?
The chart still doesn't open.
Why?
In a color setting function, the name of the color is passed in and checked against a table that maps each color name to some internal data. That function lower-cases the color name parameter, since all the names in the table are lower case. The function, though, is modifying (via tolower()) the color name within the parameter itself, rather than to a local variable. For some reason trying to overwrite the parameter in place is causing another segfault. This is a less-than-desireable thing to be doing anyway, i.e. modifying the passed-in argument that should only be used as a lookup value, so the function was modified to lower-case the value into a local variable, which was then used for the table lookup.
Now?
The chart kinda opens, and then freezes.
Why?
Segfault.
Again??
Yes, down in the Lesstif code a null dereference is occurring.
Why?
Beats the hell out of me on this one. I built Lesstif from source, with debug, so I can find the line of code that's causing the problem but I really don't know what exact sequence of events is leading to this problem (I'm an applications, not a systems, programmer!) It does seem to be happening with the ScrolledList widget, when doing something pertaining to fonts.
What now?
Try exploratory code removal. Comment out the line that sets the font list and see if maybe some sort of default gets used.
And...?
Chart correctly opened and displayed, although the text is not italicized like it is on the original platform.
I can live with that.
Tuesday, April 22, 2008
Why the Chart Wasn't Opening
Posted by
Marc
at
9:20 AM
2
comments
Labels: hubbard, lesstif, porting, segfault, Software Spelunking
Thursday, March 13, 2008
Otters Should Not Be Allowed to Design Software
I got nothing against otters.
They're cute, playful, inquisitive, and so on. We should take joy in their simple life, and protect their habitat.
Neither they nor their human analogues, however, should be allowed to design software.
When you're porting a large software system from one platform to another you spend a lot of time dealing with the design and implementation..uh..quirks of the original builders. Sometimes it's just stylistic stuff like typedef'ing (C) or renaming (Ada) every single freakin' standard type name. Other times, though, you'd swear that otters had been tasked with coming up with the design.
So the part I'm porting now has a central control process and a GUI process that communicates via sockets. All well and fine.
The latest issue I've been dealing with has to do with the operator clicking on a field in a table to change it, which pops up a menu of valid entries, one of which is selected and then the "Done" button is clicked. Somewhere, though, in the update processing chain the value was getting trashed, causing the update to fail.
In other words, run-of-the-mill porting issues.
So I traced through the code and verified that the operator's selection was getting properly packaged up into a message and sent out through the socket. I followed it right up to the socket write, so there was no doubt.
I then went to the recipient of the message, the central control process, and verified that the message was being properly received and decoded.
The next thing that's done after receiving the message is calling a function called Update_Table(). The invalid data error is being detected inside this function. However, the data from the message I just read in is not being passed into Update_Table. WTF? Why is it failing then?
So I dig down into Update_Table and see that what it's doing is going out to query for the data in the table row that's about to be updated. But it's not getting this data from a database. Nor is it accessing some internal data structure model. It's sending out a message.
To the GUI process.
And so now I go back to the GUI process and start tracing from where that message is received. So is the GUI process maintaining some data store itself that it maintains and both displays to the operator and keeps for queries from the controlling process? Why, no, no it doesn't. Instead it goes and gets the data out of the graphical widget that's displaying it. If the value is a number, it converts the displayed string representation of the number back to a number, otherwise it passes it back as a string.
So not only does the central controlling process not have control of the data, it's outsourced that responsibility to the GUI, and that in turn is using the display widget as its data store.
Whoever came up with this bright idea is a complete, raving, ... otter.
Imagine, in a distributed system that does in fact utilize a database for data storage you can lose all your active data if the GUI crashes.
(Oh, the problem turned out to be a data alignment mismatch due to the change in word sizes between the different platforms.)
Posted by
Marc
at
11:02 AM
1 comments
Labels: goodhue, otters, porting, Software Spelunking
Tuesday, March 11, 2008
Primordial Program Porting Perils
Porting old code can be a pain.
I'm currently working on porting a large (well over a million SLOC of Ada and C) wargaming simulation system from a Silicon Graphics/IRIX platform to PC/Linux.
98.3% of the porting effort has gone pretty smoothly, but there have occasionally been some real showstoppers.
The latest issues I've been working with on the port have to do with the Xbae widget set. The developers of the original SG version of this app had grabbed a version of Xbae source code and frozen it to be evermore part of the code base.
Well, there were serious problems with that version of Xbae versus the Motif distribution that was installed on the designated Linux platform. Those have pretty much been taken care of, but there's been one thing left: When a particular text edit box is updated, it's supposed to automatically update the corresponding cell in an Xbae-provided matrix table, and that wasn't happening.
Since I'd grabbed all the source code for this stuff I was able to walk through what was going on in the debugger, and I discovered that when calling XmTextFieldSetString() to update the matrix cell, a check made in that function was rejecting the update. The check? Well, a quite reasonable one to make sure that the widget being updated was an XmText widget, so as to ensure that one was actually updating what one thought was being updated.
Okay, waitaminnit. It's checking that the widget is an XmText widget, but the code is expecting it to be an XmTextField widget. And isn't the latter what the XbaeMatrixWidget says is actually used?
Er...no. Or yes. Um, it depends on what you read.
The "What is it?" Xbae Matrix Widget page says: "While XbaeMatrix looks and acts like a grid of XmTextField widgets, it actually contains only one XmTextField."
But the Xbae Matrix Documentation page says: "While XbaeMatrix looks and acts like a grid of XmText widgets, it actually contains only one XmText".
So, the source code says it's an XmText widget. When did that happen?
Spelunking on Google we find:
When did this happen? Version 4.7, from mid-1999. What version of Xbae am I using? 4.60. And this portion of the app was written in early 1997, well prior to the Xbae's conversion from the XmTextField to the XmText widget.* Swapped out the XmTextField widget to use the XmText
widget to enable multi line rows and the like.
So, I can't say the information wasn't out there, it was properly published in the release notes back then.
But because this is an old application, a "Mines of Moria" project, much of the code hasn't been looked at in years, and so there was never any incentive, hell, any reason, to keep it current with evolving utilities. I expect I'm going to run into this sort of thing again, but what I've got to get onto right now is locating and fixing any additional expectations of XmTextField widgets being used when interacting with the Xbae matrix.
This is gonna be so cool when it's done!