Here's a not atypical experience when porting a humongous (1.3ish MSLOC) legacy application from one platform to another; in this case from SG/IRIX to Linux.
During a simulation run the operator can click on a button that opens up a chart displaying some statistics depicted as a line graph. In the working version of the port that chart wasn't opening up.
Why?
Because the chart module never received a "Create Line Graph message".
Why?
Because that message is sent only when the simulation's "current time counter" is not 0.0, and it was not getting incremented.
Why?
Because no "Timestamp" message had been received.
Why?
(At this point I embarked on a Wild Goose Chase -- Marc)
Because the sending process never sent one.
Why?
Because when reading a file it turned out the file was unexpectedly empty and so froze up, having thrown an "End_Error" exception.
What did the original version do?
Err..I guess the file is empty there too, but it throws an "Out_Of_Data" exception.
Why?
Ah...race condition.
So what's the net effect of the difference between how the two exceptions are handled?
Huh. None. They're both taken to mean "no data", which is a not unexpected condition, and so the exceptions are resolved and processing continues.
-- End of Wild Goose Chase. Backtracking to where I left off:
Why?
Because no "Timestamp" message had been received.
Why?
Because the sending module is experiencing a SEGFAULT.
Why?
Because some data being extracted from a database is getting stomped on with bad, bad values, triggering the segfault.
Why?
Compiler bug.
Really??
Looks that way. Though 99% of the time that I start to think "compiler bug" it turns out to be a programming error, this is one time it looks legit. A procedure is getting called that does some calling of additional subprograms to retrieve the data from the DB. Down at the bottom an exception is thrown that propagates back up to the calling routine. This is a "no data found" exception that is perfectly legitimate to have occur and propagate back up. When control returns to the controlling procedure, though, some of the local variables have gotten clobbered--even those that are not part of the calling sequence. Everything is fine until that exception is handed up to the calling procedure. So the work-around for this was to catch the exception within that first called procedure, and change the function parameter list to include a "found" flag, which is set according to whether the exception occurred or not. The caller than checks the flag and handles the response as if the exception had occurred.
And then?
The chart still doesn't open.
Why?
In a color setting function, the name of the color is passed in and checked against a table that maps each color name to some internal data. That function lower-cases the color name parameter, since all the names in the table are lower case. The function, though, is modifying (via tolower()) the color name within the parameter itself, rather than to a local variable. For some reason trying to overwrite the parameter in place is causing another segfault. This is a less-than-desireable thing to be doing anyway, i.e. modifying the passed-in argument that should only be used as a lookup value, so the function was modified to lower-case the value into a local variable, which was then used for the table lookup.
Now?
The chart kinda opens, and then freezes.
Why?
Segfault.
Again??
Yes, down in the Lesstif code a null dereference is occurring.
Why?
Beats the hell out of me on this one. I built Lesstif from source, with debug, so I can find the line of code that's causing the problem but I really don't know what exact sequence of events is leading to this problem (I'm an applications, not a systems, programmer!) It does seem to be happening with the ScrolledList widget, when doing something pertaining to fonts.
What now?
Try exploratory code removal. Comment out the line that sets the font list and see if maybe some sort of default gets used.
And...?
Chart correctly opened and displayed, although the text is not italicized like it is on the original platform.
I can live with that.
Tuesday, April 22, 2008
Why the Chart Wasn't Opening
Posted by
Marc
at
9:20 AM
2
comments
Labels: hubbard, lesstif, porting, segfault, Software Spelunking
Thursday, March 13, 2008
Otters Should Not Be Allowed to Design Software
I got nothing against otters.
They're cute, playful, inquisitive, and so on. We should take joy in their simple life, and protect their habitat.
Neither they nor their human analogues, however, should be allowed to design software.
When you're porting a large software system from one platform to another you spend a lot of time dealing with the design and implementation..uh..quirks of the original builders. Sometimes it's just stylistic stuff like typedef'ing (C) or renaming (Ada) every single freakin' standard type name. Other times, though, you'd swear that otters had been tasked with coming up with the design.
So the part I'm porting now has a central control process and a GUI process that communicates via sockets. All well and fine.
The latest issue I've been dealing with has to do with the operator clicking on a field in a table to change it, which pops up a menu of valid entries, one of which is selected and then the "Done" button is clicked. Somewhere, though, in the update processing chain the value was getting trashed, causing the update to fail.
In other words, run-of-the-mill porting issues.
So I traced through the code and verified that the operator's selection was getting properly packaged up into a message and sent out through the socket. I followed it right up to the socket write, so there was no doubt.
I then went to the recipient of the message, the central control process, and verified that the message was being properly received and decoded.
The next thing that's done after receiving the message is calling a function called Update_Table(). The invalid data error is being detected inside this function. However, the data from the message I just read in is not being passed into Update_Table. WTF? Why is it failing then?
So I dig down into Update_Table and see that what it's doing is going out to query for the data in the table row that's about to be updated. But it's not getting this data from a database. Nor is it accessing some internal data structure model. It's sending out a message.
To the GUI process.
And so now I go back to the GUI process and start tracing from where that message is received. So is the GUI process maintaining some data store itself that it maintains and both displays to the operator and keeps for queries from the controlling process? Why, no, no it doesn't. Instead it goes and gets the data out of the graphical widget that's displaying it. If the value is a number, it converts the displayed string representation of the number back to a number, otherwise it passes it back as a string.
So not only does the central controlling process not have control of the data, it's outsourced that responsibility to the GUI, and that in turn is using the display widget as its data store.
Whoever came up with this bright idea is a complete, raving, ... otter.
Imagine, in a distributed system that does in fact utilize a database for data storage you can lose all your active data if the GUI crashes.
(Oh, the problem turned out to be a data alignment mismatch due to the change in word sizes between the different platforms.)
Posted by
Marc
at
11:02 AM
1 comments
Labels: goodhue, otters, porting, Software Spelunking
Monday, January 21, 2008
Software Development in the Mines of Moria
The majority of the projects I've worked on during my nearly 25 year long career have a lot in common with the Mines of Moria. That is, they're vast edifices of architecture and detailed design; great effort, valor, and skill were expended to bring them into being; and much wealth was extracted during their lifetime of construction and operation.
Sometimes the projects are still active, with a small group of developers fixing bugs, improving performance, or adding new features. And sometimes they're pretty moribund, with me being the only active developer in the midst of a very large code base. I'm "cursed with competence" in this aspect, in that I've got a track record of being able to be dropped onto a project that needs something fixed or added, and learning enough of it quickly enough to make the needed changes or additions. (Now I've also handled some clean-sheet projects from the beginning, so that breadth of architectural and design experience has given me the experience I leverage when diving into another project's code base.)
The one characteristic all these "Moria" projects have in common is that at the time of their development, they were always developed using the 'hot technologies' of their day. And these go back awhile, so I dealt with a variety of technologies on these projects: Pascal, Ada, Windows NT, C++, Java, CORBA, UML 1.x, Coad/Yourdon, Rumbaugh's OMT, DEC Alphas, and others.
And while these were on the cutting edge back in their day, they're not now.
So it doesn't take a Gartner Group study to foresee that the projects being developed today with the latest and greatest technologies are also going to eventually turn into Moria projects, at least those that aren't discarded at some point. And be aware, something that's old, but still works and brings in revenue, is going to be kept around, no matter how crusty its implementation and how obsolete its technology.
This realization eventually leads one to becoming rather skeptical of whatever latest and greatest technology is being touted at any given time.You can understand why senior developers don't get excited over most new technologies, because time and time again whatever was hot and manages to achieve widespread success passes though the phases of becoming mainstream, then pervasive, then over-encumbered, disdained, and finally dismissed.
It's not necessarily a fear of change or the unknown, or a lessening ability to learn, that nudges software developers towards sticking with what they know as they get older, i.e. more experienced. Seeing one killer technology/language/methodology after another fall by the wayside of their career path, year after year, tends to encourage senior developers to stick with what they've seen to be proven approaches to successful software development.
It's certainly important to remain aware of the progress that is being made in the software development field, because genuine advancements in the practice do occur. In fact, those established, proven software technologies of the experienced developer were at one time themselves experimental, state of the art hot technologies.
Some developers do always want to be on the cutting edge of software technology, they find learning and experimenting and pushing the envelope interesting and challenging. And that's great, it's these developers that do the initial shakeout of the technologies, do the initial cut to identify the candidates for eventual mainstream use. Other developers like to wait for that shakeout to occur before buying in, they're the ones who are going to put in the big commitments to do the big projects, knowing full well that new programming languages, methodologies, and tools are going to eventually supersede what they've done and how they did it. Doing something big, and doing it well, is what they're looking to accomplish, and that requires relying on proven tools and techniques that can handle the load.
In time of course the big project is done, deployed, maintained, and sometimes gradually, sometimes precipitously, evolves into a Moria project. The budget is reduced, the releases become fewer and more infrequent, and the magnitude of new functionality in each new release declines. The project still has value to its users and customers, and so keeping some lights on in the Mines is justified. And for those of us who find "software spelunking" and "software archeology" interesting, and have the experience with these technologies that used to be famous, it's not a bad way to make a living.
Posted by
Marc
at
2:05 PM
2
comments
Labels: Clearwater, Software Archeology, software engineering, Software Spelunking