Showing posts with label Ada. Show all posts
Showing posts with label Ada. Show all posts

Thursday, May 22, 2008

JGNAT is coming back!

(If you're not interested in the Ada programming language, then this post has nothing for you :-)

AdaCore has announced in their latest newsletter that JGNAT, their Ada-to-Java Byte Code compiler, is being updated and will be made available in the 2nd quarter of 2008 (see the "In The Pipeline" sidebar):

A collection of add-on tools for interfacing between Ada and
Java is scheduled for release during Q2 2008. They support
mixed-language Ada/Java development, in particular:

- Calling natively-compiled Ada code from Java
- Compiling Ada to JVM bytecodes and communicating between Ada and Java directly.

The toolsuite exploits the Java Native Interface (JNI) for the
first scenario, but automates the generation of the JNI-related
“glue code” to ease the job of the developer. An updated
version of AdaCore’s JGNAT product handles the second
scenario. The tools take advantage of Ada 2005’s new
features to provide an interfacing mechanism that complies
with the Ada standard.
A future version of the toolsuite will support the invocation of
Java methods from natively-compiled Ada code.
The original release was never updated beyond version 1.1p, which I used a fair amount. I felt it was almost, but not quite, production quality. My experience was that it worked pretty well with Java 1.2, tasking broke in 1.3, and was pretty much a non-starter for 1.4.

Robert Dewar, AdaCore's president, commented in 2004 that "the status of JGNAT is that we have kept the sources updated to the minimal extent that they compile, but we no longer support this product and it was never fully completed."

I don't know the impetus behind restarting JGNAT support, but I'm glad to see it happening.


Monday, February 11, 2008

Concretizing Static Typing Metadata

Well, that's a pretentious title, doncha think?

Steve Yegge writes in Portrait of a Noob that static typing is effectively meta-data ("we also know that static types are just metadata"), like comments, and so isn't strictly required for the compilation and execution of software. He's right in a limited context. If static typing is being used for nothing more than ensuring "type matching", and really doesn't add anything beyond that, then it is effectively just a stronger form of commenting, with the compiler acting in the role of object compatibility inspector.

This does get at why the argument for "type safety" has never achieved much success as a compelling reason for using a strongly typed language. Merely making sure your objects are compatible is a good thing, but it does constrain flexibility, extensibility, and adds type management overhead (to the programmer).

If strong typing is going to be seriously valuable it has to do more than merely ensure type safety, it needs to actually add concrete information to the software.

Take a programming language like Ada, considered one of the paragons of strongly typed programming languages. It does all the type safety stuff, and Ada advocates are more than happy to promote that as one of its great virtues for creating and delivering reliable, safety-critical software. All true, but obviously type safety, accompanied by its supporting syntax and semantics, was not sufficiently compelling to drive any significant adoption outside the defense and aerospace industries (and in those fields, of course, much of the initial impetus was mandate-driven anyway).

What most of the Ada programming language advocates overlooked was the productivity gain possible by the language's specific implementation of strong typing. When its advocates talked about strong typing aiding productivity, it was nearly always in terms of error avoidance. Again, true, and a good thing, but hardly sexy. After all, how many programmers are going to willingly admit that they write buggy code and that maybe they should look into using a programming language that would help them avoid errors?

I went into some detail about this in The Fundamental Theory of Ada, describing how the specifics of Ada's "type model" allows the Ada programmer to implicitly embed scads of additional information with no effort beyond that of defining a type. The language specifies all the additional programmatic information directly accessible to the programmer pertaining to that type. In a sense, user-defined type definitions implicitly declare an associated class instance with information relevant to that type. Here's an excerpt from Ada:

type Speed_Range is range 0 .. 1000;

With nothing more than a reference to an object of that type:

Speed : Speed_Range;

One can know its minimum value (Speed_Range'First), maximum value (Speed_Range'Last), the minimum number of bits needed to represent all possible values of the type (Speed_Range'Size), the actual number of bits representing a variable of that type (Speed'Size, which is often larger than the type size since objects almost always occupy a whole number of bytes), the number of characters needed to represent the longest possible string representation of values of that type (Speed_Range'Width), etc. You can convert values to and from strings (Speed_Range'Image, Speed_Range'Value), do min/max comparisons (Speed_Range'Min(100, Speed), Speed_Range'Max(Current_Max, Speed)), and use the type as a loop controller ("for S in Speed_Range loop" and "while S in Speed_Range loop"), and more. And none of this information needs to be explicitly programmed by a developer, it is all implicitly provided by the mere definition of the type.
This is where strong typing is far more than disposable metadata, like comments. This "aggressive" approach to strong typing, whether in Ada or a similarly conceived programming language, "concretizes" the metadata into practical use to not merely aid error avoidance, but to actively increase programmer productivity.


Thursday, December 6, 2007

A Coding War Story: What's Your Point?

I had been assigned the task of porting a fairly large (about 400 KSLOC) missile launch command and control system to an upgraded OS version and new compiler and language version. Specifically, from Solaris 2.5.1 to Solaris 7, and from the Verdix Ada Development System (VADS), which was Ada 83, to Rational Apex Ada, which was Ada 95. VADS had been bought out by Rational, and its product obsoleted, although Rational did a pretty good job implementing compatible versions of VADS-specific packages to ease the transition to the Apex compiler.

Three other guys helped with the initial compilations, just to get clean compiles of the code, which took about two weeks, and then I was on my own to actually make the whole system work. Long story short, it was the worst design and implementation of a software system I'd ever seen, and so took about two more months to successfully complete the port. It was then handed over for formal testing, which took several months as well. I fairly steadily fixed the bugs that were found as testing got going, but that rate quickly declined as it progressed (the original code was a production system after all, so its functionality was pretty solid, I just had to kill the bugs that came about due to adapting to the new compiler). Eventually I was reassigned to another project once everything appeared to be working as well as the original.

Then came the phone call on the Friday before Thanksgiving.

There was a missile test scheduled in about three weeks, and during a lab countdown test the command sequencing had locked up. In real life this would cause a test abort, and if this lock-up occurred within seconds of ignition, a number of irreversible actions would have taken place in support systems, causing a lengthy--and expensive--delay for reprocessing the missile. The missile would not have launched, but there would have been many, many very unhappy people seriously distressed over issues of time and much, much money. (Don't let anyone ever tell you that the Defense Department is cavalier about spending money--I've yet to meet a contract manager for whom budget wasn't their number 1 or 2 priority, with schedule being the other.)

Now this countdown test and many variations of it had been run hundreds of times in the preceding months, with only a handful of minor glitches. So this problem had a very low probability of occurrence, but unfortunately possessed a very high cost of occurrence. Multiply those together and the product was a bad Thanksgiving week for me and dozens of other engineers and managers.

As the guy who did the port this put the spotlight right on me.

Like most safety-critical defense systems like this, a lot of logging is captured, so it was fairly easy to locate the handful of lines of code that had been most recently executed when the system froze. And of course there was absolutely nothing questionable in those lines of code, and these same statements had already successfully executed literally thousands of times during that same run.

We put the Apex guys at Rational on notice, since it was their compiler and some of their vendor-supplied routines were being called in this area, and it was impressed on them (and everyone) that this was a problem of literally national importance that had to be tracked down. So they got their Thanksgiving week trashed as well.

Since the logs could only tell us so much, we needed to try to repeat the problem in the local lab. For something that pops up in only 1 in a 1000 test runs that's not going to be easy. Amongst the conjectures as to root cause was that a call into a vendor-supplied mutex (part of a VADS migration package) Unlock function was not unlocking. The processing thread that made this call was handling a heartbeat message that nominally arrived every few seconds. So we tried upping the rate on that heartbeat to 10 Hz, i.e., 10 per second, and kicked it off. About an hour later the system locked up. And, when reviewing the logs we saw that the same sequence of logged messages was occurring as had taken place in the failed run. Several runs were made, and it would consistently lock up sometime between 45 and 90 minutes after starting, and each time had the same log trace. So even though we were not now technically running the same code--because of the increased heartbeat rate--the behavior was consistent and so we had high confidence that this stressing scenario was triggering the same problem.

The trick now was to figure out exactly where in the sequence of candidate statements the lock up was occurring.

The implementation of this system used Ada tasking, and used it extraordinarily poorly. Tasking is Ada's high-level concurrency construct, sorta like threads, only built into the language itself. When two tasks communicate, they do it by "rendezvousing", at which time they should exchange any data of interest, and then break the rendezvous and resume their independent executions. This system wasn't implemented that way. Instead, once rendezvous had been made with a target task, that target task would then rendezvous with another task, which in turn would rendezvous with another task, and so on, until eventually some processing would get done, after which all the rendezvous would be broken and each of the tasks would go on their merry way. So what you ended up with was the world's most expensive function calls, bringing an entire, "multi-tasking" process to a halt while it processed a piece of incoming data. It was only because the normal throughput was so low that this hadn't caused performance problems in the past.

The point of this digression about tasking, though, is that when a rendezvous is requested or awaited upon, a "task switch" can occur. This means that the CPU can start processing a different task that's ready to run. So when one task becomes ready to rendezvous with another, a different task may jump in line and get executed, with control eventually getting passed back around to the rendezvousing tasks. Now there are other events that can also cause a task switch, one of which is calling an OS function, like what happens with printing or performing a mutex.

So in tracking down exactly which line was causing the problem I had to find a way to record the progress through the sequence of statements--while not triggering a task switch, which could prevent the problem from occurring. So doing Put_Line() was not an option, no system I/O of any sort could be done. I could set a counter variable or something like that, but how do I see what its value is to tell me how far it got, since I can't print it out?

Now one thing that had been observed in the log files about this executable was that while this heartbeat processing froze--which ultimately led to the process' I/O getting all blocked up, and preventing other necessary processing from occurring--other independent tasks within the executable continued to run. So the process as a whole wasn't getting blocked, just a (critical) task chain within it.

This was the wedge needed to get at locating the offending statement.

I created an Ada package containing an enumeration type, a global variable of that type, and a task. The enumeration literals were keyed to the specific statements in the problematic code sequence (like "Incrementing_Buffer_Index", "Locking_Mutex", "Mutex_Unlocked", etc.) and then into that sequence were inserted assignment statements that assigned the corresponding enumeration to the global variable. Because the object code for this was nothing more than storing a constant into a memory location it was extremely unlikely that a task switch could occur by executing such a statement. In fact, our primary suspicions centered on those statements that involved task switches, since the locking up behavior was consistent with execution not resuming (for some reason) after a task switch back.

The monitoring task then itself did nothing more than loop and periodically check to see if the global variable had changed value. Every time it did, it printed out the value to a file. It then delayed for a small interval, and made its next check. Now the reason I could write to a file from this task was that this task only ran when a task switch had occurred back in the problem area and this task had been selected to run. Whatever was done in this task should have no effect on other, unrelated, blocked tasks.

The behavior that was anticipated here, then, was that when the problem code area was entered it would do its thing and keep resetting the global variable as it progressed past each statement. It would then do something that caused a task switch, and because its execution rate (10 Hz) was slower than that of the monitoring task's, the monitor could grab the value of the global variable and write it out. So under normal behavior I would expect to see a repeating sequence of a subset of the enumerations, specifically each of those that the variable last held before a task switch occurred. And when the freeze happened, that global variable value should no longer change and the last one recorded will indicate from exactly which statement execution never resumed.

Ran the instrumented executable. It froze up. And the monitoring worked like a charm.

The logging of the progress monitoring variable displayed exactly the anticipated sequence, which eventually ceased with a value corresponding to having made a call to the Mutex Unlock function, with the value that should have been stored signaling the resumption of the task never showing up--like it had in the thousands of previous invocations.

So over to you Rational. The Apex engineers during this time had been feverishly analyzing their code and had found a place in the mutex code where it could theoretically block for good, but the odds of that happening were very remote because of everything that had to happen with the right sequencing and timing. Murphy's Law, guys, Murphy's Law.

What I did to work-around this was to replace the calls to the vendor's mutex functions (which were built atop the OS' mutex functionality) protecting this particular sequence of code with a quick little native Ada mutex package, using that to control mutex access to the relevant area.

I put this into the code and reran the test. Seven hours later it was still running.

My mutex package code was given to Rational who compiled and disassembled it and verified that it was not using the same approach that the problematic mutex functions were using.

I then had the most well attended code inspection of my career :-) There were nearly a dozen engineers and managers in the room with me, and at least another dozen dialed in from all over the country, all to inspect about 20 lines of code.

It passed, the new executables were formally built, and it was handed over to the test organization for formal regression testing. A couple weeks later the missile countdown proceeded flawlessly and away it went.

It's a good think I like cold turkey.

------------------------------------------------------------------------

Okay, this is all well and fine, but what's really the point of a coding war story?

This was a nasty, nasty problem. There was concurrency, over a dozen communicating processes, hundreds of KSLOCs, poor design, poor implementation, interfaces to embedded systems, and millions of dollars riding on the effort. No pressure, eh?

I wasn't the only developer working on this problem, though having done the original port I was of course the primary focus. But even though I did the porting, that doesn't mean I had intimate knowledge of hundreds of thousands of lines of code--or even a decent overview of it. Other engineers around the country were looking through the code and the logs as well, but I found that when they proposed a hypothesis to me about a root cause, it never took more than 30 seconds on my part to dismiss it, likewise when I was requested to provide various analyses I would shove it off on to someone else because it was clear to me they were on the wrong track. Sound like arrogance on my part? Well, yeah, it does, but that's not why I dismissed these hypotheses and requests.

It was because I knew what the nature of the problem was. I didn't know exactly where it was occurring, nor why it was occurring, but I did know what was happening.

I've built up a lot of experience and knowledge over the years--I was an early adopter of Ada, understand concurrency and its pitfalls, I know how Ada runtime libraries handle tasking and concurrency, and I understand low-level programming at the level of raw memory, registers, and assembly language. In other words, I have deep knowledge of my niche of the industry. All of that was brought to bear in successfully tracking down this problem--not just working around the bug, but understanding how to put together an approach to finding the bug in a very sensitive execution environment.

The specifics of a coding war story probably aren't all that interesting to those who aren't familiar with the particulars of its nature and environment, but they are useful for gleaning an understanding of what it takes to solve really difficult problems.

To solve the really difficult problems you need to be more than a coder, you have to understand the "fate" of that code, how it interacts with its environment, and how its environment itself operates.

Then you too can get your Thanksgiving holiday all messed up.