SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Wind River going up, up, up! -- Ignore unavailable to you. Want to Upgrade?


To: MONACO who wrote (2481)12/10/1997 12:09:00 PM
From: SteveG  Read Replies (3) | Respond to of 10309
 
Fwiw, a WIND anecdote from an associate:

================

"... The Mars Pathfinder mission was widely proclaimed as "flawless" in the early days after its July 4th, 1997 landing on the Martian surface. Successes included its unconventional "landing" -- bouncing onto the Martian surface surrounded by airbags, deploying the Sojourner rover, and gathering and transmitting voluminous data back to Earth, including the panoramic pictures that were such a hit on the Web. But a few days into the mission, not long after Pathfinder started gathering meteorological data, the spacecraft began experiencing total system resets, each resulting in losses of data. The press reported these failures in terms such as "software glitches" and "the computer was trying to do too many things at once".

This week at the IEEE Real-Time Systems Symposium I heard a fascinating keynote address by David Wilner, Chief Technical Officer of Wind River Systems. Wind River makes VxWorks, the real-time embedded systems kernel that was used in the Mars Pathfinder mission.
In his talk, he explained in detail the actual software problems that caused the total system resets of the Pathfinder spacecraft, how they were diagnosed, and how they were solved. I wanted to share his story with each of you.

VxWorks provides preemptive priority scheduling of threads. Tasks on the Pathfinder spacecraft were executed as threads with priorities that were assigned in the usual manner reflecting the relative urgency of these tasks.

Pathfinder contained an "information bus", which you can think of as a
shared memory area used for passing information between different components of the spacecraft. A bus management task ran frequently with high priority to move certain kinds of data in and out of the information bus. Access to the bus was synchronized with mutual exclusion locks (mutexes).

The meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus to publish its data. When publishing its data, it would acquire a mutex, do writes to the bus, and release the mutex. If an interrupt caused the information bus thread to be scheduled while this mutex was held, and if the information bus thread then attempted to acquire this same mutex in order to retrieve published data, this would cause it to block on the mutex, waiting until the meteorological thread released the mutex before it could continue. The spacecraft also contained a communications task that ran with medium priority.

Most of the time this combination worked fine. However, very infrequently it was possible for an interrupt to occur that caused the (medium priority) communications task to be scheduled during the short interval while the (high priority) information bus thread was blocked waiting for the (low priority) meteorological data thread. In this case, the long-running communications task, having higher priority than the meteorological task, would prevent it from running, consequently preventing the blocked information bus task from running.
After some time had passed, a watchdog timer would go off, notice that the data bus task had not been executed for some time, conclude that something had gone drastically wrong, and initiate a total system reset.

This scenario is a classic case of priority inversion.

HOW WAS THIS DEBUGGED?

VxWorks can be run in a mode where it records a total trace of all
interesting system events, including context switches, uses of
synchronization objects, and interrupts. After the failure, JPL engineers spent hours and hours running the system on the exact spacecraft replica in their lab with tracing turned on, attempting to replicate the precise conditions under which they believed that the reset occurred. Early in the morning, after all but one engineer had gone home, the engineer finally reproduced a system reset on the replica. Analysis of the trace revealed the priority inversion.

HOW WAS THE PROBLEM CORRECTED?

When created, a VxWorks mutex object accepts a boolean parameter that
indicates whether priority inheritance should be performed by the mutex. The mutex in question had been initialized with the parameter off; had it been on, the low-priority meteorological thread would have inherited the priority of the high-priority data bus thread blocked on it while it held the mutex, causing it be scheduled with higher priority than the medium-priority communications task, thus preventing the priority inversion. Once diagnosed, it was clear to the JPL engineers that using priority inheritance would prevent the resets they were seeing.

VxWorks contains a C language interpreter intended to allow developers to type in C expressions and functions to be executed on the fly during system debugging. The JPL engineers fortuitously decided to launch the spacecraft with this feature still enabled. By coding convention, the initialization parameter for the mutex in question (and those for two others which could have caused the same problem) were stored in global variables, whose addresses were in symbol tables also included in the launch software, and available to the C interpreter. A short C program was uploaded to the spacecraft, which when interpreted, changed the values of these variables from FALSE to TRUE. No more system resets occurred.

ANALYSIS AND LESSONS

First and foremost, diagnosing this problem as a black box would have been impossible. Only detailed traces of actual system behavior enabled the faulty execution sequence to be captured and identified.

Secondly, leaving the "debugging" facilities in the system saved the day. Without the ability to modify the system in the field, the problem could not have been corrected.

Finally, the engineer's initial analysis that "the data bus task executes very frequently and is time-critical -- we shouldn't spend the extra time in it to perform priority inheritance" was exactly wrong. It is precisely in such time critical and important situations where correctness is essential, even at some additional performance cost.

HUMAN NATURE, DEADLINE PRESSURES

David told us that the JPL engineers later confessed that one or two system resets had occurred in their months of pre-flight testing. They had never been reproducible or explainable, and so the engineers, in a very human-nature response of denial, decided that they probably weren't important, using the rationale "it was probably caused by a hardware glitch".

Part of it too was the engineers' focus. They were extremely focused on ensuring the quality and flawless operation of the landing software. Should it have failed, the mission would have been lost. It is entirely understandable for the engineers to discount occasional glitches in the less-critical land-mission software, particularly given that a spacecraft reset was a viable recovery strategy at that phase of the mission.

THE IMPORTANCE OF GOOD THEORY/ALGORITHMS

David also said that some of the real heroes of the situation were some people from CMU who had published a paper he'd heard presented many years ago who first identified the priority inversion problem and proposed the solution. He apologized for not remembering the precise details of the paper or who wrote it. Bringing things full circle, it turns out that the three authors of this result were all in the room, and at the end of the talk were encouraged by the program chair to stand and be acknowledged.

They were Lui Sha, John Lehoczky, and Raj Rajkumar. When was the last time you saw a room of people cheer a group of computer science theorists for their significant practical contribution to advancing human knowledge? :-)

It was quite a moment.

POSTLUDE

For the record, the paper was:

L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. In IEEE Transactions on Computers, vol. 39, pp. 1175-1185, Sep. 1990..."

============

Steve



To: MONACO who wrote (2481)12/10/1997 11:38:00 PM
From: Allen Benn  Read Replies (2) | Respond to of 10309
 
The strategic relationship between Northern Telecom and WIND is important for many reasons.

The most obvious reason is that the relationship underscores WIND's total domination of the telecomm/datacomm space. They have dealings with virtually every heavy hitter in the business, including, among many others: 3Com, Bay Networks, Cisco, Newbridge Networks, Lucent, Hughes Network Systems, Qualcomm, Motorola, Obital Sciences and Fore Systems. This deal publicizes that VxWorks is the standard in telephony, making Tornado/VxWorks even more attractive to any company in this business. For example, this agreement should make WIND the vendor of choice also with joint venture partners of Northern Telecom, such as Cabletron and Shiva.

Now let's look deeper at the probable mechanics of the relationship. NT is a $15 billion company with 68,000 employees. That means development projects are managed using hierarchical organizational structures subject to lots of corporate-wide internal rules and regulations. I suspect that a project development team fifty levels down inside the organization is, and always has been, free to pick any RTOS vendor of their choosing - as long as a proper procurement process is followed, resulting in the vendor proposing the least-cost, most efficacious solution. As we have seen from recent postings, Tornado may be accused of many things, but being least cost is not one of them - unless you also include indirect costs, such as efficient use of available engineers. This means the poor blokes at NT probably got fed up having to justify their usual choice (WIND), and finally got the managers to negotiate an overall contract with WIND, enabling any project in the company to select Tornado and bypass burdensome procurement procedures.

Certainly there will be some projects in the company that will insist on using some other RTOS. There always are. Religious attachment to complicated things like RTOSs and their IDEs are unwavering, and will not disappear just because there is now an easier way. But almost all new projects will gravitate to Tornado over time. Most of these projects will be relieved to gain easy access to Tornado/VxWorks, and some will do it because they see the wisdom of capitulating to the will of the company.

Once this process permeates the company, speeded up by the fact that Tornado/VxWorks already has been a popular choice in the company, WIND's entrenchment in NT will be irrevocable. No doubt this is understood by NT and is seen as good. To NT the relationship not only means reduced procurement time, but it also means they can train development engineers to a single, adequate standard, and then assign staff flexibly as needed among projects.

The fact that this deals crystallizes WIND as the RTOS standard bearer in telephony community also is attractive to NT. Picking the mainstream development environment virtually guarantees that the company need never suffer through conversion to yet another OS.

Finally, both WIND and NT win on costs. NT wins because they probably get to count both development seats and run-time units at a more aggregated level than before, enjoying high-volume discounts. WIND wins because they will end up in more NT projects than otherwise, with greatly reduced costs of sales. They both also win because WIND will be able to organize improved support services due to economies of scale and better visibility of future plans and schedules. This kind of relationship is one answer to the cost problems faced by the isolated Boeing employee posted about recently. It is also worth noting that low-volume projects that resist using Tornado/VxWorks will face a difficult cost comparison because unit royalties usually depend on volume.

From the beginning, one theme propounded on this thread was the importance of strategic relationships with large electronics companies like NT. Early on I mistakenly claimed WIND's relationship with HP was strategic, but recanted when I discovered there was no formal, strategic relationship in place with HP, albeit there may as well be one by the way VxWorks ends up in so many HP products. Well, the NT relationship is formal and it is strategic. NT is a huge company whose endorsement probably will be emulated often by companies, Boeing and all other telecom/datacom companies for starters.

Actually, the Adobe agreement also is strategic, although specific to an underlying printer OS, and with a company 1/20 the size of NT. There probably are many other strategic relationships in place by now, but I would wager few of the size and breath of the NT deal.

Allen