Technology

On Building Systems That Will Fail (1991)

Turing Lecture Paper

On Building Systems That Will Fail


Fernando J. Corbató

It is an honor and a pleasure to accept the Alan Turing Award. My own work
has been on computer systems and that will be my theme. It is the essence
of systems that they are integrating efforts, requiring broad knowledge
of the problem area to be addressed, and that the detailed knowledge required
is rarely held by one person. Thus the work of systems is usually done by
teams and so it is in my case too. Hence I am accepting this award on behalf
of the many whom I have worked with as much as for myself. It is not practical
to name them all, so I will not. Nevertheless I would like to give special
mention to Marjorie Daggett and Bob Daley for their parts in the birth of
CTSS and to Bob Fano and the late Ted Glaser for their critical contributions
to the development of the Multics System.

Introduction


Let me turn now to the title of this talk: On Building Systems That Will
Fail. Of course the title I chose was a teaser. Some of the alternate titles
I came up with and discarded were: “On Building Messy Systems,”
but it seemed too frivolous and suggests there is no systematic approach;
“On Mastering System Complexity” sounded like I have all the answers.
The title that came closest, “On Building Systems that are likely to
have Failures” did not have the nuance of inevitability that I wanted
to suggest.

What I am really trying to address is the class of systems that for want
of a better phrase, I will call “ambitious systems.” It almost
goes without saying that ambitious systems never quite work as expected.
Things usually go wrong and sometimes in dramatic ways. And this leads me
to my main thesis, namely, that the question to ask when designing such
systems is not: “if something will go wrong, but when
will it?”

Some Examples


Now ambitious systems that fail are really much more common than we may
realize. In fact in some circumstances we strive for it revelling in the
excitement of the unexpected. For example let me remind you of our national
sport of football. The whole object of the game is for each team to play
at the limit of its abilities. Besides the sheer physical skills required,
one has the strategic intricacies, the ability to audibilize, and the quickness
to react to the unexpected, which are all a deep part of the game. Of course,
occasionally one team approaches perfection, all the plays work, and the
game becomes dull.

Another example of a system too ambitious for perfection, which I will not
dwell on because of its painful immediacy, is military warfare. The same
elements are there with opposing sides having to constantly improvise and
deal with the unexpected. In fact we get from the military that wonderful
acronym, SNAFU, which is politely translated as “situation normal,
all fouled up.” And if any of you are still doubtful, consider how
rapidly the phrases: “precision bombing” and “surgical strikes”
are replaced by “the fog of war” and “casualties from friendly
fire” as soon as hostilities begin.


On a somewhat more whimsical note, let me offer Boston driving as an example
of systems that will fail. Automobile traffic is an excellent case
of distributed control with a common set of protocols called traffic regulations.
The Boston area is notorious for the free interpretations which drivers
make of these pesky regulations and perhaps the epitome of it occurs in
the arena of the traffic rotary. A case can be made for rotaries. They are
efficient. There is no need to wait for sluggish traffic signals. They are
direct. And they offer great opportunities for creative improvisation, thereby
adding zest to the sport of driving.

One of the most effective strategies is for a driver when approaching a
rotary to rigidly fix his head staring forward, of course, secretly using
peripheral vision to the limit. It is even more effective if the driver
on entering the rotary, starts to speed up, and some drivers embellish this
last step by adopting a look of maniacal glee. The effect is, of course,
one of intimidation, and a pecking order quickly develops.

The only reason there are not more accidents is that most drivers have a
second component to the strategy, namely, they assume everyone else may
be crazythey often are rightand every driver is really prepared to stop
with inches to spare. Again we see an example of a system where ambitious
tactics and prudent caution lead to an effective solution.

So far the examples I have given may suggest that the failures of ambitious
systems come from the human element and that at least the technical parts
of the system can be built correctly. In particular turning to computer
systems, it is only a matter of getting the code debugged. Some assume rigorous
testing will do the job. Some put their hopes in proving program correctness.
But unfortunately there are many cases where none of these techniques will
always work [15]. Let me offer a modest example illustrated in Figure 1.


Consider the case of an elaborate numerical calculation with a variable,
f, representing some physical value, being calculated for a set of points
over a range of a parameter, t. Now the property of physical variables
is that they normally do not exhibit abrupt changes or discontinuities.

So what has happened here? If we look at the expression for f, we
see it is the result of a constant, k, added to the product of two
other functions, g and h. Looking further we see that the
function g has a behavior that is exponentially increasing with t.
The function h, on the other hand, is exponentially decreasing with
t. The resultant product of g and h is almost constant
with increasing t until an abrupt jump occurs and the curve for f
goes flat.

What has gone wrong? The answer is that there has been floating point underflow
at the critical point in the curve, i.e. the representation of the negative
exponent has exceeded the field size in the floating point representation
for this particular computer, and the hardware has automatically set the
value for the function h to zero. Often this is reasonable since
small numbers are correctly approximated by zerobut not in this case, where
our results are grossly wrong. Worse yet, since the computation of f
might be internal, it is easy to imagine that the failure shown here would
not be noticed.

Because handling correctly the pathology that this example represents is
extra engineering bother, it should not be surprising that the problem of
underflow is frequently ignored. But the larger lesson to be learned from
this example is that subtle mistakes are very hard to avoid and to some
extent are inevitable.


My next example I encountered when I was a graduate student programming
on the pioneering Whirlwind computer. One night while awaiting my turn to
use it, the graduate student before me, began complaining of how “tough”
some of his calculations were. What he said he was doing was computing the
vibrational frequencies of a particular wing structure for a series of cases.
In fact his equations were cubics, and he was using the iterative Newton-Raphson
method. For reasons he did not understand, his method was finding one of
the roots but not “converging” for the others. The fix he was
making was to change his program so that when he encountered one of these
tough roots, the program would abandon the iteration after a fixed number
of tries.

Now there were several things wrong. First, the coefficients to his cubic
equations were based on experimental data and some of his points were simply
bad so that, in fact, as Figure 2 illustrates, he only had one real root
and a pair of imaginaries. Thus his iterative method could never converge
for the second and third roots and the value of his first root was pure
garbage. Second, cubic equations have an exact analytic closed form solution
so that it was entirely unnecessary to use an iterative method. And third,
based on his incomplete model and understanding of what was happening, he
exercised very poor judgment in patching his program to ignore values that
were seemingly difficult to compute.

Ambitious System Properties


Let me turn next to some of the general properties of ambitious systems.
First, they are often vast and have significant organizational structure
going beyond that of simple replication. Second, they are frequently complicated
or elaborate and are too much for even a small group to develop. Third,
if they really are ambitious, they are pushing the envelope of what people
know how to do, and as a result there is always a level of uncertainty of
when completion is possible. Because one has to be an optimist to begin
an ambitious project, it is not surprising that underestimation of completion
time is the norm. Fourth, ambitious systems when they work, often break
new ground, offer new services and soon become indispensable. Lastly, it
is often the case that ambitious systems by virtue of having opened up a
new domain of usage, invite a flood of improvements and changes.

Now one could argue that ambitious systems are really only difficult the
first time or two. It really is only a matter of learning how to do it.
Once one has, then one simply draws up the appropriate PERT charts, hires
good managers, ensures an adequate budget and gets on with it. Perhaps there
are some instances where this works, but at least in the area of computer
systems, there is a fundamental reason it does not.


A key reason we cannot seem to get ambitious systems right is change. The
computer field is intoxicated with change. We have seen galloping growth
over a period of four decades and it still does not seem to be slowing down.
The field is not mature yet and already it accounts for a significant percentage
of the Gross National Product both directly and indirectly. More importantly
the computer revolution, this second industrial revolution, has changed
our life styles and allowed the growth of countless new application areas.
And all this change and growth not only has changed the world we live in
but has raised our expectations, spurring on increasingly ambitious systems
in such diverse areas as airline reservations, banking, credit cards, and
air traffic control to name only a few.

Behind the incredible growth of the computer industry is, of course, the
equally mind-boggling change that has occurred in the raw performance of
digital logic. Figure 3, which is not precise and which many of you have
seen before in some form, gives the performance of a top-of-the-line computer
by decade. The ordinate in MIPS is logarithmic as you can see. In particular
in the last decade, the graph becomes problem dependent so that the upper
right hand end of the line should break up into some sort of whiskers as
more and more computers are tailored for special applications and for parallelism.

Complicating matters too is that parallelism is not a solution for every
problem. Certain calculations that are intrinsically serial, such as rocket
trajectories, have very limited benefit from parallel computers. And one
of course is reminded of the old joke about the Army way of speeding up
pregnancy by having nine women spend one month at the task.


As Figure 4 makes clear, it is not just performance that has fueled growth
but rather cost/performance, or simply put, the favorable economics. The
graph is an oversimplification but represents the cost for a given performance
computer model over the last four decades. Again the ordinate is logarithmic,
going from 10 million dollars in 1950 down to 1 thousand dollars in 1990.
As we approach the present, corresponding to a personal computer, the graph
really should become more complicated since one consequence of computers
becoming super-cheap is that increasingly, they are being embedded in other
equipment. The modern automobile is but one example. And it remains to be
seen how general-purpose the current wave of palm-sized computers will be
with their stylus inputs.

Further, when we look at a photograph taken back around 1960 of a”machine
room” staffed with one lone operator, we are reminded of the fantastic
changes that have occurred in computer technology. The boxes are huge, shower
stall sized, and the overall impression is of some small factory. You were
supposed to be impressed and the operator was expected to maintain decorum
by wearing a necktie. And if he did not, at least you could be sure an IBM
maintenance engineer would.

Another reminder of the immense technological change which has occurred
is in the physical dimensions of the main memories of computers. For example,
if one looks at old photographs taken in the mid 1950’s of core memory systems,
one typically sees a core memory plane roughly the size of a tennis racquet
head which could hold about 1000 bits of information. Contrast that with
today’s 4 megabit memory chips which are smaller than one’s thumb.

Now the basis of the award today is largely for my work on two pioneering
time-sharing systems, CTSS [1-2] and Multics [3-8]. Indeed it is out of
my involvement with those two systems that I gained the system building
perspective I am offering. It therefore seems appropriate to take a brief
retrospective look at these two systems as examples of ambitious systems
and to explore the reasons why the complexity of the tasks involved made
it almost impossible to build the systems correctly the first time [10].

CTSS, The Compatible Time-Sharing System


Looking first at CTSS, I have to remind you of the dark ages that then existed.
This was the early 1960’s. The computers of the day were big and expensive
and the administrators of computing centers felt obliged to husband the
precious resource. Users, i.e. programmers, were expected to submit a computing
job as a deck of punched cards; these were then combined into a batch with
other jobs onto a magnetic tape and the tape was processed by the computer
operating system. It had all the glamour and excitement of dropping one’s
clothes off at a laundromat.

The problem was that even for a trivial input typing mistake, the job would
be aborted. Time-sharing, as most of you know, was the solution to the problem
of not being able to interact with computers. The general vision of modern
time-sharing was primarily spelled out by John McCarthy, who I am pleased
to note is a featured speaker at this conference. In England, Christopher
Strachey independently came up with a limited kind of interactive computing
but it was aimed mostly at debugging. Soon there were many groups around
the country developing various forms of interactive computing, but in almost
all cases, the resulting systems had significant limitations.


It was in this context that my own group developed our version of the time-sharing
vision. We called it The Compatible Time-Sharing System, or CTSS for short.
Our initial aspirations were modest. First, it was meant to be a demonstration
prototype before more ambitious designs being attempted by others could
be implemented. Second, it was intended that general-purpose programming
could be done. And third, it was meant to be possible to run most of the
large body of software that had been developed over the years in the batch-processing
environment. Hence the name.

The basic scheme used to run CTSS was simple. The supervisor program, which
was always in main memory, would commutate among the user programs, running
each in turn for a brief interval with the help of an interval timer. As
Figure 5 indicates, user programs could do input/output with the typewriter-like
terminals and with the disk storage unit as well.

But the diagram is oversimplified. The key difficulty was that main memory
was in short supply and not all the programs of the active users could remain
in memory at once. Thus the supervisor program not only had to move programs
to and from the disk storage unit, but it also had to act as an intermediary
for all I/O initiated by user programs. Thus all the I/O lines should only
point to the supervisor program.

As a further complication, the supervisor program had to prevent user programs
from trampling over one another. To do this required special hardware modifications
to the processor such that there were memory bound registers that could
only be set by the supervisor. Nevertheless despite all the complications,
the simplicity of the initial supervisor program allowed it to occupy about
22K bytes of storageless storage than required for the text of this talk!


Most of the battles of creating CTSS involved solving problems which at
the time did not have standard solutions. For example:

There were no standard terminals. There were no simple modems. I/O to the
computer was by word and not by character, and worse yet, did not accommodate
lower case letters. The computers of the day had neither interrupt timers
nor calendar clocks. There was no way to prevent user programs from issuing
raw I/O instructions at random. There was no memory protection scheme. And,
there was no easy way to store large amounts of data with relatively rapid
random access.


The overall result of building CTSS was to change the style of computing
but there were several effects that seem worth noting. One of the most important
was that we discovered that writing interactive software was quite different
from software for batch operation and even today, in this era of personal
computers, the evolution of interactive interfaces continues.

In retrospect, several design decisions contributed to the success of CTSS,
but two were key. First, we could do general-purpose programming and, in
particular, develop new supervisor software using the system itself. Second,
by making the system able to accommodate older batch code, we inherited
a wealth of older software ready-to-go.

One important consequence of developing CTSS was that for the first time
users had persistent on-line storage of programs and data. Suddenly the
issues of privacy, protection and backup of information had to be faced.
Another byproduct of the development was that because we operated terminals
via modems, remote operation became the norm. Also the new-found freedom
of keeping information on-line in the central file system suddenly made
it especially convenient for users to share and exchange information among
themselves.

And there were surprises too. To our dismay, users who had been enduring
several hour waits between jobs run under batch processing, were suddenly
restless when response times were more than a second. Moreover many of the
simplifying assumptions that had allowed CTSS to be built so simply such
as a one level file system, suddenly began to chafe. It seemed like the
more we did, the more users wanted.


There are two other observations that can be made about the CTSS system.
First, it lasted far longer than we expected. Although CTSS had been demonstrated
in primitive form in November 1961, it was not until 1963 that it came into
wide use as the vehicle of a Project MAC Summer Study. For a time there
were two copies of the system hardware, but by 1973 the last copy was turned
off and scrapped primarily because the maintenance costs of the IBM 7094
hardware had become prohibitively expensive, and up to the bitter end, there
were users desperately trying to get in a few last hours of use.

Second, the then-new transistors and large random-access disk files were
absolutely critical to the success of time-sharing. The previous generation
of vacuum tubes was simply too unreliable for sustained real-time operation
and, of course, large disk files were crucial for the central storage of
user programs and data.

A Mishap

Now my central theme is to try to convince you that when you have a multitude
of novel issues to contend with while building a system, mistakes are inevitable.
And indeed, we had a beauty while using CTSS. Let me describe it:


What happened was that on one afternoon at Project MAC where CTSS was being
used as the main time-sharing workhorse, any user who logged in, found that
instead of the usual message-of-the-day typing out on his terminal, he had
the entire file of user passwords instead. This went on for fifteen or twenty
minutes until one particularly conscientious user called up the system administrator
and began the conversation with: “Did you know that…?” Needless
to say there was general consternation with this colossal breach of security,
the system was hastily shut down and the next twelve hours were spent heroically
changing everyone’s password. The question was how could this have happened?
Let me explain.

To simplify the organization of the initial CTSS system, a design decision
had been made to have each user at a terminal associated with his own directory
of files. Moreover the system itself was organized as a kind of quasi-user
with its own directory that included a large number of supporting applications
and files including the message-of-the day and the password file. So far,
so good. Normally a single system programmer could login to the system directory
and make any necessary changes. But the number of system programmers had
grown to about a dozen in number, and, further, the system by then was being
operated almost continuously so that the need to do live maintenance of
the system files became essential. Not surprisingly, the system programmers
saw the one-user-to-a-directory restriction as a big bottleneck for themselves.
They thereupon proceeded to cajole me into letting the system directory
be an exception so that more than one person at a time could be logged into
it. They assured me that they would be careful to not make mistakes.

But of course a mistake was made. Overlooked was a software design decision
in the standard system text editor. It was assumed that the editor would
only be used by one user at a time working in one directory so that a temporary
file could have the same name for all instantiations of the editor. But
with two system programmers editing at the same time in the system directory,
the editor temporary files became swapped and the disaster occurred.

One can draw two lessons from this: First, design bugs are often subtle
and occur by evolution with early assumptions being forgotten as new features
or uses are added to systems; and second, even skilled programmers make
mistakes.

Multics

Let me turn now to the development of Multics [9]. I will be brief since
the system has been documented well and there have already been two retrospective
papers written [11, 13]. The Multics system was meant to do time-sharing
“right” and replace the previous ad hoc systems such as CTSS.
It started as a cooperative effort among Project MAC of MIT, the Bell Telephone
Laboratories, and the Computer Department of General Electric, later acquired
by Honeywell. In our expansiveness of purpose we took on a long list of
innovations.


Some of the most important ones were: First, we introduced into the processor
hardware the mechanisms for paging and segmentation along with a careful
scheme for access control. Second, we introduced an idea for rings of protection
around the supervisor software. Third, we planned from the start that the
system would be composed of interchangeable multiple processors, memory
modules, and so forth. And fourth, we made the decision to implement nearly
all of the system in the newly defined compiler language, PL/I.

Let me make a few observations about the Multics experience. The novel hardware
we had commissioned meant that the system had to be built from the ground
up so that we had an immense task on our hands.

The decision to use a compiler to implement the system software was a good
one, but what we did not appreciate was that new language PL/I presented
us with two big difficulties: First, the language had constructs in it which
were intrinsically complicated, and it required a learning period on the
part of system programmers to learn to avoid them; second, no one knew how
to do a good job of implementing the compiler. Eventually we overcame these
difficulties but it took precious time.

That Multics succeeded is remarkable for it was the result of a cooperative
effort of three highly independent organizations and had no administrative
head. This meant decisions were made by persuasion and consensus. As a consequence,
it was difficult to reject weak ideas until considerable time and effort
had been spent on them.

The Multics system did turn into a commercial product. Some of its major
strengths were: the virtual memory system, the file system, the attention
to security, the ability to do online reconfiguration, and the information
backup system for the file system.

And, as was also true with CTSS, many of the alumni of the Multics development
have gone on to play important roles in the computing field [16].


A few more observations can be made about the ambitious Multics experience.
In particular, we were misled by our earlier successes with previous systems
such as CTSS, where we were able to build them “brick-by-brick,”
incrementally adding ideas to a large base of already working software.

We also were embarrassed by our inability to set and meet accurate schedules
for completion of the different phases of the project. In hindsight we should
not have been, for we had never done anything like it before. However in
many cases, our estimations should have been called guesses.

The UNIX system [12] was a reaction to Multics. Even the name was a joke.
Ken Thompson was part of the Bell Laboratories’ Multics effort, and, frustrated
with the attempts to bring a large system development under control, decided
to start over. His strategy was clear. Start small and build up the ideas
one by one as he saw how to implement them well. As we all know, UNIX has
evolved and become immensely successful as the system of choice for workstations.
Still there are aspects of Multics that have never been replicated in UNIX.

As a commercial product of Honeywell and Bull, Multics developed a loyal
following. At the peak there were about 77 sites worldwide and even today
many of the sites tenaciously continue for want of an alternative.

Sources of Complexity


The general problem with ambitious systems is complexity. Let me next try
to abstract some of the major causes. The most obvious complexity problems
arise from scale. In particular, the larger the personnel required, the
more levels of management there will be. We can see the problem even if
we use simplistic calculations. Thus if we assume a fixed supervision ratio,
for example six, the levels of management will grow as the logarithm of
the personnel. The difficulty is that with more layers of management, the
top most layers become out of touch with the relevant bottom issues and
the likelihood of random serendipitous communication decreases.

Another problem of organizations is that subordinates hate to report bad
news, sometimes for fear of “being shot as the messenger” and
at other times because they may have a different set of goals than the upper
management.

And lastly, large projects encourage specialization so that few team members
understand all of the project. Misunderstandings and miscommunication begin,
and soon a significant part of the project resources are spent fighting
internal confusion. And, of course, mistakes happen.


My next category of complexity arises because of new design domains. The
most vivid examples come from the world of physical systems, but software
too is subject to the same problems albeit often in more subtle ways.

Consider the destruction of the Tacoma Narrows Bridge, in Washington State,
on November 7, 1940. The bridge had been proudly opened about four months
earlier. Many of you have probably seen the amateur movie that was fortuitously
made of the collapse. What happened is that a strong but not unusual cross-wind
blew that day. Soon the roadbed, suspended by cables from the main span,
began to vibrate like a reed, and the more it flexed, the better cross-section
it presented to the wind. The result was that the bridge tore itself apart
as the oscillations became large and violent. What we had was a case of
a new design domain where the classic bridge builder, concerned with gravity-loaded
structures, had entered into the realm of aeronautics. The result was a
major mistake.

Next let us look at the complexities that arise from human usage of computer
systems. In using online systems which allow the sharing or exchanging of
information–and here networked workstations clearly fall in this class–one
is faced with a dilemma: if one places total trust in all other users, one
is vulnerable to the anti-social behavior of any malicious userconsider
the case of viruses; but if one tries to be totally reclusive and isolated,
one is not only bored, but one’s information universe will cease to grow
and be enhanced by interactions with others. The result is that most of
us operate in a complicated tradeoff zone with various arrangements of trust
and security mechanisms. Even such simple ideas as passwords are often a
problem: they are a nuisance to remember, they can easily be compromised
inadvertently, and they cannot be selectively revoked if shared. Privacy
and security issues are particularly difficult to deal with since responsibilities
are often split among users, managers, and vendors. Worse yet, there is
no way to simply “look” at a system and determine what the privacy
and security implications are. It is no wonder that mistakes happen all
the time in this area.

One of the consequences of using computer systems is that increasingly information
is being kept online in central storage devices. Computer storage devices
have become remarkably reliableexcept when they breakand that is the rub.
Even the most experienced computer user can find himself lulled into a false
sense of security by the almost perfect operation of today’s devices. The
problem is compounded by the attitude of vendors, not unlike the initial
attitude of the automobile industry toward safety, where inevitable disk
failure is treated as a negative issue that dampens sales.

What is needed is constant vigilance against a long list of “what ifs”:
hardware failure, human slips, vandalism, theft, fire, earthquakes, long-term
media failure, and even the loss of institutional memories concerning recovery
procedures. And as long as some individuals have to “learn the hard
way,” mistakes will continue to made.

A further complication in discussing risk or reliability is that there is
not a good language with which to carry on a dialog. Statistics are as often
misapplied as they are misunderstood. We also get absurd absolutes such
as “the Strategic Defense Initiative will produce a perfect unsaturatable
shield against nuclear attack” [14] or “it is impossible for the
reactor to overheat.” The problem is that we always have had risks
in our lives, we never have been very good at discussing them, and with
computers we now have a lot of new sources.


Another source of complexity arises with rapid change, change which is often
driven by technology improvements. A result is that changes in procedures
or usage occur and new vulnerabilities can arise. For example, in the area
of telephone networks, the economies and efficiencies of fiber optic cables
compared to copper wire are rapidly causing major upgrades and replacements
in the national telephone plant. Because one fiber cable can carry at a
reasonable cost the equivalent traffic of thousands of copper wires, fiber
is quickly replacing copper. As a result a transformation is likely to occur
where network links become sparser over a given area and multiply interconnected
nodes become less connected.

The difficulty is that there is reduced redundancy and a much higher vulnerability
to isolated accidents. In the Chicago area not long ago there was a fire
at a fiber optics switching center that caused a loss of service to a huge
number of customers for several weeks. More recently in New York City there
was a shutdown of the financial exchanges for several hours because of a
single mishap with a backhoe in New Jersey. Obviously in both instances,
efficiency had gotten ahead of robustness.


The last source of complexity that I will single out arises from the frailty
of human users when forced to deal with the multiplicity of technologies
in modern life. In a little more than a century, there has been an awesome
progression of technological changes from telephones and electricity, through
automobiles, movies and radioI won’t even try to complete the list since
we all know it well. The overall consequence has been to produce vast changes
in our lifestyles and we see these changes even happening today. Consider
the changes in the television editing styles that have occurred over a few
decades, the impact of viewgraph overhead projectors on college classrooms,
and the way we now do our banking with automatic teller machines. And the
progression of life style changes continues at a seemingly more rapid pace
with word processing, answering machines, facsimile machines, and electronic
mail.

One consequence of the many lifestyle changes is that some individuals feel
stressed and over stimulated by the plethora of inputs. The natural defense
is to increasingly depend on others to act as information filters. But the
combination of stressful life-styles and insulation from original data will
inevitably lead to more confusion and mistakes.

Conclusions

I have spent most of this talk trying to persuade you that failures in complex,
ambitious systems are inevitable. However I would be remiss if I did not
address what can be done about it. Unfortunately the list I can offer is
rather short but worthy of brief review.


First it is important to emphasize the value of simplicity and elegance,
for complexity has a way of compounding difficulties and as we have seen,
creating mistakes. My definition of elegance is the achievement of a given
functionality with a minimum of mechanism and a maximum of clarity.

Second, the value of metaphors should not be underestimated. Metaphors have
the virtue that they have an expected behavior that is understood by all.
Unnecessary communication and misunderstandings are reduced. Learning and
education are quicker. In effect metaphors are a way of internalizing and
abstracting concepts such that one’s thinking can be on a higher plane and
low-level mistakes are avoided.

Third, use of constrained languages for design or synthesis is a powerful
methodology. By not allowing a programmer or designer to express irrelevant
ideas, the domain of possible errors becomes far more limited.

Forth, one must try to anticipate both errors of human usage and of hardware
failure and properly develop the necessary contingency paths. This process
of playing “what if” is not as easy as it may sound since implicit
is the need to attach likelihoods of occurrence to events and to address
issues of the independence of failures.

Fifth, it should be assumed in the design of a system, that it will have
to be repaired or modified. The overall effect will be a much more robust
system, where there is a high degree of functional modularity and structure,
and repairs can be made easily.

Sixth, and lastly, on a large project, one of the best investments that
can be made is the cross-education of the team so that nearly everyone knows
more than he or she needs to know. Clearly with educational redundancy,
the team is more resilient to unexpected tragedies or departures. But in
addition, the increased awareness of team members can help catch global
or systemic mistakes early. It really is a case of “more heads are
better than one.”


Finally, I have touched on many different themes in this talk but I will
single out three:

First, the evolution of technology supports a rich future for ambitious
visions and dreams that will inevitably involve complex systems.

Second, one must always try to learn from past mistakes, but at the same
time be alert to the possibility that new circumstances require new solutions.

And third, one must remember that ambitious systems demand a defensive philosophy
of design and implementation. Or in other words, “Don’t wonder if
some mishap may happen, but rather ask what one will do about it
when it does occur.”

References

1. Corbató, F. J., Daggett, M. M., and Daley, R. C., “An Experimental
Time-Sharing System,” Proceedings of the Spring Joint Computer Conference,
May 1962.

2. Corbató, F. J., Daggett, M. M., Daley, R. C., Creasy, R. J., Hellwig,
J. D., Orenstein, R. H., and Horn, L. K., The Compatible Time-Sharing
System: A Programmer’s Guide,
M.I.T. Press, June 1963.

3. Corbató, F. J., and Vyssotsky, V. A., “Introduction and Overview
of the Multics System,” Proceedings FJCC, 1965.

4. Glaser, E. L., Couleur, J. F. and Oliver, G. A. “System Design of
a Computer for Time-Sharing Applications,” Proceedings FJCC, 1965.

5. Vyssotsky, V. A., and Corbató, F. J., “Structure of the Multics
Supervisor,” Proceedings FJCC, 1965.

6. Daley, R. C. and Neumann, P. G. “A General-Purpose File System for
Secondary Storage,” Proceedings FJCC, 1965.

7. Ossanna, J. F., Mikus, L. and Dunten, S. D. “Communications and
Input-Output Switching in a Multiplex Computing System,” Proceedings
FJCC,
1965.

8. David, E. E., Jr. and Fano, R. M. “Some Thoughts About the Social
Implications of Accessible Computing,” Proceedings FJCC, 1965.

9. Organick, E. I. The Multics System: An Examination of its Structure,
MIT Press, 1972.

10. Corbató, F. J. “Sensitive Issues in the Design of Multi-Use
Systems,” (Unpublished), Lecture transcription of Feb. 1968, Project
MAC Memo M-383.

11. Corbató, F. J., Clingen, C. T., and Saltzer, J. H.,”Multics:
The First Seven Years,” Proceedings of the SJCC, May 1972, pp.
571-583.

12. Ritchie, D. M. and Thompson, K. The UNIX time-sharing system. CACM
17,
7 (July 1974), 365-375.

13. Corbató, F. J., and Clingen, C. T., “A Managerial View of
the Multics System Development,” an article in the book Research
Directions in Software Technology
edited by P., Wegner, M.I.T. Press,
1979. (Also published in Tutorial: Software Management, Reifer,
Donald J. (ed), IEEE Computer Society Press, 1979; Second Edition 1981;
Third Edition, 1986.)

14. Parnas, D. L. “Software Aspects of Strategic Defense Systems,”
American Scientist, Nov. 1985. An excellent critique on the difficulties
of producing software for large-scale systems.

15. Brooks, F. P., Jr. “No Silver Bullet,” IEEE Computer,
April 1987, 10-19.

16. Apropos the theme of this lecture, P. G. Neumann, a Multics veteran,
has become a major contributor to the literature of computer related risks.
He is the editor of the widely-read network magazine “Risks-Forum”,
writes the “Inside Risks” column for the CACM, and periodically
creates digests in the ACM Software Engineering Notes.


Related Articles

Back to top button