Tradeoffs Between Risk and Learning
Engineering is a process of designing and trying out designs. Sometimes we forget how much trial and error there is in the engineering process. Agilists like to refer to “learning from failure”. That sounds scary, but they are referring to the trial and error process. Failure is never good; but to learn, you often have to fail. Of course, it is better if failure occurs in a test setting, rather than with real users, especially if failure means possible loss of life.
But what if failure merely results in a few users not having access to their data until the problem is fixed?
The cost of failure is a major factor in the balance between the value of trying something versus doing more analysis and simulation to perfect a design before trying it. The balance shifts if the cost of failure is large, in terms of the possible loss of the system under test and any other resources involved—including real people such as test pilots—as well as the time and energy needed to set up and conduct the test.
For software, the cost of running a test is usually much smaller than for hardware, but the cost can still be large in terms of compute resources and time. That shifts the tradeoffs.
At SpaceX, they have a “51% rule”: if they feel 51% confident that an approach is right, they feel it is worth testing it. They call that a “hardware-rich” approach: they are very willing to lose test instances in the interest of trying out design improvements to be able to iterate rapidly.
That is not a new idea though. Thomas Edison was famous for trying one design after another, seeking to find the one that worked. And in the business that SpaceX is in, rocketry, the Soviets were practicing “experimental design” back in the 1960s and 70s. Their approach was essentially SpaceX’s hardware-rich approach: keep trying design improvements and measure how they worked.
The question of whether you are ready to test is therefore a judgment call, in which you balance,
The resources it takes to set up and run the test, including the lost calendar time, and perhaps to lose the system under test.
The potential benefit of what you might learn from the test.
As we said, for software the risk is low: the resources are usually server time, as well as calendar time. But the tradeoff is still there. For hardware, the risk is higher: creating a prototype to put on a test stand is expensive in materials and engineering time. Therefore, the tradeoff is different, shifted somewhat more in favor of being more thoughtful: in other words, making more effort to verify the design through computer models prior to setting up a test.
But techniques for reducing the cost and time of creating a hardware prototype, such as metal laser sintering, aka 3D metal printing, shift the tradeoffs to be closer to what they are for software, but not entirely: it is still expensive to make alloys like inconel, the 3D printing process is slow, some parts cannot be made that way, and the parts must still be assembled and hooked up to instrumentation and the tests run, often in a specialized environment.
The tradeoff is therefore a matter of judgment: are we confident enough in our design to risk the time and resources to test it? Will testing it now resolve some unknowns, which will save us time by preventing us from continuing with an approach that will not work?