Unveiling the Inescapable Truth of Functional Safety
Kerry Johnson and Chris Hobbs from Blackberry QNX share insights on error detection in safety-focused car systems
As more cars come with advanced driver assistance systems (ADAS) and automated driving features, the need for powerful CPUs in the automotive sector is rising. However, with this push for cutting-edge technology, we are seeing a decrease in hardware reliability for the first time ever. This issue stems from two key factors: physics and complexity.
On the physics side, CPUs are now running at faster speeds, generating more heat, and use smaller transistors, sometimes as small as a few atoms. Heat speeds up the wear and tear on these parts, so the hotter they get, the quicker they break down. The smaller size of transistors also makes them more vulnerable to things like electromagnetic interference and cosmic particles, and issues like crosstalk between nearby cells. This isn’t just a CPU problem; modern DRAM systems also face bit errors about once an hour due to similar issues.
When it comes to complexity, manufacturers are packing more interconnected functions into each CPU. But no CPU is perfect—they often have bugs that only become apparent after the chip is in use. These bugs, documented in manufacturers’ errata sheets, can mess with computations and introduce safety risks. The likelihood of such errors affects the ISO 26262 ASIL (Automotive Safety Integrity Level) rating directly.
To spot and fix these errors, system designers use various compensation techniques. One common method is to run each computation multiple times and compare the outcomes. Another technique is hardware lockstep, where two CPUs perform the same instructions simultaneously, with dedicated hardware checking the results. If there’s a mismatch, an independent diagnostic checks which CPU was faulty, and then the system software takes action. However, hardware lockstep primarily supports identical replicas and can’t catch software bugs, since both CPUs will run the flawed code correctly. It also doesn’t scale well with more replicas and isn’t practical for complex high-performance hardware.
In practice, you can use software to verify hardware operations by implementing multiple software replicas. These replicas execute safety-critical tasks, such as deciding if acceleration is possible under certain conditions, with middleware managing the synchronisation points covertly.
Each replication method has its pros and cons. With identical replicas, two computations running on different threads and memory should give the same accurate result unless there’s a transient hardware or software error. Typically, the error only affects one replica, allowing middleware to identify the correct one.
However, for bugs within the software, a system might use fraternal replicas—these perform the same tasks but with different algorithms. If both replicas reach the same outcome, confidence in the result’s accuracy is higher.
To implement these replication schemes, a system designer can use middleware that sits between communication paths of different components. In a microkernel operating system, where components interact through message passing, the designer can effectively manage this process.