ABSTRACT Assessing the robustness of safety‐critical deep learning (DL) systems is of utmost importance, as these systems can cause harm when deployed in the real world. Metamorphic testing (MT) is one commonly used method to evaluate the robustness of DL systems, as it does not require expensive labelled ground truth data. This paper tackles two challenges: (1) One challenge in regulated domains such as the automotive industry is to provide a traceable argumentation of why a certain metamorphic relation (MR) was chosen. We adopt the idea of defect‐based testing to MT and argue that an MR is traceable if it can be linked to a defect hypothesis. We demonstrate how to assess the robustness of safety‐critical DL systems using the example of LiDAR object detectors. To this end, we create three new MRs for the LiDAR domain and identify five MR that can be reused by adapting them from related domains. Our experiments on the nuScenes dataset with three different object detectors produce 3.9 million test verdicts, of which 0.7 million are test failures. This shows that our defect‐based MR effectively uncover failures. (2) A second challenge resulting from executing numerous metamorphic test cases is that MT can lead to the generation of an impractically high number of failures. We show how to prioritize the most critical failures, such as failures that occur close to the ego vehicle. By prioritizing, we reduced the observed 685,000 failures to 5397 safety‐critical failures corresponding to a 127‐fold reduction.
Speth et al. (Thu,) studied this question.