Building reliable vehicle detection models for intelligent transportation systems calls for large, well-annotated datasets—yet gathering and labelling real traffic data remains both costly and labour-intensive. This paper introduces Traffic Synth, an automated pipeline that generates synthetic training datasets by altering real traffic camera images rather than constructing entirely artificial scenes. The system begins by detecting vehicles through instance segmentation and removing them from the frame. It then places new vehicles directly into the cleared regions using diffusion-based inpainting, all while retaining the original road layout, lighting, and camera perspective. Doing so preserves the realistic scene context while broadening the visual variety of vehicles in the dataset. To ensure that the resulting traffic looks physically plausible, we incorporate a lane-aware prompting mechanism that matches each vehicle’s orientation to the direction of travel as seen from the camera. The system further draws on a weighted vehicle brand database that mirrors the makes and colours commonly found on European roads to better match actual deployment conditions. Class-specific mask processing—involving anisotropic scaling and relative dilation—rounds out the pipeline by improving generation quality across different vehicle size categories. The final output is a set of images with automatically generated annotations in a standard object detection format.
Gachulinec et al. (Mon,) studied this question.