This work investigates automatic segmentation of Brazilian Sign Language videos for translation systems, addressing challenges of the visual-spatial modality of signed languages. We introduce the JW-Bible-Libras dataset, the largest resource for this task, and evaluate two segmentation approaches: Optical Flow-based models and Spatio-Temporal Graph Convolutional Networks (ST-GCN). Segmentation performance is analyzed both intrinsically and in relation to downstream translation using the gloss-free Sign2GPT architecture. Results show that the nine-layer ST-GCN with bidirectional LSTM achieves the best segmentation results (F1: 0.7358, IoU: 0.5820), while the unidirectional variant yields the strongest translation scores (BLEU1: 9.31, ROUGE: 9.49). Notably, a simple heuristic based on average sentence duration performs competitively, highlighting the gap between segmentation accuracy and translation quality. Our findings demonstrate the importance of segmentation strategies while revealing opportunities for integrating linguistic cues and boundary-aware learning to advance sign language translation.
Ramos et al. (Tue,) studied this question.