March 3, 2026Open Access

VaultGemma be like "TOXIC" --- Evaluating VaultGemma’s ability to classify toxic text

Key Points

VaultGemma shows improved accuracy in detecting toxic language using few-shot prompting compared to zero-shot prompting.
The metrics of accuracy, precision, recall, and F1 score were used to evaluate the models' performance on two datasets.
Analysis of verbalizer sets, including Toxic/Non-toxic and Toxic/Neutral, highlighted significant variations in the models' outputs.
The findings suggest further research is needed to validate VaultGemma's effectiveness in classifying toxic language.

Abstract

Introduction This thesis evaluates VaultGemma and its ability to detect toxic language, and how its ability differs compared to Gemma 3 and Perspective API, as well as how the performance gets affected by the prompting method. Research Question The two research questions are “How does the classification differ between zero-shot prompting and few- shot prompting?” and “How does VaultGemma’s classification of text compare to Gemma 3 1B and Perspective API?” Method The two verbalizer sets Toxic/Non-toxic and Toxic/Neutral were run on VaultGemma and Gemma 3 1B with both zero-shot and few-shot prompting on the two datasets, SU and HF. The output of the models were normalized and analyzed to gather accuracy, precision, recall and F1 score. Results The results of the study indicate that there are both similarities and differences between the three models and their ability to detect toxic language. Our findings primarily show that VaultGemma’s ability improves with few-shot prompting. Results also show that the datasets and differences in verbalizers have a large effect on the models, making them perform differently compared to each other. Discussion The findings show that few-shot prompting performs better than zero-shot prompting for VaultGemma. The difference between the three models does not prove a result that leads to a strong conclusion. The findings also show the possibilities of using VaultGemma to detect toxic language in texts. Further research is needed to draw stronger conclusions on VaultGemma and the models’ ability on the matter.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Wiman et al. (Thu,) studied this question.

synapsesocial.com/papers/69a76869badf0bb9e87e49a9

Bookmark

View Full Paper