Key points are not available for this paper at this time.
The findings in this experiment support the hypothesis that GPTs based on LLMs perform well on prompts that are more popular and have reached a general consensus yet struggle on controversial topics or topics with limited data. The variability in the applications's responses underscores that the models depend on the quantity and quality of their training data, paralleling the system of crowdsourcing that relies on diverse and credible contributions. Thus, while GPTs can serve as useful tools for many mundane tasks, their engagement with obscure and polarized topics should be interpreted with caution. LLMs' reliance on probabilistic models to produce statements about the world ties their accuracy closely to the breadth and quality of the data they're given.
Waldo et al. (Fri,) studied this question.