April 28, 2023Open Access

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

Key Points

Key points are not available for this paper at this time.

Abstract

In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Chang et al. (Fri,) studied this question.

www.synapsesocial.com/papers/6a0e9a73f59e0974004c461b — DOI: https://doi.org/10.48550/arxiv.2305.00118

Authors

Kent K. Chang

Mackenzie Hạnh Cramer

Sandeep Soni

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion