Key points are not available for this paper at this time.
The escalating advancement of generative AI models amplifies the imperative for adept data valuation techniques. Amidst a myriad of methodologies, various Shapley value estimation techniques, such as Data Shapley, have garnered attention for their proficient data valuation capabilities, despite computational challenges when grappling with large datasets. This paper introduces an innovative, empirically-driven batch method, aiming to expedite data valuation while preserving precision. This method strategically optimizes training batch sizes and testing subsets, effectively striking a balance between computational efficiency and valuation accuracy, a critical step forward given the substantial volume of data processed in contemporary machine learning tasks. A thorough evaluation of different Shapley value estimation techniques is conducted, underscoring TMC-Shapley for its notable efficacy. Furthermore, the exploration delves into the modelagnostic nature of Shapley value estimations, utilizing diverse machine learning models across distinct training phases. This practice not only demonstrates the versatility of Shapley value methods but also highlights their adaptability and generalizability across varied model architectures, reaffirming the significance of this approach in the broader context of machine learning research. The holistic approach and findings presented herein serve as a robust foundation for future explorations and optimizations in the realm of data valuation, paving the way for more nuanced and efficient methodologies
Yilu Yang (Mon,) studied this question.