1 code implementation • 9 Oct 2022 • Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell
Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora.