What's new

Big Data & Database Research

includes query-by-example, big data

5 Techniques to Identify Clusters In Your Data

Entity Resolution Tools


Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well --- in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.
 
Top