What's new

Big Data & Database Research

includes query-by-example, big data

SQL to Text

SQL Notebooks

SEOSS-Queries - a Dataset for Querying Software Repositories

Database Optimization for Novelty Detection

Screen Shot 2021-08-31 at 10.50.42 AM.png

Screen Shot 2021-08-31 at 11.21.42 AM.png
Last edited:

5 Techniques to Identify Clusters In Your Data

Entity Resolution Tools

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well --- in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.