项目作者: bhaskar24

项目描述 :
Clean Answers over Dirty Databases: A Probabilistic Approach
高级语言: Python
项目地址: git://github.com/bhaskar24/Clean_Answers_over_Dirty_Database.git


Clean Answers over Dirty Databases: A Probabilistic Approach

Course Code: CS702

Course Project: Distributed Database Management System

Overview

Authors propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. This repository contains the simulation of author work[1] using python[2] script in which they rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database.

Reference Dataset

Synthetic Data Generator, UIS Database Generator and Cora Dataset

Simulating Simulator

Simulator script should be executed as

  1. ./python simulator.py

Simulator Command Format

  1. Select Attribute1,Attribute2,...,AttributeN
  2. from Table1,Table2
  3. where condition1,condition2..,conditionN
  4. groupBy Attribute1,...AttributeN

Query Re-Writing Example

Dataset Snippet of Customer Table

id custId name balance prob
c1 m1 John 20 0.7
c1 m2 John 30 0.3
c2 m3 Mary 27 0.2
c2 m4 Marion 5 0.8

Normal SQL query to fetch id of those customers having balance > 10

  1. select id,prob
  2. from customer
  3. where balance>10
id prob
c1 0.7
c1 0.3
c2 0.2

But if we apply clean answers over Dirty Database using Probabilistic Database

  1. select id,sum(prob)
  2. from customer
  3. where balance>10
  4. groupby id
id prob
c1 0.1
c2 0.2

References

[1] P. Andritsos, A. Fuxman, R.J. Miller, “Clean Answers over Dirty Databases: A Probabilistic Approach”, Proceedings of the 22nd International Conference on Data Engineering, 2006.

[2] https://github.com/mysql/mysql-server

[3] https://www.python.org/.