This script will create duplicates in known quantities for testing record linkage (e.g. MDM) systems. To test fuzzy matching, it can optionally transposition and mistype duplicated columns.
Note: this script is only a thin wrapper over the existing Faker library.
- Install Python 3+
- Download the repository
- Install the referenced modules with: pip install -r requirements.txt
python duplicate_data_generator.py --column_file sample_column_files/en_US_columns.json --localization en_US --output sample_data_files/US_data.csv --rows 100000 --duprate .10
The file path to json column configuration file
The csv output file name
The total number of rows you would like produce
The known duplication rate of the records to produce
See here for a list of possible values
The config json file takes an array of Column, which can be defined as:
The name of the column in the generated csv file.
The type of the column. Supported types are:
first_name, last_name, company_name, street_address, secondary_address, city, state, postcode, country_code, phone_number, email, gender, date_of_birth, formatted_string, uuid
For formatted_str, an additional property "str_format" is required. It uses the following syntax: "??-####" -> "AB-1234"
The percentage of rows to populate for the given column.
The number of characters to transposition (e.g. switch around).
The number of characters to mistype. Please note, this just assigns a random key from the keyboard.