duplicate-data-generator

This script will create duplicates in known quantities for testing record linkage (e.g. MDM) systems. To test fuzzy matching, it can optionally transposition and mistype duplicated columns.

Note: this script is only a thin wrapper over the existing Faker library.

Installation

Install Python 3+
Download the repository
Install the referenced modules with: pip install -r requirements.txt

Usage Example

python duplicate_data_generator.py --column_file sample_column_files/en_US_columns.json --localization en_US --output sample_data_files/US_data.csv --rows 100000 --duprate .10

Command Line Parameters

column_file

The file path to json column configuration file

output_name

The csv output file name

rows

The total number of rows you would like produce

duprate

The known duplication rate of the records to produce

localization

See here for a list of possible values

Column Configuration File Settings

The config json file takes an array of Column, which can be defined as:

name (Required)

The name of the column in the generated csv file.

type (Required)

The type of the column. Supported types are:

first_name, last_name, company_name, street_address, secondary_address, city, state, postcode, country_code, phone_number, email, gender, date_of_birth, formatted_string, uuid

For formatted_str, an additional property "str_format" is required. It uses the following syntax: "??-####" -> "AB-1234"

fill_rate (Not Required)

The percentage of rows to populate for the given column.

transposition_chars (Not Required)

The number of characters to transposition (e.g. switch around).

mistype_chars (Not Required)

The number of characters to mistype. Please note, this just assigns a random key from the keyboard.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
sample_column_files		sample_column_files
sample_data_files		sample_data_files
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
duplicate_data_generator.py		duplicate_data_generator.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

duplicate-data-generator

Installation

Usage Example

Command Line Parameters

column_file

output_name

rows

duprate

localization

Column Configuration File Settings

name (Required)

type (Required)

fill_rate (Not Required)

transposition_chars (Not Required)

mistype_chars (Not Required)

About

Releases 1

Packages

Languages

License

thomaswyrick/duplicate-data-generator

Folders and files

Latest commit

History

Repository files navigation

duplicate-data-generator

Installation

Usage Example

Command Line Parameters

column_file

output_name

rows

duprate

localization

Column Configuration File Settings

name (Required)

type (Required)

fill_rate (Not Required)

transposition_chars (Not Required)

mistype_chars (Not Required)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages