datsom1/db-diff

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChangelogLicense

db-diff is the most advanced and user-friendly Python CLI tool and library for comparing CSV, TSV, and JSON database dumps. Designed specifically for database workflows, it delivers precise, human-readable or machine-readable diffs, supports custom key columns, handles massive files efficiently with mode, and offers flexible field selection.



  • Compare CSV, TSV, and JSON files for differences
  • Human-readable and machine-readable (JSON) output
  • JSON output to file with filename and path support
  • Detects added, removed, and changed rows and columns
  • Supports custom key columns for row identity
  • Field inclusion/exclusion for focused diffs
  • mode for very large files (memory efficient)
  • Can output to terminal or JSON file
  • Automatic delimiter and encoding detection
  • Python 3.6+ compatible

Install the latest version directly from :

pip install git+https://.com/datsom1/db-diff.git

To upgrade to the latest version:

pip install --upgrade --force-reinstall git+https://.com/datsom1/db-diff.git

Suppose you have two CSV files:

one.csv

Id,name,age
1,Cleo,4
2,Pancakes,2

two.csv

Id,name,age
1,Cleo,5
3,Bailey,1

Compare them using:

db-diff one.csv two.csv --key=Id

Sample output:

1 rows changed, 1 rows added, 1 rows removed

1 rows changed

  Id: 1
    age: "4" => "5"

1 rows added

  Id: 3
  name: Bailey
  age: 1

1 rows removed

  Id: 2
  name: Pancakes
  age: 2

db-diff is a flexible CLI tool for comparing two data files (CSV, TSV, or JSON). It detects added, removed, and changed rows and columns, and can output results in a human-readable or machine-readable format.

Basic usage:

db-diff [OPTIONS] PREVIOUS CURRENT
  • PREVIOUS and CURRENT are the file paths to the two files you want to compare.
  • The tool auto-detects file format by extension, or you can specify with --format.

Key features:

  • Custom Key Column: Use --key to specify which column uniquely identifies rows.
  • Output Formats: Choose between human-readable (readable), JSON (json), or save JSON to a file (jsonfile).
  • Field Selection: Use --fields to compare only specific columns, or --ignorefields to exclude columns.
  • Mode: For very large files, use -- (CSV/TSV only, files must be sorted by key).
  • Encoding: Specify file encoding with --encoding.
  • Show Unchanged: Use --showunchanged to display unchanged fields for changed rows.
  • List Fields: Use --listfields to print available columns and exit.
  • Timing: Use --time to display how long the diff operation took.

See all options:

db-diff --help

You can use db-diff as a Python library for advanced or automated workflows. The library provides functions to load data, compare datasets, and render results.

from db_diff import load_csv, load_json

# Load CSV file, using a specific column as the key
with open("one.csv", encoding="utf-8") as f:
    prev = load_csv(f, key="Id")

# Load JSON file, using a specific key
with open("two.json", encoding="utf-8") as f:
    curr = load_json(f, key="Id")
from db_diff import compare

# Compare two datasets (dictionaries keyed by your chosen column)
diff = compare(prev, curr, show_unchanged=False)
  • show_unchanged: If True, includes unchanged fields for changed rows.
  • fields: Pass a set of field names to only compare those fields.
  • ignorefields: Pass a set of field names to ignore during comparison.
from db_diff import _compare_csv

diff = _compare_csv(
    "one.csv",
    "two.csv",
    key="Id",
    compare_columns={"Id", "name", "age"},
    encoding="utf-8",
    dialect="excel"
)

The result of compare or _compare_csv is a dictionary:

{
    "added": [ ... ],            # List of added rows (dicts)
    "removed": [ ... ],          # List of removed rows (dicts)
    "changed": [                 # List of changed rows
        {
            "key": "row_id",
            "changes": {
                "field1": ["old", "new"],
                ...
            },
            "unchanged": { ... } # (optional) if show_unchanged=True
        },
        ...
    ],
    "columns_added": [ ... ],    # List of columns added
    "columns_removed": [ ... ]   # List of columns removed
}
from db_diff import human_text

print(human_text(diff, key="Id", current=curr))
from db_diff import load_csv, compare, human_text

with open("one.csv") as f1, open("two.csv") as f2:
    prev = load_csv(f1, key="Id")
    curr = load_csv(f2, key="Id")
    diff = compare(prev, curr, show_unchanged=True)
    print(human_text(diff, key="Id", current=curr))

See all available options with:

db-diff --help

A summary of key options:

OptionDescription
--key TEXTColumn to use as a unique ID for each row (default: first column header)
--output TEXTOutput format: readable, json, or jsonfile (default: readable)
--outfilename FILEFile to write JSON output to (used with --output=jsonfile)
--outfilepath DIRDirectory to save the output file (used with --output=jsonfile)
--fields TEXTComma-separated list of fields to compare (all others ignored)
--ignorefields TEXTComma-separated list of fields to ignore during comparison
--showunchangedShow all fields for changed records, not just changed fields
--timeMeasure and display elapsed time for the diff operation
--format TEXTExplicitly specify input format: csv, tsv, or json (default: auto-detect)
--encoding TEXTInput file encoding (default: utf-8)
--Use mode for very large CSV/TSV files (requires files to be sorted by key)
--listfieldsList available fields/columns in the input files and exit
--versionShow the version and exit
-h, --helpShow help message and exit

Show unchanged fields for changed rows:

db-diff one.csv two.csv --key=Id --showunchanged

Output as JSON:

db-diff one.csv two.csv --key=Id --output=json

Save JSON output to a file:

db-diff one.csv two.csv --key=Id --output=jsonfile --outfilename=diffs.json

Compare only specific fields:

db-diff one.csv two.csv --key=Id --fields=Id,name

Ignore specific fields:

db-diff one.csv two.csv --key=Id --ignorefields=LastModifiedDate

mode for large files:

db-diff large1.csv large2.csv --key=Id --

  • For very large files, use -- mode (CSV/TSV only, files must be sorted by key).
  • Supports efficient memory usage and fast comparison for millions of rows.

Clone the repository:

git clone https://.com/datsom1/db-diff.git
cd db-diff

Install dependencies:

pip install -e .

This project is licensed under the Apache License 2.0.


Author: Thomas Coyle
Repository: https://.com/datsom1/db-diff

About

Python CLI tool and library for comparing CSV database dumps and finding differences.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%