Remove Duplicates Without Losing Important Records: A Checklist
Overview
A concise checklist to safely remove duplicate records while preserving unique or important data. Use this when cleaning datasets in spreadsheets, databases, or CSV files.
Pre-cleaning
- Backup: Make a copy of the original file or export a backup snapshot.
- Scope: Define which fields determine a duplicate (e.g., email, ID, name+date).
- Objective: Decide whether to fully delete duplicates or consolidate them (merge fields).
Detection
- Identify exact duplicates: Find rows where all fields match.
- Identify partial duplicates: Find rows matching on key fields but differing elsewhere.
- Flag vs. delete: Mark suspected duplicates first instead of deleting immediately.
Prioritization rules (choose one)
- Keep newest: Preserve the row with the latest timestamp.
- Keep most complete: Preserve the row with fewest empty fields.
- Keep highest priority source: Preserve rows from trusted sources over others.
- Keep lowest ID: Preserve the record with the smallest primary key if IDs are meaningful.
Consolidation
- Field-level merge: For duplicates with complementary data, create a merged record (e.g., prefer non-empty fields).
- Audit trail: Record which records were merged and their original IDs.
Validation
- Spot-check: Manually review a random sample of changes.
- Integrity checks: Run consistency checks (unique constraints, referential integrity).
- Compare counts: Verify expected reduction in row count and that key metrics (e.g., total revenue) remain reasonable.
Automated tools & commands
- Excel/Sheets: Use Remove Duplicates, UNIQUE(), FILTER(), or conditional formatting to highlight duplicates.
- SQL: Use GROUP BY, ROW_NUMBER() OVER (PARTITION BY …) to identify and delete/keep rows.
- Python/pandas: Use df.dropduplicates(subset=[…], keep=…) or groupby + agg to consolidate.
Post-cleaning
- Store backups: Archive pre- and post-clean datasets.
- Document rules: Save the deduplication rules and scripts used.
- Monitor: Schedule periodic checks to prevent duplicate buildup.
Quick examples
- Excel: Data → Remove Duplicates → select key columns → OK.
- SQL (keep newest):
sql
WITH ranked AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY updatedat DESC) AS rn FROM users ) DELETE FROM users WHERE id IN (SELECT id FROM ranked WHERE rn > 1);
- pandas (keep most complete):
python
df[‘non_nulls’] = df.notnull().sum(axis=1) df = df.sort_values(‘non_nulls’, ascending=False).drop_duplicates(subset=[‘email’], keep=‘first’)
Final checks
- Re-run detection to ensure no unintended duplicates remain.
- Confirm business metrics unchanged where expected.
- Roll back plan: Ensure you can restore the original if issues appear.
Leave a Reply