Improve Data Cleaning with ChatGPT-Generated Code

Posted on 2026-01-13 08:50:33

Data groups spend a shocking percentage in their time solving troubles they did no longer create. Missing values, inconsistent codecs, clipped textual content, phantom whitespace, duplicated files, and mismatched codes silently erode brand overall performance and trade agree with. You can build classy pipelines, simply to observe them buckle less than the load of messy inputs. Over the past few years, I actually have folded ChatGPT into my cleansing workflow not as a toy, yet as a practical, day-by-day software. It speeds up the grunt work, surfaces facet circumstances I may possibly pass when worn-out, and drafts code that I refine with precise context.

Used properly, a form which can write code will become a second pair of fingers. Used poorly, it becomes a generator of possible nonsense. The distinction displays up in 3 puts: how you advised, the way you try out, and the way you fold the outputs into present tooling and requisites. The intention just isn't to allow a bot refreshing your records. The target is to enable it draft first passes that you tighten into legitimate, predictable steps.

The authentic charge of imperfect data

At one retail buyer, replica consumers inflated lifetime cost by way of 18 to 22 % in a unmarried quarterly research. Another crew as soon as launched an attribution report that assigned 40 percent of conversions to an “Unknown” channel because e mail UTMs have been a blend of lowercase, uppercase, and poorly encoded characters. In a healthcare placing, a stray area in a diagnosis code led to a few to 5 p.c. of claims falling out of downstream legislation. None of it truly is glamorous work. All of it issues.

Data cleaning mess ups disguise in averages. Outliers inform the tale. The cost of tightening the first mile compounds. Every refreshing, well-documented step prevents downstream arguments and ad hoc patches. When I added LLM-generated code to my procedure, the enhancements had been instantaneous: swifter prototype modifications, more complete regex protection, and swifter generation on schema enforcement. The caveat is that you shouldn't blind-accept as true with generated common sense. You have got to test, the equal method you would a brand new analyst’s pull request.

Where ChatGPT matches within the cleansing lifecycle

For maximum teams, cleansing stretches across discovery, transformation, validation, and tracking. ChatGPT helps in each one section, but in totally different tactics.

Exploration and profiling receive advantages from immediate, advert hoc snippets. Ask for a pandas profile file, a abstract of one of a kind values with counts and null premiums, or a objective to stumble on blended datatypes within a column. You will get a operating draft in seconds. Those drafts are ceaselessly verbose, that is high quality at some stage in exploration.

Normalization and transformation are in which generated code can shop hours. Standardizing date formats, trimming whitespace, changing unique Unicode characters, deciphering HTML entities, deduplicating near-similar text, harmonizing u . s . a . codes to ISO 3166, or mapping product different types to a controlled vocabulary all lend themselves to code that might possibly be templated and subtle. Given examples, ChatGPT can generate the mapping logic and tests round it.

Validation and checking out became improved when you have the form write unit assessments, Great Expectations suites, or SQL assessments. Ask for checks that put in force referential integrity, verify categorical columns contain most effective normal values, or fail the pipeline if null costs exceed thresholds. The sort is right at scaffolding boilerplate and proposing part situations it's possible you'll now not catch on the first cross.

Monitoring requires light-weight alerts and cost-effective alerts. Here, I even have used ChatGPT to draft dbt tests tailor-made to my schema, in addition snippets that compute inhabitants stability indices on key columns and flag flow beyond a group band. You nevertheless tune thresholds and decide what triggers a price tag, however the scaffolding arrives quickly.

Prompting in a manner that yields maintainable code

The first-class of generated code tracks the fine of your set off. Specificity can pay. Instead of asking “refreshing this dataset,” define the structure and legislation. The variety desires schemas, examples, and constraints, not vibes.

Describe the incoming schema with dtypes. State the output schema you desire. Give concrete examples of dangerous values and the way they should still be fixed. Name the library and edition you intend to apply, and the runtime goal. If your team defaults to pandas 1.five with Python 3.10, say so. If this step will run inner Spark on Databricks with pyspark.square.capabilities, state that. Mention memory constraints in case you are cleansing tens of tens of millions of rows. That steers the version faraway from row-intelligent Python loops and in the direction of vectorized operations or window services.

I additionally specify layout constraints. Pure purposes with specific inputs and outputs. No hardcoded paths. Logging hooks for counts and sampling. Deterministic habits. If you desire returning a DataFrame and a dictionary of metrics in preference to printing to stdout, say it. These constraints prevent you from receiving code that “works” on a laptop computer and dies in manufacturing.

Turning examples into amazing transforms

Few responsibilities illustrate the importance of generated code reasonably like standardizing dates. Most datasets have in any case 3 date formats inside the wild. One document uses MM/DD/YYYY, one more uses DD-MM-YYYY, and a third uses “Mon 3, 2023 14:22:09 UTC” with stray time zones. I will provide a small illustration desk with desired outputs, then ask for a perform that handles these and returns ISO 8601 strings in UTC. The first draft in the main works for eighty % of situations. From there, I harden it with part situations: jump days, missing time zones, noon vs midnight confusion, and simply malformed data that need to be flagged, now not coerced.

Generated regex for phone numbers, emails, and IDs is an alternative sweet spot. Ask for E.164 phone normalization with united states detection situated on prefixes and fallback assumptions. The first go occasionally overfits. Give counterexamples, and ask the type to simplify. Push it closer to simply by vetted libraries wherein licensing and runtime enable. phonenumbers in Python is more legitimate than a tradition regex. The brand will endorse it when you point out that 1/3 celebration applications are perfect.

Text normalization merits from readability about individual training. I as soon as inherited a product description feed with hidden comfortable hyphens and slender no-destroy spaces. Regular trimming neglected them. I asked ChatGPT for a function that normalizes Unicode, eliminates zero-width characters, and collapses numerous whitespace into a single space with out touching intraword punctuation. The generated code used NFKC normalization, an explicit set of zero-width code facets, and a concise regex. I stored it, wrapped it in a helper, and introduced a log of what number of rows modified beyond trivial whitespace. That metric stuck upstream variations two months later when a CMS editor started out pasting content from a new WYSIWYG with totally different code issues.

From single-use snippets to reusable components

The early wins manifest in notebooks. The lasting worth comes in the event you bring up snippets into small, reusable modules. I dedicate a utilities file in each and every assignment for prevalent cleaning initiatives: parse date, normalizewhitespace, coerce boolean, totitle_case with locale information, and a deduplicate perform that respects a composite commercial key plus fuzzy matching on a descriptive container.

ChatGPT can draft these utilities with docstrings, form recommendations, and examples embedded in exams. I ask it to create pytest checks, through parameters that reflect the physical mess I see. Then I run them in the community, repair concerns, and feature it regenerate when I adjust the immediate to mirror the failure modes. The conversational loop supports. The code finally ends up fashioned to my context in preference to being typical.

If your stack runs on dbt, ask for Jinja macros that put in force cleaning common sense at the SQL layer. For illustration, write a macro to standardize text fields by means of trimming, lowercasing, and doing away with non-breaking house code features, then follow it across staging units. The form can infer patterns from some examples and convey consistent macros throughout assets.

Schema enforcement that prevents quiet rot

A schema is a contract. When this is unfastened, silent error creep in and your pipeline smiles at the same time as lying. I ask ChatGPT to generate pydantic items or pandera schemas that seize my expectancies. That would embrace numeric tiers, category enumerations, and column-point constraints together with specialty or nullable flags. When new facts arrives, I validate and log mess ups. If the schema breaks, I would like the task to give up or shunt awful statistics to a quarantine table with a failure explanation why. It is more advantageous to pay the expense of failing instant than to ship mistaken numbers to leaders.

The type helps with the aid of drafting the schema objects rapidly. It additionally indicates exams that I would possibly prolong if I have been coding from scratch. This is in which you save your ideas the front and midsection. If nullable skill nullable, do no longer let the mannequin to sneak in fillna with 0 simply to flow a take a look at. Cleaning deserve to no longer erase which means. If zero isn't the same as null on your domain, shield that line.

Matching and deduplication with simply adequate complexity

Entity decision tempts overengineering. For most industrial tips, one can get some distance with conservative legislation which are explainable and auditable. I use ChatGPT to draft layered matching logic: first on strong identifiers, then on electronic mail or phone with normalization, then on identify plus tackle with fuzzy thresholds. I actually have it floor the thresholds as configuration and return now not only the deduplicated table however additionally a fit record with counts by using rule. That file has kept uncomfortable conferences extra than once, for the reason that stakeholders see exactly what number data merged below which standards.

I found out that soliciting for explainability forces the type to architecture the code around transparency. Instead of an opaque “ranking,” I request flags in keeping with rule. The code then produces a clean lineage for every one merge decision. When a customer support team questions a merge, I trace the logic and teach the evidence. ChatGPT is distinctly right at constructing those breadcrumb trails whenever you ask for them.

Bringing unit checks and details assessments into the behavior loop

The biggest present a code generator can provide your cleaning technique is a test scaffold. Ask for pytest unit assessments for each one feature with either joyful paths and adversarial examples. Ask for property-based exams for date parsing that be sure that roundtrips below formatting variations. Ask for Great Expectations suites that assert null bounds, specialty, allowed sets, and significance distributions. Even once you only adopt 70 % of what it generates, you might be forward.

I preserve the checks just about the code, run them in CI, and measure coverage for center utils. For details assessments, I decide on less costly and commonly used tests over heavy snapshots. For occasion, compute day-over-day null premiums and elementary deviations for key columns, then page basically whilst the amendment exceeds a z-ranking threshold or crosses a demanding bound. You can ask ChatGPT to write the SQL for those checks opposed to your warehouse, returning a small outcome set on your alerting instrument.

Responsible use: when now not to accept as true with and how one can verify

A style will fortuitously produce code that appears appropriate and is wrong. It would possibly hallucinate applications or gloss over timezone semantics. I placed a few safeguards in position.

I do now not settle for black-field cleansing steps for something that touches fee, compliance, or security. If a purpose’s conduct is not really evident from code and exams, it does not deliver. I also set traps. For illustration, I consist of intentionally malformed dates and identifiers in scan furniture to make sure that the role fails loudly in place of guessing. Where a possibility, I desire library calls with customary conduct over customized regex. And I assessment any use of eval-like operations or dynamic code new release with heightened warning. If functionality topics, I benchmark before adopting. Generated code will be based and sluggish. For broad datasets, Spark or SQL offload broadly speaking wins over pandas for sign up for-heavy cleansing.

Finally, I by no means enable the version invent industrial law. It drafts implementations for laws I specify. If a rule is ambiguous, I determine it with the industry owner, then mirror the decision in code and checks. The level of field isn't very to sluggish you down. It is to shop speed from changing into remodel.

A walk-because of: from raw knowledge to clean dataset with a reproducible trail

Consider a purchaser desk touchdown every single day from 3 resource tactics. You see combined-case emails with whitespace, telephone numbers in distinct codecs, reproduction consumers across methods, addresses with extraneous punctuation, and dates in inconsistent codecs. Here is a practical path I would take with ChatGPT inside the loop.

I delivery by way of profiling a 50 to a hundred thousand row pattern, satisfactory to discover patterns without wrestling memory. I ask ChatGPT for a pandas snippet that prints worth counts for key columns, null shares, and fundamental regex matches for phone and e mail. I feed it sample rows that convey the worst issues and tell it the target codecs: lowercase emails trimmed of whitespace, E.164 phones the place one can, cope with normalization that removes trailing punctuation and normalizes whitespace, and dates in ISO 8601 in UTC.

Next, I request small, natural helper features: normalize electronic mail, normalizecellphone, normalize tackle, parsedate toutc. I specify that normalize mobile could use phonenumbers if allowed, in another way a chatgpt examples for Nigerians clean fallback with conservative principles. I ask for docstrings, kind guidelines, and logging of how many values converted. I then paste in my pattern horrific values and ascertain outputs. If normalizesmartphone guesses too aggressively, I rein it in. Conservative beats innovative for contact fields.

For deduplication, I ask for a objective that groups by using a secure customer_id in which provide, in any other case e mail after normalization, in any other case telephone. If dissimilar records remain, it should always decide upon the most lately up to date row and store the only with the so much non-null fields. I ask for a document that counts what percentage rows resolved at each rule layer and a connect that surfaces conflicts for handbook overview. The draft code continuously nails the skeleton. I regulate threshold good judgment and upload an override mechanism for typical false fits.

For validation, I ask for a pandera schema with column sorts and constraints, plus pytest checks for the helper services, and dbt exams for downstream versions. I paste a subset of the pandera exams straight away into the cleaning script so poor rows get quarantined with explanations. I add a sampling perform that writes five instance corrections per column to a scratch table. Those samples turn into component to a day to day Slack publish to the tips channel, which builds have faith and catches surprises.

Finally, I run a timed benchmark on the total day’s tips, measure wall-clock time, and ask ChatGPT to endorse vectorization or parallelization in which obligatory. If the task runs in Spark, I ask for the pyspark equivalents and examine joins and window operations in opposition to the pandas edition. I retain whatsoever meets my performance finances with headroom. Then the code movements into the pipeline, with checks as gates.

Beyond pandas: SQL, Spark, and dbt realities

Half the time, your cleaning lives in SQL. The brand does properly at producing ANSI SQL for trimming, case normalization, regex replacements, and reliable casts. It also can work with window capabilities to deduplicate stylish on industrial good judgment. If you specify your warehouse dialect, the output improves. Snowflake’s REGEXP_REPLACE differs a little bit from BigQuery’s. I invariably come with the goal dialect in my immediate, and I ask for secure casting styles that produce null on failure and log counts of failed casts. In a dbt project, I actually have the fashion generate macros for repeated transforms and assessments for time-honored values and area of expertise.

In Spark, functionality traps are regular. The kind would possibly default to UDFs while built-in purposes would be sooner. Tell it to preclude Python UDFs unless undoubtedly crucial, to choose pyspark.sq..services, and to scale down shuffles with the aid of averting vast groupBy operations until eventually needed. Then profile. If a sign up explodes, ask for a published trace for small dimensions. The variety can add those guidelines, however you still validate with honestly sizes.

Privacy, compliance, and reproducibility

Cleaning traditionally touches touchy fields. If you figure with regulated records, do now not paste uncooked examples into a chat. Redact or generate synthetic analogs that sustain shape. Better yet, use a secured, licensed surroundings that integrates the variety together with your own documents protections. For auditability, make sure that each transformation is versioned and that you would rerun the same job with the similar parameters to reproduce an output table. I encompass a checksum of enter records, the Git dedicate hash Technology of the cleansing code, and the schema variant in a run metadata desk. ChatGPT can generate the purpose that writes this metadata, but you want to plug on your storage and governance styles.

Measuring have an impact on and figuring out whilst to stop

You can try this work perpetually. Pick measurable desires. Reduce reproduction buyer history by means of a aim share. Cut null prices on telephone and electronic mail beneath a threshold. Enforce ISO date compliance across all journey tables by using a given date. Add exams to preclude regression. Ask ChatGPT to advise a small scorecard with five metrics, then put in force it. Share traits with stakeholders. When the graph of terrible documents flattens, cross on.

There can be a factor in which added cleansing yields diminishing returns. For example, chasing the remaining 1 p.c of malformed addresses won't pay off in the event that your marketing staff basically sends electronic campaigns. Be express about trade-offs. Document what stays unsolved and why. I almost always consist of a “regular issues” area within the repo that tracks decisions, and I use the sort to generate the first draft from commit messages and exams.

What experienced groups do in a different way with generated code

A few styles separate groups that win with ChatGPT from people who churn.

They deal with the variation as a sketchpad, now not an oracle. Draft, attempt, refine. They seed activates with schemas, examples, and constraints. They insist on deterministic, pure functions in which likely. They add checks instantly, no longer later. They keep watch over danger with the aid of starting in non-serious paths and graduating shown steps into manufacturing. They hinder human beings within the loop wherein ambiguity is top, resembling entity matching beyond clear identifiers. And they make small investments in tooling that pay off daily: a standard log format, a sampling mechanism, and a run metadata table.

A quick tick list to upgrade your records cleansing with ChatGPT

Specify schema, goal codecs, and constraints for your instantaneous, including library variations and runtime targets. Ask for pure, reusable services with docstrings, style recommendations, and logging of modifications. Generate exams alongside code: unit exams for purposes, files tests for tables, and thresholds for signals. Prefer tested libraries and built-in functions over customized regex or UDFs unless imperative. Measure efficiency and correctness, then promote code to production with versioning and run metadata.

Closing thought

Data cleaning is not really a side quest. Done well, it becomes the backbone of dependable analytics. ChatGPT will not take away the work, however it is able to boost up the parts that repeat and increase your assurance of edge cases. Keep your criteria high, your prompts precise, and your tests plentiful. Over time, you'll spend much less time chasing ghosts and greater time answering questions that count number.