Skip to main content
← Back to Blog

Build a data cleaner that stops your GTM workflows from eating garbage

Colin Gillingham··4 min read
gtm-automationhubspotai-agentssales-automationcrm

This post is part of the GTM Automation Playbook — a 13-part series on building AI-powered GTM agents with HubSpot.


Every workflow I've built in this series depends on CRM data being at least roughly correct. The ICP scorer reads industry and employee count. The personalization writer pulls job titles and company descriptions. When that data is inconsistent or missing, the outputs range from slightly off to completely useless.

I watched an SDR send a personalized email referencing "Acme, Inc." when the prospect's company had already rebranded six months prior. The data was stale, the opener made no sense, and the thread died on read. That's the tax you pay on dirty data. Not one catastrophic failure, but hundreds of small ones compounding across every automated touchpoint.

What dirty actually means

It's rarely dramatic. The problems are mundane. Phone numbers stored as (555) 867-5309, 5558675309, +1 555-867-5309, and 555.867.5309 across four contacts at the same company. Company names entered as "HubSpot", "Hubspot, Inc.", "hubspot inc", and "HUBSPOT". Job titles left blank on 40% of records because the import skipped that column.

These inconsistencies break filters, duplicate detection, and list segmentation. Two records for the same company show up as separate accounts because one has "Inc." appended. Your routing workflow splits them to different reps.

The n8n cleanup workflow

I run this nightly. Six nodes, no AI required. Pure data hygiene.

Node 1: Schedule Trigger. Set to run at 2 AM. Off-peak keeps you well under HubSpot's 100-requests-per-10-seconds rate limit. The workflow processes contacts updated in the last 24 hours, so each run handles only the delta.

Node 2: HTTP Request, pull contacts. Hit POST /crm/v3/objects/contacts/search to fetch contacts where lastmodifieddate is greater than yesterday. Pull phone, company, jobtitle, firstname, lastname, email. Cap the batch at 100 records per request and loop if needed using the after cursor from the response's paging.next object.

Node 3: Code node, normalize phones. Strip everything that isn't a digit, then apply E.164 formatting. For US/CA numbers, that means prepending +1 if the result is 10 digits. For international numbers, you'll need a country code lookup based on the contact's country property. Here's the core logic:

for (const item of $input.all()) {
  let phone = (item.json.phone || '').replace(/\D/g, '');
  if (phone.length === 10) {
    phone = '+1' + phone;
  } else if (phone.length === 11 && phone.startsWith('1')) {
    phone = '+' + phone;
  }
  item.json.phone_normalized = phone || null;
}
return $input.all();

Node 4: Code node, standardize company names. Strip legal suffixes, normalize whitespace, apply title case. This makes downstream dedup matching far more reliable.

const suffixes = /,?\s*(Inc\.?|LLC|Ltd\.?|Corp\.?|Co\.?|Limited|Incorporated)$/i;
for (const item of $input.all()) {
  let name = (item.json.company || '').trim();
  name = name.replace(suffixes, '').trim();
  name = name.replace(/\s+/g, ' ');
  name = name.split(' ').map(w =>
    w.charAt(0).toUpperCase() + w.slice(1).toLowerCase()
  ).join(' ');
  item.json.company_normalized = name || null;
}
return $input.all();

Node 5: IF node, flag enrichment gaps. Route contacts missing jobtitle or phone to a separate branch. On that branch, add them to a HubSpot static list called "Needs Enrichment" via POST /crm/v3/lists/{listId}/memberships/add. Your enrichment tool (Clay, Clearbit, or a manual process) picks up from there. Contacts with complete data continue to the write-back node.

Node 6: HTTP Request, write back. PATCH /crm/v3/objects/contacts/{contactId} with the normalized phone and company name. Use the batch update endpoint (POST /crm/v3/objects/contacts/batch/update) if you're processing more than 10 records per run. Batch accepts up to 100 records per call and cuts your API usage dramatically.

Catching soft duplicates

After normalization, run a simple comparison. If two contacts share the same company_normalized value and one was created in the last 24 hours, send a Slack notification to your ops team with both record IDs. Don't auto-merge. Automated merging sounds efficient until it combines a current champion with a churned contact from 2019 and overwrites the lifecycle stage. Flag it, let a human decide.

HubSpot's built-in duplicate manager catches exact matches. It misses "Hubspot Inc" vs "HubSpot" because it doesn't normalize before comparing. Your workflow does.

What "clean enough" looks like

You don't need perfection. You need consistency. I define clean enough as: phone numbers in E.164 format, company names stripped of suffixes and casing variations, no contact missing both job title and phone, and soft duplicates flagged within 24 hours of creation.

Track it with a HubSpot report. Create an active list filtered to contacts where phone doesn't start with "+" or jobtitle is empty. That list should trend toward zero over time. If it's growing, your inbound forms or import process is creating dirty records faster than the workflow cleans them. Fix the source.

The cleaner doesn't generate revenue on its own. It makes everything downstream more accurate. Your ICP scores improve because the industry field is consistent. Personalization actually lands because the company name is correct. Reps stop working the same account in parallel because duplicates get caught early. Data quality is the unglamorous foundation that determines whether your automation stack actually works.

Colin Gillingham

Need a Fractional Head of AI?

I help companies build an AI operating system — shared context across teams, AI handling the repetitive work, and your people focused on what actually matters.

15+

Years in Tech

12+

AI Products Shipped

3

Fortune 500 Brands