Garbage In Gospel Out : How Bad Data Derails AI Systems

Bruce Mullan
May 7
5 min read

Key points

Around 80% of an organisation's data is not AI-ready.
AI amplifies problematic data by automating poor decisions faster, generating errors in workflows, and eroding trust.
Start small with a data quality assessment focusing on your most critical AI use cases. Australia's AI Governance framework provides practical guidance to manage data for AI systems.

This week, I'm onto data. Not just any data, the bad data. Bad data is all around. In the past, it was the perilous "data migration" stage that always had just enough issues to bring a system implementation project to its knees.

Fast forward, no-one really talks about "data migration" in an AI context. I think this is because AI is "unleashed" rather than installed. But the same bad data problem exists, only it's worse, because AI works across all the data you let it access, including obsolete, duplicate, unstructured or incomplete data.

Show me the data

Whenever I stay with someone for a few days in their home, the first day is spent finding things. Where are your coffee mugs? Where's the bathroom? Where's a good place to eat? Also, where not to go: Don't go into my teenage son's bedroom there's clothes all over the floor.

The same thing happens when I walk into a client site. I'm entering their house, and their way of organising things. I need to ask where are the policies stored? Where are your procedures? Where is the data the AI system is being trained on?

Over the last 15 years, I've seen plenty of ways of organising things, especially data. From ceiling-high stacks of paper files, shared network drives, PC hard drives, USB sticks, cloud storage to structured computer systems. Data is literally everywhere.

Since the job of keeping the data in the house tidy was usually owned by no-one, people developed their own methods and solutions. When they left, someone else came in, re-arranged everything and created a folder on departure called Arthur's Stuff.

In the early days of IT, data storage was expensive, only the important stuff was digitally stored. Staff were routinely advised to remove large attachments from their email because there was a duplicate version already stored on the network. Then, as storage became cheap, the amount data collected and stored became prolific. In addition, data retention rules appeared such as requiring a company to keep records on children (in education or care settings) for a lifetime.

AI systems amplify bad data

So when AI is layered on top of years of document and data accumulation, AI amplifies the problematic data by automating poor decisions faster, embedding errors into workflows, and eroding trust among users and stakeholders.

What's worse is Garbage In, Gospel Out.

Garbage in gospel out describes the dangerous tendency to treat computer-generated data, models, or AI outputs as absolute, unquestionable truth (the gospel) regardless of the accuracy, quality, or biases of the initial input data (the garbage).

Garbage In, Gospel Out warns us against blind trust in automated systems.

How much of the data we hold is actually useful for AI? Research suggests only around 20% is AI-ready, 50% is obsolete and 30% should be kept confidential.

So 80% of an organisations data is not AI-ready. That's the big problem no-one knows about until you run an AI pilot.

Unleashing AI over poor quality data is one of the most common reasons any AI initiative fails. In traditional systems, we painstakingly go through a data migration phase to clean up the existing data, transform it and ensure its ready for the new system.

AI systems don’t fix bad data, they amplify it. Your AI still needs a data migration phase.

While organisations often focus on models, tools, and platforms, the real issue is the quality, structure, and governance of their underlying data.

A practical starting point

You will need to think somewhat deeply about your data and how its organised:

Can the AI access it?
Is the data usable for AI?
Is the data confidential or sensitive?
Can it be read by a machine?
Who owns this document or data?

A practical starting point is a data quality assessment focusing on your most critical AI use cases. Rather than trying to fix everything, you can prioritise data that directly feed AI models or high-impact decisions. Don't boil the ocean!

You won't have to search too far to find outdated records, legacy documents, duplicated files or historical practices that are not relevant to today's practices, policies or operating conditions.

AI models struggle with different versions of the same document - which one to use? In machine learning contexts, duplicates can bias models and distort training outcomes. Incomplete data limits the ability of AI to identify patterns or make confident predictions, reducing accuracy and increasing the likelihood of hallucinations.

Most importantly, you'll need to define data ownership. That’s probably a key reason for the bad data in the first place. Without defined accountability, data becomes fragmented across systems and teams, with no-one responsible for quality, access controls, or the data lifecycle.

Together, these issues create “data entropy” which is a gradual degradation of data reliability over time due to the level of disorder, uncertainty, or randomness within a dataset. AI doesn't like data entropy! But neither does traditional systems.

Establish your data governance processes

Once you've completed your data quality assessment you'll probably need to categorise the data as follows:

Value vs Risk categorisation for AI systems. — Classification model for data in AI systems

Addressing your data issues requires a deliberate shift to treating it as a managed asset. In the AI Governance Standard, Data Quality is covered by Statement 16: Ensure data quality is acceptable and Statement 17: Validate and select data for AI use.

Data Ownership is covered under Statement 1: Define an Operational model. You will need to assign clear data accountability for business units (e.g. Operations, Finance, HR), with defined responsibilities for quality, definitions, and access. Avoid the heavy bureaucracy, just clarity on who is responsible for what.

Next, you need to establish your Data Supply Chain processes (Statement 13). Where possible, automate data controls at the point of AI ingestion rather than trying to clean data downstream (including archiving and destruction of data not fit for AI use).

Data quality is not a one-off project because data degrades over time (aka data entropy). Establishing continuous monitoring and feedback loops (Statement 25) will be very important.

It also helps establishing data metrics such as data completeness, accuracy and timeliness to track these alongside AI performance (Statement 12 Define Success Criteria).

When your AI outputs are incorrect, you may need to trace the issue back to your data (Statement 4 Auditing) and fix the root cause.

Most organisations will find around 20% of their data is AI-ready. Successful AI adoption depends less on shiny, sophisticated AI tools and more on disciplined data practices and the adoption of governance standards. First, find the right starting point for your AI projects. Get something working, build trust in the data and the AI. Expand from there. And remember to always clean up your bedroom, grandma is coming over.

Stay safe, Bruce

ABOUT ME

I write all my own content, you can tell by the odd typo and occasional missing word. I use AI for my research. I also teach organisations how to implement the Australian AI Governance Standard and confidently transition to AI systems. To learn about my upcoming public AI Governance workshops visit: Public workshops

To learn more about AI Governance, check out my Hitchhikers Guide to AI Governance Podcast visit: Hitchhikers Guide to AI Governance Podcast

Bruce Mullan hosts the Hitchhikers Guide to AI Governance podcast.

Garbage In Gospel Out : How Bad Data Derails AI Systems

Show me the data

AI systems amplify bad data

A practical starting point

Establish your data governance processes

Recent Posts

Comments