Getting Clean Data From Less Than Clean Places

Data orthodoxies in the bin

Getting Clean Data From Less Than Clean Places

An oft-identified barrier to adopting digital innovations in oil and gas is that the data is too ‘dirty’. But is it true?

The Dirty Data Speedbump

I was delivering a presentation this past week to an oil and gas concern about digital adoption drawn from my first book, and a question from the audience resonated with me. Their objection crops up repeatedly in conversations about digital, and it’s usually phrased as “what do we do with digital adoption when our data is suspect?”

It’s not at all surprising that the data assets of oil and gas companies are viewed with suspicion by those tasked with figuring out how to exploit digital, and by extension, data. Much of that data originates in places that are cold, dark, wet and muddy. It’s easy to assume that it’s the conditions that make for poor quality data. But that’s not the whole story.

There are many reasons why data is so vexatious for oil and gas:

Oil and gas is a business that’s conducted outdoors, exposed to the elements, and in extreme conditions (from arctic cold to Middle Eastern heat, from on-shore farm land to off shore deep ocean). It’s tough to do anything consistently in this setting. Sensors collecting data will occasionally deliver weird results. As climate change grips our industrial infrastructure, it will also impact sensors that were not designed to cope with the new extreme points.

Operations data originates with SCADA systems, DCSs, PLCs, and other assorted acronyms. While these systems have been modernising of late, the bulk of the installed base predates cloud computing, machine learning, and open systems. These operations systems are tightly coupled to the physical assets involved (pumps, compressors, well heads), which also predate concepts like wireless networks and the internet of things. The data they collect is for control and not analysis, and they were never designed for our new data intense world.

Accountability for data quality is linked to the business unit where the data originates, which is a polite way of saying that data is highly fragmented as a resource. If the originating business unit doesn’t directly get value from high quality data, they likely don’t (or can’t) invest to improve it. For example, operations has tons of sensors generating data, but the data is used typically for moment by moment control. Capturing context, fixing outliers, enhancing searchability, and improving accessibility (the features useful for data analysis) are not a priority.

Across oil and gas, data is not generally viewed as an asset and does not attract much capital. If anything, data’s value is tied to the cost of the technology used to store the data, with such costs having fallen to near zero (and so what is the value of the data in storage?). In those increasingly common cases where data is housed by a cloud services company, the value of the data might well be equated with the value of the services contract.

As an expense item, data generally attracts cost-cutting attention (ie, annual budget scrutiny, extra pressure when commodity prices are unfavorable). The diesel budget for an oil and gas company will far outweigh the budget for data systems, and won’t warrant much interaction among managers.

Data is frequently tied to the application system that generates the data, and most legacy application software companies have little incentive (and much disincentive) to promoting easy access to their applications data. Locking the data into walled gardens ties customers to the vendor’s specific analytic solutions.

Oil and gas lags in sorting out problems with its data. For example, there are few leaders in oil and gas carrying the title ‘chief data officer’ who would own the accountability to sort out data issues. For fun, I composed a little people search on LinkedIn (all possible connections, based in the US, working in oil and gas, job title of chief data officer), and found only 4 people in oil and gas (not services or technology companies serving oil and gas) who call themselves CDO. As a career path, data accountability looks underwhelming, and may not attract top flight professionals.

The Data Orthodoxies

Underpinning this situation in oil and gas are the handful of orthodoxies about data that no longer hold in a digital world. Orthodoxy comes from two Greek words—‘orthos’, meaning right, true or straight, and ‘doxa’ meaning opinion—and translates to the ‘correct opinion’. An example of a correct opinion that we now know to be profoundly incorrect (thank you, COVID)  is that people must work in offices to be effective at their jobs.

For an industry that relies so tightly on math, engineering, science and physics, oil and gas still clings tightly to its correct opinions on things that are not made of steel.

Here’s some deeply held beliefs or orthodoxies about data that need confronting.

Data has no value.

This is objectively not realistic. Capital markets now reward data-intense businesses with valuations well above the molecule and electron businesses in energy. Designing new plant and equipment without thought to leveraging the data it produces is folly, yet still frequent.

Data is best viewed as an expense.

This is an accounting rule. Accounting is itself a human invented means to accomodate our human reality. Accounting rules, unlike the laws of gravity, are well within our remit to change.

Correcting data requires much manual effort.

Correcting data manually ignores the possibility of mechanically improving data quality using digital, either to manage away the impacts of outliers, or to fix data directly.

Data must first be clean to be useful.

The real issue is that not all decisions in oil and gas are of equal weight, and that imperfect data for some decisions is likely of little consequence.

How To Approach the Dirty Data Problem

I regret that when I’m unexpectedly asked about specific solutions to complex problems, I’ll often resort to offering a simple, immediate answer, when I should really pause and think first.

And so back to the question at the outset—how to approach dirty data. At an earlier time, the only way to clean up dirty data was to have a more senior individual with depth of experience and a range of expertise review the data, flag the problem items, add contextual insight, discuss possible fixes to the data, and undertake the correction. This is now impractical—senior talent is in short supply and high demand, the work to review data is unappealing and low value, and the business benefits from clean data are harder to quantify relative to other opportunities.

I suggested that an immediate action that the manager could take to deal with dirty data in his or her quite specific situation was to use the power of digital tools to help clean up the dirty data. These tools are exceptional at identifying outliers, missing data, data inconsistencies, obscure patterns, gaps in data, drift, and a host of other dimensions, and in making data consistently formatted, structured, and compliant.

This will certainly help, but what else could they do?

Determine if the right problem is being targeted, and by extension, the right data.

For example, a mining company faced a problem of frequently broken teeth on the shovel buckets that extracted the ore. The engineers asked for help from their data scientists in predicting when teeth would break so that they could better plan for outage (shifting from reactive to predictive maintenance). A project to build a predictive algorithm was funded, the required data was identified, and the clean up action started.

However, the digital team asked why are the teeth breaking in the first place? Was it operator behaviour, steel composition, welding approach, shovelling technique, digging speed, ore composition? This analysis required quite different data, but by eliminating, or drastically reducing, teeth failure, operations stood to capture a far greater productivity gain than from just predicting failure. No complex algorithms either.

Clean only the data that is required to capture the benefit. 

Oil and gas is sometimes not surgical enough in approaching data clean up, and will want to clean up everything (“while you’re at it, how about doing this too?”). By sharpening the problem focus, the minimal amount of data to clean comes more clearly into view, and the cost to clean data becomes more manageable relative to the benefit to be captured.

The bigger solutions—creating the position of Chief Data Officer, converting data from an expense line item to a balance sheet asset, inventing performance metrics for data quality, changing sourcing rules for software applications to promote open data—are generally not within the mandate of a unit manager trying to get a digital project successfully deployed, but should be actively considered by executive leadership and the Board.


Questionable data looks all but inevitable in an industry that works where it is cold, dark, wet and muddy.  However, the reaction to, and treatment of, dirty data is well within the control of leaders in oil and gas charged with digital deployment. Immediate tactics are helpful, but we are only going to be more digital in the future, so longer term changes are the way to go.

Check out my latest book, ‘Carbon, Capital, and the Cloud: A Playbook for Digital Oil and Gas’, available on Amazon and other on-line bookshops.

You might also like my first book, Bits, Bytes, and Barrels: The Digital Transformation of Oil and Gas’, also available on Amazon.

Take Digital Oil and Gas, the one-day on-line digital oil and gas awareness course on Udemy.

Take the one-hour Digital for the Front Line Worker in Oil and Gas, on Udemy.

Biz card: 🪪 Geoffrey Cann on OVOU
Mobile: ☎️ +1(587)830-6900
email: 📧
website: 🖥
LinkedIn: 🔵

No Comments

Post A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.