(Data) Cleanliness Is Next To Godliness

I’m pleased to welcome Palisade Six Sigma Partner Edward Biernat of Consulting with Impact as featured guest blogger. As well as running a successful consultancy, Ed is a noted Six Sigma educator and author.


–Steve Hunt



(Data) Cleanliness Is Next To Godliness


I recently had dinner with Eric Alden, a Master Black Belt for Xerox corporation.  Eric had just gotten back from the American Society for Quality’s  (ASQ) headquarters in Milwaukee where he was one of 200 Master Black Belts worldwide that generated the questions for the upcoming ASQ Master Black Belt certification examination (more on that in an upcoming post).  Eric had also recently completed a mini-course for the local ASQ chapter on data integrity.  We shared some war stories and came up with some common threads regarding data integrity.


1.       Just because it is a number doesn’t mean it is worth anything.  People get enamored with tons of data from process instrumentation, shop floor collection sources or Excel spreadsheets.  There seems to be a false security with this pile of data, and managers often look to the Black Belt to ‘sort it out’, because with all that data, the answer is in there somewhere.  Many a belt has crashed on the rocky reefs of bad data, often after tons of time and effort (and credibility) were wasted generating false answers.

2.       GIGO.  The Garbage In – Garbage Out philosophy of computing applies especially to existing corporate databases.  Here a few recent examples of GIGO.

a.       A belt wanted to analyze the specific timing of events in shop floor process and had tons of data from the process instrumentation that had times down to the fraction of a second.  After lengthy analysis, they found a significant difference between two shifts and forced the lesser shift to adopt the sequence of the more uniform shift.  After introducing costly production problems and actually hurting the overall process, the sensors were found to be faulty and the overall process subject to human manipulation to generate the ‘pretty charts’ that everyone expected.

b.      Office areas are not immune.  Something as simple as a checksheet to gather data to analyze when a particular computer error occurred can be in question, especially when the clerk fills in the times at the end of the shift from memory rather than logging the event as it occurs.

3.       Good data in bad spreadsheets.  Even if you get good data, having an inexperienced person setting up the spreadsheet can cause problems.  It is analogous to a person using a word processing software and making a table using spaces and tabs.  It looks like a great table until you have to manipulate it.  Then it falls apart.  Problems like merged cells, subtotals, random formula inserted in cells, etc. can make a Belt weep and cause significant errors in the resulting analyses.

4.       Useless manipulation.  Often a big issue is that management wants data sliced a certain way for no good reason.  This sometimes leads to the proliferation of additional spreadsheets or databases that needlessly add to complexity.  (Note: If you have an ERP system like Oracle or SAP, USE IT!  They are designed to house data and protect its integrity.  Plus the data entry screens typically allow for better and more accurate entry.  Few things are more wasteful than entering everything in the ERP system then re-entering it into a spreadsheet to appease a manager’s inability to adapt and change.)


What are some tactics for resolving these issues?

1.       On a macro level, start ensuring that the data that your company is collecting is sound data as part of the preparation for a Six Sigma launch, or a part of plain old good business.  Bad data slows down or stops a Six Sigma project dead in its tracks, changing it from getting something done to fixing the data. 

a.       Know catalog your data databases, including the extra ones (Excel, Access) that are usually relied upon but undocumented.

b.      Prioritize the data sources by synchronizing them with your Six Sigma launch sequencing. 

c.       Sample the data to insure its usefulness.  If it is bad, fix it.  This will give teams better data to start off with and will allow time for that data to accumulate for analysis.

2.       For specific projects, conduct a Measurement System Analysis (MSA) on you data sources (This tool is often used in the Measure phase of the DMAIC model).  We often think of MSA’s when it comes to physical measurements.  It is just as critical in the ‘softer’ data. 

a.       Pull the correct sample size.  In StatTools, under  Statistical Inference there is a Sample Size Selection tool that can be used to pull the correct amount of data needed for the analysis.

b.      Pull your data randomly and follow the trail to the actual entry point.  That may mean watching how individuals enter data, probing for special circumstances, etc.

c.       In your analysis, look for random factors such as vacation fill-ins.  Both Eric and I both had several experiences where one person was filling in for someone who is out sick or on vacation and, usually do to inadequate training, varies from the expected process.

3.       Pivot Tables are our friends.  Start today upgrading the skill sets of the people that do the actual data entry and first level analysis.  Train them in how to use tools like Picot Tables that slice the data but leave the actual spreadsheet intact.  The fewer merged cells, etc. that we fight with, the better.

4.       Managers – Trust your Belt.  If they say the data is bad, it probably is.  No matter how much you want an answer today, you may not be able to get one.  The good news is that some processes can be modeled using @RISK to begin improvement that is directionally correct while waiting for the data to compile.  Then the better data can be used to either update or replace the early model.

5.       Go hunting.  Find extraneous datasets and merge them / kill them.  The fewer that are out there, the more likely you will be able to ensure the integrity of those that remain.


Remember that data analysis is a funnel.  Tons of data leads to bunches of information which then can help us make some decisions.  Throwing bad data into the system is similar to throwing bad tomatoes into the food distribution system.  The end results can be pretty messy and difficult to clean up. 


Also, don’t miss Ed Biernat’s free live webcast DMAIC and Using a Non-Intuition Approach, Thursday, 11AM Eastern Time.


Sign up here:






Edward Biernat is the president of Consulting With Impact, Ltd., a training, coaching, and consultancy located in Canandaigua, NY that he founded in 1998.

Published by shunt27

I am a Lean Six Sigma Black Belt at Palisade Corporation.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: