Automating Error Checks in Data Analysis: A Comprehensive Guide

This article is the fourth installment in a series aimed at guiding you through the process of developing software for automatic scientific data analysis. The initial article introduced the rationale and fundamental steps, while the second focused on organizing datasets to facilitate automated analysis and identifying test conditions. The third entry explained how to implement a loop for performing calculations on test outcomes and storing these results. In this fourth piece, we will address a crucial aspect of this entire process: verifying data and analysis outcomes for errors to prevent them from skewing final results.

Ensuring the Integrity of Testing and Analysis Results

One of the main criticisms of automating data analysis is its reliability. Algorithms executing calculations without discernment may fail to recognize poorly executed tests or analytical mistakes, leading to erroneous final results. This concern is not unfounded, particularly in laboratory settings where tests can deviate from the plan, necessitating error identification during data analysis.

Fortunately, there are methods to incorporate error-checking into your programs. Techniques such as printing interim outputs and plotting data do require some manual checks, yet they demand considerably less human effort compared to entirely manual analysis. Additionally, developing an automated data-checking algorithm allows the program to independently verify data quality, significantly minimizing the time required for data validation. When combined with the inherent repeatability of computer programs, these strategies can yield a data analysis workflow that is both more reliable and faster than manual calculations.

The following sections will explore three methods for ensuring data quality in Python programs.

Printing Intermediate Outputs

Printing intermediate outputs is akin to revealing all calculations. This practice not only aids in debugging during program development but also enables others to verify results, fostering confidence in the automated data analysis process. Many users may prefer not to delve into Python code directly, making it essential to provide ample intermediate outputs for independent calculations.

The core idea behind printing intermediate outputs is to display as many calculation steps as possible in a format reminiscent of Excel. This approach simplifies result verification, allowing others to comprehend the calculations and compare their findings against the Python output. This can generally be accomplished through two key steps:

Include as many calculation details as possible within the data frame. While external variables or lists may be necessary, they should be used sparingly. Centralizing all data and calculations within a single data frame enhances clarity and facilitates verification.
Save each test's data frame to a distinct .csv file.

Certain aspects of presenting calculation details naturally occur; most calculations will be performed directly on the data frame, with results stored within it. However, some require additional effort, such as adding a column for constants, making them accessible for verification alongside the output table.

Creating a new .csv file for each test ensures that all calculations across the project are preserved, avoiding overwriting previous results. This can be accomplished by 1) establishing a new folder for each test and 2) saving results using dynamic file names that reflect the current test conditions. The following code demonstrates this process while employing the techniques highlighted in Part 2 to track test conditions:

Flow_Hot = Find_Between(filename, '_FlowHot=', '_FlowCold') Flow_Cold = Find_Between(filename, '_FlowCold=', '_TemperatureHot') Temp_Hot = Find_Between(filename, '_TemperatureHot=', '_TemperatureCold') Temp_Cold = Find_Between(filename, '_TemperatureCold=', '.csv')

Folder = r'C:UsersJSmithDataAnalysisDataFlow_Hot=' + Flow_Hot + 'Flow_Cold=' + Flow_Cold + 'Temp_Hot=' + Temp_Hot + 'Temp_Cold=' + Temp_Cold

if not os.path.exists(Folder):

os.makedirs(Folder)

# Calculations Data.to_csv(Folder + 'Flow_Hot=' + Flow_Hot + 'Flow_Cold=' + Flow_Cold + 'Temp_Hot=' + Temp_Hot + 'Temp_Cold=' + Temp_Cold + '.csv', index=False)

This code operates as follows:

It identifies the nominal conditions of the current dataset, ensuring the program has the necessary information for naming folders and files while maintaining an organized structure for future reference.
It uses the conditions from Step 1 to create a unique variable for the folder path.
It checks for the existence of a similarly named folder and creates one if it does not exist.
All calculations on the data are performed.
Finally, the data frame is written to a new .csv file located within the newly created folder, ensuring the naming structure reflects the nominal test conditions for easy retrieval later.

Utilizing Plots for Quality Assessment

One effective method for assessing the quality of test data and the performance of data analysis is through plots. Automated data analysis programs can swiftly generate plots for numerous tests, significantly reducing the manual effort required to produce them individually. This enables users to quickly scan the plots and ascertain data quality.

Readers can now automate the process of generating and saving plots for each test. This generally involves using the techniques outlined in Automating Scientific Data Analysis Part 2 and Automating Analysis of Scientific Data Sets to create a program that cycles through all datasets, conducts necessary calculations, generates plots, and saves the results.

The key here is to create enough plots to swiftly and visually verify that each test was conducted correctly. For instance, when considering a heat exchanger example, the saved plots should enable users to quickly confirm:

That the hot-side and cold-side flow rates closely align with nominal conditions outlined in the test plan.
That the inlet temperatures on both sides match the expected conditions.
That all parameters remained stable enough to guarantee quality, steady-state operation.
That the filters used to define the steady state period accurately captured the correct segment of the dataset.
That the final steady-state effectiveness value is both stable and reasonable.

These objectives can typically be met with three plots.

Figure 1 displays an example plot of the water flow rates on both sides of the heat exchanger. For this illustration, assume the nominal flow rate conditions for the test are set at 3.5 gal/min on each side. In the plot, both flow rates fluctuate between 3.4 and 3.6 gal/min, which is a minor variation that falls within acceptable test parameters, thus confirming the first condition.

Figure 1 also demonstrates that the flow rates satisfy condition three. While there is minor scatter in the data, the long-term trend remains stable around 3.5 gal/min. A test would be deemed unsatisfactory if, for example, the flow rate fell to 3.0 gal/min before returning to the target rate.

Figure 2 serves a similar purpose for temperature data, assuming a hot-side inlet temperature of 100.4 °F and a cold-side inlet temperature of 50 °F. The test is valid if the recorded inlet temperatures are close to these nominal values and do not fluctuate significantly during the steady state phase. Both criteria are met, indicating that this plot satisfies conditions two and three.

Figure 3 illustrates the calculated effectiveness of the heat exchanger. The dataset is filtered to show only data after the valves switch to the test flow rate. The effectiveness rating hovers around 0.34 with expected variation due to fluctuations in temperature and flow rate. Initial data points indicate a transitional period when the filter was first applied, but their limited number minimizes their impact on the average effectiveness. The stability of the effectiveness data confirms that conditions four and five are satisfied.

Having access to these three plots enables users to quickly verify that the test was executed correctly and that the data is reliable, requiring just seconds of active engagement. Integrating a section within an automated data analysis program to create and save these plots proves to be an efficient and effective method for ensuring data quality.

Developing an Automated Data Checker

The most thorough and automated approach for checking data quality is to develop an automated data checker. This script examines the recorded data to assess what was measured, compares it against nominal test conditions, evaluates their acceptability, and reports any unacceptable results to the user. This can be especially beneficial, as it reduces the number of plots requiring manual review. Projects involving hundreds of tests can easily produce thousands of plots, so minimizing the number that need examining saves considerable time, effort, and budget.

The following code illustrates how this process can be implemented. Assume the program has a data frame named “Temp” for temporarily storing information about questionable results and another data frame called “SuspiciousTests” containing the comprehensive list.

if abs(Temperature_Hot - np.mean(Data['Hot Inlet Temperature (deg F)'])) > Threshold_Difference_Temperature:

Temp.loc[0, 'Filename'] = filename

Temp.loc[0, 'Test Parameters'] = 'H' + str(Flow_Hot) + '-C' + str(Flow_Cold) + '-T' + str(Temperature_Hot)

Temp.loc[0, 'Code'] = 'Temperature_HotInlet_Avg'

Temp.loc[0, 'Value'] = np.mean(Data['Hot Inlet Temperature (deg F)'])

SuspiciousTests = SuspiciousTests.append(Temp)

SuspiciousTests.to_csv(r'C:UsersJSmithDataAnalysisSuspicious Tests.csv', index=False)

The code operates as follows:

It compares the nominal hot-side inlet temperature, Temperature_Hot, with the average measured value. If the difference exceeds a predefined threshold (Threshold_Difference_Temperature), an issue is flagged.
If an issue is detected, the test parameters—including the filename, nominal test conditions, a code for the unmet condition, and the measured value—are recorded in the SuspiciousTests data frame.
After processing all data, the SuspiciousTests data frame is exported to a .csv file, documenting which tests appear questionable.

The example provided involves a single check for the average hot-side inlet temperature against the nominal condition. A comprehensive program would also incorporate checks for other test parameters and standard deviations to confirm the stability of all parameters.

Here are some general guidelines for developing an automated data checking algorithm:

It should verify all nominal test conditions to ensure they are acceptable in both average and standard deviation.
It should assess the results of any filters to confirm they captured the correct data range.
It should evaluate the final calculated output to ensure it falls within the expected range and that the test provided reliable signals.
It must be rigorously tested to ensure it correctly identifies problematic tests while not flagging acceptable ones.

Next Steps

The articles thus far have guided you through structuring datasets for automation, opening and analyzing files automatically, and verifying datasets for errors. The next focus will be on storing these results in a manner that supports the development of regressions naturally. This will be covered in the upcoming article.

karasms.com

Automating Error Checks in Data Analysis: A Comprehensive Guide

Ensuring the Integrity of Testing and Analysis Results

Printing Intermediate Outputs

Utilizing Plots for Quality Assessment

Developing an Automated Data Checker

Next Steps

Share the page:

Recent Post:

Exploring the Mystique of Wormholes: Science Beyond Imagination

Innovations at CES 2019: A Dive into Emerging Trends

Finding Calm: Embracing Acceptance to Navigate Inner Turmoil

Innovative Approaches to Gravity Batteries for Energy Storage

Embracing the Enchantment of Nature and Spirituality

Title: Achieving 1000 Followers on Medium: My Journey and Insights

Insights on the 7-Point Story Structure for Successful Films

Exploring the Ancient Dream Book and Its 108 Symbolic Meanings