A discussion on data validation

Validation is a critical consideration when considering sourced data, that data over which the system designer does not have complete control.  Validation is also an essential part of interpersonal relationships, but that is out of scope for this article…  When looking at sourced data (and interpersonal relationships, actually), we can see that data can be made up of historical data, integration-related data, and\or user-entered data.  This data is called out because it was created by systems that are not necessarily making the same assumptions about data as your system.  Since there are different assumptions, there is a chance that this provided data will not work within your system.  This is where validation comes in; a way of examining data at run-time to ensure “correctness.”  This may be as simple as the length of a field to something more complicated and businessy, such as a verified email address or a post-office approved address.  This examination for “correctness” is validation. 

In the grand scheme of things, validation is simply the detection, measurement, correction, and prevention of defects caused directly, or indirectly, by weak quality data.  There are three different types of validation, internal validation, external validation, and process validation.  Internal validation methods are embedded in the production system at design time to ensure the internal validity of data by eliminating any inconsistencies. External validation, on the other hand, makes use of knowledge about the real world. This use of the real world means that, when looking at a data model, external validity of that model is determined by comparing it against the real world.  Finally, process validation focuses on actual data processing operations. This focus on operations means that all data defects are synonymous with process defects, and thus, correction of process defects effectively leads to the prevention of future data defects. Efforts around process correction are based upon the fact that once a piece of incomplete data is in a database, it will be challenging to fix.

lotsa words needs a picture of ducks!

Validation and correctness are somewhat vague words, so let us add some context around them.  The easiest way to do that would be to talk about the different types of validation that may need to be done.  The most common forms of validation are range and constraint checking, cross-reference checking, and consistency checking.

  • Range and constraint check – this approach is where the validation logic examines data for consistency with a minimum/maximum range, such as spatial coordinates that have a real boundary, or consistency with a test for evaluating a sequence of characters, such evaluating whether a US phone number has ten digits or a US postal code has either 5 or 9 numeric values.  A presence check, where a value is validated based on the expectation that it is populated, is a unique example of a range and constraint check.
  • Cross-reference check – this type of check is when a value is validated by determining if it exists elsewhere.  This is the kind of validation where a user-entered postal code would be checked against a database of postal codes do determine validity.  This is different from the constraint check, which was simply checking whether the characters in the entered string were appropriate and not making any determination as to whether those values are real.  Another example of a cross-reference check would be a uniqueness test, such as evaluating whether or not an email address has already been used within the system.  You can depend upon cross-reference checks being more expensive than are range and constraint checks.
  • Consistency check – where the system determines whether the data entered is consistent.  An example of this would be checking to see if the date shipped on an order is before the date the order was placed.

The last thing for consideration as part of validation is post-validation actions, the steps to be taken after data validation.  The easiest decision is when validation is successful, as the business process can continue.  The more significant decision is what to do when validation fails.  Three different actions can happen in the event of a validation failure, an enforcement action, and advisory action, or a verification action.  I guess there is always the option of taking no action at all, but that kind of makes the performed work to perform the validation of a big ole’ waste of time…

  • Enforcement action – When taking an enforcement action, the expectation is that data that vailed validation will be put into a correct state.  There are typically two different ways that could happen, either the user corrects the information based upon feedback from the validator, such as by providing a missing value OR the system automatically fixes the data, such as fixing the capitalization in the LastName field when entered with AllCaps.
  • Advisory action – this approach is when the data is either accepted or rejected by the system, with an advisory is issued about the failed validation.  This approach tends to be taken when there is no user involvement and thus no ability to perform an enforcement action and could be represented as an item being loaded with a status that indicates the validation problem.
  • Verification action – This type of approach could be considered a select type of advisory action.  Also, by design, it is not a strict validation approach.  An example of this approach is when a user is asked to verify that the entered data is what they wanted to enter.  Sometimes there be a suggestion to the contrary.  This approach is relatively common in those sites that validate addresses, for instance.  In that case, the address validation may indicate that the street name is incorrectly spelled, so you end up giving the user the option of accepting the recommendation and replacing the value with the corrected value or keeping their version.

Validation is an important consideration when building any application that interacts with data.  There are various types of validation, such as internal validation, external validation, and process validation.  There are several different types of checking that could be performed for the validation, including range and constraint check, cross-reference check, and consistency check.  Lastly, there are three major approaches for managing an error: and enforcement action, and advisory action, and a verification action.

The next posts in the Validation series will show how to implement differing validation strategies.  Stay tuned!