Defining a Data Strategy

Trustworthy Product Validation

Continuous delivery relies on the organization’s ability to rapidly test new product versions in a manner such that if the tests pass, then the new product version is deemed ready for release to customers. For that to be possible, the tests must be a true indicator of market readiness. That requires that the conditions of the test faithfully represent the real world use of the product. That includes the test data.

See Data For Testing.

Developers Need to Understand the Organization’s Information

In order for developers to design and build products, they need to understand the environment in which the product will operate. Products that input and output data operate within a data environment. Developers need to have an awareness and understanding of that data.

Today, the data environment takes several forms:

  • The APIs of services. These APIs receive data as input and emit data as output. A service’s API is usually defined by a schema or spec. One of the common spec types is known as OpenAPI.

  • Databases maintained by a service. Today it is common for service databases to be schema-less - meaning that there is no specification for the content of the database.

  • Data that is sent to a data lake. This data is often received as a “document” in a format such as JSON.

Unfortunately, the dominant Agile frameworks say little about data, and so it is common for developers to begin work on a system with little knowledge of the organization’s data. This has resulted in several unintended outcomes:

  1. Service APIs are poorly understood, and fields in different APIs might sound similar but mean different things.

  2. Programmers who work on a service they did not design often must wade through the code to reverse engineer it, and try to figure out what the structure of the database is. That is error-prone and wasteful.

  3. The data in data lakes ends up unusable for business intelligence and machine learning, because analysts and machine learning experts are unable to decipher it or match it up across applications.

The remedy for these outcomes is twofold:

  • Restore the role of the data architect - a role that has lapsed in many organizations since the introduction of Agile frameworks.

  • Train the data architect to work in an agile manner, so that they can, in effect, “maintain the airplane while it is flying”. Traditional data architecture methods are top-down: they need to become just-in-time.

  • Establish behavioral norms that elevate awareness of the importance of data, and the importance of documenting data.

See Information Modeling and Machine Learning and Business Intelligence.

Managing Sensitive Data

There are legal implications with regard to data privacy that are very difficult to manage at a technical level, requiring experts.

Make sure that you include analysts with experience in data privacy laws who can help to devise ways to manage data so that it can be made available for testing while protecting data privacy. There are techniques for doing this, including masking and surrogating, but the issues are complex, because if not done properly, it can be possible to “de-anonymize” the data even if these techniques are used.

Managing Data Biases

There are legal implications with regard to data biases that are very difficult to manage at a technical level, requiring experts.

Even if real world data indicates that there are biases that reflect current reality, using that data can run afoul of laws that are intended to shift social biases. For example, data might reveal that members of a particular minority have a higher likelihood of criminal activity, but that cause might be societal rather than innate, and so using the resulting biases that real data reflect has the effect of cementing in that sociological behavior.

That is why there are laws, and more all the time, to restrict the use of data that contains embedded biases, even if those biases reflect reality. Make sure that you include analysts with experience in the applicable laws that govern how machine learning and other methods can be used and which kinds of biases must be removed, and how to do that. This is an area that requires deep expertise.