At Mphasis Datalytyx we are obsessed with DataOps. We’ve spent a lot of time thinking and discussing all things DataOps, contributing to community projects such as #TrueDataOps and the DataOps Platform for Snowflake. As such we get asked a lot about what DataOps actually means and where it differs from DevOps? This is an absolutely valid question, but it is perhaps not the right question to ask, as it tends to pit DevOps and DataOps against each other.

The better question is: What issues does DataOps address that haven’t been catered for by traditional DevOps?

Before addressing the differences, let’s look at the similarities.

Philosophically speaking, with DevOps there is no real departure from Agile. DataOps and DevOps have a similar relationship in that, with DataOps there is no real departure, philosophically, from DevOps. If you zoom out they are all trying to achieve the same thing; Delivering high quality products or services early and continuously improving them through frequent iteration, all whilst keeping a laser sight on the customer.

Whilst Agile was born out of the software development industry it has become a domain independent philosophy that isn’t explicit about how to address all of the real-world issues that software development has encountered over the years, hence DevOps draws on the Agile philosophy and creates a domain specific set of tools & practices addressing these issues. DataOps draws on the Agile philosophy and the tools & practices of DevOps. It adds to or modifies these tools & practices when an issue that is especially relevant to the Data & Analytics domain isn’t addressed by traditional DevOps. There are many of these issues, but let’s talk about the main five.

5 Issues in the Domain of Data & Analytics

1. The Reliance on Realistic Test Data

What does this look like in the world of Data & Analytics?

In the field of Software Development, it is entirely possible that a new feature can be developed, tested and deployed without the notion of “Test Data” ever being a part of the conversation*. This is not the case in the world of Data & Analytics.

If we generalize the components in a data platform, we can identify two key functions: data pipelines and data storage. A fully comprehensive and static data set is needed to develop and test any given data pipeline or storage mechanism. At the risk of explaining three (fairly) self-descriptive terms let me qualify that statement by defining ‘fully comprehensive’, ‘static’ and ‘sufficiently large’.

In the context of a data pipeline or storage mechanism, a test data set can be thought of as fully comprehensive if it contains a range of attributes or properties that allow for every element of functionality in the pipeline or storage mechanism to be tested. A trivial example would be a database table that needed to store some combination of strings, dates and integers; a test data set for this functionality could not possibly be fully comprehensive if it did not include a mix of strings, dates and integers (in the appropriate structure and format of course). Another clear example would be the volume of a data set in testing capacity or performance of your data platform.

A static data set is simply a data set that does not change over time. This is a crucial property for test data as it is important that an unexpected test result can be confidently attributed to the functionality exhibited by the data pipeline or storage mechanism and not due to a change in the input data.

Does DevOps Provide us with a Solution to this Issue?

Even in data teams that have embraced the DevOps way of life, sourcing suitable test data is still a manual and time-consuming process. There are a certain set of practices and tools that are omnipresent in any organisation that has adopted DevOps, yet you will not find an elegant solution to this issue.

Some data sources have dev or test instances which can be helpful, but this by no means guarantees that you will have fully comprehensive or static test data and in fact the most common scenario is that there is no test data available at all. Therefore, we can’t look to source systems to provide test data sets as there is no standardisation or robustness.

Often the next place organisations look is production data, and this is a reasonable approach so long as robust Data Protection & Governance measures are taken to protect PII and other sensitive data. However, this is often still a manual process and susceptible to human error and inefficiency.

This is one of the reasons that organisations must look to TrueDataOps. A TrueDataOps implementation is built upon the 7 Pillars of DataOps and will allow for the easy creation and management of suitable test data sets in a robust and compliant manner – no mean feat.

2. Lack of Control over Data Sources

What does this look like in the world of Data & Analytics?

It was hinted at in the previous section that data teams are often at the mercy of the teams that manage operational source systems. In an ideal world, any upstream changes to data sources would be communicated to the data team and changes would be incorporated and tested ahead of time – we do not live in an ideal world.

Does DevOps Provide us with a Solution to this Issue?

Again, this is not a scenario that is explicitly catered for in DevOps. There is, of course, the notion of dependencies but these are within the realm of programming rather than data and they are all within the internal control of any given development team.

Data Platforms must be designed robustly and defensively to absorb upstream changes and, where possible, continue to deliver functionality unperturbed. The Spirit of ELT, implemented with appropriate technologies, will deal with these kinds of issues without breaking a sweat. Another crucial issue dealt with by TrueDataOps.

3. The Need to be Compliant with Regulation

What does this look like in the world of Data & Analytics?

There is an ever-increasing need for organisations to comply with strict regulations that protect the privacy of individuals and ensure the secure storage and handling of other sensitive data. Not only is it morally and ethically important to ensure that sensitive data is handled with care, but failure to comply can also come with a huge financial penalty. Traditionally it has been the role of the data governance function to both define policies and to ensure that they are being implemented and adhered to.

There can be friction between the data governance function and the data analysis team when a data source or set of data sources contain sensitive data that needs to be handled with care. The data analysis team insist they merely need a subset of the data or an aggregate in order to provide useful insights, but data governance team is not convinced that data lineage and provenance can be maintained or perhaps they have concerns over the granularity of access controls that can be implemented.

All of this tends to slow down the rate of delivery and often ends up with data being buried in source systems rather than being ingested into the analytics platform. This is an overly cautious and costly technique for remaining compliant and there is a better way.

Does DevOps Provide us with a Solution to this Issue?

Security and compliance are, of course, key areas within DevOps. However, data protection, auditability, data provenance and lineage, and attribute level access controls are not something that have been addressed explicitly in any of the common tools or practices.

When we embrace TrueDataops the combination of design patterns & technologies implemented can provide Governance by Design, meaning that the platform itself can enforce the data governance policies, freeing up the people in data governance from a job that is repetitive and not well suited to humans.

4. Complex Data Transformations

What does this look like in the world of Data & Analytics?

There is no way of getting around it, sometimes, in order to produce an elegant and insightful output from a large and varied data set you need to perform some complicated joins and data transformations to get there.

The need to perform complex data transformations is not an issue in its own right (this is why data engineers get out of bed in the morning!) The issue comes with the design and implementation of the complex data transformations which have, in the past, manifested themselves as thousand-line stored procedures or unwieldy god-class data pipelines that are hard to read, complicated to understand and completely unmaintainable.

Does DevOps Provide us with a Solution to this Issue?

The shape of this problem is very familiar to all developers; don’t write large & complex blocks of code when we can easily abstract, simplify and de-couple functionality. There are a host of principles and best practices out there to guide developers towards writing better code.

If we zoom out to the level of a product or service and apply the same principles, we get the concept of micro-services which is one of the common practices you will see in the world of DevOps.

Despite the generic problem being well understood, and largely addressed within software development, it is still very common to find these gargantuan blocks of complex logic stitched together in the realm of data engineering.

I think there are many potential reasons that data engineering is lagging behind software development in this regard but whatever those reasons are, the outcome is the same; Data Engineering needs an equivalent toolbox of technologies, methodologies and frameworks that are used by software developers in order to bring some rigour, standardisation and elegance to how data transformations are designed and built.

This example is a fantastic illustration of the relationship between DevOps and TrueDataops. It is simply a matter of taking the set of philosophies and best practices that have been battle hardened over many years in software development and applying them to a different domain.

5. The Expectation for Rapid Changes

What does this look like in the world of Data & Analytics?

End users want an improved product or service and they want it yesterday. The expectation for rapid additions or improvements to a data platform are becoming more common. This can be difficult in a landscape with large, varied and complex data sets that all need to be of sufficient quality and compliant with regulation.

A number of issues have already been referenced in this article that should give a reasonable idea of why rapid turnaround times have been difficult to achieve in data & analytics.

Does DevOps Provide us with a Solution to this Issue?

Ironically, rather than Devops solving this particular issue, it has, in a way, created the issue. The widespread adoption of DevOps has meant that it is now the norm in a variety of applications to see new or improved functionality being delivered rapidly, to a higher quality, with ever decreasing turnaround times.

TrueDataOps solutions combine the philosophies, frameworks and tools previously mentioned and enable a true CI/CD and automated testing and monitoring capability for Data & Analytics – something that has been a serious challenge in the past.

Conclusion

The title of this article asks the question: DataOps is just DevOps Right? And what I have attempted to explain is that the answer is a resounding: “It depends”. Feeling deeply unsatisfied now? You need not be.

It all comes down to the lens with which you are analysing the problem. If we abstract and generalise enough, then there really is nothing new in DataOps that is not addressed in the philosophy and spirit of DevOps. But if we recall, the same can be said of DevOps in comparison to Agile. If we decide to zoom in and look at some of the characteristics within the Data & Analytics domain we can see that there are some challenges that have escaped the attention of DevOps (and Agile, of course). This is through no fault of DevOps, the principles were developed within a domain that does not need to solve the same problems as the Data & Analytics domain.

This is why I believe DataOps is a useful term; the technology landscape is vast and complicated and we need to be able to navigate that space effectively. If we speak about DataOps as opposed to DevOps, it gives focus and context, instantly framing a problem or discussion around the intended domain: Data & Analytics.

*There are cases where test data is needed in software / application development but it tends to be at the event / transactional level and therefore small and well defined.