Select Page

Data Engineering Trends for 2023

Yaniv Ben Hemo
Published: February 6, 2023

We distilled 160 conversations with tech leaders across enterprises, startups, NASDAQ companies, and legacy organizations, from team leaders to CTOs and VP R&D. Here is their Top five list of data engineering trends that will likely come to life in 2023.

1. Data Contracts

Data Contracts are API-based agreements between Software Engineers who own services and Data Consumers that understand how the business works to generate well-modeled, high-quality, trusted data. Suppose you take a good look at it. In that case, you have at least ten different data producers and multiple consumers, written in different languages, interacting with various databases, SQL, No-SQL, and the holy grail data models. It’s a mess. Data contracts are still a management or operational concept. Still, we are starting to see more and more traction and conversation around it (Chad Sanderson covers the subject in depth in his newsletter). 

The end goals of data contracts are to:

  • Increase the quality of produced data. 
  • Easier maintenance.
  • Apply governance and standardization over a federated data platform.

Data Contracts

2. A New Role — Data Reliability Engineer (DRE)

One of the most common challenges that leaders have raised is how to narrow the technological gap between the different data stakeholders; Engineers, Analysts, BIs, and Scientists.


This gap is not only the source of over-complicated architectures but also one of the significant cost generators. The BIs, analysts, and scientists each have a stack with dedicated languages like SQL and R. Besides technical differences, there are also different interests and sort of a bubble-like environment that is much different than any other group of teams that assemble one unit with one clear goal, like the famous triangle — IT, DevOps, and Devs. Due to the growing complexity of data and the increase in in-house investments in making data much more cost-effective, accessible, and a real growth engine, a new position must be filled. Just as the SRE (Site Reliability Engineer) narrowed the gap between the developers and DevOps engineers, so will the DRE, by having a swiss army knife of capabilities starting with business understanding and requirements, on to data structures and SQL, to theory concepts in ML an AI, and lastly in how to create a straight to the point pipelines that will gather the needed data to fulfill the other layers.

3. Streaming and Real-Time

Data is growing too fast to process as a whole. That’s a simple truth.
These days we can find super intelligent and efficient algorithms that will process output in milliseconds, but in order to bring the data in, each pull will take minutes and hours. That example demonstrates that if the entire process could be refactored and generate results per a single event or in small batches, the output would take a reasonable amount of time. Not hours.
It is one example of many more, but not all use cases can happen in real-time, and refactoring is hard. The mindset should be real-time from the get-go.

4. Tracking Stream Lineage

The troubleshooting barriers must be lowered to enable the “streaming” growth and increase usability. For example, most of the respondents said that they are using a message broker to enable real-time pipelines; for them, a message broker is a black box. Something comes in, something comes out, and at the end of the pipeline, some events are dropped, and some get ingested. In addition, the lack of a pattern for failures with debugging experience that requires an assembly of different teams and engineers keeps architects from deepening in real-time. To overcome these barriers, engineers need better observability, context-based, that can display the full evolution of a single event all the way from the first producer (a stage in a pipeline) to the very last consumer. Several products and projects started to address this challenge, including with embedded event’s journey, Confluent, OpenLineage, Monte Carlo, and more.

5. Event Sourcing Is Coming Back

How do you struct or emphasize a user’s journey?
Let’s take, for example, the journey of some users in an eCommerce store.

  1. They entered the store.
  2. They searched.
  3. They found something / They didn’t.
  4. They purchase something within two minutes of entrance / They arrive at the checkout and went out.

Eventually, you can describe it in 10X different SQL tables or in one No-SQL document, do some joins/aggregations, and finally perform some actions after the user went far away from your store. Well, there is a better approach for store and action, and it’s called event sourcing. In simple words, it means that there is a queue, and into that queue, you push every event that a certain user has made instead into a database. Until now, pretty straightforward, but we want to perform some real-time actions derived from their behavior pattern while the user is in our store. 

To conclude, even though it seems we hear it all the time, data keeps growing, arriving from multiple sources in different shapes and sizes. As a result, mastering the streams and lakes can benefit any organization by reducing costs, increasing sales, becoming more efficient, and, most importantly, understanding the customer on the other side.