Big Data An Evolutionary Perspective on Data Warehouse Architecture

Big Data: An Evolutionary Perspective on Data Warehouse Architecture

Avanmag
By Avanmag
6 Min Read

Building an enterprise data system that meets millisecond transaction response times while integrating data quickly enough for near real-time analysis is a significant challenge. Many companies struggle with multiple data silos across platforms that include transactional database systems, data warehousing, Big Data systems, NoSQL, in-memory stores, and message buses, often carrying years of technical debt that complicates integration and performance.

Understanding how this complexity evolved helps in finding a way forward. In the early 90s, working as an intern on a database architecture team at a large car factory in Brazil, the state-of-the-art technology was an IBM mainframe running IMS and DB2 databases. The mainframe provided a mature and consistent data management system with structured models, physical schemas, metadata, and standardization. However, reporting capabilities were severely limited. Generating month-end reports required physically transporting stacks of printed reports to analysts who manually entered the data into Excel for further analysis.

With the rise of client-server computing, Oracle databases presented a new alternative to mainframes. Expanding upon previous data architecture principles, the biggest challenge became integrating data across different platforms. In the early days, common obstacles included physical design, ETL architecture, network latency, unstable storage systems, and database servers that were just beginning to incorporate parallel processing features. The emergence of Data Warehousing helped consolidate marketing and financial data, bringing structured information together into a unified system.

Despite references to data mining and unstructured data in Data Warehousing literature, no viable technology outside of mainframes could process massive amounts of data effectively. Even as relational enterprise data warehouses (EDW) matured, they were not designed to handle unstructured datasets. As data growth accelerated, the limitations of EDW became apparent, particularly with the rise of e-commerce and the explosion of web traffic, logs, and social network data. A new massive parallel processing paradigm was necessary to keep up with evolving data needs.

The introduction of Hadoop provided a breakthrough for data architects, addressing gaps in unstructured data management, scientific processing, and data mining. While Hadoop enabled powerful new capabilities, many viewed Data Warehousing as obsolete. However, harnessing the power of unstructured data requires integrating and modeling it alongside core transactional data to provide context. A successful data system must combine both EDW and Hadoop while challenging outdated beliefs about enterprise data architecture.

The traditional approach assumed that all data needed to reside in a single, monolithic EDW server, that analytical data duplication was inherently negative, and that EDW was strictly a downstream system. However, these assumptions no longer hold true. Instead, the future of data architecture involves shifting from the traditional “warehouse” model to a flexible, distributed “store” model, where data is rationalized and stored in an integrated platform dedicated to real-time streaming, batch processing, and core EDW metrics.

In the evolving Hadoop ecosystem, solutions like HBase, Impala, and Drill indicate a trend toward performing traditional data warehouse functions on a cheaper, open-source, shared-nothing architecture. However, full maturity has yet to be reached. A balanced approach integrates storage and access patterns while maintaining data governance, quality, and a clear source of record. Expanding all layers of the architecture allows for a flexible, real-time, and democratized data platform without compromising control.

Achieving this vision involves merging the Operational Data Store (ODS) with the core EDW layer to create a lower-latency repository for source records and core metrics, ensuring distribution for various analytical uses. Lower latency is further enhanced by integrating real-time data streaming and analytics engines like Storm with Hadoop into the data integration layer, enabling data processing closer to the source while maintaining performance and service level agreements.

The outdated EDW assumptions must be redefined within this new architecture. Instead of a monolithic system, data is rationalized across an integrated platform supporting real-time streaming, batch processing, and core EDW metrics. Instead of avoiding data replication, it is strategically distributed across relational databases and Hadoop while preserving the concept of a single source of truth. Rather than treating EDW as a downstream system, it evolves into an enterprise data store that functions as the authoritative source for core datasets and metrics, blending EDW and ODS concepts into a unified integration environment.

Leveraging the strengths of data warehousing, Big Data technologies, and cloud computing principles enables the creation of a scalable, service-oriented data platform. This approach allows data and insights to be delivered as a service, ensuring that organizations can keep pace with the demands of a rapidly evolving data landscape.

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *