Key Component Areas
A complete data warehouse architecture includes data and technical elements. Thornthwaite breaks down the architecture into three broad areas. The first, data architecture, is centered on business processes. The next area, infrastructure, includes hardware, networking, operating systems, and desktop machines. Finally, the technical area encompasses the decision-making technologies that will be needed by the users, as well as their supporting structures. These areas are detailed in the sub-sections below.
As stated above, the data architecture portion of the overall data warehouse architecture is driven by business processes. For example, in a manufacturing environment the data model might include orders, shipping, and billing. Each area draws on a different set of dimensions. But where dimensions intersect in the data model the definitions have to be the same—the same customer who buys is the same that builds. So data items should have a common structure and content, and involve a single process to create and maintain.
Thornthwaite says that organizations often ask how data should be represented in the warehouse—entity/relationship or dimensional? “If you have a star schema1 then use dimensional. Is your detail normalized2 or dimensional? Will users be querying detail? Then use dimensional.” He adds that most data warehousing experts are in substantial agreement; the [data] sources are typically entity/relationship models and the front end is a dimensional model. The only issue is where you draw the line between the warehouse itself and the data staging area.
As you work through the architecture and present data to your users, tool choices will be made, but many choices will disappear as the requirements are set. For example, he explains that product capabilities are beginning to merge, like MOLAP and ROLAP. “MOLAP is okay if you stay within the cube you’ve built. It’s fast and allows for flexible querying—within the confines of the cube.” Its weaknesses are size (overall and within a dimension), design constraints (limited by the cube structure), and the need for a proprietary data base.
With the required hardware platform and boxes, sometimes the data warehouse becomes its own IS shop. Indeed, there are lots of “boxes” in data warehousing, mostly used for data bases and application servers.
The issues with hardware and DBMS choices are size, scalability, and flexibility. In about 80 percent of data warehousing projects this isn’t difficult; most businesses can get enough power to handle their needs.
In terms of the network, check the data sources, the warehouse staging area, and everything in between to ensure there’s enough bandwidth to move data around. On the desktop, run the tools and actually get some data through them to determine if there’s enough power for retrieval. Sometimes the problem is simply with the machine, and the desktops must be powerful enough to run current-generation access tools. Also, don’t forget to implement a software distribution mechanism.
The technical architecture is driven by the meta data catalog. “Everything should be meta data-driven,” says Thornthwaite. “The services should draw the needed parameters from tables, rather than hard-coding them.” An important component of technical architecture is the data staging process, which covers five major areas:
- Extract - data comes from multiple sources and is of multiple types. Data compression and encryption handling must be considered at this area, if it applies.
- Transform - data transformation includes surrogate key management, integration, de-normalization, cleansing, conversion, aggregation, and auditing.
- Load - loading is often done to multiple targets, with load optimization and support for the entire load cycle.
- Security - administrator access and data encryption policies.
- Job control - this includes job definition, job scheduling (time and event), monitoring, logging, exception handling, error handling, and notification.