Knowledge Modeling Methods For Knowledge Warehouse | by Mariusz Kujawski

11 min learn

Jun 19, 2023

Photograph by Zdeněk Macháček on Unsplash

Knowledge modeling is a course of of making a conceptual illustration of the info and its relationships inside a corporation or system. Dimensional modeling is a complicated method that makes an attempt to current information in a approach that’s intuitive and comprehensible for any consumer. It additionally permits for high-performance entry, flexibility, and scalability to accommodate modifications in enterprise wants.

On this article, I’ll present an in-depth overview of information modeling, with a particular concentrate on Kimball’s methodology. Moreover, I’ll introduce different methods used to current information in a user-friendly and intuitive method. One significantly attention-grabbing method for contemporary information warehouses is storing information in a single vast desk, though this strategy is probably not appropriate for all question engines. I’ll current methods that can be utilized in Knowledge Warehouses, Knowledge Lakes, Knowledge Lakehouses, and so forth. Nonetheless, it is very important select the suitable methodology on your particular use case and question engine.

Each dimensional mannequin consists of a number of tables with a multipart key, known as the very fact desk, together with a set of tables generally known as dimension tables. Every dimension desk has a main key that exactly corresponds to one of many elements of the multipart key within the reality desk. This distinct construction is usually known as a star schema. In some circumstances, a extra intricate construction known as a snowflake schema can be utilized, the place dimension tables are linked to smaller dimension tables

Dimensional modeling offers a sensible and environment friendly strategy to organizing and analyzing information, ensuing within the following advantages:

  • Simplicity and understandability for enterprise customers.
  • Improved question efficiency for sooner information retrieval.
  • Flexibility and scalability to adapt to altering enterprise wants.
  • Ensured information consistency and integration throughout a number of sources.
  • Enhanced consumer adoption and self-service analytics.

Now that we now have mentioned what dimensional modeling is and the worth it brings to organizations, let’s discover learn how to successfully leverage it.

Whereas I intend to primarily concentrate on Kimball’s methodology, let’s briefly contact upon a number of different common methods earlier than diving into it.

Inmon suggests using a normalized information mannequin inside the information warehouse. This technique helps the creation of information marts. These information marts are smaller, specialised subsets of the info warehouse that cater to particular enterprise areas or consumer teams. These are designed to supply a extra tailor-made and environment friendly information entry expertise for explicit enterprise capabilities or departments.

Knowledge Vault is a modeling methodology that focuses on scalability, flexibility, and traceability. It consists of three core elements: the Hub, the Hyperlink, and the Satellite tv for pc.

Hubs

Hubs are collections of all distinct entities. For instance, an account hub would come with account, account_ID, load_date, and src_name. This enables us to trace the place the file initially got here from when it was loaded, and if we want a surrogate key generated from the enterprise key.

Hyperlinks

Hyperlinks set up relationships between hubs and seize the associations between totally different entities. They comprise the overseas keys of the associated hubs, enabling the creation of many-to-many relationships.

Satellites

Satellites retailer the descriptive details about the hubs, offering extra context and attributes. They embody historic information, audit data, and different related attributes related to a particular time limit.

Knowledge Vault’s design permits for a versatile and scalable information warehouse structure. It promotes information traceability, auditability, and historic monitoring. This makes it appropriate for eventualities the place information integration and agility are vital, comparable to in extremely regulated industries or quickly altering enterprise environments.

OBT shops information in a single vast desk. Utilizing one huge desk, or a denormalized desk, can simplify queries, enhance efficiency, and streamline information evaluation. It eliminates the necessity for complicated joins, eases information integration, and could be useful in sure eventualities. Nonetheless, it could result in redundancy, information integrity challenges, and elevated upkeep complexity. Take into account the precise necessities earlier than choosing a single giant desk.

Picture by creator
Picture by creator
WITH transactions AS (
SELECT 1000001 AS order_id, TIMESTAMP('2017-12-18 15:02:00') AS order_time,
STRUCT(65401 AS id, 'John Doe' AS title, 'Norway' AS location) AS buyer,
[
STRUCT('xxx123456' AS sku, 3 AS quantity, 1.3 AS price),
STRUCT('xxx535522' AS sku, 6 AS quantity, 500.4 AS price),
STRUCT('xxx762222' AS sku, 4 AS quantity, 123.6 AS price)
] AS orders
UNION ALL
SELECT 1000002, TIMESTAMP('2017-12-16 11:34:00'),
STRUCT(74682, 'Jane Smith', 'Poland') AS buyer,
[
STRUCT('xxx635354', 4, 345.7),
STRUCT('xxx828822', 2, 9.5)
] AS orders
)

choose *

from
transactions

Within the case of 1 vast desk we don’t want to hitch tables. We will use just one desk to combination information and make analyzes. This technique improves efficiency in BigQuery.

Picture by creator
choose buyer.title, sum(a.amount)

from
transactions t, UNNEST(t.orders) as a
group by buyer.title

The Kimball methodology locations vital emphasis on the creation of a centralized information repository generally known as the info warehouse. This information warehouse serves as a singular supply of reality, integrating and storing information from varied operational methods in a constant and structured method.

This strategy gives a complete set of tips and finest practices for designing, creating, and implementing information warehouse methods. It locations a powerful emphasis on creating dimensional information fashions and prioritizes simplicity, flexibility, and ease of use. Now, let’s delve into the important thing ideas and elements of the Kimball methodology.

Entity mannequin to dimensional mannequin

In our information warehouses, the sources of information are sometimes present in entity fashions which can be normalized into a number of tables, which comprise the enterprise logic for functions. In such a situation, it may be difficult as one wants to know the dependencies between tables and the underlying enterprise logic. Creating an analytical report or producing statistics usually requires becoming a member of a number of tables.

Picture by creator

To create a dimensional mannequin, the info must endure an Extract, Remodel, and Load (ETL) course of to denormalize it right into a star schema or snowflake schema. The important thing exercise on this course of includes figuring out the very fact and dimension tables and defining the granularity. The granularity determines the extent of element saved within the reality desk. For instance, transactions could be aggregated per hour or day.

Picture by creator

Let’s assume we now have an organization that sells bikes and bike equipment. On this case, we now have details about:

  • Transactions
  • Shops
  • Purchasers
  • Merchandise

Based mostly on our enterprise information, we all know that we have to acquire details about gross sales quantity, amount over time, and segmented by areas, clients, and merchandise. With this data, we are able to design our information mannequin. The transactions’ desk will function our reality desk, and the shops, purchasers, and merchandise tables will act as dimensional tables.

Reality desk

A reality desk sometimes represents a enterprise occasion or transaction and contains the metrics or measures related to that occasion. These metrics can embody varied information factors comparable to gross sales quantities, portions bought, buyer interactions, web site clicks, or another measurable information that provides insights into enterprise efficiency. The very fact desk additionally contains overseas key columns that set up relationships with dimension tables.

Picture by creator

One of the best follow within the reality desk design is to place all overseas keys on the highest of the desk after which measure.

Reality Tables Sorts

  1. Transaction Reality Tables provides a grain at its lowest degree as one row represents a file from the transaction system. Knowledge is refreshed each day or in actual time.
  2. Periodic Snapshot Reality Tables seize a snapshot of a reality desk at a time limit, like as an example the top of month.
  3. Accumulating Snapshot Reality Desk summarizes the measurement occasions occurring at predictable steps between the start and the top of a course of.
  4. Factless Reality Desk retains details about occasions occurring with none masseurs or metrics.

Dimension desk

A dimension desk is a sort of desk in dimensional modeling that accommodates descriptive attributes like as an example details about merchandise, its class, and sort. Dimension tables present the context and perspective to the quantitative information saved within the reality desk.

Dimension tables comprise a singular key that identifies every file within the desk, named the surrogate key. The desk can comprise a enterprise key that may be a key from a supply system. A superb follow is to generate a surrogate key as an alternative of utilizing a enterprise key.

There are a number of approaches to making a surrogate key:

  • -Hashing: a surrogate key could be generated utilizing a hash operate like MD5, SHA256(e.g. md5(key_1, key_2, key_3) ).
  • -Incrementing: a surrogate key that’s generated by utilizing a quantity that’s all the time incrementing (e.g. row_number(), identification).
  • -Concatenating: a surrogate key that’s generated by concatenating the distinctive key columns (e.g. concat(key_1, key_2, key_3) ).
  • -Distinctive generated: a surrogate key that’s generated by utilizing a operate that generates a singular identifier (e.g. GENERATE_UUID())

The strategy that you’ll select depends upon the engine that you simply use to course of and retailer information. It will probably impression efficiency of querying information.

Dimensional tables usually comprise hierarchies.

a) For instance, the parent-child hierarchy can be utilized to signify the connection between an worker and their supervisor.

Picture by creator

b) Hierarchical relationships between attributes. For instance, a time dimension might need attributes like yr, quarter, month, and day, forming a hierarchical construction.

Picture by creator

Kinds of dimension tables

Conformed Dimension:

A conformed dimension is a dimension that can be utilized by a number of reality tables. For instance, a area desk could be utilized by totally different reality tables.

Degenerate Dimension:

A degenerate dimension happens when an attribute is saved within the reality desk as an alternative of a dimension desk. As an example, the transaction quantity could be present in a reality desk.

Junk Dimension:

This one accommodates non-meaningful attributes that don’t match effectively in present dimension tables, or are mixtures of flags and binary values representing varied mixtures of states.

Position-Taking part in Dimension:

The identical dimension key contains a couple of overseas key within the reality desk. For instance, a date dimension can confer with totally different dates in a reality desk, comparable to creation date, order date, and supply date.

Static Dimension:

A static dimension is a dimension that sometimes by no means modifications. It may be loaded from reference information with out requiring updates. An instance might be a listing of branches in an organization.

Bridge Desk:

Bridge tables are used when there are one-to-many relationships between a reality desk and a dimension desk.

Slowly altering dimension

A Slowly Altering Dimension (SCD) is an idea in dimensional modeling. It handles modifications to dimension attributes over time in dimension tables. SCD offers a mechanism for sustaining historic and present information inside a dimension desk as enterprise entities evolve and their attributes change. There are six kinds of SCD, however the three hottest ones are:

  • SCD Sort 0: On this kind, solely new information are imported into dimension tables with none updates.
  • SCD Sort 1: On this kind, new information are imported into dimension tables, and present information are up to date.
  • SCD Sort 2: On this kind, new information are imported, and new information with new values are created for modified attributes.

For instance, when John Smith strikes to a different metropolis, we use SCD Sort 2 to maintain details about transactions associated to London. On this case, we create a brand new file and replace the earlier one. Because of this, historic reviews will retain data that his purchases have been made in London.

Picture by creator
Picture by creator
MERGE INTO consumer AS tgt
USING (
SELECT
Client_id,
Title,
Surname,
Metropolis
GETDATE() AS ValidFrom
‘20199-01-01’ AS ValidTo
from client_stg
) AS src
ON (tgt.Clinet_id = src.Clinet_id AND tgt.iscurrent = 1)
WHEN MATCHED THEN
UPDATE SET tgt.iscurrent = 0, ValidTo = GETDATE()
WHEN NOT MATCHED THEN
INSERT (Client_id, title, Surname, Metropolis, ValidFrom, ValidTo, iscurrent)
VALUES (Client_id, title, Surname, Metropolis, ValidFrom, ValidTo,1);

That is how SCD 3 seems after we maintain new and former values in separate columns.

Star schema vs. snowflake schema

The most well-liked strategy to designing a knowledge warehouse is to make the most of both a star schema or a snowflake schema. The star schema has reality tables and dimensional tables which can be in relation to the very fact desk. In a star schema, there are reality tables and dimensional tables which can be immediately associated to the very fact desk. However, a snowflake schema consists of a reality desk, dimension tables associated to the very fact desk, and extra dimensions associated to these dimension tables.

Picture by creator

The primary variations between these two designs lie of their normalization strategy. The star schema retains information denormalized, whereas the snowflake schema ensures normalization. The star schema is designed for higher question efficiency. The snowflake schema is particularly tailor-made to deal with updates on giant dimensions. When you encounter challenges with updates to in depth dimension tables, take into account transitioning to a snowflake schema.

Knowledge loading methods

In our information warehouse, information lake, and information lake home we are able to have varied load methods like:

Full Load: The complete load technique includes loading all information from supply methods into the info warehouse. This technique is often used within the case of efficiency points or lack of columns that would inform about row modification.

Incremental Load: The incremental load technique includes loading solely new information for the reason that final information load. If rows within the supply system can’t be modified, we are able to load solely new information based mostly on a singular identifier or creation date. We have to outline a “watermark” that we’ll use to pick out new rows.

Delta Load: The delta load technique focuses on loading solely the modified and new information for the reason that final load. It differs from incremental load in that it particularly targets the delta modifications somewhat than all information. Delta load methods could be environment friendly when coping with excessive volumes of information modifications and considerably scale back the processing time and assets required.

The most typical technique to load information is to populate dimension tables after which reality tables. The order right here is necessary as a result of we have to use main keys from dimension tables the truth is tables to create relationships between tables. There may be an exception. When we have to load a reality desk earlier than a dimension desk, this method title is late arriving dimensions.

On this method, we are able to create surrogate keys in a dimension desk, and replace it by ETL course of after populating the very fact desk.

Abstract

After a radical studying of the article, in case you have any questions or want to additional focus on information modeling and efficient dimensional fashions, be at liberty to achieve out to me on LinkedIn. Implementing information modeling can unlock the potential of your information, offering worthwhile insights for knowledgeable decision-making whereas gaining information in strategies and finest practices.

Leave a Reply