Windows Server Posted March 7 Posted March 7 This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH and Co-authored with Tim Kirschke Senior BI Consultant from Scalefree The Technical Review is done by Ian Clarke and Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft Introduction In this series' previous blog articles, we created a Raw Data Vault to store our raw data. In addition to capturing and integrating the data, we applied examples of soft business rules inside the Business Vault. This article focuses on using the data from the combined Data Vault model (that is, both the Raw Data Vault and Business Vault) and transforming it into valuable information to provide to business users. Information Delivery The Raw Data Vault and Business Vault capture the raw data from the source systems and the results from the business logic required by the business users. One could argue that the job is done by then. But in reality, end-users typically don’t want to work with the Raw Data Vault or Business Vault entities. The valid reasons for that often include a lack of knowledge in Data Vault modeling and, hence, unclarity about how to query a Data Vault implementation. Additionally, most end-users are already familiar with different data consumption methods. This typically includes dimensional models, such as star schema or snowflake schema, or fully denormalized flat-and-wide tables. In this article, we discuss fact entities and slowly changing dimensions (SCD) 1 and 2. Also, most tools for delivering information, such as dashboarding tools like Microsoft PowerBI or SQL Server Analysis Services to produce OLAP cubes, are easy to use with such models. How to Deliver Information with Data Vault 2.0 Regardless of the desired information delivery format, it can be directly queried out of Raw Data Vault entities. The Data Vault model follows an optimized schema-on-read design where the raw data is stored as-is and transformations, such as business logic and structural changes, are applied during query time. This is true, except that the incoming source data is broken down into the fundamental components: business keys, relationships, and descriptive data. This is the optimization of the storage and it makes the application of business rules much easier and also the transformation into any desired target information schema. Business Vault entities are used during information delivery to apply business rules. In most cases, the raw data is insufficient for reporting: it contains erroneous data, and some data is missing or needs to be converted from one currency to another. However, some of the raw data is good enough for reporting. Therefore, in many cases, information models, such as a dimensional model, would be derived from both the Raw Data Vault and the Business Vault by joining the required entities. Information Delivery requirements typically include a historization requirement. A Slowly Changing Dimension (SCD) Type 1 would only include the current state of descriptive attributes. However, SCD Type 2 would consist of the full history of descriptive attributes. Data Vault follows a multi-temporal approach and leverages multiple timelines to implement such solutions: The load date timestamp is the technical timeline that indicates when data arrived at the data platform. The timeline must be defined (and controlled) by the data platform team. The snapshot timestamp indicates when information should be delivered to the end user. This timeline is regular (e.g., every morning at 8 a.m.) and defined by the business user. Business timelines are inside the source data and indicate when something happened. Examples include birth dates, valid from and to dates, change dates, and deletion dates. Separating these timelines and creating multi-temporal solutions, where some data is back-dated or post-dated, becomes much more straightforward. However, this is beyond the scope of this article. Implementation Walk Through To fulfill the business requirements, let's start as simple as possible. For various reasons, it’s highly recommended that information marts be implemented using SQL views initially and only use physical tables if performance or processing times/costs require it. Other options like PIT and bridge tables typically provide a sufficient (virtualized) solution. We follow this recommendation in this article and start with a dimension view and a fact view. Store Dimension Many dimensions entities are derived from a hub and its satellite. If no business rules are implemented, the Dimension can access directly from the Raw Data Vault entities. For example, the following CREATE VIEW statement implements a SCD Type 1 store dimension: CREATE VIEW InformationMarts.DIM_STORE_SCD1 AS SELECT hub.store_hashkey as StoreKey, hub.store_id as StoreID, sat.address_street as AddressStreet, sat.postal_code as PostalCode, sat.country as Country FROM DV.store_hub hub LEFT JOIN DV.store_address_crm_lroc_sat sat ON hub.hk_store_hub = sat.hk_store_hub WHERE sat.is_current = 1 This simple query accesses the store_hub and joins it to the store_address satellite. It selects the business key from the hub because typical business users want to include it in the dimension. In addition, it renames all descriptive attributes from the satellite to make them more readable. The hash key is added for efficient joins from Fact entities. In the end, a WHERE clause leverages the is_current flag in the satellite to only include the latest descriptive data. This flag is calculated in a view on top of the actual satellite table. Thus, the view is joined, not the table. Only this specific WHERE clause makes this dimension of SCD type 1. Leaving it away would automatically lead to an SCD type 2! However, in such a case, it would make sense to include the load_date and load_end_date of the satellite view additionally. Transaction Fact The following CREATE VIEW statement implements a fact entity. In this simple example, no aggregations are defined. The granularity of the derived fact entity matches the underlying data from the non-historized link. Therefore, the fact view can be directly derived from the non-historized link without the requirement for a grain shift, e.g., a GROUP BY clause: CREATE VIEW InformationMarts.FACT_STORE_TRANSACTIONS AS SELECT nl.transaction_id as TransactionID, s_hub.store_hashkey as StoreKey, c_hub.customer_hashkey as CustomerKey, nl.transaction_date as TransactionDate, nl.amount as Amount FROM DV.store_transaction_nlnk nl LEFT JOIN DV.store_hub s_hub ON nl.hk_store_hub = s_hub.hk_store_hub LEFT JOIN DV.customer_hub c_hub ON nl.hk_customer_hub = c_hub.hk_customer_hub This query selects from the non-historized link and joins both hubs via the hashkeys. From these hubs, the hash keys are assigned. From the non-historized link, the relevant transaction details are chosen. A filter for historization is not required because both hubs and non-historized links only capture non-changing data. Capturing changing facts, which in theory should never happen but might happen in reality, is also possible using non-historized links but beyond the scope of this article. Pre-Calculated Aggregations In most business environments, BI developers would now connect their reporting tool of choice against our provided dimensional model to create custom reports. It's common to aggregate data to calculate sums, counts, averages, or other aggregated values, especially for fact data. Depending on the data volume, the reporting tool, and the aggregation complexity, this might be a challenge for business users. To simplify usage and optimize query performance in some cases, a pre-aggregation in the dimensional layer might be the best choice. For example, the CREATE VIEW statement implements another store transaction fact view that already includes the requested aggregations. Since aggregations are always based on a GROUP BY clause, the following views implement both grain shifts to calculate the number and amount of transactions on the different dimensions of store and customer: CREATE VIEW InformationMarts.FACT_AGG_STORE_TRANSACTIONS AS SELECT s_hub.store_hashkey as StoreKey, COUNT(nl.transaction_id) as TransactionCount, SUM(nl.amount) as TotalAmount, AVG(nl.amount) as AverageAmount FROM DV.store_transaction_nlnk nl LEFT JOIN DV.store_hub s_hub ON nl.hk_store_hub = s_hub.hk_store_hub GROUP BY s_hub.store_hashkey CREATE VIEW InformationMarts.FACT_AGG_CUSTOMER_TRANSACTIONS AS SELECT c_hub.customer_hashkey as CustomerKey, COUNT(nl.transaction_id) as TransactionCount, SUM(nl.amount) as TotalAmount, AVG(nl.amount) as AverageAmount FROM DV.store_transaction_nlnk nl LEFT JOIN DV.customer_hub c_hub ON nl.hk_customer_hub = c_hub.hk_customer_hub GROUP BY c_hub.customer_hashkey In both these queries, only one hub is required. The hash key of each hub is used for the GROUP BY clause, and three basic aggregations are applied to determine the count of transactions and calculate the sum and average amount of transactions. While this reduces the workload on the business user side, this implementation might still be slow or produce high processing costs. So, it would make sense to start materializing this aggregated fact entity or to introduce a bridge table. A bridge table is similar to a pre-aggregated fact table in dimensional models. However, it is much more customizable as it only implements the grain shift operation (in this case, the GROUP BY clause), measure calculations, and timelines. It also contains the hub references, which will be turned into dimension references, as seen in the previous examples. The definition of the bridge table is provided in the following statement: CREATE TABLE [DV].[CUSTOMER_TRANSACTIONS_BB] ( SnapshotDate DATETIME2(7) NOT NULL, CustomerKey CHAR(32) NOT NULL, TransactionCount BIGINT NOT NULL, AverageAmount MONEY NOT NULL ); The code to load a bridge table is similar to the fact view: INSERT INTO [DV].[CUSTOMER_TRANSACTIONS_BB] SELECT SYSDATETIME() as SnapshotDate nl.customer_hashkey as CustomerKey, COUNT(nl.transaction_id) as TransactionCount, SUM(nl.amount) as TotalAmount, AVG(nl.amount) as AverageAmount FROM DV.store_transaction_nlnk nl GROUP BY nl.customer_hashkey; The bridge table might also contain complex business calculations in many other cases. Still, the focus is on the grain shift operation, which takes a reasonable amount of time on many traditional database systems due to their row-based storage. However, Microsoft Fabric uses a different storage format optimized for aggregations but typically at the price of joins. The bridge table aims to improve the query performance of fact entities. In turn, that means it is ok to pre-join other data into the bridge table if the join performance is insufficient. A common requirement is the addition of a time dimension. Snapshot-Based Information Delivery So far, the store dimension presented in this article was an SCD Type 1 dimension - a dimension without history. However, in many cases, businesses want to relate facts to the dimension’s member version of the time the fact occurred. For example, an order was issued before the customer relocated to another state. In a Type 1 scenario, the order’s revenue would be associated with the customer's current state. However, this might not be correct, depending on the information requirements. In such cases, the revenue should be associated with the customer's state at the time of the transaction. This information requirement demands an SCD Type 2 dimension with history. Point-in-time (PIT) tables are recommended to produce such dimensions efficiently. This section discusses the necessary steps to create such a table. A good starting point is a date table. This table is a reference table for dates and can produce a date dimension and populate the PIT table. The following statement creates the table and initializes it with dates between 1970 and 2099: CREATE SCHEMA CONTROL; CREATE TABLE CONTROL.Ref_Date_v0 (snapshot_datetime datetime2(6), snapshot_date date, year int, month int, quarter int, week int, day int, day_of_year int, week_day int, beginning_of_year bit, beginning_of_quarter bit, beginning_of_month bit, beginning_of_week bit, end_of_year bit, enf_of_quarter bit, end_of_month bit, end_of_week bit) WITH date_base AS ( SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n) ), date_basic as ( SELECT TOP (DATEDIFF(DAY, '1970-01-01', '2099-12-31') + 1) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS rn FROM date_base ones, date_base tens, date_base hundreds, date_base thousands ORDER BY 1 ), snapshot_base AS ( select cast(dateadd(day, rn - 1, '2020-01-01 07:00:00') as datetime2) as snapshot_datetime, cast(dateadd(day, rn - 1, '2020-01-01') as date) as snapshot_date from date_basic ), snapshot_extended AS ( SELECT snapshot_datetime, snapshot_date, DATEPART(YEAR, snapshot_date) as year, DATEPART(MONTH, snapshot_date) as month, DATEPART(QUARTER, snapshot_date) as quarter, DATEPART(WEEK, snapshot_date) as week, DATEPART(DAY, snapshot_date) as day, DATEPART(DAYOFYEAR, snapshot_date) as day_of_year, DATEPART(WEEKDAY, snapshot_date) as week_day FROM snapshot_base ) INSERT INTO CONTROL.Ref_Date_v0 SELECT *, CASE WHEN day_of_year = 1 THEN 1 ELSE 0 END as beginning_of_year, CASE WHEN day = 1 AND month in (1, 4, 7, 10) THEN 1 ELSE 0 END as beginning_of_quarter, CASE WHEN day = 1 THEN 1 ELSE 0 END as beginning_of_month, CASE WHEN week_day = 2 THEN 1 ELSE 0 END as beginning_of_week, CASE WHEN snapshot_date = EOMONTH(snapshot_date) AND month = 12 THEN 1 ELSE 0 END as end_of_year, CASE WHEN snapshot_date = EOMONTH(snapshot_date) AND month in (3, 6, 9, 12) THEN 1 ELSE 0 END as end_of_quarter, CASE WHEN snapshot_date = EOMONTH(snapshot_date) THEN 1 ELSE 0 END as end_of_month, CASE WHEN week_day = 1 THEN 1 ELSE 0 END as end_of_week FROM snapshot_extended The first part is a simple DDL statement that creates the reference date table. This is followed by an INSERT statement that leverages multiple Common Table Expressions (CTEs) to simplify the logic. The first CTE date_base simply generates a list of the numbers 1 to 10, followed by the CTE date_basic, which CROSS JOINs the previous CTE four times, creating 10 * 10 * 10 * 10 = 10000 rows. A ROW_NUMBER() transforms the numbers into an ascending number ranging from 1 to 10000. The next CTE snapshot_base uses this ascending number to execute a DATEADD() function on top of a specified start date, '2020-01-01 07:00:00', to generate a list of daily dates. This is done once in datatype datetime2 and once in datatype date. The last CTE snapshot_extended adds metadata like MONTH, YEAR, etc. Lastly, boolean columns, which mark the beginning and end of weeks, months, quarters, and years., are added. Everything is then inserted into the reference date table. This reference date table can now create and load a Point-In-Time (PIT) Table. The PIT table precalculates for each snapshot date timestamp (SDTS), which satellite entry is valid for each business key. The granularity of a PIT is (number of snapshots) * (number of business keys) = row count in PIT. The following code creates and populates a simple PIT example for stores: CREATE TABLE DV.STORE_BP (hk_d_store CHAR(32) NOT NULL, hk_store_hub CHAR(32) NOT NULL, snapshot_datetime datetime2(6) NOT NULL, hk_store_address_crm_lroc_sat CHAR(32) NULL, load_datetime_store_address_crm_lroc_sat datetime2(6) NULL); WITH pit_entries AS ( SELECT CONVERT(CHAR(32), HASHBYTES('MD5', CONCAT(hub.hk_store_hub, '||', date.snapshot_datetime)), 2) as hk_d_store, hub.hk_store_hub, date.snapshot_datetime, COALESCE(sat1.hk_store_hub, '00000000000000000000000000000000') as hk_store_address_crm_lroc_sat, COALESCE(sat1.load_datetime, CONVERT(DATETIME, '1900-01-01T00:00:00', 126)) as load_datetime_store_address_crm_lroc_sat FROM DV.store_hub hub INNER JOIN CONTROL.Ref_Date_v0 date ON hub.load_datetime <= date.snapshot_datetime LEFT JOIN DV.store_address_crm_lroc_sat sat1 ON hub.hk_store_hub = sat1.hk_store_hub AND date.snapshot_datetime BETWEEN sat1.load_datetime and sat1.load_end_datetime ) INSERT INTO DV.STORE_BP SELECT * FROM pit_entries new WHERE NOT EXISTS (SELECT 1 FROM DV.STORE_BP pit WHERE pit.hk_d_store = new.hk_d_store) The one CTE pit_entries defines the whole set of PIT entries. The store hub is joined against the snapshot table only when the hub appears before the SDTS to reduce the number of rows. But since there is no more specific JOIN condition, after this join, the number of rows is already a multiple of the number of rows in the hub. Next, the only satellite attached to the store hub is joined, store_address_crm_lroc_sat. It is joined on the hash key, and additionally, the load_date and load_end_date are leveraged to determine the valid record for a specific SDTS using the BETWEEN function. The SELECT list of this CTE introduces a new concept, a dimensional key, hk_d_store, generated by hashing the store hub hash key and the SDTS. This creates a new unique column that can be used for the primary key constraint and incremental loads. Additionally, both components of this dimensional key, hk_store_hub and snapshot_datetime, are selected. The hash key and load, datetime of the satellite, are also chosen to uniquely identify one row of the satellite. They are renamed to include the satellite's name, which helps when joining multiple satellites instead of just one. A typical PIT always brings together all satellites connected to a specific hub. Therefore, a typical PIT has various combinations of hash key and load_datetime columns. Ultimately, we insert only rows where the new dimensional key does not already exist in the target PIT. This additional clause enables incremental loading. This PIT can now be used as a starting point for a snapshot-based store dimension. To produce a historized (SCD Type 2) store dimension, the PIT is joined with the hub and the satellite: CREATE VIEW InformationMarts.DIM_STORE_SB AS SELECT pit.snapshot_datetime as SnapshotDatetime, hub.store_id as StoreID, sat.address_street as AddressStreet, sat.postal_code as PostalCode, sat.country as Country FROM DV.STORE_BP pit INNER JOIN DV.store_hub hub ON hub.hk_store_hub = pit.hk_store_hub INNER JOIN DV.store_address_crm_lroc_sat sat ON pit.hk_store_address_crm_lroc_sat = sat.hk_store_hub AND pit.load_datetime_store_address_crm_lroc_sat = sat.load_datetime With all history precalculated in our PIT, the actual dimension can be virtual again because the only operation required is an INNER-JOIN. Additional information and patterns about PIT and bridge tables can be found on the Scalefree Blog. Conclusion Data Vault has been designed to integrate data from multiple data sources, creatively destruct the data into its fundamental components, and store and organize it so that any target structure can be derived quickly. This article focused on generating information models, often dimensional models, using virtual entities. They are used in the data architecture to deliver information. After all, dimensional models are easier to consume by dashboarding solutions, and business users know how to use dimensions and facts to aggregate their measures. However, PIT and bridge tables are usually needed to maintain the desired performance level. They also simplify the implementation of dimension and fact entities and, for those reasons, are frequently found in Data Vault-based data platforms. This article completes the information delivery. The following articles will focus on the automation aspects of Data Vault modeling and implementation. <<< Back to Blog Series Title PageView the full article Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.