Thursday, June 18, 2009

Modeling Average Metrics in Aggregate Tables

OBIEE has some excellent features for incorporating aggregate tables into the metadata modeling, and these tables are usually pretty easy to create and populate in the database using simple SQL scripts. One of the only wrinkles I have ever run into involved the need for average metrics along with aggregate tables. This blog posting explains the issues that were encountered and provides a solution.

SCENARIO

Suppose you have the following Fact Table, which represents patient visits at a hospital:


  • VISIT_ID is the Primary Key
  • YEAR_KEY is a Foreign Key to the Year Dimension
  • PATIENT_ID is a Foreign Key to the Patient Dimension
  • CHARGES is the only Fact (metric) in the table currently

Let’s say you need to create two CHARGES metrics for your Answers users:
  1. Charges (represent a sum of CHARGES)
  2. Avg Charges (represents an average of CHARGES per patient visit)

You could satisfy these requirements by creating two metrics in the logical layer of the repository, both mapped to the CHARGES column, with the Default aggregation rules for these metrics set to Sum and Avg respectively. For now, let’s assume you choose this route.

Now suppose you want to use an aggregate table to improve query performance for queries that only utilize the year dimension along with visit Facts. The aggregate table might look like this:



NOTE: (in the above example, there were only 6 records in the original table, so an aggregate table would be trivial, but imagine if the VISITS_FACT table had 6 million rows and your aggregate table reduced the row count from 6 million down to 2).

You could easily use the following query to populate the table above:



Note that I have included a VISIT_COUNT column, which might not have been an initial requirement, but its purpose will soon become apparent.


ISSUES

The above method for creating an average metric in an aggregate table may seem like a good approach, but it falls apart in certain query scenarios. Remember your “Avg Charges” metric? Its Default aggregation rule is set to Avg, and this metric is now mapped to two logical table sources:

  1. VISIT_FACT.CHARGES
  2. VIST_FACT_AGG_YEAR.AVG_CHARGES_PER_VISIT
When you run an Answers request that contains the Year dimension along with the “Avg Charges” metric, the result is correct; however, what if you run an Answers request that only contains the “Avg Charges” metric and nothing else? There’s a good chance that the Oracle BI Server will still choose to use the aggregate logical table source for this query, and if it does, the results will have an incorrectly calculated “Avg Charges” metric. Why is this? It’s because you can’t average an average. Averaging is not an additive operation. When you try to average a group of numbers that already represent averages, you do not get the proper weighting factors of the original averages and the formula is wrong.

In our example, the correct average for all 6 rows would be (100+100+200+200+850+950)/6 = 400; however, the “average of the averages” would be (900+150)/2 = 525, which is incorrect.

SOLUTION

I suppose you could create two separate “Avg Charges” metrics in the logical layer and map each metric to only one logical table source, and then set the content levels for these metrics, but this solution is impractical, overcomplicated, and does not follow best practices. Besides, wouldn’t you rather have just one “Avg Charges” metric available to end users? The backend aggregate tables and processing should be seamless to the end user anyhow, that’s the whole point of the business layer.

Therefore, here is my proposed solution that should still help you reap the performance benefits of using aggregate tables, and will also allow you to define just one “Avg Charges” metric that will have the correct calculation no matter what the query scenario is:

  1. Do not create an AVG_CHARGES_PER_VISIT column in your aggregate table at all.
  2. Instead, create the VISIT_COUNT column in the aggregate table.
  3. In the logical layer, create a “# Visits” metric based on the VISIT_COUNT. Set the Default aggregation rule to Sum (this is the appropriate aggregation rule since counts are additive; here we do not use Count as the aggregation rule because it makes no sense to count a count.) But wait, our VISIT_FACT table does not have a VISIT_COUNT column, so how do we map the “# Visits” metric to this logical table source? Here’s the trick, just map this column to a constant number 1:



    (Think about why this works…counting individual rows is equivalent to tallying the rows by assigning a value of 1 to each row and then summing all the 1’s).
  4. Now that we have a proper “# Visits” metric mapped to both of our logical table sources, create a new logical column for “Avg Charges” using the formula “Avg Charges” = “Charges” / “# Visits” (remember that our “Charges” metric already represents a sum of charges):



  5. No matter what the Answers request looks like, our “Avg Charges” metric will calculate the average correctly, using the appropriate logical table source, and using the following formula, which applies the default aggregation rules in the correct manner: “Avg Charges” = SUM(“Charges”)/SUM(“# Visits”). This formula even works at the most granular level when you are only averaging one row: 100 = 100/1.

This is the best solution I have found for working with averages along with aggregate tables. If anyone has any other ideas, I’d be interested to hear them.

No comments:

Post a Comment