Dynamic Cubes - Near Real Time and Cross Database Queries
Introduction
Purpose of Document
IBM Cognos Dynamic Cubes deliver tremendous performance gains for data at rest. In today’s changing business environments there is also a need to reduce how stale data is or how often data is refreshed. Examples of this requirement may include financial processes where an adjustment is made at an end of period and the results are visualized soon after the fact to ensure correct figures or in a sales environment where up to the minute data alongside historic data is required to help spot trends and improve business outcome.
This document assumes knowledge of IBM Cognos Dynamic Cubes and IBM Cognos Cube Designer and that the reader has an advanced level of knowledge of cube modelling. For more information on cube modelling see the IBM Cognos Dynamic Cubes Redbook and the Dynamic Cubes User Guide. There are two main concepts described in this document:
- The ability to have high performance queries from cached and aggregate-aware historic data structures merged with near real-time structures of data where new data and deltas are applied.
- The ability to do this spanning two supported Relational Database Management Systems (RDBMS).
Please note the introduction of a second RDBMS is usually due to the requirement for cross functional analysis – for example the billing warehouse is in DB2 whereas the returns warehouse is in SQL Server. In this case, IBM Cognos Dynamic Cubes can have base cubes with partially differing structures (grain and dimension number) within a virtual cube to show which customers invoices had all its line items returned. In this document we assume that the data structures are the same and that near real-time will be achieved by spanning different database platforms.
It is more likely that this technique of near real-time analysis would be performed within the same RDBMS, such as DB2. The decision to use two different RDBMS engines in this document was to demonstrate a more complex environment.
Approach To Near Real-Time Dynamic Cubes
The approach stems from one briefly described as Time Partitioning in the IBM Cognos Dynamic Cubes Redbook (see the Resources section at the end of this document) but with the delta cube set as near real-time. It is relatively simple and works on the premise that the largest volume of data is historic and does not change.
New data (inserts of new rows, deltas or removal of rows) tend to be only for the current period. Similarly the largest amount of data, and therefore the data most in need of caching, aggregates and other optimisations tends to be historic verses the proportionately smaller current period. The larger the amount of data, the longer it takes to re-populate caches, calculate in-database aggregates and load in-memory aggregates thus it is not always feasible to re-populate optimisations intra-day. By splitting data based on periodicity, the need to frequently reload these optimisations is reduced.
The solution is to split data based on volatility with non-changing data in one (or more) cubes and real-time data in another. This means that optimisations such as summary tables need not be loaded and calculated so frequently and real-time data is small enough to be aggregated on the fly. Although this applies equally to splitting data by volatile and static datasets, for this document we will use the analogy of historic and current period. This does not refer to the dates in the dataset but rather the date the changed data is entered - a delta is performed within the current period but may apply to a past transaction record.
Depending on data volumes involved we could have history, current month and today as separate dynamic cubes, with only the today dynamic cube not leveraging an aggregate cache. We could then combine the physical cubes together to create a set of virtual cubes, which would combine the all the data into one virtual cube for reporting.
For the sake of simplicity, in this document we will have only a historic cube that contains non-changing data and a real-time current period cube which is a cube built upon the changing data of the current period with no aggregates or data/result set cache. Figure 1 shows these historic and current period cubes can be merged into one virtual cube that the users can report from.
Figure 1 - Real-time Dynamic Cube with historic and current period cubes merged into a virtual cube
Note that a virtual cube in IBM Cognos BI 10.2 can currently only combine two cubes. However, one may be a virtual cube thereby allowing for more historic data cubes by chaining them together.
Environment
The environment for this exercise is a simple, single virtual machine running all the required components. Figure 2 shows the various pieces of this environment which include,
- IBM Cognos Dynamic Query Analyzer
- IBM Cognos BI 10.2 single server installation
- Firefox 10.0.11 and Internet Explorer 8 web browsers
- Microsoft IIS 7.5 Web Server
- IBM DB2 10.1 running the IBM Cognos BI Content Store and containing 100 million rows of legacy data
- Microsoft SQL Server 2008 housing the real time data set
Figure 2 - Diagram of environment used for this document
This virtual machine is also used for other workloads and is therefore larger than needed for the data volumes used here. For example, during this testing the system was actually using less than 8 GB of RAM.
Dataset Description
The dataset used for this document is a modified version of the standard IBM Cognos GO Sales sample, represented as a star schema. For the purposes of this document we are concentrating on the modelling and delivery techniques, rather than the actual data. Thfigure3erefore there may be some duplication of data due to the need to generate reasonable data volumes. There are two data sets - one is held in an IBM DB2 database and the other in a Microsoft SQL Server database.
DB2 dataset
Tables 1-8 below detail the structure of each dimension from the DB2 datasource. The number of members for each level in a dimension will help to provide context as to how sparse the data may be.
Table 1 - Products Dimension
Levels | Number of Members |
---|---|
Product Lines | 5 members |
Product Types | 21 members |
Product | 115 members |
Table 2 - Order Method Dimension
Levels | Number of Members |
---|---|
Order Methods | 7 Members |
Table 3 - Periods Dimension
Levels | Number of Members |
---|---|
Millennium | 2 members |
Year | 112 members |
Quarter | 448 members |
Month | 1344 members |
Day | Days in respective Month |
Table 4 - Slice Dimension
Levels | Number of Members |
---|---|
Slice | 63 members |
Table 5 - Retailer Dimension By Type
Levels | Number of Members |
---|---|
Retailer Type | 8 members |
Retailer | 109 members |
Table 6 - Retailer Dimension By Location
Levels | Number of Members |
---|---|
Retailer Region | 21 members |
Retailer City | 233 members |
Retailer Site | 391 members |
Table 7 - Staff Dimension
Levels | Number of Members |
---|---|
City | 28 members |
Staff | 104 members |
Table 8 - MainFact fact table
Levels | Datatype |
---|---|
Quantity | bigint |
Unit Cost | decimal (19,2) |
Unit Price | decimal (19,2) |
Unit Sale Price | decimal (19,2) |
Sale Total | decimal (19,2) |
Gross Profit | decimal (19,2) |
Total Rows | bigint |
Total number of rows in the MainFact fact table is 100,825,767.
SQLServer Dataset
Tables 9-16 below detail the structure of each dimension from the SQL Server datasource. The number of members for each level in a dimension will help to provide context as to how sparse the data may be.
Table 9 - Products Dimension
Levels | Number of Members |
---|---|
Product Line | 5 members |
Product Type | 21 members |
Product | 115 members |
Table 10 - Order Method Dimension
Levels | Number of Members |
---|---|
Order Method | 7 members |
Table 11 - Periods Dimension
Levels | Number of Members |
---|---|
Millennium | 2 members |
Year | 112 members |
Quarter | 448 members |
Month | 1344 members |
Day | Days in respective Month |
Table 12 - Slice Dimension
Levels | Number of Members |
---|---|
Slice | One slice as RT Slice |
Table 13 - Retailer Dimension By Type
Levels | Number of Members |
---|---|
Retailer Type | 8 members |
Retailer | 109 members |
Table 14 - Retailer Dimension By Location
Levels | Number of Members |
---|---|
Retailer Region | 21 members |
Retailer City | 233 members |
Retailer Site | 391 members |
Table 15 - Staff Dimension
Levels | Number of Members |
---|---|
City | 28 members |
Staff | 104 members |
Table 16 - MainFactRT fact table
Levels | Datatype |
---|---|
Quantity | bigint |
Unit Cost | decimal (19,2) |
Unit Price | decimal (19,2) |
Unit Sale Price | decimal (19,2) |
Sale Total | decimal (19,2) |
Gross Profit | decimal (19,2) |
Total Rows | bigint |
Cube Design
The design of both cubes is the same with the only exceptions being the calculations used for the Millennium and Quarter levels where the resultant values are the same but RDBMS specific functions were used. The DB2 historic data set also has some optimisations added as part of the tuning exercise. These optimisations include database aggregates (summary tables) and in-memory aggregates as well as having the data and result set caches enabled.
This document will first test without the historic cube optimisations so that relevant comparisons of timings can be made.
The star schema with a single snowflake dimension is shown below in Figure 3 with the hierarchies and levels added. The diagram shows theMainFact table in the middle of the snowflake with the tables Large_Periods, Staff_Dimension, Order_Method_Dimension, Slice, Retailer_Dimension and Product_Dimension joined to it. The Product_Dimension table also has Product_Type joined to it, which in turn is joined to Product_Line completing the relationships required for Products_Dimension.
Figure 3 - Star schema diagram of historic data
Test Reports
The reports used for this test were designed with two purposes, to simulate simple ad-hoc navigation and to simulate a more complex analysis that may be saved as a report output in the future. Table 17 below lists the report names that will be referenced later in the document. The reports and data were created in an IBM Cognos Insight application for ease of result analysis.
Table 17 - Test Report Names
Report Number | Report Name |
---|---|
1 | Basic Crosstab Products by Order Method |
2 | Drill Down on Products |
3 | Drill Up |
4 | Replace Columns |
5 | Drill Down to Years |
6 | Multi Level Set |
7 | Multi Level Calculate and Top |
8 | Real-time Test |
Figure 4 is a simple report that shows an initial ad-hoc query for all data, the total number of rows for each product line by order method. Notice the highlighted circle which shows the total rows returned being 100,825,767.
Figure 4 – The Basic Crosstab Products by Order Method report
Figure 5 shows a simple crosstab report that simulates a drill down on the largest product line (Mortgages) to show the mortgage product types by order method.
Figure 5 – The Drill Down by Products report
Figure 6 demonstrates the user drilling back up to the original view of total number of rows for each product line by order method.
Figure 6 – The Drill Up report
Figure 7 shows a crosstab report that changes the report structure to simulate the user investigating over time - the total number of rows for each product line by time. In this case 1000s and 2000s columns refer to which century the date falls in. For instance 1996 would fall into the 1000s column.
Figure 7 – The Replace Columns report
In Figure 8, the user drills down to investigate the current millennium.
Figure 8 - The Drill Down to Years report
In Figure 9, the user lists the total rows by all product types for two years alongside two quarters and three months.
Figure 9 - The Multi Level Set report
Figure 10 shows the same report as Figure 9 but this time there is a calculation showing the percentage contribution of June 2012 to the overall year 2012 and then shows the top five product types based on their contribution.
Figure 10 – The Multi Level Calculate and Top report
Figure 11 shows the top five product types calculation that was used. This was accomplished by using the properties of the Set Definition and setting the Top or Bottom properties to a Type of Top and the Number of items to 5.
Figure 11 - Top five product type calculation used in Figure 10
Figure 12 is a simple report output listing total rows by slice where the real-time slice is differentiated.
Figure 12 - The Real-time Test report
The Approach To Timing
Timings are all taken using the total query time shown within Dynamic Query Analyser (DQA) when the report is run directly within the DQA interface (Figure 13).
Figure 13 - The total reported time for the query processing shown in Dynamic Query Analyzer
The two underlying RDBMS’s are not restarted between tests and it is possible that any caching of data or query plan at that layer may skew the results slightly. Therefore, report executions were performed three times and an average taken.
As the environment is virtual, it is also possible that there are variations in the results due to other workloads. Where appropriate, the Dynamic Cubes were restarted after each run, with the only exception being the merged virtual cube where restarting it would have negated the benefits of some of the caching at the historic cube layer.
Finally, after all tuning was performed, each run had two iterations - an initial run and a re-run so that the benefits of the data and result set caches could be seen.
All results from the last phase of tuning were copied into Final times (1st run) and Final times rerun for easy comparison across tuned historic data and near real-time and merged data. As previously mentioned, all the timings were recorded in an IBM Cognos Insight application for ease of result analysis.
The Historic Cube
Description
This star schema contains just over 100 million rows of static data and is modelled the same as any normal Dynamic Cube. The cube then had an optimisation cycle with workloads applied to improve performance.
Although 100 million rows is small it does give us a baseline - before tuning, the total average report query run time is 585.54 seconds (just under 10 minutes), long enough for us to see the impact of tuning and compare with the near real-time cube later.
Configuring the cube
A Dynamic Cube model was built on top of the star schema, deployed and the deployment was then used for the initial timings. The Aggregate Advisor was used to generate the summaries for both of the databases. In-memory aggregates and some manually created summaries that were not part of the advisor results were added from a previous exercise. The maximum amount of memory IBM Cognos BI uses for the aggregate cache was changed to 1024 MB from the default of 0 MB (Figure 14) in order to allow for sufficient memory to house the in-memory aggregates needed for the cube.
Figure 14 - Aggregate cache memory setting
For this volume of data the Aggregate Advisor recommended in-memory aggregates totalling 190Mb of RAM. For more information on this process and Dynamic Cubes overall please see chapter titled Cognos Dynamic Cubes administration in the IBM Cognos Dynamic Cubes User Guide or the IBM Cognos Dynamic Cubes Redbook.
Report timings
Figure 15 below shows the IBM Cognos Insight report displaying the report timings and demonstrates that the impact of any summary aggregate is substantial. Simply adding the in-database aggregates and using the built-in aggregate awareness of IBM Cognos Dynamic Cubes reduced the overall report run time from 585.54 seconds to 0.92 seconds. It also shows that the Drill Up report (Report 3) and the Multi Level Calculation and Top report (Report 7) never took more than .01 of a second.
Figure 15 – Tuning results in seconds
Changing the view to milliseconds (ms) shows fairly consistent performance for the Drill Up and the Multi Level Calculation and Top reports across all levels of tuning - between 6 and 7 milliseconds for the slowest query (Figure 16).
Figure 16 - Tuning results in milliseconds
The reasons why performance is similar across all levels of tuning is different for each report. The Drill Up report (Report 3) is actually able to reuse the result set from the Basic Crosstab Products by Order Method report (Report 1) thereby not requiring a query to the database and showing that the data cache and result set caches should aid ad-hoc query when drilling back to a prior view, or when outputting the results to a different format. For the Multi Level Calculation on Top report (Report 7), the query time is also consistent. This is because the data cache from the Multi Level Set report (Report 6) can be re-used but there is an extra five milliseconds for the calculation and the
topcount
functions to be performed on top of the data cache. This not only demonstrates the value of the hybrid query engine but also shows how analysis would benefit as users perform calculations and other operations.
As illustrated in Figure 17, the average final times on the re-run of the Drill Up report (Report 3) is nearly a millisecond more than that of just using in-database aggregates and that the in-memory result is also slightly slower. Both Reports 1 and 3 are fundamentally the same query - the reason for this variance is partly due to fluctuations in the environment but also shows how analysis would benefit as users perform calculations and other operations.
Figure 17 - Report execution results for Reports 1 and 3
Once in-memory aggregates are added, the calculation performance is fast enough that the need to create a result set cache versus re-calculating and retrieving data from a data cache is negated because the query length has dropped from 340 milliseconds down to 73.99 milliseconds. The Minimum query execution time before a result set is considered for caching (milliseconds) setting determines the performance characteristic that drives the creation of a result set cache. This setting can be found in the Tuning properties of the Query Service using IBM Cognos Administration and defaults to 50 milliseconds (Figure 18).
Figure 18 - Dynamic Cube advanced setting for minimum query execution time for caching
Using the result set cache incurs less cost but for the sake of a query of less than 50 ms (retrieve from data cache at 0.88 ms vs. an initial in memory aggregate calculation of 73.99 ms) the maintenance is not worth it.
The impact of using the result set cache across all reports becomes clearer if we ignore the column labelled 1 No Aggregates run, as this skews the times due to the need to query 100+ million rows directly. Figure 19 shows some of the results. The initial run of the Basic Crosstab Products by Order Method report (Report 1) shows an initial run time of 122,268.60 ms when run without any form of pre-created aggregates. Compare that to the second execution where in database aggregates were leveraged, dropping the report execution time down to 340.94 ms and the third execution using in memory aggregates which dropped the report execution time down to 73.99 ms. Similar results are seen with the Replace Columns report (Report 4) where the initial run time was 242,232.15 ms, second run was 205.37 ms and third run was 12.05 ms.
Figure 19 - Report timings showing initial run versus cache leveraged results
Interestingly, for the Drill Down to Years report (Report 5), the addition of in-memory aggregates increased the overall run time from 74.01 ms to 102.407 ms. This is due to overzealous in-database summary creation at a finer grain. This matched the report better than the in-memory aggregate and was selected first. In practice this is not a good scheme as it requires maintenance of an in-database summary that is of little value. For ad-hoc queries, a larger in-database aggregate may produce better results. For the purpose of this document this will be ignored, as the difference in average query time is less than 50 milliseconds and re-runs effectively use caches that negate this anomaly.
Ultimately the query performance benefits from the use of aggregates. Caching and hybrid query are significant across all reports, reducing the overall run times down to milliseconds from minutes.
The Near Real-Time Cube
Description
The near real-time cube is identical in structure to the historic data set. The base table is populated with just over 1.5 million rows to simulate mid-day load and a stored procedure is used to add data at a rate of 362 rows per second.
With the exception of the member cache that is required by Dynamic Cubes, all caching is disabled and aggregates are not used.
Configuring the real-time cube
Once the cube is published and running we need to disable caching. In IBM Cognos Administration select the Status tab and choose System. Click on the server to drill into, then click on the Dispatcher and select QueryService. From here you can administer any Dynamic Cube that has been published to that Dispatcher. Right-click on the QueryService, choose Set properties from the menu (Figure 20) and then select the Settings tab (Figure 21).
Figure 20 - Set properties for QueryService in IBM Cognos Administration
Figure 21 – Settings tab for the Query Service
Under the Settings tab, click on the Edit... link for the Dynamic cube configurations entry (Figure 22) and then click the Edit configuration icon for the real-time cube, which in this case is named MainFactRT (Figure 23).
Figure 22 - QueryService Settings tab
Figure 23 - Dynamic Cube configuration
Ensure the Disable result set cache check box is enabled, the Data cache size limit (MB) is set to 0 and the Disable external aggregates checkbox is enabled (Figure 24). Once set, click OK three times to return to the main screen and save the changes.
Figure 24 - Dynamic Cube properties
We need to wait for the changes to be written to the Content Store and then filter back to the Dynamic Cube server - waiting for 30 seconds should be sufficient. We can then restart the cube and once restarted, we can see this is now working in near real-time based on the report outputs. In IBM Cognos Administration select the Status tab and choose System. Click on the server to drill into and then click on the Dispatcher and select QueryService. Right click on the Dynamic Cube, in this case MainFactRT, and select Restart (Figure 25).
Figure 25 - QueryService menu showing restart for real-time cube
As the real time database is loaded with data, you can see the number of rows for the corresponding product and order methods slowing increasing, we now have a system that is mimicking a real world scenario of real time data being loaded into a production database.
Report timings
As there are no in-database or in-memory optimisations, these are excluded from the results. The re-run does show an improvement on the second run, however this is due to the Dynamic Cube resources being loaded and prepared as well as improved RDBMS query run and plan caching at the data layer. Figure 26 shows the report execution time differences between the first run and the average of the next six report executions. Query performance increased by 45% on average
Figure 26 - Real-time report timings
This can be proven to a certain degree within Dynamic Query Analyzer. For one example report run, the data retrieval time for the query against the database was 361 ms against a total processing time of just under 379 ms. The second run was 60 ms database time against 93 ms total processing time. In both cases, approximately 30ms of processing took place outside of the data query. These results can be seen in Figure 27 below.
Figure 27 - Dynamic Query Analyzer results showing data retrieval times across runs
The XCubeRetrieval process represents a query against the underlying RDBMS. Analysing the queries issued, we find that they are identical across query runs. Example of a query used:
SELECT "Order_Method_Dimension"."Order_Method_Key" AS "Order_Method_Dimension_Order_Method_Key" , "Product_Type"."Product_Type_Code" AS "Product_Dimension_Product_Type_Code" , SUM("MainFactRT"."Total Rows")AS "Total_Rows" FROM "DQMSampleDataRT"."dbo"."Order_Method_Dimension" "Order_Method_Dimension" INNER JOIN "DQMSampleDataRT"."dbo"."MainFactRT" "MainFactRT" ON "Order_Method_Dimension"."Order_Method_Key" = "MainFactRT"."Order_Method_Key" INNER JOIN "DQMSampleDataRT"."dbo"."Product_Dimension" "Product_Dimension" ON "Product_Dimension"."Product_Key" = "MainFactRT"."Product_Key" INNER JOIN "DQMSampleDataRT"."dbo"."Product_Type" "Product_Type" ON "Product_Dimension"."Product_Type_Code" = "Product_Type"."Product_Type_Code" INNER JOIN "DQMSampleDataRT"."dbo"."Product_Line" "Product_Line" ON "Product_Type"."Product_Line_Code" = "Product_Line"."Product_Line_Code" WHERE "Product_Line"."Product_Line_Code" IN (1 ) GROUP BY "Order_Method_Dimension"."Order_Method_Key", "Product_Type"."Product_Type_Code"
When issuing the same query within the RDBMS tools we can prove that the database is in fact completing the requests quicker for the same client. Figure 28 is taken directly from the RDBMS query tool statistics and does not include any IBM Cognos components. It shows the timings for this query are 855 ms for the first run followed by 149 ms to 158 ms for subsequent runs even though the data is changing. One can conclude, without further investigation, that the RDBMS is holding some of the data pages in cache, the new data is still in cache and the queries are executed in close succession time wise, hence the improvement in performance.
Figure 28 - Execution timings for query executed in RDBMS tool
As other queries are run this benefit disappears so by the next iteration we see the same effect. Only on the third restart of the Dynamic Cube and run did the results appear consistent between first and second run.
Further tracing at the RDBMS may prove the differences however as the results are, at most, a second or two apart and as this document focuses on Dynamic Cubes and not the underlying RDBMS, this anomaly will be ignored.
The expectation is that subsequent queries may slow over time on the real-time cube as data is being added and the base queries take longer. For results within our virtualized environment all queries running against 1.5 to 3 million rows of incrementing data took between 87 ms and 1.7 seconds which is acceptable performance for our infrastructure.
The Virtual Cube
The virtual cube plays a vital part in the system as this is where we add the tuned static data with the changing data by merging the historic and near real-time cubes into one cube. For our tests, the historic cube has been left fully cached to simulate a real world scenario while the reports are run twice in succession and the process repeated three times for each run.
How to create and configure the virtual cube
Using Cube Designer go to the Project Explorer pane, right-click on Model, select New, then select Virtual Cube (Figure 29).
Figure 29 – Create a new virtual cube in Cube Designer
In the Select Base Cubes dialog, select the historic cube Large Sales Dynamic Cube and the real-time cube MainFactRT (Figure 30).
Figure 30 - Select Base Cubes dialog with Large Sales Dynamic Cube and MainFactRT cube selected
If all the data was in the same database then the dimensions would be the same and match automatically. In some cases, as in this one, they are named differently because they come from different sources. We need to define how the virtual dimension will merge the underlying cube dimensions and measures - reference dimensions should be merged on keys. This is done in the properties of the virtual cube that was created (Figure 31).
Figure 31 - Virtual cube dimension mapping
We need to remove all real-time dimensions that do not match the historic ones by deleting them from the editor pane. Next ensure the like dimensions are mapped properly between the two physical cubes. Figure 32 shows the mapping dialog used to map the like dimensions between the Large Sales Dynamic Cube and the MainFactRT cube. Both cubes contain a product dimension, however the Large Sales Dynamic Cube contains a dimension called Products while the MainFactRT cubes product dimension is called Product_Dimension. To map the dimension select the ellipses corresponding to the cube for the dimension you wish to merge, then from the radio dialog select the dimension from that cube that you wish to merge.
Figure 32 - Virtual cube dimension mapping select source dialog
The next step is to ensure the hierarchies are mapped in the virtual dimensions as well. Double-click on each virtual dimension and ensure the proper hierarchies are mapped. If hierarchies have the same name then this will be automatically done for you, otherwise you’ll have to ensure the proper hierarchies have been selected by selecting the ellipses under each dimension from the base cube and select the correct hierarchy. In Figure 33 below, you can see that the Order Methods dimension in the virtual cube has the Order Methods hierarchy from one cube and the Order_Method_Dimension from the other cube mapped.
Figure 33 - Virtual cube dimension properties
Finally, check the levels in each hierarchy to ensure they align as well, by double clicking on each virtual hierarchy. Figure 34 shows levels used in the Periods and Large_Periods hierarchies are merged together into the Periods hierarchy in the virtual cube. You can see that in this case the level names all match, however if in your cube they do not you will have to select the appropriate level name that corresponds to the desired level to be used for mapping in the virtual cube.
Figure 34 - Virtual cube periods hierarchy level mapping dialog
For this near real-time example the level names should all match but if we were merging different subject areas within the same database or from separate ones, they may not match and there may be extra dimensions and levels within hierarchies. We need to ensure that the measures are properly merged using the sum operator. In Figure 35 below, the virtual measure Quantity is selected and the property Merge Operator is changed to Sum. This will ensure that when Quantity is used, the result is a sum operation between the values pulled from the two underlying cubes.
Figure 35 - Virtual Cube measures mapping dialog
Calculated measures such as Total Cost can also be used in a host of areas. Instead of storing measure values in the RDBMS and incurring the overhead of retrieving the totals as well as the computation overhead of aggregates and summaries, a calculated measure may be used that multiplies Quantity and Unit Cost. For multi-subject areas such as sales and returns, a calculated measure would be used for order quantity minus returns.
Another good example for performance is in the case of averages such as Average Sales Total. Instead of having the database calculate the averages and potentially avoid summary usage (the base fact is needed to calculate an average), it is good practice to include a total row measure in the summaries and base tables. We can then use a calculated measure for Average by dividing the desired measure by the total rows. This can then efficiently use summaries and caches that may already exist. For more information on the use of summaries and caches please see the IBM Cognos Dynamic Cubes Redbook referenced in the Resources section at the end of this document.
Finally the result set cache and data cache need to be disabled on the virtual cube. For more information on these settings refer to the steps the section titled Configuring the Real-Time Cube up to the editing the virtual cube definition. Figure 36 below shows the properties for the real time cube. The Disable result set cache check box is enabled and the Data cache size limit (MB) set to 0 will disable both the result set and data caches.
Figure 36 - Virtual cube settings to disable the result set cache and data cache
In a system where there is more than one historic cube - for example, cubes named History All Years, History Month and Real-Time - it may prove better to have a virtual cube for the two historic cubes with the data cache and result set enabled. The Real-Time cube could move daily rebuilds of the entire history aggregates to the end of a month which would result in only having to re-build the current monthly optimisations instead of a having to re-build each day. This saves memory and processing as the historic and monthly cubes would have both of their caches disabled and the overhead of recalculating the data cache from the optimized cubes may be minimal. A second virtual cube would then be used to merge the historic virtual cube and the real-time cube in the same way as described in this document.
Report timings for virtual cube
As can be seen from the results shown below in Figure 37, the overall query times for a report are at most 1.31 seconds, demonstrating how the same reports can efficiently run against the virtual cube made up of historic and real time data. Across the result set of over 100 million rows with changing data, this is deemed to be acceptable on this infrastructure.
Figure 37 - Virtual cube report timings first run versus rerun
If we compare the first run to that of the second and third runs, we can see a subtle difference in consistency (Figure 38). The first query run on some reports appear to be a second or so slower than subsequent reruns. In this test, the cube was not restarted so that the benefits of the historic caches can be fully realized, hence the second and third runs are more consistent as the Dynamic Cube server resources and RDBMS page caching start to take effect. Slight anomalies can be seen at the millisecond level, these may simply be the virtualized environment competing for resources
Figure 38 - Virtual cube first, second and third run timings
Overall Report Timings
When evaluating results across all scenarios, the total average report run time for the 100+ million row dataset with no changing data or optimisation took nearly 10 minutes (586 seconds). Once the changing data had been added and the system optimised, the total average report run time fell to just under 4 seconds (3870.17 milliseconds).
If we simply look at the total average report run time for the optimised historic data, the result was just under 15.85 ms so one could deduce that the addition of near real-time data and virtual cube added 3854.32 ms to the processing of the queries. The actual tested total average run time of the near real-time cube on its own was greater at 4096.02 ms. With the odd exception, this result can be seen across the individual reports (Figure 39).
Figure 39 - Historic, Real-time and Virtual cube timings for all reports
The results below show that the addition of near real-time data in a virtual cube has made little or no impact. Figure 40 shows that with the exception three reports (Multi Level Set, Multi Level Calculation and Top and Basic Crosstab Products by Order Method), the addition of real-time data with the cube optimizations actually resulted in a decrease in report execution times ranging from 28 milliseconds to 151 milliseconds. Even those reports that increased in execution time only saw an increase between 14 milliseconds and 147 milliseconds. These are values which will have no real impact on the actual users.
Figure 40 - Time difference between Virtual Cube and Real-time cube
In this example most average query times actually reduced in comparison to the base real-time runs. This is attributed in part to the ability to parallel process virtual cubes (and can also happen with Dynamic Cubes) but also the effect of not restarting the cubes after each run meaning that the underlying RDBMS is able to perform better on the same connections verses new connections. This effect can be seen below in Figure 41 where the first run shows that the total execution time for the reports against the virtual cube took 4337 milliseconds, versus 3437 milliseconds for the real-time cube. Figure 42 shows the second run and we see the performance improvement where the total report execution on the virtual cube was 3581 milliseconds versus 4110 milliseconds for the real-time cube.
Figure 41 - First run report timings of a virtual cube versus real-time cube
Figure 42 - Second run report timings of a virtual cube versus real-time cube
Overall the results show that a running system with connections created against the underlying database can run near real-time queries on volumes greater than 100 million rows. In fact, ad-hoc queries should be possible on even larger volumes providing the real-time dataset is small enough to return the results within acceptable times.
Near Real-Time Cube Configuration
Throughout this document the term near real-time has been used. This is partly because the data is not streaming in real time. We need to query data at the point an output is requested and latency in the query may mean that the data has already changed marginally. This is of course pedantic as in most cases being able to request results and have a view within a second or so is considered to be real-time. The main reason for stating that this is near real-time is because the dimensional framework - all the dimensional members and their attributes - are loaded at cube start-up time. In the example used in this document, this is not an issue as there are no new members created - all real-time data uses keys that already exist (such as Products). Even times and dates would normally be fully populated. Should there be a requirement for the member cache to be refreshed to accommodate new members, IBM Cognos BI provides the functionality to do so.
When refreshing the member cache, the new cache is loaded into memory from the database and then swapped with the running cache when ready. This has the performance advantage of user requests not being queued while the member refresh queries are running, however more RAM is required on the server hosting the IBM Cognos BI Dispatcher that is housing the cube in order to hold two copies of the member cache during the refresh.
The most effective method to refresh the member cache is to create an administration task than can then be scheduled or triggered. From IBM Cognos Administration, select the Configuration tab and in the left pane select Content Administration. Click the New Query service administration task dropdown and then select Dynamic cube... from the menu that appears (Figure 43).
Figure 43 - New Query Service administration task for Dynamic Cube refresh
Give the new task a name - Reload Dimensions is being used here - and if desired a description and screen tip. Click the Next button (Figure 44).
Figure 44 – New Query Service Administration Task wizard dialog to specify the task name
Select the Refresh member cache option from the Operation drop-down and ensure the Server Group and Dispatcher are properly set. Create this task only for the real-time cube as this action will force a refresh of the virtual cube member cache. Do not do this against the historic cube as it would force a reload of all the in-memory aggregates and invalidate the cache, thereby impacting historic performance. This is done by ensuring the only the MainFactRT cube is selected from the list in the Cubes section (Figure 45).
Figure 45 - Dynamic Cube task options completed
Click the Next button and in the Select an action screen, set the Action to Save and schedule and click Finish (Figure 46).
Figure 46 – Save and schedule for the new administration task for the Query Service
The report shown in Figure 47 was run before running the administration task and shows the real-time slice of data (excluding the seeded rows) with 1,600,409 rows of data in the Historical Cube and the number of rows being added into our real time database (represented by RT Slice) is 47,421.
Figure 47 - Report displaying members before task execution
The administration task is executed and for this model the total refresh time is approximately 2 seconds. For a larger dimensional framework with millions of members, this time could increase so testing should be performed. After the administration task has run we can now see the new member RT New Member with 1 row and the RT Slice now having 72,302 rows of data (Figure 48).
Figure 48 - Report displaying RT New Member added after task execution
It is worth emphasising that this administration task can be scheduled to run and the interval can be down to the minute or based on an external trigger. For a system where new member inserts can be detected, the trigger could be used to ensure the dimensional framework is only updated when required. For example, Event Studio could detect a change in the count of members and then execute an administration job to refresh the member cache. For small dimensional frameworks, the overhead of running the queries against the dimension tables using Event Studio, may negate the benefit of simply refreshing the member cache on a time scheduled basis.
Roll-Over Of Data
As the performance of the real-time cube is dependent on the volume of data and the RDBMS in which it is stored, a suitable mechanism for moving the committed, real-time rows to the historic data set needs to be put in place. As a roll-over also requires the historic aggregates to be updated and the in-memory aggregates to be reloaded, the frequency of this needs to be balanced with real-time data volumes.
Although more complex, this is where using multiple historic cubes may be more advantageous by reducing interim roll-over times. The process of rolling data over is a simple concept and may use ETL tools through to simple SQL scripts to perform the following tasks.
- Move data from the real-time fact table to the historic fact.
- If the dimensions are not the same, add any new members to the historic dimensions.
- Update the historic in-database aggregates.
- Issue a restart command for all the cubes.
Simply refreshing the member cache on the historic and real-time cubes would have a similar effect to restarting those cubes, in that all the optimizations and caches will be refreshed and may minimize downtime at the expense of the extra RAM needed to process the change.
In order to create a job to restart all the cubes, create three separate query service administration tasks and add them to a single job. The process for creating the administration tasks was described in the previous section with only the tasks selected from the Operation dropdown list being different. The first task should stop the virtual cube using the Stop immediately or the Stop after active tasks complete operation, the second task should restart the base cubes using the Restart operation and ensuring the base cubes are selected and the third task should start the virtual cube using the Start operation.
Next create a new job (Figure 49) to combine the three tasks to be executed in a specific order and in the Name field type Restart Near Real-time (Figure 50). To have the tasks executed in sequence, add them to the job in the order you wish them to execute and ensure the In Sequence radio button is selected (Figure 51).
Figure 49 - Cognos Administration new job creation
Figure 50 - New Job creation name and description
Figure 51 - New Job with tasks in execution order
This job can now be used to restart the whole set of cubes manually by running the job interactively or based on a schedule. In many instances, the most useful scheduling option is through the use of a trigger as the job can be invoked by an external process such as an ETL process.
It should be noted that the need to first stop the virtual cube is due to IBM Cognos BI 10.2 requiring all dependant virtual cubes to be stopped before a cube can be restarted.
The Impact Of A Multi-Cube System
Through this document you have seen that the use of virtual cubes has minimal overhead on the query times in general. However there is a cost associated with the use of multiple cubes and, in this case, a virtual cube. This cost includes an increase in processing requirements as more threads are required to handle parallel execution. Dynamic Cubes benefit from more CPU computing power and this can play a significant part in improving the performance of a single, complex query.
It should be noted that for a multi-cube system, there is the requirement for sufficient RAM. In IBM Cognos BI 10.2, each cube requires its own dimension member cache. If the member cache is refreshed while a cube is running, there also needs to be sufficient RAM to store a copy of the member cache. This is to minimise the impact on queries as, once loaded, the new member cache is swapped for the current member cache. In practice this means that for this near real-time method, up to six times the member cache may be required. For smaller systems this might have minimal impact however for large systems the impact on RAM requirement is increased substantially.
For more information on server sizing see the article Dynamic Cubes Hardware Sizing Recommendations referenced in the Resources section of this article.
Conclusion
In conclusion, the Dynamic Cube capability within IBM Cognos BI 10.2 is not only able to deliver significant performance gains on large datasets facilitating ad-hoc queries but is also able to add near real-time information to that in order to help with, for example, reporting on large volumes where iterative adjustment and report cycles need to be made.
Although the addition of the real-time data did, as expected, decrease the overall performance of the system, given the right balance of historic to real-time data on a suitable infrastructure containing an RDBMS that delivers the required query performance and a suitable regime for data management, it is possible to cater for scenarios requiring near real-time data with acceptable response times.
In addition, the use of differing RDBMS platforms shows that, providing the dimension tables are replicated and the main factual data can be retrieved from a single table, it would be possible to do this using a filtered operational system (or mirrored transactional database) and a warehouse.
It is important to note that the impact on the operational system would need to be measured for potential problems which mirroring technology may alleviate but still provide near real-time reporting. IBM Cognos BI versions 10.2 and 10.2.1 only support star and snowflake schemas for good reasons and even though it may be possible to use views on a transactional system to create a star structure for the real-time element, it is not advised as the performance of such views can impact query times.
Although not required, the use of two RDBMS has shown that it is possible for a virtual cube to merge information from two disparate data warehouses. This aids multi-subject area analytics in organisations that have multiple data warehouse strategies or where one application such as the financials system and another such as the asset management system are both provided with their own out of the box warehouses.
No comments:
Post a Comment