Optimizing tm1 calculations (with pictures!)

Found myself using both of my new toy-scripts while stress-testing TM1:

  • I vizualize model data flow (way bigger than PlanSample in this picture, but node filtering helps).
  • Looking at such graph it’s easy to see model inputs (to left side), hardest calculations (these are the gordian knots) and reporting cubes.
  • So I just start pouring data input and reads in suspiciously looking model parts. And if they suffocate, it’s up to optimizing rules or turning rules into TI processes.
  • Redraw the picture and repeat until happy )

Very much like unjamming the creek. Spring )

Stress testing TM1 models

Ever thought about load testing your dear-loved TM1 project? Emulating a hundred (or hundreds) users reading and writing data simultaneously? it’s easy and simple!

Never trust sales slogans like above. Any kind of stress testing system should be approached with a huge amount of doubt, as well as this one. But it’s a simple and useful tool for testing your calculations for scalability (IMHO).

We needed to stress-test one of our TM1 models and looked around for solutions. Most of them are discussed in a very interesting thread on olapforums, but going via Excel + VB way seemed a bit too much of ‘Excel + F9′ to emulate the load. Turned out that it’s rather easy to roll-out a simple customizable Java application for this task, so here it is.

What the tool does

This tool tests your model for scalability and potential locking problems while performing calculations.
It doesn’t test you client connectivity (Excel, Contributor, Websheets) because it works via Java API. It doesn’t test your environment (network latency etc) as well.
It allows you to find and optimize the long-running and potentially blocking calculations in the model.
It allows you to simulate different types of users (like “What will happen when brand managers are inputting sales plan and analysts are working with investment portfolio and CashFlow at the same time?”).

Attention: Tool writes data to your model, don’t use it in production.

How does it work

You fill a user profile, containing:

  • logins&passwords list to be used
  • cubes for reading
  • cubes for writing
  • a list of dimension and their elements for writing — this is used for various purposes like simulating users from different regions (by specifying region in the list). Data will be written only to this elements in all the cubes containing dimension.
  • number of user session to be started with this profile

When you run the tool with filled in profile (you can run different profiles at the same time), it creates a required number of user sessions and in each session:

  • flips a coin to decide whether to read or write
  • (if read) creates a random view in one of the cubes for writing and reading and reads it contents
  • (if write) writes a random number (less than 100) of randomly picked cells into one of cubes for writing
  • waits 3 seconds and goes on again. Maybe this wait interval should be random as well )

After starting the tool you fire up TM1Top and enjoy the results.

You’re interested in locking problems and long running calculations that are nicely depicted there (an example is underlined in the picture above).
You can run TM1perfmon to see the hardware side of the problem at the same time

After you’ve got enough, just hit CTRL + C in command window and it will stop.
As a side bonus you’ll get a test log in the file test_results_profile_name.csv which will contain something like this:

How to install&use it

  1. You will need a only Java machine (jre), which you most likely already have, because of number of web-plugins developed in Java. This the only requirement, there’s no need to have TM1 installed or anything.
  2. Download the tool, unpack to some directory (c:\tm1stress, for example)
  3. Cryptic step of adding security certificate to your Java.
    1. Find jre\bin\keytool.exe utility on your computer
    2. Open the command line and navigate to jre\bin
    3. Run command line keytool -keystore ..\lib\security\cacerts -alias Company -import -file c:\tm1stress\applixca.der
    4. Enter ‘changeit’ as password
    5. Type yes
  4. Create a new user profile xml file, use the planning sample one as an example (if don’t need some sections — just leave them blank)
  5. Open command line, navigate to c:\tm1stress, run java tm1_stress_test your_profile_filename.xml
  6. You should get some informational messages (logins, operations, errors) in command output

There’s application source code in the download archive, it may serve as Java API usage sample.

Analyzing lots of Cognos Enterprise Planning Application Links

Sometimes you find yourself with a full-blown EP model, containing a 100 or so Contributor Application links and you need to actually find what they do. Opening them one-by-one is very time-consuming. One of my colleagues stumbled in such situation, so I’ve made a quick script that processes application link export files (.cal ones) and records their steps metadata (what cubes and apps are connected) into a single csv file that you can analyze in Excel easily.
Result looks like this:
LinkName SourceApplicationState SourceApplication SourceCube TargetApplication TargetCube
Link 1 DEVELOPMENT source_app1 source_cube1 target_app target_cube1
Link 1 DEVELOPMENT source_app2 source_cube2 target_app target_cube1
How to use:
1) install python-2.7 (should work in 3, but I didn’t test).
2) download script archive, unzip it into a separate directory
3) from Contributor Administration Console export needed links into some directory (wish I knew how to automate this)
3) open command prompt, go to directory from 2) and run
python eplinks.py -i directory_you_exported_links_to -o file_with_results.csv
After a minute or so (depends on your links quantity and size) you’ll get your file complete.
You can monitor execution in epllinks_log.txt file.
PS:
It’s pretty easy to draw a picture of connected cubes out of it, note me if you’re interested.
PPS:
All hail to Paul McGuire, author of pyparsing, that I’m using for this script and a much more complex application that does the same for all kinds of TM1 connections (rules and processes).

BI&DWH Booklist

More time I spend doing BI\DWH projects, I become more aware that such projects a totally a ‘craft‘, as my professor once said. And this ‘craft’ has many different aspects in it, each having a number of of very good texts. telling you all you need to know to avoid simple mistakes and deliver predictable results. But I encounter a very small number people who read even half of these books. Maybe it’s the ease of ‘bicycle invention’, maybe it’s boredom of reading about others mistakes, instead of earning your own scars. So it’s another ‘book-list’ of texts that make up a competent BI\DWH professional.
I’m trying to split it up into section of interest.

DWH Classic texts

There is no way you can do DWH modeling or ETL design, or even reports developing without reading at least some of Kimball Classics (Dimensional Modeling, ETL Toolkit, Lifecycle Toolkit). And if you’re into this for more than one project — you should start memorizing Lifecycle by heart.
You should read Bill Inmon as well to be aware of 2 different approaches to modeling. Also see Data Modeling section just below.

Data Modeling.

These books are not DWH-specific and most modeling tricks don’t apply directly, but you should be aware of typical problems and solutions, since this allows fast source system model examination and profiling — you can recognize modeling techniques used in developing and potential caveats in data.
There are numerous data modeling books out there, I’d recommend reading Hoberman and Witt&Simsion.
And I insist on reading Data Model Resource Books. It has 3 volumes and I suggest starting with 3rd one and then picking your business area of interest (HealthCare, Telco, Insurance — there’s a lot to choose from) and looking the model in first two volumes.

Achieving DWH performance

If you’re interested in getting it all faster, take a look at Mastering DataWarehouse Aggregates. And there’s a deep insight out there in Relational Database Index Design and the Optimizers — read this one only if you’re really into performance )

Data Quality

It’s a very important, yet most forgotten topic. After reading ETL Toolkit and Lifecycle parts on data quality, I suggest following with one of works by Larry English. If you’re very serious about separate DQ project, pick up Executing Data Quality Projects.

Data Visualization & Report Design

Data visualization is a most expected BI-revolution that won’t come in a way vendors think about it (another feature pack or product). It will come by evolution, just not of BI products, but of us, BI professionals  and data analysts. As we will get more acquainted with analysis techniques and get more experience with well designed visual aids (as opposite to Flash-based eye candy), we’ll shift our demands towards more analytical oriented tools and vendors will have to provide them )
I’d recommend anyone to skim through Head’s First Data Analysis, since it’s a good introductory book on how gain some insights from data.
Anyone designing any dashboard should read Stephen Few’s books. It’s an absolute must. Stop Pie-Charts invasion!
If you’re up to thinking about some new ways of representing information or gaining a deeper understanding of how things work, read Tufte’s works. And then follow up with Collin Ware deep texts on how our minds really work while processing visual information.

Product documentation

And, of course, you should read all the documentation provided by vendors of your tools and follow up most prominent blogs (there are ones on almos every tool out there) and forums. It’s a shame to see people inventing built-in functionality or not knowing that’s a screwdriver they’re are trying to slam a nail with. I find myself doing this with TM1 recently, so I’m back to reading manuals
I’ll update this post when I’ll encounter a good text on MDM or more good books on above mentioned subjects.
Any suggestions?

Gartner BI Quadrant for 2010 is out

It’s an interesting reading indeed.
I read vendor sections in these reports with interest, because after some years it’s possible to wipe all vendor-based marketing claims and assumptions made by people (analysts) who aren’t using the products daily and see something resembling “true state”.
For example, if you take a statement “QlikTech offers limited metadata management” and read it with cynic-hat on, you’ll read out that there’s no single metadata repository for an installation. Each so-called application packs it’s own metadata (+ data loading scripts), so you can easily have ‘Net Sales’ sourced from different tables in applications developed for different users. “Single source of truth” ;)
With this hat on we get:
- Oracle 11 was too long in development labs and didn’t get much out of it. But all those new stuff will be added some time later.
- SAP still haven’t decided what they bought BO for — that’s hysterically funny
- Microsoft BI is still a DIY kit, but now with more shiny new tools (PowerPivot) inside. But once again packed without assembly instruction )
- Cognos is hard to implement and slow, ain’t it?
- Tableau is terribly nice, useful and furry. Just small for enterprise-scale
- etc
Seems like at Gartner guys like SPSS acquisition by IBM much. And consider Cognos 10 a break-through in performance. Need to check it myself. After SP2 )

Anaplan

In case you haven’t heard about it yet — “Adaytum team” is back to business. With their new planning solution ) — Anaplan, or “In-Memory Analyst in the clouds”.
Take a look at demos and site, especially History page — reminds you anything?
My takes:

  • Technology has all needed buzzwords, although I’d prefer some more common, not ‘home-made’ parts, like HFS for file-system. Not so many details out there, nothing to comment more on.
  • I’m really fascinated by implementation cycle proposed. I’ve been doing it almost ten years up to now and met only one 3-4 weeks project on the way. Doing requirements gathering off-site, from Excel templates — implementation team must be totaly great.
  • Take a look at prices. It’s not just cheap, it’s just for free. In this case, to earn money, Anaplan must attract literally thousands of clients. SalesForce for planning?
  • Data integration. This is an interesting part, since a planning system is just a fake if it sits in the air (cloud :) ) without strong roots to the ground (actual data). In a complete CPM solution — DW srores your actuals and loads them into planning on demand. And loads back accepted plan figures for analysis. How this integration is to be done, when a part of the system is out on the cloud — is not a magic, but some tricky task anyway. Integrating with other cloud-based services (i.e. Saleforce) should be easy.
  • People. I’d really love to chat with those guys one day
  • Overall: I haven’t heard of a great success of any cloud-based BI services (and there was and is much hype on topic). Planning is bit like BI, because there are a few\none industry-standard solutions, it’s mostly an open LEGO constructor for you to build a castle.
    And if you don’t have a repeated\reusable model, you don’t earn that much money scaling up number of customers, since building from scratch is required in every case. SalesForce provides a standard CRM with some customizations for client, but the required functionality is typical for all client base. This allows ‘reusing’.
    Can this be done with Enterprise Planning?

    It looks very interesting, although cloud-based thingie won’t go in Russia right now, I’ll be watching. I really liked idea behind Analyst.
    And as I’ve already said, I’d rather tweak APL and J technology for 64bit and multi-core….

    Cognos Enterprise Planning 10.1 is out

    What’s new

    Fix list

    Nothing revolutionary, performance improvements, fighting with Java-based Rich Client memory footprint and performance. Publishing enhancements seem interesting.

    Since BI upgrade is required as well, I’d wait for first couple of BI service packs before trying this one out.

    Should we use TM1 for reporting?

    Another point worth clearing out: I really think TM1 is not a very good choice for OLAP reporting engine. On a decent data volumes, of course, millions of rows will be fast as lightning ;) Let me explain why.

    There’s a fundamental difference choice you take when you’re building an OLAP engine:  you decide whether it will be bulk updated or data will be inserted real-time. Consequences are substantial, you either optimize system for faster reads or inserts.
    TM1 was designed for inserts, so it lacks the main ‘read-only’ system optimization: aggregates.
    Aggregates are a common solution to performance issues that rise up when you ask “So how many socks did we sell last year” on dataset of billion rows. You can either try to walk through all detail records and count socks, or you can store an answer beforehand, creating a so-called aggregate view (products,years). Moreover this aggregate view can be used to answer question like ‘How were our sales at all last year’, since this view is way smaller than original data. But to do so, system must be ‘hierarchy-aware’, which is to know that you can calculate all product sales by omitting product dimension from (products,years) view. This property is sometimes called aggregate navigation. This all boils down to cube lattices that i like dearly. Working with such lattices was my PhD topic )
    All read-optimized OLAP engines are built around this concept:
    - aggregate views are called partitions in Cognos Transformer
    - aggregate views in Oracle Essbase
    - aggregate in Microsoft Analysis Services
    Each of these engines has a lot of optimizations in query rewriting. aimed on optimal aggregates usage.
    As a side note: so does Oracle with materialized views and dimensions concept and IBM MQ with cube views )
    But there’s no such thing in TM1. There are stargate views, which simply store the opened view and it’s very unclear how good is query-rewriting afterwards. At least nobody I talked with knows anything about that and my tests didn’t show good results.
    Generally, TM1 projects were all about planning, which doesn’t involve billions of rows, so performance questions didn’t come up often.

    Another big question, of course, is data loading speed.
    TM1 loads both data and dimension at the same speed of 30k records per second (it depends on record size, but we couldn’t raise it much higher) in one CPU thread (say hello to Transformer). I’ve loaded data ten times faster in Oracle Essbase and it’s not the maximum you can reach.
    30k\s is actually not bad enough, but inability to scale up is very bad.

    Nowadays IBM positions TM1 as reporting-OLAP server as well as foundation for planning products, which, I think, is a bit inaccurate. I’ve tried to sum up my concerns, feel free to comment on your experiences.

    ETL Testing

    I’m a fan of testing in each DWH project, because it allows to:
    - be sure that at least tested parts work
    - change logic w\o retesting all results
    Receiving “All OK” in the morning means that there’s time for a cup of coffee )
    A list of approaches I use for testing ETL-procedures:

    ‘Water in a Sieve’

    Checking whether we could carry all required data into DWH without ‘spilling’ it out.
    Common things to check:
    - row counts in source\DWH
    - grand totals in source\DWH
    Common mistakes found:
    - precision errors
    - missing dimension mappings

    ‘Excluded Middle’

    Checking dimension mapping for a selected dimension. This extends previous test by including dimension totals for checking.
    So if we want to check whether ‘products’ were mapped correctly — we compare totals(sums) by time, store, etc, listing all dimensions except products.
    If counts(sums) differ in source\DWH — we know that products were mapped incorrectly.  Moreover we know the specific data subset containing error, which helps a lot as well.

    ‘Do your maths’

    Checking DWH calculations. If we’re doing some data transformation\calculating something in DWH — it should be tested as well.
    2 approaches to testing:
    - overall logic testing. For example, if we’re allocating HQ expenses to get regional reports it logical to expect the overall sum of expenses to stay the same )
    - testing a specific data subset. We can select a single account and verify logic on it.
    These two approaches should be combined.

    ‘Like in good old days’

    Checking some heuristic expectations, based on previously loaded data.
    1) There’s no way daily sales can jump 70% compared to quarter average
    2) We’re usually getting about this number of rows from this source
    3) If we’re reloading 3 months of data daily — we expect modest amount of changes in past days.
    If you’re using partitioned loads ‘last_load’ partition can be used for such testing.

    It’s best when tests are written not by procedure developer himself. You can always apply cross-checking (you write tests for my procedures, I write for yours).
    Pay attention to Chapter 4 of ETL Toolkit — this post is just a simple list of typical tests, the methodology is described there.

    Microsoft SQL Server reporting database configuration

    A check-list for using Microsoft SQL Server as a reporting database. This is a ‘be aware’ list for myself in first place, since I forget some details as time passes.

    Table of contents

    Overall database settings

    Use simple recovery mode

    Set up simple recovery mode for database to minimize performance issues with transaction log records. Usually you can use daily backups with DWH, since data is loaded daily. In more complex situations you can use ETL-based ‘checkpoints’, but in general, daily backups are more than enough.

    Separate log and data io channels

    Use decent I\O systems and separate data and log I\O channels. This is obvious, just reiterating.
    RAID 1+0 seems to be preferable choice of disk configuration for DWH loads. See this entrythis one or test it yourself )

    2008 r2

    Use SQL Server 2008r2 Enterprise if you have a choice. 2008 adds table\index compression (which is great) and get’s partitioning to work way better in queries. And I’ve  20% performance speed-up against 2005 on the same query set (with same execution plans and that surprised me).

    Auto shrink Off

    Set Auto shrink off, it kicks unpredictable I\O strain. Shrink database daily, just after backup in the same maintenance task.

    Table configuration

    identity columns as primary keys

    Use identity columns as table primary keys. And using this column as clustered index will help I\O in most cases.

    Set up foreign keys

    Again, obvious, but I’ve seen this violated more than once. Use foreign keys for fact-dimension table reference. This will help optimizer a lot.
    You can disable these keys to speed up inserts, but in 90% cases it’s not needed.

    Table partitioning

    Always use partitioned tables (even fake, with 1 partition) to:
    - achieve partition elimination during query execution. Especially in Ms SQL 2008
    - use partition switching for data load (check this article as well)

    Table compression in 2008

    SQL Server 2008 introduces table compression which helps to reduce I\O (but increases CPU load), which is very good for reporting tasks. Use it for fact tables.

    ETL

    partition switching as default load strategy

    More benefits of partition switching for loading:
    - while you’re loading data in ‘new_data’ partition, base table is accessible and consistent
    - after you’ve prepared ‘new_data’ partition you can check it’s data for quality (planning to write about later)
    You can implement switching by creating another temporary partition, which I like to call ‘last load’. So you switch:
    - current -> last_load
    - new_data -> current
    in one transaction
    After that you have ‘last_load’ to:
    - switch back in emergency
    - check data from current\previous load for some more quality checks

    Filtered indexes for calculations

    Another 2008-only feature is called filtered indexes. These indexes can be essentially smaller than common ones if you know for sure what region of data you’ll be accessing. And this turns out to be very efficient for SQL code you’re using for different calculations. Since I mostly work with Oracle Data Integrator as an ETL tool and it’s an SQL-generating machine: I get up to10 times calculation speed up by building small, calculation-specific filtered indexes.

    Tablock hint

    All inserts should be done with TABLOCK hint, which speeds up insert by acquiring a full table lock during the whole transaction. Since in DWH main inserts are performed by ETL there should be no conflicts. TABLOCK also speeds up logging during bcp.

    Query tuning

    Indexes

    Really a vast topic, but to sum it up ) :
    - columns in indexes should be ordered by selectivity
    Be very-very careful, ‘your mileage may vary’:
    - creating a huge number of large indexes may actually slow system down due to I\O overload
    - you should try to create indexes that can help a number of reports\query instead of tuning a single request
    See Books session below, skip through at least one of them before starting it out wild.

    Filtered indexes

    As with ETL calculations — filtered indexes may be used for tuning specific queries.

    Index compression

    Indexes can be compressed in the same way as tables (which is obvious given that a table is a clustered index) and this speeds queries by reducing IO

    Books to read

    I’d recommend following
    And if you have time and will