<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Applied dimensionality &#187; etl</title>
	<atom:link href="http://ykud.com/blog/category/etl/feed" rel="self" type="application/rss+xml" />
	<link>http://ykud.com/blog</link>
	<description></description>
	<lastBuildDate>Sat, 28 Jan 2012 12:01:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>New recruit to my ETL toolbox</title>
		<link>http://ykud.com/blog/etl/new-recruit-to-my-etl-toolbox</link>
		<comments>http://ykud.com/blog/etl/new-recruit-to-my-etl-toolbox#comments</comments>
		<pubDate>Sat, 17 Dec 2011 11:29:54 +0000</pubDate>
		<dc:creator>ykud</dc:creator>
				<category><![CDATA[etl]]></category>

		<guid isPermaLink="false">http://ykud.com/blog/?p=743</guid>
		<description><![CDATA[I&#8217;ve recently completed my first real DataStage project and took a chance to get certified while all the stuff is still “fresh”. Certification itself is quite complex and I didn&#8217;t use most of the tricks depicted in questions up until the moment when one of the jobs had to process a quarter billion of rows [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://ykud.com/blog/wp-content/uploads/2011/12/5572518350_ec16922708.jpg"><img class="alignleft size-medium wp-image-779" title="5572518350_ec16922708" src="http://ykud.com/blog/wp-content/uploads/2011/12/5572518350_ec16922708-300x200.jpg" alt="" width="210" height="140" /></a>I&#8217;ve recently completed my first real DataStage project and took a chance to get certified while all the stuff is still “fresh”. Certification itself is quite complex and I didn&#8217;t use most of the tricks depicted in questions up until the moment when one of the jobs had to process a quarter billion of rows in a reasonable timeframe. From that point on I learned quite a lot about partitioning, balancing, debugging and choosing right stages to do the job (who would&#8217;ve thought that RemoveDuplicates is waaaaay slower than Sort (with Remove Duplicates option) — why put RD stage in at all?) Anyhow, now I&#8217;m also an IBM Certified Solution Developer — Infoshere DataStage v8.5 )</p>
<p>So my current ETL tools breakdown goes smth like (not counting PoC and likes):</p>
<ul>
<li>Oracle Data Integrator — 3 projects</li>
<li>Pentaho Data Integrator — 2 projects</li>
<li>IBM InfoSphere DataStage — 1 project</li>
</ul>
<p>And that&#8217;s my current preference list as well. I love ODI&#8217;s flexibility (it&#8217;s actually very simple once you get it how it works and it&#8217;s extremely configurable), ELT approach (I&#8217;d rather be tuning my DBMS than DBMS and a separate ETL engine). PDI is very open and quite user-friendly (compared to DS, for example) and it&#8217;s easier to understand &amp; debug than ODI. PDI community edition is enough for most small data sized integration projects and enterprise version is very affordable. Datastage is terrifically well-suited for big data volume tasks and parallel processing, but is quite an overkill in small projects.</p>
<p>It&#8217;s interesting that although I did quite a bit of DWH model design I written have just a few posts on this topic. But every time I think about writing out some advice — I think that the best advice is to just go <a href="http://ykud.com/blog/bicpm/bidwh-booklist">read the books</a>. And if you still have questions — reread them ) I&#8217;m reread Kimball&#8217;s books a few times already and every time gives you an “ah, that&#8217;s what they meant” moment based on your recent experience.</p>
<p>Anyhow, my last couple major DWH projects were for government agencies and I packed a number of simple but effective modeling tips exactly for them. Hopefully I&#8217;ll write them out in nearest future. Just need a free weekend or two.</p>
]]></content:encoded>
			<wfw:commentRss>http://ykud.com/blog/etl/new-recruit-to-my-etl-toolbox/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Using session variables in Cognos BI</title>
		<link>http://ykud.com/blog/ibm/using-session-variables-in-cognos-bi</link>
		<comments>http://ykud.com/blog/ibm/using-session-variables-in-cognos-bi#comments</comments>
		<pubDate>Mon, 22 Aug 2011 15:52:14 +0000</pubDate>
		<dc:creator>ykud</dc:creator>
				<category><![CDATA[bi]]></category>
		<category><![CDATA[etl]]></category>
		<category><![CDATA[ibm]]></category>

		<guid isPermaLink="false">http://ykud.com/blog/?p=676</guid>
		<description><![CDATA[Just a quick Cognos BI hint: you can use session variables to store project-level constant values. I&#8217;m a big fan of &#8216;feature-rich&#8217; ETL reports showing not only what dimension element mismatch between systems, but also allowing seamless editing of element mapping. This usually means drilling down from report into external application for dimension mapping. Parameters [...]]]></description>
			<content:encoded><![CDATA[<div>
<p><a href="http://ykud.com/blog/wp-content/uploads/2011/08/owl_wink.jpg"><img class="alignleft size-medium wp-image-677" title="owl_wink" src="http://ykud.com/blog/wp-content/uploads/2011/08/owl_wink-300x233.jpg" alt="" width="126" height="98" /></a>Just a quick Cognos BI hint: you can use session variables to store project-level constant values.</p>
</div>
<div>I&#8217;m a big fan of &#8216;feature-rich&#8217; ETL reports showing not only what dimension element mismatch between systems, but also allowing seamless editing of element mapping. This usually means drilling down from report into external application for dimension mapping. Parameters are usually passed via URL (easiest possible way).</div>
<div>After server name and, therefore, URL changed for a second time in current project, I&#8217;ve set up a project level &#8216;severname&#8217; constant to avoid XML find\replace for each report.</div>
<div>It&#8217;s really easy:</div>
<p>1) Add a session variable to your Framework project, write your required constant value there. Like &#8216;servername&#8217; = &#8216;awesomebi&#8217;</p>
<p>2) Use it in Report Studio, just type</p>
<p>#sq($variable_name)#</p>
<p>in expression editor. sq encloses string in single quotes.</p>
]]></content:encoded>
			<wfw:commentRss>http://ykud.com/blog/ibm/using-session-variables-in-cognos-bi/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ETL Testing</title>
		<link>http://ykud.com/blog/etl/etl-testing</link>
		<comments>http://ykud.com/blog/etl/etl-testing#comments</comments>
		<pubDate>Fri, 24 Sep 2010 11:44:37 +0000</pubDate>
		<dc:creator>ykud</dc:creator>
				<category><![CDATA[etl]]></category>

		<guid isPermaLink="false">http://ykud.com/blog/?p=446</guid>
		<description><![CDATA[I&#8217;m a fan of testing in each DWH project, because it allows to: - be sure that at least tested parts work - change logic w\o retesting all results Receiving &#8220;All OK&#8221; in the morning means that there&#8217;s time for a cup of coffee ) A list of approaches I use for testing ETL-procedures: &#8216;Water in [...]]]></description>
			<content:encoded><![CDATA[<div>I&#8217;m a fan of testing in each DWH project, because it allows to:</div>
<div>- be sure that at least tested parts work</div>
<div>- change logic w\o retesting all results</div>
<div>Receiving &#8220;All OK&#8221; in the morning means that there&#8217;s time for a cup of coffee )</div>
<div>A list of approaches I use for testing ETL-procedures:</div>
<div><div id='toc' class='post-446'><div id='toc_title'></div>
<ul><li><a href="#Water-in-a-Sieve">&#8216;Water in a Sieve&#8217;</a></li>
<li><a href="#Excluded-Middle">&#8216;Excluded Middle&#8217;</a></li>
<li><a href="#Do-your-maths">&#8216;Do your maths&#8217;</a></li>
<li><a href="#Like-in-good-old-days">&#8216;Like in good old days&#8217;</a></li>
</ul>
</div></div>
<h3 id='Water-in-a-Sieve'>&#8216;Water in a Sieve&#8217;</h3>
<div>Checking whether we could carry all required data into DWH without &#8216;spilling&#8217; it out.</div>
<div>Common things to check:</div>
<div>- row counts in source\DWH</div>
<div>- grand totals in source\DWH</div>
<div>Common mistakes found:</div>
<div>- precision errors</div>
<div>- missing dimension mappings</div>
<h3 id='Excluded-Middle'>&#8216;Excluded Middle&#8217;</h3>
<div>Checking dimension mapping for a selected dimension. This extends previous test by including dimension totals for checking.</div>
<div>So if we want to check whether &#8216;products&#8217; were mapped correctly &#8212; we compare totals(sums) by time, store, etc, listing all dimensions except products.</div>
<div>If counts(sums) differ in source\DWH &#8212; we know that products were mapped incorrectly.  Moreover we know the specific data subset containing error, which helps a lot as well.</div>
<h3 id='Do-your-maths'>&#8216;Do your maths&#8217;</h3>
<div>Checking DWH calculations. If we&#8217;re doing some data transformation\calculating something in DWH &#8212; it should be tested as well.</div>
<div>2 approaches to testing:</div>
<div>- overall logic testing. For example, if we&#8217;re allocating HQ expenses to get regional reports it logical to expect the overall sum of expenses to stay the same )</div>
<div>- testing a specific data subset. We can select a single account and verify logic on it.</div>
<div>These two approaches should be combined.</div>
<h3 id='Like-in-good-old-days'>&#8216;Like in good old days&#8217;</h3>
<div>Checking some heuristic expectations, based on previously loaded data.</div>
<div>1) There&#8217;s no way daily sales can jump 70% compared to quarter average</div>
<div>2) We&#8217;re usually getting about this number of rows from this source</div>
<div>3) If we&#8217;re reloading 3 months of data daily &#8212; we expect modest amount of changes in past days.</div>
<div>If you&#8217;re using <a href="http://ykud.com/blog/bicpm/microsoft/microsoft-sql-server-reporting-database-configuration-practices">partitioned loads</a> &#8216;last_load&#8217; partition can be used for such testing.</div>
<p></p>
<div>It&#8217;s best when tests are written not by procedure developer himself. You can always apply cross-checking (you write tests for my procedures, I write for yours).</div>
<div>Pay attention to Chapter 4 of <a href="http://www.amazon.com/Data-Warehouse-ETL-Toolkit-Techniques/dp/0764567578">ETL Toolkit</a> &#8212; this post is just a simple list of typical tests, the methodology is described there.</div>
]]></content:encoded>
			<wfw:commentRss>http://ykud.com/blog/etl/etl-testing/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Freebase GridWorks &#8212; a data-driven approach to ETL</title>
		<link>http://ykud.com/blog/etl/freebase-gridworks-a-data-driven-approach-to-etl</link>
		<comments>http://ykud.com/blog/etl/freebase-gridworks-a-data-driven-approach-to-etl#comments</comments>
		<pubDate>Tue, 11 May 2010 14:11:32 +0000</pubDate>
		<dc:creator>ykud</dc:creator>
				<category><![CDATA[etl]]></category>

		<guid isPermaLink="false">http://ykud.com/blog/?p=341</guid>
		<description><![CDATA[Take 5 minutes to watch screencast for FreeBase Gridworks &#8212; an interesting new approach to data transformation. Instead of &#8216;transformation-based&#8217; approach of every tool on the market, this tool uses &#8216;data-based&#8217; approach, which looks rather intriguing. Especially &#8216;Undo-Redo&#8217; function ) I keep thinking when to use this tool. User driven data load in DWH? This [...]]]></description>
			<content:encoded><![CDATA[<p>Take 5 minutes to watch screencast for <a href="http://code.google.com/p/freebase-gridworks/">FreeBase Gridworks</a> &#8212; an interesting new approach to data transformation. Instead of &#8216;transformation-based&#8217; approach of every tool on the market, this tool uses &#8216;data-based&#8217; approach, which looks rather intriguing. Especially &#8216;Undo-Redo&#8217; function )</p>
<p>I keep thinking when to use this tool. User driven data load in DWH? This tool lacks MDM capabilities in such case. Manual cleaning of csv files?&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://ykud.com/blog/etl/freebase-gridworks-a-data-driven-approach-to-etl/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

