domingo, 21 de junho de 2009

Matriz Semântica

A utilização da Matriz Semântica é um instrumento simples e prático de mostrar o cruzamento das métricas e dimensões na fase de leventamento funcional , neste artigo Kimball descreve as funcionalidades e indica o seu uso ao final um artigo da Margy Ross completa o assunto

Ralph Kimball

Over the years, I have found that a matrix depiction of the data warehouse plan is a pretty good planning tool once you have gathered the business requirements and performed a full data audit. This matrix approach has been exceptionally effective for distributed data warehouses without a center. Most of the new Web-oriented, multiple organization warehouses we are trying to build these days have no center, so it is even more urgent that we find a way to plan these beasts.

The matrix is simply a vertical list of data marts and a horizontal list of dimensions. Figure 1 (page 26) is an example matrix for the enterprise data warehouse of a large telecommunications company. You start the matrix by listing all the first-level data marts that you could possibly build over the next three years across the enterprise. A first-level data mart is a collection of related fact tables and dimension tables that is typically:

Derived from a single data source

Supported and implemented by a single department

Based on the most atomic data possible to collect from the source

Conformed to the “data warehouse bus.”

First-level data marts should be the smallest and least risky initial implementations of an enterprise data warehouse. They form a foundation on which a larger implementation can be brought to completion in the least amount of time, but that are still guaranteed to contribute to the final result without being incompatible stovepipes.

You should try to reduce the risk of implementation as much as possible by basing the first-level data marts on single production sources. In my experience, the cost and complexity of data warehouse implementation, once the “right” data has been chosen, turns out to be proportional to the number of data sources that must be extracted. Each separate data source can be as much as a six-month programming and testing exercise. You must create a production data pipeline from the legacy source through the data staging area and on to the fact and dimension tables of the presentation part of the data warehouse.

In Figure 1, the first-level data marts for the telecommunications company are many of the major production data sources. An obvious production data source is the customer billing system, listed first. This row in the matrix is meant to represent all the base-level fact tables you expect to build in this data mart. Assume this data mart contains one major base-level fact table, the grain of which is the individual line item on a customer bill. Assume the line item on the bill represents the class of service provided, not the individual telephone call within the class of service. With these assumptions, you can check off the dimensions this fact table needs. For customer bills, you need Time, Customer, Service, Rate Category, Local Service Provider, Long Distance Provider, Location, and Account Status.

Continue to develop the matrix rows by listing all the possible first-level data marts that could be developed in the next three years, based on known, existing data sources. Sometimes I am asked to include a first-level data mart based on a production system that does not yet exist. I usually decline the offer. I try to avoid including “potential” data sources, unless there is a very specific design and implementation plan in place. Another dangerously idealistic data source is the grand corporate data model, which usually takes up a whole wall of the IT department. Most of this data model cannot be used as a data source because it is not real. Ask the corporate data architect to highlight with a red pen the tables on the corporate data model that are currently populated with real data. These red tables are legitimate drivers of data marts in the planning matrix and can be used as sources.

The planning matrix columns indicate all the dimensions a data mart might need. A real enterprise data warehouse contains more dimensions than those in Figure 1. It is often helpful to attempt a comprehensive list of dimensions before filling in the matrix. When you start with a large list of dimensions, it becomes a kind of creative exercise to ask whether a given dimension could possibly be associated with a data mart. This activity could suggest interesting ways to add dimensional data sources to existing fact tables. The Weather dimension in Figure 1 is an example of such a creative addition. If you study the details of Figure 1, you may decide that more X’s should be filled in, or that some significant dimensions should be added. If so, more power to you! You are using the matrix as it was intended.

Inviting Data Mart Groups to the Conforming Meeting

Looking across the rows of the matrix is revealing. You can see the full dimensionality of each data mart at a glance. Dimensions can be tested for inclusion or exclusion. But the real power of the matrix comes from looking at the columns. A column in the matrix is a map of where the dimension is required.


FIGURE 1 The Matrix Plan for the enterprise data warehouse of a large telecommunications company. First-level data marts are directly derived from production applications. Second-level data marts are developed later and represent combinations of first-level data marts.



The first dimension, Time, is required in every data mart. Every data mart is a time series. But even the Time dimension requires some thought. When a dimension is used in multiple data marts, it must be conformed. Conformed dimensions are the basis for distributed data warehouses, and using conformed dimensions is the way to avoid stovepipe data marts. A dimension is conformed when two copies of the dimensions are either exactly the same (including the values of the keys and all the attributes), or else one dimension is a perfect subset of the other. So using the Time dimension in all the data marts implies that the data mart teams agree on a corporate calendar. All the data mart teams must use this calendar and agree on fiscal periods, holidays, and workdays.

The grain of the conformed Time dimension needs to be consistent as well. An obvious source of stovepipe data marts is the reckless use of incompatible weeks and months across the data marts. Get rid of awkward time spans such as quad weeks or 4-4-5-week quarters.

The second dimension in Figure 1, Customer, is even more interesting than Time. Developing a standard definition for “customer” is one of the most important steps in combining separate sources of data from around the enterprise. The willingness to seek a common definition of the customer is a major litmus test for an organization intending to build an enterprise data warehouse. Roughly speaking, if an organization is unwilling to agree on a common definition of the customer across all data marts, the organization should not attempt to build a data warehouse that spans these data marts. The data marts should remain separate forever.

For these reasons, you can think of the planning matrix columns as the invitation list to the conforming meeting! The planning matrix reveals the interaction between the data marts and the dimensions.

Communicating With the Boss

The planning matrix is a good communication vehicle for senior management. It is simple and direct. Even if the executive does not know much about the technical details of the data warehouse, the planning matrix sends the message that standard definitions of calendars, customers, and products must be defined, or the enterprise won’t be able to use its data.

A meeting to conform a dimension is probably more political than technical. The data warehouse project leader does not need to be the sole force for conforming a dimension such as Customer. A senior manager such as the enterprise CIO should be willing to appear at the conforming meeting and make it clear how important the task of conforming the dimension is. This political support is very important. It gets the data warehouse project manager off the hook and puts the burden of the decision making process on senior management’s shoulders, where it belongs.

Second-Level Data Marts

After you have represented all the major production sources in the enterprise with first-level data marts, you can define one or more second-level marts. A second-level data mart is a combination of two or more first-level marts. In most cases, a second-level mart is more than a simple union of data sets from the first-level marts. For example, a second-level profitability mart may result from a complex allocation process that associates costs from several first-level cost-oriented data marts onto products and customers contained in a first-level revenue mart. I discussed the issues of creating these kinds of profitability data marts in my October 26 column, “Not so Fast.”

The matrix planning technique helps you build an enterprise data warehouse, especially when the warehouse is a distributed combination of far-flung data marts. The matrix becomes a resource that is part technical tool, part project management tool, and part communication vehicle to senior management.

By Margy Ross



Many of you are already familiar with the data warehouse bus architecture and matrix given their

central role in building architected data marts. Ralph’s most recent Intelligent Enterprise article

(http://www.intelligententerprise.com/021030/517warehouse1_1.shtml) reinforces the importance of

the bus architecture. The corresponding bus matrix identifies the key business processes of an

organization, along with their associated dimensions. Business processes (typically corresponding to

major source systems) are listed as matrix rows, while dimensions appear as matrix columns. The

cells of the matrix are then marked to indicate which dimensions apply to which processes.



In a single document, the data warehouse team has a tool for planning the overall data warehouse,

identifying the shared dimensions across the enterprise, coordinating the efforts of separate

implementation teams, and communicating the importance of shared dimensions throughout the

organization. We firmly believe drafting a bus matrix is one of the key initial tasks to be completed by

every data warehouse team after soliciting the business’ requirements.



While the matrix provides a high-level overview of the data warehouse presentation layer “puzzle

pieces” and their ultimate linkages, it is often helpful to provide more detail as each matrix row is

implemented. Multiple fact tables often result from a single business process. Perhaps there’s a need

to view business results in a combination of transaction, periodic snapshot or accumulating snapshot

perspectives. Alternatively, multiple fact tables are often required to represent atomic versus more

summarized information or to support richer analysis in a heterogeneous product environment.



We can alter the matrix’s “grain” or level of detail so that each row represents a single fact table (or

cube) related to a business process. Once we’ve specified the individual fact table, we can

supplement the matrix with columns to indicate the fact table’s granularity and corresponding facts

(actual, calculated or implied). Rather than merely marking the dimensions that apply to each fact

table, we can indicate the dimensions’ level of detail (such as brand or category, as appropriate,

within the product dimension column).



The resulting embellished matrix provides a roadmap to the families of fact tables in your data

warehouse. While many of us are naturally predisposed to dense details, we suggest you begin with

the more simplistic, high-level matrix and then drill-down into the details as each business process is

implemented. Finally, for those of you with an existing data warehouse, the detailed matrix is often a

useful tool to document the “as is” status of a more mature warehouse environment.

domingo, 14 de junho de 2009

Kimball University: The 10 Essential Rules of Dimensional Modeling

Ótimo artigo diretamente da equipe do papa do DW.



Follow the rules to ensure granular data, flexibility and a future-proofed information resource. Break the rules and you'll confuse users and run into data warehousing brick walls.


By Margy Ross
Maio 29, 2009


A student attending one of Kimball Group's recent onsite dimensional modeling classes asked me for a list of "Kimball's Commandments" for dimensional modeling. We'll refrain from using religious terminology, but let's just say the following are not-to-be-broken rules together with less stringent rule-of-thumb recommendations.

Rule #1: Load detailed atomic data into dimensional structures.

Dimensional models should be populated with bedrock atomic details to support the unpredictable filtering and grouping required by business user queries. Users typically don't need to see a single record at a time, but you can't predict the somewhat arbitrary ways they'll want to screen and roll up the details. If only summarized data is available, then you've already made assumptions about data usage patterns that will cause users to run into a brick wall when they want to dig deeper into the details. Of course, atomic details can be complemented by summary dimensional models that provide performance advantages for common queries of aggregated data, but business users cannot live on summary data alone; they need the gory details to answer their ever-changing questions.

Rule #2: Structure dimensional models around business processes.

Business processes are the activities performed by your organization; they represent measurement events, like taking an order or billing a customer. Business processes typically capture or generate unique performance metrics associated with each event. These metrics translate into facts, with each business process represented by a single atomic fact table. In addition to single process fact tables, consolidated fact tables are sometimes created that combine metrics from multiple processes into one fact table at a common level of detail. Again, consolidated fact tables are a complement to the detailed single-process fact tables, not a substitute for them.

Rule #3: Ensure that every fact table has an associated date dimension table.

The measurement events described in Rule #2 always have a date stamp of some variety associated with them, whether it's a monthly balance snapshot or a monetary transfer captured to the hundredth of a second. Every fact table should have at least one foreign key to an associated date dimension table, whose grain is a single day, with calendar attributes and nonstandard characteristics about the measurement event date, such as the fiscal month and corporate holiday indicator. Sometimes multiple date foreign keys are represented in a fact table.

Rule #4: Ensure that all facts in a single fact table are at the same grain or level of detail.

There are three fundamental grains to categorize all fact tables: transactional, periodic snapshot, or accumulating snapshot. Regardless of its grain type, every measurement within a fact table must be at the exact same level of detail. When you mix facts representing multiple levels of granularity in the same fact table, you are setting yourself up for business user confusion and making the BI applications vulnerable to overstated or otherwise erroneous results.

Rule #5: Resove many-to-many relationships in fact tables.

Since a fact table stores the results of a business process event, there's inherently a many-to-many (M:M) relationship between its foreign keys, such as multiple products being sold in multiple stores on multiple days. These foreign key fields should never be null. Sometimes dimensions can take on multiple values for a single measurement event, such as the multiple diagnoses associated with a health care encounter or multiple customers with a bank account. In these cases, it's unreasonable to resolve the many-valued dimensions directly in the fact table, as this would violate the natural grain of the measurement event. Thus, we use a many-to-many, dual-keyed bridge table in conjunction with the fact table.

Rule #6: Resolve many-to-one relationships in dimension tables.

Hierarchical, fixed-depth many-to-one (M:1) relationships between attributes are typically denormalized or collapsed into a flattened dimension table. If you've spent most of your career designing entity-relationship models for transaction processing systems, you'll need to resist your instinctive tendency to normalize or snowflake a M:1 relationship into smaller subdimensions; dimension denormalization is the name of the game in dimensional modeling.

It is relatively common to have multiple M:1 relationships represented in a single dimension table. One-to-one relationships, like a unique product description associated with a product code, are also handled in a dimension table. Occasionally many-to-one relationships are resolved in the fact table, such as the case when the detailed dimension table has millions of rows and its roll-up attributes are frequently changing. However, using the fact table to resolve M:1 relationships should be done sparingly.

Rule #7: Store report labels and filter domain values in dimension tables.

The codes and, more importantly, associated decodes and descriptors used for labeling and query filtering should be captured in dimension tables. Avoid storing cryptic code fields or bulky descriptive fields in the fact table itself; likewise, don't just store the code in the dimension table and assume that users don't need descriptive decodes or that they'll be handled in the BI application. If it's a row/column label or pull-down menu filter, then it should be handled as a dimension attribute.

Though we stated in Rule #5 that fact table foreign keys should never be null, it's also advisable to avoid nulls in the dimension tables' attribute fields by replacing the null value with "NA" (not applicable) or another default value, determined by the data steward, to reduce user confusion if possible.

Rule #8: Make certain that dimension tables use a surrogate key.

Meaningless, sequentially assigned surrogate keys (except for the date dimension, where chronologically assigned and even more meaningful keys are acceptable) deliver a number of operational benefits, including smaller keys which mean smaller fact tables, smaller indexes, and improved performance. Surrogate keys are absolutely required if you're tracking dimension attribute changes with a new dimension record for each profile change. Even if your business users don't initially visualize the value of tracking attribute changes, using surrogates will make a downstream policy change less onerous. The surrogates also allow you to map multiple operational keys to a common profile, plus buffer you from unexpected operational activities, like the recycling of an obsolete product number or acquisition of another company with its own coding schemes.

Rule #9: Create conformed dimensions to integrate data across the enterprise.

Conformed dimensions (otherwise known as common, master, standard or reference dimensions) are essential for enterprise data warehousing. Managed once in the ETL system and then reused across multiple fact tables, conformed dimensions deliver consistent descriptive attributes across dimensional models and support the ability to drill across and integrate data from multiple business processes. The Enterprise Data Warehouse Bus Matrix is the key architecture blueprint for representing the organization's core business processes and associated dimensionality. Reusing conformed dimensions ultimately shortens the time-to-market by eliminating redundant design and development efforts; however, conformed dimensions require a commitment and investment in data stewardship and governance, even if you don't need everyone to agree on every dimension attribute to leverage conformity.

Rule #10: Continuously balance requirements and realities to deliver a DW/BI solution that's accepted by business users and that supports their decision-making.

Dimensional modelers must constantly straddle business user requirements along with the underlying realities of the associated source data to deliver a design that can be implemented and that, more importantly, stands a reasonable chance of business adoption. The requirements-versus-realities balancing act is a fact of life for DW/BI practitioners, whether you're focused on the dimensional model, project strategy, technical/ETL/BI architectures or deployment/maintenance plan.

If you've read our Intelligent Enterprise articles, Toolkit books or monthly Design Tips regularly, these rules shouldn't be news to you, but here we've consolidated our rules into a single rulebook that you can refer to when you are gathered to design or review your models.

Good luck!

Modelagem

Modelagem de dados

Modelagem

Como no caso dos Bancos de Dados de produção (BD), a necessidade de rapidez nas recuperações SQL é muito grande, mas, no caso dos DW, essa necessidade é maior. Essa necessidade se deve ao fato da quantidade de dados ser muito maior no DW que no BD. Justamente por isso que os bancos de DW não são totalmente normalizados, como no caso dos BD. Essa normalização é extremamente útil na economia de espaço no armazenamento de dados, pois tem justamente o objetivo de diminuir as duplicidades. Mas essa normalização envolve a criação de mais tabelas e torna as consultas mais complexas, por isso, essas consultas ficam mais lentas e, para as consultas ganharem performance, o DW não é totalmente normalizado, ou seja, permite alguns tipos de duplicidades nos dados.

Em um BD, os dados são armazenados em tabelas, no data warehouse os dados ficam armazenados em um formato de cubo, o cubos de dados. O cubo é a figura que representa as várias dimensões de dados inter-relacionadas, própria de um sistema multidimensional.

Modelagem Dimensional

Num DW, a modelagem deve ser estudada para atender as necessidades do negócio. Não existe um modelo certo ou errado. Para que essa modelagem seja feita da melhor maneira, uma das alternativas é o uso da modelagem dimensional.

Essa modelagem dimensional consiste em uma técnica de projeto lógico que procura apresentar dados em uma forma comum que é intuitiva para as pessoas que acessam e permita acesso de alto desempenho.

É diferente da modelagem Entidade-Relacionamento (E-R), que procura eliminar redundâncias de forma a economizar espaço. Essa eliminação de redundâncias tem como efeito a criação de diversas entidades dentro da modelagem, o que torna mais difícil a compreensão do modelo por usuários que não tenham nível avançado. A modelagem E-R faz com que as recuperações SQL possuam “joins” mais complexos, com a união de diversas tabelas, o que torna o resultado mais demorado.



São componentes do modelo dimensional:

  • Fatos: são observações decorrentes do andar do negócio. Não são conhecidas de antemão, ou seja, não se sabe o seu valor até que se tenha acontecido. Compõe-se basicamente de campos numéricos. Ex: “quantidade de produtos vendidos”.
  • Atributos do negócio: são as descrições das características do negócio. São conhecidos previamente e são caracterizados por campos textuais. Ex: nome do produto.

Tabelas do modelo dimensional:

  • Tabelas Fato: são as tabelas que guardam os dados do negócio. Todas as informações decorrentes do andamento do negócio que não são conhecidas previamente. Os fatos podem ser aditivos, ou seja, podem ser acumulados, além de terem valores contínuos.
  • Tabelas Dimensão: são as tabelas que guardam os atributos do negócio. Elas podem ser usadas para restringir as pesquisas feitas nos tabelas Fato e servem como títulos em colunas.

Existem várias formas de representação para os dados em ambientes de banco de dados convencionais. Podemos generalizar sem perder informações da seguinte maneira:

  • Datas e Horas: uma das principais características de um DW através das suas operações é poder analisar historicamente os dados. Como as possibilidades de análise são atribuídas a restrições das dimensões e possibilidade de agrupar os dados, então as datas e horas são sempre bom indício de atributos de dimensão, não de fatos.
  • Textuais: os dados textuais são descrições de fácil compreensão humana, logo é natural que sejam utilizados para descrição de elementos do negócio. Como as possibilidades de análise são atribuídas a restrições das dimensões e possibilidade de agrupar os dados, então as descrições textuais são sempre bom indício de atributos de dimensão, não de fatos.
  • Fatos Aditivos: são numéricos e podem ser somados em relação às dimensões existentes, pois sua semântica permite tais operações. Sempre que em uma modelagem um dado numérico for apresentado então este será um bom indício de um atributo em fatos. Em geral, fatos aditivos representam medidas de atividade do negócio. Ex. Valor Venda, Quantidade de produtos vendidos.
  • Fatos Semi-Aditivos: também são numéricos, mas não podem ser somados em relação a todas dimensões existentes, pois sua semântica não permite. Em geral, fatos semi-aditivos representam leituras medidas de intensidade do negócio. São snapshots destas leituras que entram no DW. O valor atual já leva em consideração valores passados. Ex. Nível de Estoque, Fechamento diário/mensal de conta.
  • Fatos Não-Aditivos: algumas observações não são numéricas que também não são datas/horas podem eventualmente ser fatos dos eventos. Campos textuais de livre formato são ruins quer seja para dimensões ou fatos. Em geral, campos como “obs” são exemplos desta situação, pois o domínio é irrestrito. Ex. Em um DW para registrar acidentes de transito temos: carro 1, carro 2, moto 1, moto 2., hora/data, descrição do acidente, descrição do tempo (chuva,...) e descrição da pista.

Aditivos

Semi-Aditivos

Não-Aditivos

Numéricos

Somas

Bons Fatos

Fácil Browsing

Numéricos

Média

Fatos Razoáveis

Browsing Médio

Textual e Lógico

Contagem

Fatos Ruins

Browsing Médio

Granularidade

Granularidade é o nível de representação mais específico utilizado para armazenar os dados. Cada Grânulo representa, em suma, uma entidade dentro da modelagem E-R. Em muitas situações, existem hierarquias de conceitos, onde a modelagem é o nível mais inferior. Um banco de maior granularidade é um banco de dados extremamente normalizado.

O nível de granularidade de um DW deve ser estudado de forma a deixá-lo de entendimento e uso mais simples, para qualquer usuário. Quanto mais baixo o nível escolhido para a granularidade no DW, maior será a quantidade de dados armazenados (próximo ao transacional). A escolha da granularidade do DW não precisa ser a mesma da utilizada no operacional, por isso sempre será igual ou superior (agregado).

Esquema Estrela

Um esquema do tipo estrela, também conhecido por “star” possui uma entidade que contém o identificador de instância, os valores das dimensões descritivas para cada instância, e os valores dos fatos, ou medidas, para aquela instância. Essa entidade é a chamada Tabela Fato. Essa tabela fato terá pelo menos uma Tabela Dimensão, que, conforme a definição acima, irá armazenar os dados sobre os atributos do negócio. Em um caso bem simples, a Tabela Dimensão tem uma linha para cada valor válido da dimensão. Esses valores correspondem a valores encontrados na coluna referente àquela dimensão na Tabela Fato.









A figura acima mostra um exemplo desse esquema Estrela, onde você possui uma Tabela Fato, no centro, e, nas extremidades, você possui as Tabelas Dimensão. Enquanto que a Tabela Fato se liga com as demais Tabelas Dimensão por diversas ligações, as Tabelas Dimensão se ligam somente a Tabela Fato e por uma ligação simples.

A Tabela Fato é onde as medidas numéricas do fato representado estão armazenadas. Cada uma destas medidas é tomada segundo a interseção de todas as dimensões. No caso do exemplo, uma consulta típica selecionaria fatos da tabela FATOSVENDAS a partir de valores fornecidos relativos a cada dimensão.

Abaixo segue uma pequena explicação sobre as variações do Esquema Estrela:

Esquema Estrela com múltiplas tabela fato: acontece quando existem fatos não relacionados tornando possível existir mais de uma tabela fato ou quando a freqüência de carga dos dados operacionais é distinta. Ex. tabela fato de vendas e tabela fato de vendas previstas.

Tabelas Associativas: algumas chaves podem ser desdobradas na tabela fato quando existem relacionamentos muitos-para-muitos.

Esquema Estrela com tabelas Externas: nesse esquema uma ou mais tabelas dimensão podem conter uma chave estrangeira que referencia a chave primária de outra tabela dimensão, podendo também ser chamada de tabela dimensão secundária.

Esquema floco de neve: o esquema floco de neve é uma variação do esquema estrela no qual todas as tabelas dimensão são normalizadas na terceira forma normal (3FN). Reduzem a redundância, mas aumentam a complexidade do esquema e conseqüentemente a compreensão por parte dos usuários. Elas dificultam as implementações de ferramentas de visualização dos dados. Impossibilitam o uso de esquemas de indexação mais eficientes como o bitmap indexing.

Tabela Multi-Estrela: são assim chamadas quando a chave da tabela fato é complementada por mais campos que não são dimensões.

Constelações: quando existem múltiplas tabelas fato que compartilham as mesmas dimensões, dizemos que o esquema de constelações de fatos. Isto acontece quando as medidas nas tabelas fatos possuem diferenças em relação aos eventos geradores: Ex: vendas realizadas x vendas previstas ou venda unitária x desconto por venda conjunta.

Vantagens do modelo estrela:

  • O modelo Estrela tem uma arquitetura padrão e previsível. As ferramentas de consulta e interfaces do usuário podem se valer disso para fazer suas interfaces mais amigáveis e fazer um processamento mais eficiente;
  • Todas as dimensões do modelo são equivalentes, ou seja, podem ser vistas como pontos de entrada simétricos para a tabela de fatos. As interfaces do usuário são simétricas, as estratégias de consulta são simétricas, e o SQL gerado, baseado no modelo, é simétrico;
  • O modelo dimensional é totalmente flexível para suportar a inclusão de novos elementos de dados, bem como mudanças que ocorram no projeto. Essa flexibilidade se expressa de várias formas, dentre as quais temos:

o Todas as tabelas de fato e dimensões podem ser alteradas simplesmente acrescentando novas colunas a tabelas;

o Nenhuma ferramenta de consulta ou relatório precisa ser alterada de forma a acomodar as mudanças;

o Todas as aplicações que existiam antes das mudanças continuam rodando sem problemas;

  • Existe um conjunto de abordagens padrões para tratamento de situações comuns no mundo dos negócios. Cada uma destas tem um conjunto bem definido de alternativas que podem então ser especificamente programadas em geradores de relatórios, ferramentas de consulta e outras interfaces do usuário. Dentre estas situações temos:

o Mudanças lentas das dimensões: ocorre quando uma determinada dimensão evolui de forma lenta e assíncrona;

o Produtos heterogêneos: quando um negócio, tal como um banco, precisa controlar diferentes linhas de negócio juntas, dentro de um conjunto comum de atributos e fatos, mas ao mesmo tempo esta precisa descrever e medir as linhas individuais de negócio usando medidas incompatíveis;

  • Outra vantagem é o fato de um número cada vez maior de utilitários administrativos e processo de software serem capazes de gerenciar e usar agregados, que são de suma importância para a boa performance de respostas em um data warehouse.

Banco de dados Multidimensionais

Abaixo segue as figuras que representam o modo tradicional de representação de um BD relacional e a representação de uma matriz bidimensional.

BD Relacional

Modelo

Cor

Vendas

Van

Azul

6

Van

Vermelha

5

Van

Branca

4

Coupe

Azul

3

Coupe

Vermelha

5

Coupe

Branca

5

Sedan

Azul

4

Sedan

Vermelha

3

Sedan

Branca

2

Matriz Bidimensional

Modelo

Azul

Vermelha

Branca

Van

6

5

4

Coupe

3

5

5

Sedan

4

3

2

Como é possível observar, na primeira tabela, os dados estão dispostos de maneira relacional tradicional. Fica evidente que a segunda opção é mais clara e de fácil entendimento, pois os valores de venda se encontram nas intersecções entre os eixos X e Y, formando um matriz bidimensional 3X3.

Numa matriz bidimensional valores são muito facilmente armazenados. Para isso, basta somente adicionar mais uma coluna, ou uma linha.

Um array multidimensional tem um número fixo de dimensões. As posições segundo uma dimensão são também chamadas de elementos ou membros. As dimensões podem ser compostas por múltiplos níveis, segundo os quais os dados podem ser grupados. Os membros do nível mais baixo da hierarquia de uma dimensão são chamados de membros de entrada.

Modelos de dados

A modelagem de dados é muito importante no desenvolvimento do DW e quando ela é feita de maneira uniforme em todos os aspectos do desenvolvimento, a manutenção é muito mais simples e o esforço necessário será muito menor quando for necessário unir todos os modelos de dados envolvidos no projeto todo, pois todos os componentes do sistema estarão usando a mesma estrutura de dados.


10 Mistakes to Avoid in a Business Intelligence Delivery

Artigo muito bom. Antes de começar um projeto , passe este artigo para a equipe.

Pelo menos não se vai errar no básico ....


Lalitha Chikkatur

Behind every success or failure are people. People are the only differentiators. Every data warehousing (DW) and business intelligence (BI) project, whether successful or not, teaches us something. It is generally on failures that we base our new success. Having said that, it’s not always necessary that you fail to learn; you can also learn from other’s failures, 10 of which are discussed here.

Mistake #1: A non-BI background project manager managing the end-to-end delivery of a BI initiative.

BI project management requires different techniques and methods to succeed. The breakthrough in work process and methodology that form the foundation of data warehouse delivery include such concepts as iterations and phased delivery, and from a non-data warehouse perspective it's hard to appreciate how truly revolutionary and critical these concepts are for successful BI delivery.

A project manager who drives the complete BI initiative from end-to-end has to at least be educated on the basics of DW and BI to be able to deliver the BI project successfully. No matter how successful an individual maybe or how much expertise he/she has in managing non-data warehousing projects, he/she will never be able to deliver the DW projects successfully if he/she does not understand the phased delivery approach of data warehouse. Most often it becomes very challenging to convince a non-DW project manager that the analysis and design phases in DW projects go side by side and not one after the other like in traditional project delivery. If this important aspect is ignored, then the schedule and budget are going to get hit, as one always encounters changing requirements in DW projects, whether he/she likes it or not. Additionally, the fallout would be arguments and politics rather than focusing on technical solutions.

Every project delivery requires a methodology, which a project manager uses to deliver the project successfully. A project manager who by definition plans, controls and reviews all project activities must understand that a data warehousing project delivery cannot use the traditional “waterfall” methodology. The data warehouse methodology must take into account the fact that the delivery of BI projects happens in iterations. The success of data warehousing projects is in its phased approach.

A project manager who is not knowledgeable about BI is not able to make appropriate staffing selections for his team. The team also suffers due to lack of guidance from the leadership role as much as the goal of the BI initiative would suffer because of the management.

Mistake #2: Being in a “pleasing” mode with the clients rather than concentrating on feasibility and value-add from the BI project.

The client sponsoring the DW project and end users have to accept the solution which is being built by the implementation team; there is no doubt about this fact. At the end of the day, the solution being built has to be liked and should demonstrate value-add to the clients. The time and effort spent on a particular initiative should demonstrate value for money. But a word of caution. In this process, the implementation team, which is most often a service provider company offering offshore support as well, should not get into the “pleasing” mode with the clients and users. It might not be practically possible to implement the client’s entire wish list. This should be communicated in a strong but polite way. The requirements driving the DW initiative should be validated very critically so that the best solution can be built. What cannot be done should be communicated as clearly as you communicate what can be done. Clients will definitely appreciate and welcome this kind of assessment in the initial stages of a project rather than giving explanations on architecture and infrastructure just before production or in use or acceptance testing when it’s too late.

Mistake #3: Assuming service provider companies own everything about the successful delivery of the project.

This is yet another critical factor for a successful BI delivery. A service provider who has signed a contract to put the BI project into production definitely has ownership on the delivery. That being said, the delivery cannot be a success without active participation from the client and end users having in each stage and phase of the entire lifecycle. Service providers are specialized consultants who can give you options and best practices, much like a professional home decorator consultant. Because it’s your home, you will have to give the consultant your input and exact specifications. If this does not happen, then the decorator will decorate the home according to his assumptions of your likes and dislikes, which you may or may not approve of. And if this happens at the last minute, then not only will you end up paying for the work that has already been done, but you will also invest more time and money on rework. Without active and adequate client involvement at every phase, no BI delivery can ever have an assured success.

Mistake #4: Bringing in a solution architect halfway into the project and assuming that he/she is going to magically fulfill all the deficiencies.

This is the most common scenario one sees in most of the BI projects. When things are not happening the way they should, the management thinks the immediate remedy is to get a solution architect. What one has to understand is that a solution architect cannot just walk in and wield a magic wand to set things right. The solution architect’s experience will determine how soon he/she can start delivering the value-adds. Also, the time at which you bring in the solution architect is a driving factor for success. Often when it comes to BI architecture, business users are from Mars and IT people are from Venus. To get them to a common platform is in itself a premium skill for a solution architect.

Every BI delivery must have a solution architect with expertise and wide skills in DW and BI. This is vital, as they bring a wealth of knowledge from reference architectures and similar implementations with them. This “ready-recon” eventually reduces the cost and time to implement technology solutions.

The best time to start their involvement is from the analysis stage itself. If not then, do it at least before you spend a large amount of time exploring technology options and assessing the appropriate solutions. Carefully set expectations of the value a solution architect would deliver. If you decide to engage a solution architect when your data model is near completion and expect the architect to do magic to make a performance-tuned and efficient design, it might be overexpecting, as things would have already crossed certain stages involving a good amount of time and cost. At that stage, again the issue resolution becomes more political than technical.

Mistake #5: Lack of the right people with the right skills evaluating BI tools for the implementation.

When a new BI tool is being evaluated, a big crowd of stakeholders is often involved in the evaluation. This might include people who are directly or indirectly associated with the BI initiative for which the tool is being evaluated. If a formal process is not followed, it might lead to various arguments and different loops without ever closing the evaluation. Each individual will come up with their own comments and wishes, and this will lead to a laundry list of features expected from the tool. Any BI tool is just a medium helping to deliver a definitive functionality. But what makes the initiative a success is the right people selecting right kind of tool to deliver the right functionality.

Key people evaluating the BI tool should clearly understand that each BI tool is pretty much designed for some kind of defined and specific purpose. No tool as such is good or bad. It’s the decision-makers, not the tools themselves that make an implementation a success. Imagine if a director whose area of expertise his whole career has been in infrastructure domain is given the authority to make the decision on an analytic tool. You guessed right… It would be a disaster.

The BI tool evaluation team must include a combination of theBI team, BI solution architect, users and the procurement team. Input from the team should be considered and analyzed to a specified extent to avoid analysis paralysis on the evaluation. If the right people keep their expectations straight, the evaluation process should be relatively smooth. The important factor to consider while evaluating is what exactly the problem statement is. Answer questions like: Is the tool expected to cater to a multiterabyte data warehouse, or its less than a terabyte? This will be a big driving factor as your choice of tools will be dependent on this. You don’t need a that comes at a premium price to be catering a <1tb>

Mistake #6: Business users driving the data modeling.

This could be one of the biggest mistakes which can cause the complete BI initiative to fail. The data model is the heart of the data warehouse, which will determine all other aspects such as performance, easy reporting, scalability etc. There is no doubt that active participation of business users in doing data modeling is required, but the modeling should be done by data modelers who specialize in data modeling and dimensional modeling. Business users have to define and explore the links and dependencies between various business areas or subject areas. Business users have complete ownership of understanding the data. With this knowledge and taking input data, modelers have to define the most appropriate way of placing each measure and dimension of the subject areas in a star schema or a customized schema, whichever is appropriate for that environment. In this exercise, data modelers have to take the ownership of designing the schema and be very careful in defining things even if that means being hard sometimes. They have to be careful of a situation which I once faced. A business user was pestering me to define a data field called “premium” in investment banking as a dimension. This field was holding currency data. It took two full days for me to explain and convince the user why that field would not qualify as dimension and should be actually a measure.

By getting influenced by business users, data modelers easily fall into a pleasing mode. Then they are most likely to deviate from the modeling rules, which down the line, will lead to lot of rework and remodeling.

Mistake #7: Counting on your vendor to deliver all that they represented in the presentation.

Avoid overdependence on vendor’s claims about their product and its performance. Everyone wants to be the best when they are making a presentation about their products. The value and performance of the products always look promising; they just might be. But one must be careful to do one’s own homework about the product rather than just blindly accepting vendor presentations and claims. The key here is that tools are not the only critical success factors for a successful data warehouse. There may be instances where the capabilities being evaluated from the tool are not available in the current release; however, the vendor might showcase that they’re been planned in the next release. Making decisions based on these kinds of assumptions is very risky for the simple reason that you are building a thorough dependency on the delivery of a separate entity over which you have no control. In making a decision about a tool on which you are spending a fortune, collect references from the vendor and network with them to see how the product has been doing in a similar environment. You should leverage references where work has already been completed. Apart from the references, it might also be worthwhile to consider the analysis of those products done by research firms like Gartner and Forrester.

Mistake #8: Assuming data quality can be managed “somehow.”

As we speak about the maturity of the data warehouse today, there is still a lot to be explored and learned about the severity of the data quality problem. Assuming data quality can somehow be taken care of might lead to lot of inconsistencies in the downstream systems. The quality of the data has to be checked and cleansed at the source or at least before it enters the data warehouse. It is inappropriate to do any quality checks in the data warehouse itself.

Companies depend heavily on information to make decisions regarding profits, effective operation and customer satisfaction. Inaccuracy and inconsistency in the data will hinder the company’s ability to perform competitively. An effective data quality program is almost a must in these maturing systems. It would allow companies to analyze better and make more meaningful decisions. The data quality initiative need not start as a big bang or with the purchase of an expensive tool. It could be an initiative a step-by-step approach that could be automated with a tool once it’s matured in organization.

Mistake #9: Overdependency on contractors and ignoring the need to build BI capabilities in house.

Hiring contractors for specialty skills has benefited data warehouse delivery within organizations. However, contractors may benefit specific projects, but not on an ongoing basis. If there is overdependency on the contractors who come in, do good work and leave upon delivery, they not only take their deep knowledge with them but also are not available for any clarifications and fixes if a need arises. Don’t treat contractors as employees. You should draw a very clear line for what contractors can help with and what stays internal.

Contractors can very easily be caught up in office politics. This is even truer if your contractors are coming from a specific software vendor. It is practically not possible for them to be unbiased about their products. This is where a knowledgeable in-house person with BI skills is able to evaluate and advise the clients if anything is derailing or if the technology is just not fit for their environment.

Mistake #10: Assuming you are done once the data warehouse project is in production.

As the nature of data warehouse is change and becoming more and more productive with iterations, so is the delivery. Once the project is in production, you have just completed a phase - you are still not done. You have a new world to explore and make continuously improve that application. The investments in data warehousing projects are always shared between the actual delivery and research.

With the successful completion of the implementation phase, you should research what other systems can benefit from it and which other systems can be integrated so that you get the best value and results. This keeps on going, both from the point of view of improvement of the existing phase and of initiating new phases.

We all have faced the scenarios mentioned in this article at one time another during BI delivery. I have addressed 10 common mistakes that impact a BI delivery. These mistakes I have seen myself. I solved a few of them and was able to prevent some damage, but for some I was just a witness and could not do anything. However, I believe these precious lessons, learned the hard way over the years, and can help my BI peers.