Test Blog: database basics

Showing posts with label database basics. Show all posts

Saturday, November 28, 2009

Kimball vs. Inmon: The TDWI perspective

Surfing through the blog posts trying to be "inspired", I stumbled upon a year old post comparing Kimball and Inmon's architecture. I learned about the differences between their architectures during my courses at the TDWI conference. Only having read Kimball's literature, I was intrigued by this rivalry.

As it was explained to me, Kimball's "Bus Architecture" defines a data warehouse as the combination of all the data marts, which would make the data warehouse responsible for the intake, integration, distribution, delivery, and access of data. Comparatively, Inmon's "Hub and Spoke Architecture" defines the data warehouse as "a subject-oriented, integrated, non-volatile, time-variant collection of data organized to support management needs." Basically, it is responsible for the intake, integration, and distribution of the data. The argument proposed in the course was which was better to use?

To me, it seems this depends on the level of the BI program at the business in question. Is your business new to BI or finding their previous BI project poorly integrated? You could do well to subscribe to Kimball's approach, as it quickly allows the users to get what they need. However, as the BI program matures and more data marts are developed the issue of maintaining the "bus," the rules that define the conformed dimensions necessary for the architecture, become harder to keep aligned. This is where Inmon's approach makes sense, by pushing the conformity back to the warehouse, it is easier to administrate changes to rules and enforce integrity.

True, the line where it makes sense to implement the "hub" in the Inmon's architecture is hard to draw, but aside from that blurred area, I'm not sure there is much to arguing one as "better" than the other. To me it's simply a matter of BI Program Maturity.

Thursday, February 5, 2009

SQL: Just the Basics

What is SQL?

SQL used be "SEQUEL" and stood for 'Structured English Query Language', but it was changed to SQL and now stands for 'Structured Query Language.' Supposedly, the change was due to a trademark infringement with another company.

SQL is what users like you and I use to access and manipulate databases.

SQL can....

execute queries against a database
retrieve data from a database
insert records in a database
update records in a database
delete records in a database
create new databases
create new tables in a database
create stored procedures in a database
create views in a database
etc.

SQL is just a general term used to describe the language used to interact with databases.

There are in fact several variations of SQL that are widely used; such as,

SQL/PSM -- SQL/Persistent Stored Modules (ANSI/ISO Standard)

PSQL -- Procedural SQL (Interbase/Firebird)

SQL PL -- SQL Procedural Language (IBM)

T-SQL -- Transact SQL (Microsoft/Sybase)

SQL/PSM - SQL/Persistent Stored Module (MySQL)

PL/SQL -- Procedural Language/SQL (Oracle)

PL/pgSQL -- Procedural Language/PostgreSQL Structured Query Language (PostgreSQL)

PL/PSM -- Procedural Language/Persistent Stored Modules (PostgreSQL)

I won't go into detail because that is for a later date and a different post.

As the title states, this is just the basics of SQL.

Before we get started, let me explain that I will only be covering the DML part of SQL. To refresh your memory, DML stands for Data Manipulation Language. We'll probably go over the DDL (Data Definition Language) part later.

So, let's get started!

Below are two tables: Employee table and Salary table.

These tables will be used in each exercise to demonstrate how each SQL command works.

In nearly every query, the commands SELECT and FROM will always be used. Why? Because in order to retrieve data from databases, we need to SELECT what it is that we want to see and need to specify FROM which table(s) to retrieve the data.

SELECT

This statement is used to select data from a database.

The syntax for SELECT is:

SELECT table_name.column_name(s)

FROM table_name;

**NOTE:

1. SQL is not case sensitive, but it is good practice to keep all caps for easy readability.

2. 'FROM' does not have to be on a separate line; however, keeping it on a separate line is considered best practice due easy readability.

3. Placing the name of the table with the column name is best practice because future queries will require joining tables and tables may have the same column name and the only way the query can distinguish between columns with similar names is by indicating what tables the columns belong to.

An example:

From the Employee table, select the last name, first name, and occupation columns.

SELECT Employee.Lastname, Employee.Firstname, Employee.Occupation

FROM Employee;

SELECT DISTINCT

This statement is used when selecting only distinct, meaning non-repetitive, unique, data from a table.

The syntax for SELECT DISTINCT is:

SELECT DISTINCT table_name.column_name(s)

FROM table_name

An example of SELECT DISTINCT:

From the Employee table, select distinct values from the Location’s column.

SELECT DISTINCT Employee.Location

FROM Employee;

WHERE

This clause is used to select records that meet a specific condition(s).

The syntax for WHERE is:

SELECT table_name.column_name(s)

FROM table_name

WHERE table_name.column_name 'operator value';

Examples of operator values:

'=' (equal)

'>' (greater than)

'<' (less than) '>=' (greater than or equal to)

'<=' (less than or equal to) '<>' (not equal to)

'LIKE' (string comparison test)

An example of WHERE:

From the Employee table column, select all of the columns where the Location column is equal to Minneapolis.

SELECT *

FROM Employee

WHERE Employee.Location = ‘Minneapolis’;

**NOTE: When the condition references a text value, then you must enclose the condition with single quotes; however, numerical values do not require single quotes.

AND & OR

These operators are used when there are more than 1 condition in the query. They can be used separately or together.

The syntax for each one is:

SELECT table_name.column_name(s)

FROM table_name

WHERE table_name.column_name 'operator value' AND table_name.column_name 'operator value';

(This query means that BOTH conditions must be met in order for the select value(s) to appear in the result-set. If one condition is not met, then no value(s) will appear in the result-set.)

SELECT table_name.column_name(s)

FROM table_name

WHERE table_name.column_name 'operator value' OR table_name.column_name 'operator value';

(This query will return value(s) provided that either one of the two conditions are met. For instance, if the condition is column_name = 'red' OR column_name = 'blue' and only the first condition can be met, then the result-set will show the rows that meet the first condition.)

SELECT column_name(s)

FROM table_name

WHERE table_name.column_name 'operator value' AND (table_name.column_name 'operator value' OR table_name.column_name 'operator value');

(This query combines both AND & OR resulting in a result-set that must still meet both conditions, but with the second condition containing a separate condition that does or does not have to meet the conditions.)

An example of AND & OR:

From the Employee table, select all columns that meet the following conditions: Location equals Minneapolis and Firstname equals Jane or Bob.

SELECT *

FROM Employee

WHERE Employee.Location = ‘Minneapolis’ AND (Employee.FirstName = ‘Jane’ OR Employee.FirstName = ‘Bob’);

IN & BETWEEN

The IN operator works when there are several conditions. It works the same way as the OR operator, but makes the SQL look more legible and intelligible.

The BETWEEN operator is used when selecting a range of data that is between two values.

The syntax for IN and BETWEEN:

SELECT table_name.column_name(s)

FROM table_name

WHERE table_name.column_name IN ('value', value, 'value', etc);

NOTE: The conditional portion of the query could also be written using OR, but would make the query look long: WHERE tale_name. column_name = 'value' OR table_name.column_name = value OR table_name.column_name ='value' etc. Using IN requires less typing and makes the query look more clean and intelligible.

SELECT table_name.column_name(s)

FROM table_name

WHERE table_name.column_name BETWEEN value AND value;

NOTE: The conditional portion of the query can also be rewritten using AND; for example, WHERE table_name.column_name >= value AND table_name.column_name <= value, where 'value' is a numerical value.

An example of IN and BETWEEN:

From the Employee table, select all employees with the LastName equal to Alba, Bower or Davis.

SELECT *

FROM Employee

WHERE Employee.LastName IN (‘Alba’, ‘Bower’, ‘Davis’);

From the Employee table, select all employees whose Location is between Minneapolis and Minneapolis.

SELECT *

FROM Employee

WHERE Employee.Location BETWEEN ‘Minneapolis’ AND ‘Minneapolis’;

Note: Depending on the database that you are using, the BETWEEN function will work different. It may include or exclude the test values.

ORDER BY ... ASC/DESC

This command is used to sort the result-set in ascending or descending order on a specific column. Keep in mind that to sort in ascending order, one can either use 'ORDER BY ... ASC' or just 'ORDER BY', because 'ORDER BY' defaults to ascending order anyways.

The syntax for ORDER BY:

SELECT table_name.column_name(s)

FROM table_name

ORDER BY table_name.column_name(s) ASC / DESC;

An example of ORDER BY:

Select all the employees and rank them by Location in ascending order.

SELECT *

FROM Employee

ORDER BY Employee.Location ASC;

AGGREGATION FUNCTIONS

Aggregation functions are used to calculate numerical values in a specified column.

Below are 6 commonly used aggregate functions:

MIN () - returns the smallest value

MAX() - returns the largest value

AVG() - returns the average value

SUM() - returns the sum value

COUNT() - returns the number of values

COUNT(*) - returns the total number of rows in a table

The syntax for the aggregate functions:

SELECT AVG(table_name.column_name)

FROM table_name

WHERE table_name.column_name 'operator value';

SELECT COUNT(table_name.column_name)

FROM table_name;

SELECT COUNT(*)

FROM table_name;

(**This will return the number of rows for the selected table**)

An example of an Aggregate Function:

SELECT AVG(Salary.Salary)

FROM Salary;

Note: In the result-set, the name of the column is AVG(Salary.Salary). To give the column name a more appropriate title, you can give the column name an alias. For instance, SELECT AVG(Salary.Salary) AS Avg. Salary. The column header will now appear as Avg. Salary.

GROUP BY

This statement is used to group the result-set by one of more columns and is used in conjunction with the aggregate functions mentioned above.

The syntax for GROUP BY:

SELECT table_name.column_name(s), aggregate_function(table_name.column_name)

FROM table_name

GROUP BY table_name.column_name(s);

An example of GROUP BY:

Create a list of the total salaries of employees and group them by occupation.

SELECT Salary.Occupation,SUM(Salary.Salary) as Total_Salary

FROM Salary

GROUP BY Occupation;

HAVING

This clause is used in conjunction with ORDER BY and places a condition on the column(s) in the GROUP BY clause. Also, this clause is used with ORDER BY because WHERE cannot be used with aggregate functions.

The syntax for HAVING:

SELECT table_name.column_name(s), aggregate_function(table_name.column_name)

FROM table_name

GROUP BY table_name.column_name(s)

HAVING aggregate_function(table_name.column_name) operator value;

An example of HAVING:

Find the occupations that have a total salary less than 200000.

SELECT Salary.Occupation, SUM(Salary.Salary)

FROM Salary

GROUP BY Salary.Occupation

HAVING SUM(Salary.Salary) <>

What I've shared with you now is just the tip of the SQL iceberg.

Continue to check back on this blog post as I will continue to update it with additions and edits.

Tuesday, November 11, 2008

Data Modeling Continued... Kimball vs. Inmon: The Basics

As I discussed last time, many in the field of BI are strongly sided with the methodology of either Ralph Kimball or Richard Inmon. As mentioned last week, there are more similarities than differences, but today I'll just point out the main differences between their philosophies for anyone unfamiliar with them.

The main difference is that Kimball's architecture, also known as the Bus Architecture, is based on loading individual data marts directly from the operational system through the data staging area using conformed dimensions. An operational data store or intermediate data structure may or may not be necessary depending on existing data sources and business requirements. In this design, what is referred to as the data warehouse is actually just the collection of data marts. Kimball's basic architecture is shown in the diagram to the left. Inmon argues that this approach is inflexible without a centralized warehouse and changes cannot be made as gracefully as with his approach, which is explained below.

Inmon's Corporate Information Factory, or CIF architecture, is based on the idea that a complete data warehouse should be created in third normal form. Data marts are then created separately using the warehouse as their source. These data marts can be denormalized as the designers see fit, often into a star schema. This architecture is depicted in the diagram below.Those in Kimball's camp argue that the design, implementation, and maintenance of this data warehouse, along with its associated additional ETL processes, are often unnecessary and take much more time to get off the ground than projects using the BUS archeticture.

The differences and arguments between these two approaches go far beyond what I've mentioned here, but this should help to explain the basic split between the methodologies. I've read many of the arguments for both sides out there, and although there are plenty of hard liners in both camps, the verdict seems to be that the answer to which architecture is better is....it depends. Yes, boring I know, but I've read many comments by designers claiming that they have either used hybrids or, used both successfully at different times depending on the existing architecture and business requirements.

For every opinion I've read advocating one or the other, I read another praising the merits of both. I also read one claiming that Richard (not Ralph) Kimball's methodology is superior, which made me laugh, because I made the same mistake once in conversation shortly after learning his name. My colleagues somehow seemed skeptical that the fictional character from the movie "The Fugitive" has his own data warehouse methodology.

Hopefully this helped explain the main differences to anyone new to their methodologies. As I mentioned in the preceding post, I encourage anyone involved with a BI project at any level to pick up a book by both men to fully understand their ideas. Also, although this topic has already been argued at length, I encourage comments from anyone with significant experience using either or both methodologies.

Friday, October 31, 2008

Data Modeling

One of the most important aspects of a BI project is the underlying data warehouse model. This may seem like a no-brainer, right?

In practice, I don't believe the data model is always given the proper time or expertise necessary for a successful implementation. Often, seemingly small flaws or shortcuts in the design of data warehouses or marts can cause much larger problems down the road in terms of functionality, performance, and flexibility. It's imperative that any organization faced with the task of designing their own data warehouse has a project team that understands the importance of a well-planned dimensional model. Changes can always be made, but often at an exponentially higher cost when issues are uncovered late in the game.

Anyone involved with a BI project at any level without a sound understanding of data warehouse modeling techniques should consider doing some research on the topic. One book in particular which I highly recommend is The Data Warehouse Toolkit, 2nd Ed. by Ralph Kimball and Margy Ross. If you're unfamiliar with Ralph Kimball, he's considered a pioneer in the field of data warehousing.

Another highly regarded data warehouse guru is BIll Inmon, who has also written several books on the subject. Most experts in the field strongly side with either Kimball or Inmon. There have been many arguments around whose philosophy on the subject is superior, but of course I'd hate to go picking a fight in my first post by endorsing one over the other here. Maybe we'll hit the topic of their fundamental differences next time. In actuality, their methodologies are very similar and have become more so over time. Anyone interested in learning more about warehouse design should pick up a book on the subject by either or both of them.

An understanding of their techniques and a well designed warehouse won't guarantee a successful implementation, but it's a great step in the right direction. Obviously, we'll never be able to foresee every issue to arise or every request thrown at us, but a strong warehouse design will allow us to deal with both much more easily.

Tuesday, October 7, 2008

Cache Basics

Part of any OBIEE Implementation should be spent defining the proper caching strategy specific to a given project. This post is a basic look at what cache means to OBIEE.

WHAT IS CACHE?

Pronounced "cash", it is generally an easily accessible place to store frequently used information. The most recognizable use of cache relates to the manner in which a personal computer utilizes this information. As it relates to PC, there are two basic types of cache: memory caching and disk caching.

Both types follow the same principle of storing recent or frequently used information in a memory buffer. Memory cache uses a portion of static RAM memory. Disk caching uses conventional main memory. When a piece of information is required, the system (or application) first checks the cache to see if the data is there.

The main purpose of cache is to improve performance. Accessing information stored in a memory buffer can be thousands of times faster than accessing a byte on a hard disk.

OBIEE CACHE
Caching as it relates to OBIEE is a bit different, but the same general purpose applies. Rather than utilizing memory buffer to quickly access recent information, OBIEE will access a file containing a stored record set of information rather than spend processing time querying the database upon each request.

When a request, or query, is made for a set of records, OBIEE can store this as cached information in a file to be used later. When a similar request is made for this same information, OBIEE will retrieve this information from the cache rather than spending processing time negotiating with the database. This can greatly improve the performance of the application.

In the OBIEE/Siebel Analytics world, this is referred to this as ‘query cache’. By utilizing query cache, the cost of database processing only needs to be paid for the one time the query is run against the database. Subsequent identical queries will run against the cache.