A caveat for the experienced people: this is a highly simplistic introduction to college grads. I only mean to provide some overall view.
I am a relatively new comer to data warehousing. I come from OLTP world, where we take lot of transactions and put them through the databases. Early in my career, I worked in data warehousing, specifically archiving.
Historically (For the most readers at least), RDBMS have been the mainstay of the OLTP system. Get a high end RISC (*nix) computer and put Oracle on it – this is the standard OLTP in most shops. There have been lot of changes in the last 10 years, starting with the rise of low-end databases with different performance characteristics (lot of reads and few writes or explicit transaction control etc). But, I digress.
You have your database where all your data is storied; but the people upstairs want a different database for other purposes. Why? What other purposes could there be? Why not use the same database?
Consider the case of a Acme company that sells widgets to coyotes on an e-commerce site. Naturally, they use an OLTP system to sell their widgets – from the record of the sale to current status of the order. But, if they want to generate standard reports (say, how many widgets sold in which state for the last month), they cannot go against the OLTP database – obviously, that can interfere with serving your e-commerce. Or, you may want to look at the trends – which products are selling fast in the last month. Naturally, you do not want to hit the OLTP database.
What you need is a separate DSS (Decision support system) to support your queries and reports. Here is the way it would fit together.
Of course, this is highly simplified picture We will see how it can evolve and can reflect real-world scenarios as we proceed.
A small note: You will encounter words like OLAP. OLAP is what one kind of DSS system does. You also will hear about Datamart. You can think of it as a subset of data (or a mini data warehouse). DSS database is called data warehouse for the most part. DSS is used for the applications+data warehouse.
Why did we have to introduce the ETL tool at all? Consider the following possibilities:
- The schema in OLTP is optimized for specific purpose: to run transactions. The data warehouse is optimized for a different purpose – ability to run queries and reports faster. So, that schema is different from OLTP. ETL tool can do the translation.
- The data in the warehouse may have to come from multiple different databases. Typically, you have the departmental, corporate warehouses that collate the information from multiple sources. You need ETL for that.
- The data may lie in different kinds of databases as well as other sources. ETL deals with all those different sources.
Before going further, you are right to ask: What other kinds of uses are there for databases? As a web developer, I am familiar only with OLTP databases. Tell me about the other standard ways that databases are used.
Here is a simple view, that will suffice for the beginners:
Of course, to solve these different needs, we do have different kind of databases (and constantly invented as well). Recall that most databases speak SQL, which is a bit of a burden on developing a new database. Fortunately, some databases break that taboo for greater glory. In another note, we will get into those details.
Here are a few take-away’s for the beginners:
- Standard database that we generally learn in college or while programming the web, is good for transactions but not well-optimized for other needs.
- There are lot of other ways databases are needed; and there are different kind of databases that support different needs.
- As you are creating all these different kind of databases, you will encounter different data transformation needs and other needs pertaining to integration – which is solved by yet another piece of software like ETL (and a few others – you will see later).