This speech was presented at Networking 2000, the SouthAtlantic Regional Conference of the Special Libraries Association on March 12, 1999.
Copyright © 1999 by Gordian Solutions, Inc.
Permission is granted to duplicate and distribute this diskette as is, without modification or alteration of any kind. This is a promotional diskette.

James Callan is president of Gordian Solutions, Inc., a software consulting firm based in Cary, NC. Prior to this post, Mr. Callan served as a consulting director for Oracle Corporation and as Vice President of R & D for TSI Incorporated, a telecommunications company. As author of Collaborative Computing with Delphi 3, Mr. Callan is a world authority on collaborative computing architectures and applications. James has authored two books and numerous articles and remains a sought after speaker in Data Warehousing and Project Management. His 21 years of experience with software spans 22 industries. Mr. Callan holds four degrees, a software patent and has an additional patent pending.

Gordian Solutions, Inc. provides a variety of consulting services to help our clients achieve more for their business and their people through technology. We strive to create enduring software architectures, so that what we do for our clients is what our clients do for their future. Visit our web site or drop Mr. Callan an email to find out more how we can help you make the most of your technology investments.
Ladies and gentleman its a real pleasure to speak with you today on a topic that you should find stimulating. I have entitled todays topic, Harnessing Information After Dawn. Ambrose Bierce called the dawn the time when men of reason go to bed. Being a writer myself, I can identify with a late night or two. However, I think that after today you will agree that the dawn is when information can best be harnessed. But doing so today, unlike in Bierce's day, demands unreasonable men, and women.
Yes, a new age has dawned. Our digital future is here and now. The Internet changed everything. In fact these were the exact words spoken by Bill Gates when he realized that the Internet had indeed changed everything for him and his company, Microsoft. What with cellular phones, computers and satellites we live on a digital planet. What was to be has become. "It will" has become "it has," and we now live in a digital world.
They say that information is power, but I disagree. Leveraged information rules. Simply finding answers is not good enough anymore. Mere retrieval has become commonplace. People want more. They want interesting information. They want news. People pay for news. They crave new ideas and events. They also want ideas and answers. But, unlike yesterday, they no longer are content to wait. People want instant gratification. Instant answers win.

Information is your business. For that matter, information is my business. In the dawn of this new age, information from all sources is accumulating at an alarming rate. More information means more questions. More questions means that you and I must find more answers. We must adapt. Information providers must adapt. We have no choice. Resistance is futile. You must adapt, and today I want to talk about how to adapt.

How has our world changed? Let's get just a glimpse of the early light. We now live in a connected world. From my own personal experience over the last six months, I can tell you about how I received critical comments from China, or how I collaborated with people in Brazil on a management course, or how I exchanged web links with a Polish company. I can share with you how a German emailed me to inquire about changes to some software that I released, or how a Finnish company was able to double the price of their product after I made a personal endorsement. Finally, I can tell you about how last week I was contacted by an Italian firm about developing a software product in Italian for release in Italy. I can't even speak or read Italian.

We live in a small world getting smaller. We live in a world where media and the press moves us and controls us and, if you see movies live The Truman Show, in a world that watches our every move. Where else could someone named Monica connect with 70 million people and clog up the Internet for two days of chit chat. It is also a fascinating time to be alive. With a mere click we can tune in the Discovery channel and watch a live ascent of Mt. Everest.

It sure is a faster world than it used to be. Last week Raleigh had a tornado warning. I was working in the Research Triangle away from my office and noticed the sky getting dark and the wind picking up. It was looking rather spooky. Three or four clicks and I had a doppler radar scanning live on my screen. No problem. We also live in a world that is fast for business. When else in history could a small company named Netscape started by graduate students spin off $4 billion dollars in 4 years flat? Wow? That is what I call leverage. And, this is just the kind of leverage that we need as information providers.

Archimedes, a Greek who used to spend a lot of time in bathtubs, once said, "Give me a place to stand and I will move the earth." Archimedes, when not sitting in bathtubs and running around Greece with only a towel, studied levers. He was talking about leverage. What I am talking about is information leverage. In an era when anyone can get answers a mere click away, how can we as Information Providers leverage information?
"Facts is facts," but are they really. Surely you must have had one client give you a search, only to hear, "But, that is old news?" Not all information is equal. Although it might be a click away, do not believe everything you read on the Internet or in a newspaper. For this very reason a friend of mine in Dallas calls the Internet, "The great sea of mediocrity."
So, although anyone can find an answer, not everyone can find the answer. When people ask you for answers you probe them for specifics. Why? Specificity sells. People want the right answer. They pay for this. This, they call quality. The right answer brings them news. It surprises them. In this sense, Information Providers are in the business of selling depreciating surprise. The longer you wait, the less valuable the answer becomes.
This is why people want answers quickly. Only a few people can get answers quickly. No sooner have they asked the question that they ask, "When can I get it?" We often think, "You want it when?" When we live in a world of instant pizza and overnight delivery, answers must come quickly. When they do we can leverage them.
What do I mean by information leverage? Let's look at a few examples. Information leverage is when you can compare the performance across the organization or when you can anticipate a competitors' actions and responses to your actions. It's when you can optimize your inventory turns or cycles or when you want to study customer behavior and buying patterns. It's when you want to analyze what people want most. Now, that's valuable. Leverage is when you need information to decide how to restructure your sales organization and prepare to move into new territories. It is times like these when quality information delivered quickly provides leverage.
With these examples it becomes clear that information providers, especially those that operate special libraries, must adapt. But, how?

I recommend that we look to the retail industry for answers. Here is where we hear the phrase, "We can ship two today if you will just give us your credit card." If Wal-Mart, the most successful retailer, does it, then so can we. Wal-Mart's stores employ three principles to allow them to leverage capital. We can apply these same principles to meeting our future demands for information.

Sam Walton, the founder of Wal-Mart and until he died one of the wealthiest men in America, after years in retailing discovered that inventory enables immediacy. If you have the goods on hand, then you do not have far to walk. He made sure that he always had goods when people wanted to buy them. Sam also discovered that if you stage the goods, you can ship them faster. This he employed in his Sam's Clubs and in his distribution centers. Ever walk through a Sam's Club and see goods resting on pallets? These are staged goods.

Lastly, Sam Walton discovered the same thing that Henry Ford discovered. Standard orders reduce your processing. Henry used to say about his model T, "You can have any color you like as long as its black." Now, I am not recommending a lack of choice or that you stop providing custom reports, but rather, I want you to at least entertain the notion of standard requests.

As Information Providers, we can apply the three same principles to leverage information.

The computer industry has since 1991 been employing these principles in a retail oriented data management approach that we call Data Warehousing. Perhaps the best definition of a data warehouse that I could think of when preparing this topic is also the most terse. A data warehouse is a storehouse of Information Packages. The term Information Package was first coined by a Sybase consultant named Tom Hammergren shortly after data warehousing became popular. I liked Tom's idea of packaging information, and his notion has stuck with me ever since.

An Information Package is information structured to adapt to change and to provide accurate answers to new questions. They are designed to adapt well to change. Data warehouses are therefore a set of tools and procsses that leverage data, transforming it into valuable information.

It is helpful when first encountering data warehouses to think of the retail distribution chain. Manufacturers create widgets and ship these to distribution warehouses. Retail stores order the widgets from widget catalogs, whereupon the distributors ship them to the stores for sale to consumers. Data warehouses work exactly the same way.

If we were to dissect a data warehouse--to examine its anatomy so to speak, we would see something like that depicted on the left of this slide. Lots of data manufacturers produce data. The warehouse is stocked with data from multiple data sources. An extraction process obtains the data from all these data sources and places it into a staging area. Because the data comes from so many sources it seldom looks the same. Consequently, the data must be scrubbed and cleaned up so that it can be combined. Once combined, the data represents quite a vast storehouse of knowledge, but it only in a raw state. To provide quick answers it must be combined. Averages and totals are pre-calculated in a step called aggregation. This aggregated data is next organized for fast and easy retrieval.

Data consumers order the data by browsing through a data catalog that contains data about the data stored in the warehouse. This is called a meta-data dictionary. We shall cover these shortly. Once the data has been identified, desktop tools retrieve the data and present it to data consumers.

Now on the surface this process may look very much like some of your existing operational systems. The difference lies in the nature and type of the data managed by the system. Operational systems focus on current day to day business operations, whereas, data warehouses focus on trends and on answering more strategic and far reaching questions. They focus on the so called high value questions.

Let's look at the steps in data warehousing a little more closely. Suppose you were wanting to put together a competitive information system. This is rather plausible because many special libraries exist specifically for this purpose. Where would you source the data?

Determining the data sources for a data warehouse is called "sourcing the warehouse." I agree that the verbing of the noun is strange, but verbing nouns is another interesting aspect of the age in which we live.

In sourcing data, you might first look to obvious, easy to reach sources of data like your existing operational systems. Here you might find systems that handled sales, circulation, cataloguing, reserves, serials, collections, purchasing, inventory and the like. Next, you might reach outside your organization to publishers and commercial data providers. Here you would find books, periodicals, newsletters and reports. You probably would want to consider free government sources of data as well.

Next, you might turn to the press. Here you would examine newspapers, trade publications, the associated press wire feed and news sources like CNN. You might next turn to the Internet and on-line subscription sources. Here you would source data from competitor web sites, industry web sites and sources like Lexis-Nexis and CompuServe. Finally, you might turn to private sources of information and seek the advice of industry watchers, luminaries and consultants.

This is how a data warehouse gets data. The process might sound vaguely familiar to people already in the information industry.

Having sourced the information, you now have a huge pile of raw data. Like a lot of things that begin raw, things can begin to smell if you just let them sit there. You have to cook the raw data. At the very least you have to re-organize it into a more useful form. This soon leads you to issues related to data quality.

You will find duplicate data. On the surface it doesn't look the same, but when you delve in you find that the data has been duplicated. You will also find multiple formats and a total lack of standards. Expect this. Your source systems were never built with your data warehouse in mind. You might choose to use foreign data in which case you have the language translation monster to slay. You might encounter technical problems just getting the data into digital form. You may also find that although your source systems work well with their data, there just is no consistency in data validation.

Take a common person's name. One system might accept the data like John A. Doe. Two people entering in the same person might come up with John A. Doe and John A. Doe, Jr. for the same person. Similarly, another system might prefer the last name first as Doe, John A. and be entered by a person in a hurry as Doe, John. Seems innocent enough, until you ask a computer to treat all these as the same John Doe.

You may also find that all your sources of information update their data on different intervals. One may be always up to date. Another may be current weekly. Still two more aren't current until the end of the month. Another might not close the month until the end of the following month. Still others may decide to rewrite history months later. Determining when data is current and history is in the past often depends upon when you ask for the data.

Assume that you overcome these problems. How do you organize the information? Better yet, how do you know what you have? How would you know what you have in a warehouse? Take inventory. When you take inventory in a data warehouse, you record the answers in a meta-data repository. Meta-data is simply a clever term for data about data. It's a way of knowing what you know, knowing how you know what you know and knowing why you know what you know. Clearly, if you know this then you know what types of questions that you are capable of answering within the data warehouse.

Time does not permit a technical discussion of how data warehouses are organized internally. These days, relational databases like DB2, Oracle, Sybase and Informix often appear as the storage mechanism for data warehouses. Data is collected into fact tables, which are related to one another. If one diagrams these fact tables and their relationships, one usually sees what looks like a star or snowflake. Although architecture is one of the most important elements in a successful data warehouse, these star and snowflake patterns have been over-emphasized in the industry. The really important thing is to structure the data in such a way as to balance the need to continuously load new data and to get good performance from trend queries. Any structure that facilitates this is appropriate.

Three additional terms surface when building a data warehouse. These are dimensions, categories and measures. Although we shall cover these shortly it is best to just understand that these are merely terms for processes that you already perform when researching the answer to a particular question. In the same way that you organize your research results, data warehousing practitioners organize dimensions, categories and measures.

Just as you must service many customers, so must a data warehouse. Specificity enables you to segment your search to various aspects of a subject area. Similarly, data warehouses segment user access by subject area. Each subject area may have its own unique vocabulary. For this and performance reasons data warehouses segment access to the data in different subject areas into Data Marts. A data mart is a specialty retail store--a boutique--that sells data related to one kind of subject area, like sales or manufacturing. Data warehouses segment their data via packaged reports or access to specific subsets of data used to create ad-hoc reports.

Over the years experience with data warehouses has led developers to a few proven ways of presenting data to users. By far the most popular is through data visualization tools. Data visualization tools allow users to explore a virtual data space. Like true space, data space is multi-dimensional. Each dimension in this virtual information space represents a different aspect of the subject being explored. Users have the ability to drill down to more detail or drill-up for a high level view. This ability to drill down and drill up was first introduced with spreadsheets and early Executive Information Systems.

As tools progressed, the ability to visualize data in more ways than charts and graphs became important. People began plotting data on maps. This gave rise to geographic information systems or GIS for short. Multi-dimensional tools have evolved to what are now termed On-line Analytical Processing system or OLAP systems, whereas GIS has evolved into a specialty field of its own. Vendors have emerged and introduced their own vernacular so as to confuse the competition. So, you may run into phrases like ROLAP and MOLAP as well as DOLAP that all mean the same thing.

All this gets rather dry without something to sink your teeth into, so I took an example from my book and constructed a slide to explain multi-dimensional analysis. Lets pick a common subject area like sales. For libraries, we could consider it sales of special research projects. For this example, we will look at personal computer sales. What are the dimensions of sales? More fundamental, what is a sale? A sale is a business event based on a transaction. It occurs at a particular time and place and involves a particular product. We therefore have at least four dimensions. We can depict them as a four dimensional diagram.

 

Actually, I can only depict a three dimensional diagram, but you get the idea. So, sales, geography, time and product are our dimensions. What are we interested in knowing? This becomes our measure or fact. Suppose we want to track the dollar amount actually sold. As we begin breaking down sales in this manner we might find it convenient to group related values along a dimension into categories. These form natural hierarchies. Thus, we collect days into months and months into quarters and quarters into years.

Tools that support OLAP or multi-dimensional analysis present information organized by dimensions.

Here is a screen shot taken from an OLAP tool called Knowledge Point. It depicts the hierarchy of our sales dimension. Similarly, the tool allows us to specify different measures. Once we have organized the data, the tool helps us to visualize our subject area graphically. We can drill up or down to look at more or less detail. This is the essence of data visualization tools, multi-dimensional visualization tools in particular.

Pretty neat, but you are probably thinking to yourself, but this just is a way of presenting data--it doesn't tell me anything new. This is akin to preparing the pretty research report with all the research data. Where is the beef? Where is the leverage?

Here we are talking about innovative insights. Where do these come from? This is Information Innovation, and it is what gives data warehouses their leverage.

What do I mean by Information Innovation? Let's look at some examples. Examining a whole industry and determining competitive product research trends is innovative. How about looking at dozens of symptoms to be able to predict who will have the next heart attack? That's pretty innovative. For those more financially motivated, we can seek patterns of activity in the stock market. We could analyze price elasticity and set better prices for our products and services. You might be in insurance and want to know if the last person you signed is a good risk. These are all innovative uses of technology. They are also very very hard to do.

Each of these examples has the same three things in common. They look at patterns in the data. They apply rules of thumb--more commonly called heuristics. They are deductive. Computers capable of producing new information like this from vast quantities of data appear to think. Humans are not capable of processing all the data and seeing the patterns. The vast quantities of data overload our circuitry. If we could analyze the data, however, we could discover a wealth of new ideas.

This is the essence of Data Mining.

Data Mining products identify interesting new facts from large samples of existing facts. When I say large, I mean databases consisting of hundreds of terabytes of data. A terabyte is a thousand gigabytes and a gigabyte is a thousand million bytes of data. The word zillions comes to mind. To data mining products these zillions of facts are literally food for thought.

 

How do they work? Data mining algorithms fall into two basic categories, which is convenient for people who always divide the world into two groups. Algorithms are either statistical or heuristic in nature. Hybrid algorithms exists, but these combine multiple approaches to mining the data. There are six approaches currently being employed by commercial products to mine data from data warehouses. First there is classification, wherein vast quantities of data are classified according to common criteria. This is basically divide and conquer applied to data itself.

Next, and similar to classification, comes clustering. In clustering, products look at the frequency distribution of data values and determine natural cluster points for values. These cluster points form interesting starting points from which many useful new insights typically arise. In addition to frequency distribution, data mining products also employ other statistical techniques. Some look at series regression using the same techniques taught to first year MBA students. Other product specialize in analyzing series over time. Still others combine values derived from multiple events over time. These sequence analysis products produce useful insights but require high skill operators. Finally, there are the pure associative reasoning engines that apply decision trees and neural networks to seek interesting patterns in data.

Dozens of data products are currently on the market. All claim to be better in some ways than other products. I would recommend paying the vendor to develop an example mining scenario prior to full deployment. Not every product will work well for every situation.

While I am on the topic of recommendations, perhaps I should digress and cover some specific recommendations for building data warehouses. In building your data warehouse it is absolutely essential that you involve the top people. Data warehouses cut across organizational boundaries and get very political very quickly. In determining which subject areas to attack, focus on critical business events. These typically manifest themselves as transactions, but need not. For instance, in a competitive information system, you might choose to focus on competitive press events and public disclosures. Focus, however, is key. Divide the project into manageable pieces. Begin by answering the ten top business questions. Resist the temptation to please everyone at once.

 

In sourcing your data it is easiest to manage a single data format and normalize all data coming into the warehouse in the staging areas. When deciding what to aggregate, study who will use the data and how they want to use it. Herein, lies the answers for how to aggregate the data. Above all, focus on the long term. Today's acute problem becomes a monthly management report, so plan on keeping the warehouse as a long term business asset.

To make data warehouses real Lois asked me to share with you a recent scenario from the education industry.

My case in point comes from an engagement in Princeton, New Jersey that I completed last year. It was with a company called Educational Testing Service. Those of you unfamiliar with the education industry will recognize ETS by their products and services. They administrate the GMAT, GRE, TOEFL, SAT and PSAT tests as well as many others. Worldwide, they are the largest assessment organization. They have been at it about 40 years and have about five years worth of purely digital records for virtually every college graduate in Generation X.

The company grew by taking on administrative contracts. Although they administrate all kinds of tests there is very little operational commonality across each division. Divisions operate separately based upon the needs of the primary client for that division. They originally wanted to seek commonality across their divisions in order to lower costs. They also had outsourced much of their administration to Sylvan Learning Systems and wanted to begin getting universities to conduct computer based testing. For this they faced a huge logistical nightmare. Above all, was the ETS mission to ensure equity in assessment. They constantly experiment with questions and evaluate the statistical fairness of their tests. Tests must measure precisely what they are supposed to fairly for everyone.

The data warehouse team developed a GIS to help with logistical planning. This was a great success. We also developed a set of OLAP hypercubes that we made available via client-server workstations and across the ETS Intranet. From the web users could slice and dice data from across all of their graduate programs. We also provided many packaged and ad-hoc reports, also available via the web. Above all our team laid a foundation for the future, but the warehouse was not an overwhelming success. ETS made some dramatic discoveries and as a result has some tough strategic decisions ahead.

What did we learn from the experience? Many decisions had already been made before I became involved. The project was sponsored at the vice president level, which in my opinion is much too low. The first week I was there I insisted on escalation. Eventually, the project was endorsed by the CEO, almost too late in the project. It is easy to get distracted by all the source systems that must one day provide data for the warehouse. Focus on sources required to answer your top questions first. You can always add new ones later.

 

We also learned that vendors can ruin a project. I will not mention these vendors' names because at least one is represented at this conference. Let me just say buyer be careful. We also learned that bad data is bad any way you slice it. This is a real business issue. Technology just makes it pretty. Speaking of technology, the problem is never really technical. It is always a business problem--usually a people problem. Data warehouses bring out people's worst fears. Most people fear true accountability. We also found that some managers do not want their peers poking around in their data. Sometimes ignorance truly is bliss. You cannot please everyone. There will be winners and losers. This is why focus is important. However, you can make these things work for you and your organizations.

We have covered a lot today in this whirlwind tour of data warehousing and data mining. What should you leave with today? The digital future is here. We cannot escape it. Any information, including the type you deal with evey day can be leveraged through restructuring and packaging. Data warehouses and mining are your future. In considering your data warehouse, plan carefully. Your approach does make a difference. Have a vision. Focus on the long term. Ask questions that make a big impact. Most importantly, enjoy the ride.

I hope that you have enjoyed this ride and have some questions for me. Again, I am James Callan and I am confident that you can apply these ideas for Harnessing Information After Dawn. Ladies and gentleman, Lois thank you very much.