« Stereo grej i bilen | Main | Stereo grej i bilen »
12.04.2004
Data Mining
Quarter 1, 2004 Vol. 9, Issue 1
This Way Failure Lies
Nine simple rules you won't want to follow.
Not all mining projects are successful. This pronouncement may come as a surprise (but it probably won't). Some mining projects are successful sometimes spectacularly so. The fact that this second pronouncement may come as a surprise to many readers is unfortunate.
Although there are many paths to data mining success, the paths to failure are followed all too often. Data miners heading for failure seem to follow rules, or worst practices, just as those seeking success try to follow best practices.
Certain Disaster
Here is a selection of worst practices I've encountered much too often over the past year. These rules are in no particular order. To court failure, simply make a selection from this list.
Rule 1. Jump right in. To guarantee failure from the beginning, simply take the data on hand and start applying whatever tools are available. Don't stop to consider how answers to this sort of business problem could best be discovered. Most particularly, don't consider what the business users need from the process. Instead, focus on the data that's most immediately available and the tools you're most familiar with. That way, you'll produce results of the sort that you are most comfortable with producing. And don't stop to consider how the result has to be applied in the real world. A successful project produces results that the business can actually apply. If you're aiming for failure, you don't need to consider how the results will be applied until your results have already been generated and delivered.
Rule 2. Frame the problem in terms of the data. Don't consider alternative solutions to the business situation. Because the problem has been passed to a data miner, the best way to solve the problem is through data analysis. Don't consider whether other decision-making or data-gathering efforts would be appropriate, or whether some other method of dealing with the business problem would help. Don't consider the other challenges faced by the company or the specifics of the industry the company competes in. Consider only what the data set as it currently exists has to reveal. Whatever the data can be persuaded to reveal, recast the business objective in terms that the data addresses.
Rule 3. Focus only on the most obvious way to frame the problem. Don't waste time trying to explore or reconfigure the data. The best results are to be had by concentrating on the statistical and technical criteria provided by the tool of choice. Those criteria are technically exact measures that can easily justify the quality of the final mined models that are discovered. Concentrate on improving the technical merits of the model until it reaches the highest degree of technical perfection.
Rule 4. Rely on your own judgment. An experienced miner has the best judgment as to what should and should not be included in the model. Because the data contains all the necessary information, mining's job is only to extract and reveal the relationships that its contains. Input from others, particularly from the business managers, is likely to be distracting and should be ignored, discounted, or at least recast into terms that the data actually does address. Remember that the miner probably does know best, and that, if properly and correctly applied, the tools will reveal all.
Rule 5. Find the best algorithms. Mining data is fundamentally about algorithms. For any data set, some particular algorithm will produce the best model. In order to discover the best model, it's very important to use the appropriate algorithm. In fact, fitting an appropriate algorithm (mining) to the data is what data mining is all about.
Rule 6. Rely on memory. Most data mining projects are simple enough that you can hold most important details in your head. There's no need to waste time in documenting the steps you take. By far, the best approach is to keep pressing the investigation forward as fast as possible. Should it be necessary to duplicate the investigation or, in the unlikely event that it's necessary to justify the results at some future time, duplicating the original investigation and recreating the line of reasoning you used will be easy and straightforward.
Rule 7. Intuition is more important than standard practice. All data is different, and all data sets are unique, so each should be approached as an individual and unique entity. Standard practices are really only guides for those who aren't experienced and, therefore, haven't developed the intuition necessary to work with each data set as a unique experience.
Rule 8. Minimize interaction between miners and business managers. A skilled miner can easily explore a data set and discover all the interesting, insightful, and useful relationships without any interaction, guidance, or involvement from the business managers. In fact, it's the primary business task for a data miner to take a data set, analyze it, and return a comprehensive report that is insightful and relevant, relying exclusively on the information contained in the data. Data mining tools are so powerful at discovering relationships, a miner's main job is simply to report the deep insights the tools discover.
Rule 9. Minimize data preparation. The most important and interesting part of data mining is creating models using state-of-the-art mining algorithms. Tools are quite capable of preparing any data set automatically, so that it will reveal all the significant, interesting, and relevant relationships contained therein. Preparing data is not only boring and tedious, it also takes a long time. It slows the mining process to a crawl. It's best by far to do as little preparation as possible and get right to the modeling part.
If you adopt all nine of these rules you will almost guarantee that the data mining project fails. Of course, most readers will want to do the exact opposite. And, although reversing these rules won't guarantee success (rules to guarantee success are far harder to come by than rules to guarantee failure) you'll at least avoid the most common causes of failure.
Be particularly careful with Rule 9, probably the single largest cause for failure. Data preparation consists of two totally separate steps. Some mining tools (although not many) do a reasonable job of the second step (preparing the variables for modeling). But the key to many mining projects is actually the first data preparation step: assembling the source data for mining from one or more source files. Because data assembly is often the largest single contributor to mining project success or failure, it's worth looking at in more detail.
Assembling Data
The biggest problem faced by almost any data analyst, and certainly by most data miners, is assembling a data set that addresses the business problem of interest. Of course, most data is already assembled in some form (in tables, marts, warehouses, spreadsheets, flat files, and so on) and often in quite a variety of formats. Data is hardly data unless it is assembled in some form. But data assembled in some form and data assembled in a form that enables a miner to address a business problem are totally different entities.
Assembling data to address a business problem requires understanding what the business situation is (which isn't always an easy task) and then framing the data so that it can be used to provide information about the business situation. If the situation is clear and well understood and the business objectives are apparent, the required framing is relatively obvious. A simple example: Corporate management may explicitly say that its business is framed in terms of sales of products to customers over time. In this case, it's relatively easy to design a data structure (mart, warehouse, and so on) in which these three dimensions of a business are explicitly represented. Transactions are translated into records in tables that represent the behavior of customers by product per unit of time, for instance.
However, a miner faces additional challenges. Although mining may discover the obvious in a data set, it's also expected to discover unknown, unexpected, and potentially useful (business actionable) insights. But data structured for a specific use (as in the simple example I mentioned) isn't actually structured to reveal the unknown; in fact, it hides the unknown.
First of all, the initial data isn't captured to reveal as much as possible, but to reflect specific features of worldly events. For instance, a supermarket shopping cart transaction record contains much information but it specifically rejects more, even though the unrecorded features might be both relevant and interesting. A supermarket transaction record seldom captures information about whether the shopper prefers paper or plastic bags or brings bags from home, for example. Yet this information might be enormously indicative of shopping style or perhaps it isn't. Because the data isn't routinely recorded in other words, because this information is deliberately ignored it's impossible to know.
This deliberate rejection of information carries over into the structures of a data set. To continue with the shopping basket example: A transaction report often provides information about the order in which the items pass through the register. This information, although noisy and distorted, turns out to roughly reflect the order in which the items were placed in a basket, which, in turn, is a rough estimate of the path of shopper takes through the store, which is an estimate of shopping style, which can be used (with additional information) to infer which special offers the shopper might respond to. However, in a typical data structure, information about item order isn't included; therefore, a miner can't make any such discovery because that particular relationship has been specifically excluded by the data structure.
I've used a shopping cart example because shopping in a supermarket is a common experience for most people; however, the same principles apply to all data sets. All data sets are designed to reveal specific features and (more important for the data miner) are, consciously or unconsciously, designed to hide or at least, not reveal other potentially important features.
Of course, in order to make data usable, this exclusion is essential. Even so, it's rarely a data miner's job to discover the known or the obvious, and restructuring the data set is an essential part of mining the data. However, more than any other part of mining data, data restructuring is an art, not a science, and creativity and imagination are required. It's also usually the longest and most challenging part of any mining project. However, restructuring is often where the difference between success and failure lies.
Although data restructuring is indeed an art that is to say a practice of human skill, intuition, insight and understanding there are approaches and skill sets that can be learned and improved with practice.
One of the keys to preparing the data for mining is to understand the business situation in terms that allow the data to be framed appropriately. As any practitioner of data mining knows (although many business users may not) data mining isn't carried out as an activity separate from the business. No data miner, however skilled in the practices of mining, can produce the best results without a good understanding of the business situation. The sole purpose of data mining is to provide a foundation for better and more informed decision-making. No data mining exercise is carried out without the intention, or at least the hope, that the results will provide the insight for a company (or perhaps an individual) to act on the results.
Making it Work
The miner's problem is to interpret the business-oriented expression of the problem at hand and translate it into a form that can be used to direct data assembly.
One useful technique for translating the problem is to use a goal hierarchy, a device for working from the most general objective down to more specific statements.
Figure 1: A sample goal hierarchy.

Figure 1 shows a small fragment of a goal hierarchy that starts with a very general objective: Help marketing make the best use of their dollars. This objective is so high-level that no miner could directly interpret it into terms that would allow data to be discovered and assembled to answer the question. But the miner can get to each lower level in the hierarchy by asking, "What objectives need to be met to achieve this goal?" Eventually, the miner will arrive at a level that lets the objective be interpreted in terms of relationships between recorded features of the world.
Data mining is a tool for discovering relationships in data, which must contain recorded features of the world. But a data set is a to a miner as paint and canvas are to a painter a place to start. From these raw materials an artist may construct a mediocre work or a masterpiece.
Mining tools discover relationships between the data set records or between data set variables. Any business object of interest (such as ROI, lifetime value, attrition, response, performance) must be represented appropriately. Similarly, the features with which a useful relationship is to be discovered must also be represented.
For instance, if "days until Thanksgiving" is an important feature, no tool will discover this relationship unless the variable representing it are present regardless of whether date and time are present in a data set. If "distance to store" is important, mining won't reveal its relationship, even if ZIP codes or latitude and longitude are present, unless geographic position is interpreted into an explicit representation of distance.
You can use the goal hierarchy to discover what the important features are. Then, look at the data and determine how it would be possible to represent those features. Consider what could possibly be related in interesting ways and how those features could be represented. If the features aren't represented in the data set, then no mining tool or technique will discover or characterize the relationship.
Avoiding Failure
No one deliberately sets out to fail at mining data. Whatever actions are taken are intended to have an effect that will improve the situation, whatever the situation is. Yet it is all too easy to slide into failure simply by not paying attention to the basics. The most basic of all activities that any data miner ever engages in is assembling data and creating the relevant features. Mining tools can only discover what is in the data set to begin with. At rock bottom, mining tools discover relationships between variables and among records. If the variables or records don't reflect the relevant relationships, they won't be discovered. To a large degree it's the miner's job to structure and restructure the data so that the relevant information is there to be discovered. That is the art of data mining. After that, failure is avoided by not mining data in a vacuum.
Data mining is part, and a very small part, of a much larger business process. It may be an essential part of a data mining project, but incorporating the results of mining with all the related parts of the corporate project is equally, if not more, important for ultimate success.
Posted by kiig at 12.04.2004 14:50