Monday, September 11, 2006

The lure of histograms

I had been doing a lot of research into histograms. I scooped a lot of references (which I will list later), and filled my head with no shortage of verbiage. Our current dilemma now is, "Does one introduce more problems when proactively creating targeted histograms (frequency histograms on non-unique columns) than you solve?" I am of the opinion that the benefits outweigh the risks. But I am being cautious because people smarter than I (ie, Jonathan Lewis, Wolfgang Breitling, Dave Ensor) all exercise extreme prejudice and use histograms as little as absolutely necessary. Or at least, that is my understanding. I firmly believe that if you tell the CBO absolutely all the facts about your data and resources, you will get the optimal plan 100% of the time. Which infers that if you fail to tell the truth 100% of the time, you are running the chance of getting a "bad" plan, or even the "worst" plan.

And than the whole topic of equi-depth histograms, or "height balanced". Granted, there will always be a limit to the number of buckets you can use, currently 254 in Oracle 10.2.0.2. If you have more distinct values than max buckets, you will have to resort to summarizing your data as best as possible. This is a vast, grey area. On the one hand, you can shoot in the dark, sample your data and hope for the best (which works most of the time). A slightly better method would be to describe your most popular values. However, the whole exercise is moot if your business needs require specific information about the least-popular values. This is the part that makes equi-depth histograms so messy. There is no silver bullet.

For this reason alone, I wish Oracle would vastly increase the number of histogram buckets one could collect, so as to avoid the whole guessing game. You have to admit, it is much easier to calculate frequency histograms than dive into your data and business logic to determine which data most needs to be described. Yes, it would also be more resource intensive, but my point is it would be easier on the dba.

No comments: