Bad, vile and meaningless: Another day, another crash from Alan's clob

Learn a new thing each day (quite possibly while tearing your hair off)

DB2 does not cache Long Varchar column contents. With one exception: it "does" if the table is in system managed space (host filesystem). The OS cache gives you caching, but DB2 itself does not in database managed space.

How we came into our situation was that we benchmarked bunch of column types: varchar, char, blob, clob and long varchar, and found that long varchar gave good performance with long maximum data length with efficient packing so we saw no reason to not use it. But as you probably guessed, we conducted our tests using SMS, while the production database used DMS.

Since then, we have been watching our DB performance go down steadily each passing month and seen disks use huge bandwidth and have not been able to fix this thing. I said it was obviously about caching, and the answer was staring at my face constantly, even subconsciously, as the slowest tables were always those that had long varchar columns (but they were also the largest, so I thought maybe there's not enough buffer cache for indices or something like that). Well, now we finally know what's going on. Thank heavens, at last.

So don't ever, ever, ever use long varchar for any purpose. It's not worth the lack of caching. You have to have a total write-only table with huge bandwidth and large average column length and small tablespace page size, in other words, no, it doesn't happen, but yes, this thing might have its use there.

Forget the above paragraph, except for the first sentence. If you do use it, you'll curse yourself later on. Particularly if you weren't aware of the subtlety between SMS and DMS regarding long varchar, chances are you're in for rude surprises.

Alternatives

With DB2, a clob seems to perform with about 2-3x the speed, but a varchar, if you can use it, is at least 10x faster. Of course, these figures are somewhat meaningless, but here I'm talking about mostly short entries with occasional longer entry. Consider an average of 60 and median of 50 or something like that. And why exactly do we use a 32 kB storing column for entries with an average length of, like, 60 chars? Well, again it is because we wanted to avoid designing limitations anyone would likely hit. We know database schema is very hard to change afterwards, and thought we can afford some performance tradeoff for schema longevity. Besides, occasionally there is a long entry.

A good method would be to store data in two tables based on value length, use tight varchars for majority of entries and then allow an occasional long entry spill into another table. This design is naturally not so simple as the "all longs" version, but you can alleviate the reading pain with view, and probably get stellar performance all around.

Wait, I bet I know what this is, is it the "one true lookup table"?

This isn't exactly the OTLT antipattern but close.

Our table is a key-value storage with arbitrary keys (or "columns" from SQL viewpoint), with key_id and free-form string representation of value, a lot like standard sqlite table. It was this way because DB was used as storage backend and there is no possibility to know at deployment time what kind of schema will be stored, so a dynamic method was required. The schema would be uploaded by the end user in a www browsing session, and user may change the schema any time in any way (and the old data must not be lost).

Generally the dynamic methods for creating "tables" are:

If I were to redesign this code now I'd consider a combination of 3 and 4. That is, sometimes you need to have a "relational" access method to a small subset of values in the larger data, and I see no better way than declaring general-purpose fields for storing them so they are always accessible, and then the actual data may take any format. One good choice would be any language-independent protocol well supported, such as XML or YAML, or whatever your pet protocol is. I'm not so hot about XML generally, but in this case XML would be good because it's being used as interface format in many places already.

But either way, this type of problems seem very awkward to solve in SQL. SQL is bad for organizing information with loose structure, much to everyone's surprise...