FB II Compiler







Disk I/O














Made with FB


Store very large texts

I've done a small amount of this, years ago... pre-Mac even! tedd's right on the "best" method, but there are a number of "quick and dirty" approaches that save _some_ space, without resorting to a "unique dictionary" built based on your contents. (Which may well be almost totally unknown - other than being "English" - at the time you're trying to build the dictionary anyway.) This note may be below the level of detail you're looking for; if so, email me privately and we'll discuss it further.

Generally, the first step is to determine what portions of the text you're going to carry "straight", and what portions you'll encode. Part of this is deciding how you'll indicate to the "decoder" that a word is encoded; this is generally through use of the high-order bit in ASCII, but since the Mac uses this bit, that's not recommended - which means you'll need some sort of "escape" sequence. The simplest is any non-printing character, although it wastes a full byte. (And you have to be sure to strip any naturally-occuring cases of that character in your input text!) [Uh; there are even script systems on the Mac that use two bytes...]

Following the escape character, you put one or two or more bytes that will be treated as a number. (I'd use two; 256 "words" in your dictionary is not enough, but more than 65535 is probably overkill.)

Then for decoding, you simply use that number as an index into your dictionary, which is an array, loaded from a resource. When encoding, you search the input text for each word in your dictionary, replacing it with the appropriate number. (Munger is the way to do the replacing.)

Building the dictionary... here is the art of it! Obviously, it's pointless to replace any one, two, or three-character words, since you'd either lose storage space or break even. So I'd probably take a wide selection of my stored emails, assuming that's what I was going to index, and write a program to sort words by number of occurrances, then word length, discarding anything less than five letters long, or that occurs less than "x" number of times. (Where "x" is on the order of 1/10 of the number of input emails. You'd save a lot of space by indexing "antidisestablishmentarianism", or "supercalifragilisticexpialidocious", but only one time... on the other hand, "communication" will probably show up a lot.)

Now, what is a "word"?!? Obviously "hello" and "goodbye" are different, but what about "hello" and "Hello"? I'd say that capitalization would make a word "different" - so "Hello" and "hello" would be two different entities to my codec. We've got plenty of codes to waste, and if "Hello" truly shows up enough times to be in the top section, why not store a code for it? But what _breaks_ one word from another? I'd go the "simple" route, and say that any character with an ASCII code less than 33 is a word break. Obviously, some punctuation, etc., will be "part of the word" using this scheme, but that really shouldn't affect the results much. If it does, you can exclude other characters, calling them word breaks as well.

Pick the top couple of hundred entries (you can always add more later) and run them through several spell checkers! (Don't save misspelled words, unless they're very common.) Number your selections. Then _manually_ convert _one_ email message using your hacked-together table, and see how it goes. Now convert it back. You'll learn from the manual process what pitfalls to avoid in writing the code. :-)

Write the code to work with your micro-index, and test it out. You _should_ get about a 5% reduction in storage space, or better, even with this limited set. Then adding the next "n" thousand words to your index should drop your storage to about 70-75% of the original space, given "average" English text. All much simpler than the "complicated" approaches, although not likely to be as space-saving.

Be warned that the _encoding_ is not a fast process! The decoding can be very quick.

P.S. Exercise for the student: how much space can be saved by replacing all double-spaces with a single otherwise-unused non-printing character? What about other common "pairs", such as period-space? How many unused non-printing characters are there (not) in your input text?


We used a technique much like this to compress the data in Cliffs StudyWare.
The difference is that, instead of using a fixed dictionary based on some sample text, we custom-built the dictionary for each "book." We scanned the book, and built an array which contained every unique word that was longer than our "token" size, along with the number of bytes that would be saved by replacing that word with a token (this number was based on the length of the word and the number of times it occurred in the book). If there were more unique words than we were able to tokenize, we'd then sort them according to how many bytes they'd save, and tokenize the "n" words that gave us the most savings. We typically saved 30% or so this way.