November 03, 2003

Performance

I enjoy some informed Microsoft bashing as much as the next guy, but I have to quibble with a couple things:

Surely WinFS is going to make applications better? I mean, XML metadata for every file. Common data shared transparently between applications. Automatic searching and grouping. What could be better than that? Well, it won’t work. WinFS is going to be glacial. Whatever benefits WinFS holds for applications will be overwhelmed by performance so poor as to make them unusable.

Now, BFS was not a speed demon, and it wasn’t centered around XML, but it was a perfectly usable filesystem. It’s certainly not ridiculous on its face to have this kind of capability built in to the file system.

Of the 140,000 files, there is one file I care about more than any other, my Outlook .PST file. This one file is a repository of all my emails, sent and received, all my calendar items, and all my contacts. Know why it is one file? For performance. Try storing every email, appointment, or contact in a separate file, and you’ll have the slowest PIM known to man.

Is this really true? The system I use for email storage (Maildir) is the complete opposite of this approach: one email for every message. Now, there are positive and negatives to this scheme, but performance wise, there is certainly nothing to complain about, and I’ve got some big mailboxes. As an aside, Outlook’s one-big-file approach has only actual benefit from a user perspective: dead-simple backup. What it doesn’t allow for is the use of any tools outside of Outlook for accessing, sorting or search your email. But anyway.

Posted by Bill Stilwell at November 3, 2003 08:56 PM
Comments

In my experience as a developer generally we put everything in one big file in order to prevent users from being able to manipulate the individual files (e.g., accidentially removing part of collection of files).

Reading a data stream out of files should be faster than reading it out of a database stored in one big file. This is because for each block of data in the latter case, you have to look up the block in whatever data structures the database uses, and then to retrieve the block you do whatever the file system does. Reading data direct from a file skips that first look-up operation.

Using one big file means fewer directory look-ups (you have to search for the file, rather than for each of the smaller files). On the other hand, if the database is structured internally with a directory structure, then you ARE in effect doing directory look-ups and this may or may not be faster then the file system's code.

If your data items are smaller than a disk block, then your individual files will be wasteful of disc space (and this means reading in data wastes the time spent reading in the part of the blocks not in use).

If your file system (a) has relatively inefficient directory structures and (b) has a large block size, then glomming everything together in to one file may well help efficiency. Window's FAT file system falls in to this category. On the other hand, some operating systems have file systems designed to tackle those. ReiserFS on Linux is an example -- it is designed to have fast directory searches and efficient storage of small files.

Posted by: Damian Cugley at November 4, 2003 09:55 AM