Date: Mon, 31 Oct 94 03:56:42 JST From: Stephen Turnbull To: eliz AT is DOT elta DOT co DOT il Cc: djgpp AT sun DOT soe DOT clarkson DOT edu Subject: stat()/fstat() for DJGPP, v.02 While I agree with Eli's responses to Morten Welinder's comments about stat()/fstat(), I'd like to make some comments based on my own experience and system configuration. EZ> 5. Directory size is not reported zero by stat(); the number EZ> of used directory entries (sans the ``.'' and ``..'' EZ> pseudo-entries) multiplied by entry size is returned MW> This could be expensive and is misleading. If you create a MW> directory with 1000 files then delete them all, the size of MW> the directory should not change. How can this be more misleading than DOS's normal approach of showing directory sizes as 0? If you add 1000 files to a directory, its size *should* change. This is necessarily going to happen more often than the case Morten describes! The only time I can see this as misleading is in something like du, and a du using Eli's stat() is always going to give a better approximation than one using DOS functions for files' sizes. Not too expensive, as my experience shows (stat() is already an expensive function). Misleading? not entirely. For regular files, sizes are also reported as only the number of *used* bytes they hold; the last cluster (may be as large as 16KB) is usually incomplete, but this doesn't bother us. It is true that rewriting a file usually returns unused clusters to the system, while deleting files in a directory doesn't, but to report this slack part of the directory is *indeed* expensive (you must work on BIOS level and read the FAT for this), and is totally impossible on networked drives. So, Really? I guess that the raw disk-reading functions are BIOS-level and wouldn't be available for network drives, but it's not the FAT you need to read, it's a regular file with the directory bit set (except for the root directory, and doing stat()s on the root directory can't be that common). Right? So couldn't you try some dodge like resetting the directory attribute (I suppose this might also require BIOS-level functions) and reading the raw directory data? I have no idea if something like that would work, but it's weird enough that it might. (It also looks very unsafe. And expensive---two extra disk operations to reset and set the directory attribute.) I chose a (hopefully useful) compromise. After all, Be that as it may, I'm not sure I agree with this compromise. One can imagine a program that compares directory statistics, and runs the defragger based on (among other things) the number of directories that are much bigger than their file count justifies. directories with a large number of unused entries are rare (unless you didn't run your favorite defragger for 5 years or so ;-) Until I repartitioned my disk, I had 16KB clusters *and* always had at least one directory with a couple dozen KB of unused entries: my Ghostscript build directory after 'make clean'. I would guess that people who do a lot of beta testing of such large programs would have several such directories. So in terms of "will you find such a problem on a given system," they're common. Of course, at that time I had about 2000 directories and at most 3 such beta test directories. In terms of percentage of directories with such a problem, .015% is going to be pretty high for most lusers. :-) I didn't run a defragger for a long time because my favorite one was Norton Speedisk, which chokes on big drives (this is an old version, about Norton Utilities 5.0). EZ> 3. I don't know how to obtain time fields for root directories, MW> You could use the volume label as a better fall back. Also, I A disk is not required to have a label; in fact, most floppies don't have one. Even if a label is present, it can easily be changed, thus changing its time stamp. In my view, this makes the label method unreliable. A lot of DOSes (well, IBM's, anyway) automatically put serial numbers on floppies. I believe this is done using the volume label bit, but I'm not sure. As for lack of reliability.... MW> Also, I MW> think there are some time stamp in the boot record. Semi-expensive AFAIK, there is no time stamp in the boot record, but if you know otherwise, please tell me where in the boot record it dwells. In my small experience, boot records are more likely to change than volume labels! (For hard drives, anyway.) Most of my colleagues still have disks labelled "MSDOS_5" or the like. Most of them have some sort of multiboot utility installed after the initial installation. (I don't know how common this is outside of Oriental countries; here average users need multiboot because Japanese DOS and English DOS don't like each other's programs very much.) I assume that the rationale for setting the date of the root directory to Anno Gatesii 0 is that the root directory is the earliest object created in most file systems. But this is not universal. For example, the MSDOS 5 file system is incompatible with that of MSDOS 4. So to upgrade, one would probably back up one's system, then restore after reformatting the hard drive and reinstalling DOS. The root directory then is younger than anything but new system files. On Unix, this can happen even without backup and restore. I've been doing a lot of fiddling with my Linux system. What I have done to minimize backup/restore cycles is to create 5 partitions: ROOT, ROOT-TEST, USR, USR-TEST, and HOME. ROOT contains my current working system's minimal boot and system repair utilities, ROOT-TEST the corresponding new installation. Now, typically the new installation doesn't include lots of the utilities I use, so I mount the USR-TEST partition on /usr and the USR partition on /stable-usr, and often *everything* in /stable-usr is older than /. (HOME of course contains the directories where I do my non-system work.) Simpler than that would be having the superuser 'touch /', which ought to work. (I haven't tried it, but why not?) This could happen as a typo.... If you want to stick with the "oldest file" rationale, it might be useful to consider setting the default date to Anno Unixii 0, since Linuxers using the UMSDOS file system *YUCK* may be able to read their Unix files from MSDOS (shudder, talk about security holes). Which leads to the rather bizarre concept that somebody could get hold of an old PDP-11 tar, and restore a Unix file system from before DOS was born to their MS-DOS disk. I guess this is pretty silly, and pre-1980 files are going to be far rarer than directories with 500 deleted file entries.... Although one can imagine someone touch'ing a file to such a date (why, I don't know, and why they wouldn't want to touch to a "before Unix" date I can't figure out either). Given all this, it's not clear to me what the meaning of "reliability" of the root directory's date might be. I don't find the date of the volume label to be at all implausible, assuming it exists. In some sense it's the last major change the user has made to the root directory. (Eg, when I reuse a floppy I typically 'del /fsxyz *.*' and change the volume label---this is much less obstructive when running DESQview/X than a reformat.) One way to deal with it might be to set up the sources so that individuals could create their own preferred order of checking the various alternatives easily. I think that it's probably a good idea to make the library f?stat as standard as possible, but if there's some reason the date of / can matter, it ought to be possible to alter the default behavior.