On Fri, Sep 27, 2002 at 06:20:27PM -0700, Ryan Cumming wrote: > > This is while deleteing an old fsstress directory (a full fsck had been > performed since the last time the fsstress directory had been touched) while > running a few instances of the attached program. > > You guys have any idea what's going on yet? Some ideas yes, although I don't have a complete solution yet. I've been able to replicate it now fairly reliably, with the attached shell script and 2.4.19 with the 2.4.19-2 dxdir patch. It appears to be somewhat timing dependent, as where the directory corruption occurs is not consistent, but I believe it is in the split code. Since e2fsck -fD packs all of the directories completely, it means that any attempt to add a file to directory will guarantee at least one split, and possibly two levels of tree splits. Since the -D option to e2fsck has only been relatively recently been available, I believe this is why it hasn't been noticed up until now in the testing; directories which are indexed "naturally" as they grow don't appear to trigger the problem, or are very, very unlikely to trigger the problem. (One potential avenue for exploration is that -D option perfectly sorts all of the directory entries in hash order, which doesn't normally occur for naturally grown directories, and this may be triggering a fencepost error in the split code.) The other thing which I've developed is a patch to e2fsprogs 1.29 (also attached) which fixes the directory corruption without causing files to end up in lost+found. I'm using the dxdir patch in production, and I was first able to replicate your problem after I ran e2fsck -fD on my /usr partition. At that point, /usr/bin and /usr/share/man/man8 got corrupted. So I modified e2fsck to be able to correct the problem without throwing a directory block's worth of dirents into lost+found. The nature of the corruption is that a directory entry of size 8 (which is enough room for a zero-length name) is left in the directory. This is harmless, but it should never happen normally, and so the ext3 sanity-checking code flags it as an error. With this patch, e2fsck is much smarter about salvaging corrupt directories, and so it can do so without causing any directory entries to be lost. (This corrupted, too-small directory entry appears at the beginning of the directory block, which is another reason why I strongly suspect the dx_split code.) BTW, Andreas, I've tried your stack-usage reduction patch (modified to sanely deal with out of memory conditions instead of panicking), but that doesn't seem to fix the problem. So whatever it is, it's something else, although your patch is still a good one and I've commited it to the 2.4 ext3-dxdir BK tree. - Ted