* Question on readdir implementation @ 2009-09-15 9:57 Zhang Huan 2009-09-15 14:41 ` Andreas Dilger 2009-09-15 14:53 ` Theodore Tso 0 siblings, 2 replies; 6+ messages in thread From: Zhang Huan @ 2009-09-15 9:57 UTC (permalink / raw) To: linux-ext4 Hi all, I'm reading EXT4 codes and has some questions about readdir implementation. Why traverse the directory in hash order? This brings lots of code to build and traverse a red-black tree. Why not just plainly traverse the directory's blocks? Since the red-black tree is built every time a NFS readdir request comes in, in case of hash collision, the nfs client may receive duplicate dir entries if the buffer is not large enough to return all entries with the same hash value in once. Thanks. -- Zhang Huan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question on readdir implementation 2009-09-15 9:57 Question on readdir implementation Zhang Huan @ 2009-09-15 14:41 ` Andreas Dilger 2009-09-15 14:53 ` Theodore Tso 1 sibling, 0 replies; 6+ messages in thread From: Andreas Dilger @ 2009-09-15 14:41 UTC (permalink / raw) To: Zhang Huan; +Cc: linux-ext4 On Sep 15, 2009 17:57 +0800, Zhang Huan wrote: > I'm reading EXT4 codes and has some questions about readdir > implementation. > > Why traverse the directory in hash order? This brings lots of code to > build and traverse a red-black tree. Why not just plainly traverse the > directory's blocks? Because htree does not maintain constant ordering of directory entries. Consider a readdir running at the same time as files are being added: readdir proc create proc read block 0 read block 1 create new file hash of filename puts entry into block 0 block 0 is full, split it add new block 3 copy 1/2 of block 0 entries to block 3 add new filename to block 0 read block 2 read block 3 When the readdir returns block 3, 1/2 of the entries in that block are the same as were returned with the original block 0 read. This is a violation of POSIX readdir() semantics that require each existing entry only be returned a single time (it doesn't matter if the new filename is returned or not). > Since the red-black tree is built every time a NFS readdir request comes > in, in case of hash collision, the nfs client may receive duplicate dir > entries if the buffer is not large enough to return all entries with the > same hash value in once. This is because NFSv2 only has a 32-bit cookie value, and if there is a whole buffer full of values with the same 32-bit hash there isn't anything that can be done about it except drop those extra duplicates (a very rare case). It would have a much more serious problem if there was a directory larger than 2^32 bytes in size, which would result in all entries beyond 2GB being lost. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question on readdir implementation 2009-09-15 9:57 Question on readdir implementation Zhang Huan 2009-09-15 14:41 ` Andreas Dilger @ 2009-09-15 14:53 ` Theodore Tso 2009-09-15 17:56 ` Florian Weimer 2009-09-16 5:47 ` Zhang Huan 1 sibling, 2 replies; 6+ messages in thread From: Theodore Tso @ 2009-09-15 14:53 UTC (permalink / raw) To: Zhang Huan; +Cc: linux-ext4 On Tue, Sep 15, 2009 at 05:57:24PM +0800, Zhang Huan wrote: > Hi all, > > I'm reading EXT4 codes and has some questions about readdir > implementation. > > Why traverse the directory in hash order? This brings lots of code to > build and traverse a red-black tree. Why not just plainly traverse the > directory's blocks? > > Since the red-black tree is built every time a NFS readdir request comes > in, in case of hash collision, the nfs client may receive duplicate dir > entries if the buffer is not large enough to return all entries with the > same hash value in once. The reason why we have to do this is because of the telldir() and seekdir() interfaces. NFS is implicated in this as well, because it traverses the directory using a 32-bit offset on the protocol wire. The fundamental problem is they presume that directory is stored as a linear array. For file systems which use some kind of B-tree or other non-linear data structure to speed up directory lookups, the problem is what to do when we need to split a B-tree leaf. When this happens, half of the directory entries in the that B-tree leaf must be moved to a new directory block. However, readdir() must still return every single entry in the directory once and exactly once; and this must be true even if even if telldir()/seekdir() is used, and even if NFS decides to wait for several minutes or hours between reading the first set of directory entries, and then reading the next set of directory entries, sending to the NFS server nothing but the 32-bit offset token. It is for this reason that we must traverse the directory in hash order; so that directory entries are returned once and only once even in the face of node splits. We try very hard to avoid a hash collisions, including using a keyed hash which is kept secret to prevent users from deliberately trying to trigger the failure mode. Looking more closely at what we have done, we should be able to do better on architectures where off_t is 64 bits. Internally, we use a 64-bit hash, but we only put the high 32 bits of the hash in f_offset because we can't tell whether the telldir() or telldir64() call might be used by an application, and we can't tell whether the NFS server is speaking NFSv2 (where the readdir cookie is only 32 bits) or NFSv3 (where the readdir cookie is cookie is 64 bits). On platforms where off_t is 64-bytes, we could store the full 64-bit hash value in f_offset, but we would swap the low and high 32-bit values, so that 32 LSB are in the high 32 bits of f_offset and vice versa. The NFS v2 server would still get a 64-bit f_offset, but it currently doesn't do any range checking before dropping it into the 32-bit cookie, so the high 32 bits would get truncated. This would result in the same behaviour we have today for those poor benighted souls which are still using NFSv2, but allow for better behavior for NFSv3 at least on 64-bit platforms. So this is something we could do in the future. In practice, no one has complained about this as far as NFS is concerned, so it's not high priority for me to pursue. Were you worried about this as a practical matter, or as a theoretical one? Regards, - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question on readdir implementation 2009-09-15 14:53 ` Theodore Tso @ 2009-09-15 17:56 ` Florian Weimer 2009-09-15 18:38 ` Theodore Tso 2009-09-16 5:47 ` Zhang Huan 1 sibling, 1 reply; 6+ messages in thread From: Florian Weimer @ 2009-09-15 17:56 UTC (permalink / raw) To: Theodore Tso; +Cc: Zhang Huan, linux-ext4 * Theodore Tso: > So this is something we could do in the future. In practice, no one > has complained about this as far as NFS is concerned, so it's not high > priority for me to pursue. Were you worried about this as a practical > matter, or as a theoretical one? readdir returning entries in essentially randomized order is a practical performance problem for many things, from grep -r to tar. 8-( (My recent FIBMAP/FIEMAP question was related to that, too.) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question on readdir implementation 2009-09-15 17:56 ` Florian Weimer @ 2009-09-15 18:38 ` Theodore Tso 0 siblings, 0 replies; 6+ messages in thread From: Theodore Tso @ 2009-09-15 18:38 UTC (permalink / raw) To: Florian Weimer; +Cc: Zhang Huan, linux-ext4 On Tue, Sep 15, 2009 at 05:56:32PM +0000, Florian Weimer wrote: > * Theodore Tso: > > > So this is something we could do in the future. In practice, no one > > has complained about this as far as NFS is concerned, so it's not high > > priority for me to pursue. Were you worried about this as a practical > > matter, or as a theoretical one? > > readdir returning entries in essentially randomized order is a > practical performance problem for many things, from grep -r to tar. 8-( > (My recent FIBMAP/FIEMAP question was related to that, too.) Well, it's not _that_ hard for applications to sort the directory entries by inode number. I've written a LD_PRELOAD, called spd_readdir() for people who want to use it. The mutt application does this natively, and it makes the problem go away. We could do this in the kernel, but for very large directories, you will end up pinning large amounts of memory --- and if an application holds a directory fd open for a long time, the memory needs to be kept pinned until the dfd is closed. Still, for moderate sized directories, it's a possibility. - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question on readdir implementation 2009-09-15 14:53 ` Theodore Tso 2009-09-15 17:56 ` Florian Weimer @ 2009-09-16 5:47 ` Zhang Huan 1 sibling, 0 replies; 6+ messages in thread From: Zhang Huan @ 2009-09-16 5:47 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 On Tue, Sep 15, 2009 at 10:53:37AM -0400, Theodore Tso wrote: > On Tue, Sep 15, 2009 at 05:57:24PM +0800, Zhang Huan wrote: > > Hi all, > > > > I'm reading EXT4 codes and has some questions about readdir > > implementation. > > > > Why traverse the directory in hash order? This brings lots of code to > > build and traverse a red-black tree. Why not just plainly traverse the > > directory's blocks? > > > > Since the red-black tree is built every time a NFS readdir request comes > > in, in case of hash collision, the nfs client may receive duplicate dir > > entries if the buffer is not large enough to return all entries with the > > same hash value in once. > > The reason why we have to do this is because of the telldir() and > seekdir() interfaces. NFS is implicated in this as well, because it > traverses the directory using a 32-bit offset on the protocol wire. > > The fundamental problem is they presume that directory is stored as a > linear array. For file systems which use some kind of B-tree or other > non-linear data structure to speed up directory lookups, the problem > is what to do when we need to split a B-tree leaf. When this happens, > half of the directory entries in the that B-tree leaf must be moved to > a new directory block. However, readdir() must still return every > single entry in the directory once and exactly once; and this must be > true even if even if telldir()/seekdir() is used, and even if NFS > decides to wait for several minutes or hours between reading the first > set of directory entries, and then reading the next set of directory > entries, sending to the NFS server nothing but the 32-bit offset > token. > > It is for this reason that we must traverse the directory in hash > order; so that directory entries are returned once and only once even > in the face of node splits. We try very hard to avoid a hash > collisions, including using a keyed hash which is kept secret to > prevent users from deliberately trying to trigger the failure mode. > > Looking more closely at what we have done, we should be able to do > better on architectures where off_t is 64 bits. Internally, we use a > 64-bit hash, but we only put the high 32 bits of the hash in f_offset > because we can't tell whether the telldir() or telldir64() call might > be used by an application, and we can't tell whether the NFS server is > speaking NFSv2 (where the readdir cookie is only 32 bits) or NFSv3 > (where the readdir cookie is cookie is 64 bits). On platforms where > off_t is 64-bytes, we could store the full 64-bit hash value in > f_offset, but we would swap the low and high 32-bit values, so that 32 > LSB are in the high 32 bits of f_offset and vice versa. > > The NFS v2 server would still get a 64-bit f_offset, but it currently > doesn't do any range checking before dropping it into the 32-bit > cookie, so the high 32 bits would get truncated. This would result in > the same behaviour we have today for those poor benighted souls which > are still using NFSv2, but allow for better behavior for NFSv3 at > least on 64-bit platforms. > Thanks for replying to explain in such details. So there isn't much we can do before we improve nfsd's offset range checking. > So this is something we could do in the future. In practice, no one > has complained about this as far as NFS is concerned, so it's not high > priority for me to pursue. Were you worried about this as a practical > matter, or as a theoretical one? > Oh, I was asking because some guy I work with told me he has seen this once before. He was testing a proprietary Windows NFS v3 client then, using EXT3 as backend filesystem. I do not know it is hash collision that cause the problem until I spent some time reading EXT3 codes. > Regards, > > - Ted -- Zhang Huan ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-09-16 5:47 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-09-15 9:57 Question on readdir implementation Zhang Huan 2009-09-15 14:41 ` Andreas Dilger 2009-09-15 14:53 ` Theodore Tso 2009-09-15 17:56 ` Florian Weimer 2009-09-15 18:38 ` Theodore Tso 2009-09-16 5:47 ` Zhang Huan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.