* [PATCH 0/3] Three VFS patch resends @ 2017-11-09 3:22 NeilBrown 2017-11-09 3:22 ` [PATCH 1/3] VFS: use synchronize_rcu_expedited() in namespace_unlock() NeilBrown ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: NeilBrown @ 2017-11-09 3:22 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, linux-kernel Hi these three patches are all completely separate, but are all related to the VFS. I've posted them all previously but with, as yet, no response (except an ack from Paul McKenney for the RCU one). One was, admitted, posted early last year... I would appreciate review and, hopefully, acceptance. Thanks, NeilBrown --- NeilBrown (3): VFS: use synchronize_rcu_expedited() in namespace_unlock() Improve fairness when locking the per-superblock s_anon list VFS: close race between getcwd() and d_move() fs/dcache.c | 42 ++++++++++++++++++++++++++++++------------ fs/namespace.c | 2 +- fs/super.c | 1 + include/linux/fs.h | 1 + 4 files changed, 33 insertions(+), 13 deletions(-) -- Signature ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/3] VFS: use synchronize_rcu_expedited() in namespace_unlock() 2017-11-09 3:22 [PATCH 0/3] Three VFS patch resends NeilBrown @ 2017-11-09 3:22 ` NeilBrown 2017-11-09 3:22 ` [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list NeilBrown 2017-11-09 3:22 ` [PATCH 3/3] VFS: close race between getcwd() and d_move() NeilBrown 2 siblings, 0 replies; 21+ messages in thread From: NeilBrown @ 2017-11-09 3:22 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, linux-kernel The synchronize_rcu() in namespace_unlock() is called every time a filesystem is unmounted. If a great many filesystems are mounted, this can cause a noticable slow-down in, for example, system shutdown. The sequence: mkdir -p /tmp/Mtest/{0..5000} time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done time umount /tmp/Mtest/* on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and 100 seconds to unmount them. Boot the same VM with 1 CPU and it takes 18 seconds to mount the tmpfs filesystems, but only 36 to unmount. If we change the synchronize_rcu() to synchronize_rcu_expedited() the umount time on a 4-cpu VM drop to 0.6 seconds I think this 200-fold speed up is worth the slightly higher system impact of using synchronize_rcu_expedited(). Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> (from general rcu perspective) Signed-off-by: NeilBrown <neilb@suse.com> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index d18deb4c410b..23cdf6c62895 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1420,7 +1420,7 @@ static void namespace_unlock(void) if (likely(hlist_empty(&head))) return; - synchronize_rcu(); + synchronize_rcu_expedited(); group_pin_kill(&head); } ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list 2017-11-09 3:22 [PATCH 0/3] Three VFS patch resends NeilBrown 2017-11-09 3:22 ` [PATCH 1/3] VFS: use synchronize_rcu_expedited() in namespace_unlock() NeilBrown @ 2017-11-09 3:22 ` NeilBrown 2017-11-09 19:52 ` Linus Torvalds 2017-11-09 3:22 ` [PATCH 3/3] VFS: close race between getcwd() and d_move() NeilBrown 2 siblings, 1 reply; 21+ messages in thread From: NeilBrown @ 2017-11-09 3:22 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, linux-kernel bit-spin-locks, as used for dcache hash chains, are not fair. This is not a problem for the dcache hash table as different CPUs are likely to access different entries in the hash table so high contention is not expected. However anonymous dentryies (created by NFSD) all live on a single hash chain "s_anon" and the bitlock on this can be highly contended, resulting in soft-lockup warnings. So introduce a per-sb (fair) spinlock and take it before grabing the bitlock on s_anon. This provides fairness and makes the warnings go away. Signed-off-by: NeilBrown <neilb@suse.com> --- fs/dcache.c | 11 ++++++++++- fs/super.c | 1 + include/linux/fs.h | 1 + 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/fs/dcache.c b/fs/dcache.c index f90141387f01..d5952306206b 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -51,6 +51,8 @@ * - the dcache hash table * s_anon bl list spinlock protects: * - the s_anon list (see __d_drop) + * dentry->d_sb->s_anon_lock protect: + * - the s_anon bl bitlock, ensuring fairness. * dentry->d_sb->s_dentry_lru_lock protects: * - the dcache lru lists and counters * d_lock protects: @@ -484,7 +486,12 @@ void __d_drop(struct dentry *dentry) else b = d_hash(dentry->d_name.hash); - hlist_bl_lock(b); + if (b == &dentry->d_sb->s_anon) { + spin_lock(&dentry->d_sb->s_anon_lock); + hlist_bl_lock(b); + spin_unlock(&dentry->d_sb->s_anon_lock); + } else + hlist_bl_lock(b); __hlist_bl_del(&dentry->d_hash); dentry->d_hash.pprev = NULL; hlist_bl_unlock(b); @@ -1965,7 +1972,9 @@ static struct dentry *__d_obtain_alias(struct inode *inode, int disconnected) spin_lock(&tmp->d_lock); __d_set_inode_and_type(tmp, inode, add_flags); hlist_add_head(&tmp->d_u.d_alias, &inode->i_dentry); + spin_lock(&tmp->d_sb->s_anon_lock); hlist_bl_lock(&tmp->d_sb->s_anon); + spin_unlock(&tmp->d_sb->s_anon_lock); hlist_bl_add_head(&tmp->d_hash, &tmp->d_sb->s_anon); hlist_bl_unlock(&tmp->d_sb->s_anon); spin_unlock(&tmp->d_lock); diff --git a/fs/super.c b/fs/super.c index 994db21f59bf..af644ae93445 100644 --- a/fs/super.c +++ b/fs/super.c @@ -209,6 +209,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, if (s->s_user_ns != &init_user_ns) s->s_iflags |= SB_I_NODEV; INIT_HLIST_NODE(&s->s_instances); + spin_lock_init(&s->s_anon_lock); INIT_HLIST_BL_HEAD(&s->s_anon); mutex_init(&s->s_sync_lock); INIT_LIST_HEAD(&s->s_inodes); diff --git a/include/linux/fs.h b/include/linux/fs.h index 885266aae2d7..9df9bace53fb 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1359,6 +1359,7 @@ struct super_block { const struct fscrypt_operations *s_cop; + spinlock_t s_anon_lock; /* needed for fairness */ struct hlist_bl_head s_anon; /* anonymous dentries for (nfs) exporting */ struct list_head s_mounts; /* list of mounts; _not_ for fs use */ struct block_device *s_bdev; ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list 2017-11-09 3:22 ` [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list NeilBrown @ 2017-11-09 19:52 ` Linus Torvalds 2017-11-09 20:50 ` Al Viro 2017-11-10 8:50 ` Christoph Hellwig 0 siblings, 2 replies; 21+ messages in thread From: Linus Torvalds @ 2017-11-09 19:52 UTC (permalink / raw) To: NeilBrown; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List On Wed, Nov 8, 2017 at 7:22 PM, NeilBrown <neilb@suse.com> wrote: > bit-spin-locks, as used for dcache hash chains, are not fair. > This is not a problem for the dcache hash table as different CPUs are > likely to access different entries in the hash table so high contention > is not expected. > However anonymous dentryies (created by NFSD) all live on a single hash > chain "s_anon" and the bitlock on this can be highly contended, resulting > in soft-lockup warnings. This really seems idiotic. Let me rephrase this commit message so that you can see why I think it's wrong: "NFSd uses a single hash chain for all dentries, which can cause horrible lock contention, in ways that the normal hashing does not. This horrendous contention can cause the machine to have bad latency spikes, which can cause soft lockup warnings. Instead of just fixing the bad design decision that causes this horrible contention, we'll just special-case this code and use an additional lock that hides the problem by serializing the actual contending access". Honestly, looking at the code, the whole s_anon thing seems entirely broken. There doesn't even seem to be much reason for it. In pretty much all cases, we could just hash the damn dentry, The only reason for actually having s_anon seems to be that we want some per-superblock list of these unconnected dentries for shrink_dcache_for_umount(). Everything else would actually be *much* happier with just having the dentry on the regular hash table. It would entirely get rid of this stupid performance problem, and it would actually simplify all the code elsewhere, because it would remove special cases like this if (unlikely(IS_ROOT(dentry))) b = &dentry->d_sb->s_anon; else b = d_hash(dentry->d_name.hash); and just turn them into b = d_hash(dentry->d_name.hash); so I really wonder if we could just get rid of s_anon entirely. Yes, getting rid of s_anon might involve crazy things like "let's just walk all the dentries at umount time", but honestly, that sounds preferable. Especially if we can just then do something like - set a special flag in the superblock if we ever use __d_obtain_alias() - only scan all the dentries on umount if that flag is set. Hmm? The other alternative would be to just try to make the bitlocks fairer. I would find that much less distasteful than this nasty hack. I could imagine, for example, just turning the bitlocks into "spinlock on hashed array". That's basically what we did with the page lock, which _used_ to be basically a blocking bitlock, and where there was no possible way to not have contention on some pages. We could afford to have two bits for the bitlock (lock and contention) to make something like that more viable. But I really get the feeling that the problem here is not the locking primitive, but that whole s_anon decision. And my gut feeling is that it really should be fixable. Because wouldn't it be much nicer if the horrible nfsd contention just went away rather than be worked around? But maybe I'm missing some _other_ reason why s_anon has to be its own separate list? Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list 2017-11-09 19:52 ` Linus Torvalds @ 2017-11-09 20:50 ` Al Viro 2017-11-09 23:09 ` NeilBrown 2017-11-10 0:02 ` Linus Torvalds 2017-11-10 8:50 ` Christoph Hellwig 1 sibling, 2 replies; 21+ messages in thread From: Al Viro @ 2017-11-09 20:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: NeilBrown, linux-fsdevel, Linux Kernel Mailing List On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote: > Honestly, looking at the code, the whole s_anon thing seems entirely > broken. There doesn't even seem to be much reason for it. In pretty > much all cases, we could just hash the damn dentry, > > The only reason for actually having s_anon seems to be that we want > some per-superblock list of these unconnected dentries for > shrink_dcache_for_umount(). > > Everything else would actually be *much* happier with just having the > dentry on the regular hash table. It would entirely get rid of this > stupid performance problem, and it would actually simplify all the > code elsewhere, because it would remove special cases like this > > if (unlikely(IS_ROOT(dentry))) > b = &dentry->d_sb->s_anon; > else > b = d_hash(dentry->d_name.hash); > > and just turn them into > > b = d_hash(dentry->d_name.hash); > > so I really wonder if we could just get rid of s_anon entirely. > > Yes, getting rid of s_anon might involve crazy things like "let's just > walk all the dentries at umount time", but honestly, that sounds > preferable. Especially if we can just then do something like > > - set a special flag in the superblock if we ever use __d_obtain_alias() Automatically set for a lot of NFS mounts (whenever you mount more than one tree from the same server, IIRC)... > - only scan all the dentries on umount if that flag is set. > > Hmm? That looks like a bloody painful approach, IMO. I'm not saying I like Neil's patch, but I doubt that "let's just walk the entire dcache on umount" is a good idea. I wonder if separating the d_obtain_alias() and d_obtain_root() would be a good idea; the former outnumber the latter by many orders of magnitude. The tricky part is that we could have a disconnected directory from d_obtain_alias() with children already connected to it (and thus normally hashed by d_splice_alias()) and fail to connect the whole thing to parent. That leaves us with an orphaned tree that might stick around past the time when we drop all references to dentries in it. And we want to get those hunted down and shot on umount. Could we * make s_anon hold d_obtain_root ones + orphans from such failed reconnects * make final dput() treat hashed IS_ROOT as "don't retain it" * have d_obtain_alias() put into normal hash, leaving the "move to s_anon" part to reconnect failures. * keep umount side of things unchanged. I agree that temporary insertions into ->s_anon are bogus; hell, I'm not even sure we want to put it on _any_ list initially - we want it to look like it's hashed, so we could set ->next to NULL and have ->pprev point to itself. Then normal case for d_obtain_alias() would not bother the hash chains at all at allocation time, then have it put into the right hash chain on reconnect. And on reconnect failure the caller would've moved it to orphan list (i.e. ->s_anon). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list 2017-11-09 20:50 ` Al Viro @ 2017-11-09 23:09 ` NeilBrown 2017-11-09 23:19 ` Al Viro 2017-11-10 0:02 ` Linus Torvalds 1 sibling, 1 reply; 21+ messages in thread From: NeilBrown @ 2017-11-09 23:09 UTC (permalink / raw) To: Al Viro, Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 4514 bytes --] On Thu, Nov 09 2017, Al Viro wrote: > On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote: >> Honestly, looking at the code, the whole s_anon thing seems entirely >> broken. There doesn't even seem to be much reason for it. In pretty >> much all cases, we could just hash the damn dentry, >> >> The only reason for actually having s_anon seems to be that we want >> some per-superblock list of these unconnected dentries for >> shrink_dcache_for_umount(). >> >> Everything else would actually be *much* happier with just having the >> dentry on the regular hash table. It would entirely get rid of this >> stupid performance problem, and it would actually simplify all the >> code elsewhere, because it would remove special cases like this >> >> if (unlikely(IS_ROOT(dentry))) >> b = &dentry->d_sb->s_anon; >> else >> b = d_hash(dentry->d_name.hash); >> >> and just turn them into >> >> b = d_hash(dentry->d_name.hash); >> >> so I really wonder if we could just get rid of s_anon entirely. >> >> Yes, getting rid of s_anon might involve crazy things like "let's just >> walk all the dentries at umount time", but honestly, that sounds >> preferable. Especially if we can just then do something like >> >> - set a special flag in the superblock if we ever use __d_obtain_alias() > > Automatically set for a lot of NFS mounts (whenever you mount more than one > tree from the same server, IIRC)... > >> - only scan all the dentries on umount if that flag is set. >> >> Hmm? > > That looks like a bloody painful approach, IMO. I'm not saying I like > Neil's patch, but I doubt that "let's just walk the entire dcache on > umount" is a good idea. > > I wonder if separating the d_obtain_alias() and d_obtain_root() would be > a good idea; the former outnumber the latter by many orders of magnitude. > The tricky part is that we could have a disconnected directory from > d_obtain_alias() with children already connected to it (and thus normally > hashed by d_splice_alias()) and fail to connect the whole thing to parent. > > That leaves us with an orphaned tree that might stick around past the > time when we drop all references to dentries in it. And we want to > get those hunted down and shot on umount. Could we > * make s_anon hold d_obtain_root ones + orphans from such > failed reconnects > * make final dput() treat hashed IS_ROOT as "don't retain it" > * have d_obtain_alias() put into normal hash, leaving the > "move to s_anon" part to reconnect failures. > * keep umount side of things unchanged. > > I agree that temporary insertions into ->s_anon are bogus; hell, I'm not > even sure we want to put it on _any_ list initially - we want it to look > like it's hashed, so we could set ->next to NULL and have ->pprev point > to itself. Then normal case for d_obtain_alias() would not bother > the hash chains at all at allocation time, then have it put into the > right hash chain on reconnect. And on reconnect failure the caller > would've moved it to orphan list (i.e. ->s_anon). I looked back at the original bug report, and it was d_obtain_alias() that was particularly suffering. As this holds two spinlocks while waiting for the bl_lock, a delay can affect other code that doesn't touch s_anon. So improving d_obtain_alias() would be a good goal (as long as we don't just move the problem of course). It isn't only "reconnect failure" that leaves things on s_anon. We normally don't bother trying to connect non-directories. So if an NFS server is getting lots of read/write request without opens or other pathname lookups, it could easily have lots of disconnected files being repeatedly accessed. Keeping the dentries on d_anon means we don't need to keep allocating new ones for every request. So I'm not keen on dropping an IS_ROOT() dentry at final dput(), but it might make sense to add the dentry to the per-fs list of IS_ROOT dentries at that time. One possible approach would be to use d_child rather than d_hash to link together dentries that don't have a parent. We could assign a random number to d_name.hash so it could appear to be hashed without imposing on any one hash chain. We would still need a spinlock in the superblock to manage the d_anon list that links the d_child's together... I might try to see how the code looks. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list 2017-11-09 23:09 ` NeilBrown @ 2017-11-09 23:19 ` Al Viro 0 siblings, 0 replies; 21+ messages in thread From: Al Viro @ 2017-11-09 23:19 UTC (permalink / raw) To: NeilBrown; +Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List On Fri, Nov 10, 2017 at 10:09:09AM +1100, NeilBrown wrote: > So if an NFS server is getting lots of read/write request without opens > or other pathname lookups, it could easily have lots of disconnected > files being repeatedly accessed. Keeping the dentries on d_anon means > we don't need to keep allocating new ones for every request. > > So I'm not keen on dropping an IS_ROOT() dentry at final dput(), but > it might make sense to add the dentry to the per-fs list of IS_ROOT > dentries at that time. Watch out for dput() fast path (see fast_dput()) if you go that way. > One possible approach would be to use d_child rather than d_hash to link > together dentries that don't have a parent. > We could assign a random number to d_name.hash so it could appear to be > hashed without imposing on any one hash chain. We would still need a > spinlock in the superblock to manage the d_anon list that links the > d_child's together... > I might try to see how the code looks. Keep in mind that d_hash() includes bits of ->d_parent, aka. dentry itself. So no need for fake ->d_name.hash; you'll get spread from that part. ->d_child is... delicate. There are very interesting games around d_walk vs. dput already; I'd be very careful with that one. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list 2017-11-09 20:50 ` Al Viro 2017-11-09 23:09 ` NeilBrown @ 2017-11-10 0:02 ` Linus Torvalds 1 sibling, 0 replies; 21+ messages in thread From: Linus Torvalds @ 2017-11-10 0:02 UTC (permalink / raw) To: Al Viro; +Cc: NeilBrown, linux-fsdevel, Linux Kernel Mailing List On Thu, Nov 9, 2017 at 12:50 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > That looks like a bloody painful approach, IMO. I'm not saying I like > Neil's patch, but I doubt that "let's just walk the entire dcache on > umount" is a good idea. There may be other approaches, but how often do you umount a filesystem that you have NFS-exported? Not often. So making that slower doesn't really sound horrible. > I wonder if separating the d_obtain_alias() and d_obtain_root() would be > a good idea; the former outnumber the latter by many orders of magnitude. > The tricky part is that we could have a disconnected directory from > d_obtain_alias() with children already connected to it (and thus normally > hashed by d_splice_alias()) and fail to connect the whole thing to parent. Well, the real issue (I think) is that we abuse that hash list for the s_anon thing. I'd like to get rid of that part. Now, the reason we do that is because we want to have some way to find dentries that come from that filesystem, but aren't connected. So I really would like to get rid of s_anon entirely. We could easily use _another_ list entirely for holding the "unconnected dentries". Because we do have such a list: the dentry sibling list. So what if we instead of using s_anon, we create a "disconnected root", and make all the disconnected dentries be children of that disconnected root instead? Wouldn't that be nicer? The "reconnect" would be basically a "move from disconnected parent to right place". Maybe I'm missing something, but that sounds much more logical than the current s_anon list. > That leaves us with an orphaned tree that might stick around past the > time when we drop all references to dentries in it. And we want to > get those hunted down and shot on umount. Could we > * make s_anon hold d_obtain_root ones + orphans from such > failed reconnects > * make final dput() treat hashed IS_ROOT as "don't retain it" > * have d_obtain_alias() put into normal hash, leaving the > "move to s_anon" part to reconnect failures. > * keep umount side of things unchanged. I guess that would work too, but I'm not seeing why s_anon is so wonderful. If we want to make those entries look hashed, let's just hash them, and use a special parent for it (that "disconnected root"). Why wouldn't that work? That would make the umount side _simpler_, because the disconnected root would be handled exactly like we currently handle the real root. So we'd just do that same "do_one_tree()" on the disconnected root, the same way we do on the real one. That sounds _so_ straightforward that I'm probably missing something important. And if the "move from disconnected state" is very common, then maybe the disconnected root lock ends up being a horrible choke-point instead, simply because we'd take that parent lock all the time for the add/move operations.. So maybe it's a bad idea. But it sounds clean. Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list 2017-11-09 19:52 ` Linus Torvalds 2017-11-09 20:50 ` Al Viro @ 2017-11-10 8:50 ` Christoph Hellwig 1 sibling, 0 replies; 21+ messages in thread From: Christoph Hellwig @ 2017-11-10 8:50 UTC (permalink / raw) To: Linus Torvalds Cc: NeilBrown, Al Viro, linux-fsdevel, Linux Kernel Mailing List On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote: > Yes, getting rid of s_anon might involve crazy things like "let's just > walk all the dentries at umount time", but honestly, that sounds > preferable. Especially if we can just then do something like > > - set a special flag in the superblock if we ever use __d_obtain_alias() > > - only scan all the dentries on umount if that flag is set. > > Hmm? The scan would be extremely painful with our current global dcache. But then again the global dcache isn't exactly a good idea to start with. If we had a per-sb dcache using e.g. the resizable hash table the networking people seem to be fond off your idea would work really well. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-09 3:22 [PATCH 0/3] Three VFS patch resends NeilBrown 2017-11-09 3:22 ` [PATCH 1/3] VFS: use synchronize_rcu_expedited() in namespace_unlock() NeilBrown 2017-11-09 3:22 ` [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list NeilBrown @ 2017-11-09 3:22 ` NeilBrown 2017-11-09 11:41 ` Nikolay Borisov 2017-11-09 20:23 ` Linus Torvalds 2 siblings, 2 replies; 21+ messages in thread From: NeilBrown @ 2017-11-09 3:22 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, linux-kernel d_move() will call __d_drop() and then __d_rehash() on the dentry being moved. This creates a small window when the dentry appears to be unhashed. Many tests of d_unhashed() are made under ->d_lock and so are safe from racing with this window, but some aren't. In particular, getcwd() calls d_unlinked() (which calls d_unhashed()) without d_lock protection, so it can race. This races has been seen in practice with lustre, which uses d_move() as part of name lookup. See: https://jira.hpdd.intel.com/browse/LU-9735 It could race with a regular rename(), and result in ENOENT instead of either the 'before' or 'after' name. The race can be demonstrated with a simple program which has two threads, one renaming a directory back and forth while another calls getcwd() within that directory: it should never fail, but does. See: https://patchwork.kernel.org/patch/9455345/ We could fix this race by taking d_lock and rechecking when d_unhashed() reports true. Alternately when can remove the window, which is the approach this patch takes. When __d_drop and __d_rehash are used to move a dentry, an extra flag is passed which causes d_hash.pprev to not be cleared, and to not be tested. Signed-off-by: NeilBrown <neilb@suse.com> --- fs/dcache.c | 31 ++++++++++++++++++++----------- 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index d5952306206b..3130d62f29c9 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -471,8 +471,11 @@ static void dentry_lru_add(struct dentry *dentry) * reason (NFS timeouts or autofs deletes). * * __d_drop requires dentry->d_lock. + * ___d_drop takes an extra @moving argument. + * If true, d_hash.pprev is not cleared, so there is no transient d_unhashed() + * state. */ -void __d_drop(struct dentry *dentry) +static void inline ___d_drop(struct dentry *dentry, bool moving) { if (!d_unhashed(dentry)) { struct hlist_bl_head *b; @@ -493,12 +496,18 @@ void __d_drop(struct dentry *dentry) } else hlist_bl_lock(b); __hlist_bl_del(&dentry->d_hash); - dentry->d_hash.pprev = NULL; + if (likely(!moving)) + dentry->d_hash.pprev = NULL; hlist_bl_unlock(b); /* After this call, in-progress rcu-walk path lookup will fail. */ write_seqcount_invalidate(&dentry->d_seq); } } + +void __d_drop(struct dentry *dentry) +{ + ___d_drop(dentry, false); +} EXPORT_SYMBOL(__d_drop); void d_drop(struct dentry *dentry) @@ -2387,10 +2396,10 @@ void d_delete(struct dentry * dentry) } EXPORT_SYMBOL(d_delete); -static void __d_rehash(struct dentry *entry) +static void __d_rehash(struct dentry *entry, bool moving) { struct hlist_bl_head *b = d_hash(entry->d_name.hash); - BUG_ON(!d_unhashed(entry)); + BUG_ON(!moving && !d_unhashed(entry)); hlist_bl_lock(b); hlist_bl_add_head_rcu(&entry->d_hash, b); hlist_bl_unlock(b); @@ -2406,7 +2415,7 @@ static void __d_rehash(struct dentry *entry) void d_rehash(struct dentry * entry) { spin_lock(&entry->d_lock); - __d_rehash(entry); + __d_rehash(entry, false); spin_unlock(&entry->d_lock); } EXPORT_SYMBOL(d_rehash); @@ -2580,7 +2589,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode) raw_write_seqcount_end(&dentry->d_seq); fsnotify_update_flags(dentry); } - __d_rehash(dentry); + __d_rehash(dentry, false); if (dir) end_dir_add(dir, n); spin_unlock(&dentry->d_lock); @@ -2642,7 +2651,7 @@ struct dentry *d_exact_alias(struct dentry *entry, struct inode *inode) alias = NULL; } else { __dget_dlock(alias); - __d_rehash(alias); + __d_rehash(alias, false); spin_unlock(&alias->d_lock); } spin_unlock(&inode->i_lock); @@ -2828,8 +2837,8 @@ static void __d_move(struct dentry *dentry, struct dentry *target, /* unhash both */ /* __d_drop does write_seqcount_barrier, but they're OK to nest. */ - __d_drop(dentry); - __d_drop(target); + ___d_drop(dentry, true); + ___d_drop(target, exchange); /* Switch the names.. */ if (exchange) @@ -2838,9 +2847,9 @@ static void __d_move(struct dentry *dentry, struct dentry *target, copy_name(dentry, target); /* rehash in new place(s) */ - __d_rehash(dentry); + __d_rehash(dentry, true); if (exchange) - __d_rehash(target); + __d_rehash(target, true); /* ... and switch them in the tree */ if (IS_ROOT(dentry)) { ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-09 3:22 ` [PATCH 3/3] VFS: close race between getcwd() and d_move() NeilBrown @ 2017-11-09 11:41 ` Nikolay Borisov 2017-11-09 13:08 ` Matthew Wilcox 2017-11-09 20:23 ` Linus Torvalds 1 sibling, 1 reply; 21+ messages in thread From: Nikolay Borisov @ 2017-11-09 11:41 UTC (permalink / raw) To: NeilBrown, Al Viro; +Cc: linux-fsdevel, Linus Torvalds, linux-kernel On 9.11.2017 05:22, NeilBrown wrote: > d_move() will call __d_drop() and then __d_rehash() > on the dentry being moved. This creates a small window > when the dentry appears to be unhashed. Many tests > of d_unhashed() are made under ->d_lock and so are safe > from racing with this window, but some aren't. > In particular, getcwd() calls d_unlinked() (which calls > d_unhashed()) without d_lock protection, so it can race. > > This races has been seen in practice with lustre, which uses d_move() as > part of name lookup. See: > https://jira.hpdd.intel.com/browse/LU-9735 > It could race with a regular rename(), and result in ENOENT instead > of either the 'before' or 'after' name. > > The race can be demonstrated with a simple program which > has two threads, one renaming a directory back and forth > while another calls getcwd() within that directory: it should never > fail, but does. See: > https://patchwork.kernel.org/patch/9455345/ > > We could fix this race by taking d_lock and rechecking when > d_unhashed() reports true. Alternately when can remove the window, > which is the approach this patch takes. > > When __d_drop and __d_rehash are used to move a dentry, an extra > flag is passed which causes d_hash.pprev to not be cleared, and > to not be tested. > > Signed-off-by: NeilBrown <neilb@suse.com> > --- > fs/dcache.c | 31 ++++++++++++++++++++----------- > 1 file changed, 20 insertions(+), 11 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index d5952306206b..3130d62f29c9 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -471,8 +471,11 @@ static void dentry_lru_add(struct dentry *dentry) > * reason (NFS timeouts or autofs deletes). > * > * __d_drop requires dentry->d_lock. > + * ___d_drop takes an extra @moving argument. > + * If true, d_hash.pprev is not cleared, so there is no transient d_unhashed() > + * state. > */ > -void __d_drop(struct dentry *dentry) > +static void inline ___d_drop(struct dentry *dentry, bool moving) > { > if (!d_unhashed(dentry)) { > struct hlist_bl_head *b; > @@ -493,12 +496,18 @@ void __d_drop(struct dentry *dentry) > } else > hlist_bl_lock(b); > __hlist_bl_del(&dentry->d_hash); > - dentry->d_hash.pprev = NULL; > + if (likely(!moving)) > + dentry->d_hash.pprev = NULL; nit: isn't a bit more explicit if (unlikely(moving)). I suspect the end result is the same, however it's easy to miss the !. It's not a big deal but just wondering. > hlist_bl_unlock(b); > /* After this call, in-progress rcu-walk path lookup will fail. */ > write_seqcount_invalidate(&dentry->d_seq); > } > } > + > +void __d_drop(struct dentry *dentry) > +{ > + ___d_drop(dentry, false); > +} > EXPORT_SYMBOL(__d_drop); > > void d_drop(struct dentry *dentry) > @@ -2387,10 +2396,10 @@ void d_delete(struct dentry * dentry) > } > EXPORT_SYMBOL(d_delete); > > -static void __d_rehash(struct dentry *entry) > +static void __d_rehash(struct dentry *entry, bool moving) > { > struct hlist_bl_head *b = d_hash(entry->d_name.hash); > - BUG_ON(!d_unhashed(entry)); > + BUG_ON(!moving && !d_unhashed(entry)); > hlist_bl_lock(b); > hlist_bl_add_head_rcu(&entry->d_hash, b); > hlist_bl_unlock(b); > @@ -2406,7 +2415,7 @@ static void __d_rehash(struct dentry *entry) > void d_rehash(struct dentry * entry) > { > spin_lock(&entry->d_lock); > - __d_rehash(entry); > + __d_rehash(entry, false); > spin_unlock(&entry->d_lock); > } > EXPORT_SYMBOL(d_rehash); > @@ -2580,7 +2589,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode) > raw_write_seqcount_end(&dentry->d_seq); > fsnotify_update_flags(dentry); > } > - __d_rehash(dentry); > + __d_rehash(dentry, false); > if (dir) > end_dir_add(dir, n); > spin_unlock(&dentry->d_lock); > @@ -2642,7 +2651,7 @@ struct dentry *d_exact_alias(struct dentry *entry, struct inode *inode) > alias = NULL; > } else { > __dget_dlock(alias); > - __d_rehash(alias); > + __d_rehash(alias, false); > spin_unlock(&alias->d_lock); > } > spin_unlock(&inode->i_lock); > @@ -2828,8 +2837,8 @@ static void __d_move(struct dentry *dentry, struct dentry *target, > > /* unhash both */ > /* __d_drop does write_seqcount_barrier, but they're OK to nest. */ > - __d_drop(dentry); > - __d_drop(target); > + ___d_drop(dentry, true); > + ___d_drop(target, exchange); > > /* Switch the names.. */ > if (exchange) > @@ -2838,9 +2847,9 @@ static void __d_move(struct dentry *dentry, struct dentry *target, > copy_name(dentry, target); > > /* rehash in new place(s) */ > - __d_rehash(dentry); > + __d_rehash(dentry, true); > if (exchange) > - __d_rehash(target); > + __d_rehash(target, true); > > /* ... and switch them in the tree */ > if (IS_ROOT(dentry)) { > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-09 11:41 ` Nikolay Borisov @ 2017-11-09 13:08 ` Matthew Wilcox 2017-11-09 16:02 ` Nikolay Borisov 0 siblings, 1 reply; 21+ messages in thread From: Matthew Wilcox @ 2017-11-09 13:08 UTC (permalink / raw) To: Nikolay Borisov Cc: NeilBrown, Al Viro, linux-fsdevel, Linus Torvalds, linux-kernel On Thu, Nov 09, 2017 at 01:41:24PM +0200, Nikolay Borisov wrote: > On 9.11.2017 05:22, NeilBrown wrote: > > @@ -493,12 +496,18 @@ void __d_drop(struct dentry *dentry) > > } else > > hlist_bl_lock(b); > > __hlist_bl_del(&dentry->d_hash); > > - dentry->d_hash.pprev = NULL; > > + if (likely(!moving)) > > + dentry->d_hash.pprev = NULL; > > nit: isn't a bit more explicit if (unlikely(moving)). I suspect the end > result is the same, however it's easy to miss the !. It's not a big deal > but just wondering. umm ... you just suggested the exact opposite of what the patch is intended to do. likely()/unlikely() only hint to the compiler the probabilities of the branch; they don't change the meaning of the condition. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-09 13:08 ` Matthew Wilcox @ 2017-11-09 16:02 ` Nikolay Borisov 0 siblings, 0 replies; 21+ messages in thread From: Nikolay Borisov @ 2017-11-09 16:02 UTC (permalink / raw) To: Matthew Wilcox Cc: NeilBrown, Al Viro, linux-fsdevel, Linus Torvalds, linux-kernel On 9.11.2017 15:08, Matthew Wilcox wrote: > On Thu, Nov 09, 2017 at 01:41:24PM +0200, Nikolay Borisov wrote: >> On 9.11.2017 05:22, NeilBrown wrote: >>> @@ -493,12 +496,18 @@ void __d_drop(struct dentry *dentry) >>> } else >>> hlist_bl_lock(b); >>> __hlist_bl_del(&dentry->d_hash); >>> - dentry->d_hash.pprev = NULL; >>> + if (likely(!moving)) >>> + dentry->d_hash.pprev = NULL; >> >> nit: isn't a bit more explicit if (unlikely(moving)). I suspect the end >> result is the same, however it's easy to miss the !. It's not a big deal >> but just wondering. > > umm ... you just suggested the exact opposite of what the patch is > intended to do. likely()/unlikely() only hint to the compiler the > probabilities of the branch; they don't change the meaning of the > condition. brainfart, disregard my comment doh.... > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-09 3:22 ` [PATCH 3/3] VFS: close race between getcwd() and d_move() NeilBrown 2017-11-09 11:41 ` Nikolay Borisov @ 2017-11-09 20:23 ` Linus Torvalds 2017-11-09 22:14 ` NeilBrown 1 sibling, 1 reply; 21+ messages in thread From: Linus Torvalds @ 2017-11-09 20:23 UTC (permalink / raw) To: NeilBrown; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List On Wed, Nov 8, 2017 at 7:22 PM, NeilBrown <neilb@suse.com> wrote: > d_move() will call __d_drop() and then __d_rehash() > on the dentry being moved. This creates a small window > when the dentry appears to be unhashed. Many tests > of d_unhashed() are made under ->d_lock and so are safe > from racing with this window, but some aren't. > In particular, getcwd() calls d_unlinked() (which calls > d_unhashed()) without d_lock protection, so it can race. Hmm. I see what you're doing, but I don't necessarily agree. I would actually almost prefer that we simply change __d_move() itself. The problem is that __d_move() really wants to move the hashes things atomically, but instead of doing that it does a "unhash and then rehash". How nasty would it be to just expand the calls to __d_drop/__d_rehash into __d_move itself, and take both has list locks at the same time (with the usual ordering and checking if it's the same list, of course). Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-09 20:23 ` Linus Torvalds @ 2017-11-09 22:14 ` NeilBrown 2017-11-10 1:40 ` Linus Torvalds 0 siblings, 1 reply; 21+ messages in thread From: NeilBrown @ 2017-11-09 22:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 4578 bytes --] On Thu, Nov 09 2017, Linus Torvalds wrote: > On Wed, Nov 8, 2017 at 7:22 PM, NeilBrown <neilb@suse.com> wrote: >> d_move() will call __d_drop() and then __d_rehash() >> on the dentry being moved. This creates a small window >> when the dentry appears to be unhashed. Many tests >> of d_unhashed() are made under ->d_lock and so are safe >> from racing with this window, but some aren't. >> In particular, getcwd() calls d_unlinked() (which calls >> d_unhashed()) without d_lock protection, so it can race. > > Hmm. > > I see what you're doing, but I don't necessarily agree. > > I would actually almost prefer that we simply change __d_move() itself. > > The problem is that __d_move() really wants to move the hashes things > atomically, but instead of doing that it does a "unhash and then > rehash". > > How nasty would it be to just expand the calls to __d_drop/__d_rehash > into __d_move itself, and take both has list locks at the same time > (with the usual ordering and checking if it's the same list, of > course). > > Linus something like this? I can probably do better than "b1" and "b2". I assume target must always be hashed ?? Do you like it enough for me to make it into a real patch? I'd probably move hlist_bl_lock_two() into list_bl.h. I'm not generally keen on open-coding subtle code, but maybe it is justified here. Thanks, NeilBrown diff --git a/fs/dcache.c b/fs/dcache.c index f90141387f01..1a329fedf23c 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -472,6 +472,9 @@ static void dentry_lru_add(struct dentry *dentry) */ void __d_drop(struct dentry *dentry) { + /* WARNING: any changes here should be reflected in __d_move() + * which open-codes some of this functionality. + */ if (!d_unhashed(dentry)) { struct hlist_bl_head *b; /* @@ -2380,6 +2383,9 @@ EXPORT_SYMBOL(d_delete); static void __d_rehash(struct dentry *entry) { + /* WARNING: any changes here should be reflected in __d_move() + * which open-codes some of this functionality. + */ struct hlist_bl_head *b = d_hash(entry->d_name.hash); BUG_ON(!d_unhashed(entry)); hlist_bl_lock(b); @@ -2796,11 +2802,23 @@ static void dentry_unlock_for_move(struct dentry *dentry, struct dentry *target) * rename_lock, the i_mutex of the source and target directories, * and the sb->s_vfs_rename_mutex if they differ. See lock_rename(). */ +static void hlist_bl_lock_two(struct hlist_bl_head *b1, struct hlist_bl_head *b2) +{ + if (b1 && b1 < b2) + hlist_bl_lock(b1); + if (b2) + hlist_bl_lock(b2); + if (b1 > b2) + hlist_bl_lock(b1); +} + static void __d_move(struct dentry *dentry, struct dentry *target, bool exchange) { struct inode *dir = NULL; unsigned n; + struct hlist_bl_head *b1, *b2; + if (!dentry->d_inode) printk(KERN_WARNING "VFS: moving negative dcache entry\n"); @@ -2817,10 +2835,24 @@ static void __d_move(struct dentry *dentry, struct dentry *target, write_seqcount_begin(&dentry->d_seq); write_seqcount_begin_nested(&target->d_seq, DENTRY_D_LOCK_NESTED); - /* unhash both */ - /* __d_drop does write_seqcount_barrier, but they're OK to nest. */ - __d_drop(dentry); - __d_drop(target); + /* We want to unhash both, change names, then rehash one or both. + * If we use __d_drop() and __d_rehash() there will be a window + * when dentry appears to be d_unhashed() which can race with lockless + * checking. So instead we open-code the important parts of __d_drop() + * and __d_rehash(). + * @target must already be hashed, @dentry must be if @exchange. + */ + BUG_ON(d_unhashed(dentry) && exchange); + BUG_ON(d_unhashed(target)); + + b1 = d_unhashed(dentry) ? NULL : d_hash(dentry->d_name.hash); + b2 = d_hash(target->d_name.hash); + hlist_bl_lock_two(b1, b2); + if (b1) + __hlist_bl_del(&dentry->d_hash); + __hlist_bl_del(&target->d_hash); + write_seqcount_invalidate(&dentry->d_seq); + write_seqcount_invalidate(&target->d_seq); /* Switch the names.. */ if (exchange) @@ -2829,9 +2861,14 @@ static void __d_move(struct dentry *dentry, struct dentry *target, copy_name(dentry, target); /* rehash in new place(s) */ - __d_rehash(dentry); + hlist_bl_add_head_rcu(&dentry->d_hash, b2); if (exchange) - __d_rehash(target); + hlist_bl_add_head_rcu(&target->d_hash, b1); + else + target->d_hash.pprev = NULL; + if (b1 && b1 != b2) + hlist_bl_unlock(b1); + hlist_bl_unlock(b2); /* ... and switch them in the tree */ if (IS_ROOT(dentry)) { [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-09 22:14 ` NeilBrown @ 2017-11-10 1:40 ` Linus Torvalds 2017-11-10 4:45 ` NeilBrown 0 siblings, 1 reply; 21+ messages in thread From: Linus Torvalds @ 2017-11-10 1:40 UTC (permalink / raw) To: NeilBrown; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List On Thu, Nov 9, 2017 at 2:14 PM, NeilBrown <neilb@suse.com> wrote: > On Thu, Nov 09 2017, Linus Torvalds wrote: >> >> How nasty would it be to just expand the calls to __d_drop/__d_rehash >> into __d_move itself, and take both has list locks at the same time >> (with the usual ordering and checking if it's the same list, of >> course). > > something like this? Yes. This looks nicer to me. Partly because I hate those "pass flags to functions that modify their behavior" kinds of patches. I'd rather see just straight-line unconditional code with some possible duplication. That said, your conversion isn't the obvious "no semantic changes" one. The BUG_ON() conditions you added are a bit odd. You've taken the BUG_ON() from __d_rehash() that no longe rmakes any sense (because we just explicitly unhashed it), and replaced it with a BUG_ON() that didn't exist before, and is also not the conditionm that __d_drop actually had (or the condition that means that the hash liost might be different - ie the whole IS_ROOT() case). So the patch looks conceptually good., but I worry about the details. They may be right, but that odd IS_ROOT case in __d_drop really worries me. Can we rename one of those disconnected entries? Of course, that then ties into the other thread, where those disconnected entries are a bit too special anyway. I also do wonder if we can avoid all the unhash/rehash games entirely (and avoid the hash list locking) if it turns out that the dentry and target hash lists are the same. I'm not claiming that as an optimization (it's an unusual case), I'm more thinking that it might fall out fairly naturally from the "lock both" case, since that one needs to check for the same list anyway. > Do you like it enough for me to make it into a real patch? Let's see if Al hates this approach, but I definitely prefer it assuming my worries are groundless. > I'd probably move hlist_bl_lock_two() into list_bl.h. Maybe not - see the above issue where maybe the "same hash" might actually merit different codepath. Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-10 1:40 ` Linus Torvalds @ 2017-11-10 4:45 ` NeilBrown 2017-11-10 19:52 ` Linus Torvalds 2017-11-10 20:53 ` Al Viro 0 siblings, 2 replies; 21+ messages in thread From: NeilBrown @ 2017-11-10 4:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 5358 bytes --] On Thu, Nov 09 2017, Linus Torvalds wrote: > On Thu, Nov 9, 2017 at 2:14 PM, NeilBrown <neilb@suse.com> wrote: >> On Thu, Nov 09 2017, Linus Torvalds wrote: >>> >>> How nasty would it be to just expand the calls to __d_drop/__d_rehash >>> into __d_move itself, and take both has list locks at the same time >>> (with the usual ordering and checking if it's the same list, of >>> course). >> >> something like this? > > Yes. > > This looks nicer to me. Partly because I hate those "pass flags to > functions that modify their behavior" kinds of patches. I'd rather see > just straight-line unconditional code with some possible duplication. ... > > I also do wonder if we can avoid all the unhash/rehash games entirely > (and avoid the hash list locking) if it turns out that the dentry and > target hash lists are the same. I'm not convinced. I haven't actually tried it, but the matrix of possibilities seems a little large. The source dentry may or may not be hashed (not in the "disconnected IS_ROOT" case), and the target may or may not want to be rehashed afterwards (depending on 'exchange'). We could skip the lock for an exchange if they both had the same hash, but not for a simple move. However your description of what it was that you didn't like gave me an idea - I can take the same approach as my original, but not pass flags around. I quite like how this turned out. Dropping the BUG_ON() in d_rehash() isn't ideal, maybe we could add ___d_rehash() without the BUG_ON() and call that from __d_rehash? Thanks, NeilBrown From: NeilBrown <neilb@suse.com> Date: Fri, 10 Nov 2017 15:20:06 +1100 Subject: [PATCH] VFS: close race between getcwd() and d_move() d_move() will call __d_drop() and then __d_rehash() on the dentry being moved. This creates a small window when the dentry appears to be unhashed. Many tests of d_unhashed() are made under ->d_lock and so are safe from racing with this window, but some aren't. In particular, getcwd() calls d_unlinked() (which calls d_unhashed()) without d_lock protection, so it can race. This races has been seen in practice with lustre, which uses d_move() as part of name lookup. See: https://jira.hpdd.intel.com/browse/LU-9735 It could race with a regular rename(), and result in ENOENT instead of either the 'before' or 'after' name. The race can be demonstrated with a simple program which has two threads, one renaming a directory back and forth while another calls getcwd() within that directory: it should never fail, but does. See: https://patchwork.kernel.org/patch/9455345/ We could fix this race by taking d_lock and rechecking when d_unhashed() reports true. Alternately when can remove the window, which is the approach this patch takes. ___d_drop() is introduce which does *not* clear d_hash.pprev so the dentry still appears to be hashed. __d_drop() calls ___d_drop(), then clears d_hash.pprev. __d_move() now uses ___d_drop() and only clears d_hash.pprev when not rehashing. Signed-off-by: NeilBrown <neilb@suse.com> --- fs/dcache.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index f90141387f01..8c83543f5065 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -468,9 +468,11 @@ static void dentry_lru_add(struct dentry *dentry) * d_drop() is used mainly for stuff that wants to invalidate a dentry for some * reason (NFS timeouts or autofs deletes). * - * __d_drop requires dentry->d_lock. + * __d_drop requires dentry->d_lock + * ___d_drop doesn't mark dentry as "unhashed" + * (dentry->d_hash.pprev will be LIST_POISON2, not NULL). */ -void __d_drop(struct dentry *dentry) +static void ___d_drop(struct dentry *dentry) { if (!d_unhashed(dentry)) { struct hlist_bl_head *b; @@ -486,12 +488,15 @@ void __d_drop(struct dentry *dentry) hlist_bl_lock(b); __hlist_bl_del(&dentry->d_hash); - dentry->d_hash.pprev = NULL; hlist_bl_unlock(b); /* After this call, in-progress rcu-walk path lookup will fail. */ write_seqcount_invalidate(&dentry->d_seq); } } +void __d_drop(struct dentry *dentry) { + ___d_drop(dentry); + dentry->d_hash.pprev = NULL; +} EXPORT_SYMBOL(__d_drop); void d_drop(struct dentry *dentry) @@ -2381,7 +2386,7 @@ EXPORT_SYMBOL(d_delete); static void __d_rehash(struct dentry *entry) { struct hlist_bl_head *b = d_hash(entry->d_name.hash); - BUG_ON(!d_unhashed(entry)); + hlist_bl_lock(b); hlist_bl_add_head_rcu(&entry->d_hash, b); hlist_bl_unlock(b); @@ -2818,9 +2823,9 @@ static void __d_move(struct dentry *dentry, struct dentry *target, write_seqcount_begin_nested(&target->d_seq, DENTRY_D_LOCK_NESTED); /* unhash both */ - /* __d_drop does write_seqcount_barrier, but they're OK to nest. */ - __d_drop(dentry); - __d_drop(target); + /* ___d_drop does write_seqcount_barrier, but they're OK to nest. */ + ___d_drop(dentry); + ___d_drop(target); /* Switch the names.. */ if (exchange) @@ -2832,6 +2837,8 @@ static void __d_move(struct dentry *dentry, struct dentry *target, __d_rehash(dentry); if (exchange) __d_rehash(target); + else + target->d_hash.pprev = NULL; /* ... and switch them in the tree */ if (IS_ROOT(dentry)) { -- 2.14.0.rc0.dirty [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-10 4:45 ` NeilBrown @ 2017-11-10 19:52 ` Linus Torvalds 2017-11-10 20:53 ` Al Viro 1 sibling, 0 replies; 21+ messages in thread From: Linus Torvalds @ 2017-11-10 19:52 UTC (permalink / raw) To: NeilBrown; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List On Thu, Nov 9, 2017 at 8:45 PM, NeilBrown <neilb@suse.com> wrote: > > However your description of what it was that you didn't like gave me an > idea - I can take the same approach as my original, but not pass flags > around. > I quite like how this turned out. > Dropping the BUG_ON() in d_rehash() isn't ideal, maybe we could add > ___d_rehash() without the BUG_ON() and call that from __d_rehash? This looks more palatable to me, yes. It still worries me a bit that we end up having that dangling pprev pointer despite having dropped the hash list lock, but as long as nobody uses it for anything but that "is it hashed" test without taking the dentry lock, it all *should* be safe. Please fix the whitespace, though. This is not how we do function definitions: void __d_drop(struct dentry *dentry) { but otherwise I think this patch is acceptable. Still want commentary from Al (and preferably going through his vfs tree for 4.15 or whatever). Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-10 4:45 ` NeilBrown 2017-11-10 19:52 ` Linus Torvalds @ 2017-11-10 20:53 ` Al Viro 2017-11-21 23:50 ` Al Viro 1 sibling, 1 reply; 21+ messages in thread From: Al Viro @ 2017-11-10 20:53 UTC (permalink / raw) To: NeilBrown; +Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List On Fri, Nov 10, 2017 at 03:45:41PM +1100, NeilBrown wrote: > -void __d_drop(struct dentry *dentry) > +static void ___d_drop(struct dentry *dentry) > { > if (!d_unhashed(dentry)) { > struct hlist_bl_head *b; > @@ -486,12 +488,15 @@ void __d_drop(struct dentry *dentry) > > hlist_bl_lock(b); > __hlist_bl_del(&dentry->d_hash); > - dentry->d_hash.pprev = NULL; > hlist_bl_unlock(b); > /* After this call, in-progress rcu-walk path lookup will fail. */ > write_seqcount_invalidate(&dentry->d_seq); > } > } > +void __d_drop(struct dentry *dentry) { > + ___d_drop(dentry); > + dentry->d_hash.pprev = NULL; Umm... That reordering (unhashed vs. ->d_seq) might be a problem on the RCU side. I'm not sure it is, we might get away with that, actually, but I want to finish digging through the pathwalk-related code. Cursing it for being too subtle for its own good, as usual... ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-10 20:53 ` Al Viro @ 2017-11-21 23:50 ` Al Viro 2017-11-22 1:31 ` NeilBrown 0 siblings, 1 reply; 21+ messages in thread From: Al Viro @ 2017-11-21 23:50 UTC (permalink / raw) To: NeilBrown; +Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List On Fri, Nov 10, 2017 at 08:53:28PM +0000, Al Viro wrote: > On Fri, Nov 10, 2017 at 03:45:41PM +1100, NeilBrown wrote: > > -void __d_drop(struct dentry *dentry) > > +static void ___d_drop(struct dentry *dentry) > > { > > if (!d_unhashed(dentry)) { > > struct hlist_bl_head *b; > > @@ -486,12 +488,15 @@ void __d_drop(struct dentry *dentry) > > > > hlist_bl_lock(b); > > __hlist_bl_del(&dentry->d_hash); > > - dentry->d_hash.pprev = NULL; > > hlist_bl_unlock(b); > > /* After this call, in-progress rcu-walk path lookup will fail. */ > > write_seqcount_invalidate(&dentry->d_seq); > > } > > } > > +void __d_drop(struct dentry *dentry) { > > + ___d_drop(dentry); > > + dentry->d_hash.pprev = NULL; > > Umm... That reordering (unhashed vs. ->d_seq) might be a problem > on the RCU side. I'm not sure it is, we might get away with that, > actually, but I want to finish digging through the pathwalk-related > code. Cursing it for being too subtle for its own good, as usual... OK, I believe that it's survivable, but I'd prefer to keep in -next for a while and give it more testing. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move() 2017-11-21 23:50 ` Al Viro @ 2017-11-22 1:31 ` NeilBrown 0 siblings, 0 replies; 21+ messages in thread From: NeilBrown @ 2017-11-22 1:31 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 1379 bytes --] On Tue, Nov 21 2017, Al Viro wrote: > On Fri, Nov 10, 2017 at 08:53:28PM +0000, Al Viro wrote: >> On Fri, Nov 10, 2017 at 03:45:41PM +1100, NeilBrown wrote: >> > -void __d_drop(struct dentry *dentry) >> > +static void ___d_drop(struct dentry *dentry) >> > { >> > if (!d_unhashed(dentry)) { >> > struct hlist_bl_head *b; >> > @@ -486,12 +488,15 @@ void __d_drop(struct dentry *dentry) >> > >> > hlist_bl_lock(b); >> > __hlist_bl_del(&dentry->d_hash); >> > - dentry->d_hash.pprev = NULL; >> > hlist_bl_unlock(b); >> > /* After this call, in-progress rcu-walk path lookup will fail. */ >> > write_seqcount_invalidate(&dentry->d_seq); >> > } >> > } >> > +void __d_drop(struct dentry *dentry) { >> > + ___d_drop(dentry); >> > + dentry->d_hash.pprev = NULL; >> >> Umm... That reordering (unhashed vs. ->d_seq) might be a problem >> on the RCU side. I'm not sure it is, we might get away with that, >> actually, but I want to finish digging through the pathwalk-related >> code. Cursing it for being too subtle for its own good, as usual... > > OK, I believe that it's survivable, but I'd prefer to keep in -next > for a while and give it more testing. Great, thanks. I assume you will fix the silly '{' at the end of the line when defining __d_drop(). Let me know if you would rather I resend. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2017-11-22 1:32 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-09 3:22 [PATCH 0/3] Three VFS patch resends NeilBrown 2017-11-09 3:22 ` [PATCH 1/3] VFS: use synchronize_rcu_expedited() in namespace_unlock() NeilBrown 2017-11-09 3:22 ` [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list NeilBrown 2017-11-09 19:52 ` Linus Torvalds 2017-11-09 20:50 ` Al Viro 2017-11-09 23:09 ` NeilBrown 2017-11-09 23:19 ` Al Viro 2017-11-10 0:02 ` Linus Torvalds 2017-11-10 8:50 ` Christoph Hellwig 2017-11-09 3:22 ` [PATCH 3/3] VFS: close race between getcwd() and d_move() NeilBrown 2017-11-09 11:41 ` Nikolay Borisov 2017-11-09 13:08 ` Matthew Wilcox 2017-11-09 16:02 ` Nikolay Borisov 2017-11-09 20:23 ` Linus Torvalds 2017-11-09 22:14 ` NeilBrown 2017-11-10 1:40 ` Linus Torvalds 2017-11-10 4:45 ` NeilBrown 2017-11-10 19:52 ` Linus Torvalds 2017-11-10 20:53 ` Al Viro 2017-11-21 23:50 ` Al Viro 2017-11-22 1:31 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).