linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible FS race condition between iterate_dir and d_alloc_parallel
@ 2019-09-03 14:44 zhengbin (A)
  2019-09-03 15:40 ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: zhengbin (A) @ 2019-09-03 14:44 UTC (permalink / raw)
  To: jack, Al Viro, akpm, linux-fsdevel; +Cc: zhangyi (F), zhengbin13

We recently encountered an oops(the filesystem is tmpfs)
crash> bt
PID: 108367  TASK: ffff8020d28eda00  CPU: 123  COMMAND: "du"
 #0 [ffff0000ae77b7e0] machine_kexec at ffff00006709d674
 #1 [ffff0000ae77b830] __crash_kexec at ffff000067150354
 #2 [ffff0000ae77b9c0] panic at ffff0000670a9358
 #3 [ffff0000ae77baa0] die at ffff00006708ec98
 #4 [ffff0000ae77bae0] die_kernel_fault at ffff0000670a1c6c
 #5 [ffff0000ae77bb10] __do_kernel_fault at ffff0000670a1924
 #6 [ffff0000ae77bb40] do_translation_fault at ffff0000676bb754
 #7 [ffff0000ae77bb50] do_mem_abort at ffff0000670812e0
 #8 [ffff0000ae77bd50] el1_ia at ffff000067083214
     PC: ffff0000672954c0  [dcache_readdir+216]
     LR: ffff0000672954f8  [dcache_readdir+272]
     SP: ffff0000ae77bd60  PSTATE: 60400009
    X29: ffff0000ae77bd60  X28: ffff8020d28eda00  X27: 0000000000000000
    X26: 0000000000000000  X25: 0000000056000000  X24: ffff80215c854000
    X23: 0000000000000001  X22: ffff8021f2f03290  X21: ffff803f74359698
    X20: ffff803f74359960  X19: ffff0000ae77be30  X18: 0000000000000000
    X17: 0000000000000000  X16: 0000000000000000  X15: 0000000000000000
    X14: 0000000000000000  X13: 0000000000000000  X12: 0000000000000000
    X11: 0000000000000000  X10: ffff8020fee99b18   X9: ffff8020fee99878
     X8: 0000000000a1f3aa   X7: 0000000000000000   X6: ffff00006727d760
     X5: ffffffffffff0073   X4: 0000000315d1d1c6   X3: 000000000000001b
     X2: 00000000ffff803f   X1: 656d616e00676f6c   X0: ffff0000ae77be30
 #9 [ffff0000ae77bd60] dcache_readdir at ffff0000672954bc

The reason is as follows:
Process 1 cat test which is not exist in directory A, process 2 cat test in directory A too.
process 3 create new file in directory B, process 4 ls directory A.

process 1(dirA)                  |process 2(dirA)                            |process 3(dirB)                       |process 4(dirA)
do_last                          |do_last                                    |do_last                               |iterate_dir
  inode_lock_shared              |  inode_lock_shared                        |  inode_lock(dirB)                    |  inode_lock_shared
  lookup_open                    |  lookup_open                              |  lookup_open                         |
    d_alloc_parallel             |    d_alloc_parallel                       |    d_alloc_parallel                  |
      d_alloc(add dtry1 to dirA) |                                           |                                      |
      hlist_bl_lock              |      d_alloc(add dtry2 to dirA)           |                                      |
      hlist_bl_add_head_rcu      |                                           |                                      |  dcache_readdir
      hlist_bl_unlock            |                                           |                                      |    p = &dentry->d_subdirs
                                 |      hlist_bl_lock                        |                                      |    next_positive(dentry, p, 1)
                                 |		hlist_bl_for_each_entry      |                                      |      p = from->next(p is dtry2)
                                 |		hlist_bl_unlock              |                                      |
                                 |		dput                         |                                      |
                                 |		  retain_dentry(dentry) false|                                      |
                                 |		  dentry_kill                |                                      |
                                 |		    spin_trylock(&parent)    |                                      |
                                 |			__dentry_kill        |                                      |
                                 |			  dentry_unlist      |                                      |
                                 |			  dentry_free(dtry2) |                                      |
                                 |                                           |      d_alloc(add dtry2 to dirB)      |
                                 |                                           |      hlist_bl_add_head_rcu           |
                                 |                                           |    dir_inode->i_op->create(new inode)|
                                 |                                           |                                      |      d = list_entry(p, struct dentry, d_child)
                                 |                                           |                                      |      if (!simple_positive(d))-->d belongs to dirB now

lookup_open-->d_in_lookup-->simple_lookup(shmem_dir_inode_operations)-->dentry->d_op = simple_dentry_operations
const struct dentry_operations simple_dentry_operations = {
	.d_delete = always_delete_dentry,
};
retain_dentry will return false


We should use spin_lock(&parent->d_lock) in next_positive. commit ebaaa80e8f20 ("lockless next_positive()") removes spin_lock, is it just for performance optimization?

Or if dput dentry, use inode_lock instead of inode_lock_shared?



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-03 14:44 Possible FS race condition between iterate_dir and d_alloc_parallel zhengbin (A)
@ 2019-09-03 15:40 ` Al Viro
  2019-09-03 15:41   ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-03 15:40 UTC (permalink / raw)
  To: zhengbin (A); +Cc: jack, akpm, linux-fsdevel, zhangyi (F)

On Tue, Sep 03, 2019 at 10:44:32PM +0800, zhengbin (A) wrote:
> We recently encountered an oops(the filesystem is tmpfs)
> crash> bt
>  #9 [ffff0000ae77bd60] dcache_readdir at ffff0000672954bc
> 
> The reason is as follows:
> Process 1 cat test which is not exist in directory A, process 2 cat test in directory A too.
> process 3 create new file in directory B, process 4 ls directory A.


good grief, what screen width do you have to make the table below readable?

What I do not understand is how the hell does your dtry2 manage to get actually
freed and reused without an RCU delay between its removal from parent's
->d_subdirs and freeing its memory.  What should've happened in that
scenario is
	* process 4, in next_positive() grabs rcu_read_lock().
	* it walks into your dtry2, which might very well be
just a chunk of memory waiting to be freed; it sure as hell is
not positive.  skipped is set to true, 'i' is not decremented.
Note that ->d_child.next points to the next non-cursor sibling
(if any) or to the ->d_subdir of parent, so we can keep walking.
	* we keep walking for a while; eventually we run out of
counter and leave the loop.

Only after that we do rcu_read_unlock() and only then anything
observed in that loop might be freed and reused.

Confused...  OTOH, I might be misreading that table of yours -
it's about 30% wider than the widest xterm I can get while still
being able to read the font...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-03 15:40 ` Al Viro
@ 2019-09-03 15:41   ` Al Viro
  2019-09-04  6:15     ` zhengbin (A)
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-03 15:41 UTC (permalink / raw)
  To: zhengbin (A); +Cc: jack, akpm, linux-fsdevel, zhangyi (F)

On Tue, Sep 03, 2019 at 04:40:07PM +0100, Al Viro wrote:
> On Tue, Sep 03, 2019 at 10:44:32PM +0800, zhengbin (A) wrote:
> > We recently encountered an oops(the filesystem is tmpfs)
> > crash> bt
> >  #9 [ffff0000ae77bd60] dcache_readdir at ffff0000672954bc
> > 
> > The reason is as follows:
> > Process 1 cat test which is not exist in directory A, process 2 cat test in directory A too.
> > process 3 create new file in directory B, process 4 ls directory A.
> 
> 
> good grief, what screen width do you have to make the table below readable?
> 
> What I do not understand is how the hell does your dtry2 manage to get actually
> freed and reused without an RCU delay between its removal from parent's
> ->d_subdirs and freeing its memory.  What should've happened in that
> scenario is
> 	* process 4, in next_positive() grabs rcu_read_lock().
> 	* it walks into your dtry2, which might very well be
> just a chunk of memory waiting to be freed; it sure as hell is
> not positive.  skipped is set to true, 'i' is not decremented.
> Note that ->d_child.next points to the next non-cursor sibling
> (if any) or to the ->d_subdir of parent, so we can keep walking.
> 	* we keep walking for a while; eventually we run out of
> counter and leave the loop.
> 
> Only after that we do rcu_read_unlock() and only then anything
> observed in that loop might be freed and reused.
> 
> Confused...  OTOH, I might be misreading that table of yours -
> it's about 30% wider than the widest xterm I can get while still
> being able to read the font...

Incidentally, which kernel was that on?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-03 15:41   ` Al Viro
@ 2019-09-04  6:15     ` zhengbin (A)
  2019-09-05 17:47       ` Al Viro
  2019-09-09 14:10       ` zhengbin (A)
  0 siblings, 2 replies; 49+ messages in thread
From: zhengbin (A) @ 2019-09-04  6:15 UTC (permalink / raw)
  To: Al Viro; +Cc: jack, akpm, linux-fsdevel, zhangyi (F), renxudong1

On 2019/9/3 23:41, Al Viro wrote:

> On Tue, Sep 03, 2019 at 04:40:07PM +0100, Al Viro wrote:
>> On Tue, Sep 03, 2019 at 10:44:32PM +0800, zhengbin (A) wrote:
>>> We recently encountered an oops(the filesystem is tmpfs)
>>> crash> bt
>>>  #9 [ffff0000ae77bd60] dcache_readdir at ffff0000672954bc
>>>
>>> The reason is as follows:
>>> Process 1 cat test which is not exist in directory A, process 2 cat test in directory A too.
>>> process 3 create new file in directory B, process 4 ls directory A.
>>
>> good grief, what screen width do you have to make the table below readable?
>>
>> What I do not understand is how the hell does your dtry2 manage to get actually
>> freed and reused without an RCU delay between its removal from parent's
>> ->d_subdirs and freeing its memory.  What should've happened in that
>> scenario is
>> 	* process 4, in next_positive() grabs rcu_read_lock().
>> 	* it walks into your dtry2, which might very well be
>> just a chunk of memory waiting to be freed; it sure as hell is
>> not positive.  skipped is set to true, 'i' is not decremented.
>> Note that ->d_child.next points to the next non-cursor sibling
>> (if any) or to the ->d_subdir of parent, so we can keep walking.
>> 	* we keep walking for a while; eventually we run out of
>> counter and leave the loop.
>>
>> Only after that we do rcu_read_unlock() and only then anything
>> observed in that loop might be freed and reused.
You are right, I miss this.
>>
>> Confused...  OTOH, I might be misreading that table of yours -
>> it's about 30% wider than the widest xterm I can get while still
>> being able to read the font...

The table is my guess. This oops happens sometimes

(We have one vmcore, others just have log, and the backtrace is same with vmcore, so the reason should be same).

Unfortunately, we do not know how to reproduce it. The vmcore has such a law:

1、dirA has 177 files, and it is OK

2、dirB has 25 files, and it is OK

3、When we ls dirA, it begins with ".", "..", dirB's first file, second file... last file,  last file->next = &(dirB->d_subdirs)

-------->

crash> struct dir_context ffff0000ae77be30  --->dcache_readdir ctx

struct dir_context {

actor = 0xffff00006727d760 <filldir64>,

pos = 27   --->27 = . + .. + 25 files

}


next_positive

  for (p = from->next; p != &parent->d_subdirs; p = p->next)  --->parent is dirA, so will continue


This should be a bug, I think it is related with locks,  especially with commit ebaaa80e8f20 ("lockless next_positive()").

Howerver, until now, I do not find the reason, Any suggestions?
> Incidentally, which kernel was that on?
4.19-stable,  the code of iterate_dir and d_alloc_parallel is same with master
>
> .
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-04  6:15     ` zhengbin (A)
@ 2019-09-05 17:47       ` Al Viro
  2019-09-06  0:55         ` Jun Li
  2019-09-06  2:32         ` zhengbin (A)
  2019-09-09 14:10       ` zhengbin (A)
  1 sibling, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-09-05 17:47 UTC (permalink / raw)
  To: zhengbin (A); +Cc: jack, akpm, linux-fsdevel, zhangyi (F), renxudong1, Li Jun

On Wed, Sep 04, 2019 at 02:15:58PM +0800, zhengbin (A) wrote:
> >>
> >> Confused...  OTOH, I might be misreading that table of yours -
> >> it's about 30% wider than the widest xterm I can get while still
> >> being able to read the font...
> 
> The table is my guess. This oops happens sometimes
> 
> (We have one vmcore, others just have log, and the backtrace is same with vmcore, so the reason should be same).
> 
> Unfortunately, we do not know how to reproduce it. The vmcore has such a law:
> 
> 1、dirA has 177 files, and it is OK
> 
> 2、dirB has 25 files, and it is OK
> 
> 3、When we ls dirA, it begins with ".", "..", dirB's first file, second file... last file,  last file->next = &(dirB->d_subdirs)

Hmm...  Now, that is interesting.  I'm not sure it has anything to do
with that bug, but lockless loops over d_subdirs can run into trouble.

Look: dentry_unlist() leaves the ->d_child.next pointing to the next
non-cursor list element (or parent's ->d_subdir, if there's nothing
else left).  It works in pair with d_walk(): there we have
                struct dentry *child = this_parent;
                this_parent = child->d_parent;

                spin_unlock(&child->d_lock);
                spin_lock(&this_parent->d_lock);

                /* might go back up the wrong parent if we have had a rename. */
                if (need_seqretry(&rename_lock, seq))
                        goto rename_retry;
                /* go into the first sibling still alive */
                do {
                        next = child->d_child.next;
                        if (next == &this_parent->d_subdirs)
                                goto ascend;
                        child = list_entry(next, struct dentry, d_child);
                } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
                rcu_read_unlock();

Note the recheck of rename_lock there - it does guarantee that even if
child has been killed off between unlocking it and locking this_parent,
whatever it has ended up with in its ->d_child->next has *not* been
moved elsewhere.  It might, in turn, have been killed off.  In that
case its ->d_child.next points to the next surviving non-cursor, also
guaranteed to remain in the same directory, etc.

However, lose that rename_lock recheck and we'd get screwed, unless
there's some other d_move() prevention in effect.

Note that all libfs.c users (next_positive(), move_cursor(),
dcache_dir_lseek(), dcache_readdir(), simple_empty()) should be
safe - dcache_readdir() is called with directory locked at least
shared, uses in dcache_dir_lseek() are surrounded by the same,
move_cursor() and simple_empty() hold ->d_lock on parent,
next_positive() is called only under the lock on directory's
inode (at least shared).  Any of those should prevent any
kind of cross-directory moves - both into and out of.

<greps for d_subdirs/d_child users>

Huh?
In drivers/usb/typec/tcpm/tcpm.c:
static void tcpm_debugfs_exit(struct tcpm_port *port)
{
        int i;

        mutex_lock(&port->logbuffer_lock);
        for (i = 0; i < LOG_BUFFER_ENTRIES; i++) {
                kfree(port->logbuffer[i]);
                port->logbuffer[i] = NULL;
        }
        mutex_unlock(&port->logbuffer_lock);

        debugfs_remove(port->dentry);
        if (list_empty(&rootdir->d_subdirs)) {
                debugfs_remove(rootdir);
                rootdir = NULL;
        }
}

Unrelated, but obviously broken.  Not only the locking is
deeply suspect, but it's trivially confused by open() on
the damn directory.  It will definitely have ->d_subdirs
non-empty.

Came in "usb: typec: tcpm: remove tcpm dir if no children",
author Cc'd...  Why not remove the directory on rmmod?
And create on insmod, initially empty...

fs/nfsd/nfsctl.c:
static void nfsdfs_remove_files(struct dentry *root)
{
        struct dentry *dentry, *tmp;

        list_for_each_entry_safe(dentry, tmp, &root->d_subdirs, d_child) {
                if (!simple_positive(dentry)) {
                        WARN_ON_ONCE(1); /* I think this can't happen? */
                        continue;
It can happen - again, just have it opened and it bloody well will.
Locking is OK, though - parent's inode is locked, so we are
safe from d_move() playing silly buggers there.

fs/autofs/root.c:
static void autofs_clear_leaf_automount_flags(struct dentry *dentry)
{
...
        /* Set parent managed if it's becoming empty */
        if (d_child->next == &parent->d_subdirs &&
            d_child->prev == &parent->d_subdirs)
                managed_dentry_set_managed(parent);

Same bogosity regarding the check for emptiness (that one might've been my
fault).  Locking is safe...  Not sure if all places in autofs/expire.c
are careful enough...

So it doesn't look like this theory holds.  Which filesystem had that
been on and what about ->d_parent of dentries in dirA and dirB
->d_subdirs?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-05 17:47       ` Al Viro
@ 2019-09-06  0:55         ` Jun Li
  2019-09-06  2:00           ` Al Viro
  2019-09-06  2:32         ` zhengbin (A)
  1 sibling, 1 reply; 49+ messages in thread
From: Jun Li @ 2019-09-06  0:55 UTC (permalink / raw)
  To: Al Viro, zhengbin (A); +Cc: jack, akpm, linux-fsdevel, zhangyi (F), renxudong1

Hi,
> -----Original Message-----
> From: Al Viro <viro@ftp.linux.org.uk> On Behalf Of Al Viro
> Sent: 2019年9月6日 1:48
> To: zhengbin (A) <zhengbin13@huawei.com>
> Cc: jack@suse.cz; akpm@linux-foundation.org; linux-fsdevel@vger.kernel.org; zhangyi
> (F) <yi.zhang@huawei.com>; renxudong1@huawei.com; Jun Li <jun.li@nxp.com>
> Subject: Re: Possible FS race condition between iterate_dir and d_alloc_parallel
> 
> On Wed, Sep 04, 2019 at 02:15:58PM +0800, zhengbin (A) wrote:
> > >>
> > >> Confused...  OTOH, I might be misreading that table of yours - it's
> > >> about 30% wider than the widest xterm I can get while still being
> > >> able to read the font...
> >
> > The table is my guess. This oops happens sometimes
> >
> > (We have one vmcore, others just have log, and the backtrace is same with vmcore, so
> the reason should be same).
> >
> > Unfortunately, we do not know how to reproduce it. The vmcore has such a law:
> >
> > 1、dirA has 177 files, and it is OK
> >
> > 2、dirB has 25 files, and it is OK
> >
> > 3、When we ls dirA, it begins with ".", "..", dirB's first file, second
> > file... last file,  last file->next = &(dirB->d_subdirs)
> 
> Hmm...  Now, that is interesting.  I'm not sure it has anything to do with that bug, but
> lockless loops over d_subdirs can run into trouble.
> 
> Look: dentry_unlist() leaves the ->d_child.next pointing to the next non-cursor list element
> (or parent's ->d_subdir, if there's nothing else left).  It works in pair with d_walk(): there we
> have
>                 struct dentry *child = this_parent;
>                 this_parent = child->d_parent;
> 
>                 spin_unlock(&child->d_lock);
>                 spin_lock(&this_parent->d_lock);
> 
>                 /* might go back up the wrong parent if we have had a rename. */
>                 if (need_seqretry(&rename_lock, seq))
>                         goto rename_retry;
>                 /* go into the first sibling still alive */
>                 do {
>                         next = child->d_child.next;
>                         if (next == &this_parent->d_subdirs)
>                                 goto ascend;
>                         child = list_entry(next, struct dentry, d_child);
>                 } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
>                 rcu_read_unlock();
> 
> Note the recheck of rename_lock there - it does guarantee that even if child has been
> killed off between unlocking it and locking this_parent, whatever it has ended up with in its
> ->d_child->next has *not* been moved elsewhere.  It might, in turn, have been killed off.
> In that case its ->d_child.next points to the next surviving non-cursor, also guaranteed to
> remain in the same directory, etc.
> 
> However, lose that rename_lock recheck and we'd get screwed, unless there's some
> other d_move() prevention in effect.
> 
> Note that all libfs.c users (next_positive(), move_cursor(), dcache_dir_lseek(),
> dcache_readdir(), simple_empty()) should be safe - dcache_readdir() is called with
> directory locked at least shared, uses in dcache_dir_lseek() are surrounded by the same,
> move_cursor() and simple_empty() hold ->d_lock on parent,
> next_positive() is called only under the lock on directory's inode (at least shared).  Any of
> those should prevent any kind of cross-directory moves - both into and out of.
> 
> <greps for d_subdirs/d_child users>
> 
> Huh?
> In drivers/usb/typec/tcpm/tcpm.c:
> static void tcpm_debugfs_exit(struct tcpm_port *port) {
>         int i;
> 
>         mutex_lock(&port->logbuffer_lock);
>         for (i = 0; i < LOG_BUFFER_ENTRIES; i++) {
>                 kfree(port->logbuffer[i]);
>                 port->logbuffer[i] = NULL;
>         }
>         mutex_unlock(&port->logbuffer_lock);
> 
>         debugfs_remove(port->dentry);
>         if (list_empty(&rootdir->d_subdirs)) {
>                 debugfs_remove(rootdir);
>                 rootdir = NULL;
>         }
> }
> 
> Unrelated, but obviously broken.  Not only the locking is deeply suspect, but it's trivially
> confused by open() on the damn directory.  It will definitely have ->d_subdirs non-empty.
> 
> Came in "usb: typec: tcpm: remove tcpm dir if no children", author Cc'd...  Why not
> remove the directory on rmmod?

That's because tcpm is a utility driver and there may be multiple instances
created under the directory, each instance/user removal will call to tcpm_debugfs_exit()
but only the last one should remove the directory.

Below patch changed this by using dedicated dir for each instance:

https://www.spinics.net/lists/linux-usb/msg183965.html

Li Jun
> And create on insmod, initially empty...
> 
> fs/nfsd/nfsctl.c:
> static void nfsdfs_remove_files(struct dentry *root) {
>         struct dentry *dentry, *tmp;
> 
>         list_for_each_entry_safe(dentry, tmp, &root->d_subdirs, d_child) {
>                 if (!simple_positive(dentry)) {
>                         WARN_ON_ONCE(1); /* I think this can't happen? */
>                         continue;
> It can happen - again, just have it opened and it bloody well will.
> Locking is OK, though - parent's inode is locked, so we are safe from d_move() playing
> silly buggers there.
> 
> fs/autofs/root.c:
> static void autofs_clear_leaf_automount_flags(struct dentry *dentry) { ...
>         /* Set parent managed if it's becoming empty */
>         if (d_child->next == &parent->d_subdirs &&
>             d_child->prev == &parent->d_subdirs)
>                 managed_dentry_set_managed(parent);
> 
> Same bogosity regarding the check for emptiness (that one might've been my fault).
> Locking is safe...  Not sure if all places in autofs/expire.c are careful enough...
> 
> So it doesn't look like this theory holds.  Which filesystem had that been on and what
> about ->d_parent of dentries in dirA and dirB
> ->d_subdirs?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-06  0:55         ` Jun Li
@ 2019-09-06  2:00           ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-09-06  2:00 UTC (permalink / raw)
  To: Jun Li; +Cc: zhengbin (A), jack, akpm, linux-fsdevel, zhangyi (F), renxudong1

On Fri, Sep 06, 2019 at 12:55:22AM +0000, Jun Li wrote:
> > Huh?
> > In drivers/usb/typec/tcpm/tcpm.c:
> > static void tcpm_debugfs_exit(struct tcpm_port *port) {
> >         int i;
> > 
> >         mutex_lock(&port->logbuffer_lock);
> >         for (i = 0; i < LOG_BUFFER_ENTRIES; i++) {
> >                 kfree(port->logbuffer[i]);
> >                 port->logbuffer[i] = NULL;
> >         }
> >         mutex_unlock(&port->logbuffer_lock);
> > 
> >         debugfs_remove(port->dentry);
> >         if (list_empty(&rootdir->d_subdirs)) {
> >                 debugfs_remove(rootdir);
> >                 rootdir = NULL;
> >         }
> > }
> > 
> > Unrelated, but obviously broken.  Not only the locking is deeply suspect, but it's trivially
> > confused by open() on the damn directory.  It will definitely have ->d_subdirs non-empty.
> > 
> > Came in "usb: typec: tcpm: remove tcpm dir if no children", author Cc'd...  Why not
> > remove the directory on rmmod?
> 
> That's because tcpm is a utility driver and there may be multiple instances
> created under the directory, each instance/user removal will call to tcpm_debugfs_exit()
> but only the last one should remove the directory.

Er...  So why not have the directory present for as long as the module is in,
removing it on rmmod?

> Below patch changed this by using dedicated dir for each instance:
> 
> https://www.spinics.net/lists/linux-usb/msg183965.html

*shrug*

Up to you; the variant in mainline is obviously broken (open the debugfs
directory and you'll confuse the hell out of that check).  My preference
in fixing those would've been to make mkdir and rmdir of the parent
unconditional, happening on module_init() and module_exit() resp., not
bothering with "is that the last one" checks, but I'm (a) not a user of that
code and (b) currently not quite sober, so I'll just leave that to you guys.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-05 17:47       ` Al Viro
  2019-09-06  0:55         ` Jun Li
@ 2019-09-06  2:32         ` zhengbin (A)
  1 sibling, 0 replies; 49+ messages in thread
From: zhengbin (A) @ 2019-09-06  2:32 UTC (permalink / raw)
  To: Al Viro; +Cc: jack, akpm, linux-fsdevel, zhangyi (F), renxudong1, Li Jun

On 2019/9/6 1:47, Al Viro wrote:

> On Wed, Sep 04, 2019 at 02:15:58PM +0800, zhengbin (A) wrote:
>>>> Confused...  OTOH, I might be misreading that table of yours -
>>>> it's about 30% wider than the widest xterm I can get while still
>>>> being able to read the font...
>> The table is my guess. This oops happens sometimes
>>
>> (We have one vmcore, others just have log, and the backtrace is same with vmcore, so the reason should be same).
>>
>> Unfortunately, we do not know how to reproduce it. The vmcore has such a law:
>>
>> 1、dirA has 177 files, and it is OK
>>
>> 2、dirB has 25 files, and it is OK
>>
>> 3、When we ls dirA, it begins with ".", "..", dirB's first file, second file... last file,  last file->next = &(dirB->d_subdirs)
> Hmm...  Now, that is interesting.  I'm not sure it has anything to do
> with that bug, but lockless loops over d_subdirs can run into trouble.
>
> Look: dentry_unlist() leaves the ->d_child.next pointing to the next
> non-cursor list element (or parent's ->d_subdir, if there's nothing
> else left).  It works in pair with d_walk(): there we have
>                 struct dentry *child = this_parent;
>                 this_parent = child->d_parent;
>
>                 spin_unlock(&child->d_lock);
>                 spin_lock(&this_parent->d_lock);
>
>                 /* might go back up the wrong parent if we have had a rename. */
>                 if (need_seqretry(&rename_lock, seq))
>                         goto rename_retry;
>                 /* go into the first sibling still alive */
>                 do {
>                         next = child->d_child.next;
>                         if (next == &this_parent->d_subdirs)
>                                 goto ascend;
>                         child = list_entry(next, struct dentry, d_child);
>                 } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
>                 rcu_read_unlock();
>
> Note the recheck of rename_lock there - it does guarantee that even if
> child has been killed off between unlocking it and locking this_parent,
> whatever it has ended up with in its ->d_child->next has *not* been
> moved elsewhere.  It might, in turn, have been killed off.  In that
> case its ->d_child.next points to the next surviving non-cursor, also
> guaranteed to remain in the same directory, etc.
>
> However, lose that rename_lock recheck and we'd get screwed, unless
> there's some other d_move() prevention in effect.
>
> Note that all libfs.c users (next_positive(), move_cursor(),
> dcache_dir_lseek(), dcache_readdir(), simple_empty()) should be
> safe - dcache_readdir() is called with directory locked at least
> shared, uses in dcache_dir_lseek() are surrounded by the same,
> move_cursor() and simple_empty() hold ->d_lock on parent,
> next_positive() is called only under the lock on directory's
> inode (at least shared).  Any of those should prevent any
> kind of cross-directory moves - both into and out of.
>
> <greps for d_subdirs/d_child users>
>
> Huh?
> In drivers/usb/typec/tcpm/tcpm.c:
> static void tcpm_debugfs_exit(struct tcpm_port *port)
> {
>         int i;
>
>         mutex_lock(&port->logbuffer_lock);
>         for (i = 0; i < LOG_BUFFER_ENTRIES; i++) {
>                 kfree(port->logbuffer[i]);
>                 port->logbuffer[i] = NULL;
>         }
>         mutex_unlock(&port->logbuffer_lock);
>
>         debugfs_remove(port->dentry);
>         if (list_empty(&rootdir->d_subdirs)) {
>                 debugfs_remove(rootdir);
>                 rootdir = NULL;
>         }
> }
>
> Unrelated, but obviously broken.  Not only the locking is
> deeply suspect, but it's trivially confused by open() on
> the damn directory.  It will definitely have ->d_subdirs
> non-empty.
>
> Came in "usb: typec: tcpm: remove tcpm dir if no children",
> author Cc'd...  Why not remove the directory on rmmod?
> And create on insmod, initially empty...
>
> fs/nfsd/nfsctl.c:
> static void nfsdfs_remove_files(struct dentry *root)
> {
>         struct dentry *dentry, *tmp;
>
>         list_for_each_entry_safe(dentry, tmp, &root->d_subdirs, d_child) {
>                 if (!simple_positive(dentry)) {
>                         WARN_ON_ONCE(1); /* I think this can't happen? */
>                         continue;
> It can happen - again, just have it opened and it bloody well will.
> Locking is OK, though - parent's inode is locked, so we are
> safe from d_move() playing silly buggers there.
>
> fs/autofs/root.c:
> static void autofs_clear_leaf_automount_flags(struct dentry *dentry)
> {
> ...
>         /* Set parent managed if it's becoming empty */
>         if (d_child->next == &parent->d_subdirs &&
>             d_child->prev == &parent->d_subdirs)
>                 managed_dentry_set_managed(parent);
>
> Same bogosity regarding the check for emptiness (that one might've been my
> fault).  Locking is safe...  Not sure if all places in autofs/expire.c
> are careful enough...
>
> So it doesn't look like this theory holds.  Which filesystem had that
> been on and what about ->d_parent of dentries in dirA and dirB
> ->d_subdirs?

The filesystem is tmpfs.  All the ->d_parent of dentries in dirA is dirA,  in dirB is dirB.

I still think is a use-after-free bug.. 

d_move needs parent's inode lock, while dcache_readdir

is called under the lock on directory's inode shared.

>
> .
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-04  6:15     ` zhengbin (A)
  2019-09-05 17:47       ` Al Viro
@ 2019-09-09 14:10       ` zhengbin (A)
  2019-09-09 14:59         ` Al Viro
  1 sibling, 1 reply; 49+ messages in thread
From: zhengbin (A) @ 2019-09-09 14:10 UTC (permalink / raw)
  To: Al Viro; +Cc: jack, akpm, linux-fsdevel, zhangyi (F), renxudong1, Hou Tao


On 2019/9/4 14:15, zhengbin (A) wrote:
> On 2019/9/3 23:41, Al Viro wrote:
>
>> On Tue, Sep 03, 2019 at 04:40:07PM +0100, Al Viro wrote:
>>> On Tue, Sep 03, 2019 at 10:44:32PM +0800, zhengbin (A) wrote:
>>>> We recently encountered an oops(the filesystem is tmpfs)
>>>> crash> bt
>>>>  #9 [ffff0000ae77bd60] dcache_readdir at ffff0000672954bc
>>>>
>>>> The reason is as follows:
>>>> Process 1 cat test which is not exist in directory A, process 2 cat test in directory A too.
>>>> process 3 create new file in directory B, process 4 ls directory A.
>>> good grief, what screen width do you have to make the table below readable?
>>>
>>> What I do not understand is how the hell does your dtry2 manage to get actually
>>> freed and reused without an RCU delay between its removal from parent's
>>> ->d_subdirs and freeing its memory.  What should've happened in that
>>> scenario is
>>> 	* process 4, in next_positive() grabs rcu_read_lock().
>>> 	* it walks into your dtry2, which might very well be
>>> just a chunk of memory waiting to be freed; it sure as hell is
>>> not positive.  skipped is set to true, 'i' is not decremented.
>>> Note that ->d_child.next points to the next non-cursor sibling
>>> (if any) or to the ->d_subdir of parent, so we can keep walking.
>>> 	* we keep walking for a while; eventually we run out of
>>> counter and leave the loop.
>>>
>>> Only after that we do rcu_read_unlock() and only then anything
>>> observed in that loop might be freed and reused.
> You are right, I miss this.
>>> Confused...  OTOH, I might be misreading that table of yours -
>>> it's about 30% wider than the widest xterm I can get while still
>>> being able to read the font...
> The table is my guess. This oops happens sometimes
>
> (We have one vmcore, others just have log, and the backtrace is same with vmcore, so the reason should be same).
>
> Unfortunately, we do not know how to reproduce it. The vmcore has such a law:
>
> 1、dirA has 177 files, and it is OK
>
> 2、dirB has 25 files, and it is OK
>
> 3、When we ls dirA, it begins with ".", "..", dirB's first file, second file... last file,  last file->next = &(dirB->d_subdirs)
>
> -------->
>
> crash> struct dir_context ffff0000ae77be30  --->dcache_readdir ctx
>
> struct dir_context {
>
> actor = 0xffff00006727d760 <filldir64>,
>
> pos = 27   --->27 = . + .. + 25 files
>
> }
>
>
> next_positive
>
>   for (p = from->next; p != &parent->d_subdirs; p = p->next)  --->parent is dirA, so will continue
>
>
> This should be a bug, I think it is related with locks,  especially with commit ebaaa80e8f20 ("lockless next_positive()").
>
> Howerver, until now, I do not find the reason, Any suggestions?

They will be a such timing as follows:

1. insert a negative dentryB1 to dirB,  dentryB1->next = dirB's first positive dentry(such as fileB)      d_alloc_parallel-->d_alloc

2.insert a negative dentryB2 to dirB, dentryB2->next = dentryB1                                                        d_alloc_parallel-->d_alloc

3. remove dentryB1 from dirB,  dentryB1->next will be fileB too                                                         d_alloc_parallel->dput(new)

4. alloc dentryB1 to dirA,  dirA's d_subdirs->next will be dentryB1


process 1(ls dirA)                       |  process 2(alloc dentryB1 to dirA: d_alloc_parallel-->d_alloc)

dcache_readdir                          |  d_alloc                             

    p = &dentry->d_subdirs;      |      

    next_positive                         |      

                                                  |       __d_alloc-->INIT_LIST_HEAD(&dentry->d_child)

                                                  |       list_add(&dentry->d_child, &parent->d_subdirs)  --->cpu may be executed out of order, first set parent->d_subdirs->next = dentryB1

        p = from->next                 |

        ---> p will be dentryB1, and dentryB1->next will be fileB


We can solute it in 2 ways:

1. add a smp_wmb between __d_alloc and list_add(&dentry->d_child, &parent->d_subdirs)

2. revert commit ebaaa80e8f20 ("lockless next_positive()")

>> Incidentally, which kernel was that on?
> 4.19-stable,  the code of iterate_dir and d_alloc_parallel is same with master
>> .
>>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-09 14:10       ` zhengbin (A)
@ 2019-09-09 14:59         ` Al Viro
  2019-09-09 15:10           ` zhengbin (A)
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-09 14:59 UTC (permalink / raw)
  To: zhengbin (A); +Cc: jack, akpm, linux-fsdevel, zhangyi (F), renxudong1, Hou Tao

On Mon, Sep 09, 2019 at 10:10:00PM +0800, zhengbin (A) wrote:

Hmm...  So your theory is that what you are seeing is the insertion
into the list done by list_add() exposing an earlier ->next pointer
to those who might be doing lockless walk through the list.
Potentially up to the last barrier done before the list_add()...

> We can solute it in 2 ways:
> 
> 1. add a smp_wmb between __d_alloc and list_add(&dentry->d_child, &parent->d_subdirs)
> 2. revert commit ebaaa80e8f20 ("lockless next_positive()")

I want to take another look at the ->d_subdirs/->d_child readers...
I agree that the above sounds plausible, but I really want to be
sure about the exclusion we have for those accesses.

I'm not sure that smp_wmb() alone would suffice, BTW - the reader side
loop would need to be careful as well.

Which architecture it was, again?  arm64?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-09 14:59         ` Al Viro
@ 2019-09-09 15:10           ` zhengbin (A)
       [not found]             ` <7e32cda5-dc89-719d-9651-cf2bd06ae728@huawei.com>
  0 siblings, 1 reply; 49+ messages in thread
From: zhengbin (A) @ 2019-09-09 15:10 UTC (permalink / raw)
  To: Al Viro; +Cc: jack, akpm, linux-fsdevel, zhangyi (F), renxudong1, Hou Tao

On 2019/9/9 22:59, Al Viro wrote:

> On Mon, Sep 09, 2019 at 10:10:00PM +0800, zhengbin (A) wrote:
>
> Hmm...  So your theory is that what you are seeing is the insertion
> into the list done by list_add() exposing an earlier ->next pointer
> to those who might be doing lockless walk through the list.
> Potentially up to the last barrier done before the list_add()...
>
>> We can solute it in 2 ways:
>>
>> 1. add a smp_wmb between __d_alloc and list_add(&dentry->d_child, &parent->d_subdirs)
>> 2. revert commit ebaaa80e8f20 ("lockless next_positive()")
> I want to take another look at the ->d_subdirs/->d_child readers...
> I agree that the above sounds plausible, but I really want to be
> sure about the exclusion we have for those accesses.
>
> I'm not sure that smp_wmb() alone would suffice, BTW - the reader side
> loop would need to be careful as well.
>
> Which architecture it was, again?  arm64?

arm64

>
> .
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
       [not found]             ` <7e32cda5-dc89-719d-9651-cf2bd06ae728@huawei.com>
@ 2019-09-10 21:53               ` Al Viro
  2019-09-10 22:17                 ` Al Viro
  2019-09-14 16:16                 ` [PATCH] " Al Viro
  0 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-09-10 21:53 UTC (permalink / raw)
  To: zhengbin (A); +Cc: jack, akpm, linux-fsdevel, zhangyi (F), renxudong1, Hou Tao

On Tue, Sep 10, 2019 at 11:05:16PM +0800, zhengbin (A) wrote:

> Now we can reproduce it about every 10 minutes, just use the following script:
> 
> (./go.sh, if can not reproduce this,  killall go.sh open.bin, and ./go.sh)
> 
> (If we set a negative value for dentry->d_child.next in __d_free, we can reproduce it in 30 seconds
> 
> @@ -254,6 +255,8 @@ static void __d_free(struct rcu_head *head)
>  {
>         struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
> 
> +       dentry->d_child.next = 0x1234567812345678;
> +
>         kmem_cache_free(dentry_cache, dentry);
>  }
> )
 
OK, that pretty much demonstrates that what's happening there is next_positive()
observing uninitialized ->next after INIT_LIST_HEAD + list_add.

*grumble*

First of all, reverting that commit would close _that_ race; no questions
about that.  However, I would rather try to avoid that - users generally do
have write access to dcache_dir_lseek()-using filesystem (tmpfs, that is)
and it's not hard to create a bunch of empty files/links to the same file
in the same directory.  Ability to have ->d_lock held for a long time (just
open it and lseek for a huge offset) is not a good thing.

Said that, the bug is real and certainly does need to be fixed.  Looking at
the users of those lists, we have

Individual filesystems:

	* spufs_prune_dir(): exclusive lock on directory, attempt to do rm -rf,
list_for_each_entry_safe() used to iterate.  Removals are actually done by
shrink_dcache_parent(), after we unpin the children.

	* tcpm_debugfs_exit(): bogus check for the list being empty.  Bogus
locking, wrong semantics.

	* afs_dynroot_depopulate(): exclusive lock on directory, manually
unpins some children, uses list_for_each_entry_safe() to iterate through them.
Actual removal from the list comes later, when the caller gets around to
shrinking dentry tree (it's a part of ->kill_sb()).  Actually, that area
(esp. afs_dynroot_rmdir()) looks fishy in terms of races with fs shutdown.
May be correct, may be not - needs to be looked into.

	* autofs get_next_positive_subdir()/get_next_positive_dentry().
Tree-walkers, traverse these lists under ->d_lock on parent.  Pure readers.
[note: they are somewhat simplified in vfs.git#work.autofs, but for our
purposes it just takes the duplicate logics from these two functions into
a new helper; locking conditions are unchanged]

	* autofs_d_manage(), RCU case: quick check for list emptiness before
even bothering with simple_empty().  Lockless, false negatives are fine,
since simple_empty() will grab ->d_lock and do proper checks.  This is just
the fast case.

	* autofs_clear_leaf_automount_flags(): tries to check if rmdir
victim's removal will make the parent empty.  Exclusive lock on directory,
but the check itself is bogus - have it (the parent) opened and the cursor
will result in false negative.  Real bug.

	* ceph drop_negative_children(): checks if all children are negative,
->d_lock held, list_for_each_entry for iteration through the list.  Pure reader.

	* coda_flag_children(): loop through all children, call
coda_flag_inode() for inodes of positive ones.  ->d_lock held,
list_for_each_entry as iterator, pure reader.

	* debugfs_remove_recursive(): looks for the first positive hashed
child, looping under ->d_lock on parent.  Pure reader until that point.
Exclusive lock on directory.  It also checks the child's list for emptiness -
lockless, false negatives being OK.

	* tracefs_remove_recursive(): same as debugfs analogue (a copy of it,
actually).

	* tracevents remove_event_file_dir(): goes through positive children
of directory, zeroes ->i_private for their inodes.  ->d_lock on directory,
list_for_each_entry for iterator, pure reader.  Further exclusion needs to be
looked into to tell if parent dentry can get freed under it.  _VERY_ likely
to be fishy in terms of the effect on opened files in that directory.
ftrace_event_release() will oops if inode->i_private has been zeroed while
the file had been opened.

	* __fsnotify_update_child_dentry_flags(): iterate through positive
children, modifying their ->d_flags.  ->d_lock held, list_for_each_entry for
iterator, pure reader.

	* nfsdfs_remove_files(): exclusive lock on directory, iterate through
children, calling nfsdfs_remove_file() on positive ones.  And screaming on
negatives, including the cursors.  User-triggerable WARN_ON(), how nice...
list_for_each_entry_safe for iterator.  IOW, yet another rm -rf from the kernel.

Core dcache logics:

	* __d_move(): moves from one parent to another or inserts
a previously parentless one into a new parent's list.  rename_lock is held,
so's ->s_vfs_rename_mutex.  Parent(s) are locked at least shared; ->d_lock
is held on their dentries.  Insertion into the list is always at the head.
Note that "at least shared" part is actually "exclusive except for the case
of d_splice_alias() from ->lookup() finding a preexisting alias of subdir's
inode".

	* d_alloc(): inserts new child into the parent's list.  ->d_lock on
parent held, child is negative at that point.  NOTE: most of the callers
are either with parent locked at least shared, or in situation when
the entire tree is not accessible to anyone else (mount-time, basically).
HOWEVER, there are exceptions and these are potential headache.  The easiest
one to hit is probably devpts_pty_new().  IOW, assuming that directory
locked exclusive will be enough to exclude d_alloc() is not safe on
an arbitrary filesystem.  What's more, the same devpts has file *removals*
done without such exclusion, which can get even uglier - dcache_readdir()
(and it is a dcache_readdir() user) does not expect dentry to get freed
under it.  On devpts it can happen.

	* __d_alloc(): constructor, sets the list empty.

	* dentry_unlist(): called by __dentry_kill(), after dentry is doomed
to get killed.  Parent's ->d_lock is held.  Removed from the list, ->next
is left pointing to the next non-cursor (if any).  Note that ->next is set
either to parent's ->d_subdirs *OR* to something that still hadn't been
passed to __dentry_kill().

	* d_walk(): walks the tree; accesses to the lists are under parent's
->d_lock.  As a side note, when we ascend into the parent we might end up with
locked parent and looking at the already doomed child.  In the case we skip
forward by ->d_child.next until we find a non-doomed one.  That's paired with
dentry_unlist() and AFAICS the flag used to mark the doomed ones is redundant -
__dentry_kill() starts with marking ->d_lockref dead, then, still holding
parent's ->d_lock, gets around to dentry_unlist() which marks it with
DCACHE_DENTRY_KILLED and sets ->d_child.next.  All before dropping parent's
->d_lock and lockref remains marked dead ever after.
IOW, d_walk() might as well have used __lockref_is_dead(&child->d_lockref)
(or open-coded it as child->d_count < 0).  If we do that, DCACHE_DENTRY_KILLED
could be killed off.  Anyway, that's a side story.
	FWIW, the main consideration for d_walk() analysis is the following:
if we have grabbed ->d_lock on a live dentry, we are guaranteed that everything
in its ->d_parent chain will have positive refcount and won't get physically
freed until an RCU delay starting after we drop ->d_lock.  That's what makes
the 'ascend' loop in there safe.

	A large part of headache in dcache.c lifetime rules (and they are
rather subtle) is in that area; it needs to be documented, and I've got some
bits and pieces of that, but it really needs to be turned into a coherent
text.

Assorted helpers in VFS:

	* simple_empty(): checks (under ->d_lock) whether ther are any
positive hashed children.  Pure readers.

	* move_cursor(): directory locked at least shared.  Moves the cursor
(with directory's ->d_lock held).  Position where we take the cursor comes
from a positive hashed child, which shouldn't be able to go away in such
conditions.  The reasons why it (mostly) works are somewhat brittle:
	+ shared lock on the directory excludes most of the __d_move()
callers.  The only exception would be d_splice_alias() from ->lookup()
picking an existing alias for directory inode, and none of the dcache_readdir()
users do that.
	+ the same lock excludes all directory-modifying syscalls
	+ in-kernel file removals are mostly excluded by the same lock.
However, there are exceptions, devpts being the most obvious one.

	* next_positive(): the problematic one.  It walks the list
with only shared lock guaranteed on directory.  No ->d_lock, retries
on ->i_dir_seq bumps.  Relies upon the lack of moves to/from directory.
Unfortunately, the lack of ->d_lock means lacking a barrier ;-/

	* dcache_readdir(): directory locked at least shared.  Iterates
through the directory using next_positive() starting at cursor position
or beginning of directory, moves cursor to wherever it stops with
move_cursor().  Note that the cursor is inserted into the list of children
only if we ever move past the beginning.  We have a problem with
in-kernel file removals without directory locked exclusive - both for
move_cursor() and for dir_emit(); the latter is much worse, since we
can't hold any spinlocks over that (it might very well call copy_to_user()).

	* dcache_dir_lseek(): calls next_positive() and move_cursor()
under shared lock on directory.

	* umount_check(): somewhat fishy; it tries to check for presence of
(busy) children, so that warnings about something being busy on umount would
be produced only for leaves.  The check is not quite right - for a busy empty
directory with cursors in it we get no warnings.  Not critical - that thing
should never trigger in the first place.  And arguably we might want to drop
that logics entirely - warn about all ancestors of something busy.  Check is
done under ->d_lock.

	So...

* all modifications of the lists happen under parent's ->d_lock (thankfully).
* all readers that walk those under ->d_lock are safe.
* modifying under both ->d_lock and exclusive lock on directory is safe.

* walking the list with just the exclusive lock on directory is *NOT*
safe without something like rcu_read_lock().  Something like stat(2) on
inexistent file done just as we hit spufs_prune_dir() will end up with
dentry allocated and then dropped.  Have the final dput() there come
while spufs_prune_dir() is walking the list, have spufs_prune_dir() lose
the timeslice at just the right moment and we are screwed.  That has
nothing to do with readdir and lseek, BTW - those are excluded.
The same goes for nfsdfs_remove_files().  In both cases I'd prefer to
grab ->d_lock - directories are not large.  And nfsdfs_remove_files()
needs to get a clue - simple_positive() being false is trivially possible
there.

* we might need to grab dentry reference around dir_emit() in dcache_readdir().
As it is, devpts makes it very easy to fuck that one up.

* it might make sense to turn next_positive() into "try to walk that much,
return a pinned dentry, drop the original one, report how much we'd walked".
That would allow to bring ->d_lock back and short-term it might be the best
solution.  IOW,
int next_positive(parent, from, count, dentry)
	grab ->d_lock
	walk the list, decrementing count on hashed positive ones
		 if we see need_resched
			 break
	if we hadn't reached the end, grab whatever we'd reached
	drop ->d_lock
	dput(*dentry)
	if need_resched
		schedule
	*dentry = whatever we'd grabbed or NULL
	return count;

The callers would use that sucker in a loop - readdir would just need to
initialize next to NULL and do
        while (next_positive(dentry, p, 1, &next), next != NULL) {
in the loop, with dput(next) in the very end.  And lseek would do
				to = NULL;
				p = &dentry->d_subdirs;
				do {
					n = next_positive(dentry, p, n, &to);
					if (!to)
						break;
					p = &to->d_child;
				} while (n);
				move_cursor(cursor, to ? p : NULL);
				dput(to);
instead of
				to = next_positive(dentry, &dentry->d_subdirs, n);
				move_cursor(cursor, to ? &to->d_child : NULL);

Longer term I would really like to get rid of ->d_lock in that thing,
but it's much too late in this cycle for that.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-10 21:53               ` Al Viro
@ 2019-09-10 22:17                 ` Al Viro
  2019-09-14 16:16                 ` [PATCH] " Al Viro
  1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-09-10 22:17 UTC (permalink / raw)
  To: zhengbin (A)
  Cc: jack, akpm, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, Linus Torvalds

On Tue, Sep 10, 2019 at 10:53:57PM +0100, Al Viro wrote:

> * we might need to grab dentry reference around dir_emit() in dcache_readdir().
> As it is, devpts makes it very easy to fuck that one up.

FWIW, that goes back to commit 8ead9dd54716 (devpts: more pty driver interface
cleanups) three years ago.  Rule of the thumb: whenever you write "no actual
semantic changes" in commit message, you are summoning Murphy...

> * it might make sense to turn next_positive() into "try to walk that much,
> return a pinned dentry, drop the original one, report how much we'd walked".
> That would allow to bring ->d_lock back and short-term it might be the best
> solution.  IOW,
> int next_positive(parent, from, count, dentry)
> 	grab ->d_lock
> 	walk the list, decrementing count on hashed positive ones
> 		 if we see need_resched
> 			 break
> 	if we hadn't reached the end, grab whatever we'd reached
> 	drop ->d_lock
> 	dput(*dentry)
> 	if need_resched
> 		schedule
> 	*dentry = whatever we'd grabbed or NULL
> 	return count;
> 
> The callers would use that sucker in a loop - readdir would just need to
> initialize next to NULL and do
>         while (next_positive(dentry, p, 1, &next), next != NULL) {
> in the loop, with dput(next) in the very end.  And lseek would do
> 				to = NULL;
> 				p = &dentry->d_subdirs;
> 				do {
> 					n = next_positive(dentry, p, n, &to);
> 					if (!to)
> 						break;
> 					p = &to->d_child;
> 				} while (n);
> 				move_cursor(cursor, to ? p : NULL);
> 				dput(to);
> instead of
> 				to = next_positive(dentry, &dentry->d_subdirs, n);
> 				move_cursor(cursor, to ? &to->d_child : NULL);
> 
> Longer term I would really like to get rid of ->d_lock in that thing,
> but it's much too late in this cycle for that.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-10 21:53               ` Al Viro
  2019-09-10 22:17                 ` Al Viro
@ 2019-09-14 16:16                 ` Al Viro
  2019-09-14 16:49                   ` Linus Torvalds
       [not found]                   ` <20190916020434.tutzwipgs4f6o3di@inn2.lkp.intel.com>
  1 sibling, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-09-14 16:16 UTC (permalink / raw)
  To: zhengbin (A)
  Cc: jack, akpm, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, Linus Torvalds

	OK, folks, could you try the following?  It survives the local beating
so far.

	* replacement of next_positive() with different calling conventions:
it returns struct list_head * instead of struct dentry *; the latter is
passed in and out by reference, grabbing the result and dropping the original
value.
	* scan is under ->d_lock.  If we run out of timeslice, cursor is moved
after the last position we'd reached and we reschedule; then the scan continues
from that place.  To avoid livelocks between multiple lseek() (with cursors
getting moved past each other, never reaching the real entries) we always
skip the cursors, need_resched() or not.
	* returned list_head * is either ->d_child of dentry we'd found or
->d_subdirs of parent (if we got to the end of the list).
	* dcache_readdir() and dcache_dir_lseek() switched to new helper.
dcache_readdir() always holds a reference to dentry passed to dir_emit() now.
Cursor is moved to just before the entry where dir_emit() has failed or into
the very end of the list, if we'd run out.
	* move_cursor() eliminated - it had sucky calling conventions and
after fixing that it became simply list_move() (in lseek and scan_positives)
or list_move_tail() (in readdir).

	All operations with the list are under ->d_lock now, and we do not
depend upon having all file removals done with parent locked exclusive
anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/libfs.c b/fs/libfs.c
index c9b2850c0f7c..e0b262e5fb34 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -89,58 +89,42 @@ int dcache_dir_close(struct inode *inode, struct file *file)
 EXPORT_SYMBOL(dcache_dir_close);
 
 /* parent is locked at least shared */
-static struct dentry *next_positive(struct dentry *parent,
-				    struct list_head *from,
-				    int count)
+/*
+ * Returns an element of siblings' list.
+ * We are looking for <count>th positive after <p>; if
+ * found, dentry is grabbed and passed to caller via *<res>.
+ * If no such element exists, the anchor of list is returned
+ * and *<res> is set to NULL.
+ */
+static struct list_head *scan_positives(struct dentry *cursor,
+					struct list_head *p,
+					loff_t count,
+					struct dentry **res)
 {
-	unsigned *seq = &parent->d_inode->i_dir_seq, n;
-	struct dentry *res;
-	struct list_head *p;
-	bool skipped;
-	int i;
+	struct dentry *dentry = cursor->d_parent, *found = NULL;
 
-retry:
-	i = count;
-	skipped = false;
-	n = smp_load_acquire(seq) & ~1;
-	res = NULL;
-	rcu_read_lock();
-	for (p = from->next; p != &parent->d_subdirs; p = p->next) {
+	spin_lock(&dentry->d_lock);
+	while ((p = p->next) != &dentry->d_subdirs) {
 		struct dentry *d = list_entry(p, struct dentry, d_child);
-		if (!simple_positive(d)) {
-			skipped = true;
-		} else if (!--i) {
-			res = d;
+		// we must at least skip cursors, to avoid livelocks
+		if (d->d_flags & DCACHE_DENTRY_CURSOR)
+			continue;
+		if (simple_positive(d) && !--count) {
+			found = dget(d);
 			break;
 		}
+		if (need_resched()) {
+			list_move(&cursor->d_child, p);
+			p = &cursor->d_child;
+			spin_unlock(&dentry->d_lock);
+			cond_resched();
+			spin_lock(&dentry->d_lock);
+		}
 	}
-	rcu_read_unlock();
-	if (skipped) {
-		smp_rmb();
-		if (unlikely(*seq != n))
-			goto retry;
-	}
-	return res;
-}
-
-static void move_cursor(struct dentry *cursor, struct list_head *after)
-{
-	struct dentry *parent = cursor->d_parent;
-	unsigned n, *seq = &parent->d_inode->i_dir_seq;
-	spin_lock(&parent->d_lock);
-	for (;;) {
-		n = *seq;
-		if (!(n & 1) && cmpxchg(seq, n, n + 1) == n)
-			break;
-		cpu_relax();
-	}
-	__list_del(cursor->d_child.prev, cursor->d_child.next);
-	if (after)
-		list_add(&cursor->d_child, after);
-	else
-		list_add_tail(&cursor->d_child, &parent->d_subdirs);
-	smp_store_release(seq, n + 2);
-	spin_unlock(&parent->d_lock);
+	spin_unlock(&dentry->d_lock);
+	dput(*res);
+	*res = found;
+	return p;
 }
 
 loff_t dcache_dir_lseek(struct file *file, loff_t offset, int whence)
@@ -158,17 +142,28 @@ loff_t dcache_dir_lseek(struct file *file, loff_t offset, int whence)
 			return -EINVAL;
 	}
 	if (offset != file->f_pos) {
+		struct dentry *cursor = file->private_data;
+		struct dentry *to = NULL;
+		struct list_head *p;
+
 		file->f_pos = offset;
-		if (file->f_pos >= 2) {
-			struct dentry *cursor = file->private_data;
-			struct dentry *to;
-			loff_t n = file->f_pos - 2;
-
-			inode_lock_shared(dentry->d_inode);
-			to = next_positive(dentry, &dentry->d_subdirs, n);
-			move_cursor(cursor, to ? &to->d_child : NULL);
-			inode_unlock_shared(dentry->d_inode);
+		inode_lock_shared(dentry->d_inode);
+
+		if (file->f_pos > 2) {
+			p = scan_positives(cursor, &dentry->d_subdirs,
+					   file->f_pos - 2, &to);
+			spin_lock(&dentry->d_lock);
+			list_move(&cursor->d_child, p);
+			spin_unlock(&dentry->d_lock);
+		} else {
+			spin_lock(&dentry->d_lock);
+			list_del_init(&cursor->d_child);
+			spin_unlock(&dentry->d_lock);
 		}
+
+		dput(to);
+
+		inode_unlock_shared(dentry->d_inode);
 	}
 	return offset;
 }
@@ -190,25 +185,29 @@ int dcache_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct dentry *dentry = file->f_path.dentry;
 	struct dentry *cursor = file->private_data;
-	struct list_head *p = &cursor->d_child;
-	struct dentry *next;
-	bool moved = false;
+	struct list_head *anchor = &dentry->d_subdirs;
+	struct dentry *next = NULL;
+	struct list_head *p;
 
 	if (!dir_emit_dots(file, ctx))
 		return 0;
 
 	if (ctx->pos == 2)
-		p = &dentry->d_subdirs;
-	while ((next = next_positive(dentry, p, 1)) != NULL) {
+		p = anchor;
+	else
+		p = &cursor->d_child;
+
+	while ((p = scan_positives(cursor, p, 1, &next)) != anchor) {
 		if (!dir_emit(ctx, next->d_name.name, next->d_name.len,
 			      d_inode(next)->i_ino, dt_type(d_inode(next))))
 			break;
-		moved = true;
-		p = &next->d_child;
 		ctx->pos++;
 	}
-	if (moved)
-		move_cursor(cursor, p);
+	spin_lock(&dentry->d_lock);
+	list_move_tail(&cursor->d_child, p);
+	spin_unlock(&dentry->d_lock);
+	dput(next);
+
 	return 0;
 }
 EXPORT_SYMBOL(dcache_readdir);

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-14 16:16                 ` [PATCH] " Al Viro
@ 2019-09-14 16:49                   ` Linus Torvalds
  2019-09-14 17:01                     ` Al Viro
  2019-09-22 21:29                     ` Al Viro
       [not found]                   ` <20190916020434.tutzwipgs4f6o3di@inn2.lkp.intel.com>
  1 sibling, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2019-09-14 16:49 UTC (permalink / raw)
  To: Al Viro
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 14, 2019 at 9:16 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         OK, folks, could you try the following?  It survives the local beating
> so far.

This looks like the right solution to me. Keep the locking simple,
take the dentry refcount as long as we keep a ref to it in "*res".

However, the one thing that strikes me is that it looks to me like
this means that the "cursor" and the dentry in "*res" are basically
synonymous. Could we drop the cursor entirely, and just keep the ref
to the last dentry we showd _as_ the cursor?

Yes, this would mean that we'd keep a ref to the dentry across
readdir() calls, and maybe that's a horrible idea. But is that all
that different from keeping the ref to the dentry that is the
directory itself?

              Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-14 16:49                   ` Linus Torvalds
@ 2019-09-14 17:01                     ` Al Viro
  2019-09-14 17:15                       ` Linus Torvalds
  2019-09-22 21:29                     ` Al Viro
  1 sibling, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-14 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 14, 2019 at 09:49:21AM -0700, Linus Torvalds wrote:
> On Sat, Sep 14, 2019 at 9:16 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         OK, folks, could you try the following?  It survives the local beating
> > so far.
> 
> This looks like the right solution to me. Keep the locking simple,
> take the dentry refcount as long as we keep a ref to it in "*res".
> 
> However, the one thing that strikes me is that it looks to me like
> this means that the "cursor" and the dentry in "*res" are basically
> synonymous. Could we drop the cursor entirely, and just keep the ref
> to the last dentry we showd _as_ the cursor?

I thought of that, but AFAICS rename(2) is a fatal problem for such
a scheme.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-14 17:01                     ` Al Viro
@ 2019-09-14 17:15                       ` Linus Torvalds
  2019-09-14 20:04                         ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2019-09-14 17:15 UTC (permalink / raw)
  To: Al Viro
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 14, 2019 at 10:01 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I thought of that, but AFAICS rename(2) is a fatal problem for such
> a scheme.

Duh. You're obviously correct, and I didn't think it through.

             Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-14 17:15                       ` Linus Torvalds
@ 2019-09-14 20:04                         ` Al Viro
  2019-09-14 22:57                           ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-14 20:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 14, 2019 at 10:15:22AM -0700, Linus Torvalds wrote:
> On Sat, Sep 14, 2019 at 10:01 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > I thought of that, but AFAICS rename(2) is a fatal problem for such
> > a scheme.
> 
> Duh. You're obviously correct, and I didn't think it through.

Getting the cursors out of the list is obviously tempting; if we could make
simple_rename() take care of those it might be worth doing.  I played with
having them hashed by middle bits of target dentry address (we have enough
fields unused for cursor dentries), but the problem is that we'd need to
scan the full hash chain on simple_rename() for that to work.  Or keep the
chains ordered, but that's even more painful - moving cursors would become
costly.  That approach needs something to represent the following
data and primitives:
	non-intersecting sets Cursors(dentry) (empty for most of dentries)
	by given cursor find dentry such that cursor \in Cursors(dentry)
	remove given cursor from the set it belongs to
	add given cursor to Cursors(dentry)
	move everything from Cursors(dentry1) into Cursors(dentry2)

An obvious approach would be to have them sit in the lists hanging off
dentries, with pointer to dentry in the cursor itself.  It's not hard
to do, with "move" costing O(#Cursors(dentry1)) and everything else
being O(1), but I hate adding a pointer to each struct dentry, when
it's completely useless for most of the filesystems *and* NULL for
most of dentries on dcache_readdir()-using one.

We could try to be smart with ->d_fsdata, but at least autofs and debugfs
are already using that ;-/  Hell knows - in any case, that's far too
intrusive by that point in the cycle, so I decided to leave any work
in that direction for later.  I might be missing something obvious, though,
so any suggestions would be welcome.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-14 20:04                         ` Al Viro
@ 2019-09-14 22:57                           ` Linus Torvalds
  2019-09-15  0:50                             ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2019-09-14 22:57 UTC (permalink / raw)
  To: Al Viro
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 14, 2019 at 1:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> An obvious approach would be to have them sit in the lists hanging off
> dentries, with pointer to dentry in the cursor itself.  It's not hard
> to do, with "move" costing O(#Cursors(dentry1)) and everything else
> being O(1), but I hate adding a pointer to each struct dentry, when
> it's completely useless for most of the filesystems *and* NULL for
> most of dentries on dcache_readdir()-using one.

Yeah, no, I think we should just do the straightforward proper locking
for now, and see if anything even cares.

Last time I think it was a microbenchmark that showed a regression,
not necessarily even a real load (reaim), and there wasn't a _ton_ of
debugging on exactly what triggered the higher system time. It was
also on a fairly unusual system that would show the lock contention
much more than 99+% of anything else.

So I suspect we might have other avenues of improvement than just the
cursor thing. Yes, it would be absolutely lovely to not have the
cursor, and avoid the resulting growth of the dentry child list, which
then makes everything else much more expensive.

But it is also possible that we could avoid some of that O(n**2)
behavior by simply not adding the corsor to the end of the dentry
child list at all. Right now your patch *always* sets the cursor at a
valid point - even if it's at the end of the directory. But we could
skip the "end of the directory" case entirely and just set a flag in
the file for "at eof" instead.

That way the cursor at least wouldn't exist for the common cases when
we return to user space (at the beginning of the readdir and at the
end). Which might make the cursors simply not be quite as common, even
when you have a _lot_ of concurrent readdir() users.

There may be other similar things we could do to minimize the pressure
on the parent dentry lock. For example, maybe we could insert a cursor
only _between_ readdir() calls, but then in the readdir() loop itself,
we know we hold the inode lock shared for the directory, so during
_that_ loop we can use the existing positive dentries that we keep a
refcount to as cursors.

Because if the only reason to not use existing dentry pointers as
cursors is the concurrent rename() issue, then _that_ is certainly
something that the parent inode shared lock should protect against,
even if the child dentry list can change in other ways).

So if the main 'reaim' problem was that the dentry child list itself
grew due to the cursor pressure (and that is likely) and that in turn
then caused the O(n**2) bad locking behavior due to having to walk
much longer dentry child chains, then I suspect that we can do a few
tweaks to simply not make that happen in practice.

Yes, I realize that I'm handwaving a bit on the two above suggestions,
but don't you think that sounds doable?

            Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-14 22:57                           ` Linus Torvalds
@ 2019-09-15  0:50                             ` Al Viro
  2019-09-15  1:41                               ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-15  0:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 14, 2019 at 03:57:15PM -0700, Linus Torvalds wrote:

> But it is also possible that we could avoid some of that O(n**2)
> behavior by simply not adding the corsor to the end of the dentry
> child list at all. Right now your patch *always* sets the cursor at a
> valid point - even if it's at the end of the directory. But we could
> skip the "end of the directory" case entirely and just set a flag in
> the file for "at eof" instead.

Yeah.

> That way the cursor at least wouldn't exist for the common cases when
> we return to user space (at the beginning of the readdir and at the
> end). Which might make the cursors simply not be quite as common, even
> when you have a _lot_ of concurrent readdir() users.
> 
> There may be other similar things we could do to minimize the pressure
> on the parent dentry lock. For example, maybe we could insert a cursor
> only _between_ readdir() calls, but then in the readdir() loop itself,
> we know we hold the inode lock shared for the directory, so during
> _that_ loop we can use the existing positive dentries that we keep a
> refcount to as cursors.
> 
> Because if the only reason to not use existing dentry pointers as
> cursors is the concurrent rename() issue, then _that_ is certainly
> something that the parent inode shared lock should protect against,
> even if the child dentry list can change in other ways).
> 
> So if the main 'reaim' problem was that the dentry child list itself
> grew due to the cursor pressure (and that is likely) and that in turn
> then caused the O(n**2) bad locking behavior due to having to walk
> much longer dentry child chains, then I suspect that we can do a few
> tweaks to simply not make that happen in practice.
> 
> Yes, I realize that I'm handwaving a bit on the two above suggestions,
> but don't you think that sounds doable?

I think I have a stronger solution.  It includes the "cursors past
the EOF are marked", but that's not all.

Look: the obvious problem with adding an hlist of cursors anchored in
dentry is that we don't want to blow dentry size.  OK, but... we have
d_subdirs/d_child.  And the same crawl through the tree has shown
exactly one place where we do anything to the end of the list - right
in dcache_readdir().  So we could bloody well turn that thing into
hlist.  Voila - a pointer saved.

So, fuck using struct dentry for cursors.  What we need in there
is
	* pointer to struct dentry we would be about to read
	* hlist_node linking them together
	* indicator of pointing before / after the entire directory
contents.
	* some way to get to dentry of directory and/or struct file.
Which can be done in much smaller space than struct dentry; details
of representation really don't matter.

d_subdirs/d_child become an hlist_head/hlist_node list; no cursors
in there at any time.

d_cursors is a new hlist_head, anchoring the set of cursor that
point to this sucker.  The list is protected by ->d_lock of
that dentry.

d_cursors being non-empty contributes 1 to d_count.

dcache_readdir()/dcache_dir_lseek() have exclusion with simple_rename()
on parent's inode.  On per-cursor basis they have exclusion with
each other (already; ->f_lock_pos).

dcache_dir_close() locks directory shared, giving it exclusion with
simple_rename(); per-cursor exclusion is, of course, already there.
Under that rwsem it removes the cursor from the hlist it's on (if
any), using ->d_lock of whatever it's point on for protection.
If that was the last cursor in hlist, drop the target after removal.
In any case, cursor gets freed.

In simple_rename():
	if there are cursors
		grab the next sibling (if any)
		take ->d_lock on source
		rip the list out of it
		drop ->d_lock on source
		if there's no sibling
			go through the list
				point cursors post-EOF, dissolving hlist
		else
			go through the list
				point cursors to the sibling
				find the last element of the list, while we are at it
			if the sibling's list hadn't been empty to start with
				drop our reference to sibling
			take ->d_lock on the sibling
			splice the list in
			drop ->d_lock on sibling
		drop the reference to source (from now-empty ->d_cursors)
Note that it's O(1) under any ->d_lock and O(cursors moved) without
holding any ->d_lock.

walk_cursor(cursor, count):
	while count && cursor is not post-EOF
		drop_old = NULL
		if cursor points to anything
			take ->d_lock on the target
			remove cursor from hlist
			if target's ->d_cursor is now empty
				drop_old = target
			drop ->d_lock on the target
			p = target's ->d_child
		else
			p = parent's ->d_subdirs
		take ->d_lock on directory
		drop_new = NULL
		eof = true
		for d in list, starting after p
			if d is positive && !--count || need_resched
				eof = false
				drop_new = dget(d)
				point cursor to d
				take ->d_lock on d
				if its ->d_cursors is empty
					drop_new = NULL
				insert cursor into its ->d_cursors
				drop ->d_lock on d
				break
		drop ->d_lock on directory
		dput(drop_new)
		dput(drop_old)
		if eof
			mark cursor post-EOF
		else if count
			cond_resched()

dcache_dir_lseek(), "changing position" case:
	walk_cursor(cursor, where)

dcache_readdir() loop:
	while cursor not post-EOF
		if dir_emit the cursor's target fails
			break
		walk_cursor(cursor, 1)
		ctx->pos++

dentry_unlink(): none of the cursor shit in the list, TYVM, and we can't
be called with cursors attached - d_count would've been positive.


What it should, AFAICS, give:
	* no loops under ->d_lock in dentry_unlist()
	* no cursors polluting the lists
	* dentry size unchanged
	* cursors considerably smaller than now
	* no looping for hell knows how long under ->d_lock
And it might be possible to be clever enough to get lockless walk_cursor()
(lockless in terms of directory ->d_lock, that is), but even without that
this scheme looks interesting, IMO.

I haven't even started trying to implement that, so I might very well have
missed obvious problems.  Comments?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-15  0:50                             ` Al Viro
@ 2019-09-15  1:41                               ` Linus Torvalds
  2019-09-15 16:02                                 ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2019-09-15  1:41 UTC (permalink / raw)
  To: Al Viro
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 14, 2019 at 5:51 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> d_subdirs/d_child become an hlist_head/hlist_node list; no cursors
> in there at any time.

Hmm. I like this.

I wonder if we could do that change independently first, and actually
shrink the dentry (or, more likely, just make the inline name longer).

I don't think that dcache_readdir() is actually stopping us from doing
that right now. Yes, we do that

    list_add_tail(&cursor->d_child, &parent->d_subdirs);

for the end case, but as mentioned, we could replace that with an EOF
flag, couldn't we?

Btw, if you do this change, we should take the opportunity to rename
those confusingly named things. "d_subdirs" are our children - which
aren't necessarily directories, and "d_child" are the nodes in there.

Your change would make that clearer wrt typing (good), but we could
make the naming clearer too (better).

So maybe rename "d_subdirs -> d_children" and "d_child -> d_sibling"
or something at the same time?

Wouldn't that clarify usage, particularly together with the
hlist_head/hlist_node typing?

Most of the users would have to change due to the type change anyway,
so changing the names shouldn't make the diff any worse, and might
make the diff easier to generate (simply because you can *grep* for
the places that need changing).

I wonder why we have that naming to begin with, but it's so old that I
can't remember the reason for that confusing naming. If there ever was
any, outside of "bad thinking".

> d_cursors is a new hlist_head, anchoring the set of cursor that
> point to this sucker.  The list is protected by ->d_lock of
> that dentry.
>
> d_cursors being non-empty contributes 1 to d_count.

My first reaction is that this sound clever, but in a good way (ie not
"too subtle" clever, but simply a clever way to avoid bloating the
dentry).

But I'd like to see the patch (and hear that it works) before I'd say
that it's the right thing to do. Maybe there's some reason why the
above would be problematic.

            Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-15  1:41                               ` Linus Torvalds
@ 2019-09-15 16:02                                 ` Al Viro
  2019-09-15 17:58                                   ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-15 16:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 14, 2019 at 06:41:41PM -0700, Linus Torvalds wrote:
> On Sat, Sep 14, 2019 at 5:51 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > d_subdirs/d_child become an hlist_head/hlist_node list; no cursors
> > in there at any time.
> 
> Hmm. I like this.
> 
> I wonder if we could do that change independently first, and actually
> shrink the dentry (or, more likely, just make the inline name longer).
> 
> I don't think that dcache_readdir() is actually stopping us from doing
> that right now. Yes, we do that
> 
>     list_add_tail(&cursor->d_child, &parent->d_subdirs);
> 
> for the end case, but as mentioned, we could replace that with an EOF
> flag, couldn't we?

Could be done, AFAICS.  I'm not even sure we need a flag per se - we
have two cases when the damn thing is not in the list and "before
everything" case doesn't really need to be distinguished from post-EOF
one.  dcache_dir_lseek() doesn't care where the cursor had been -
it goes by ->f_pos and recalculates the position from scratch.  And
dcache_readdir() is doing
        if (!dir_emit_dots(file, ctx))
                return 0;

        if (ctx->pos == 2)
                p = anchor;
        else
                p = &cursor->d_child;
IOW, if we used to be pre-list, we'll try to spit . and .. out and
either bugger off, or get ctx->pos == 2.  I.e. we only look at the
cursor's position in the list for ctx->pos > 2 case.

So for the variant that has cursors still represented by dentries we can
replace "post-EOF" with "not in hlist" and be done with that.  For
"cursors are separate data structures" variant... I think pretty much
the same applies (i.e. "not refering to any dentry" both for post-EOF
and before-everything cases, with readdir and lseek logics taking care
of small-offset case by ->f_pos), but I'll need to try and see how
well does that work.

> Btw, if you do this change, we should take the opportunity to rename
> those confusingly named things. "d_subdirs" are our children - which
> aren't necessarily directories, and "d_child" are the nodes in there.
> 
> Your change would make that clearer wrt typing (good), but we could
> make the naming clearer too (better).
> 
> So maybe rename "d_subdirs -> d_children" and "d_child -> d_sibling"
> or something at the same time?
> 
> Wouldn't that clarify usage, particularly together with the
> hlist_head/hlist_node typing?
>
> Most of the users would have to change due to the type change anyway,
> so changing the names shouldn't make the diff any worse, and might
> make the diff easier to generate (simply because you can *grep* for
> the places that need changing).

Makes sense...

> I wonder why we have that naming to begin with, but it's so old that I
> can't remember the reason for that confusing naming. If there ever was
> any, outside of "bad thinking".

->d_subdirs/->d_child introduction was what, 2.1.63?  November 1997...
I'd been otherwise occupied, to put it mildly (first semester in
PennState grad program, complete with the move from spb.ru to US in
August).  I don't think I'd been following any lists at the time,
sorry.  And the only thing google finds is
http://lkml.iu.edu/hypermail/linux/kernel/9711.0/0250.html
with nothing public prior to that.  What has happened to Bill Hawes, BTW?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-15 16:02                                 ` Al Viro
@ 2019-09-15 17:58                                   ` Linus Torvalds
  2019-09-21 14:07                                     ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2019-09-15 17:58 UTC (permalink / raw)
  To: Al Viro
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sun, Sep 15, 2019 at 9:02 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Could be done, AFAICS.  I'm not even sure we need a flag per se - we
> have two cases when the damn thing is not in the list and "before
> everything" case doesn't really need to be distinguished from post-EOF
> one.

Agreed, it looks like we could just look at f_pos and use that
(together with whether we have a cursor or not) as the flag:

 - no cursor: f_pos < 2 means beginning, otherwise EOF

 - otherwise: cursor points to position

> > I wonder why we have that naming to begin with, but it's so old that I
> > can't remember the reason for that confusing naming. If there ever was
> > any, outside of "bad thinking".
>
> ->d_subdirs/->d_child introduction was what, 2.1.63?  November 1997...

Heh. Your google-fu was better than mine.

> http://lkml.iu.edu/hypermail/linux/kernel/9711.0/0250.html
> with nothing public prior to that.  What has happened to Bill Hawes, BTW?

I think the original submission predates that by some time.

Afaik, the original dentry patches were for a PhD thesis or something
like that, and in the original form is was not used for caching and
lookup, but to generate filenames for logging.

.. and that may in fact be why it had the list of children being
called "d_subdirs".

Because the dentry patches originally were about tracking the changes
to the directory structure, and so only tracking subdirectories was
interesting.

As to Bill Hawes: "Now there's a name I've not heard in a long, long
time. A long time."

I don't find anything after 98.

                Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 266a9a8b41: WARNING:possible_recursive_locking_detected
       [not found]                   ` <20190916020434.tutzwipgs4f6o3di@inn2.lkp.intel.com>
@ 2019-09-16  2:58                     ` Al Viro
  2019-09-16  3:03                       ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-16  2:58 UTC (permalink / raw)
  To: kernel test robot
  Cc: zhengbin (A), jack, akpm, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, Linus Torvalds, lkp

On Mon, Sep 16, 2019 at 10:04:34AM +0800, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: 266a9a8b41803281e192151ae99779a7d50fc391 ("[PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel")
> url: https://github.com/0day-ci/linux/commits/Al-Viro/Re-Possible-FS-race-condition-between-iterate_dir-and-d_alloc_parallel/20190915-052109
> 
> 
> in testcase: rcutorture
> with following parameters:
> 
> 	runtime: 300s
> 	test: default
> 	torture_type: srcu
> 
> test-description: rcutorture is rcutorture kernel module load/unload test.
> test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

False positive; dget() on child while holding ->d_lock on parent is OK.
We could turn that into explicit spin_lock_nested() on child + increment of
->d_count under that, but this is a pointless pessimization.  Not sure
what's the best way to tell lockdep to STFU here, but in this case it
really ought to - locking order is correct.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 266a9a8b41: WARNING:possible_recursive_locking_detected
  2019-09-16  2:58                     ` 266a9a8b41: WARNING:possible_recursive_locking_detected Al Viro
@ 2019-09-16  3:03                       ` Al Viro
  2019-09-16  3:44                         ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-16  3:03 UTC (permalink / raw)
  To: kernel test robot
  Cc: zhengbin (A), jack, akpm, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, Linus Torvalds, lkp

On Mon, Sep 16, 2019 at 03:58:27AM +0100, Al Viro wrote:
> On Mon, Sep 16, 2019 at 10:04:34AM +0800, kernel test robot wrote:
> > FYI, we noticed the following commit (built with gcc-7):
> > 
> > commit: 266a9a8b41803281e192151ae99779a7d50fc391 ("[PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel")
> > url: https://github.com/0day-ci/linux/commits/Al-Viro/Re-Possible-FS-race-condition-between-iterate_dir-and-d_alloc_parallel/20190915-052109
> > 
> > 
> > in testcase: rcutorture
> > with following parameters:
> > 
> > 	runtime: 300s
> > 	test: default
> > 	torture_type: srcu
> > 
> > test-description: rcutorture is rcutorture kernel module load/unload test.
> > test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt
> > 
> > 
> > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
> > 
> > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> 
> False positive; dget() on child while holding ->d_lock on parent is OK.
> We could turn that into explicit spin_lock_nested() on child + increment of
> ->d_count under that, but this is a pointless pessimization.  Not sure
> what's the best way to tell lockdep to STFU here, but in this case it
> really ought to - locking order is correct.

Perhaps lockref_get_nested(struct lockref *lockref, unsigned int subclass)?
With s/spin_lock/spin_lock_nested/ in the body...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 266a9a8b41: WARNING:possible_recursive_locking_detected
  2019-09-16  3:03                       ` Al Viro
@ 2019-09-16  3:44                         ` Linus Torvalds
  2019-09-16 17:16                           ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2019-09-16  3:44 UTC (permalink / raw)
  To: Al Viro
  Cc: kernel test robot, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, LKP

On Sun, Sep 15, 2019 at 8:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Perhaps lockref_get_nested(struct lockref *lockref, unsigned int subclass)?
> With s/spin_lock/spin_lock_nested/ in the body...

Sure. Under the usual CONFIG_DEBUG_LOCK_ALLOC, with the non-debug case
just turning into a regular lockref_get().

Sounds fine to me.

             Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 266a9a8b41: WARNING:possible_recursive_locking_detected
  2019-09-16  3:44                         ` Linus Torvalds
@ 2019-09-16 17:16                           ` Al Viro
  2019-09-16 17:29                             ` Al Viro
       [not found]                             ` <bd707e64-9650-e9ed-a820-e2cabd02eaf8@huawei.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-09-16 17:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, LKP

On Sun, Sep 15, 2019 at 08:44:05PM -0700, Linus Torvalds wrote:
> On Sun, Sep 15, 2019 at 8:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Perhaps lockref_get_nested(struct lockref *lockref, unsigned int subclass)?
> > With s/spin_lock/spin_lock_nested/ in the body...
> 
> Sure. Under the usual CONFIG_DEBUG_LOCK_ALLOC, with the non-debug case
> just turning into a regular lockref_get().
> 
> Sounds fine to me.

Done and force-pushed into vfs.git#fixes

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 266a9a8b41: WARNING:possible_recursive_locking_detected
  2019-09-16 17:16                           ` Al Viro
@ 2019-09-16 17:29                             ` Al Viro
       [not found]                             ` <bd707e64-9650-e9ed-a820-e2cabd02eaf8@huawei.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-09-16 17:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, LKP

On Mon, Sep 16, 2019 at 06:16:06PM +0100, Al Viro wrote:
> On Sun, Sep 15, 2019 at 08:44:05PM -0700, Linus Torvalds wrote:
> > On Sun, Sep 15, 2019 at 8:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > Perhaps lockref_get_nested(struct lockref *lockref, unsigned int subclass)?
> > > With s/spin_lock/spin_lock_nested/ in the body...
> > 
> > Sure. Under the usual CONFIG_DEBUG_LOCK_ALLOC, with the non-debug case
> > just turning into a regular lockref_get().
> > 
> > Sounds fine to me.
> 
> Done and force-pushed into vfs.git#fixes

BTW, folks, I would really appreciate more people _testing_ that thing;
it survives the local beating, but...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 266a9a8b41: WARNING:possible_recursive_locking_detected
       [not found]                             ` <bd707e64-9650-e9ed-a820-e2cabd02eaf8@huawei.com>
@ 2019-09-17 12:01                               ` Al Viro
  2019-09-19  3:36                                 ` zhengbin (A)
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-17 12:01 UTC (permalink / raw)
  To: zhengbin (A)
  Cc: Linus Torvalds, kernel test robot, Jan Kara, Andrew Morton,
	linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, LKP

On Tue, Sep 17, 2019 at 03:03:33PM +0800, zhengbin (A) wrote:
> 
> On 2019/9/17 1:16, Al Viro wrote:
> > On Sun, Sep 15, 2019 at 08:44:05PM -0700, Linus Torvalds wrote:
> >> On Sun, Sep 15, 2019 at 8:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >>> Perhaps lockref_get_nested(struct lockref *lockref, unsigned int subclass)?
> >>> With s/spin_lock/spin_lock_nested/ in the body...
> >> Sure. Under the usual CONFIG_DEBUG_LOCK_ALLOC, with the non-debug case
> >> just turning into a regular lockref_get().
> >>
> >> Sounds fine to me.
> > Done and force-pushed into vfs.git#fixes
> + if (file->f_pos > 2) {
> + p = scan_positives(cursor, &dentry->d_subdirs,
> + file->f_pos - 2, &to);
> + spin_lock(&dentry->d_lock);
> + list_move(&cursor->d_child, p);
> + spin_unlock(&dentry->d_lock);
> + } else {
> + spin_lock(&dentry->d_lock);
> + list_del_init(&cursor->d_child);
> + spin_unlock(&dentry->d_lock);
> }
> +
> + dput(to);
> dput(to) should be in if if (file->f_pos > 2)? cause we dget(to) in scan_positives

dput(NULL) is a no-op

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 266a9a8b41: WARNING:possible_recursive_locking_detected
  2019-09-17 12:01                               ` Al Viro
@ 2019-09-19  3:36                                 ` zhengbin (A)
  2019-09-19  3:55                                   ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: zhengbin (A) @ 2019-09-19  3:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, kernel test robot, Jan Kara, Andrew Morton,
	linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, LKP


On 2019/9/17 20:01, Al Viro wrote:
> On Tue, Sep 17, 2019 at 03:03:33PM +0800, zhengbin (A) wrote:
>> On 2019/9/17 1:16, Al Viro wrote:
>>> On Sun, Sep 15, 2019 at 08:44:05PM -0700, Linus Torvalds wrote:
>>>> On Sun, Sep 15, 2019 at 8:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>>> Perhaps lockref_get_nested(struct lockref *lockref, unsigned int subclass)?
>>>>> With s/spin_lock/spin_lock_nested/ in the body...
>>>> Sure. Under the usual CONFIG_DEBUG_LOCK_ALLOC, with the non-debug case
>>>> just turning into a regular lockref_get().
>>>>
>>>> Sounds fine to me.
>>> Done and force-pushed into vfs.git#fixes
>> + if (file->f_pos > 2) {
>> + p = scan_positives(cursor, &dentry->d_subdirs,
>> + file->f_pos - 2, &to);
>> + spin_lock(&dentry->d_lock);
>> + list_move(&cursor->d_child, p);
>> + spin_unlock(&dentry->d_lock);
>> + } else {
>> + spin_lock(&dentry->d_lock);
>> + list_del_init(&cursor->d_child);
>> + spin_unlock(&dentry->d_lock);
>> }
>> +
>> + dput(to);
>> dput(to) should be in if if (file->f_pos > 2)? cause we dget(to) in scan_positives
> dput(NULL) is a no-op

+    spin_unlock(&dentry->d_lock);
+    dput(*res);
+    *res = found;
+    return p;

dput(*res) should be removed?

>
> .
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 266a9a8b41: WARNING:possible_recursive_locking_detected
  2019-09-19  3:36                                 ` zhengbin (A)
@ 2019-09-19  3:55                                   ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-09-19  3:55 UTC (permalink / raw)
  To: zhengbin (A)
  Cc: Linus Torvalds, kernel test robot, Jan Kara, Andrew Morton,
	linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, LKP

On Thu, Sep 19, 2019 at 11:36:28AM +0800, zhengbin (A) wrote:

> >> + dput(to);
> >> dput(to) should be in if if (file->f_pos > 2)? cause we dget(to) in scan_positives
> > dput(NULL) is a no-op
> 
> +    spin_unlock(&dentry->d_lock);
> +    dput(*res);
> +    *res = found;
> +    return p;
> 
> dput(*res) should be removed?

Huh?  Why would it?  We drop the original reference and replace it with the
new one; what's the problem?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-15 17:58                                   ` Linus Torvalds
@ 2019-09-21 14:07                                     ` Al Viro
  2019-09-21 16:21                                       ` Linus Torvalds
                                                         ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Al Viro @ 2019-09-21 14:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sun, Sep 15, 2019 at 10:58:50AM -0700, Linus Torvalds wrote:
> On Sun, Sep 15, 2019 at 9:02 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Could be done, AFAICS.  I'm not even sure we need a flag per se - we
> > have two cases when the damn thing is not in the list and "before
> > everything" case doesn't really need to be distinguished from post-EOF
> > one.
> 
> Agreed, it looks like we could just look at f_pos and use that
> (together with whether we have a cursor or not) as the flag:
> 
>  - no cursor: f_pos < 2 means beginning, otherwise EOF
> 
>  - otherwise: cursor points to position

FWIW, #next.dcache has the straight conversion to hlist.  It definitely
wants at least nfsd, er... misconception dealt with, though: list_head
or hlist, this
static void nfsdfs_remove_files(struct dentry *root)
{
        struct dentry *dentry;
        struct hlist_node *n;

        hlist_for_each_entry_safe(dentry, n, &root->d_children, d_sibling) {
                if (!simple_positive(dentry)) {
                        WARN_ON_ONCE(1); /* I think this can't happen? */
                        continue;
                }
                nfsdfs_remove_file(d_inode(root), dentry);
        }
}
is wrong, for obvious reasons (have the victim directory opened before that
gets called and watch the fireworks)...

No "take cursors out of the list" parts yet.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-21 14:07                                     ` Al Viro
@ 2019-09-21 16:21                                       ` Linus Torvalds
  2019-09-21 17:18                                         ` Al Viro
  2019-09-24  2:52                                       ` Al Viro
  2019-09-25 11:59                                       ` Amir Goldstein
  2 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2019-09-21 16:21 UTC (permalink / raw)
  To: Al Viro
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 21, 2019 at 7:07 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, #next.dcache has the straight conversion to hlist.  It definitely
> wants at least nfsd, er... misconception dealt with, though: list_head
> or hlist, this

Well, yeah. But is there really any downside except for the warning?

Looks like the code should just do

                if (!simple_positive(dentry))
                        continue;

and just ignore non-positive dentries - whether cursors or negative
ones (which may not happen, but still).

> No "take cursors out of the list" parts yet.

Looking at the commits, that "take it off the list" one seems very
nice on its own. It actually seems to simplify the logic regardless of
the whole "don't need to add it to the end"..

Only this:

    if (next)
        list_move_tail(&cursor->d_child, &next->d_child);
    else
        list_del_init(&cursor->d_child);

is a slight complication, and honestly, I think that should just have
its own helper function there ("dcache_update_cursor(cursor, next)" or
something).

That helper function would end up meaning one less change in the hlist
conversion too.

The hlist conversion looks straightforward except for the list_move()
conversions that I didn't then look at more to make sure that they are
identical, but the ones I looked at looked sane.

              Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-21 16:21                                       ` Linus Torvalds
@ 2019-09-21 17:18                                         ` Al Viro
  2019-09-21 17:38                                           ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-21 17:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 21, 2019 at 09:21:46AM -0700, Linus Torvalds wrote:
> On Sat, Sep 21, 2019 at 7:07 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > FWIW, #next.dcache has the straight conversion to hlist.  It definitely
> > wants at least nfsd, er... misconception dealt with, though: list_head
> > or hlist, this
> 
> Well, yeah. But is there really any downside except for the warning?
> 
> Looks like the code should just do
> 
>                 if (!simple_positive(dentry))
>                         continue;
> 
> and just ignore non-positive dentries - whether cursors or negative
> ones (which may not happen, but still).

FWIW, I really want to do a unified helper for "rm -rf from kernel"
kind of thing.  We have too many places trying to do that and buggering
it up in all kinds of ways.  This is one of those places; I agree
that the first step is to get rid of that WARN_ONCE, and it's the
right thing to do so that two series would be independent, but it
will need attention afterwards.

> > No "take cursors out of the list" parts yet.
> 
> Looking at the commits, that "take it off the list" one seems very
> nice on its own. It actually seems to simplify the logic regardless of
> the whole "don't need to add it to the end"..
> 
> Only this:
> 
>     if (next)
>         list_move_tail(&cursor->d_child, &next->d_child);
>     else
>         list_del_init(&cursor->d_child);
> 
> is a slight complication, and honestly, I think that should just have
> its own helper function there ("dcache_update_cursor(cursor, next)" or
> something).

I want to see what will fall out of switching cursors to separate
type - the set of primitives, calling conventions for those, etc.
will become more clear once I have something to tweak.  And I would
rather use here the calling conventions identical to the final ones...
 
> That helper function would end up meaning one less change in the hlist
> conversion too.
> 
> The hlist conversion looks straightforward except for the list_move()
> conversions that I didn't then look at more to make sure that they are
> identical, but the ones I looked at looked sane.

BTW, how much is the cost of smp_store_release() affected by a recent
smp_store_release() on the same CPU?  IOW, if we have
	smp_store_release(p, v1);
	<some assignments into the same cacheline>
	r = *q;			// different cacheline
	smp_store_release(q, v2);
how much overhead will the second smp_store_release() give?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-21 17:18                                         ` Al Viro
@ 2019-09-21 17:38                                           ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2019-09-21 17:38 UTC (permalink / raw)
  To: Al Viro
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 21, 2019 at 10:19 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> BTW, how much is the cost of smp_store_release() affected by a recent
> smp_store_release() on the same CPU?

Depends on the architecture.

On x86, smp_store_release() is free - all stores are releases.

(Well, there's the cost of the compiler barrier and the fact that gcc
generates horrid code for volatiles, but that's it).

On modern arm64 it should be fairly cheap. A store-release is just
about the cheapest barrier you can find.

On ppc, it's an lwsync, which is again the cheapest barrier there is,
and is usually just a few cycles. Although on older powerpc I think it
becomes a 'sync' which is fairly expensive.

Other architectures are a mix of the above. It's usually not all that
expensive, but ..

> IOW, if we have
>         smp_store_release(p, v1);
>         <some assignments into the same cacheline>
>         r = *q;                 // different cacheline
>         smp_store_release(q, v2);
> how much overhead will the second smp_store_release() give?

It really should only order the store queue, and make sure that
earlier reads have completed by the time the store queue entry drains.

Which sounds like a big deal, but since you have to be all kinds of
silly to delay reads past later writes (reads are latency-important,
buffered writes are not), that really isn't a performance issue.

Except some architectures are stupid and lack the proper barrier
model, and then it can be a full memory barrier. But honestly, I don't
think we have any architecture where we really care.

               Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-14 16:49                   ` Linus Torvalds
  2019-09-14 17:01                     ` Al Viro
@ 2019-09-22 21:29                     ` Al Viro
  2019-09-23  3:32                       ` zhengbin (A)
  1 sibling, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-22 21:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Sat, Sep 14, 2019 at 09:49:21AM -0700, Linus Torvalds wrote:
> On Sat, Sep 14, 2019 at 9:16 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         OK, folks, could you try the following?  It survives the local beating
> > so far.
> 
> This looks like the right solution to me. Keep the locking simple,
> take the dentry refcount as long as we keep a ref to it in "*res".

*grumble*

Example of subtleties in the whole mess: this is safe for mainline
now, but only due to "devpts_pty_kill(): don't bother with d_delete()"
already merged.  Without that, we are risking the following fun:

scan_positive() on devpts: finds a dentry, sees it positive, decides
to grab the sucker.  Refcount is currently 1 (will become 2 after
we grab the reference).

devpts_pty_kill(): d_delete(dentry); on that sucker.  Refcount is
currently (still) 1, so we simply make it negative.

scan_positive(): grabs an extra reference to now negative dentry.

devpts_pty_kill(): dput() drops refcount to 1 (what if it got there
before scan_positive() grabbed a reference?  Nothing, really, since
scan_positive() is holding parent's ->d_lock; dput() wouldn't
have progressed through dentry_kill() until it managed to get
that, and it would've rechecked the refcount.  So that's not
a problem)

scan_positive(): returns a reference to negative dentry to
dcache_readdir().  Which proceeds to oops on
                if (!dir_emit(ctx, next->d_name.name, next->d_name.len,
                              d_inode(next)->i_ino, dt_type(d_inode(next))))
since d_inode(next) is NULL.

With the aforementioned commit it *is* safe, since the dentry remains
positive (and unhashed), so we simply act as if dcache_readdir() has
won the race and emitted a record for the sucker killed by devpts_pty_kill().

IOW, backports will either need to bring that commit in as well, or
they'll need to play silly buggers along the lines of

		if (simple_positive(d) && !--count) {
			spin_lock(&d->d_lock);
			if (likely(simple_positive(d)))
				found = dget_dlock(d);
			spin_unlock(&d->d_lock);
			if (found)
				break;
			count = 1;	// it's gone, keep scanning
                }
Probably the latter, since it's less dependent on some other place
doing what devpts used to do...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-22 21:29                     ` Al Viro
@ 2019-09-23  3:32                       ` zhengbin (A)
  2019-09-23  5:08                         ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: zhengbin (A) @ 2019-09-23  3:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel,
	zhangyi (F),
	renxudong1, Hou Tao

On 2019/9/23 5:29, Al Viro wrote:

> On Sat, Sep 14, 2019 at 09:49:21AM -0700, Linus Torvalds wrote:
>> On Sat, Sep 14, 2019 at 9:16 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>         OK, folks, could you try the following?  It survives the local beating
>>> so far.
>> This looks like the right solution to me. Keep the locking simple,
>> take the dentry refcount as long as we keep a ref to it in "*res".
> *grumble*
>
> Example of subtleties in the whole mess: this is safe for mainline
> now, but only due to "devpts_pty_kill(): don't bother with d_delete()"
> already merged.  Without that, we are risking the following fun:
>
> scan_positive() on devpts: finds a dentry, sees it positive, decides
> to grab the sucker.  Refcount is currently 1 (will become 2 after
> we grab the reference).
>
> devpts_pty_kill(): d_delete(dentry); on that sucker.  Refcount is
> currently (still) 1, so we simply make it negative.
>
> scan_positive(): grabs an extra reference to now negative dentry.
>
> devpts_pty_kill(): dput() drops refcount to 1 (what if it got there
> before scan_positive() grabbed a reference?  Nothing, really, since
> scan_positive() is holding parent's ->d_lock; dput() wouldn't
> have progressed through dentry_kill() until it managed to get
> that, and it would've rechecked the refcount.  So that's not
> a problem)
>
> scan_positive(): returns a reference to negative dentry to
> dcache_readdir().  Which proceeds to oops on
>                 if (!dir_emit(ctx, next->d_name.name, next->d_name.len,
>                               d_inode(next)->i_ino, dt_type(d_inode(next))))
> since d_inode(next) is NULL.
>
> With the aforementioned commit it *is* safe, since the dentry remains
> positive (and unhashed), so we simply act as if dcache_readdir() has
> won the race and emitted a record for the sucker killed by devpts_pty_kill().
>
> IOW, backports will either need to bring that commit in as well, or
> they'll need to play silly buggers along the lines of
>
> 		if (simple_positive(d) && !--count) {
> 			spin_lock(&d->d_lock);
> 			if (likely(simple_positive(d)))
> 				found = dget_dlock(d);
> 			spin_unlock(&d->d_lock);
> 			if (found)
> 				break;
> 			count = 1;	// it's gone, keep scanning
>                 }
> Probably the latter, since it's less dependent on some other place
> doing what devpts used to do...

Is it possible to trigger ABBA deadlock? In next_positive, the lock order is (parent, dentry),

while in dput, the lock order is (dentry, parent). Cause we use spin_trylock(parent), so the ABBA deadlock will not open.

dput

    fast_dput

       spin_lock(&dentry->d_lock)

   dentry_kill

       spin_trylock(&parent->d_lock))

Is there any other scene like dput, but do not use spin_trylock? I am looking for the code, till now do not find this

> .
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-23  3:32                       ` zhengbin (A)
@ 2019-09-23  5:08                         ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-09-23  5:08 UTC (permalink / raw)
  To: zhengbin (A)
  Cc: Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel,
	zhangyi (F),
	renxudong1, Hou Tao

On Mon, Sep 23, 2019 at 11:32:43AM +0800, zhengbin (A) wrote:

> 
> Is it possible to trigger ABBA deadlock? In next_positive, the lock order is (parent, dentry),
> 
> while in dput, the lock order is (dentry, parent). Cause we use spin_trylock(parent), so the ABBA deadlock will not open.
> 
> dput
> 
>     fast_dput
> 
>        spin_lock(&dentry->d_lock)
> 
>    dentry_kill
> 
>        spin_trylock(&parent->d_lock))
> 
> Is there any other scene like dput, but do not use spin_trylock? I am looking for the code, till now do not find this

There should be none.  The order is parent, then child.  And
all changes of the tree topology are serialized by rename_lock.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-21 14:07                                     ` Al Viro
  2019-09-21 16:21                                       ` Linus Torvalds
@ 2019-09-24  2:52                                       ` Al Viro
  2019-09-24 13:30                                         ` Josef Bacik
       [not found]                                         ` <CAHk-=wiJ1eY7y6r_cFNRPCqD+BJZS7eJeQFO6OrXxRFjDAipsQ@mail.gmail.com>
  2019-09-25 11:59                                       ` Amir Goldstein
  2 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-09-24  2:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

[btrfs and cifs folks Cc'd]

On Sat, Sep 21, 2019 at 03:07:31PM +0100, Al Viro wrote:

> No "take cursors out of the list" parts yet.

Argh...  The things turned interesting.  The tricky part is
where do we handle switching cursors away from something
that gets moved.

What I hoped for was "just do it in simple_rename()".  Which is
almost OK; there are 3 problematic cases.  One is shmem -
there we have a special ->rename(), which handles things
like RENAME_EXCHANGE et.al.  Fair enough - some of that
might've been moved into simple_rename(), but some (whiteouts)
won't be that easy.  Fair enough - we can make kicking the
cursors outs a helper called by simple_rename() and by that.
Exchange case is going to cause a bit of headache (the
pathological case is when the entries being exchanged are
next to each other in the same directory), but it's not
that bad.

Two other cases, though, might be serious trouble.  Those are
btrfs new_simple_dir() and this in cifs_root_iget():
        if (rc && tcon->pipe) {
                cifs_dbg(FYI, "ipc connection - fake read inode\n");
                spin_lock(&inode->i_lock);
                inode->i_mode |= S_IFDIR;
                set_nlink(inode, 2);
                inode->i_op = &cifs_ipc_inode_ops;
                inode->i_fop = &simple_dir_operations;
                inode->i_uid = cifs_sb->mnt_uid;
                inode->i_gid = cifs_sb->mnt_gid;
                spin_unlock(&inode->i_lock);
	}
The trouble is, it looks like d_splice_alias() from a lookup elsewhere
might find an alias of some subdirectory in those.  And in that case
we'll end up with a child of those (dcache_readdir-using) directories
being ripped out and moved elsewhere.  With no calls of ->rename() in
sight, of course, *AND* with only shared lock on the parent.  The
last part is really nasty.  And not just for hanging cursors off the
dentries they point to - it's a problem for dcache_readdir() itself
even in the mainline and with all the lockless crap reverted.

We pass next->d_name.name to dir_emit() (i.e. potentially to
copy_to_user()).  And we have no warranty that it's not a long
(== separately allocated) name, that will be freed while
copy_to_user() is in progress.  Sure, it'll get an RCU delay
before freeing, but that doesn't help us at all.

I'm not familiar with those areas in btrfs or cifs; could somebody
explain what's going on there and can we indeed end up finding aliases
to those suckers?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24  2:52                                       ` Al Viro
@ 2019-09-24 13:30                                         ` Josef Bacik
  2019-09-24 14:51                                           ` Al Viro
       [not found]                                         ` <CAHk-=wiJ1eY7y6r_cFNRPCqD+BJZS7eJeQFO6OrXxRFjDAipsQ@mail.gmail.com>
  1 sibling, 1 reply; 49+ messages in thread
From: Josef Bacik @ 2019-09-24 13:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 03:52:15AM +0100, Al Viro wrote:
> [btrfs and cifs folks Cc'd]
> 
> On Sat, Sep 21, 2019 at 03:07:31PM +0100, Al Viro wrote:
> 
> > No "take cursors out of the list" parts yet.
> 
> Argh...  The things turned interesting.  The tricky part is
> where do we handle switching cursors away from something
> that gets moved.
> 
> What I hoped for was "just do it in simple_rename()".  Which is
> almost OK; there are 3 problematic cases.  One is shmem -
> there we have a special ->rename(), which handles things
> like RENAME_EXCHANGE et.al.  Fair enough - some of that
> might've been moved into simple_rename(), but some (whiteouts)
> won't be that easy.  Fair enough - we can make kicking the
> cursors outs a helper called by simple_rename() and by that.
> Exchange case is going to cause a bit of headache (the
> pathological case is when the entries being exchanged are
> next to each other in the same directory), but it's not
> that bad.
> 
> Two other cases, though, might be serious trouble.  Those are
> btrfs new_simple_dir() and this in cifs_root_iget():
>         if (rc && tcon->pipe) {
>                 cifs_dbg(FYI, "ipc connection - fake read inode\n");
>                 spin_lock(&inode->i_lock);
>                 inode->i_mode |= S_IFDIR;
>                 set_nlink(inode, 2);
>                 inode->i_op = &cifs_ipc_inode_ops;
>                 inode->i_fop = &simple_dir_operations;
>                 inode->i_uid = cifs_sb->mnt_uid;
>                 inode->i_gid = cifs_sb->mnt_gid;
>                 spin_unlock(&inode->i_lock);
> 	}
> The trouble is, it looks like d_splice_alias() from a lookup elsewhere
> might find an alias of some subdirectory in those.  And in that case
> we'll end up with a child of those (dcache_readdir-using) directories
> being ripped out and moved elsewhere.  With no calls of ->rename() in
> sight, of course, *AND* with only shared lock on the parent.  The
> last part is really nasty.  And not just for hanging cursors off the
> dentries they point to - it's a problem for dcache_readdir() itself
> even in the mainline and with all the lockless crap reverted.
> 
> We pass next->d_name.name to dir_emit() (i.e. potentially to
> copy_to_user()).  And we have no warranty that it's not a long
> (== separately allocated) name, that will be freed while
> copy_to_user() is in progress.  Sure, it'll get an RCU delay
> before freeing, but that doesn't help us at all.
> 
> I'm not familiar with those areas in btrfs or cifs; could somebody
> explain what's going on there and can we indeed end up finding aliases
> to those suckers?

We can't for the btrfs case.  This is used for the case where we have a link to
a subvolume but the root has disappeared already, so we add in that dummy inode.
We completely drop the dcache from that root downards when we drop the
subvolume, so we're not going to find aliases underneath those things.  Is that
what you're asking?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24 13:30                                         ` Josef Bacik
@ 2019-09-24 14:51                                           ` Al Viro
  2019-09-24 15:01                                             ` Josef Bacik
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-24 14:51 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 09:30:26AM -0400, Josef Bacik wrote:

> > We pass next->d_name.name to dir_emit() (i.e. potentially to
> > copy_to_user()).  And we have no warranty that it's not a long
> > (== separately allocated) name, that will be freed while
> > copy_to_user() is in progress.  Sure, it'll get an RCU delay
> > before freeing, but that doesn't help us at all.
> > 
> > I'm not familiar with those areas in btrfs or cifs; could somebody
> > explain what's going on there and can we indeed end up finding aliases
> > to those suckers?
> 
> We can't for the btrfs case.  This is used for the case where we have a link to
> a subvolume but the root has disappeared already, so we add in that dummy inode.
> We completely drop the dcache from that root downards when we drop the
> subvolume, so we're not going to find aliases underneath those things.  Is that
> what you're asking?  Thanks,

Umm...  Completely drop, you say?  What happens if something had been opened
in there at that time?

Could you give a bit more background?  How much of that subvolume remains
and what does btrfs_lookup() have to work with?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24 14:51                                           ` Al Viro
@ 2019-09-24 15:01                                             ` Josef Bacik
  2019-09-24 15:11                                               ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Josef Bacik @ 2019-09-24 15:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Josef Bacik, Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 03:51:04PM +0100, Al Viro wrote:
> On Tue, Sep 24, 2019 at 09:30:26AM -0400, Josef Bacik wrote:
> 
> > > We pass next->d_name.name to dir_emit() (i.e. potentially to
> > > copy_to_user()).  And we have no warranty that it's not a long
> > > (== separately allocated) name, that will be freed while
> > > copy_to_user() is in progress.  Sure, it'll get an RCU delay
> > > before freeing, but that doesn't help us at all.
> > > 
> > > I'm not familiar with those areas in btrfs or cifs; could somebody
> > > explain what's going on there and can we indeed end up finding aliases
> > > to those suckers?
> > 
> > We can't for the btrfs case.  This is used for the case where we have a link to
> > a subvolume but the root has disappeared already, so we add in that dummy inode.
> > We completely drop the dcache from that root downards when we drop the
> > subvolume, so we're not going to find aliases underneath those things.  Is that
> > what you're asking?  Thanks,
> 
> Umm...  Completely drop, you say?  What happens if something had been opened
> in there at that time?
> 
> Could you give a bit more background?  How much of that subvolume remains
> and what does btrfs_lookup() have to work with?

Sorry I mis-read the code a little bit.  This is purely for the subvolume link
directories.  We haven't wandered down into this directory yet.  If the
subvolume is being deleted and we still have the fake directory entry for it
then we just populate it with this dummy inode and then we can't lookup anything
underneath it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24 15:01                                             ` Josef Bacik
@ 2019-09-24 15:11                                               ` Al Viro
  2019-09-24 15:26                                                 ` Josef Bacik
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-24 15:11 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 11:01:45AM -0400, Josef Bacik wrote:

> Sorry I mis-read the code a little bit.  This is purely for the subvolume link
> directories.  We haven't wandered down into this directory yet.  If the
> subvolume is being deleted and we still have the fake directory entry for it
> then we just populate it with this dummy inode and then we can't lookup anything
> underneath it.  Thanks,

Umm...  OK, I guess my question would be better stated a bit differently: we
have a directory inode, with btrfs_lookup() for lookups in it *and* with
dcache_readdir() called when you try to do getdents(2) on that thing.
How does that work?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24 15:11                                               ` Al Viro
@ 2019-09-24 15:26                                                 ` Josef Bacik
  2019-09-24 16:33                                                   ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Josef Bacik @ 2019-09-24 15:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Josef Bacik, Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 04:11:07PM +0100, Al Viro wrote:
> On Tue, Sep 24, 2019 at 11:01:45AM -0400, Josef Bacik wrote:
> 
> > Sorry I mis-read the code a little bit.  This is purely for the subvolume link
> > directories.  We haven't wandered down into this directory yet.  If the
> > subvolume is being deleted and we still have the fake directory entry for it
> > then we just populate it with this dummy inode and then we can't lookup anything
> > underneath it.  Thanks,
> 
> Umm...  OK, I guess my question would be better stated a bit differently: we
> have a directory inode, with btrfs_lookup() for lookups in it *and* with
> dcache_readdir() called when you try to do getdents(2) on that thing.
> How does that work?

Sorry I hadn't read through the context.  We won't end up with things under this
directory.  The lookup will try to look up into the subvolume, see that it's
empty, and just return nothing.  There should never be any entries that end up
under this dummy entry.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24 15:26                                                 ` Josef Bacik
@ 2019-09-24 16:33                                                   ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-09-24 16:33 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 11:26:28AM -0400, Josef Bacik wrote:
> On Tue, Sep 24, 2019 at 04:11:07PM +0100, Al Viro wrote:
> > On Tue, Sep 24, 2019 at 11:01:45AM -0400, Josef Bacik wrote:
> > 
> > > Sorry I mis-read the code a little bit.  This is purely for the subvolume link
> > > directories.  We haven't wandered down into this directory yet.  If the
> > > subvolume is being deleted and we still have the fake directory entry for it
> > > then we just populate it with this dummy inode and then we can't lookup anything
> > > underneath it.  Thanks,
> > 
> > Umm...  OK, I guess my question would be better stated a bit differently: we
> > have a directory inode, with btrfs_lookup() for lookups in it *and* with
> > dcache_readdir() called when you try to do getdents(2) on that thing.
> > How does that work?
> 
> Sorry I hadn't read through the context.  We won't end up with things under this
> directory.  The lookup will try to look up into the subvolume, see that it's
> empty, and just return nothing.  There should never be any entries that end up
> under this dummy entry.  Thanks,

Er...  Then why not use simple_lookup() in there?  Confused...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-21 14:07                                     ` Al Viro
  2019-09-21 16:21                                       ` Linus Torvalds
  2019-09-24  2:52                                       ` Al Viro
@ 2019-09-25 11:59                                       ` Amir Goldstein
  2019-09-25 12:22                                         ` Al Viro
  2 siblings, 1 reply; 49+ messages in thread
From: Amir Goldstein @ 2019-09-25 11:59 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Mon, Sep 23, 2019 at 5:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, #next.dcache has the straight conversion to hlist.

Note that this:
@@ -108,8 +108,8 @@ struct dentry {
                struct list_head d_lru;         /* LRU list */
                wait_queue_head_t *d_wait;      /* in-lookup ones only */
        };
-       struct list_head d_child;       /* child of parent list */
-       struct list_head d_subdirs;     /* our children */
+       struct hlist_node d_sibling;    /* child of parent list */
+       struct hlist_head d_children;   /* our children */

Changes the 'standard' struct dentry size from 192 to 184.

Does that matter for cache line alignment?

Should struct dentry be ____cacheline_aligned?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-25 11:59                                       ` Amir Goldstein
@ 2019-09-25 12:22                                         ` Al Viro
  2019-09-25 12:34                                           ` Amir Goldstein
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-09-25 12:22 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Wed, Sep 25, 2019 at 02:59:47PM +0300, Amir Goldstein wrote:
> On Mon, Sep 23, 2019 at 5:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > FWIW, #next.dcache has the straight conversion to hlist.
> 
> Note that this:
> @@ -108,8 +108,8 @@ struct dentry {
>                 struct list_head d_lru;         /* LRU list */
>                 wait_queue_head_t *d_wait;      /* in-lookup ones only */
>         };
> -       struct list_head d_child;       /* child of parent list */
> -       struct list_head d_subdirs;     /* our children */
> +       struct hlist_node d_sibling;    /* child of parent list */
> +       struct hlist_head d_children;   /* our children */
> 
> Changes the 'standard' struct dentry size from 192 to 184.
> 
> Does that matter for cache line alignment?
> 
> Should struct dentry be ____cacheline_aligned?

*shrug*

DNAME_INLINE_LEN would need to be adjusted; it's just that I think
it would make a lot of sense to represent cursors not by dentries and
hang those on an additional hlist_head in dentry...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-25 12:22                                         ` Al Viro
@ 2019-09-25 12:34                                           ` Amir Goldstein
  0 siblings, 0 replies; 49+ messages in thread
From: Amir Goldstein @ 2019-09-25 12:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao

On Wed, Sep 25, 2019 at 3:22 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Sep 25, 2019 at 02:59:47PM +0300, Amir Goldstein wrote:
> > On Mon, Sep 23, 2019 at 5:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > FWIW, #next.dcache has the straight conversion to hlist.
> >
> > Note that this:
> > @@ -108,8 +108,8 @@ struct dentry {
> >                 struct list_head d_lru;         /* LRU list */
> >                 wait_queue_head_t *d_wait;      /* in-lookup ones only */
> >         };
> > -       struct list_head d_child;       /* child of parent list */
> > -       struct list_head d_subdirs;     /* our children */
> > +       struct hlist_node d_sibling;    /* child of parent list */
> > +       struct hlist_head d_children;   /* our children */
> >
> > Changes the 'standard' struct dentry size from 192 to 184.
> >
> > Does that matter for cache line alignment?
> >
> > Should struct dentry be ____cacheline_aligned?
>
> *shrug*
>
> DNAME_INLINE_LEN would need to be adjusted;

OK. But increasing DNAME_INLINE_LEN will move d_fsdata
across cache line.
Not sure if that matters, but it seems to be documented as
part of the group of struct members touched by Ref lookup.
It was intentionally moved by commit
44a7d7a878c9 fs: cache optimise dentry and inode for rcu-walk

Anyway, if you end up adding another hlist_head that won't be a
problem.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
       [not found]                                         ` <CAHk-=wiJ1eY7y6r_cFNRPCqD+BJZS7eJeQFO6OrXxRFjDAipsQ@mail.gmail.com>
@ 2019-09-29  5:29                                           ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-09-29  5:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin, Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 09:55:06AM -0700, Linus Torvalds wrote:
> [ Sorry for html, I'm driving around ]
> 
> On Mon, Sep 23, 2019, 19:52 Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> >
> > Argh...  The things turned interesting.  The tricky part is
> > where do we handle switching cursors away from something
> > that gets moved.
> >
> 
> I forget why we do that. Remind me?
> 
> Anyway, I think to a first approximation, we should probably first just see
> if the "remove cursors from the end" change already effectively makes most
> of the regression go away.
> 
> Because with that change, if you just readdir() things with a sufficiently
> big buffer, you'll never have the cursor hang around over several system
> calls.
> 
> Before, you'd do a readdir, add the cursor, return to user space, then do
> another readdir just to see the EOF, and then do the close().
> 
> So it used to leave the cursor in place over those two final system calls,
> and now it's all gone.
> 
> I'm sure that on the big systems you can still trigger the whole d_lock
> contention on the parent, but I wonder if it's all that much of a problem
> any more in practice with your other change..

If nothing else, it's DoSable right now.  And getting rid of that would
reduce the memory footprint, while we are at it.

In any case, it looks like btrfs really wants an empty directory there,
i.e. the right thing to do would be simple_lookup() for ->lookup.

CIFS is potentially trickier.  AFAICS, what's going on is
	* Windows has a strange export, called IPC$.  Looks like it
was (still is?) the place where they export their named pipes.  From
what I'd been able to figure out, it's always there and allows for
some other uses - it can be used to get the list of exports.  Whether
the actual named pipes are seen there these days... no idea.
	* there seems to be nothing to prevent a server (modified
samba, for example) from exporting whatever it wants under that
name.
	* IF it can be non-empty, mounting it will end up with
root directory where we can do lookups for whatever is there.
getdents() on that root will show what's currently in dcache
(== had been looked up and still has not been evicted by
memory pressure).  Mainline can get seriously buggered if
dcache_readdir() steps into something that is going away.  With the
patches in this series that's no longer a problem.  HOWEVER, if
lookup in one subdirectory picks an alias for another subdirectory
of root, we will be really screwed - shared lock on parent
won't stop d_splice_alias() from moving the alias, and that can
bloody well lead to freeing the original name.  From under
copy_to_user()...  And grabbing a reference to dentry obviously
doesn't prevent that - dentry itself won't get freed, but
external name very well might be.

Again, I don't know CIFS well enough to tell how IPC$ is really
used.  If it doesn't normally contain anything but named pipes,
we can simply have cifs_lookup() refuse to return subdirectories
on such mounts, with solves any problems with d_splice_alias().
If it's always empty - better yet.  If the impressions above
(and that's all those are) are correct, we might have a problem
in mainline with bogus/malicious servers.

That's independent from what we do with the cursors.  I asked
around a bit; no luck so far.  It would be really nice if
some CIFS folks could give a braindump on that thing...

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2019-09-29  5:30 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-03 14:44 Possible FS race condition between iterate_dir and d_alloc_parallel zhengbin (A)
2019-09-03 15:40 ` Al Viro
2019-09-03 15:41   ` Al Viro
2019-09-04  6:15     ` zhengbin (A)
2019-09-05 17:47       ` Al Viro
2019-09-06  0:55         ` Jun Li
2019-09-06  2:00           ` Al Viro
2019-09-06  2:32         ` zhengbin (A)
2019-09-09 14:10       ` zhengbin (A)
2019-09-09 14:59         ` Al Viro
2019-09-09 15:10           ` zhengbin (A)
     [not found]             ` <7e32cda5-dc89-719d-9651-cf2bd06ae728@huawei.com>
2019-09-10 21:53               ` Al Viro
2019-09-10 22:17                 ` Al Viro
2019-09-14 16:16                 ` [PATCH] " Al Viro
2019-09-14 16:49                   ` Linus Torvalds
2019-09-14 17:01                     ` Al Viro
2019-09-14 17:15                       ` Linus Torvalds
2019-09-14 20:04                         ` Al Viro
2019-09-14 22:57                           ` Linus Torvalds
2019-09-15  0:50                             ` Al Viro
2019-09-15  1:41                               ` Linus Torvalds
2019-09-15 16:02                                 ` Al Viro
2019-09-15 17:58                                   ` Linus Torvalds
2019-09-21 14:07                                     ` Al Viro
2019-09-21 16:21                                       ` Linus Torvalds
2019-09-21 17:18                                         ` Al Viro
2019-09-21 17:38                                           ` Linus Torvalds
2019-09-24  2:52                                       ` Al Viro
2019-09-24 13:30                                         ` Josef Bacik
2019-09-24 14:51                                           ` Al Viro
2019-09-24 15:01                                             ` Josef Bacik
2019-09-24 15:11                                               ` Al Viro
2019-09-24 15:26                                                 ` Josef Bacik
2019-09-24 16:33                                                   ` Al Viro
     [not found]                                         ` <CAHk-=wiJ1eY7y6r_cFNRPCqD+BJZS7eJeQFO6OrXxRFjDAipsQ@mail.gmail.com>
2019-09-29  5:29                                           ` Al Viro
2019-09-25 11:59                                       ` Amir Goldstein
2019-09-25 12:22                                         ` Al Viro
2019-09-25 12:34                                           ` Amir Goldstein
2019-09-22 21:29                     ` Al Viro
2019-09-23  3:32                       ` zhengbin (A)
2019-09-23  5:08                         ` Al Viro
     [not found]                   ` <20190916020434.tutzwipgs4f6o3di@inn2.lkp.intel.com>
2019-09-16  2:58                     ` 266a9a8b41: WARNING:possible_recursive_locking_detected Al Viro
2019-09-16  3:03                       ` Al Viro
2019-09-16  3:44                         ` Linus Torvalds
2019-09-16 17:16                           ` Al Viro
2019-09-16 17:29                             ` Al Viro
     [not found]                             ` <bd707e64-9650-e9ed-a820-e2cabd02eaf8@huawei.com>
2019-09-17 12:01                               ` Al Viro
2019-09-19  3:36                                 ` zhengbin (A)
2019-09-19  3:55                                   ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).