All of lore.kernel.org
 help / color / mirror / Atom feed
* Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-07 15:37 ` Oleg Drokin
  0 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-07 15:37 UTC (permalink / raw)
  To: J. Bruce Fields, Jeff Layton
  Cc: linux-nfs, <linux-kernel@vger.kernel.org> Mailing List

Hello!

   I've been trying to better understand this problem I was having where sometimes
   a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).

   I finally traced it to a leaked filedescriptor that was allocated from
   nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().

   Also together with it we see leaked credentials allocated along the same path from
   fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
   are presumably used by the credentials.

   Unfortunately I was not able to make total sense out of the state handling in nfsd,
   but it's clear that one of the file descriptors inside struct nfs4_file is
   lost. I added a patch like this (always a good idea, so surprised it was not
   there already):
@@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
 {
        struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
 
+       WARN_ON(fp->fi_fds[0]);
+       WARN_ON(fp->fi_fds[1]);
+       WARN_ON(fp->fi_fds[2]);
        kmem_cache_free(file_slab, fp);
 }

   And when the problem is hit, I am also triggering (Always this one which is fd[1])
[ 3588.143002] ------------[ cut here ]------------
[ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
[ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
[ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
[ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
[ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
[ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
[ 3588.153834] Call Trace:
[ 3588.153839]  [<ffffffff814a11a5>] dump_stack+0x86/0xc1
[ 3588.153841]  [<ffffffff8108806b>] __warn+0xcb/0xf0
[ 3588.153852]  [<ffffffffc0110fc0>] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
[ 3588.153853]  [<ffffffff8108819d>] warn_slowpath_null+0x1d/0x20
[ 3588.153859]  [<ffffffffc0111025>] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
[ 3588.153861]  [<ffffffff81109c65>] rcu_nocb_kthread+0x335/0x510
[ 3588.153862]  [<ffffffff81109baf>] ? rcu_nocb_kthread+0x27f/0x510
[ 3588.153863]  [<ffffffff81109930>] ? rcu_cpu_notify+0x3e0/0x3e0
[ 3588.153866]  [<ffffffff810af391>] kthread+0x101/0x120
[ 3588.153868]  [<ffffffff810e6c84>] ? trace_hardirqs_on_caller+0xf4/0x1b0
[ 3588.153871]  [<ffffffff8188b6af>] ret_from_fork+0x1f/0x40
[ 3588.153872]  [<ffffffff810af290>] ? kthread_create_on_node+0x250/0x250


  release_all_access() seems to be doing correct job of all that cleaning, so
  there must be some other path that I do not quite see.

  Hopefully you are more familiar with the code and can see the problem right away ;)

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-07 15:37 ` Oleg Drokin
  0 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-07 15:37 UTC (permalink / raw)
  To: J. Bruce Fields, Jeff Layton
  Cc: linux-nfs, <linux-kernel@vger.kernel.org> Mailing List

Hello!

   I've been trying to better understand this problem I was having where sometimes
   a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).

   I finally traced it to a leaked filedescriptor that was allocated from
   nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().

   Also together with it we see leaked credentials allocated along the same path from
   fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
   are presumably used by the credentials.

   Unfortunately I was not able to make total sense out of the state handling in nfsd,
   but it's clear that one of the file descriptors inside struct nfs4_file is
   lost. I added a patch like this (always a good idea, so surprised it was not
   there already):
@@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
 {
        struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
 
+       WARN_ON(fp->fi_fds[0]);
+       WARN_ON(fp->fi_fds[1]);
+       WARN_ON(fp->fi_fds[2]);
        kmem_cache_free(file_slab, fp);
 }

   And when the problem is hit, I am also triggering (Always this one which is fd[1])
[ 3588.143002] ------------[ cut here ]------------
[ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
[ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
[ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
[ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
[ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
[ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
[ 3588.153834] Call Trace:
[ 3588.153839]  [<ffffffff814a11a5>] dump_stack+0x86/0xc1
[ 3588.153841]  [<ffffffff8108806b>] __warn+0xcb/0xf0
[ 3588.153852]  [<ffffffffc0110fc0>] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
[ 3588.153853]  [<ffffffff8108819d>] warn_slowpath_null+0x1d/0x20
[ 3588.153859]  [<ffffffffc0111025>] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
[ 3588.153861]  [<ffffffff81109c65>] rcu_nocb_kthread+0x335/0x510
[ 3588.153862]  [<ffffffff81109baf>] ? rcu_nocb_kthread+0x27f/0x510
[ 3588.153863]  [<ffffffff81109930>] ? rcu_cpu_notify+0x3e0/0x3e0
[ 3588.153866]  [<ffffffff810af391>] kthread+0x101/0x120
[ 3588.153868]  [<ffffffff810e6c84>] ? trace_hardirqs_on_caller+0xf4/0x1b0
[ 3588.153871]  [<ffffffff8188b6af>] ret_from_fork+0x1f/0x40
[ 3588.153872]  [<ffffffff810af290>] ? kthread_create_on_node+0x250/0x250


  release_all_access() seems to be doing correct job of all that cleaning, so
  there must be some other path that I do not quite see.

  Hopefully you are more familiar with the code and can see the problem right away ;)

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-07 15:37 ` Oleg Drokin
@ 2016-06-07 17:10   ` Jeff Layton
  -1 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-07 17:10 UTC (permalink / raw)
  To: Oleg Drokin, J. Bruce Fields
  Cc: linux-nfs, <linux-kernel@vger.kernel.org> Mailing List

On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> Hello!
> 
>    I've been trying to better understand this problem I was having where sometimes
>    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
> 
>    I finally traced it to a leaked filedescriptor that was allocated from
>    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> 
>    Also together with it we see leaked credentials allocated along the same path from
>    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
>    are presumably used by the credentials.
> 
>    Unfortunately I was not able to make total sense out of the state handling in nfsd,
>    but it's clear that one of the file descriptors inside struct nfs4_file is
>    lost. I added a patch like this (always a good idea, so surprised it was not
>    there already):
> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>  {
>         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>  
> +       WARN_ON(fp->fi_fds[0]);
> +       WARN_ON(fp->fi_fds[1]);
> +       WARN_ON(fp->fi_fds[2]);
>         kmem_cache_free(file_slab, fp);
>  }
> 
>    And when the problem is hit, I am also triggering (Always this one which is fd[1])
> [ 3588.143002] ------------[ cut here ]------------
> [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
> [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
> [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
> [ 3588.153834] Call Trace:
> [ 3588.153839]  [] dump_stack+0x86/0xc1
> [ 3588.153841]  [] __warn+0xcb/0xf0
> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> [ 3588.153866]  [] kthread+0x101/0x120
> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> 
> 
>   release_all_access() seems to be doing correct job of all that cleaning, so
>   there must be some other path that I do not quite see.
> 
>   Hopefully you are more familiar with the code and can see the problem right away ;)
> 
> Bye,
>     Oleg

Hmm...well I'm not seeing it right offhand, and haven't been able to
reproduce the problem so far after a couple of attempts by hand. What
sort of workload are you running before you see that warning pop?

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-07 17:10   ` Jeff Layton
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-07 17:10 UTC (permalink / raw)
  To: Oleg Drokin, J. Bruce Fields
  Cc: linux-nfs, <linux-kernel@vger.kernel.org> Mailing List

On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> Hello!
> 
>    I've been trying to better understand this problem I was having where sometimes
>    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
> 
>    I finally traced it to a leaked filedescriptor that was allocated from
>    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> 
>    Also together with it we see leaked credentials allocated along the same path from
>    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
>    are presumably used by the credentials.
> 
>    Unfortunately I was not able to make total sense out of the state handling in nfsd,
>    but it's clear that one of the file descriptors inside struct nfs4_file is
>    lost. I added a patch like this (always a good idea, so surprised it was not
>    there already):
> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>  {
>         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>  
> +       WARN_ON(fp->fi_fds[0]);
> +       WARN_ON(fp->fi_fds[1]);
> +       WARN_ON(fp->fi_fds[2]);
>         kmem_cache_free(file_slab, fp);
>  }
> 
>    And when the problem is hit, I am also triggering (Always this one which is fd[1])
> [ 3588.143002] ------------[ cut here ]------------
> [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
> [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
> [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
> [ 3588.153834] Call Trace:
> [ 3588.153839]  [] dump_stack+0x86/0xc1
> [ 3588.153841]  [] __warn+0xcb/0xf0
> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> [ 3588.153866]  [] kthread+0x101/0x120
> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> 
> 
>   release_all_access() seems to be doing correct job of all that cleaning, so
>   there must be some other path that I do not quite see.
> 
>   Hopefully you are more familiar with the code and can see the problem right away ;)
> 
> Bye,
>     Oleg

Hmm...well I'm not seeing it right offhand, and haven't been able to
reproduce the problem so far after a couple of attempts by hand. What
sort of workload are you running before you see that warning pop?

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-07 17:10   ` Jeff Layton
@ 2016-06-07 17:30     ` Oleg Drokin
  -1 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-07 17:30 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
>> Hello!
>> 
>>    I've been trying to better understand this problem I was having where sometimes
>>    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
>> 
>>    I finally traced it to a leaked filedescriptor that was allocated from
>>    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
>> 
>>    Also together with it we see leaked credentials allocated along the same path from
>>    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
>>    are presumably used by the credentials.
>> 
>>    Unfortunately I was not able to make total sense out of the state handling in nfsd,
>>    but it's clear that one of the file descriptors inside struct nfs4_file is
>>    lost. I added a patch like this (always a good idea, so surprised it was not
>>    there already):
>> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>>  {
>>         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>>  
>> +       WARN_ON(fp->fi_fds[0]);
>> +       WARN_ON(fp->fi_fds[1]);
>> +       WARN_ON(fp->fi_fds[2]);
>>         kmem_cache_free(file_slab, fp);
>>  }
>> 
>>    And when the problem is hit, I am also triggering (Always this one which is fd[1])
>> [ 3588.143002] ------------[ cut here ]------------
>> [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
>> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
>> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
>> [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
>> [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
>> [ 3588.153834] Call Trace:
>> [ 3588.153839]  [] dump_stack+0x86/0xc1
>> [ 3588.153841]  [] __warn+0xcb/0xf0
>> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
>> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
>> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
>> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
>> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
>> [ 3588.153866]  [] kthread+0x101/0x120
>> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
>> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
>> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
>> 
>> 
>>   release_all_access() seems to be doing correct job of all that cleaning, so
>>   there must be some other path that I do not quite see.
>> 
>>   Hopefully you are more familiar with the code and can see the problem right away ;)
> 
> Hmm...well I'm not seeing it right offhand, and haven't been able to
> reproduce the problem so far after a couple of attempts by hand. What
> sort of workload are you running before you see that warning pop?

Well, the workload is a crazed twisty conflicting operations set in a very limited
namespace (called racer.sh,
http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer )

It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I
run it on several mountpoints all pointing to the same spot
(now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as
mount -o nfs and one on mount -o nfs4).
It must be hitting some rare race or a race + operations combination that leads to
the leak.

To reproduce I run it like this until it stops (and it usually stops in
2 or 3 iterations, so it takes some time):
$ cat ~/stress.sh 
#!/bin/bash

set -x

cd /home/green/git/lustre-release/lustre/tests/racer
dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
mkfs.ext4 /tmp/loop
service rpcbind start
mount none /var/lib/nfs -t tmpfs
touch /var/lib/nfs/etab
service nfs-mountd start

while :; do

mount /tmp/loop /mnt/lustre -o loop || exit 1
mkdir /mnt/lustre/racer
service nfs-server start || exit 2
mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
DURATION=3600 sh racer.sh /mnt/nfs/racer &
DURATION=3600 sh racer.sh /mnt/nfs2/racer &
DURATION=3600 sh racer.sh /mnt/lustre/racer &
wait %1 %2 %3
umount /mnt/nfs || exit 5
umount /mnt/nfs2 || exit 6
service nfs-server stop || exit 7
umount /mnt/lustre || exit 8
e2fsck -n -f /tmp/loop || exit 9
done

Please ignore all the Lustre references, the test is generic, it's just I am running
off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
there's no Lustre in picture here).
With all the other problems from before fixed, now it typically stops either for
this leak discussed here, or for this other problem in nfs4 I described here:
http://marc.info/?l=linux-nfs&m=146518324823160&w=2

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-07 17:30     ` Oleg Drokin
  0 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-07 17:30 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
>> Hello!
>> 
>>    I've been trying to better understand this problem I was having where sometimes
>>    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
>> 
>>    I finally traced it to a leaked filedescriptor that was allocated from
>>    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
>> 
>>    Also together with it we see leaked credentials allocated along the same path from
>>    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
>>    are presumably used by the credentials.
>> 
>>    Unfortunately I was not able to make total sense out of the state handling in nfsd,
>>    but it's clear that one of the file descriptors inside struct nfs4_file is
>>    lost. I added a patch like this (always a good idea, so surprised it was not
>>    there already):
>> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>>  {
>>         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>>  
>> +       WARN_ON(fp->fi_fds[0]);
>> +       WARN_ON(fp->fi_fds[1]);
>> +       WARN_ON(fp->fi_fds[2]);
>>         kmem_cache_free(file_slab, fp);
>>  }
>> 
>>    And when the problem is hit, I am also triggering (Always this one which is fd[1])
>> [ 3588.143002] ------------[ cut here ]------------
>> [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
>> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
>> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
>> [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
>> [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
>> [ 3588.153834] Call Trace:
>> [ 3588.153839]  [] dump_stack+0x86/0xc1
>> [ 3588.153841]  [] __warn+0xcb/0xf0
>> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
>> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
>> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
>> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
>> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
>> [ 3588.153866]  [] kthread+0x101/0x120
>> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
>> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
>> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
>> 
>> 
>>   release_all_access() seems to be doing correct job of all that cleaning, so
>>   there must be some other path that I do not quite see.
>> 
>>   Hopefully you are more familiar with the code and can see the problem right away ;)
> 
> Hmm...well I'm not seeing it right offhand, and haven't been able to
> reproduce the problem so far after a couple of attempts by hand. What
> sort of workload are you running before you see that warning pop?

Well, the workload is a crazed twisty conflicting operations set in a very limited
namespace (called racer.sh,
http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer )

It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I
run it on several mountpoints all pointing to the same spot
(now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as
mount -o nfs and one on mount -o nfs4).
It must be hitting some rare race or a race + operations combination that leads to
the leak.

To reproduce I run it like this until it stops (and it usually stops in
2 or 3 iterations, so it takes some time):
$ cat ~/stress.sh 
#!/bin/bash

set -x

cd /home/green/git/lustre-release/lustre/tests/racer
dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
mkfs.ext4 /tmp/loop
service rpcbind start
mount none /var/lib/nfs -t tmpfs
touch /var/lib/nfs/etab
service nfs-mountd start

while :; do

mount /tmp/loop /mnt/lustre -o loop || exit 1
mkdir /mnt/lustre/racer
service nfs-server start || exit 2
mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
DURATION=3600 sh racer.sh /mnt/nfs/racer &
DURATION=3600 sh racer.sh /mnt/nfs2/racer &
DURATION=3600 sh racer.sh /mnt/lustre/racer &
wait %1 %2 %3
umount /mnt/nfs || exit 5
umount /mnt/nfs2 || exit 6
service nfs-server stop || exit 7
umount /mnt/lustre || exit 8
e2fsck -n -f /tmp/loop || exit 9
done

Please ignore all the Lustre references, the test is generic, it's just I am running
off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
there's no Lustre in picture here).
With all the other problems from before fixed, now it typically stops either for
this leak discussed here, or for this other problem in nfs4 I described here:
http://marc.info/?l=linux-nfs&m=146518324823160&w=2

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-07 17:30     ` Oleg Drokin
@ 2016-06-07 20:04       ` Jeff Layton
  -1 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-07 20:04 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List

On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
> 
> > 
> > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> > > 
> > > Hello!
> > > 
> > >    I've been trying to better understand this problem I was having where sometimes
> > >    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
> > > 
> > >    I finally traced it to a leaked filedescriptor that was allocated from
> > >    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> > > 
> > >    Also together with it we see leaked credentials allocated along the same path from
> > >    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
> > >    are presumably used by the credentials.
> > > 
> > >    Unfortunately I was not able to make total sense out of the state handling in nfsd,
> > >    but it's clear that one of the file descriptors inside struct nfs4_file is
> > >    lost. I added a patch like this (always a good idea, so surprised it was not
> > >    there already):
> > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> > >  {
> > >         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
> > >  
> > > +       WARN_ON(fp->fi_fds[0]);
> > > +       WARN_ON(fp->fi_fds[1]);
> > > +       WARN_ON(fp->fi_fds[2]);
> > >         kmem_cache_free(file_slab, fp);
> > >  }
> > > 
> > >    And when the problem is hit, I am also triggering (Always this one which is fd[1])
> > > [ 3588.143002] ------------[ cut here ]------------
> > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
> > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
> > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
> > > [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
> > > [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
> > > [ 3588.153834] Call Trace:
> > > [ 3588.153839]  [] dump_stack+0x86/0xc1
> > > [ 3588.153841]  [] __warn+0xcb/0xf0
> > > [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> > > [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> > > [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> > > [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> > > [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> > > [ 3588.153866]  [] kthread+0x101/0x120
> > > [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> > > [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> > > [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> > > 
> > > 
> > >   release_all_access() seems to be doing correct job of all that cleaning, so
> > >   there must be some other path that I do not quite see.
> > > 
> > >   Hopefully you are more familiar with the code and can see the problem right away ;)
> > Hmm...well I'm not seeing it right offhand, and haven't been able to
> > reproduce the problem so far after a couple of attempts by hand. What
> > sort of workload are you running before you see that warning pop?
> Well, the workload is a crazed twisty conflicting operations set in a very limited
> namespace (called racer.sh,
> http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer )
> 
> It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I
> run it on several mountpoints all pointing to the same spot
> (now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as
> mount -o nfs and one on mount -o nfs4).
> It must be hitting some rare race or a race + operations combination that leads to
> the leak.
> 
> To reproduce I run it like this until it stops (and it usually stops in
> 2 or 3 iterations, so it takes some time):
> $ cat ~/stress.sh 
> #!/bin/bash
> 
> set -x
> 
> cd /home/green/git/lustre-release/lustre/tests/racer
> dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
> mkfs.ext4 /tmp/loop
> service rpcbind start
> mount none /var/lib/nfs -t tmpfs
> touch /var/lib/nfs/etab
> service nfs-mountd start
> 
> while :; do
> 
> mount /tmp/loop /mnt/lustre -o loop || exit 1
> mkdir /mnt/lustre/racer
> service nfs-server start || exit 2
> mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
> mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
> DURATION=3600 sh racer.sh /mnt/nfs/racer &
> DURATION=3600 sh racer.sh /mnt/nfs2/racer &
> DURATION=3600 sh racer.sh /mnt/lustre/racer &
> wait %1 %2 %3
> umount /mnt/nfs || exit 5
> umount /mnt/nfs2 || exit 6
> service nfs-server stop || exit 7
> umount /mnt/lustre || exit 8
> e2fsck -n -f /tmp/loop || exit 9
> done
> 
> Please ignore all the Lustre references, the test is generic, it's just I am running
> off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
> there's no Lustre in picture here).
> With all the other problems from before fixed, now it typically stops either for
> this leak discussed here, or for this other problem in nfs4 I described here:
> http://marc.info/?l=linux-nfs&m=146518324823160&w=2

Thanks. So it seems likely we're looking at a race of some sort then...

So, the way this is supposed to work is that the stateids each hold a
reference to the nfs4_file. They also take an fi_access reference for
the open modes that they represent. This is how we know when it's OK to
close one of the files in the fd[] slots. It might be interesting to
also print out the fi_access counters if the slot hasn't been zeroed
out if you feel up to it.

I think though that the open upgrade accounting may be wrong.
__nfs4_file_get_access does an unconditional atomic_inc, so if the
client does a open for write and then opens again for read/write, then
we may end up with too many write fi_access references taken.

That said, this code is quite subtle. I'd need to look over it in more
detail before I offer up any fixes. I'd also appreciate it if anyone
else wants to sanity check my analysis there.

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-07 20:04       ` Jeff Layton
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-07 20:04 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List

On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
> 
> > 
> > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> > > 
> > > Hello!
> > > 
> > >    I've been trying to better understand this problem I was having where sometimes
> > >    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
> > > 
> > >    I finally traced it to a leaked filedescriptor that was allocated from
> > >    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> > > 
> > >    Also together with it we see leaked credentials allocated along the same path from
> > >    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
> > >    are presumably used by the credentials.
> > > 
> > >    Unfortunately I was not able to make total sense out of the state handling in nfsd,
> > >    but it's clear that one of the file descriptors inside struct nfs4_file is
> > >    lost. I added a patch like this (always a good idea, so surprised it was not
> > >    there already):
> > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> > >  {
> > >         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
> > >  
> > > +       WARN_ON(fp->fi_fds[0]);
> > > +       WARN_ON(fp->fi_fds[1]);
> > > +       WARN_ON(fp->fi_fds[2]);
> > >         kmem_cache_free(file_slab, fp);
> > >  }
> > > 
> > >    And when the problem is hit, I am also triggering (Always this one which is fd[1])
> > > [ 3588.143002] ------------[ cut here ]------------
> > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
> > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
> > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
> > > [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
> > > [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
> > > [ 3588.153834] Call Trace:
> > > [ 3588.153839]  [] dump_stack+0x86/0xc1
> > > [ 3588.153841]  [] __warn+0xcb/0xf0
> > > [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> > > [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> > > [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> > > [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> > > [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> > > [ 3588.153866]  [] kthread+0x101/0x120
> > > [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> > > [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> > > [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> > > 
> > > 
> > >   release_all_access() seems to be doing correct job of all that cleaning, so
> > >   there must be some other path that I do not quite see.
> > > 
> > >   Hopefully you are more familiar with the code and can see the problem right away ;)
> > Hmm...well I'm not seeing it right offhand, and haven't been able to
> > reproduce the problem so far after a couple of attempts by hand. What
> > sort of workload are you running before you see that warning pop?
> Well, the workload is a crazed twisty conflicting operations set in a very limited
> namespace (called racer.sh,
> http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer )
> 
> It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I
> run it on several mountpoints all pointing to the same spot
> (now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as
> mount -o nfs and one on mount -o nfs4).
> It must be hitting some rare race or a race + operations combination that leads to
> the leak.
> 
> To reproduce I run it like this until it stops (and it usually stops in
> 2 or 3 iterations, so it takes some time):
> $ cat ~/stress.sh 
> #!/bin/bash
> 
> set -x
> 
> cd /home/green/git/lustre-release/lustre/tests/racer
> dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
> mkfs.ext4 /tmp/loop
> service rpcbind start
> mount none /var/lib/nfs -t tmpfs
> touch /var/lib/nfs/etab
> service nfs-mountd start
> 
> while :; do
> 
> mount /tmp/loop /mnt/lustre -o loop || exit 1
> mkdir /mnt/lustre/racer
> service nfs-server start || exit 2
> mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
> mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
> DURATION=3600 sh racer.sh /mnt/nfs/racer &
> DURATION=3600 sh racer.sh /mnt/nfs2/racer &
> DURATION=3600 sh racer.sh /mnt/lustre/racer &
> wait %1 %2 %3
> umount /mnt/nfs || exit 5
> umount /mnt/nfs2 || exit 6
> service nfs-server stop || exit 7
> umount /mnt/lustre || exit 8
> e2fsck -n -f /tmp/loop || exit 9
> done
> 
> Please ignore all the Lustre references, the test is generic, it's just I am running
> off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
> there's no Lustre in picture here).
> With all the other problems from before fixed, now it typically stops either for
> this leak discussed here, or for this other problem in nfs4 I described here:
> http://marc.info/?l=linux-nfs&m=146518324823160&w=2

Thanks. So it seems likely we're looking at a race of some sort then...

So, the way this is supposed to work is that the stateids each hold a
reference to the nfs4_file. They also take an fi_access reference for
the open modes that they represent. This is how we know when it's OK to
close one of the files in the fd[] slots. It might be interesting to
also print out the fi_access counters if the slot hasn't been zeroed
out if you feel up to it.

I think though that the open upgrade accounting may be wrong.
__nfs4_file_get_access does an unconditional atomic_inc, so if the
client does a open for write and then opens again for read/write, then
we may end up with too many write fi_access references taken.

That said, this code is quite subtle. I'd need to look over it in more
detail before I offer up any fixes. I'd also appreciate it if anyone
else wants to sanity check my analysis there.

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-07 20:04       ` Jeff Layton
@ 2016-06-07 23:39         ` Oleg Drokin
  -1 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-07 23:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
>> 
>>> 
>>> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
>>>> 
>>>> Hello!
>>>> 
>>>>    I've been trying to better understand this problem I was having where sometimes
>>>>    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
>>>> 
>>>>    I finally traced it to a leaked filedescriptor that was allocated from
>>>>    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
>>>> 
>>>>    Also together with it we see leaked credentials allocated along the same path from
>>>>    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
>>>>    are presumably used by the credentials.
>>>> 
>>>>    Unfortunately I was not able to make total sense out of the state handling in nfsd,
>>>>    but it's clear that one of the file descriptors inside struct nfs4_file is
>>>>    lost. I added a patch like this (always a good idea, so surprised it was not
>>>>    there already):
>>>> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>>>>  {
>>>>         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>>>>  
>>>> +       WARN_ON(fp->fi_fds[0]);
>>>> +       WARN_ON(fp->fi_fds[1]);
>>>> +       WARN_ON(fp->fi_fds[2]);
>>>>         kmem_cache_free(file_slab, fp);
>>>>  }
>>>> 
>>>>    And when the problem is hit, I am also triggering (Always this one which is fd[1])
>>>> [ 3588.143002] ------------[ cut here ]------------
>>>> [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>>>> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
>>>> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
>>>> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>>>> [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
>>>> [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
>>>> [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
>>>> [ 3588.153834] Call Trace:
>>>> [ 3588.153839]  [] dump_stack+0x86/0xc1
>>>> [ 3588.153841]  [] __warn+0xcb/0xf0
>>>> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
>>>> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
>>>> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>>>> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
>>>> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
>>>> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
>>>> [ 3588.153866]  [] kthread+0x101/0x120
>>>> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
>>>> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
>>>> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
>>>> 
>>>> 
>>>>   release_all_access() seems to be doing correct job of all that cleaning, so
>>>>   there must be some other path that I do not quite see.
>>>> 
>>>>   Hopefully you are more familiar with the code and can see the problem right away ;)
>>> Hmm...well I'm not seeing it right offhand, and haven't been able to
>>> reproduce the problem so far after a couple of attempts by hand. What
>>> sort of workload are you running before you see that warning pop?
>> Well, the workload is a crazed twisty conflicting operations set in a very limited
>> namespace (called racer.sh,
>> http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer )
>> 
>> It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I
>> run it on several mountpoints all pointing to the same spot
>> (now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as
>> mount -o nfs and one on mount -o nfs4).
>> It must be hitting some rare race or a race + operations combination that leads to
>> the leak.
>> 
>> To reproduce I run it like this until it stops (and it usually stops in
>> 2 or 3 iterations, so it takes some time):
>> $ cat ~/stress.sh 
>> #!/bin/bash
>> 
>> set -x
>> 
>> cd /home/green/git/lustre-release/lustre/tests/racer
>> dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
>> mkfs.ext4 /tmp/loop
>> service rpcbind start
>> mount none /var/lib/nfs -t tmpfs
>> touch /var/lib/nfs/etab
>> service nfs-mountd start
>> 
>> while :; do
>> 
>> mount /tmp/loop /mnt/lustre -o loop || exit 1
>> mkdir /mnt/lustre/racer
>> service nfs-server start || exit 2
>> mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
>> mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
>> DURATION=3600 sh racer.sh /mnt/nfs/racer &
>> DURATION=3600 sh racer.sh /mnt/nfs2/racer &
>> DURATION=3600 sh racer.sh /mnt/lustre/racer &
>> wait %1 %2 %3
>> umount /mnt/nfs || exit 5
>> umount /mnt/nfs2 || exit 6
>> service nfs-server stop || exit 7
>> umount /mnt/lustre || exit 8
>> e2fsck -n -f /tmp/loop || exit 9
>> done
>> 
>> Please ignore all the Lustre references, the test is generic, it's just I am running
>> off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
>> there's no Lustre in picture here).
>> With all the other problems from before fixed, now it typically stops either for
>> this leak discussed here, or for this other problem in nfs4 I described here:
>> http://marc.info/?l=linux-nfs&m=146518324823160&w=2
> 
> Thanks. So it seems likely we're looking at a race of some sort then...
> 
> So, the way this is supposed to work is that the stateids each hold a
> reference to the nfs4_file. They also take an fi_access reference for
> the open modes that they represent. This is how we know when it's OK to
> close one of the files in the fd[] slots. It might be interesting to
> also print out the fi_access counters if the slot hasn't been zeroed
> out if you feel up to it.

Ok, I have primed that and will let you know next time it triggers.

> I think though that the open upgrade accounting may be wrong.
> __nfs4_file_get_access does an unconditional atomic_inc, so if the
> client does a open for write and then opens again for read/write, then
> we may end up with too many write fi_access references taken.


The read-write access is all fine I think, because when dropping read or write
it also checks if the readwrite is filled?
Also whoever did that atomic_inc for get_access would eventually do put,
and that should free the file, no?
I was mostly thinking there's an error exit path somewhere that forgets to do
the put.
This also would explain the other pieces leaked that are not directly attached
into the nfs4_file.

> That said, this code is quite subtle. I'd need to look over it in more
> detail before I offer up any fixes. I'd also appreciate it if anyone
> else wants to sanity check my analysis there.
> 
> -- 
> Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-07 23:39         ` Oleg Drokin
  0 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-07 23:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
>> 
>>> 
>>> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
>>>> 
>>>> Hello!
>>>> 
>>>>    I've been trying to better understand this problem I was having where sometimes
>>>>    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
>>>> 
>>>>    I finally traced it to a leaked filedescriptor that was allocated from
>>>>    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
>>>> 
>>>>    Also together with it we see leaked credentials allocated along the same path from
>>>>    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
>>>>    are presumably used by the credentials.
>>>> 
>>>>    Unfortunately I was not able to make total sense out of the state handling in nfsd,
>>>>    but it's clear that one of the file descriptors inside struct nfs4_file is
>>>>    lost. I added a patch like this (always a good idea, so surprised it was not
>>>>    there already):
>>>> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>>>>  {
>>>>         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>>>>  
>>>> +       WARN_ON(fp->fi_fds[0]);
>>>> +       WARN_ON(fp->fi_fds[1]);
>>>> +       WARN_ON(fp->fi_fds[2]);
>>>>         kmem_cache_free(file_slab, fp);
>>>>  }
>>>> 
>>>>    And when the problem is hit, I am also triggering (Always this one which is fd[1])
>>>> [ 3588.143002] ------------[ cut here ]------------
>>>> [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>>>> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
>>>> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
>>>> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>>>> [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
>>>> [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
>>>> [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
>>>> [ 3588.153834] Call Trace:
>>>> [ 3588.153839]  [] dump_stack+0x86/0xc1
>>>> [ 3588.153841]  [] __warn+0xcb/0xf0
>>>> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
>>>> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
>>>> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>>>> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
>>>> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
>>>> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
>>>> [ 3588.153866]  [] kthread+0x101/0x120
>>>> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
>>>> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
>>>> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
>>>> 
>>>> 
>>>>   release_all_access() seems to be doing correct job of all that cleaning, so
>>>>   there must be some other path that I do not quite see.
>>>> 
>>>>   Hopefully you are more familiar with the code and can see the problem right away ;)
>>> Hmm...well I'm not seeing it right offhand, and haven't been able to
>>> reproduce the problem so far after a couple of attempts by hand. What
>>> sort of workload are you running before you see that warning pop?
>> Well, the workload is a crazed twisty conflicting operations set in a very limited
>> namespace (called racer.sh,
>> http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer )
>> 
>> It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I
>> run it on several mountpoints all pointing to the same spot
>> (now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as
>> mount -o nfs and one on mount -o nfs4).
>> It must be hitting some rare race or a race + operations combination that leads to
>> the leak.
>> 
>> To reproduce I run it like this until it stops (and it usually stops in
>> 2 or 3 iterations, so it takes some time):
>> $ cat ~/stress.sh 
>> #!/bin/bash
>> 
>> set -x
>> 
>> cd /home/green/git/lustre-release/lustre/tests/racer
>> dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
>> mkfs.ext4 /tmp/loop
>> service rpcbind start
>> mount none /var/lib/nfs -t tmpfs
>> touch /var/lib/nfs/etab
>> service nfs-mountd start
>> 
>> while :; do
>> 
>> mount /tmp/loop /mnt/lustre -o loop || exit 1
>> mkdir /mnt/lustre/racer
>> service nfs-server start || exit 2
>> mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
>> mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
>> DURATION=3600 sh racer.sh /mnt/nfs/racer &
>> DURATION=3600 sh racer.sh /mnt/nfs2/racer &
>> DURATION=3600 sh racer.sh /mnt/lustre/racer &
>> wait %1 %2 %3
>> umount /mnt/nfs || exit 5
>> umount /mnt/nfs2 || exit 6
>> service nfs-server stop || exit 7
>> umount /mnt/lustre || exit 8
>> e2fsck -n -f /tmp/loop || exit 9
>> done
>> 
>> Please ignore all the Lustre references, the test is generic, it's just I am running
>> off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
>> there's no Lustre in picture here).
>> With all the other problems from before fixed, now it typically stops either for
>> this leak discussed here, or for this other problem in nfs4 I described here:
>> http://marc.info/?l=linux-nfs&m=146518324823160&w=2
> 
> Thanks. So it seems likely we're looking at a race of some sort then...
> 
> So, the way this is supposed to work is that the stateids each hold a
> reference to the nfs4_file. They also take an fi_access reference for
> the open modes that they represent. This is how we know when it's OK to
> close one of the files in the fd[] slots. It might be interesting to
> also print out the fi_access counters if the slot hasn't been zeroed
> out if you feel up to it.

Ok, I have primed that and will let you know next time it triggers.

> I think though that the open upgrade accounting may be wrong.
> __nfs4_file_get_access does an unconditional atomic_inc, so if the
> client does a open for write and then opens again for read/write, then
> we may end up with too many write fi_access references taken.


The read-write access is all fine I think, because when dropping read or write
it also checks if the readwrite is filled?
Also whoever did that atomic_inc for get_access would eventually do put,
and that should free the file, no?
I was mostly thinking there's an error exit path somewhere that forgets to do
the put.
This also would explain the other pieces leaked that are not directly attached
into the nfs4_file.

> That said, this code is quite subtle. I'd need to look over it in more
> detail before I offer up any fixes. I'd also appreciate it if anyone
> else wants to sanity check my analysis there.
> 
> -- 
> Jeff Layton <jlayton@poochiereds.net>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-07 23:39         ` Oleg Drokin
@ 2016-06-08  0:03           ` Jeff Layton
  -1 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-08  0:03 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List

On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:
> 
> > 
> > On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
> > > 
> > > On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
> > > 
> > > > 
> > > > 
> > > > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> > > > > 
> > > > > 
> > > > > Hello!
> > > > > 
> > > > >    I've been trying to better understand this problem I was having where sometimes
> > > > >    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
> > > > > 
> > > > >    I finally traced it to a leaked filedescriptor that was allocated from
> > > > >    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> > > > > 
> > > > >    Also together with it we see leaked credentials allocated along the same path from
> > > > >    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
> > > > >    are presumably used by the credentials.
> > > > > 
> > > > >    Unfortunately I was not able to make total sense out of the state handling in nfsd,
> > > > >    but it's clear that one of the file descriptors inside struct nfs4_file is
> > > > >    lost. I added a patch like this (always a good idea, so surprised it was not
> > > > >    there already):
> > > > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> > > > >  {
> > > > >         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
> > > > >  
> > > > > +       WARN_ON(fp->fi_fds[0]);
> > > > > +       WARN_ON(fp->fi_fds[1]);
> > > > > +       WARN_ON(fp->fi_fds[2]);
> > > > >         kmem_cache_free(file_slab, fp);
> > > > >  }
> > > > > 
> > > > >    And when the problem is hit, I am also triggering (Always this one which is fd[1])
> > > > > [ 3588.143002] ------------[ cut here ]------------
> > > > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
> > > > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
> > > > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > > > [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
> > > > > [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
> > > > > [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
> > > > > [ 3588.153834] Call Trace:
> > > > > [ 3588.153839]  [] dump_stack+0x86/0xc1
> > > > > [ 3588.153841]  [] __warn+0xcb/0xf0
> > > > > [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> > > > > [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> > > > > [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > > > [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> > > > > [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> > > > > [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> > > > > [ 3588.153866]  [] kthread+0x101/0x120
> > > > > [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> > > > > [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> > > > > [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> > > > > 
> > > > > 
> > > > >   release_all_access() seems to be doing correct job of all that cleaning, so
> > > > >   there must be some other path that I do not quite see.
> > > > > 
> > > > >   Hopefully you are more familiar with the code and can see the problem right away ;)
> > > > Hmm...well I'm not seeing it right offhand, and haven't been able to
> > > > reproduce the problem so far after a couple of attempts by hand. What
> > > > sort of workload are you running before you see that warning pop?
> > > Well, the workload is a crazed twisty conflicting operations set in a very limited
> > > namespace (called racer.sh,
> > > http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer )
> > > 
> > > It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I
> > > run it on several mountpoints all pointing to the same spot
> > > (now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as
> > > mount -o nfs and one on mount -o nfs4).
> > > It must be hitting some rare race or a race + operations combination that leads to
> > > the leak.
> > > 
> > > To reproduce I run it like this until it stops (and it usually stops in
> > > 2 or 3 iterations, so it takes some time):
> > > $ cat ~/stress.sh 
> > > #!/bin/bash
> > > 
> > > set -x
> > > 
> > > cd /home/green/git/lustre-release/lustre/tests/racer
> > > dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
> > > mkfs.ext4 /tmp/loop
> > > service rpcbind start
> > > mount none /var/lib/nfs -t tmpfs
> > > touch /var/lib/nfs/etab
> > > service nfs-mountd start
> > > 
> > > while :; do
> > > 
> > > mount /tmp/loop /mnt/lustre -o loop || exit 1
> > > mkdir /mnt/lustre/racer
> > > service nfs-server start || exit 2
> > > mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
> > > mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
> > > DURATION=3600 sh racer.sh /mnt/nfs/racer &
> > > DURATION=3600 sh racer.sh /mnt/nfs2/racer &
> > > DURATION=3600 sh racer.sh /mnt/lustre/racer &
> > > wait %1 %2 %3
> > > umount /mnt/nfs || exit 5
> > > umount /mnt/nfs2 || exit 6
> > > service nfs-server stop || exit 7
> > > umount /mnt/lustre || exit 8
> > > e2fsck -n -f /tmp/loop || exit 9
> > > done
> > > 
> > > Please ignore all the Lustre references, the test is generic, it's just I am running
> > > off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
> > > there's no Lustre in picture here).
> > > With all the other problems from before fixed, now it typically stops either for
> > > this leak discussed here, or for this other problem in nfs4 I described here:
> > > http://marc.info/?l=linux-nfs&m=146518324823160&w=2
> > Thanks. So it seems likely we're looking at a race of some sort then...
> > 
> > So, the way this is supposed to work is that the stateids each hold a
> > reference to the nfs4_file. They also take an fi_access reference for
> > the open modes that they represent. This is how we know when it's OK to
> > close one of the files in the fd[] slots. It might be interesting to
> > also print out the fi_access counters if the slot hasn't been zeroed
> > out if you feel up to it.
> Ok, I have primed that and will let you know next time it triggers.
> 

Thanks. Maybe that will give us some idea of where to look...

> > 
> > I think though that the open upgrade accounting may be wrong.
> > __nfs4_file_get_access does an unconditional atomic_inc, so if the
> > client does a open for write and then opens again for read/write, then
> > we may end up with too many write fi_access references taken.
> 
> The read-write access is all fine I think, because when dropping read or write
> it also checks if the readwrite is filled?
> Also whoever did that atomic_inc for get_access would eventually do put,
> and that should free the file, no?
> I was mostly thinking there's an error exit path somewhere that forgets to do
> the put.
> This also would explain the other pieces leaked that are not directly attached
> into the nfs4_file.
> 
> > 
> > That said, this code is quite subtle. I'd need to look over it in more
> > detail before I offer up any fixes. I'd also appreciate it if anyone
> > else wants to sanity check my analysis there.
> > 

Yeah, I think you're right. It's fine since r/w opens have a distinct
slot, even though the refcounting just tracks the number of read and
write references. So yeah, the leak probably is in an error path
someplace, or maybe a race someplace.

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-08  0:03           ` Jeff Layton
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-08  0:03 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List

On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:
> 
> > 
> > On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
> > > 
> > > On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
> > > 
> > > > 
> > > > 
> > > > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> > > > > 
> > > > > 
> > > > > Hello!
> > > > > 
> > > > >    I've been trying to better understand this problem I was having where sometimes
> > > > >    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
> > > > > 
> > > > >    I finally traced it to a leaked filedescriptor that was allocated from
> > > > >    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> > > > > 
> > > > >    Also together with it we see leaked credentials allocated along the same path from
> > > > >    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
> > > > >    are presumably used by the credentials.
> > > > > 
> > > > >    Unfortunately I was not able to make total sense out of the state handling in nfsd,
> > > > >    but it's clear that one of the file descriptors inside struct nfs4_file is
> > > > >    lost. I added a patch like this (always a good idea, so surprised it was not
> > > > >    there already):
> > > > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> > > > >  {
> > > > >         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
> > > > >  
> > > > > +       WARN_ON(fp->fi_fds[0]);
> > > > > +       WARN_ON(fp->fi_fds[1]);
> > > > > +       WARN_ON(fp->fi_fds[2]);
> > > > >         kmem_cache_free(file_slab, fp);
> > > > >  }
> > > > > 
> > > > >    And when the problem is hit, I am also triggering (Always this one which is fd[1])
> > > > > [ 3588.143002] ------------[ cut here ]------------
> > > > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
> > > > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
> > > > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > > > [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
> > > > > [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
> > > > > [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
> > > > > [ 3588.153834] Call Trace:
> > > > > [ 3588.153839]  [] dump_stack+0x86/0xc1
> > > > > [ 3588.153841]  [] __warn+0xcb/0xf0
> > > > > [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> > > > > [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> > > > > [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > > > [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> > > > > [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> > > > > [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> > > > > [ 3588.153866]  [] kthread+0x101/0x120
> > > > > [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> > > > > [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> > > > > [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> > > > > 
> > > > > 
> > > > >   release_all_access() seems to be doing correct job of all that cleaning, so
> > > > >   there must be some other path that I do not quite see.
> > > > > 
> > > > >   Hopefully you are more familiar with the code and can see the problem right away ;)
> > > > Hmm...well I'm not seeing it right offhand, and haven't been able to
> > > > reproduce the problem so far after a couple of attempts by hand. What
> > > > sort of workload are you running before you see that warning pop?
> > > Well, the workload is a crazed twisty conflicting operations set in a very limited
> > > namespace (called racer.sh,
> > > http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer )
> > > 
> > > It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I
> > > run it on several mountpoints all pointing to the same spot
> > > (now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as
> > > mount -o nfs and one on mount -o nfs4).
> > > It must be hitting some rare race or a race + operations combination that leads to
> > > the leak.
> > > 
> > > To reproduce I run it like this until it stops (and it usually stops in
> > > 2 or 3 iterations, so it takes some time):
> > > $ cat ~/stress.sh 
> > > #!/bin/bash
> > > 
> > > set -x
> > > 
> > > cd /home/green/git/lustre-release/lustre/tests/racer
> > > dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
> > > mkfs.ext4 /tmp/loop
> > > service rpcbind start
> > > mount none /var/lib/nfs -t tmpfs
> > > touch /var/lib/nfs/etab
> > > service nfs-mountd start
> > > 
> > > while :; do
> > > 
> > > mount /tmp/loop /mnt/lustre -o loop || exit 1
> > > mkdir /mnt/lustre/racer
> > > service nfs-server start || exit 2
> > > mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
> > > mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
> > > DURATION=3600 sh racer.sh /mnt/nfs/racer &
> > > DURATION=3600 sh racer.sh /mnt/nfs2/racer &
> > > DURATION=3600 sh racer.sh /mnt/lustre/racer &
> > > wait %1 %2 %3
> > > umount /mnt/nfs || exit 5
> > > umount /mnt/nfs2 || exit 6
> > > service nfs-server stop || exit 7
> > > umount /mnt/lustre || exit 8
> > > e2fsck -n -f /tmp/loop || exit 9
> > > done
> > > 
> > > Please ignore all the Lustre references, the test is generic, it's just I am running
> > > off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
> > > there's no Lustre in picture here).
> > > With all the other problems from before fixed, now it typically stops either for
> > > this leak discussed here, or for this other problem in nfs4 I described here:
> > > http://marc.info/?l=linux-nfs&m=146518324823160&w=2
> > Thanks. So it seems likely we're looking at a race of some sort then...
> > 
> > So, the way this is supposed to work is that the stateids each hold a
> > reference to the nfs4_file. They also take an fi_access reference for
> > the open modes that they represent. This is how we know when it's OK to
> > close one of the files in the fd[] slots. It might be interesting to
> > also print out the fi_access counters if the slot hasn't been zeroed
> > out if you feel up to it.
> Ok, I have primed that and will let you know next time it triggers.
> 

Thanks. Maybe that will give us some idea of where to look...

> > 
> > I think though that the open upgrade accounting may be wrong.
> > __nfs4_file_get_access does an unconditional atomic_inc, so if the
> > client does a open for write and then opens again for read/write, then
> > we may end up with too many write fi_access references taken.
> 
> The read-write access is all fine I think, because when dropping read or write
> it also checks if the readwrite is filled?
> Also whoever did that atomic_inc for get_access would eventually do put,
> and that should free the file, no?
> I was mostly thinking there's an error exit path somewhere that forgets to do
> the put.
> This also would explain the other pieces leaked that are not directly attached
> into the nfs4_file.
> 
> > 
> > That said, this code is quite subtle. I'd need to look over it in more
> > detail before I offer up any fixes. I'd also appreciate it if anyone
> > else wants to sanity check my analysis there.
> > 

Yeah, I think you're right. It's fine since r/w opens have a distinct
slot, even though the refcounting just tracks the number of read and
write references. So yeah, the leak probably is in an error path
someplace, or maybe a race someplace.

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-08  0:03           ` Jeff Layton
@ 2016-06-08  0:46             ` Oleg Drokin
  -1 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08  0:46 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:
>>> So, the way this is supposed to work is that the stateids each hold a
>>> reference to the nfs4_file. They also take an fi_access reference for
>>> the open modes that they represent. This is how we know when it's OK to
>>> close one of the files in the fd[] slots. It might be interesting to
>>> also print out the fi_access counters if the slot hasn't been zeroed
>>> out if you feel up to it.
>> Ok, I have primed that and will let you know next time it triggers.
> Thanks. Maybe that will give us some idea of where to look…

Well, the counters are: 0 1 0 at the warning site, which is what
makes sense, really.

>>> I think though that the open upgrade accounting may be wrong.
>>> __nfs4_file_get_access does an unconditional atomic_inc, so if the
>>> client does a open for write and then opens again for read/write, then
>>> we may end up with too many write fi_access references taken.
>> 
>> The read-write access is all fine I think, because when dropping read or write
>> it also checks if the readwrite is filled?
>> Also whoever did that atomic_inc for get_access would eventually do put,
>> and that should free the file, no?
>> I was mostly thinking there's an error exit path somewhere that forgets to do
>> the put.
>> This also would explain the other pieces leaked that are not directly attached
>> into the nfs4_file.
>> 
>>> 
>>> That said, this code is quite subtle. I'd need to look over it in more
>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>> else wants to sanity check my analysis there.
>>> 
> 
> Yeah, I think you're right. It's fine since r/w opens have a distinct
> slot, even though the refcounting just tracks the number of read and
> write references. So yeah, the leak probably is in an error path
> someplace, or maybe a race someplace.

I guess I'll try to put some tracers in the code to try and match
refcount takers vs refcount droppers for lack of better ideas.

Hopefully that'll lead us to see who got the count that was not dropped.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-08  0:46             ` Oleg Drokin
  0 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08  0:46 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:
>>> So, the way this is supposed to work is that the stateids each hold a
>>> reference to the nfs4_file. They also take an fi_access reference for
>>> the open modes that they represent. This is how we know when it's OK to
>>> close one of the files in the fd[] slots. It might be interesting to
>>> also print out the fi_access counters if the slot hasn't been zeroed
>>> out if you feel up to it.
>> Ok, I have primed that and will let you know next time it triggers.
> Thanks. Maybe that will give us some idea of where to look…

Well, the counters are: 0 1 0 at the warning site, which is what
makes sense, really.

>>> I think though that the open upgrade accounting may be wrong.
>>> __nfs4_file_get_access does an unconditional atomic_inc, so if the
>>> client does a open for write and then opens again for read/write, then
>>> we may end up with too many write fi_access references taken.
>> 
>> The read-write access is all fine I think, because when dropping read or write
>> it also checks if the readwrite is filled?
>> Also whoever did that atomic_inc for get_access would eventually do put,
>> and that should free the file, no?
>> I was mostly thinking there's an error exit path somewhere that forgets to do
>> the put.
>> This also would explain the other pieces leaked that are not directly attached
>> into the nfs4_file.
>> 
>>> 
>>> That said, this code is quite subtle. I'd need to look over it in more
>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>> else wants to sanity check my analysis there.
>>> 
> 
> Yeah, I think you're right. It's fine since r/w opens have a distinct
> slot, even though the refcounting just tracks the number of read and
> write references. So yeah, the leak probably is in an error path
> someplace, or maybe a race someplace.

I guess I'll try to put some tracers in the code to try and match
refcount takers vs refcount droppers for lack of better ideas.

Hopefully that'll lead us to see who got the count that was not dropped.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-08  0:03           ` Jeff Layton
@ 2016-06-08  2:22             ` Oleg Drokin
  -1 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08  2:22 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:

>>> That said, this code is quite subtle. I'd need to look over it in more
>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>> else wants to sanity check my analysis there.
>>> 
> Yeah, I think you're right. It's fine since r/w opens have a distinct
> slot, even though the refcounting just tracks the number of read and
> write references. So yeah, the leak probably is in an error path
> someplace, or maybe a race someplace.

So I noticed that set_access is always called locked, but clear_access is not,
this does not sound right.

So I placed this strategic WARN_ON:
@@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
                        goto out_put_access;
                spin_lock(&fp->fi_lock);
                if (!fp->fi_fds[oflag]) {
+WARN_ON(!test_access(open->op_share_access, stp));
                        fp->fi_fds[oflag] = filp;
                        filp = NULL;
       
This is right in the place where nfsd set the access flag already, discovered
that the file is not opened and went on to open it, yet some parallel thread
came in and cleared the flag by the time we got the file opened.
It did trigger (but there are 30 minutes left till test finish, so I don't
know yet if this will correspond to the problem at hand yet, so below is speculation).

Now, at the exit from this function, the state will not have access for this file
set and the file would be leaked, since the matching call, probably in
release_all_access() does:
                if (test_access(i, stp))
                        nfs4_file_put_access(stp->st_stid.sc_file, i);
                clear_access(i, stp);

Meaning that the refcount is not dropped due to no access flag set.

I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
and that particular race would be gone?

Would that be ok, or is it just wrong that somebody can come and drop
(delegation downgrade seems to be the most likely suspect) access bits like this
while they are in the middle of being acquired?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-08  2:22             ` Oleg Drokin
  0 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08  2:22 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:

>>> That said, this code is quite subtle. I'd need to look over it in more
>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>> else wants to sanity check my analysis there.
>>> 
> Yeah, I think you're right. It's fine since r/w opens have a distinct
> slot, even though the refcounting just tracks the number of read and
> write references. So yeah, the leak probably is in an error path
> someplace, or maybe a race someplace.

So I noticed that set_access is always called locked, but clear_access is not,
this does not sound right.

So I placed this strategic WARN_ON:
@@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
                        goto out_put_access;
                spin_lock(&fp->fi_lock);
                if (!fp->fi_fds[oflag]) {
+WARN_ON(!test_access(open->op_share_access, stp));
                        fp->fi_fds[oflag] = filp;
                        filp = NULL;
       
This is right in the place where nfsd set the access flag already, discovered
that the file is not opened and went on to open it, yet some parallel thread
came in and cleared the flag by the time we got the file opened.
It did trigger (but there are 30 minutes left till test finish, so I don't
know yet if this will correspond to the problem at hand yet, so below is speculation).

Now, at the exit from this function, the state will not have access for this file
set and the file would be leaked, since the matching call, probably in
release_all_access() does:
                if (test_access(i, stp))
                        nfs4_file_put_access(stp->st_stid.sc_file, i);
                clear_access(i, stp);

Meaning that the refcount is not dropped due to no access flag set.

I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
and that particular race would be gone?

Would that be ok, or is it just wrong that somebody can come and drop
(delegation downgrade seems to be the most likely suspect) access bits like this
while they are in the middle of being acquired?


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-08  2:22             ` Oleg Drokin
@ 2016-06-08  3:55               ` Oleg Drokin
  -1 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08  3:55 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 7, 2016, at 10:22 PM, Oleg Drokin wrote:

> 
> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
> 
>>>> That said, this code is quite subtle. I'd need to look over it in more
>>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>>> else wants to sanity check my analysis there.
>>>> 
>> Yeah, I think you're right. It's fine since r/w opens have a distinct
>> slot, even though the refcounting just tracks the number of read and
>> write references. So yeah, the leak probably is in an error path
>> someplace, or maybe a race someplace.
> 
> So I noticed that set_access is always called locked, but clear_access is not,
> this does not sound right.
> 
> So I placed this strategic WARN_ON:
> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
>                        goto out_put_access;
>                spin_lock(&fp->fi_lock);
>                if (!fp->fi_fds[oflag]) {
> +WARN_ON(!test_access(open->op_share_access, stp));
>                        fp->fi_fds[oflag] = filp;
>                        filp = NULL;
> 
> This is right in the place where nfsd set the access flag already, discovered
> that the file is not opened and went on to open it, yet some parallel thread
> came in and cleared the flag by the time we got the file opened.
> It did trigger (but there are 30 minutes left till test finish, so I don't
> know yet if this will correspond to the problem at hand yet, so below is speculation).

Duh, I looked for a warning, but did not cross reference, and it was not this one that
hit yet.

Though apparently I am hitting some of the "impossible" warnings, so you might want to
look into that anyway.

        status = nfsd4_process_open2(rqstp, resfh, open);
        WARN(status && open->op_created,
             "nfsd4_process_open2 failed to open newly-created file! status=%u\n",
             be32_to_cpu(status));

and

        filp = find_readable_file(fp);
        if (!filp) {
                /* We should always have a readable file here */
                WARN_ON_ONCE(1);
                locks_free_lock(fl);
                return -EBADF;
        }

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-08  3:55               ` Oleg Drokin
  0 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08  3:55 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 7, 2016, at 10:22 PM, Oleg Drokin wrote:

> 
> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
> 
>>>> That said, this code is quite subtle. I'd need to look over it in more
>>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>>> else wants to sanity check my analysis there.
>>>> 
>> Yeah, I think you're right. It's fine since r/w opens have a distinct
>> slot, even though the refcounting just tracks the number of read and
>> write references. So yeah, the leak probably is in an error path
>> someplace, or maybe a race someplace.
> 
> So I noticed that set_access is always called locked, but clear_access is not,
> this does not sound right.
> 
> So I placed this strategic WARN_ON:
> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
>                        goto out_put_access;
>                spin_lock(&fp->fi_lock);
>                if (!fp->fi_fds[oflag]) {
> +WARN_ON(!test_access(open->op_share_access, stp));
>                        fp->fi_fds[oflag] = filp;
>                        filp = NULL;
> 
> This is right in the place where nfsd set the access flag already, discovered
> that the file is not opened and went on to open it, yet some parallel thread
> came in and cleared the flag by the time we got the file opened.
> It did trigger (but there are 30 minutes left till test finish, so I don't
> know yet if this will correspond to the problem at hand yet, so below is speculation).

Duh, I looked for a warning, but did not cross reference, and it was not this one that
hit yet.

Though apparently I am hitting some of the "impossible" warnings, so you might want to
look into that anyway.

        status = nfsd4_process_open2(rqstp, resfh, open);
        WARN(status && open->op_created,
             "nfsd4_process_open2 failed to open newly-created file! status=%u\n",
             be32_to_cpu(status));

and

        filp = find_readable_file(fp);
        if (!filp) {
                /* We should always have a readable file here */
                WARN_ON_ONCE(1);
                locks_free_lock(fl);
                return -EBADF;
        }


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-08  2:22             ` Oleg Drokin
@ 2016-06-08 10:58               ` Jeff Layton
  -1 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-08 10:58 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List

On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
> 
> > > > That said, this code is quite subtle. I'd need to look over it in more
> > > > detail before I offer up any fixes. I'd also appreciate it if anyone
> > > > else wants to sanity check my analysis there.
> > > > 
> > Yeah, I think you're right. It's fine since r/w opens have a distinct
> > slot, even though the refcounting just tracks the number of read and
> > write references. So yeah, the leak probably is in an error path
> > someplace, or maybe a race someplace.
> 
> So I noticed that set_access is always called locked, but clear_access is not,
> this does not sound right.
> 
> So I placed this strategic WARN_ON:
> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
>                         goto out_put_access;
>                 spin_lock(&fp->fi_lock);
>                 if (!fp->fi_fds[oflag]) {
> +WARN_ON(!test_access(open->op_share_access, stp));
>                         fp->fi_fds[oflag] = filp;
>                         filp = NULL;
>        
> This is right in the place where nfsd set the access flag already, discovered
> that the file is not opened and went on to open it, yet some parallel thread
> came in and cleared the flag by the time we got the file opened.
> It did trigger (but there are 30 minutes left till test finish, so I don't
> know yet if this will correspond to the problem at hand yet, so below is speculation).
> 
> Now, at the exit from this function, the state will not have access for this file
> set and the file would be leaked, since the matching call, probably in
> release_all_access() does:
>                 if (test_access(i, stp))
>                         nfs4_file_put_access(stp->st_stid.sc_file, i);
>                 clear_access(i, stp);
> 
> Meaning that the refcount is not dropped due to no access flag set.
> 
> I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
> and that particular race would be gone?
> 
> Would that be ok, or is it just wrong that somebody can come and drop
> (delegation downgrade seems to be the most likely suspect) access bits like this
> while they are in the middle of being acquired?
> 

I think you're on the right track...

Operations that morph the stateid's seqid should be serialized via the
st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the
rwsem, but we do allow OPENs to run in parallel with each other, under
the assumption that on success you always get at least what you
requested. So, I don't think that OPEN can really race with an
OPEN_DOWNGRADE or CLOSE like you're thinking.

But...suppose we have two racing OPEN calls. They're both in
nfs4_get_vfs_file. One opens the file and succeeds and the other fails
and ends up in out_put_access. At that point, you could end up
clobbering the successful update to st_access_bmap from the other task,
and we'd end up not putting the file access references in
release_all_access.

A simple way to confirm that might be to convert all of the read locks
on the st_rwsem to write locks. That will serialize all of the open
operations and should prevent that particular race from occurring.

If that works, we'd probably want to fix it in a less heavy-handed way,
but I'd have to think about how best to do that.

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-08 10:58               ` Jeff Layton
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-08 10:58 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List

On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
> 
> > > > That said, this code is quite subtle. I'd need to look over it in more
> > > > detail before I offer up any fixes. I'd also appreciate it if anyone
> > > > else wants to sanity check my analysis there.
> > > > 
> > Yeah, I think you're right. It's fine since r/w opens have a distinct
> > slot, even though the refcounting just tracks the number of read and
> > write references. So yeah, the leak probably is in an error path
> > someplace, or maybe a race someplace.
> 
> So I noticed that set_access is always called locked, but clear_access is not,
> this does not sound right.
> 
> So I placed this strategic WARN_ON:
> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
>                         goto out_put_access;
>                 spin_lock(&fp->fi_lock);
>                 if (!fp->fi_fds[oflag]) {
> +WARN_ON(!test_access(open->op_share_access, stp));
>                         fp->fi_fds[oflag] = filp;
>                         filp = NULL;
>        
> This is right in the place where nfsd set the access flag already, discovered
> that the file is not opened and went on to open it, yet some parallel thread
> came in and cleared the flag by the time we got the file opened.
> It did trigger (but there are 30 minutes left till test finish, so I don't
> know yet if this will correspond to the problem at hand yet, so below is speculation).
> 
> Now, at the exit from this function, the state will not have access for this file
> set and the file would be leaked, since the matching call, probably in
> release_all_access() does:
>                 if (test_access(i, stp))
>                         nfs4_file_put_access(stp->st_stid.sc_file, i);
>                 clear_access(i, stp);
> 
> Meaning that the refcount is not dropped due to no access flag set.
> 
> I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
> and that particular race would be gone?
> 
> Would that be ok, or is it just wrong that somebody can come and drop
> (delegation downgrade seems to be the most likely suspect) access bits like this
> while they are in the middle of being acquired?
> 

I think you're on the right track...

Operations that morph the stateid's seqid should be serialized via the
st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the
rwsem, but we do allow OPENs to run in parallel with each other, under
the assumption that on success you always get at least what you
requested. So, I don't think that OPEN can really race with an
OPEN_DOWNGRADE or CLOSE like you're thinking.

But...suppose we have two racing OPEN calls. They're both in
nfs4_get_vfs_file. One opens the file and succeeds and the other fails
and ends up in out_put_access. At that point, you could end up
clobbering the successful update to st_access_bmap from the other task,
and we'd end up not putting the file access references in
release_all_access.

A simple way to confirm that might be to convert all of the read locks
on the st_rwsem to write locks. That will serialize all of the open
operations and should prevent that particular race from occurring.

If that works, we'd probably want to fix it in a less heavy-handed way,
but I'd have to think about how best to do that.

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-08 10:58               ` Jeff Layton
@ 2016-06-08 14:44                 ` Oleg Drokin
  -1 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08 14:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
>> 
>>>>> That said, this code is quite subtle. I'd need to look over it in more
>>>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>>>> else wants to sanity check my analysis there.
>>>>> 
>>> Yeah, I think you're right. It's fine since r/w opens have a distinct
>>> slot, even though the refcounting just tracks the number of read and
>>> write references. So yeah, the leak probably is in an error path
>>> someplace, or maybe a race someplace.
>> 
>> So I noticed that set_access is always called locked, but clear_access is not,
>> this does not sound right.
>> 
>> So I placed this strategic WARN_ON:
>> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
>>                         goto out_put_access;
>>                 spin_lock(&fp->fi_lock);
>>                 if (!fp->fi_fds[oflag]) {
>> +WARN_ON(!test_access(open->op_share_access, stp));
>>                         fp->fi_fds[oflag] = filp;
>>                         filp = NULL;
>>        
>> This is right in the place where nfsd set the access flag already, discovered
>> that the file is not opened and went on to open it, yet some parallel thread
>> came in and cleared the flag by the time we got the file opened.
>> It did trigger (but there are 30 minutes left till test finish, so I don't
>> know yet if this will correspond to the problem at hand yet, so below is speculation).
>> 
>> Now, at the exit from this function, the state will not have access for this file
>> set and the file would be leaked, since the matching call, probably in
>> release_all_access() does:
>>                 if (test_access(i, stp))
>>                         nfs4_file_put_access(stp->st_stid.sc_file, i);
>>                 clear_access(i, stp);
>> 
>> Meaning that the refcount is not dropped due to no access flag set.
>> 
>> I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
>> and that particular race would be gone?
>> 
>> Would that be ok, or is it just wrong that somebody can come and drop
>> (delegation downgrade seems to be the most likely suspect) access bits like this
>> while they are in the middle of being acquired?
>> 
> 
> I think you're on the right track...
> 
> Operations that morph the stateid's seqid should be serialized via the
> st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the
> rwsem, but we do allow OPENs to run in parallel with each other, under
> the assumption that on success you always get at least what you
> requested. So, I don't think that OPEN can really race with an
> OPEN_DOWNGRADE or CLOSE like you're thinking.
> 
> But...suppose we have two racing OPEN calls. They're both in
> nfs4_get_vfs_file. One opens the file and succeeds and the other fails
> and ends up in out_put_access. At that point, you could end up
> clobbering the successful update to st_access_bmap from the other task,
> and we'd end up not putting the file access references in
> release_all_access.

It looks like the race plays in reverse to what I thought.
My assert when setting fi_fds to not NULL never hit, but I also had this warning
below and that DID hit and is a 100% correlation to the later leak.

set_access(u32 access, struct nfs4_ol_stateid *stp)
{
        unsigned char mask = 1 << access;

        WARN_ON_ONCE(access > NFS4_SHARE_ACCESS_BOTH);
WARN_ON(stp->st_access_bmap & mask);
        stp->st_access_bmap |= mask;
}

So we are setting the access that is already set
(this comes from nfs4_get_vfs_file() via nfs4_upgrade_open() ) twice, then one
of the callers drops theirs, and so subsequent state is not dropping their
file reference in release_all_access() because they don't see the access bit,
something like this.

Though I don't see why it cannot play like I originally envisioned too.

Hm…
So if it always hits when set_access warning triggers, and there's no
error on both paths, then there's no explicit call to nfs4_file_put_access()
and that's a leak since release_all_access() would only release the file once?

I wonder if instead on the success exit path we can check the old_access_bmap and
if there our current access is also set - release it?
This is ok, right? if the other caller succeeded they still keep the reference.
If the other caller failed they drop theirs and restore their idea of access bmap
which presumably does not have the bit set and drop their reference anyway.

The question is if they both fail, or only them fail.
If both fail and we win the race on restoring old access - we might set access wrongly
because them had it correct.
Similarly if they fail and we don't, they can reset our access.

> A simple way to confirm that might be to convert all of the read locks
> on the st_rwsem to write locks. That will serialize all of the open
> operations and should prevent that particular race from occurring.
> 
> If that works, we'd probably want to fix it in a less heavy-handed way,
> but I'd have to think about how best to do that.

Indeed this sounds a bit too heavy handed, but I'll give it a try just to make sure.

Thanks.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-08 14:44                 ` Oleg Drokin
  0 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08 14:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
>> 
>>>>> That said, this code is quite subtle. I'd need to look over it in more
>>>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>>>> else wants to sanity check my analysis there.
>>>>> 
>>> Yeah, I think you're right. It's fine since r/w opens have a distinct
>>> slot, even though the refcounting just tracks the number of read and
>>> write references. So yeah, the leak probably is in an error path
>>> someplace, or maybe a race someplace.
>> 
>> So I noticed that set_access is always called locked, but clear_access is not,
>> this does not sound right.
>> 
>> So I placed this strategic WARN_ON:
>> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
>>                         goto out_put_access;
>>                 spin_lock(&fp->fi_lock);
>>                 if (!fp->fi_fds[oflag]) {
>> +WARN_ON(!test_access(open->op_share_access, stp));
>>                         fp->fi_fds[oflag] = filp;
>>                         filp = NULL;
>>        
>> This is right in the place where nfsd set the access flag already, discovered
>> that the file is not opened and went on to open it, yet some parallel thread
>> came in and cleared the flag by the time we got the file opened.
>> It did trigger (but there are 30 minutes left till test finish, so I don't
>> know yet if this will correspond to the problem at hand yet, so below is speculation).
>> 
>> Now, at the exit from this function, the state will not have access for this file
>> set and the file would be leaked, since the matching call, probably in
>> release_all_access() does:
>>                 if (test_access(i, stp))
>>                         nfs4_file_put_access(stp->st_stid.sc_file, i);
>>                 clear_access(i, stp);
>> 
>> Meaning that the refcount is not dropped due to no access flag set.
>> 
>> I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
>> and that particular race would be gone?
>> 
>> Would that be ok, or is it just wrong that somebody can come and drop
>> (delegation downgrade seems to be the most likely suspect) access bits like this
>> while they are in the middle of being acquired?
>> 
> 
> I think you're on the right track...
> 
> Operations that morph the stateid's seqid should be serialized via the
> st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the
> rwsem, but we do allow OPENs to run in parallel with each other, under
> the assumption that on success you always get at least what you
> requested. So, I don't think that OPEN can really race with an
> OPEN_DOWNGRADE or CLOSE like you're thinking.
> 
> But...suppose we have two racing OPEN calls. They're both in
> nfs4_get_vfs_file. One opens the file and succeeds and the other fails
> and ends up in out_put_access. At that point, you could end up
> clobbering the successful update to st_access_bmap from the other task,
> and we'd end up not putting the file access references in
> release_all_access.

It looks like the race plays in reverse to what I thought.
My assert when setting fi_fds to not NULL never hit, but I also had this warning
below and that DID hit and is a 100% correlation to the later leak.

set_access(u32 access, struct nfs4_ol_stateid *stp)
{
        unsigned char mask = 1 << access;

        WARN_ON_ONCE(access > NFS4_SHARE_ACCESS_BOTH);
WARN_ON(stp->st_access_bmap & mask);
        stp->st_access_bmap |= mask;
}

So we are setting the access that is already set
(this comes from nfs4_get_vfs_file() via nfs4_upgrade_open() ) twice, then one
of the callers drops theirs, and so subsequent state is not dropping their
file reference in release_all_access() because they don't see the access bit,
something like this.

Though I don't see why it cannot play like I originally envisioned too.

Hm…
So if it always hits when set_access warning triggers, and there's no
error on both paths, then there's no explicit call to nfs4_file_put_access()
and that's a leak since release_all_access() would only release the file once?

I wonder if instead on the success exit path we can check the old_access_bmap and
if there our current access is also set - release it?
This is ok, right? if the other caller succeeded they still keep the reference.
If the other caller failed they drop theirs and restore their idea of access bmap
which presumably does not have the bit set and drop their reference anyway.

The question is if they both fail, or only them fail.
If both fail and we win the race on restoring old access - we might set access wrongly
because them had it correct.
Similarly if they fail and we don't, they can reset our access.

> A simple way to confirm that might be to convert all of the read locks
> on the st_rwsem to write locks. That will serialize all of the open
> operations and should prevent that particular race from occurring.
> 
> If that works, we'd probably want to fix it in a less heavy-handed way,
> but I'd have to think about how best to do that.

Indeed this sounds a bit too heavy handed, but I'll give it a try just to make sure.

Thanks.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-08 10:58               ` Jeff Layton
@ 2016-06-08 16:10                 ` Oleg Drokin
  -1 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08 16:10 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:

> A simple way to confirm that might be to convert all of the read locks
> on the st_rwsem to write locks. That will serialize all of the open
> operations and should prevent that particular race from occurring.
> 
> If that works, we'd probably want to fix it in a less heavy-handed way,
> but I'd have to think about how best to do that.

So I looked at the call sites for nfs4_get_vfs_file(), how about something like this:

after we grab the fp->fi_lock, we can do test_access(open->op_share_access, stp);

If that returns true - just drop the spinlock and return EAGAIN.

The callsite in nfs4_upgrade_open() would handle that by retesting the access map
again and either coming back in or more likely reusing the now updated stateid
(synchronised by the fi_lock again).
We probably need to convert the whole access map testing there to be under
fi_lock.
Something like:
nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
{
        __be32 status;
        unsigned char old_deny_bmap = stp->st_deny_bmap;

again:
+        spin_lock(&fp->fi_lock);
        if (!test_access(open->op_share_access, stp)) {
+		spin_unlock(&fp->fi_lock);
+               status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
+		if (status == -EAGAIN)
+			goto again;
+		return status;
+	}

        /* test and set deny mode */
-        spin_lock(&fp->fi_lock);
        status = nfs4_file_check_deny(fp, open->op_share_deny);


The call in nfsd4_process_open2() I think cannot hit this condition, right?
probably can add a WARN_ON there? BUG_ON? more sensible approach?

Alternatively we can probably always call nfs4_get_vfs_file() under this spinlock,
just have it drop that for the open and then reobtain (already done), not as transparent I guess.

Or the fi_lock might be converted to say a mutex, so we can sleep with it held and
then we can hold it across whole invocation of nfs4_get_vfs_file() and access testing and stuff.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-08 16:10                 ` Oleg Drokin
  0 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08 16:10 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:

> A simple way to confirm that might be to convert all of the read locks
> on the st_rwsem to write locks. That will serialize all of the open
> operations and should prevent that particular race from occurring.
> 
> If that works, we'd probably want to fix it in a less heavy-handed way,
> but I'd have to think about how best to do that.

So I looked at the call sites for nfs4_get_vfs_file(), how about something like this:

after we grab the fp->fi_lock, we can do test_access(open->op_share_access, stp);

If that returns true - just drop the spinlock and return EAGAIN.

The callsite in nfs4_upgrade_open() would handle that by retesting the access map
again and either coming back in or more likely reusing the now updated stateid
(synchronised by the fi_lock again).
We probably need to convert the whole access map testing there to be under
fi_lock.
Something like:
nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
{
        __be32 status;
        unsigned char old_deny_bmap = stp->st_deny_bmap;

again:
+        spin_lock(&fp->fi_lock);
        if (!test_access(open->op_share_access, stp)) {
+		spin_unlock(&fp->fi_lock);
+               status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
+		if (status == -EAGAIN)
+			goto again;
+		return status;
+	}

        /* test and set deny mode */
-        spin_lock(&fp->fi_lock);
        status = nfs4_file_check_deny(fp, open->op_share_deny);


The call in nfsd4_process_open2() I think cannot hit this condition, right?
probably can add a WARN_ON there? BUG_ON? more sensible approach?

Alternatively we can probably always call nfs4_get_vfs_file() under this spinlock,
just have it drop that for the open and then reobtain (already done), not as transparent I guess.

Or the fi_lock might be converted to say a mutex, so we can sleep with it held and
then we can hold it across whole invocation of nfs4_get_vfs_file() and access testing and stuff.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-08 16:10                 ` Oleg Drokin
@ 2016-06-08 17:22                   ` Jeff Layton
  -1 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-08 17:22 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List

On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote:
> On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:
> 
> > A simple way to confirm that might be to convert all of the read locks
> > on the st_rwsem to write locks. That will serialize all of the open
> > operations and should prevent that particular race from occurring.
> > 
> > If that works, we'd probably want to fix it in a less heavy-handed way,
> > but I'd have to think about how best to do that.
> 
> So I looked at the call sites for nfs4_get_vfs_file(), how about something like this:
> 
> after we grab the fp->fi_lock, we can do test_access(open->op_share_access, stp);
> 
> If that returns true - just drop the spinlock and return EAGAIN.
> 
> The callsite in nfs4_upgrade_open() would handle that by retesting the access map
> again and either coming back in or more likely reusing the now updated stateid
> (synchronised by the fi_lock again).
> We probably need to convert the whole access map testing there to be under
> fi_lock.
> Something like:
> nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
> {
>         __be32 status;
>         unsigned char old_deny_bmap = stp->st_deny_bmap;
> 
> again:
> +        spin_lock(&fp->fi_lock);
>         if (!test_access(open->op_share_access, stp)) {
> +		spin_unlock(&fp->fi_lock);
> +               status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
> +		if (status == -EAGAIN)
> +			goto again;
> +		return status;
> +	}
> 
>         /* test and set deny mode */
> -        spin_lock(&fp->fi_lock);
>         status = nfs4_file_check_deny(fp, open->op_share_deny);
> 
> 
> The call in nfsd4_process_open2() I think cannot hit this condition, right?
> probably can add a WARN_ON there? BUG_ON? more sensible approach?
> 
> Alternatively we can probably always call nfs4_get_vfs_file() under this spinlock,
> just have it drop that for the open and then reobtain (already done), not as transparent I guess.
> 

Yeah, I think that might be best. It looks like things could change
after you drop the spinlock with the patch above. Since we have to
retake it anyway in nfs4_get_vfs_file, we can just do it there.

> Or the fi_lock might be converted to say a mutex, so we can sleep with it held and
> then we can hold it across whole invocation of nfs4_get_vfs_file() and access testing and stuff.

I think we'd be better off taking the st_rwsem for write (maybe just
turning it into a mutex). That would at least be per-stateid instead of
per-inode. That's a fine fix for now.

It might slow down a client slightly that is sending two stateid
morphing operations in parallel, but they shouldn't affect each other.
I'm liking that solution more and more here.
Longer term, I think we need to further simplify OPEN handling. It has
gotten better, but it's still really hard to follow currently (and is
obviously error-prone).

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-08 17:22                   ` Jeff Layton
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-08 17:22 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List

On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote:
> On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:
> 
> > A simple way to confirm that might be to convert all of the read locks
> > on the st_rwsem to write locks. That will serialize all of the open
> > operations and should prevent that particular race from occurring.
> > 
> > If that works, we'd probably want to fix it in a less heavy-handed way,
> > but I'd have to think about how best to do that.
> 
> So I looked at the call sites for nfs4_get_vfs_file(), how about something like this:
> 
> after we grab the fp->fi_lock, we can do test_access(open->op_share_access, stp);
> 
> If that returns true - just drop the spinlock and return EAGAIN.
> 
> The callsite in nfs4_upgrade_open() would handle that by retesting the access map
> again and either coming back in or more likely reusing the now updated stateid
> (synchronised by the fi_lock again).
> We probably need to convert the whole access map testing there to be under
> fi_lock.
> Something like:
> nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
> {
>         __be32 status;
>         unsigned char old_deny_bmap = stp->st_deny_bmap;
> 
> again:
> +        spin_lock(&fp->fi_lock);
>         if (!test_access(open->op_share_access, stp)) {
> +		spin_unlock(&fp->fi_lock);
> +               status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
> +		if (status == -EAGAIN)
> +			goto again;
> +		return status;
> +	}
> 
>         /* test and set deny mode */
> -        spin_lock(&fp->fi_lock);
>         status = nfs4_file_check_deny(fp, open->op_share_deny);
> 
> 
> The call in nfsd4_process_open2() I think cannot hit this condition, right?
> probably can add a WARN_ON there? BUG_ON? more sensible approach?
> 
> Alternatively we can probably always call nfs4_get_vfs_file() under this spinlock,
> just have it drop that for the open and then reobtain (already done), not as transparent I guess.
> 

Yeah, I think that might be best. It looks like things could change
after you drop the spinlock with the patch above. Since we have to
retake it anyway in nfs4_get_vfs_file, we can just do it there.

> Or the fi_lock might be converted to say a mutex, so we can sleep with it held and
> then we can hold it across whole invocation of nfs4_get_vfs_file() and access testing and stuff.

I think we'd be better off taking the st_rwsem for write (maybe just
turning it into a mutex). That would at least be per-stateid instead of
per-inode. That's a fine fix for now.

It might slow down a client slightly that is sending two stateid
morphing operations in parallel, but they shouldn't affect each other.
I'm liking that solution more and more here.
Longer term, I think we need to further simplify OPEN handling. It has
gotten better, but it's still really hard to follow currently (and is
obviously error-prone).

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-08 17:22                   ` Jeff Layton
@ 2016-06-08 17:37                     ` Oleg Drokin
  -1 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08 17:37 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 8, 2016, at 1:22 PM, Jeff Layton wrote:

> On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote:
>> On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:
>> 
>>> A simple way to confirm that might be to convert all of the read locks
>>> on the st_rwsem to write locks. That will serialize all of the open
>>> operations and should prevent that particular race from occurring.
>>> 
>>> If that works, we'd probably want to fix it in a less heavy-handed way,
>>> but I'd have to think about how best to do that.
>> 
>> So I looked at the call sites for nfs4_get_vfs_file(), how about something like this:
>> 
>> after we grab the fp->fi_lock, we can do test_access(open->op_share_access, stp);
>> 
>> If that returns true - just drop the spinlock and return EAGAIN.
>> 
>> The callsite in nfs4_upgrade_open() would handle that by retesting the access map
>> again and either coming back in or more likely reusing the now updated stateid
>> (synchronised by the fi_lock again).
>> We probably need to convert the whole access map testing there to be under
>> fi_lock.
>> Something like:
>> nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
>> {
>>         __be32 status;
>>         unsigned char old_deny_bmap = stp->st_deny_bmap;
>> 
>> again:
>> +        spin_lock(&fp->fi_lock);
>>         if (!test_access(open->op_share_access, stp)) {
>> +		spin_unlock(&fp->fi_lock);
>> +               status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
>> +		if (status == -EAGAIN)
>> +			goto again;
>> +		return status;
>> +	}
>> 
>>         /* test and set deny mode */
>> -        spin_lock(&fp->fi_lock);
>>         status = nfs4_file_check_deny(fp, open->op_share_deny);
>> 
>> 
>> The call in nfsd4_process_open2() I think cannot hit this condition, right?
>> probably can add a WARN_ON there? BUG_ON? more sensible approach?
>> 
>> Alternatively we can probably always call nfs4_get_vfs_file() under this spinlock,
>> just have it drop that for the open and then reobtain (already done), not as transparent I guess.
>> 
> 
> Yeah, I think that might be best. It looks like things could change
> after you drop the spinlock with the patch above. Since we have to
> retake it anyway in nfs4_get_vfs_file, we can just do it there.
> 
>> Or the fi_lock might be converted to say a mutex, so we can sleep with it held and
>> then we can hold it across whole invocation of nfs4_get_vfs_file() and access testing and stuff.
> 
> I think we'd be better off taking the st_rwsem for write (maybe just
> turning it into a mutex). That would at least be per-stateid instead of
> per-inode. That's a fine fix for now.
> 
> It might slow down a client slightly that is sending two stateid
> morphing operations in parallel, but they shouldn't affect each other.
> I'm liking that solution more and more here.
> Longer term, I think we need to further simplify OPEN handling. It has
> gotten better, but it's still really hard to follow currently (and is
> obviously error-prone).

The conversion to always rwlock holds up nice so far (also no other WARNs are triggered
yet.)

I guess I'll do a patch converting to mutex, but also separately a patch that just
holds fi_lock more - test that other one and if all is well, submit is too,
and let you choose which one you like the most ;)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-08 17:37                     ` Oleg Drokin
  0 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-08 17:37 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


On Jun 8, 2016, at 1:22 PM, Jeff Layton wrote:

> On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote:
>> On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:
>> 
>>> A simple way to confirm that might be to convert all of the read locks
>>> on the st_rwsem to write locks. That will serialize all of the open
>>> operations and should prevent that particular race from occurring.
>>> 
>>> If that works, we'd probably want to fix it in a less heavy-handed way,
>>> but I'd have to think about how best to do that.
>> 
>> So I looked at the call sites for nfs4_get_vfs_file(), how about something like this:
>> 
>> after we grab the fp->fi_lock, we can do test_access(open->op_share_access, stp);
>> 
>> If that returns true - just drop the spinlock and return EAGAIN.
>> 
>> The callsite in nfs4_upgrade_open() would handle that by retesting the access map
>> again and either coming back in or more likely reusing the now updated stateid
>> (synchronised by the fi_lock again).
>> We probably need to convert the whole access map testing there to be under
>> fi_lock.
>> Something like:
>> nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
>> {
>>         __be32 status;
>>         unsigned char old_deny_bmap = stp->st_deny_bmap;
>> 
>> again:
>> +        spin_lock(&fp->fi_lock);
>>         if (!test_access(open->op_share_access, stp)) {
>> +		spin_unlock(&fp->fi_lock);
>> +               status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
>> +		if (status == -EAGAIN)
>> +			goto again;
>> +		return status;
>> +	}
>> 
>>         /* test and set deny mode */
>> -        spin_lock(&fp->fi_lock);
>>         status = nfs4_file_check_deny(fp, open->op_share_deny);
>> 
>> 
>> The call in nfsd4_process_open2() I think cannot hit this condition, right?
>> probably can add a WARN_ON there? BUG_ON? more sensible approach?
>> 
>> Alternatively we can probably always call nfs4_get_vfs_file() under this spinlock,
>> just have it drop that for the open and then reobtain (already done), not as transparent I guess.
>> 
> 
> Yeah, I think that might be best. It looks like things could change
> after you drop the spinlock with the patch above. Since we have to
> retake it anyway in nfs4_get_vfs_file, we can just do it there.
> 
>> Or the fi_lock might be converted to say a mutex, so we can sleep with it held and
>> then we can hold it across whole invocation of nfs4_get_vfs_file() and access testing and stuff.
> 
> I think we'd be better off taking the st_rwsem for write (maybe just
> turning it into a mutex). That would at least be per-stateid instead of
> per-inode. That's a fine fix for now.
> 
> It might slow down a client slightly that is sending two stateid
> morphing operations in parallel, but they shouldn't affect each other.
> I'm liking that solution more and more here.
> Longer term, I think we need to further simplify OPEN handling. It has
> gotten better, but it's still really hard to follow currently (and is
> obviously error-prone).

The conversion to always rwlock holds up nice so far (also no other WARNs are triggered
yet.)

I guess I'll do a patch converting to mutex, but also separately a patch that just
holds fi_lock more - test that other one and if all is well, submit is too,
and let you choose which one you like the most ;)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] nfsd: Always lock state exclusively.
  2016-06-08 17:22                   ` Jeff Layton
  (?)
  (?)
@ 2016-06-09  2:55                   ` Oleg Drokin
  2016-06-09 10:13                     ` Jeff Layton
  -1 siblings, 1 reply; 60+ messages in thread
From: Oleg Drokin @ 2016-06-09  2:55 UTC (permalink / raw)
  To: Jeff Layton, J . Bruce Fields; +Cc: linux-nfs, linux-kernel, Oleg Drokin

It used to be the case that state had an rwlock that was locked for write
by downgrades, but for read for upgrades (opens). Well, the problem is
if there are two competing opens for the same state, they step on
each other toes potentially leading to leaking file descriptors
from the state structure, since access mode is a bitmap only set once.
This patch converts st_rwsem to st_mutex and all users become exclusive,
no sharing

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
---
This holds up well in my testing.
I'll also try the other approach to see if it's any better.
 fs/nfsd/nfs4state.c | 40 ++++++++++++++++++++--------------------
 fs/nfsd/state.h     |  2 +-
 2 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index f5f82e1..c927d36 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3502,7 +3502,7 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
 	stp->st_access_bmap = 0;
 	stp->st_deny_bmap = 0;
 	stp->st_openstp = NULL;
-	init_rwsem(&stp->st_rwsem);
+	mutex_init(&stp->st_mutex);
 	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
 	list_add(&stp->st_perfile, &fp->fi_stateids);
 
@@ -4335,10 +4335,10 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	 */
 	if (stp) {
 		/* Stateid was found, this is an OPEN upgrade */
-		down_read(&stp->st_rwsem);
+		mutex_lock(&stp->st_mutex);
 		status = nfs4_upgrade_open(rqstp, fp, current_fh, stp, open);
 		if (status) {
-			up_read(&stp->st_rwsem);
+			mutex_unlock(&stp->st_mutex);
 			goto out;
 		}
 	} else {
@@ -4348,19 +4348,19 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 		if (swapstp) {
 			nfs4_put_stid(&stp->st_stid);
 			stp = swapstp;
-			down_read(&stp->st_rwsem);
+			mutex_lock(&stp->st_mutex);
 			status = nfs4_upgrade_open(rqstp, fp, current_fh,
 						stp, open);
 			if (status) {
-				up_read(&stp->st_rwsem);
+				mutex_unlock(&stp->st_mutex);
 				goto out;
 			}
 			goto upgrade_out;
 		}
-		down_read(&stp->st_rwsem);
+		mutex_lock(&stp->st_mutex);
 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
 		if (status) {
-			up_read(&stp->st_rwsem);
+			mutex_unlock(&stp->st_mutex);
 			release_open_stateid(stp);
 			goto out;
 		}
@@ -4372,7 +4372,7 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	}
 upgrade_out:
 	nfs4_inc_and_copy_stateid(&open->op_stateid, &stp->st_stid);
-	up_read(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 
 	if (nfsd4_has_session(&resp->cstate)) {
 		if (open->op_deleg_want & NFS4_SHARE_WANT_NO_DELEG) {
@@ -4977,12 +4977,12 @@ static __be32 nfs4_seqid_op_checks(struct nfsd4_compound_state *cstate, stateid_
 		 * revoked delegations are kept only for free_stateid.
 		 */
 		return nfserr_bad_stateid;
-	down_write(&stp->st_rwsem);
+	mutex_lock(&stp->st_mutex);
 	status = check_stateid_generation(stateid, &stp->st_stid.sc_stateid, nfsd4_has_session(cstate));
 	if (status == nfs_ok)
 		status = nfs4_check_fh(current_fh, &stp->st_stid);
 	if (status != nfs_ok)
-		up_write(&stp->st_rwsem);
+		mutex_unlock(&stp->st_mutex);
 	return status;
 }
 
@@ -5030,7 +5030,7 @@ static __be32 nfs4_preprocess_confirmed_seqid_op(struct nfsd4_compound_state *cs
 		return status;
 	oo = openowner(stp->st_stateowner);
 	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
-		up_write(&stp->st_rwsem);
+		mutex_unlock(&stp->st_mutex);
 		nfs4_put_stid(&stp->st_stid);
 		return nfserr_bad_stateid;
 	}
@@ -5062,12 +5062,12 @@ nfsd4_open_confirm(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	oo = openowner(stp->st_stateowner);
 	status = nfserr_bad_stateid;
 	if (oo->oo_flags & NFS4_OO_CONFIRMED) {
-		up_write(&stp->st_rwsem);
+		mutex_unlock(&stp->st_mutex);
 		goto put_stateid;
 	}
 	oo->oo_flags |= NFS4_OO_CONFIRMED;
 	nfs4_inc_and_copy_stateid(&oc->oc_resp_stateid, &stp->st_stid);
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 	dprintk("NFSD: %s: success, seqid=%d stateid=" STATEID_FMT "\n",
 		__func__, oc->oc_seqid, STATEID_VAL(&stp->st_stid.sc_stateid));
 
@@ -5143,7 +5143,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
 	nfs4_inc_and_copy_stateid(&od->od_stateid, &stp->st_stid);
 	status = nfs_ok;
 put_stateid:
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 	nfs4_put_stid(&stp->st_stid);
 out:
 	nfsd4_bump_seqid(cstate, status);
@@ -5196,7 +5196,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	if (status)
 		goto out; 
 	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 
 	nfsd4_close_open_stateid(stp);
 
@@ -5422,7 +5422,7 @@ init_lock_stateid(struct nfs4_ol_stateid *stp, struct nfs4_lockowner *lo,
 	stp->st_access_bmap = 0;
 	stp->st_deny_bmap = open_stp->st_deny_bmap;
 	stp->st_openstp = open_stp;
-	init_rwsem(&stp->st_rwsem);
+	mutex_init(&stp->st_mutex);
 	list_add(&stp->st_locks, &open_stp->st_locks);
 	list_add(&stp->st_perstateowner, &lo->lo_owner.so_stateids);
 	spin_lock(&fp->fi_lock);
@@ -5591,7 +5591,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 					&open_stp, nn);
 		if (status)
 			goto out;
-		up_write(&open_stp->st_rwsem);
+		mutex_unlock(&open_stp->st_mutex);
 		open_sop = openowner(open_stp->st_stateowner);
 		status = nfserr_bad_stateid;
 		if (!same_clid(&open_sop->oo_owner.so_client->cl_clientid,
@@ -5600,7 +5600,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		status = lookup_or_create_lock_state(cstate, open_stp, lock,
 							&lock_stp, &new);
 		if (status == nfs_ok)
-			down_write(&lock_stp->st_rwsem);
+			mutex_lock(&lock_stp->st_mutex);
 	} else {
 		status = nfs4_preprocess_seqid_op(cstate,
 				       lock->lk_old_lock_seqid,
@@ -5704,7 +5704,7 @@ out:
 		    seqid_mutating_err(ntohl(status)))
 			lock_sop->lo_owner.so_seqid++;
 
-		up_write(&lock_stp->st_rwsem);
+		mutex_unlock(&lock_stp->st_mutex);
 
 		/*
 		 * If this is a new, never-before-used stateid, and we are
@@ -5874,7 +5874,7 @@ nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 fput:
 	fput(filp);
 put_stateid:
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 	nfs4_put_stid(&stp->st_stid);
 out:
 	nfsd4_bump_seqid(cstate, status);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 986e51e..64053ea 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -535,7 +535,7 @@ struct nfs4_ol_stateid {
 	unsigned char			st_access_bmap;
 	unsigned char			st_deny_bmap;
 	struct nfs4_ol_stateid		*st_openstp;
-	struct rw_semaphore		st_rwsem;
+	struct mutex			st_mutex;
 };
 
 static inline struct nfs4_ol_stateid *openlockstateid(struct nfs4_stid *s)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH] nfsd: Always lock state exclusively.
  2016-06-09  2:55                   ` [PATCH] nfsd: Always lock state exclusively Oleg Drokin
@ 2016-06-09 10:13                     ` Jeff Layton
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-09 10:13 UTC (permalink / raw)
  To: Oleg Drokin, J . Bruce Fields; +Cc: linux-nfs, linux-kernel

On Wed, 2016-06-08 at 22:55 -0400, Oleg Drokin wrote:
> It used to be the case that state had an rwlock that was locked for write
> by downgrades, but for read for upgrades (opens). Well, the problem is
> if there are two competing opens for the same state, they step on
> each other toes potentially leading to leaking file descriptors
> from the state structure, since access mode is a bitmap only set once.
> This patch converts st_rwsem to st_mutex and all users become exclusive,
> no sharing
> 
> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> ---
> This holds up well in my testing.
> I'll also try the other approach to see if it's any better.
>  fs/nfsd/nfs4state.c | 40 ++++++++++++++++++++--------------------
>  fs/nfsd/state.h     |  2 +-
>  2 files changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index f5f82e1..c927d36 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3502,7 +3502,7 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
>  	stp->st_access_bmap = 0;
>  	stp->st_deny_bmap = 0;
>  	stp->st_openstp = NULL;
> -	init_rwsem(&stp->st_rwsem);
> +	mutex_init(&stp->st_mutex);
>  	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
>  	list_add(&stp->st_perfile, &fp->fi_stateids);
>  
> @@ -4335,10 +4335,10 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>  	 */
>  	if (stp) {
>  		/* Stateid was found, this is an OPEN upgrade */
> -		down_read(&stp->st_rwsem);
> +		mutex_lock(&stp->st_mutex);
>  		status = nfs4_upgrade_open(rqstp, fp, current_fh, stp, open);
>  		if (status) {
> -			up_read(&stp->st_rwsem);
> +			mutex_unlock(&stp->st_mutex);
>  			goto out;
>  		}
>  	} else {
> @@ -4348,19 +4348,19 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>  		if (swapstp) {
>  			nfs4_put_stid(&stp->st_stid);
>  			stp = swapstp;
> -			down_read(&stp->st_rwsem);
> +			mutex_lock(&stp->st_mutex);
>  			status = nfs4_upgrade_open(rqstp, fp, current_fh,
>  						stp, open);
>  			if (status) {
> -				up_read(&stp->st_rwsem);
> +				mutex_unlock(&stp->st_mutex);
>  				goto out;
>  			}
>  			goto upgrade_out;
>  		}
> -		down_read(&stp->st_rwsem);
> +		mutex_lock(&stp->st_mutex);
>  		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
>  		if (status) {
> -			up_read(&stp->st_rwsem);
> +			mutex_unlock(&stp->st_mutex);
>  			release_open_stateid(stp);
>  			goto out;
>  		}
> @@ -4372,7 +4372,7 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>  	}
>  upgrade_out:
>  	nfs4_inc_and_copy_stateid(&open->op_stateid, &stp->st_stid);
> -	up_read(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  
>  	if (nfsd4_has_session(&resp->cstate)) {
>  		if (open->op_deleg_want & NFS4_SHARE_WANT_NO_DELEG) {
> @@ -4977,12 +4977,12 @@ static __be32 nfs4_seqid_op_checks(struct nfsd4_compound_state *cstate, stateid_
>  		 * revoked delegations are kept only for free_stateid.
>  		 */
>  		return nfserr_bad_stateid;
> -	down_write(&stp->st_rwsem);
> +	mutex_lock(&stp->st_mutex);
>  	status = check_stateid_generation(stateid, &stp->st_stid.sc_stateid, nfsd4_has_session(cstate));
>  	if (status == nfs_ok)
>  		status = nfs4_check_fh(current_fh, &stp->st_stid);
>  	if (status != nfs_ok)
> -		up_write(&stp->st_rwsem);
> +		mutex_unlock(&stp->st_mutex);
>  	return status;
>  }
>  
> @@ -5030,7 +5030,7 @@ static __be32 nfs4_preprocess_confirmed_seqid_op(struct nfsd4_compound_state *cs
>  		return status;
>  	oo = openowner(stp->st_stateowner);
>  	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
> -		up_write(&stp->st_rwsem);
> +		mutex_unlock(&stp->st_mutex);
>  		nfs4_put_stid(&stp->st_stid);
>  		return nfserr_bad_stateid;
>  	}
> @@ -5062,12 +5062,12 @@ nfsd4_open_confirm(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	oo = openowner(stp->st_stateowner);
>  	status = nfserr_bad_stateid;
>  	if (oo->oo_flags & NFS4_OO_CONFIRMED) {
> -		up_write(&stp->st_rwsem);
> +		mutex_unlock(&stp->st_mutex);
>  		goto put_stateid;
>  	}
>  	oo->oo_flags |= NFS4_OO_CONFIRMED;
>  	nfs4_inc_and_copy_stateid(&oc->oc_resp_stateid, &stp->st_stid);
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  	dprintk("NFSD: %s: success, seqid=%d stateid=" STATEID_FMT "\n",
>  		__func__, oc->oc_seqid, STATEID_VAL(&stp->st_stid.sc_stateid));
>  
> @@ -5143,7 +5143,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
>  	nfs4_inc_and_copy_stateid(&od->od_stateid, &stp->st_stid);
>  	status = nfs_ok;
>  put_stateid:
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  	nfs4_put_stid(&stp->st_stid);
>  out:
>  	nfsd4_bump_seqid(cstate, status);
> @@ -5196,7 +5196,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	if (status)
>  		goto out; 
>  	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  
>  	nfsd4_close_open_stateid(stp);
>  
> @@ -5422,7 +5422,7 @@ init_lock_stateid(struct nfs4_ol_stateid *stp, struct nfs4_lockowner *lo,
>  	stp->st_access_bmap = 0;
>  	stp->st_deny_bmap = open_stp->st_deny_bmap;
>  	stp->st_openstp = open_stp;
> -	init_rwsem(&stp->st_rwsem);
> +	mutex_init(&stp->st_mutex);
>  	list_add(&stp->st_locks, &open_stp->st_locks);
>  	list_add(&stp->st_perstateowner, &lo->lo_owner.so_stateids);
>  	spin_lock(&fp->fi_lock);
> @@ -5591,7 +5591,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  					&open_stp, nn);
>  		if (status)
>  			goto out;
> -		up_write(&open_stp->st_rwsem);
> +		mutex_unlock(&open_stp->st_mutex);
>  		open_sop = openowner(open_stp->st_stateowner);
>  		status = nfserr_bad_stateid;
>  		if (!same_clid(&open_sop->oo_owner.so_client->cl_clientid,
> @@ -5600,7 +5600,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  		status = lookup_or_create_lock_state(cstate, open_stp, lock,
>  							&lock_stp, &new);
>  		if (status == nfs_ok)
> -			down_write(&lock_stp->st_rwsem);
> +			mutex_lock(&lock_stp->st_mutex);
>  	} else {
>  		status = nfs4_preprocess_seqid_op(cstate,
>  				       lock->lk_old_lock_seqid,
> @@ -5704,7 +5704,7 @@ out:
>  		    seqid_mutating_err(ntohl(status)))
>  			lock_sop->lo_owner.so_seqid++;
>  
> -		up_write(&lock_stp->st_rwsem);
> +		mutex_unlock(&lock_stp->st_mutex);
>  
>  		/*
>  		 * If this is a new, never-before-used stateid, and we are
> @@ -5874,7 +5874,7 @@ nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  fput:
>  	fput(filp);
>  put_stateid:
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  	nfs4_put_stid(&stp->st_stid);
>  out:
>  	nfsd4_bump_seqid(cstate, status);
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index 986e51e..64053ea 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -535,7 +535,7 @@ struct nfs4_ol_stateid {
>  	unsigned char			st_access_bmap;
>  	unsigned char			st_deny_bmap;
>  	struct nfs4_ol_stateid		*st_openstp;
> -	struct rw_semaphore		st_rwsem;
> +	struct mutex			st_mutex;
>  };
>  
>  static inline struct nfs4_ol_stateid *openlockstateid(struct nfs4_stid *s)Looks legit. Eventually we might want to turn this back into a rwsem by
fixing and clarifying the locking around the st_access_bmap (and the
deny bmap), but for now I think this is a reasonable fix for the
problem Oleg found.There is the potential for a minor perf hit here when there are racing
OPEN calls for the same inode, but I think that's acceptable for now.
We may also want to go ahead and send this to stable as well.Reviewed-by: Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
  2016-06-08 10:58               ` Jeff Layton
@ 2016-06-09 12:13                 ` Andrew W Elble
  -1 siblings, 0 replies; 60+ messages in thread
From: Andrew W Elble @ 2016-06-09 12:13 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Oleg Drokin, J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


> But...suppose we have two racing OPEN calls. They're both in
> nfs4_get_vfs_file. One opens the file and succeeds and the other fails
> and ends up in out_put_access. At that point, you could end up
> clobbering the successful update to st_access_bmap from the other task,
> and we'd end up not putting the file access references in
> release_all_access.


   Pondering further (with an assumption about the workload), isn't this
   path also likely to involve a lease break in progress on that file?

Thanks,

Andy

-- 
Andrew W. Elble
aweits@discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
@ 2016-06-09 12:13                 ` Andrew W Elble
  0 siblings, 0 replies; 60+ messages in thread
From: Andrew W Elble @ 2016-06-09 12:13 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Oleg Drokin, J. Bruce Fields, linux-nfs,
	<linux-kernel@vger.kernel.org> Mailing List


> But...suppose we have two racing OPEN calls. They're both in
> nfs4_get_vfs_file. One opens the file and succeeds and the other fails
> and ends up in out_put_access. At that point, you could end up
> clobbering the successful update to st_access_bmap from the other task,
> and we'd end up not putting the file access references in
> release_all_access.


   Pondering further (with an assumption about the workload), isn't this
   path also likely to involve a lease break in progress on that file?

Thanks,

Andy

-- 
Andrew W. Elble
aweits@discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file
  2016-06-08 17:22                   ` Jeff Layton
                                     ` (2 preceding siblings ...)
  (?)
@ 2016-06-09 21:01                   ` Oleg Drokin
  2016-06-10  4:18                     ` Oleg Drokin
  -1 siblings, 1 reply; 60+ messages in thread
From: Oleg Drokin @ 2016-06-09 21:01 UTC (permalink / raw)
  To: Jeff Layton, J . Bruce Fields; +Cc: linux-nfs, linux-kernel, Oleg Drokin

Currently there's an unprotected access mode check in nfs4_upgrade_open
that then calls nfs4_get_vfs_file which in turn assumes whatever
access mode was present in the state is still valid which is racy.
Two nfs4_get_vfs_file van enter the same path as result and get two
references to nfs4_file, but later drop would only happens once because
access mode is only denoted by bits, so no refcounting.

The locking around access mode testing is introduced to avoid this race.

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
---

This patch performs equally well to the st_rwsem -> mutex conversion,
but is a bit ligher-weight I imagine.
For one it seems to allow truncates in parallel if we ever want it.

 fs/nfsd/nfs4state.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index f5f82e1..d4b9eba 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3958,6 +3958,11 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
 
 	spin_lock(&fp->fi_lock);
 
+	if (test_access(open->op_share_access, stp)) {
+		spin_unlock(&fp->fi_lock);
+		return nfserr_eagain;
+	}
+
 	/*
 	 * Are we trying to set a deny mode that would conflict with
 	 * current access?
@@ -4017,11 +4022,21 @@ nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *c
 	__be32 status;
 	unsigned char old_deny_bmap = stp->st_deny_bmap;
 
-	if (!test_access(open->op_share_access, stp))
-		return nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
+again:
+	spin_lock(&fp->fi_lock);
+	if (!test_access(open->op_share_access, stp)) {
+		spin_unlock(&fp->fi_lock);
+		status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
+		/*
+		 * Somebody won the race for access while we did not hold
+		 * the lock here
+		 */
+		if (status == nfserr_eagain)
+			goto again;
+		return status;
+	}
 
 	/* test and set deny mode */
-	spin_lock(&fp->fi_lock);
 	status = nfs4_file_check_deny(fp, open->op_share_deny);
 	if (status == nfs_ok) {
 		set_deny(open->op_share_deny, stp);
@@ -4361,6 +4376,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
 		if (status) {
 			up_read(&stp->st_rwsem);
+			/*
+			 * EAGAIN is returned when there's a racing access,
+			 * this should never happen as we are the only user
+			 * of this new state, and since it's not yet hashed,
+			 * nobody can find it
+			 */
+			WARN_ON(status == nfserr_eagain);
 			release_open_stateid(stp);
 			goto out;
 		}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file
  2016-06-09 21:01                   ` [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file Oleg Drokin
@ 2016-06-10  4:18                     ` Oleg Drokin
  2016-06-10 10:50                       ` Jeff Layton
  0 siblings, 1 reply; 60+ messages in thread
From: Oleg Drokin @ 2016-06-10  4:18 UTC (permalink / raw)
  To: Jeff Layton, J . Bruce Fields; +Cc: linux-nfs, linux-kernel


On Jun 9, 2016, at 5:01 PM, Oleg Drokin wrote:

> Currently there's an unprotected access mode check in nfs4_upgrade_open
> that then calls nfs4_get_vfs_file which in turn assumes whatever
> access mode was present in the state is still valid which is racy.
> Two nfs4_get_vfs_file van enter the same path as result and get two
> references to nfs4_file, but later drop would only happens once because
> access mode is only denoted by bits, so no refcounting.
> 
> The locking around access mode testing is introduced to avoid this race.
> 
> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> ---
> 
> This patch performs equally well to the st_rwsem -> mutex conversion,
> but is a bit ligher-weight I imagine.
> For one it seems to allow truncates in parallel if we ever want it.
> 
> fs/nfsd/nfs4state.c | 28 +++++++++++++++++++++++++---
> 1 file changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index f5f82e1..d4b9eba 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3958,6 +3958,11 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
> 
> 	spin_lock(&fp->fi_lock);
> 
> +	if (test_access(open->op_share_access, stp)) {
> +		spin_unlock(&fp->fi_lock);
> +		return nfserr_eagain;
> +	}
> +
> 	/*
> 	 * Are we trying to set a deny mode that would conflict with
> 	 * current access?
> @@ -4017,11 +4022,21 @@ nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *c
> 	__be32 status;
> 	unsigned char old_deny_bmap = stp->st_deny_bmap;
> 
> -	if (!test_access(open->op_share_access, stp))
> -		return nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
> +again:
> +	spin_lock(&fp->fi_lock);
> +	if (!test_access(open->op_share_access, stp)) {
> +		spin_unlock(&fp->fi_lock);
> +		status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
> +		/*
> +		 * Somebody won the race for access while we did not hold
> +		 * the lock here
> +		 */
> +		if (status == nfserr_eagain)
> +			goto again;
> +		return status;
> +	}
> 
> 	/* test and set deny mode */
> -	spin_lock(&fp->fi_lock);
> 	status = nfs4_file_check_deny(fp, open->op_share_deny);
> 	if (status == nfs_ok) {
> 		set_deny(open->op_share_deny, stp);
> @@ -4361,6 +4376,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
> 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
> 		if (status) {
> 			up_read(&stp->st_rwsem);
> +			/*
> +			 * EAGAIN is returned when there's a racing access,
> +			 * this should never happen as we are the only user
> +			 * of this new state, and since it's not yet hashed,
> +			 * nobody can find it
> +			 */
> +			WARN_ON(status == nfserr_eagain);

Ok, some more testing shows that this CAN happen.
So this patch is inferior to the mutex one after all.

> 			release_open_stateid(stp);
> 			goto out;
> 		}
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file
  2016-06-10  4:18                     ` Oleg Drokin
@ 2016-06-10 10:50                       ` Jeff Layton
  2016-06-10 20:55                         ` J . Bruce Fields
  0 siblings, 1 reply; 60+ messages in thread
From: Jeff Layton @ 2016-06-10 10:50 UTC (permalink / raw)
  To: Oleg Drokin, J . Bruce Fields; +Cc: linux-nfs, linux-kernel, Andrew W Elble

On Fri, 2016-06-10 at 00:18 -0400, Oleg Drokin wrote:
> On Jun 9, 2016, at 5:01 PM, Oleg Drokin wrote:
> 
> > Currently there's an unprotected access mode check in
> > nfs4_upgrade_open
> > that then calls nfs4_get_vfs_file which in turn assumes whatever
> > access mode was present in the state is still valid which is racy.
> > Two nfs4_get_vfs_file van enter the same path as result and get two
> > references to nfs4_file, but later drop would only happens once
> > because
> > access mode is only denoted by bits, so no refcounting.
> > 
> > The locking around access mode testing is introduced to avoid this
> > race.
> > 
> > Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> > ---
> > 
> > This patch performs equally well to the st_rwsem -> mutex
> > conversion,
> > but is a bit ligher-weight I imagine.
> > For one it seems to allow truncates in parallel if we ever want it.
> > 
> > fs/nfsd/nfs4state.c | 28 +++++++++++++++++++++++++---
> > 1 file changed, 25 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index f5f82e1..d4b9eba 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -3958,6 +3958,11 @@ static __be32 nfs4_get_vfs_file(struct
> > svc_rqst *rqstp, struct nfs4_file *fp,
> > 
> > 	spin_lock(&fp->fi_lock);
> > 
> > +	if (test_access(open->op_share_access, stp)) {
> > +		spin_unlock(&fp->fi_lock);
> > +		return nfserr_eagain;
> > +	}
> > +
> > 	/*
> > 	 * Are we trying to set a deny mode that would conflict with
> > 	 * current access?
> > @@ -4017,11 +4022,21 @@ nfs4_upgrade_open(struct svc_rqst *rqstp,
> > struct nfs4_file *fp, struct svc_fh *c
> > 	__be32 status;
> > 	unsigned char old_deny_bmap = stp->st_deny_bmap;
> > 
> > -	if (!test_access(open->op_share_access, stp))
> > -		return nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
> > open);
> > +again:
> > +	spin_lock(&fp->fi_lock);
> > +	if (!test_access(open->op_share_access, stp)) {
> > +		spin_unlock(&fp->fi_lock);
> > +		status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
> > open);
> > +		/*
> > +		 * Somebody won the race for access while we did
> > not hold
> > +		 * the lock here
> > +		 */
> > +		if (status == nfserr_eagain)
> > +			goto again;
> > +		return status;
> > +	}
> > 
> > 	/* test and set deny mode */
> > -	spin_lock(&fp->fi_lock);
> > 	status = nfs4_file_check_deny(fp, open->op_share_deny);
> > 	if (status == nfs_ok) {
> > 		set_deny(open->op_share_deny, stp);
> > @@ -4361,6 +4376,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp,
> > struct svc_fh *current_fh, struct nf
> > 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp,
> > open);
> > 		if (status) {
> > 			up_read(&stp->st_rwsem);
> > +			/*
> > +			 * EAGAIN is returned when there's a
> > racing access,
> > +			 * this should never happen as we are the
> > only user
> > +			 * of this new state, and since it's not
> > yet hashed,
> > +			 * nobody can find it
> > +			 */
> > +			WARN_ON(status == nfserr_eagain);
> 
> Ok, some more testing shows that this CAN happen.
> So this patch is inferior to the mutex one after all.
> 

Yeah, that can happen for all sorts of reasons. As Andrew pointed out,
you can get this when there is a lease break in progress, and that may
be occurring for a completely different stateid (or because of samba,
etc...)

It may be possible to do something like this, but we'd need to audit
all of the handling of st_access_bmap (and the deny bmap) to ensure
that we get it right.

For now, I think just turning that rwsem into a mutex is the best
solution. That is a per-stateid mutex so any contention is going to be
due to the client sending racing OPEN calls for the same inode anyway.
Allowing those to run in parallel again could be useful in some cases,
but most use-cases won't be harmed by that serialization.

> > 			release_open_stateid(stp);
> > 			goto out;
> > 		}
> > -- 
> > 2.7.4
> 
-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file
  2016-06-10 10:50                       ` Jeff Layton
@ 2016-06-10 20:55                         ` J . Bruce Fields
  2016-06-11 15:41                           ` Oleg Drokin
  0 siblings, 1 reply; 60+ messages in thread
From: J . Bruce Fields @ 2016-06-10 20:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Oleg Drokin, linux-nfs, linux-kernel, Andrew W Elble

On Fri, Jun 10, 2016 at 06:50:33AM -0400, Jeff Layton wrote:
> On Fri, 2016-06-10 at 00:18 -0400, Oleg Drokin wrote:
> > On Jun 9, 2016, at 5:01 PM, Oleg Drokin wrote:
> > 
> > > Currently there's an unprotected access mode check in
> > > nfs4_upgrade_open
> > > that then calls nfs4_get_vfs_file which in turn assumes whatever
> > > access mode was present in the state is still valid which is racy.
> > > Two nfs4_get_vfs_file van enter the same path as result and get two
> > > references to nfs4_file, but later drop would only happens once
> > > because
> > > access mode is only denoted by bits, so no refcounting.
> > > 
> > > The locking around access mode testing is introduced to avoid this
> > > race.
> > > 
> > > Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> > > ---
> > > 
> > > This patch performs equally well to the st_rwsem -> mutex
> > > conversion,
> > > but is a bit ligher-weight I imagine.
> > > For one it seems to allow truncates in parallel if we ever want it.
> > > 
> > > fs/nfsd/nfs4state.c | 28 +++++++++++++++++++++++++---
> > > 1 file changed, 25 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > index f5f82e1..d4b9eba 100644
> > > --- a/fs/nfsd/nfs4state.c
> > > +++ b/fs/nfsd/nfs4state.c
> > > @@ -3958,6 +3958,11 @@ static __be32 nfs4_get_vfs_file(struct
> > > svc_rqst *rqstp, struct nfs4_file *fp,
> > > 
> > > 	spin_lock(&fp->fi_lock);
> > > 
> > > +	if (test_access(open->op_share_access, stp)) {
> > > +		spin_unlock(&fp->fi_lock);
> > > +		return nfserr_eagain;
> > > +	}
> > > +
> > > 	/*
> > > 	 * Are we trying to set a deny mode that would conflict with
> > > 	 * current access?
> > > @@ -4017,11 +4022,21 @@ nfs4_upgrade_open(struct svc_rqst *rqstp,
> > > struct nfs4_file *fp, struct svc_fh *c
> > > 	__be32 status;
> > > 	unsigned char old_deny_bmap = stp->st_deny_bmap;
> > > 
> > > -	if (!test_access(open->op_share_access, stp))
> > > -		return nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
> > > open);
> > > +again:
> > > +	spin_lock(&fp->fi_lock);
> > > +	if (!test_access(open->op_share_access, stp)) {
> > > +		spin_unlock(&fp->fi_lock);
> > > +		status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
> > > open);
> > > +		/*
> > > +		 * Somebody won the race for access while we did
> > > not hold
> > > +		 * the lock here
> > > +		 */
> > > +		if (status == nfserr_eagain)
> > > +			goto again;
> > > +		return status;
> > > +	}
> > > 
> > > 	/* test and set deny mode */
> > > -	spin_lock(&fp->fi_lock);
> > > 	status = nfs4_file_check_deny(fp, open->op_share_deny);
> > > 	if (status == nfs_ok) {
> > > 		set_deny(open->op_share_deny, stp);
> > > @@ -4361,6 +4376,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp,
> > > struct svc_fh *current_fh, struct nf
> > > 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp,
> > > open);
> > > 		if (status) {
> > > 			up_read(&stp->st_rwsem);
> > > +			/*
> > > +			 * EAGAIN is returned when there's a
> > > racing access,
> > > +			 * this should never happen as we are the
> > > only user
> > > +			 * of this new state, and since it's not
> > > yet hashed,
> > > +			 * nobody can find it
> > > +			 */
> > > +			WARN_ON(status == nfserr_eagain);
> > 
> > Ok, some more testing shows that this CAN happen.
> > So this patch is inferior to the mutex one after all.
> > 
> 
> Yeah, that can happen for all sorts of reasons. As Andrew pointed out,
> you can get this when there is a lease break in progress, and that may
> be occurring for a completely different stateid (or because of samba,
> etc...)
> 
> It may be possible to do something like this, but we'd need to audit
> all of the handling of st_access_bmap (and the deny bmap) to ensure
> that we get it right.
> 
> For now, I think just turning that rwsem into a mutex is the best
> solution. That is a per-stateid mutex so any contention is going to be
> due to the client sending racing OPEN calls for the same inode anyway.
> Allowing those to run in parallel again could be useful in some cases,
> but most use-cases won't be harmed by that serialization.

OK, so for now my plan is to take "nfsd: Always lock state exclusively"
for 4.7.  Thanks to both of you for your work on this....

--b.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file
  2016-06-10 20:55                         ` J . Bruce Fields
@ 2016-06-11 15:41                           ` Oleg Drokin
  2016-06-12  1:33                             ` Jeff Layton
  0 siblings, 1 reply; 60+ messages in thread
From: Oleg Drokin @ 2016-06-11 15:41 UTC (permalink / raw)
  To: J . Bruce Fields; +Cc: Jeff Layton, linux-nfs, linux-kernel, Andrew W Elble


On Jun 10, 2016, at 4:55 PM, J . Bruce Fields wrote:

> On Fri, Jun 10, 2016 at 06:50:33AM -0400, Jeff Layton wrote:
>> On Fri, 2016-06-10 at 00:18 -0400, Oleg Drokin wrote:
>>> On Jun 9, 2016, at 5:01 PM, Oleg Drokin wrote:
>>> 
>>>> Currently there's an unprotected access mode check in
>>>> nfs4_upgrade_open
>>>> that then calls nfs4_get_vfs_file which in turn assumes whatever
>>>> access mode was present in the state is still valid which is racy.
>>>> Two nfs4_get_vfs_file van enter the same path as result and get two
>>>> references to nfs4_file, but later drop would only happens once
>>>> because
>>>> access mode is only denoted by bits, so no refcounting.
>>>> 
>>>> The locking around access mode testing is introduced to avoid this
>>>> race.
>>>> 
>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>>>> ---
>>>> 
>>>> This patch performs equally well to the st_rwsem -> mutex
>>>> conversion,
>>>> but is a bit ligher-weight I imagine.
>>>> For one it seems to allow truncates in parallel if we ever want it.
>>>> 
>>>> fs/nfsd/nfs4state.c | 28 +++++++++++++++++++++++++---
>>>> 1 file changed, 25 insertions(+), 3 deletions(-)
>>>> 
>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>> index f5f82e1..d4b9eba 100644
>>>> --- a/fs/nfsd/nfs4state.c
>>>> +++ b/fs/nfsd/nfs4state.c
>>>> @@ -3958,6 +3958,11 @@ static __be32 nfs4_get_vfs_file(struct
>>>> svc_rqst *rqstp, struct nfs4_file *fp,
>>>> 
>>>> 	spin_lock(&fp->fi_lock);
>>>> 
>>>> +	if (test_access(open->op_share_access, stp)) {
>>>> +		spin_unlock(&fp->fi_lock);
>>>> +		return nfserr_eagain;
>>>> +	}
>>>> +
>>>> 	/*
>>>> 	 * Are we trying to set a deny mode that would conflict with
>>>> 	 * current access?
>>>> @@ -4017,11 +4022,21 @@ nfs4_upgrade_open(struct svc_rqst *rqstp,
>>>> struct nfs4_file *fp, struct svc_fh *c
>>>> 	__be32 status;
>>>> 	unsigned char old_deny_bmap = stp->st_deny_bmap;
>>>> 
>>>> -	if (!test_access(open->op_share_access, stp))
>>>> -		return nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
>>>> open);
>>>> +again:
>>>> +	spin_lock(&fp->fi_lock);
>>>> +	if (!test_access(open->op_share_access, stp)) {
>>>> +		spin_unlock(&fp->fi_lock);
>>>> +		status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
>>>> open);
>>>> +		/*
>>>> +		 * Somebody won the race for access while we did
>>>> not hold
>>>> +		 * the lock here
>>>> +		 */
>>>> +		if (status == nfserr_eagain)
>>>> +			goto again;
>>>> +		return status;
>>>> +	}
>>>> 
>>>> 	/* test and set deny mode */
>>>> -	spin_lock(&fp->fi_lock);
>>>> 	status = nfs4_file_check_deny(fp, open->op_share_deny);
>>>> 	if (status == nfs_ok) {
>>>> 		set_deny(open->op_share_deny, stp);
>>>> @@ -4361,6 +4376,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp,
>>>> struct svc_fh *current_fh, struct nf
>>>> 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp,
>>>> open);
>>>> 		if (status) {
>>>> 			up_read(&stp->st_rwsem);
>>>> +			/*
>>>> +			 * EAGAIN is returned when there's a
>>>> racing access,
>>>> +			 * this should never happen as we are the
>>>> only user
>>>> +			 * of this new state, and since it's not
>>>> yet hashed,
>>>> +			 * nobody can find it
>>>> +			 */
>>>> +			WARN_ON(status == nfserr_eagain);
>>> 
>>> Ok, some more testing shows that this CAN happen.
>>> So this patch is inferior to the mutex one after all.
>>> 
>> 
>> Yeah, that can happen for all sorts of reasons. As Andrew pointed out,
>> you can get this when there is a lease break in progress, and that may
>> be occurring for a completely different stateid (or because of samba,
>> etc...)
>> 
>> It may be possible to do something like this, but we'd need to audit
>> all of the handling of st_access_bmap (and the deny bmap) to ensure
>> that we get it right.
>> 
>> For now, I think just turning that rwsem into a mutex is the best
>> solution. That is a per-stateid mutex so any contention is going to be
>> due to the client sending racing OPEN calls for the same inode anyway.
>> Allowing those to run in parallel again could be useful in some cases,
>> but most use-cases won't be harmed by that serialization.
> 
> OK, so for now my plan is to take "nfsd: Always lock state exclusively"
> for 4.7.  Thanks to both of you for your work on this….


FYI, I just hit this again with the "Always lock state exclusively" patch too.
I hate when that happens. But it is much harder to hit now.

the trace is also in the nfs4_get_vfs_file() that's called directly from
nfsd4_process_open2().

Otherwise the symptoms are pretty same - first I get the warning in set_access
that the flag is already set and then the nfsd4_free_file_rcu() one and then
unmount of underlying fs fails.

What's strange is I am not sure what else can set the flag.
Basically set_access is called from nfs4_get_vfs_file() - under the mutex via
nfsd4_process_open2() directly or via nfs4_upgrade_open()...,
or from get_lock_access() - without the mutex, but my workload does not do any
file locking, so it should not really be hitting, right?


Ah! I think I see it.
This patch has the same problem as the spinlock moving one:
When we call nfs4_get_vfs_file() directly from nfsd4_process_open2(), there's no
check for anything, so supposed we take this diret cllign path, there's a
mutex_lock() just before the call, but what's to stop another thread from finding
this stateid meanwhile and being first with the mutex and then also settign the
access mode that our first thread no longer sets?
This makes it so that in fact we can never skip the access mode testing unless
testing and setting is atomically done under the same lock which it is not now.
The patch that extended the coverage of the fi_lock got that right, I did not hit
any leaks there, just the "unhandled EAGAIN" warn on, which is wrong in it's own right.
The surprising part is that the state that's not yet been through find_or_hash_clnt_odstate could still be found by somebody else? Is it really
supposed to work like that? 

So if we are to check the access mode at all times in nfs4_get_vfs_file(),
and return eagain, should we just check for that (all under the same
lock if we go with the mutex patch) and call into nfs4_upgrade_open in that
case again?
Or I guess it's even better if we resurrect the fi_lock coverage extension patch and
do it there, as that would mean at least the check and set are atomic wrt
locking?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file
  2016-06-11 15:41                           ` Oleg Drokin
@ 2016-06-12  1:33                             ` Jeff Layton
  2016-06-12  2:06                               ` Oleg Drokin
  0 siblings, 1 reply; 60+ messages in thread
From: Jeff Layton @ 2016-06-12  1:33 UTC (permalink / raw)
  To: Oleg Drokin, J . Bruce Fields; +Cc: linux-nfs, linux-kernel, Andrew W Elble

On Sat, 2016-06-11 at 11:41 -0400, Oleg Drokin wrote:
> On Jun 10, 2016, at 4:55 PM, J . Bruce Fields wrote:
> 
> > On Fri, Jun 10, 2016 at 06:50:33AM -0400, Jeff Layton wrote:
> > > On Fri, 2016-06-10 at 00:18 -0400, Oleg Drokin wrote:
> > > > On Jun 9, 2016, at 5:01 PM, Oleg Drokin wrote:
> > > > 
> > > > > Currently there's an unprotected access mode check in
> > > > > nfs4_upgrade_open
> > > > > that then calls nfs4_get_vfs_file which in turn assumes whatever
> > > > > access mode was present in the state is still valid which is racy.
> > > > > Two nfs4_get_vfs_file van enter the same path as result and get two
> > > > > references to nfs4_file, but later drop would only happens once
> > > > > because
> > > > > access mode is only denoted by bits, so no refcounting.
> > > > > 
> > > > > The locking around access mode testing is introduced to avoid this
> > > > > race.
> > > > > 
> > > > > Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> > > > > ---
> > > > > 
> > > > > This patch performs equally well to the st_rwsem -> mutex
> > > > > conversion,
> > > > > but is a bit ligher-weight I imagine.
> > > > > For one it seems to allow truncates in parallel if we ever want it.
> > > > > 
> > > > > fs/nfsd/nfs4state.c | 28 +++++++++++++++++++++++++---
> > > > > 1 file changed, 25 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > index f5f82e1..d4b9eba 100644
> > > > > --- a/fs/nfsd/nfs4state.c
> > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > @@ -3958,6 +3958,11 @@ static __be32 nfs4_get_vfs_file(struct
> > > > > svc_rqst *rqstp, struct nfs4_file *fp,
> > > > > 
> > > > > 	spin_lock(&fp->fi_lock);
> > > > > 
> > > > > +	if (test_access(open->op_share_access, stp)) {
> > > > > +		spin_unlock(&fp->fi_lock);
> > > > > +		return nfserr_eagain;
> > > > > +	}
> > > > > +
> > > > > 	/*
> > > > > 	 * Are we trying to set a deny mode that would conflict with
> > > > > 	 * current access?
> > > > > @@ -4017,11 +4022,21 @@ nfs4_upgrade_open(struct svc_rqst *rqstp,
> > > > > struct nfs4_file *fp, struct svc_fh *c
> > > > > 	__be32 status;
> > > > > 	unsigned char old_deny_bmap = stp->st_deny_bmap;
> > > > > 
> > > > > -	if (!test_access(open->op_share_access, stp))
> > > > > -		return nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
> > > > > open);
> > > > > +again:
> > > > > +	spin_lock(&fp->fi_lock);
> > > > > +	if (!test_access(open->op_share_access, stp)) {
> > > > > +		spin_unlock(&fp->fi_lock);
> > > > > +		status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
> > > > > open);
> > > > > +		/*
> > > > > +		 * Somebody won the race for access while we did
> > > > > not hold
> > > > > +		 * the lock here
> > > > > +		 */
> > > > > +		if (status == nfserr_eagain)
> > > > > +			goto again;
> > > > > +		return status;
> > > > > +	}
> > > > > 
> > > > > 	/* test and set deny mode */
> > > > > -	spin_lock(&fp->fi_lock);
> > > > > 	status = nfs4_file_check_deny(fp, open->op_share_deny);
> > > > > 	if (status == nfs_ok) {
> > > > > 		set_deny(open->op_share_deny, stp);
> > > > > @@ -4361,6 +4376,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp,
> > > > > struct svc_fh *current_fh, struct nf
> > > > > 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp,
> > > > > open);
> > > > > 		if (status) {
> > > > > 			up_read(&stp->st_rwsem);
> > > > > +			/*
> > > > > +			 * EAGAIN is returned when there's a
> > > > > racing access,
> > > > > +			 * this should never happen as we are the
> > > > > only user
> > > > > +			 * of this new state, and since it's not
> > > > > yet hashed,
> > > > > +			 * nobody can find it
> > > > > +			 */
> > > > > +			WARN_ON(status == nfserr_eagain);
> > > > 
> > > > Ok, some more testing shows that this CAN happen.
> > > > So this patch is inferior to the mutex one after all.
> > > > 
> > > 
> > > Yeah, that can happen for all sorts of reasons. As Andrew pointed out,
> > > you can get this when there is a lease break in progress, and that may
> > > be occurring for a completely different stateid (or because of samba,
> > > etc...)
> > > 
> > > It may be possible to do something like this, but we'd need to audit
> > > all of the handling of st_access_bmap (and the deny bmap) to ensure
> > > that we get it right.
> > > 
> > > For now, I think just turning that rwsem into a mutex is the best
> > > solution. That is a per-stateid mutex so any contention is going to be
> > > due to the client sending racing OPEN calls for the same inode anyway.
> > > Allowing those to run in parallel again could be useful in some cases,
> > > but most use-cases won't be harmed by that serialization.
> > 
> > OK, so for now my plan is to take "nfsd: Always lock state exclusively"
> > for 4.7.  Thanks to both of you for your work on this….
> 
> 
> FYI, I just hit this again with the "Always lock state exclusively" patch too.
> I hate when that happens. But it is much harder to hit now.
> 
> the trace is also in the nfs4_get_vfs_file() that's called directly from
> nfsd4_process_open2().
> 
> Otherwise the symptoms are pretty same - first I get the warning in set_access
> that the flag is already set and then the nfsd4_free_file_rcu() one and then
> unmount of underlying fs fails.
> 
> What's strange is I am not sure what else can set the flag.
> Basically set_access is called from nfs4_get_vfs_file() - under the mutex via
> nfsd4_process_open2() directly or via nfs4_upgrade_open()...,
> or from get_lock_access() - without the mutex, but my workload does not do any
> file locking, so it should not really be hitting, right?
> 
> 
> Ah! I think I see it.
> This patch has the same problem as the spinlock moving one:
> When we call nfs4_get_vfs_file() directly from nfsd4_process_open2(), there's no
> check for anything, so supposed we take this diret cllign path, there's a
> mutex_lock() just before the call, but what's to stop another thread from finding
> this stateid meanwhile and being first with the mutex and then also settign the
> access mode that our first thread no longer sets?
> This makes it so that in fact we can never skip the access mode testing unless
> testing and setting is atomically done under the same lock which it is not now.
> The patch that extended the coverage of the fi_lock got that right, I did not hit
> any leaks there, just the "unhandled EAGAIN" warn on, which is wrong in it's own right.
> The surprising part is that the state that's not yet been through find_or_hash_clnt_odstate could still be found by somebody else? Is it really
> supposed to work like that? 
> 
> So if we are to check the access mode at all times in nfs4_get_vfs_file(),
> and return eagain, should we just check for that (all under the same
> lock if we go with the mutex patch) and call into nfs4_upgrade_open in that
> case again?
> Or I guess it's even better if we resurrect the fi_lock coverage extension patch and
> do it there, as that would mean at least the check and set are atomic wrt
> locking?

Good catch. Could we fix this by locking the mutex before hashing the
new stateid (and having init_open_stateid return with it locked if it
finds an existing one?).

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file
  2016-06-12  1:33                             ` Jeff Layton
@ 2016-06-12  2:06                               ` Oleg Drokin
  2016-06-12  2:50                                 ` Jeff Layton
  0 siblings, 1 reply; 60+ messages in thread
From: Oleg Drokin @ 2016-06-12  2:06 UTC (permalink / raw)
  To: Jeff Layton; +Cc: J . Bruce Fields, linux-nfs, linux-kernel, Andrew W Elble


On Jun 11, 2016, at 9:33 PM, Jeff Layton wrote:

> On Sat, 2016-06-11 at 11:41 -0400, Oleg Drokin wrote:
>> On Jun 10, 2016, at 4:55 PM, J . Bruce Fields wrote:
>> 
>>> On Fri, Jun 10, 2016 at 06:50:33AM -0400, Jeff Layton wrote:
>>>> On Fri, 2016-06-10 at 00:18 -0400, Oleg Drokin wrote:
>>>>> On Jun 9, 2016, at 5:01 PM, Oleg Drokin wrote:
>>>>> 
>>>>>> Currently there's an unprotected access mode check in
>>>>>> nfs4_upgrade_open
>>>>>> that then calls nfs4_get_vfs_file which in turn assumes whatever
>>>>>> access mode was present in the state is still valid which is racy.
>>>>>> Two nfs4_get_vfs_file van enter the same path as result and get two
>>>>>> references to nfs4_file, but later drop would only happens once
>>>>>> because
>>>>>> access mode is only denoted by bits, so no refcounting.
>>>>>> 
>>>>>> The locking around access mode testing is introduced to avoid this
>>>>>> race.
>>>>>> 
>>>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>>>>>> ---
>>>>>> 
>>>>>> This patch performs equally well to the st_rwsem -> mutex
>>>>>> conversion,
>>>>>> but is a bit ligher-weight I imagine.
>>>>>> For one it seems to allow truncates in parallel if we ever want it.
>>>>>> 
>>>>>> fs/nfsd/nfs4state.c | 28 +++++++++++++++++++++++++---
>>>>>> 1 file changed, 25 insertions(+), 3 deletions(-)
>>>>>> 
>>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>>>> index f5f82e1..d4b9eba 100644
>>>>>> --- a/fs/nfsd/nfs4state.c
>>>>>> +++ b/fs/nfsd/nfs4state.c
>>>>>> @@ -3958,6 +3958,11 @@ static __be32 nfs4_get_vfs_file(struct
>>>>>> svc_rqst *rqstp, struct nfs4_file *fp,
>>>>>> 
>>>>>> 	spin_lock(&fp->fi_lock);
>>>>>> 
>>>>>> +	if (test_access(open->op_share_access, stp)) {
>>>>>> +		spin_unlock(&fp->fi_lock);
>>>>>> +		return nfserr_eagain;
>>>>>> +	}
>>>>>> +
>>>>>> 	/*
>>>>>> 	 * Are we trying to set a deny mode that would conflict with
>>>>>> 	 * current access?
>>>>>> @@ -4017,11 +4022,21 @@ nfs4_upgrade_open(struct svc_rqst *rqstp,
>>>>>> struct nfs4_file *fp, struct svc_fh *c
>>>>>> 	__be32 status;
>>>>>> 	unsigned char old_deny_bmap = stp->st_deny_bmap;
>>>>>> 
>>>>>> -	if (!test_access(open->op_share_access, stp))
>>>>>> -		return nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
>>>>>> open);
>>>>>> +again:
>>>>>> +	spin_lock(&fp->fi_lock);
>>>>>> +	if (!test_access(open->op_share_access, stp)) {
>>>>>> +		spin_unlock(&fp->fi_lock);
>>>>>> +		status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
>>>>>> open);
>>>>>> +		/*
>>>>>> +		 * Somebody won the race for access while we did
>>>>>> not hold
>>>>>> +		 * the lock here
>>>>>> +		 */
>>>>>> +		if (status == nfserr_eagain)
>>>>>> +			goto again;
>>>>>> +		return status;
>>>>>> +	}
>>>>>> 
>>>>>> 	/* test and set deny mode */
>>>>>> -	spin_lock(&fp->fi_lock);
>>>>>> 	status = nfs4_file_check_deny(fp, open->op_share_deny);
>>>>>> 	if (status == nfs_ok) {
>>>>>> 		set_deny(open->op_share_deny, stp);
>>>>>> @@ -4361,6 +4376,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp,
>>>>>> struct svc_fh *current_fh, struct nf
>>>>>> 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp,
>>>>>> open);
>>>>>> 		if (status) {
>>>>>> 			up_read(&stp->st_rwsem);
>>>>>> +			/*
>>>>>> +			 * EAGAIN is returned when there's a
>>>>>> racing access,
>>>>>> +			 * this should never happen as we are the
>>>>>> only user
>>>>>> +			 * of this new state, and since it's not
>>>>>> yet hashed,
>>>>>> +			 * nobody can find it
>>>>>> +			 */
>>>>>> +			WARN_ON(status == nfserr_eagain);
>>>>> 
>>>>> Ok, some more testing shows that this CAN happen.
>>>>> So this patch is inferior to the mutex one after all.
>>>>> 
>>>> 
>>>> Yeah, that can happen for all sorts of reasons. As Andrew pointed out,
>>>> you can get this when there is a lease break in progress, and that may
>>>> be occurring for a completely different stateid (or because of samba,
>>>> etc...)
>>>> 
>>>> It may be possible to do something like this, but we'd need to audit
>>>> all of the handling of st_access_bmap (and the deny bmap) to ensure
>>>> that we get it right.
>>>> 
>>>> For now, I think just turning that rwsem into a mutex is the best
>>>> solution. That is a per-stateid mutex so any contention is going to be
>>>> due to the client sending racing OPEN calls for the same inode anyway.
>>>> Allowing those to run in parallel again could be useful in some cases,
>>>> but most use-cases won't be harmed by that serialization.
>>> 
>>> OK, so for now my plan is to take "nfsd: Always lock state exclusively"
>>> for 4.7.  Thanks to both of you for your work on this….
>> 
>> 
>> FYI, I just hit this again with the "Always lock state exclusively" patch too.
>> I hate when that happens. But it is much harder to hit now.
>> 
>> the trace is also in the nfs4_get_vfs_file() that's called directly from
>> nfsd4_process_open2().
>> 
>> Otherwise the symptoms are pretty same - first I get the warning in set_access
>> that the flag is already set and then the nfsd4_free_file_rcu() one and then
>> unmount of underlying fs fails.
>> 
>> What's strange is I am not sure what else can set the flag.
>> Basically set_access is called from nfs4_get_vfs_file() - under the mutex via
>> nfsd4_process_open2() directly or via nfs4_upgrade_open()...,
>> or from get_lock_access() - without the mutex, but my workload does not do any
>> file locking, so it should not really be hitting, right?
>> 
>> 
>> Ah! I think I see it.
>> This patch has the same problem as the spinlock moving one:
>> When we call nfs4_get_vfs_file() directly from nfsd4_process_open2(), there's no
>> check for anything, so supposed we take this diret cllign path, there's a
>> mutex_lock() just before the call, but what's to stop another thread from finding
>> this stateid meanwhile and being first with the mutex and then also settign the
>> access mode that our first thread no longer sets?
>> This makes it so that in fact we can never skip the access mode testing unless
>> testing and setting is atomically done under the same lock which it is not now.
>> The patch that extended the coverage of the fi_lock got that right, I did not hit
>> any leaks there, just the "unhandled EAGAIN" warn on, which is wrong in it's own right.
>> The surprising part is that the state that's not yet been through find_or_hash_clnt_odstate could still be found by somebody else? Is it really
>> supposed to work like that? 
>> 
>> So if we are to check the access mode at all times in nfs4_get_vfs_file(),
>> and return eagain, should we just check for that (all under the same
>> lock if we go with the mutex patch) and call into nfs4_upgrade_open in that
>> case again?
>> Or I guess it's even better if we resurrect the fi_lock coverage extension patch and
>> do it there, as that would mean at least the check and set are atomic wrt
>> locking?
> 
> Good catch. Could we fix this by locking the mutex before hashing the
> new stateid (and having init_open_stateid return with it locked if it
> finds an existing one?).

Hm. I am trying to lock the newly initialized one and that seems to be holding up
well (but I want 24 hours just to be extra sure).
Hn, I just noticed a bug in this, so that'll reset the clock back.

But I think we cannot return with locked one if we found existing one due to lock
inversion?
I see that normally first we lock the state rwsem (now mutex) and then
lock the fi_lock.
Now if we make init_open_stateid() to lock the new state mutex while the fi_lock
is locked - that's probably ok, because we can do it before adding it to the list,
so nobody can find it.
Now the existing state that we find, we cannot really lock while holding that fi_lock,
because what if there's a parallel thread that already holds the mutex and now
wants the fi_lock?
And so it's probably best to return with existing state unlocked and let caller lock it?
Or do you think it's best to separately lock the found stp outside of spinlock
just for consistency?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file
  2016-06-12  2:06                               ` Oleg Drokin
@ 2016-06-12  2:50                                 ` Jeff Layton
  2016-06-12  3:15                                   ` Oleg Drokin
  0 siblings, 1 reply; 60+ messages in thread
From: Jeff Layton @ 2016-06-12  2:50 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: J . Bruce Fields, linux-nfs, linux-kernel, Andrew W Elble

On Sat, 2016-06-11 at 22:06 -0400, Oleg Drokin wrote:
> On Jun 11, 2016, at 9:33 PM, Jeff Layton wrote:
> 
> > On Sat, 2016-06-11 at 11:41 -0400, Oleg Drokin wrote:
> > > On Jun 10, 2016, at 4:55 PM, J . Bruce Fields wrote:
> > > 
> > > > On Fri, Jun 10, 2016 at 06:50:33AM -0400, Jeff Layton wrote:
> > > > > On Fri, 2016-06-10 at 00:18 -0400, Oleg Drokin wrote:
> > > > > > On Jun 9, 2016, at 5:01 PM, Oleg Drokin wrote:
> > > > > > 
> > > > > > > Currently there's an unprotected access mode check in
> > > > > > > nfs4_upgrade_open
> > > > > > > that then calls nfs4_get_vfs_file which in turn assumes whatever
> > > > > > > access mode was present in the state is still valid which is racy.
> > > > > > > Two nfs4_get_vfs_file van enter the same path as result and get two
> > > > > > > references to nfs4_file, but later drop would only happens once
> > > > > > > because
> > > > > > > access mode is only denoted by bits, so no refcounting.
> > > > > > > 
> > > > > > > The locking around access mode testing is introduced to avoid this
> > > > > > > race.
> > > > > > > 
> > > > > > > Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> > > > > > > ---
> > > > > > > 
> > > > > > > This patch performs equally well to the st_rwsem -> mutex
> > > > > > > conversion,
> > > > > > > but is a bit ligher-weight I imagine.
> > > > > > > For one it seems to allow truncates in parallel if we ever want it.
> > > > > > > 
> > > > > > > fs/nfsd/nfs4state.c | 28 +++++++++++++++++++++++++---
> > > > > > > 1 file changed, 25 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > > > index f5f82e1..d4b9eba 100644
> > > > > > > --- a/fs/nfsd/nfs4state.c
> > > > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > > > @@ -3958,6 +3958,11 @@ static __be32 nfs4_get_vfs_file(struct
> > > > > > > svc_rqst *rqstp, struct nfs4_file *fp,
> > > > > > > 
> > > > > > > 	spin_lock(&fp->fi_lock);
> > > > > > > 
> > > > > > > +	if (test_access(open->op_share_access, stp)) {
> > > > > > > +		spin_unlock(&fp->fi_lock);
> > > > > > > +		return nfserr_eagain;
> > > > > > > +	}
> > > > > > > +
> > > > > > > 	/*
> > > > > > > 	 * Are we trying to set a deny mode that would conflict with
> > > > > > > 	 * current access?
> > > > > > > @@ -4017,11 +4022,21 @@ nfs4_upgrade_open(struct svc_rqst *rqstp,
> > > > > > > struct nfs4_file *fp, struct svc_fh *c
> > > > > > > 	__be32 status;
> > > > > > > 	unsigned char old_deny_bmap = stp->st_deny_bmap;
> > > > > > > 
> > > > > > > -	if (!test_access(open->op_share_access, stp))
> > > > > > > -		return nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
> > > > > > > open);
> > > > > > > +again:
> > > > > > > +	spin_lock(&fp->fi_lock);
> > > > > > > +	if (!test_access(open->op_share_access, stp)) {
> > > > > > > +		spin_unlock(&fp->fi_lock);
> > > > > > > +		status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp,
> > > > > > > open);
> > > > > > > +		/*
> > > > > > > +		 * Somebody won the race for access while we did
> > > > > > > not hold
> > > > > > > +		 * the lock here
> > > > > > > +		 */
> > > > > > > +		if (status == nfserr_eagain)
> > > > > > > +			goto again;
> > > > > > > +		return status;
> > > > > > > +	}
> > > > > > > 
> > > > > > > 	/* test and set deny mode */
> > > > > > > -	spin_lock(&fp->fi_lock);
> > > > > > > 	status = nfs4_file_check_deny(fp, open->op_share_deny);
> > > > > > > 	if (status == nfs_ok) {
> > > > > > > 		set_deny(open->op_share_deny, stp);
> > > > > > > @@ -4361,6 +4376,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp,
> > > > > > > struct svc_fh *current_fh, struct nf
> > > > > > > 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp,
> > > > > > > open);
> > > > > > > 		if (status) {
> > > > > > > 			up_read(&stp->st_rwsem);
> > > > > > > +			/*
> > > > > > > +			 * EAGAIN is returned when there's a
> > > > > > > racing access,
> > > > > > > +			 * this should never happen as we are the
> > > > > > > only user
> > > > > > > +			 * of this new state, and since it's not
> > > > > > > yet hashed,
> > > > > > > +			 * nobody can find it
> > > > > > > +			 */
> > > > > > > +			WARN_ON(status == nfserr_eagain);
> > > > > > 
> > > > > > Ok, some more testing shows that this CAN happen.
> > > > > > So this patch is inferior to the mutex one after all.
> > > > > > 
> > > > > 
> > > > > Yeah, that can happen for all sorts of reasons. As Andrew pointed out,
> > > > > you can get this when there is a lease break in progress, and that may
> > > > > be occurring for a completely different stateid (or because of samba,
> > > > > etc...)
> > > > > 
> > > > > It may be possible to do something like this, but we'd need to audit
> > > > > all of the handling of st_access_bmap (and the deny bmap) to ensure
> > > > > that we get it right.
> > > > > 
> > > > > For now, I think just turning that rwsem into a mutex is the best
> > > > > solution. That is a per-stateid mutex so any contention is going to be
> > > > > due to the client sending racing OPEN calls for the same inode anyway.
> > > > > Allowing those to run in parallel again could be useful in some cases,
> > > > > but most use-cases won't be harmed by that serialization.
> > > > 
> > > > OK, so for now my plan is to take "nfsd: Always lock state exclusively"
> > > > for 4.7.  Thanks to both of you for your work on this….
> > > 
> > > 
> > > FYI, I just hit this again with the "Always lock state exclusively" patch too.
> > > I hate when that happens. But it is much harder to hit now.
> > > 
> > > the trace is also in the nfs4_get_vfs_file() that's called directly from
> > > nfsd4_process_open2().
> > > 
> > > Otherwise the symptoms are pretty same - first I get the warning in set_access
> > > that the flag is already set and then the nfsd4_free_file_rcu() one and then
> > > unmount of underlying fs fails.
> > > 
> > > What's strange is I am not sure what else can set the flag.
> > > Basically set_access is called from nfs4_get_vfs_file() - under the mutex via
> > > nfsd4_process_open2() directly or via nfs4_upgrade_open()...,
> > > or from get_lock_access() - without the mutex, but my workload does not do any
> > > file locking, so it should not really be hitting, right?
> > > 
> > > 
> > > Ah! I think I see it.
> > > This patch has the same problem as the spinlock moving one:
> > > When we call nfs4_get_vfs_file() directly from nfsd4_process_open2(), there's no
> > > check for anything, so supposed we take this diret cllign path, there's a
> > > mutex_lock() just before the call, but what's to stop another thread from finding
> > > this stateid meanwhile and being first with the mutex and then also settign the
> > > access mode that our first thread no longer sets?
> > > This makes it so that in fact we can never skip the access mode testing unless
> > > testing and setting is atomically done under the same lock which it is not now.
> > > The patch that extended the coverage of the fi_lock got that right, I did not hit
> > > any leaks there, just the "unhandled EAGAIN" warn on, which is wrong in it's own right.
> > > The surprising part is that the state that's not yet been through find_or_hash_clnt_odstate could still be found by somebody else? Is it really
> > > supposed to work like that? 
> > > 
> > > So if we are to check the access mode at all times in nfs4_get_vfs_file(),
> > > and return eagain, should we just check for that (all under the same
> > > lock if we go with the mutex patch) and call into nfs4_upgrade_open in that
> > > case again?
> > > Or I guess it's even better if we resurrect the fi_lock coverage extension patch and
> > > do it there, as that would mean at least the check and set are atomic wrt
> > > locking?
> > 
> > Good catch. Could we fix this by locking the mutex before hashing the
> > new stateid (and having init_open_stateid return with it locked if it
> > finds an existing one?).
> 
> Hm. I am trying to lock the newly initialized one and that seems to be holding up
> well (but I want 24 hours just to be extra sure).
> Hn, I just noticed a bug in this, so that'll reset the clock back.
> 
> But I think we cannot return with locked one if we found existing one due to lock
> inversion?
> I see that normally first we lock the state rwsem (now mutex) and then
> lock the fi_lock.
> Now if we make init_open_stateid() to lock the new state mutex while the fi_lock
> is locked - that's probably ok, because we can do it before adding it to the list,
> so nobody can find it.
> Now the existing state that we find, we cannot really lock while holding that fi_lock,
> because what if there's a parallel thread that already holds the mutex and now
> wants the fi_lock?
> And so it's probably best to return with existing state unlocked and let caller lock it?
> Or do you think it's best to separately lock the found stp outside of spinlock
> just for consistency?

I think we just have to ensure that if the new stateid is hashed that
its mutex is locked prior to being inserted into the hashtable. That
should prevent the race you mentioned.

If we find an existing one in the hashtable in init_open_stateid, then
we _can_ take the mutex after dropping the spinlocks, since we won't
call release_open_stateid in that case anyway.

We'll also need to consider what happens if nfs4_get_vfs_file fails
after we hashed the stateid, but then another task finds it while
processing another open. So we might have to have release_open_stateid
unlock the mutex after unhashing the stateid, but before putting the
reference, and then have init_open_stateid check to see if the thing is
still hashed after it gets the mutex.

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file
  2016-06-12  2:50                                 ` Jeff Layton
@ 2016-06-12  3:15                                   ` Oleg Drokin
  2016-06-12 13:13                                     ` Jeff Layton
  2016-06-13  1:26                                     ` [PATCH v2] nfsd: Always lock state exclusively Oleg Drokin
  0 siblings, 2 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-12  3:15 UTC (permalink / raw)
  To: Jeff Layton; +Cc: J . Bruce Fields, linux-nfs, linux-kernel, Andrew W Elble


On Jun 11, 2016, at 10:50 PM, Jeff Layton wrote:

> On Sat, 2016-06-11 at 22:06 -0400, Oleg Drokin wrote:
>> 
>> Hm. I am trying to lock the newly initialized one and that seems to be holding up
>> well (but I want 24 hours just to be extra sure).
>> Hn, I just noticed a bug in this, so that'll reset the clock back.
>> 
>> But I think we cannot return with locked one if we found existing one due to lock
>> inversion?
>> I see that normally first we lock the state rwsem (now mutex) and then
>> lock the fi_lock.
>> Now if we make init_open_stateid() to lock the new state mutex while the fi_lock
>> is locked - that's probably ok, because we can do it before adding it to the list,
>> so nobody can find it.
>> Now the existing state that we find, we cannot really lock while holding that fi_lock,
>> because what if there's a parallel thread that already holds the mutex and now
>> wants the fi_lock?
>> And so it's probably best to return with existing state unlocked and let caller lock it?
>> Or do you think it's best to separately lock the found stp outside of spinlock
>> just for consistency?
> 
> I think we just have to ensure that if the new stateid is hashed that
> its mutex is locked prior to being inserted into the hashtable. That
> should prevent the race you mentioned.
> 
> If we find an existing one in the hashtable in init_open_stateid, then
> we _can_ take the mutex after dropping the spinlocks, since we won't
> call release_open_stateid in that case anyway.

Yes.

> We'll also need to consider what happens if nfs4_get_vfs_file fails
> after we hashed the stateid, but then another task finds it while
> processing another open. So we might have to have release_open_stateid
> unlock the mutex after unhashing the stateid, but before putting the
> reference, and then have init_open_stateid check to see if the thing is
> still hashed after it gets the mutex.

Hm.
So what's going to go wrong if another user reuses the unhashed stateid?
As long as they drop it once they are done it'll be freed and all is fine, no?
Are there other implications?
Hm, it looks like free_ol_stateid_reaplist() just frees the thing without any looking
into mutexes and stuff?

Ok, so we get the mutex, check that the stateid is hashed, it's not anymore
(actually unhashing could be done without mutex too, right? so just mutex held
is not going to protect us), then we need to drop the mutex and restart the search
from scratch (including all relocking), I assume?
I guess I'll have it as a separate follow on patch.
We'll probably also need some fault-injection here to trigger this case, as triggering
it "naturally" will be a tough problem even on my mega-racy setup.

Something like:
                if (swapstp) {
…
		}
		if (FAULTINJECTION) {
			msleep(some_random_time);
			status = nfserr_eio;
		} else
			status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); 

should increase the chance.
Ideally there'd be a way to trigger this case more deterministically,
how do I have two OPEN requests in parallel in NFS for the same file,
just have two threads do it and that would 100% result in two requests,
no merging anywhere along the way that I need to be aware of?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file
  2016-06-12  3:15                                   ` Oleg Drokin
@ 2016-06-12 13:13                                     ` Jeff Layton
  2016-06-13  1:26                                     ` [PATCH v2] nfsd: Always lock state exclusively Oleg Drokin
  1 sibling, 0 replies; 60+ messages in thread
From: Jeff Layton @ 2016-06-12 13:13 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: J . Bruce Fields, linux-nfs, linux-kernel, Andrew W Elble

On Sat, 2016-06-11 at 23:15 -0400, Oleg Drokin wrote:
> On Jun 11, 2016, at 10:50 PM, Jeff Layton wrote:
> 
> > 
> > On Sat, 2016-06-11 at 22:06 -0400, Oleg Drokin wrote:
> > > 
> > > 
> > > Hm. I am trying to lock the newly initialized one and that seems to be holding up
> > > well (but I want 24 hours just to be extra sure).
> > > Hn, I just noticed a bug in this, so that'll reset the clock back.
> > > 
> > > But I think we cannot return with locked one if we found existing one due to lock
> > > inversion?
> > > I see that normally first we lock the state rwsem (now mutex) and then
> > > lock the fi_lock.
> > > Now if we make init_open_stateid() to lock the new state mutex while the fi_lock
> > > is locked - that's probably ok, because we can do it before adding it to the list,
> > > so nobody can find it.
> > > Now the existing state that we find, we cannot really lock while holding that fi_lock,
> > > because what if there's a parallel thread that already holds the mutex and now
> > > wants the fi_lock?
> > > And so it's probably best to return with existing state unlocked and let caller lock it?
> > > Or do you think it's best to separately lock the found stp outside of spinlock
> > > just for consistency?
> > I think we just have to ensure that if the new stateid is hashed that
> > its mutex is locked prior to being inserted into the hashtable. That
> > should prevent the race you mentioned.
> > 
> > If we find an existing one in the hashtable in init_open_stateid, then
> > we _can_ take the mutex after dropping the spinlocks, since we won't
> > call release_open_stateid in that case anyway.
> Yes.
> 
> > 
> > We'll also need to consider what happens if nfs4_get_vfs_file fails
> > after we hashed the stateid, but then another task finds it while
> > processing another open. So we might have to have release_open_stateid
> > unlock the mutex after unhashing the stateid, but before putting the
> > reference, and then have init_open_stateid check to see if the thing is
> > still hashed after it gets the mutex.
> Hm.
> So what's going to go wrong if another user reuses the unhashed stateid?
> As long as they drop it once they are done it'll be freed and all is fine, no?
> Are there other implications?
> Hm, it looks like free_ol_stateid_reaplist() just frees the thing without any looking
> into mutexes and stuff?
> 

The problem there is that you're sending a very soon to be defunct
stateid to the client as valid, which means the client will end up in
state recovery (at best).

The initial open is somewhat of a special case...we are hashing a new
stateid. If the operation that hashes it fails, then we want to make
like it never happened at all. The client will never see it, so we
don't want to leave it hanging around on the server. We need to unhash
the stateid in that case.

So, yes, the mutex doesn't really prevent the thing from being
unhashed, but if you take the mutex and the thing is still hashed
afterward, then you know that the above situation didn't happen. The
initial incarnation of the stateid will have been sent to the client.

> Ok, so we get the mutex, check that the stateid is hashed, it's not anymore
> (actually unhashing could be done without mutex too, right? so just mutex held
> is not going to protect us), then we need to drop the mutex and restart the search
> from scratch (including all relocking), I assume?

Yeah, that's what I was thinking. It should be a fairly rare race so
taking an extra hit on the locking in that case shouldn't be too bad. I
think there's quite a bit of opportunity to simplify the open path by
reworking this code as well.

What we probably need to do is turn nfsd4_find_existing_open into
something that intializes and hashes the stateid if it doesn't find one
, and returns it with the mutex locked. It could also do the check to
see if an existing stateid was still hashed after taking the mutex and
could redrive the search if it isn't. That would also make the open
code more resilient in the face of OPEN vs. CLOSE races, which would
also be nice...

> I guess I'll have it as a separate follow on patch.
> We'll probably also need some fault-injection here to trigger this case, as triggering
> it "naturally" will be a tough problem even on my mega-racy setup.
> 
> Something like:
>                 if (swapstp) {
> …
> 		}
> 		if (FAULTINJECTION) {
> 			msleep(some_random_time);
> 			status = nfserr_eio;
> 		} else
> 			status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); 
> 
> should increase the chance.

Sure. Look for CONFIG_NFSD_FAULT_INJECTION. You might be able to use
that framework (though it is a bit manky).

> Ideally there'd be a way to trigger this case more deterministically,
> how do I have two OPEN requests in parallel in NFS for the same file,
> just have two threads do it and that would 100% result in two requests,
> no merging anywhere along the way that I need to be aware of?

Yeah, that's basically it. You need two racing OPEN calls for the same
stateid. Might be easiest to do with something like pynfs...

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-12  3:15                                   ` Oleg Drokin
  2016-06-12 13:13                                     ` Jeff Layton
@ 2016-06-13  1:26                                     ` Oleg Drokin
  2016-06-14 15:38                                       ` J . Bruce Fields
  2016-06-14 15:46                                       ` [PATCH v2] nfsd: Always lock state exclusively J . Bruce Fields
  1 sibling, 2 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-13  1:26 UTC (permalink / raw)
  To: Jeff Layton, J . Bruce Fields; +Cc: linux-nfs, linux-kernel, Oleg Drokin

It used to be the case that state had an rwlock that was locked for write
by downgrades, but for read for upgrades (opens). Well, the problem is
if there are two competing opens for the same state, they step on
each other toes potentially leading to leaking file descriptors
from the state structure, since access mode is a bitmap only set once.

Extend the holding region around in nfsd4_process_open2() to avoid
racing entry into nfs4_get_vfs_file().
Make init_open_stateid() return with locked stateid to be unlocked
by the caller.

Now this version held up pretty well in my testing for 24 hours.
It still does not address the situation if during one of the racing
nfs4_get_vfs_file() calls we are getting an error from one (first?)
of them. This is to be addressed in a separate patch after having a
solid reproducer (potentially using some fault injection).

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
---
 fs/nfsd/nfs4state.c | 47 +++++++++++++++++++++++++++--------------------
 fs/nfsd/state.h     |  2 +-
 2 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index f5f82e1..fa5fb5a 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3487,6 +3487,10 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
 	struct nfs4_openowner *oo = open->op_openowner;
 	struct nfs4_ol_stateid *retstp = NULL;
 
+	/* We are moving these outside of the spinlocks to avoid the warnings */
+	mutex_init(&stp->st_mutex);
+	mutex_lock(&stp->st_mutex);
+
 	spin_lock(&oo->oo_owner.so_client->cl_lock);
 	spin_lock(&fp->fi_lock);
 
@@ -3502,13 +3506,14 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
 	stp->st_access_bmap = 0;
 	stp->st_deny_bmap = 0;
 	stp->st_openstp = NULL;
-	init_rwsem(&stp->st_rwsem);
 	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
 	list_add(&stp->st_perfile, &fp->fi_stateids);
 
 out_unlock:
 	spin_unlock(&fp->fi_lock);
 	spin_unlock(&oo->oo_owner.so_client->cl_lock);
+	if (retstp)
+		mutex_lock(&retstp->st_mutex);
 	return retstp;
 }
 
@@ -4335,32 +4340,34 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	 */
 	if (stp) {
 		/* Stateid was found, this is an OPEN upgrade */
-		down_read(&stp->st_rwsem);
+		mutex_lock(&stp->st_mutex);
 		status = nfs4_upgrade_open(rqstp, fp, current_fh, stp, open);
 		if (status) {
-			up_read(&stp->st_rwsem);
+			mutex_unlock(&stp->st_mutex);
 			goto out;
 		}
 	} else {
 		stp = open->op_stp;
 		open->op_stp = NULL;
+		/*
+		 * init_open_stateid() either returns a locked stateid
+		 * it found, or initializes and locks the new one we passed in
+		 */
 		swapstp = init_open_stateid(stp, fp, open);
 		if (swapstp) {
 			nfs4_put_stid(&stp->st_stid);
 			stp = swapstp;
-			down_read(&stp->st_rwsem);
 			status = nfs4_upgrade_open(rqstp, fp, current_fh,
 						stp, open);
 			if (status) {
-				up_read(&stp->st_rwsem);
+				mutex_unlock(&stp->st_mutex);
 				goto out;
 			}
 			goto upgrade_out;
 		}
-		down_read(&stp->st_rwsem);
 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
 		if (status) {
-			up_read(&stp->st_rwsem);
+			mutex_unlock(&stp->st_mutex);
 			release_open_stateid(stp);
 			goto out;
 		}
@@ -4372,7 +4379,7 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	}
 upgrade_out:
 	nfs4_inc_and_copy_stateid(&open->op_stateid, &stp->st_stid);
-	up_read(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 
 	if (nfsd4_has_session(&resp->cstate)) {
 		if (open->op_deleg_want & NFS4_SHARE_WANT_NO_DELEG) {
@@ -4977,12 +4984,12 @@ static __be32 nfs4_seqid_op_checks(struct nfsd4_compound_state *cstate, stateid_
 		 * revoked delegations are kept only for free_stateid.
 		 */
 		return nfserr_bad_stateid;
-	down_write(&stp->st_rwsem);
+	mutex_lock(&stp->st_mutex);
 	status = check_stateid_generation(stateid, &stp->st_stid.sc_stateid, nfsd4_has_session(cstate));
 	if (status == nfs_ok)
 		status = nfs4_check_fh(current_fh, &stp->st_stid);
 	if (status != nfs_ok)
-		up_write(&stp->st_rwsem);
+		mutex_unlock(&stp->st_mutex);
 	return status;
 }
 
@@ -5030,7 +5037,7 @@ static __be32 nfs4_preprocess_confirmed_seqid_op(struct nfsd4_compound_state *cs
 		return status;
 	oo = openowner(stp->st_stateowner);
 	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
-		up_write(&stp->st_rwsem);
+		mutex_unlock(&stp->st_mutex);
 		nfs4_put_stid(&stp->st_stid);
 		return nfserr_bad_stateid;
 	}
@@ -5062,12 +5069,12 @@ nfsd4_open_confirm(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	oo = openowner(stp->st_stateowner);
 	status = nfserr_bad_stateid;
 	if (oo->oo_flags & NFS4_OO_CONFIRMED) {
-		up_write(&stp->st_rwsem);
+		mutex_unlock(&stp->st_mutex);
 		goto put_stateid;
 	}
 	oo->oo_flags |= NFS4_OO_CONFIRMED;
 	nfs4_inc_and_copy_stateid(&oc->oc_resp_stateid, &stp->st_stid);
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 	dprintk("NFSD: %s: success, seqid=%d stateid=" STATEID_FMT "\n",
 		__func__, oc->oc_seqid, STATEID_VAL(&stp->st_stid.sc_stateid));
 
@@ -5143,7 +5150,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
 	nfs4_inc_and_copy_stateid(&od->od_stateid, &stp->st_stid);
 	status = nfs_ok;
 put_stateid:
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 	nfs4_put_stid(&stp->st_stid);
 out:
 	nfsd4_bump_seqid(cstate, status);
@@ -5196,7 +5203,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	if (status)
 		goto out; 
 	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 
 	nfsd4_close_open_stateid(stp);
 
@@ -5422,7 +5429,7 @@ init_lock_stateid(struct nfs4_ol_stateid *stp, struct nfs4_lockowner *lo,
 	stp->st_access_bmap = 0;
 	stp->st_deny_bmap = open_stp->st_deny_bmap;
 	stp->st_openstp = open_stp;
-	init_rwsem(&stp->st_rwsem);
+	mutex_init(&stp->st_mutex);
 	list_add(&stp->st_locks, &open_stp->st_locks);
 	list_add(&stp->st_perstateowner, &lo->lo_owner.so_stateids);
 	spin_lock(&fp->fi_lock);
@@ -5591,7 +5598,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 					&open_stp, nn);
 		if (status)
 			goto out;
-		up_write(&open_stp->st_rwsem);
+		mutex_unlock(&open_stp->st_mutex);
 		open_sop = openowner(open_stp->st_stateowner);
 		status = nfserr_bad_stateid;
 		if (!same_clid(&open_sop->oo_owner.so_client->cl_clientid,
@@ -5600,7 +5607,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		status = lookup_or_create_lock_state(cstate, open_stp, lock,
 							&lock_stp, &new);
 		if (status == nfs_ok)
-			down_write(&lock_stp->st_rwsem);
+			mutex_lock(&lock_stp->st_mutex);
 	} else {
 		status = nfs4_preprocess_seqid_op(cstate,
 				       lock->lk_old_lock_seqid,
@@ -5704,7 +5711,7 @@ out:
 		    seqid_mutating_err(ntohl(status)))
 			lock_sop->lo_owner.so_seqid++;
 
-		up_write(&lock_stp->st_rwsem);
+		mutex_unlock(&lock_stp->st_mutex);
 
 		/*
 		 * If this is a new, never-before-used stateid, and we are
@@ -5874,7 +5881,7 @@ nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 fput:
 	fput(filp);
 put_stateid:
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 	nfs4_put_stid(&stp->st_stid);
 out:
 	nfsd4_bump_seqid(cstate, status);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 986e51e..64053ea 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -535,7 +535,7 @@ struct nfs4_ol_stateid {
 	unsigned char			st_access_bmap;
 	unsigned char			st_deny_bmap;
 	struct nfs4_ol_stateid		*st_openstp;
-	struct rw_semaphore		st_rwsem;
+	struct mutex			st_mutex;
 };
 
 static inline struct nfs4_ol_stateid *openlockstateid(struct nfs4_stid *s)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-13  1:26                                     ` [PATCH v2] nfsd: Always lock state exclusively Oleg Drokin
@ 2016-06-14 15:38                                       ` J . Bruce Fields
  2016-06-14 15:53                                         ` Oleg Drokin
  2016-06-14 15:46                                       ` [PATCH v2] nfsd: Always lock state exclusively J . Bruce Fields
  1 sibling, 1 reply; 60+ messages in thread
From: J . Bruce Fields @ 2016-06-14 15:38 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Jeff Layton, linux-nfs, linux-kernel

On Sun, Jun 12, 2016 at 09:26:27PM -0400, Oleg Drokin wrote:
> It used to be the case that state had an rwlock that was locked for write
> by downgrades, but for read for upgrades (opens). Well, the problem is
> if there are two competing opens for the same state, they step on
> each other toes potentially leading to leaking file descriptors
> from the state structure, since access mode is a bitmap only set once.
> 
> Extend the holding region around in nfsd4_process_open2() to avoid
> racing entry into nfs4_get_vfs_file().
> Make init_open_stateid() return with locked stateid to be unlocked
> by the caller.
> 
> Now this version held up pretty well in my testing for 24 hours.
> It still does not address the situation if during one of the racing
> nfs4_get_vfs_file() calls we are getting an error from one (first?)
> of them. This is to be addressed in a separate patch after having a
> solid reproducer (potentially using some fault injection).
> 
> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> ---
>  fs/nfsd/nfs4state.c | 47 +++++++++++++++++++++++++++--------------------
>  fs/nfsd/state.h     |  2 +-
>  2 files changed, 28 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index f5f82e1..fa5fb5a 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3487,6 +3487,10 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
>  	struct nfs4_openowner *oo = open->op_openowner;
>  	struct nfs4_ol_stateid *retstp = NULL;
>  
> +	/* We are moving these outside of the spinlocks to avoid the warnings */
> +	mutex_init(&stp->st_mutex);
> +	mutex_lock(&stp->st_mutex);
> +
>  	spin_lock(&oo->oo_owner.so_client->cl_lock);
>  	spin_lock(&fp->fi_lock);
>  
> @@ -3502,13 +3506,14 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
>  	stp->st_access_bmap = 0;
>  	stp->st_deny_bmap = 0;
>  	stp->st_openstp = NULL;
> -	init_rwsem(&stp->st_rwsem);
>  	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
>  	list_add(&stp->st_perfile, &fp->fi_stateids);
>  
>  out_unlock:
>  	spin_unlock(&fp->fi_lock);
>  	spin_unlock(&oo->oo_owner.so_client->cl_lock);
> +	if (retstp)
> +		mutex_lock(&retstp->st_mutex);
>  	return retstp;

You're returning with both stp->st_mutex and retstp->st_mutex locked.
Did you mean to drop that first lock in the (retstp) case, or am I
missing something?

--b.

>  }
>  
> @@ -4335,32 +4340,34 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>  	 */
>  	if (stp) {
>  		/* Stateid was found, this is an OPEN upgrade */
> -		down_read(&stp->st_rwsem);
> +		mutex_lock(&stp->st_mutex);
>  		status = nfs4_upgrade_open(rqstp, fp, current_fh, stp, open);
>  		if (status) {
> -			up_read(&stp->st_rwsem);
> +			mutex_unlock(&stp->st_mutex);
>  			goto out;
>  		}
>  	} else {
>  		stp = open->op_stp;
>  		open->op_stp = NULL;
> +		/*
> +		 * init_open_stateid() either returns a locked stateid
> +		 * it found, or initializes and locks the new one we passed in
> +		 */
>  		swapstp = init_open_stateid(stp, fp, open);
>  		if (swapstp) {
>  			nfs4_put_stid(&stp->st_stid);
>  			stp = swapstp;
> -			down_read(&stp->st_rwsem);
>  			status = nfs4_upgrade_open(rqstp, fp, current_fh,
>  						stp, open);
>  			if (status) {
> -				up_read(&stp->st_rwsem);
> +				mutex_unlock(&stp->st_mutex);
>  				goto out;
>  			}
>  			goto upgrade_out;
>  		}
> -		down_read(&stp->st_rwsem);
>  		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
>  		if (status) {
> -			up_read(&stp->st_rwsem);
> +			mutex_unlock(&stp->st_mutex);
>  			release_open_stateid(stp);
>  			goto out;
>  		}
> @@ -4372,7 +4379,7 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>  	}
>  upgrade_out:
>  	nfs4_inc_and_copy_stateid(&open->op_stateid, &stp->st_stid);
> -	up_read(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  
>  	if (nfsd4_has_session(&resp->cstate)) {
>  		if (open->op_deleg_want & NFS4_SHARE_WANT_NO_DELEG) {
> @@ -4977,12 +4984,12 @@ static __be32 nfs4_seqid_op_checks(struct nfsd4_compound_state *cstate, stateid_
>  		 * revoked delegations are kept only for free_stateid.
>  		 */
>  		return nfserr_bad_stateid;
> -	down_write(&stp->st_rwsem);
> +	mutex_lock(&stp->st_mutex);
>  	status = check_stateid_generation(stateid, &stp->st_stid.sc_stateid, nfsd4_has_session(cstate));
>  	if (status == nfs_ok)
>  		status = nfs4_check_fh(current_fh, &stp->st_stid);
>  	if (status != nfs_ok)
> -		up_write(&stp->st_rwsem);
> +		mutex_unlock(&stp->st_mutex);
>  	return status;
>  }
>  
> @@ -5030,7 +5037,7 @@ static __be32 nfs4_preprocess_confirmed_seqid_op(struct nfsd4_compound_state *cs
>  		return status;
>  	oo = openowner(stp->st_stateowner);
>  	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
> -		up_write(&stp->st_rwsem);
> +		mutex_unlock(&stp->st_mutex);
>  		nfs4_put_stid(&stp->st_stid);
>  		return nfserr_bad_stateid;
>  	}
> @@ -5062,12 +5069,12 @@ nfsd4_open_confirm(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	oo = openowner(stp->st_stateowner);
>  	status = nfserr_bad_stateid;
>  	if (oo->oo_flags & NFS4_OO_CONFIRMED) {
> -		up_write(&stp->st_rwsem);
> +		mutex_unlock(&stp->st_mutex);
>  		goto put_stateid;
>  	}
>  	oo->oo_flags |= NFS4_OO_CONFIRMED;
>  	nfs4_inc_and_copy_stateid(&oc->oc_resp_stateid, &stp->st_stid);
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  	dprintk("NFSD: %s: success, seqid=%d stateid=" STATEID_FMT "\n",
>  		__func__, oc->oc_seqid, STATEID_VAL(&stp->st_stid.sc_stateid));
>  
> @@ -5143,7 +5150,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
>  	nfs4_inc_and_copy_stateid(&od->od_stateid, &stp->st_stid);
>  	status = nfs_ok;
>  put_stateid:
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  	nfs4_put_stid(&stp->st_stid);
>  out:
>  	nfsd4_bump_seqid(cstate, status);
> @@ -5196,7 +5203,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	if (status)
>  		goto out; 
>  	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  
>  	nfsd4_close_open_stateid(stp);
>  
> @@ -5422,7 +5429,7 @@ init_lock_stateid(struct nfs4_ol_stateid *stp, struct nfs4_lockowner *lo,
>  	stp->st_access_bmap = 0;
>  	stp->st_deny_bmap = open_stp->st_deny_bmap;
>  	stp->st_openstp = open_stp;
> -	init_rwsem(&stp->st_rwsem);
> +	mutex_init(&stp->st_mutex);
>  	list_add(&stp->st_locks, &open_stp->st_locks);
>  	list_add(&stp->st_perstateowner, &lo->lo_owner.so_stateids);
>  	spin_lock(&fp->fi_lock);
> @@ -5591,7 +5598,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  					&open_stp, nn);
>  		if (status)
>  			goto out;
> -		up_write(&open_stp->st_rwsem);
> +		mutex_unlock(&open_stp->st_mutex);
>  		open_sop = openowner(open_stp->st_stateowner);
>  		status = nfserr_bad_stateid;
>  		if (!same_clid(&open_sop->oo_owner.so_client->cl_clientid,
> @@ -5600,7 +5607,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  		status = lookup_or_create_lock_state(cstate, open_stp, lock,
>  							&lock_stp, &new);
>  		if (status == nfs_ok)
> -			down_write(&lock_stp->st_rwsem);
> +			mutex_lock(&lock_stp->st_mutex);
>  	} else {
>  		status = nfs4_preprocess_seqid_op(cstate,
>  				       lock->lk_old_lock_seqid,
> @@ -5704,7 +5711,7 @@ out:
>  		    seqid_mutating_err(ntohl(status)))
>  			lock_sop->lo_owner.so_seqid++;
>  
> -		up_write(&lock_stp->st_rwsem);
> +		mutex_unlock(&lock_stp->st_mutex);
>  
>  		/*
>  		 * If this is a new, never-before-used stateid, and we are
> @@ -5874,7 +5881,7 @@ nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  fput:
>  	fput(filp);
>  put_stateid:
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  	nfs4_put_stid(&stp->st_stid);
>  out:
>  	nfsd4_bump_seqid(cstate, status);
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index 986e51e..64053ea 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -535,7 +535,7 @@ struct nfs4_ol_stateid {
>  	unsigned char			st_access_bmap;
>  	unsigned char			st_deny_bmap;
>  	struct nfs4_ol_stateid		*st_openstp;
> -	struct rw_semaphore		st_rwsem;
> +	struct mutex			st_mutex;
>  };
>  
>  static inline struct nfs4_ol_stateid *openlockstateid(struct nfs4_stid *s)
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-13  1:26                                     ` [PATCH v2] nfsd: Always lock state exclusively Oleg Drokin
  2016-06-14 15:38                                       ` J . Bruce Fields
@ 2016-06-14 15:46                                       ` J . Bruce Fields
  2016-06-14 15:56                                         ` Oleg Drokin
  1 sibling, 1 reply; 60+ messages in thread
From: J . Bruce Fields @ 2016-06-14 15:46 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Jeff Layton, linux-nfs, linux-kernel

On Sun, Jun 12, 2016 at 09:26:27PM -0400, Oleg Drokin wrote:
> It used to be the case that state had an rwlock that was locked for write
> by downgrades, but for read for upgrades (opens). Well, the problem is
> if there are two competing opens for the same state, they step on
> each other toes potentially leading to leaking file descriptors
> from the state structure, since access mode is a bitmap only set once.
> 
> Extend the holding region around in nfsd4_process_open2() to avoid
> racing entry into nfs4_get_vfs_file().
> Make init_open_stateid() return with locked stateid to be unlocked
> by the caller.
> 
> Now this version held up pretty well in my testing for 24 hours.
> It still does not address the situation if during one of the racing
> nfs4_get_vfs_file() calls we are getting an error from one (first?)
> of them. This is to be addressed in a separate patch after having a
> solid reproducer (potentially using some fault injection).
> 
> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> ---
>  fs/nfsd/nfs4state.c | 47 +++++++++++++++++++++++++++--------------------
>  fs/nfsd/state.h     |  2 +-
>  2 files changed, 28 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index f5f82e1..fa5fb5a 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3487,6 +3487,10 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
>  	struct nfs4_openowner *oo = open->op_openowner;
>  	struct nfs4_ol_stateid *retstp = NULL;
>  
> +	/* We are moving these outside of the spinlocks to avoid the warnings */
> +	mutex_init(&stp->st_mutex);
> +	mutex_lock(&stp->st_mutex);

A mutex_init_locked() primitive might also be convenient here.

You could also take the two previous lines from the caller into this
function instead of passing in stp, that might simplify the code.
(Haven't checked.)

--b.

> +
>  	spin_lock(&oo->oo_owner.so_client->cl_lock);
>  	spin_lock(&fp->fi_lock);
>  
> @@ -3502,13 +3506,14 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
>  	stp->st_access_bmap = 0;
>  	stp->st_deny_bmap = 0;
>  	stp->st_openstp = NULL;
> -	init_rwsem(&stp->st_rwsem);
>  	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
>  	list_add(&stp->st_perfile, &fp->fi_stateids);
>  
>  out_unlock:
>  	spin_unlock(&fp->fi_lock);
>  	spin_unlock(&oo->oo_owner.so_client->cl_lock);
> +	if (retstp)
> +		mutex_lock(&retstp->st_mutex);
>  	return retstp;
>  }
>  
> @@ -4335,32 +4340,34 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>  	 */
>  	if (stp) {
>  		/* Stateid was found, this is an OPEN upgrade */
> -		down_read(&stp->st_rwsem);
> +		mutex_lock(&stp->st_mutex);
>  		status = nfs4_upgrade_open(rqstp, fp, current_fh, stp, open);
>  		if (status) {
> -			up_read(&stp->st_rwsem);
> +			mutex_unlock(&stp->st_mutex);
>  			goto out;
>  		}
>  	} else {
>  		stp = open->op_stp;
>  		open->op_stp = NULL;
> +		/*
> +		 * init_open_stateid() either returns a locked stateid
> +		 * it found, or initializes and locks the new one we passed in
> +		 */
>  		swapstp = init_open_stateid(stp, fp, open);
>  		if (swapstp) {
>  			nfs4_put_stid(&stp->st_stid);
>  			stp = swapstp;
> -			down_read(&stp->st_rwsem);
>  			status = nfs4_upgrade_open(rqstp, fp, current_fh,
>  						stp, open);
>  			if (status) {
> -				up_read(&stp->st_rwsem);
> +				mutex_unlock(&stp->st_mutex);
>  				goto out;
>  			}
>  			goto upgrade_out;
>  		}
> -		down_read(&stp->st_rwsem);
>  		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
>  		if (status) {
> -			up_read(&stp->st_rwsem);
> +			mutex_unlock(&stp->st_mutex);
>  			release_open_stateid(stp);
>  			goto out;
>  		}
> @@ -4372,7 +4379,7 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>  	}
>  upgrade_out:
>  	nfs4_inc_and_copy_stateid(&open->op_stateid, &stp->st_stid);
> -	up_read(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  
>  	if (nfsd4_has_session(&resp->cstate)) {
>  		if (open->op_deleg_want & NFS4_SHARE_WANT_NO_DELEG) {
> @@ -4977,12 +4984,12 @@ static __be32 nfs4_seqid_op_checks(struct nfsd4_compound_state *cstate, stateid_
>  		 * revoked delegations are kept only for free_stateid.
>  		 */
>  		return nfserr_bad_stateid;
> -	down_write(&stp->st_rwsem);
> +	mutex_lock(&stp->st_mutex);
>  	status = check_stateid_generation(stateid, &stp->st_stid.sc_stateid, nfsd4_has_session(cstate));
>  	if (status == nfs_ok)
>  		status = nfs4_check_fh(current_fh, &stp->st_stid);
>  	if (status != nfs_ok)
> -		up_write(&stp->st_rwsem);
> +		mutex_unlock(&stp->st_mutex);
>  	return status;
>  }
>  
> @@ -5030,7 +5037,7 @@ static __be32 nfs4_preprocess_confirmed_seqid_op(struct nfsd4_compound_state *cs
>  		return status;
>  	oo = openowner(stp->st_stateowner);
>  	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
> -		up_write(&stp->st_rwsem);
> +		mutex_unlock(&stp->st_mutex);
>  		nfs4_put_stid(&stp->st_stid);
>  		return nfserr_bad_stateid;
>  	}
> @@ -5062,12 +5069,12 @@ nfsd4_open_confirm(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	oo = openowner(stp->st_stateowner);
>  	status = nfserr_bad_stateid;
>  	if (oo->oo_flags & NFS4_OO_CONFIRMED) {
> -		up_write(&stp->st_rwsem);
> +		mutex_unlock(&stp->st_mutex);
>  		goto put_stateid;
>  	}
>  	oo->oo_flags |= NFS4_OO_CONFIRMED;
>  	nfs4_inc_and_copy_stateid(&oc->oc_resp_stateid, &stp->st_stid);
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  	dprintk("NFSD: %s: success, seqid=%d stateid=" STATEID_FMT "\n",
>  		__func__, oc->oc_seqid, STATEID_VAL(&stp->st_stid.sc_stateid));
>  
> @@ -5143,7 +5150,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
>  	nfs4_inc_and_copy_stateid(&od->od_stateid, &stp->st_stid);
>  	status = nfs_ok;
>  put_stateid:
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  	nfs4_put_stid(&stp->st_stid);
>  out:
>  	nfsd4_bump_seqid(cstate, status);
> @@ -5196,7 +5203,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	if (status)
>  		goto out; 
>  	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  
>  	nfsd4_close_open_stateid(stp);
>  
> @@ -5422,7 +5429,7 @@ init_lock_stateid(struct nfs4_ol_stateid *stp, struct nfs4_lockowner *lo,
>  	stp->st_access_bmap = 0;
>  	stp->st_deny_bmap = open_stp->st_deny_bmap;
>  	stp->st_openstp = open_stp;
> -	init_rwsem(&stp->st_rwsem);
> +	mutex_init(&stp->st_mutex);
>  	list_add(&stp->st_locks, &open_stp->st_locks);
>  	list_add(&stp->st_perstateowner, &lo->lo_owner.so_stateids);
>  	spin_lock(&fp->fi_lock);
> @@ -5591,7 +5598,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  					&open_stp, nn);
>  		if (status)
>  			goto out;
> -		up_write(&open_stp->st_rwsem);
> +		mutex_unlock(&open_stp->st_mutex);
>  		open_sop = openowner(open_stp->st_stateowner);
>  		status = nfserr_bad_stateid;
>  		if (!same_clid(&open_sop->oo_owner.so_client->cl_clientid,
> @@ -5600,7 +5607,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  		status = lookup_or_create_lock_state(cstate, open_stp, lock,
>  							&lock_stp, &new);
>  		if (status == nfs_ok)
> -			down_write(&lock_stp->st_rwsem);
> +			mutex_lock(&lock_stp->st_mutex);
>  	} else {
>  		status = nfs4_preprocess_seqid_op(cstate,
>  				       lock->lk_old_lock_seqid,
> @@ -5704,7 +5711,7 @@ out:
>  		    seqid_mutating_err(ntohl(status)))
>  			lock_sop->lo_owner.so_seqid++;
>  
> -		up_write(&lock_stp->st_rwsem);
> +		mutex_unlock(&lock_stp->st_mutex);
>  
>  		/*
>  		 * If this is a new, never-before-used stateid, and we are
> @@ -5874,7 +5881,7 @@ nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  fput:
>  	fput(filp);
>  put_stateid:
> -	up_write(&stp->st_rwsem);
> +	mutex_unlock(&stp->st_mutex);
>  	nfs4_put_stid(&stp->st_stid);
>  out:
>  	nfsd4_bump_seqid(cstate, status);
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index 986e51e..64053ea 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -535,7 +535,7 @@ struct nfs4_ol_stateid {
>  	unsigned char			st_access_bmap;
>  	unsigned char			st_deny_bmap;
>  	struct nfs4_ol_stateid		*st_openstp;
> -	struct rw_semaphore		st_rwsem;
> +	struct mutex			st_mutex;
>  };
>  
>  static inline struct nfs4_ol_stateid *openlockstateid(struct nfs4_stid *s)
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-14 15:38                                       ` J . Bruce Fields
@ 2016-06-14 15:53                                         ` Oleg Drokin
  2016-06-14 18:50                                           ` J . Bruce Fields
  0 siblings, 1 reply; 60+ messages in thread
From: Oleg Drokin @ 2016-06-14 15:53 UTC (permalink / raw)
  To: J . Bruce Fields; +Cc: Jeff Layton, linux-nfs, linux-kernel


On Jun 14, 2016, at 11:38 AM, J . Bruce Fields wrote:

> On Sun, Jun 12, 2016 at 09:26:27PM -0400, Oleg Drokin wrote:
>> It used to be the case that state had an rwlock that was locked for write
>> by downgrades, but for read for upgrades (opens). Well, the problem is
>> if there are two competing opens for the same state, they step on
>> each other toes potentially leading to leaking file descriptors
>> from the state structure, since access mode is a bitmap only set once.
>> 
>> Extend the holding region around in nfsd4_process_open2() to avoid
>> racing entry into nfs4_get_vfs_file().
>> Make init_open_stateid() return with locked stateid to be unlocked
>> by the caller.
>> 
>> Now this version held up pretty well in my testing for 24 hours.
>> It still does not address the situation if during one of the racing
>> nfs4_get_vfs_file() calls we are getting an error from one (first?)
>> of them. This is to be addressed in a separate patch after having a
>> solid reproducer (potentially using some fault injection).
>> 
>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>> ---
>> fs/nfsd/nfs4state.c | 47 +++++++++++++++++++++++++++--------------------
>> fs/nfsd/state.h     |  2 +-
>> 2 files changed, 28 insertions(+), 21 deletions(-)
>> 
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index f5f82e1..fa5fb5a 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -3487,6 +3487,10 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
>> 	struct nfs4_openowner *oo = open->op_openowner;
>> 	struct nfs4_ol_stateid *retstp = NULL;
>> 
>> +	/* We are moving these outside of the spinlocks to avoid the warnings */
>> +	mutex_init(&stp->st_mutex);
>> +	mutex_lock(&stp->st_mutex);
>> +
>> 	spin_lock(&oo->oo_owner.so_client->cl_lock);
>> 	spin_lock(&fp->fi_lock);
>> 
>> @@ -3502,13 +3506,14 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
>> 	stp->st_access_bmap = 0;
>> 	stp->st_deny_bmap = 0;
>> 	stp->st_openstp = NULL;
>> -	init_rwsem(&stp->st_rwsem);
>> 	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
>> 	list_add(&stp->st_perfile, &fp->fi_stateids);
>> 
>> out_unlock:
>> 	spin_unlock(&fp->fi_lock);
>> 	spin_unlock(&oo->oo_owner.so_client->cl_lock);
>> +	if (retstp)
>> +		mutex_lock(&retstp->st_mutex);
>> 	return retstp;
> 
> You're returning with both stp->st_mutex and retstp->st_mutex locked.
> Did you mean to drop that first lock in the (retstp) case, or am I
> missing something?

Well, I think it's ok (perhaps worthy of a comment) it's that if we matched a different
retstp state, then stp is not used and either released right away or even
if reused, it would be reinitialized in another call to init_open_stateid(),
so it's fine?


> 
> --b.
> 
>> }
>> 
>> @@ -4335,32 +4340,34 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>> 	 */
>> 	if (stp) {
>> 		/* Stateid was found, this is an OPEN upgrade */
>> -		down_read(&stp->st_rwsem);
>> +		mutex_lock(&stp->st_mutex);
>> 		status = nfs4_upgrade_open(rqstp, fp, current_fh, stp, open);
>> 		if (status) {
>> -			up_read(&stp->st_rwsem);
>> +			mutex_unlock(&stp->st_mutex);
>> 			goto out;
>> 		}
>> 	} else {
>> 		stp = open->op_stp;
>> 		open->op_stp = NULL;
>> +		/*
>> +		 * init_open_stateid() either returns a locked stateid
>> +		 * it found, or initializes and locks the new one we passed in
>> +		 */
>> 		swapstp = init_open_stateid(stp, fp, open);
>> 		if (swapstp) {
>> 			nfs4_put_stid(&stp->st_stid);
>> 			stp = swapstp;
>> -			down_read(&stp->st_rwsem);
>> 			status = nfs4_upgrade_open(rqstp, fp, current_fh,
>> 						stp, open);
>> 			if (status) {
>> -				up_read(&stp->st_rwsem);
>> +				mutex_unlock(&stp->st_mutex);
>> 				goto out;
>> 			}
>> 			goto upgrade_out;
>> 		}
>> -		down_read(&stp->st_rwsem);
>> 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
>> 		if (status) {
>> -			up_read(&stp->st_rwsem);
>> +			mutex_unlock(&stp->st_mutex);
>> 			release_open_stateid(stp);
>> 			goto out;
>> 		}
>> @@ -4372,7 +4379,7 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>> 	}
>> upgrade_out:
>> 	nfs4_inc_and_copy_stateid(&open->op_stateid, &stp->st_stid);
>> -	up_read(&stp->st_rwsem);
>> +	mutex_unlock(&stp->st_mutex);
>> 
>> 	if (nfsd4_has_session(&resp->cstate)) {
>> 		if (open->op_deleg_want & NFS4_SHARE_WANT_NO_DELEG) {
>> @@ -4977,12 +4984,12 @@ static __be32 nfs4_seqid_op_checks(struct nfsd4_compound_state *cstate, stateid_
>> 		 * revoked delegations are kept only for free_stateid.
>> 		 */
>> 		return nfserr_bad_stateid;
>> -	down_write(&stp->st_rwsem);
>> +	mutex_lock(&stp->st_mutex);
>> 	status = check_stateid_generation(stateid, &stp->st_stid.sc_stateid, nfsd4_has_session(cstate));
>> 	if (status == nfs_ok)
>> 		status = nfs4_check_fh(current_fh, &stp->st_stid);
>> 	if (status != nfs_ok)
>> -		up_write(&stp->st_rwsem);
>> +		mutex_unlock(&stp->st_mutex);
>> 	return status;
>> }
>> 
>> @@ -5030,7 +5037,7 @@ static __be32 nfs4_preprocess_confirmed_seqid_op(struct nfsd4_compound_state *cs
>> 		return status;
>> 	oo = openowner(stp->st_stateowner);
>> 	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
>> -		up_write(&stp->st_rwsem);
>> +		mutex_unlock(&stp->st_mutex);
>> 		nfs4_put_stid(&stp->st_stid);
>> 		return nfserr_bad_stateid;
>> 	}
>> @@ -5062,12 +5069,12 @@ nfsd4_open_confirm(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>> 	oo = openowner(stp->st_stateowner);
>> 	status = nfserr_bad_stateid;
>> 	if (oo->oo_flags & NFS4_OO_CONFIRMED) {
>> -		up_write(&stp->st_rwsem);
>> +		mutex_unlock(&stp->st_mutex);
>> 		goto put_stateid;
>> 	}
>> 	oo->oo_flags |= NFS4_OO_CONFIRMED;
>> 	nfs4_inc_and_copy_stateid(&oc->oc_resp_stateid, &stp->st_stid);
>> -	up_write(&stp->st_rwsem);
>> +	mutex_unlock(&stp->st_mutex);
>> 	dprintk("NFSD: %s: success, seqid=%d stateid=" STATEID_FMT "\n",
>> 		__func__, oc->oc_seqid, STATEID_VAL(&stp->st_stid.sc_stateid));
>> 
>> @@ -5143,7 +5150,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
>> 	nfs4_inc_and_copy_stateid(&od->od_stateid, &stp->st_stid);
>> 	status = nfs_ok;
>> put_stateid:
>> -	up_write(&stp->st_rwsem);
>> +	mutex_unlock(&stp->st_mutex);
>> 	nfs4_put_stid(&stp->st_stid);
>> out:
>> 	nfsd4_bump_seqid(cstate, status);
>> @@ -5196,7 +5203,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>> 	if (status)
>> 		goto out; 
>> 	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
>> -	up_write(&stp->st_rwsem);
>> +	mutex_unlock(&stp->st_mutex);
>> 
>> 	nfsd4_close_open_stateid(stp);
>> 
>> @@ -5422,7 +5429,7 @@ init_lock_stateid(struct nfs4_ol_stateid *stp, struct nfs4_lockowner *lo,
>> 	stp->st_access_bmap = 0;
>> 	stp->st_deny_bmap = open_stp->st_deny_bmap;
>> 	stp->st_openstp = open_stp;
>> -	init_rwsem(&stp->st_rwsem);
>> +	mutex_init(&stp->st_mutex);
>> 	list_add(&stp->st_locks, &open_stp->st_locks);
>> 	list_add(&stp->st_perstateowner, &lo->lo_owner.so_stateids);
>> 	spin_lock(&fp->fi_lock);
>> @@ -5591,7 +5598,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>> 					&open_stp, nn);
>> 		if (status)
>> 			goto out;
>> -		up_write(&open_stp->st_rwsem);
>> +		mutex_unlock(&open_stp->st_mutex);
>> 		open_sop = openowner(open_stp->st_stateowner);
>> 		status = nfserr_bad_stateid;
>> 		if (!same_clid(&open_sop->oo_owner.so_client->cl_clientid,
>> @@ -5600,7 +5607,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>> 		status = lookup_or_create_lock_state(cstate, open_stp, lock,
>> 							&lock_stp, &new);
>> 		if (status == nfs_ok)
>> -			down_write(&lock_stp->st_rwsem);
>> +			mutex_lock(&lock_stp->st_mutex);
>> 	} else {
>> 		status = nfs4_preprocess_seqid_op(cstate,
>> 				       lock->lk_old_lock_seqid,
>> @@ -5704,7 +5711,7 @@ out:
>> 		    seqid_mutating_err(ntohl(status)))
>> 			lock_sop->lo_owner.so_seqid++;
>> 
>> -		up_write(&lock_stp->st_rwsem);
>> +		mutex_unlock(&lock_stp->st_mutex);
>> 
>> 		/*
>> 		 * If this is a new, never-before-used stateid, and we are
>> @@ -5874,7 +5881,7 @@ nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>> fput:
>> 	fput(filp);
>> put_stateid:
>> -	up_write(&stp->st_rwsem);
>> +	mutex_unlock(&stp->st_mutex);
>> 	nfs4_put_stid(&stp->st_stid);
>> out:
>> 	nfsd4_bump_seqid(cstate, status);
>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
>> index 986e51e..64053ea 100644
>> --- a/fs/nfsd/state.h
>> +++ b/fs/nfsd/state.h
>> @@ -535,7 +535,7 @@ struct nfs4_ol_stateid {
>> 	unsigned char			st_access_bmap;
>> 	unsigned char			st_deny_bmap;
>> 	struct nfs4_ol_stateid		*st_openstp;
>> -	struct rw_semaphore		st_rwsem;
>> +	struct mutex			st_mutex;
>> };
>> 
>> static inline struct nfs4_ol_stateid *openlockstateid(struct nfs4_stid *s)
>> -- 
>> 2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-14 15:46                                       ` [PATCH v2] nfsd: Always lock state exclusively J . Bruce Fields
@ 2016-06-14 15:56                                         ` Oleg Drokin
  2016-06-14 18:46                                           ` J . Bruce Fields
  0 siblings, 1 reply; 60+ messages in thread
From: Oleg Drokin @ 2016-06-14 15:56 UTC (permalink / raw)
  To: J . Bruce Fields; +Cc: Jeff Layton, linux-nfs, linux-kernel


On Jun 14, 2016, at 11:46 AM, J . Bruce Fields wrote:

> On Sun, Jun 12, 2016 at 09:26:27PM -0400, Oleg Drokin wrote:
>> It used to be the case that state had an rwlock that was locked for write
>> by downgrades, but for read for upgrades (opens). Well, the problem is
>> if there are two competing opens for the same state, they step on
>> each other toes potentially leading to leaking file descriptors
>> from the state structure, since access mode is a bitmap only set once.
>> 
>> Extend the holding region around in nfsd4_process_open2() to avoid
>> racing entry into nfs4_get_vfs_file().
>> Make init_open_stateid() return with locked stateid to be unlocked
>> by the caller.
>> 
>> Now this version held up pretty well in my testing for 24 hours.
>> It still does not address the situation if during one of the racing
>> nfs4_get_vfs_file() calls we are getting an error from one (first?)
>> of them. This is to be addressed in a separate patch after having a
>> solid reproducer (potentially using some fault injection).
>> 
>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>> ---
>> fs/nfsd/nfs4state.c | 47 +++++++++++++++++++++++++++--------------------
>> fs/nfsd/state.h     |  2 +-
>> 2 files changed, 28 insertions(+), 21 deletions(-)
>> 
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index f5f82e1..fa5fb5a 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -3487,6 +3487,10 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
>> 	struct nfs4_openowner *oo = open->op_openowner;
>> 	struct nfs4_ol_stateid *retstp = NULL;
>> 
>> +	/* We are moving these outside of the spinlocks to avoid the warnings */
>> +	mutex_init(&stp->st_mutex);
>> +	mutex_lock(&stp->st_mutex);
> 
> A mutex_init_locked() primitive might also be convenient here.

I know! I would be able to do it under spinlock then without moving this around too.

But alas, not only there is not one, mutex documentation states this is disallowed.

> You could also take the two previous lines from the caller into this
> function instead of passing in stp, that might simplify the code.
> (Haven't checked.)

I am not really sure what do you mean here.
These lines are moved from further away in this function )well, just the init, anyway).

Having half initialisation of stp here and half in the caller sounds kind of strange
to me.

> 
> --b.
> 
>> +
>> 	spin_lock(&oo->oo_owner.so_client->cl_lock);
>> 	spin_lock(&fp->fi_lock);
>> 
>> @@ -3502,13 +3506,14 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
>> 	stp->st_access_bmap = 0;
>> 	stp->st_deny_bmap = 0;
>> 	stp->st_openstp = NULL;
>> -	init_rwsem(&stp->st_rwsem);
>> 	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
>> 	list_add(&stp->st_perfile, &fp->fi_stateids);
>> 
>> out_unlock:
>> 	spin_unlock(&fp->fi_lock);
>> 	spin_unlock(&oo->oo_owner.so_client->cl_lock);
>> +	if (retstp)
>> +		mutex_lock(&retstp->st_mutex);
>> 	return retstp;
>> }
>> 
>> @@ -4335,32 +4340,34 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>> 	 */
>> 	if (stp) {
>> 		/* Stateid was found, this is an OPEN upgrade */
>> -		down_read(&stp->st_rwsem);
>> +		mutex_lock(&stp->st_mutex);
>> 		status = nfs4_upgrade_open(rqstp, fp, current_fh, stp, open);
>> 		if (status) {
>> -			up_read(&stp->st_rwsem);
>> +			mutex_unlock(&stp->st_mutex);
>> 			goto out;
>> 		}
>> 	} else {
>> 		stp = open->op_stp;
>> 		open->op_stp = NULL;
>> +		/*
>> +		 * init_open_stateid() either returns a locked stateid
>> +		 * it found, or initializes and locks the new one we passed in
>> +		 */
>> 		swapstp = init_open_stateid(stp, fp, open);
>> 		if (swapstp) {
>> 			nfs4_put_stid(&stp->st_stid);
>> 			stp = swapstp;
>> -			down_read(&stp->st_rwsem);
>> 			status = nfs4_upgrade_open(rqstp, fp, current_fh,
>> 						stp, open);
>> 			if (status) {
>> -				up_read(&stp->st_rwsem);
>> +				mutex_unlock(&stp->st_mutex);
>> 				goto out;
>> 			}
>> 			goto upgrade_out;
>> 		}
>> -		down_read(&stp->st_rwsem);
>> 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
>> 		if (status) {
>> -			up_read(&stp->st_rwsem);
>> +			mutex_unlock(&stp->st_mutex);
>> 			release_open_stateid(stp);
>> 			goto out;
>> 		}
>> @@ -4372,7 +4379,7 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>> 	}
>> upgrade_out:
>> 	nfs4_inc_and_copy_stateid(&open->op_stateid, &stp->st_stid);
>> -	up_read(&stp->st_rwsem);
>> +	mutex_unlock(&stp->st_mutex);
>> 
>> 	if (nfsd4_has_session(&resp->cstate)) {
>> 		if (open->op_deleg_want & NFS4_SHARE_WANT_NO_DELEG) {
>> @@ -4977,12 +4984,12 @@ static __be32 nfs4_seqid_op_checks(struct nfsd4_compound_state *cstate, stateid_
>> 		 * revoked delegations are kept only for free_stateid.
>> 		 */
>> 		return nfserr_bad_stateid;
>> -	down_write(&stp->st_rwsem);
>> +	mutex_lock(&stp->st_mutex);
>> 	status = check_stateid_generation(stateid, &stp->st_stid.sc_stateid, nfsd4_has_session(cstate));
>> 	if (status == nfs_ok)
>> 		status = nfs4_check_fh(current_fh, &stp->st_stid);
>> 	if (status != nfs_ok)
>> -		up_write(&stp->st_rwsem);
>> +		mutex_unlock(&stp->st_mutex);
>> 	return status;
>> }
>> 
>> @@ -5030,7 +5037,7 @@ static __be32 nfs4_preprocess_confirmed_seqid_op(struct nfsd4_compound_state *cs
>> 		return status;
>> 	oo = openowner(stp->st_stateowner);
>> 	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
>> -		up_write(&stp->st_rwsem);
>> +		mutex_unlock(&stp->st_mutex);
>> 		nfs4_put_stid(&stp->st_stid);
>> 		return nfserr_bad_stateid;
>> 	}
>> @@ -5062,12 +5069,12 @@ nfsd4_open_confirm(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>> 	oo = openowner(stp->st_stateowner);
>> 	status = nfserr_bad_stateid;
>> 	if (oo->oo_flags & NFS4_OO_CONFIRMED) {
>> -		up_write(&stp->st_rwsem);
>> +		mutex_unlock(&stp->st_mutex);
>> 		goto put_stateid;
>> 	}
>> 	oo->oo_flags |= NFS4_OO_CONFIRMED;
>> 	nfs4_inc_and_copy_stateid(&oc->oc_resp_stateid, &stp->st_stid);
>> -	up_write(&stp->st_rwsem);
>> +	mutex_unlock(&stp->st_mutex);
>> 	dprintk("NFSD: %s: success, seqid=%d stateid=" STATEID_FMT "\n",
>> 		__func__, oc->oc_seqid, STATEID_VAL(&stp->st_stid.sc_stateid));
>> 
>> @@ -5143,7 +5150,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
>> 	nfs4_inc_and_copy_stateid(&od->od_stateid, &stp->st_stid);
>> 	status = nfs_ok;
>> put_stateid:
>> -	up_write(&stp->st_rwsem);
>> +	mutex_unlock(&stp->st_mutex);
>> 	nfs4_put_stid(&stp->st_stid);
>> out:
>> 	nfsd4_bump_seqid(cstate, status);
>> @@ -5196,7 +5203,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>> 	if (status)
>> 		goto out; 
>> 	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
>> -	up_write(&stp->st_rwsem);
>> +	mutex_unlock(&stp->st_mutex);
>> 
>> 	nfsd4_close_open_stateid(stp);
>> 
>> @@ -5422,7 +5429,7 @@ init_lock_stateid(struct nfs4_ol_stateid *stp, struct nfs4_lockowner *lo,
>> 	stp->st_access_bmap = 0;
>> 	stp->st_deny_bmap = open_stp->st_deny_bmap;
>> 	stp->st_openstp = open_stp;
>> -	init_rwsem(&stp->st_rwsem);
>> +	mutex_init(&stp->st_mutex);
>> 	list_add(&stp->st_locks, &open_stp->st_locks);
>> 	list_add(&stp->st_perstateowner, &lo->lo_owner.so_stateids);
>> 	spin_lock(&fp->fi_lock);
>> @@ -5591,7 +5598,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>> 					&open_stp, nn);
>> 		if (status)
>> 			goto out;
>> -		up_write(&open_stp->st_rwsem);
>> +		mutex_unlock(&open_stp->st_mutex);
>> 		open_sop = openowner(open_stp->st_stateowner);
>> 		status = nfserr_bad_stateid;
>> 		if (!same_clid(&open_sop->oo_owner.so_client->cl_clientid,
>> @@ -5600,7 +5607,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>> 		status = lookup_or_create_lock_state(cstate, open_stp, lock,
>> 							&lock_stp, &new);
>> 		if (status == nfs_ok)
>> -			down_write(&lock_stp->st_rwsem);
>> +			mutex_lock(&lock_stp->st_mutex);
>> 	} else {
>> 		status = nfs4_preprocess_seqid_op(cstate,
>> 				       lock->lk_old_lock_seqid,
>> @@ -5704,7 +5711,7 @@ out:
>> 		    seqid_mutating_err(ntohl(status)))
>> 			lock_sop->lo_owner.so_seqid++;
>> 
>> -		up_write(&lock_stp->st_rwsem);
>> +		mutex_unlock(&lock_stp->st_mutex);
>> 
>> 		/*
>> 		 * If this is a new, never-before-used stateid, and we are
>> @@ -5874,7 +5881,7 @@ nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>> fput:
>> 	fput(filp);
>> put_stateid:
>> -	up_write(&stp->st_rwsem);
>> +	mutex_unlock(&stp->st_mutex);
>> 	nfs4_put_stid(&stp->st_stid);
>> out:
>> 	nfsd4_bump_seqid(cstate, status);
>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
>> index 986e51e..64053ea 100644
>> --- a/fs/nfsd/state.h
>> +++ b/fs/nfsd/state.h
>> @@ -535,7 +535,7 @@ struct nfs4_ol_stateid {
>> 	unsigned char			st_access_bmap;
>> 	unsigned char			st_deny_bmap;
>> 	struct nfs4_ol_stateid		*st_openstp;
>> -	struct rw_semaphore		st_rwsem;
>> +	struct mutex			st_mutex;
>> };
>> 
>> static inline struct nfs4_ol_stateid *openlockstateid(struct nfs4_stid *s)
>> -- 
>> 2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-14 15:56                                         ` Oleg Drokin
@ 2016-06-14 18:46                                           ` J . Bruce Fields
  2016-06-15  2:19                                             ` Oleg Drokin
  0 siblings, 1 reply; 60+ messages in thread
From: J . Bruce Fields @ 2016-06-14 18:46 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Jeff Layton, linux-nfs, linux-kernel

On Tue, Jun 14, 2016 at 11:56:20AM -0400, Oleg Drokin wrote:
> 
> On Jun 14, 2016, at 11:46 AM, J . Bruce Fields wrote:
> 
> > On Sun, Jun 12, 2016 at 09:26:27PM -0400, Oleg Drokin wrote:
> >> It used to be the case that state had an rwlock that was locked for write
> >> by downgrades, but for read for upgrades (opens). Well, the problem is
> >> if there are two competing opens for the same state, they step on
> >> each other toes potentially leading to leaking file descriptors
> >> from the state structure, since access mode is a bitmap only set once.
> >> 
> >> Extend the holding region around in nfsd4_process_open2() to avoid
> >> racing entry into nfs4_get_vfs_file().
> >> Make init_open_stateid() return with locked stateid to be unlocked
> >> by the caller.
> >> 
> >> Now this version held up pretty well in my testing for 24 hours.
> >> It still does not address the situation if during one of the racing
> >> nfs4_get_vfs_file() calls we are getting an error from one (first?)
> >> of them. This is to be addressed in a separate patch after having a
> >> solid reproducer (potentially using some fault injection).
> >> 
> >> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> >> ---
> >> fs/nfsd/nfs4state.c | 47 +++++++++++++++++++++++++++--------------------
> >> fs/nfsd/state.h     |  2 +-
> >> 2 files changed, 28 insertions(+), 21 deletions(-)
> >> 
> >> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> >> index f5f82e1..fa5fb5a 100644
> >> --- a/fs/nfsd/nfs4state.c
> >> +++ b/fs/nfsd/nfs4state.c
> >> @@ -3487,6 +3487,10 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
> >> 	struct nfs4_openowner *oo = open->op_openowner;
> >> 	struct nfs4_ol_stateid *retstp = NULL;
> >> 
> >> +	/* We are moving these outside of the spinlocks to avoid the warnings */
> >> +	mutex_init(&stp->st_mutex);
> >> +	mutex_lock(&stp->st_mutex);
> > 
> > A mutex_init_locked() primitive might also be convenient here.
> 
> I know! I would be able to do it under spinlock then without moving this around too.
> 
> But alas, not only there is not one, mutex documentation states this is disallowed.

You're just talking about this comment?:

	 * It is not allowed to initialize an already locked mutex.

That's a weird comment.  You're proably right that what they meant was
something like "It is not allowed to initialize a mutex to locked
state".  But, I don't know, taken literally that comment doesn't make
sense (how could you even distinguish between an already-locked mutex
and an uninitialized mutex?), so maybe it'd be worth asking.

> > You could also take the two previous lines from the caller into this
> > function instead of passing in stp, that might simplify the code.
> > (Haven't checked.)
> 
> I am not really sure what do you mean here.
> These lines are moved from further away in this function )well, just the init, anyway).
> 
> Having half initialisation of stp here and half in the caller sounds kind of strange
> to me.

I was thinking of something like the following--so init_open_stateid
hides more of the details of the swapping.  Untested.  Does it look like
an improvement to you?

There's got to be a way to make this code a little less convoluted....

--b.

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index fa5fb5aa4847..41b59854c40f 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3480,13 +3480,15 @@ alloc_init_open_stateowner(unsigned int strhashval, struct nfsd4_open *open,
 }
 
 static struct nfs4_ol_stateid *
-init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
-		struct nfsd4_open *open)
+init_open_stateid(struct nfs4_file *fp, struct nfsd4_open *open)
 {
 
 	struct nfs4_openowner *oo = open->op_openowner;
 	struct nfs4_ol_stateid *retstp = NULL;
+	struct nfs4_ol_stateid *stp;
 
+	stp = open->op_stp;
+	open->op_stp = NULL;
 	/* We are moving these outside of the spinlocks to avoid the warnings */
 	mutex_init(&stp->st_mutex);
 	mutex_lock(&stp->st_mutex);
@@ -3512,9 +3514,12 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
 out_unlock:
 	spin_unlock(&fp->fi_lock);
 	spin_unlock(&oo->oo_owner.so_client->cl_lock);
-	if (retstp)
-		mutex_lock(&retstp->st_mutex);
-	return retstp;
+	if (retstp) {
+		nfs4_put_stid(&stp->st_stid);
+		stp = retstp;
+		mutex_lock(&stp->st_mutex);
+	}
+	return stp;
 }
 
 /*
@@ -4310,7 +4315,6 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	struct nfs4_client *cl = open->op_openowner->oo_owner.so_client;
 	struct nfs4_file *fp = NULL;
 	struct nfs4_ol_stateid *stp = NULL;
-	struct nfs4_ol_stateid *swapstp = NULL;
 	struct nfs4_delegation *dp = NULL;
 	__be32 status;
 
@@ -4347,16 +4351,9 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 			goto out;
 		}
 	} else {
-		stp = open->op_stp;
-		open->op_stp = NULL;
-		/*
-		 * init_open_stateid() either returns a locked stateid
-		 * it found, or initializes and locks the new one we passed in
-		 */
-		swapstp = init_open_stateid(stp, fp, open);
-		if (swapstp) {
-			nfs4_put_stid(&stp->st_stid);
-			stp = swapstp;
+		/* stp is returned locked: */
+		stp = init_open_stateid(fp, open);
+		if (stp->st_access_bmap == 0) {
 			status = nfs4_upgrade_open(rqstp, fp, current_fh,
 						stp, open);
 			if (status) {

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-14 15:53                                         ` Oleg Drokin
@ 2016-06-14 18:50                                           ` J . Bruce Fields
  2016-06-14 22:52                                             ` Jeff Layton
  0 siblings, 1 reply; 60+ messages in thread
From: J . Bruce Fields @ 2016-06-14 18:50 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Jeff Layton, linux-nfs, linux-kernel

On Tue, Jun 14, 2016 at 11:53:27AM -0400, Oleg Drokin wrote:
> 
> On Jun 14, 2016, at 11:38 AM, J . Bruce Fields wrote:
> 
> > On Sun, Jun 12, 2016 at 09:26:27PM -0400, Oleg Drokin wrote:
> >> It used to be the case that state had an rwlock that was locked for write
> >> by downgrades, but for read for upgrades (opens). Well, the problem is
> >> if there are two competing opens for the same state, they step on
> >> each other toes potentially leading to leaking file descriptors
> >> from the state structure, since access mode is a bitmap only set once.
> >> 
> >> Extend the holding region around in nfsd4_process_open2() to avoid
> >> racing entry into nfs4_get_vfs_file().
> >> Make init_open_stateid() return with locked stateid to be unlocked
> >> by the caller.
> >> 
> >> Now this version held up pretty well in my testing for 24 hours.
> >> It still does not address the situation if during one of the racing
> >> nfs4_get_vfs_file() calls we are getting an error from one (first?)
> >> of them. This is to be addressed in a separate patch after having a
> >> solid reproducer (potentially using some fault injection).
> >> 
> >> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> >> ---
> >> fs/nfsd/nfs4state.c | 47 +++++++++++++++++++++++++++--------------------
> >> fs/nfsd/state.h     |  2 +-
> >> 2 files changed, 28 insertions(+), 21 deletions(-)
> >> 
> >> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> >> index f5f82e1..fa5fb5a 100644
> >> --- a/fs/nfsd/nfs4state.c
> >> +++ b/fs/nfsd/nfs4state.c
> >> @@ -3487,6 +3487,10 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
> >> 	struct nfs4_openowner *oo = open->op_openowner;
> >> 	struct nfs4_ol_stateid *retstp = NULL;
> >> 
> >> +	/* We are moving these outside of the spinlocks to avoid the warnings */
> >> +	mutex_init(&stp->st_mutex);
> >> +	mutex_lock(&stp->st_mutex);
> >> +
> >> 	spin_lock(&oo->oo_owner.so_client->cl_lock);
> >> 	spin_lock(&fp->fi_lock);
> >> 
> >> @@ -3502,13 +3506,14 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
> >> 	stp->st_access_bmap = 0;
> >> 	stp->st_deny_bmap = 0;
> >> 	stp->st_openstp = NULL;
> >> -	init_rwsem(&stp->st_rwsem);
> >> 	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
> >> 	list_add(&stp->st_perfile, &fp->fi_stateids);
> >> 
> >> out_unlock:
> >> 	spin_unlock(&fp->fi_lock);
> >> 	spin_unlock(&oo->oo_owner.so_client->cl_lock);
> >> +	if (retstp)
> >> +		mutex_lock(&retstp->st_mutex);
> >> 	return retstp;
> > 
> > You're returning with both stp->st_mutex and retstp->st_mutex locked.
> > Did you mean to drop that first lock in the (retstp) case, or am I
> > missing something?
> 
> Well, I think it's ok (perhaps worthy of a comment) it's that if we matched a different
> retstp state, then stp is not used and either released right away or even
> if reused, it would be reinitialized in another call to init_open_stateid(),
> so it's fine?

Oh, I see, you're right.

Though I wouldn't have been surprised if that triggered some kind of
warning--I guess it's OK here, but typically if I saw a structure freed
that had a locked lock in it I'd be a little suspicious that somebody
made a mistake.

--b.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-14 18:50                                           ` J . Bruce Fields
@ 2016-06-14 22:52                                             ` Jeff Layton
  2016-06-14 22:54                                               ` Oleg Drokin
  0 siblings, 1 reply; 60+ messages in thread
From: Jeff Layton @ 2016-06-14 22:52 UTC (permalink / raw)
  To: J . Bruce Fields, Oleg Drokin; +Cc: linux-nfs, linux-kernel

On Tue, 2016-06-14 at 14:50 -0400, J . Bruce Fields wrote:
> On Tue, Jun 14, 2016 at 11:53:27AM -0400, Oleg Drokin wrote:
> > 
> > 
> > On Jun 14, 2016, at 11:38 AM, J . Bruce Fields wrote:
> > 
> > > 
> > > On Sun, Jun 12, 2016 at 09:26:27PM -0400, Oleg Drokin wrote:
> > > > 
> > > > It used to be the case that state had an rwlock that was locked for write
> > > > by downgrades, but for read for upgrades (opens). Well, the problem is
> > > > if there are two competing opens for the same state, they step on
> > > > each other toes potentially leading to leaking file descriptors
> > > > from the state structure, since access mode is a bitmap only set once.
> > > > 
> > > > Extend the holding region around in nfsd4_process_open2() to avoid
> > > > racing entry into nfs4_get_vfs_file().
> > > > Make init_open_stateid() return with locked stateid to be unlocked
> > > > by the caller.
> > > > 
> > > > Now this version held up pretty well in my testing for 24 hours.
> > > > It still does not address the situation if during one of the racing
> > > > nfs4_get_vfs_file() calls we are getting an error from one (first?)
> > > > of them. This is to be addressed in a separate patch after having a
> > > > solid reproducer (potentially using some fault injection).
> > > > 
> > > > Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> > > > ---
> > > > fs/nfsd/nfs4state.c | 47 +++++++++++++++++++++++++++--------------------
> > > > fs/nfsd/state.h     |  2 +-
> > > > 2 files changed, 28 insertions(+), 21 deletions(-)
> > > > 
> > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > index f5f82e1..fa5fb5a 100644
> > > > --- a/fs/nfsd/nfs4state.c
> > > > +++ b/fs/nfsd/nfs4state.c
> > > > @@ -3487,6 +3487,10 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
> > > > 	struct nfs4_openowner *oo = open->op_openowner;
> > > > 	struct nfs4_ol_stateid *retstp = NULL;
> > > > 
> > > > +	/* We are moving these outside of the spinlocks to avoid the warnings */
> > > > +	mutex_init(&stp->st_mutex);
> > > > +	mutex_lock(&stp->st_mutex);
> > > > +
> > > > 	spin_lock(&oo->oo_owner.so_client->cl_lock);
> > > > 	spin_lock(&fp->fi_lock);
> > > > 
> > > > @@ -3502,13 +3506,14 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
> > > > 	stp->st_access_bmap = 0;
> > > > 	stp->st_deny_bmap = 0;
> > > > 	stp->st_openstp = NULL;
> > > > -	init_rwsem(&stp->st_rwsem);
> > > > 	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
> > > > 	list_add(&stp->st_perfile, &fp->fi_stateids);
> > > > 
> > > > out_unlock:
> > > > 	spin_unlock(&fp->fi_lock);
> > > > 	spin_unlock(&oo->oo_owner.so_client->cl_lock);
> > > > +	if (retstp)
> > > > +		mutex_lock(&retstp->st_mutex);
> > > > 	return retstp;
> > > You're returning with both stp->st_mutex and retstp->st_mutex locked.
> > > Did you mean to drop that first lock in the (retstp) case, or am I
> > > missing something?
> > Well, I think it's ok (perhaps worthy of a comment) it's that if we matched a different
> > retstp state, then stp is not used and either released right away or even
> > if reused, it would be reinitialized in another call to init_open_stateid(),
> > so it's fine?
> Oh, I see, you're right.
> 
> Though I wouldn't have been surprised if that triggered some kind of
> warning--I guess it's OK here, but typically if I saw a structure freed
> that had a locked lock in it I'd be a little suspicious that somebody
> made a mistake.
> 
> --b.

I think I'd still prefer to have it unlock the mutex in the event that
it's not going to use it after all. While that kind of thing is ok for
now, it's stuff like that that can turn into a subtle source of bugs
later.

Also, I think I'd be more comfortable with this being split into (at
least) two patches. Do one patch as a straight conversion from rwsem to
mutex, and then another that changes the code to take the mutex before
hashing the new stateid.

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-14 22:52                                             ` Jeff Layton
@ 2016-06-14 22:54                                               ` Oleg Drokin
  2016-06-14 22:57                                                 ` Jeff Layton
  0 siblings, 1 reply; 60+ messages in thread
From: Oleg Drokin @ 2016-06-14 22:54 UTC (permalink / raw)
  To: Jeff Layton; +Cc: J . Bruce Fields, linux-nfs, linux-kernel


On Jun 14, 2016, at 6:52 PM, Jeff Layton wrote:
> I think I'd still prefer to have it unlock the mutex in the event that
> it's not going to use it after all. While that kind of thing is ok for
> now, it's stuff like that that can turn into a subtle source of bugs
> later.
> 
> Also, I think I'd be more comfortable with this being split into (at
> least) two patches. Do one patch as a straight conversion from rwsem to
> mutex, and then another that changes the code to take the mutex before
> hashing the new stateid.

Ok, I guess that could be arranged too.

And then there's this Bruce's patch to pull more stuff into the init_open_stateid

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-14 22:54                                               ` Oleg Drokin
@ 2016-06-14 22:57                                                 ` Jeff Layton
  2016-06-15  3:28                                                   ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
  0 siblings, 1 reply; 60+ messages in thread
From: Jeff Layton @ 2016-06-14 22:57 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: J . Bruce Fields, linux-nfs, linux-kernel

On Tue, 2016-06-14 at 18:54 -0400, Oleg Drokin wrote:
> On Jun 14, 2016, at 6:52 PM, Jeff Layton wrote:
> > 
> > I think I'd still prefer to have it unlock the mutex in the event that
> > it's not going to use it after all. While that kind of thing is ok for
> > now, it's stuff like that that can turn into a subtle source of bugs
> > later.
> > 
> > Also, I think I'd be more comfortable with this being split into (at
> > least) two patches. Do one patch as a straight conversion from rwsem to
> > mutex, and then another that changes the code to take the mutex before
> > hashing the new stateid.
> Ok, I guess that could be arranged too.
> 
> And then there's this Bruce's patch to pull more stuff into the init_open_stateid

Yeah, that seems like a good idea. We _really_ need an effort to
simplify this code. OPEN handling is always messy, but the current code
is really much messier than it should be.

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-14 18:46                                           ` J . Bruce Fields
@ 2016-06-15  2:19                                             ` Oleg Drokin
  2016-06-15 13:31                                               ` J . Bruce Fields
  0 siblings, 1 reply; 60+ messages in thread
From: Oleg Drokin @ 2016-06-15  2:19 UTC (permalink / raw)
  To: J . Bruce Fields; +Cc: Jeff Layton, linux-nfs, linux-kernel


On Jun 14, 2016, at 2:46 PM, J . Bruce Fields wrote:

> On Tue, Jun 14, 2016 at 11:56:20AM -0400, Oleg Drokin wrote:
>> 
>> On Jun 14, 2016, at 11:46 AM, J . Bruce Fields wrote:
>> 
>>> On Sun, Jun 12, 2016 at 09:26:27PM -0400, Oleg Drokin wrote:
>>>> It used to be the case that state had an rwlock that was locked for write
>>>> by downgrades, but for read for upgrades (opens). Well, the problem is
>>>> if there are two competing opens for the same state, they step on
>>>> each other toes potentially leading to leaking file descriptors
>>>> from the state structure, since access mode is a bitmap only set once.
>>>> 
>>>> Extend the holding region around in nfsd4_process_open2() to avoid
>>>> racing entry into nfs4_get_vfs_file().
>>>> Make init_open_stateid() return with locked stateid to be unlocked
>>>> by the caller.
>>>> 
>>>> Now this version held up pretty well in my testing for 24 hours.
>>>> It still does not address the situation if during one of the racing
>>>> nfs4_get_vfs_file() calls we are getting an error from one (first?)
>>>> of them. This is to be addressed in a separate patch after having a
>>>> solid reproducer (potentially using some fault injection).
>>>> 
>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>>>> ---
>>>> fs/nfsd/nfs4state.c | 47 +++++++++++++++++++++++++++--------------------
>>>> fs/nfsd/state.h     |  2 +-
>>>> 2 files changed, 28 insertions(+), 21 deletions(-)
>>>> 
>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>> index f5f82e1..fa5fb5a 100644
>>>> --- a/fs/nfsd/nfs4state.c
>>>> +++ b/fs/nfsd/nfs4state.c
>>>> @@ -3487,6 +3487,10 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
>>>> 	struct nfs4_openowner *oo = open->op_openowner;
>>>> 	struct nfs4_ol_stateid *retstp = NULL;
>>>> 
>>>> +	/* We are moving these outside of the spinlocks to avoid the warnings */
>>>> +	mutex_init(&stp->st_mutex);
>>>> +	mutex_lock(&stp->st_mutex);
>>> 
>>> A mutex_init_locked() primitive might also be convenient here.
>> 
>> I know! I would be able to do it under spinlock then without moving this around too.
>> 
>> But alas, not only there is not one, mutex documentation states this is disallowed.
> 
> You're just talking about this comment?:
> 
> 	 * It is not allowed to initialize an already locked mutex.
> 
> That's a weird comment.  You're proably right that what they meant was
> something like "It is not allowed to initialize a mutex to locked
> state".  But, I don't know, taken literally that comment doesn't make
> sense (how could you even distinguish between an already-locked mutex
> and an uninitialized mutex?), so maybe it'd be worth asking.

I think this is because of the strict ownership tracking or something.
I guess I can ask.

>>> You could also take the two previous lines from the caller into this
>>> function instead of passing in stp, that might simplify the code.
>>> (Haven't checked.)
>> 
>> I am not really sure what do you mean here.
>> These lines are moved from further away in this function )well, just the init, anyway).
>> 
>> Having half initialisation of stp here and half in the caller sounds kind of strange
>> to me.
> 
> I was thinking of something like the following--so init_open_stateid
> hides more of the details of the swapping.  Untested.  Does it look like
> an improvement to you?
> 
> There's got to be a way to make this code a little less convoluted....
> 
> --b.
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index fa5fb5aa4847..41b59854c40f 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3480,13 +3480,15 @@ alloc_init_open_stateowner(unsigned int strhashval, struct nfsd4_open *open,
> }
> 
> static struct nfs4_ol_stateid *
> -init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
> -		struct nfsd4_open *open)
> +init_open_stateid(struct nfs4_file *fp, struct nfsd4_open *open)
> {
> 
> 	struct nfs4_openowner *oo = open->op_openowner;
> 	struct nfs4_ol_stateid *retstp = NULL;
> +	struct nfs4_ol_stateid *stp;
> 
> +	stp = open->op_stp;
> +	open->op_stp = NULL;
> 	/* We are moving these outside of the spinlocks to avoid the warnings */
> 	mutex_init(&stp->st_mutex);
> 	mutex_lock(&stp->st_mutex);
> @@ -3512,9 +3514,12 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
> out_unlock:
> 	spin_unlock(&fp->fi_lock);
> 	spin_unlock(&oo->oo_owner.so_client->cl_lock);
> -	if (retstp)
> -		mutex_lock(&retstp->st_mutex);
> -	return retstp;
> +	if (retstp) {
> +		nfs4_put_stid(&stp->st_stid);

So as I am trying to integrate this into my patchset,
do we really need this?
We don't if we took the other path and left this one
hanging off the struct nfsd4_open (why do we need to
assign it NULL before the search?) I imagine then
we'd save some free/realloc churn as well?

I assume struct nfsd4_open cannot be shared between threads?
Otherwise we have bigger problems at hand like mutex init on a locked
mutex from another thread and stuff.

I'll try this theory I guess.


> +		stp = retstp;
> +		mutex_lock(&stp->st_mutex);
> +	}
> +	return stp;
> }
> 
> /*
> @@ -4310,7 +4315,6 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
> 	struct nfs4_client *cl = open->op_openowner->oo_owner.so_client;
> 	struct nfs4_file *fp = NULL;
> 	struct nfs4_ol_stateid *stp = NULL;
> -	struct nfs4_ol_stateid *swapstp = NULL;
> 	struct nfs4_delegation *dp = NULL;
> 	__be32 status;
> 
> @@ -4347,16 +4351,9 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
> 			goto out;
> 		}
> 	} else {
> -		stp = open->op_stp;
> -		open->op_stp = NULL;
> -		/*
> -		 * init_open_stateid() either returns a locked stateid
> -		 * it found, or initializes and locks the new one we passed in
> -		 */
> -		swapstp = init_open_stateid(stp, fp, open);
> -		if (swapstp) {
> -			nfs4_put_stid(&stp->st_stid);
> -			stp = swapstp;
> +		/* stp is returned locked: */
> +		stp = init_open_stateid(fp, open);
> +		if (stp->st_access_bmap == 0) {
> 			status = nfs4_upgrade_open(rqstp, fp, current_fh,
> 						stp, open);
> 			if (status) {

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/3] nfsd state handling fixes
  2016-06-14 22:57                                                 ` Jeff Layton
@ 2016-06-15  3:28                                                   ` Oleg Drokin
  2016-06-15  3:28                                                     ` [PATCH 1/3] nfsd: Always lock state exclusively Oleg Drokin
                                                                       ` (3 more replies)
  0 siblings, 4 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-15  3:28 UTC (permalink / raw)
  To: Jeff Layton, J . Bruce Fields; +Cc: linux-nfs, linux-kernel, Oleg Drokin

These three patches do the much discussed job of making nfsd state handling
more robust in face of races where several opens arrive for the same file
at the same time from the same client.

This does not yet handle a case when one of those opens gets an error
and others don't.

Also this is undergoing testing ATM, so please only use it as a
discussion/review piece for now.

Oleg Drokin (3):
  nfsd: Always lock state exclusively.
  nfsd: Extend the mutex holding region around in nfsd4_process_open2()
  nfsd: Make init_open_stateid() a bit more whole

 fs/nfsd/nfs4state.c | 67 +++++++++++++++++++++++++++++------------------------
 fs/nfsd/state.h     |  2 +-
 2 files changed, 38 insertions(+), 31 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/3] nfsd: Always lock state exclusively.
  2016-06-15  3:28                                                   ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
@ 2016-06-15  3:28                                                     ` Oleg Drokin
  2016-06-15  3:28                                                     ` [PATCH 2/3] nfsd: Extend the mutex holding region around in nfsd4_process_open2() Oleg Drokin
                                                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-15  3:28 UTC (permalink / raw)
  To: Jeff Layton, J . Bruce Fields; +Cc: linux-nfs, linux-kernel, Oleg Drokin

It used to be the case that state had an rwlock that was locked for write
by downgrades, but for read for upgrades (opens). Well, the problem is
if there are two competing opens for the same state, they step on
each other toes potentially leading to leaking file descriptors
from the state structure, since access mode is a bitmap only set once.

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
---
 fs/nfsd/nfs4state.c | 40 ++++++++++++++++++++--------------------
 fs/nfsd/state.h     |  2 +-
 2 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index f5f82e1..c927d36 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3502,7 +3502,7 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
 	stp->st_access_bmap = 0;
 	stp->st_deny_bmap = 0;
 	stp->st_openstp = NULL;
-	init_rwsem(&stp->st_rwsem);
+	mutex_init(&stp->st_mutex);
 	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
 	list_add(&stp->st_perfile, &fp->fi_stateids);
 
@@ -4335,10 +4335,10 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	 */
 	if (stp) {
 		/* Stateid was found, this is an OPEN upgrade */
-		down_read(&stp->st_rwsem);
+		mutex_lock(&stp->st_mutex);
 		status = nfs4_upgrade_open(rqstp, fp, current_fh, stp, open);
 		if (status) {
-			up_read(&stp->st_rwsem);
+			mutex_unlock(&stp->st_mutex);
 			goto out;
 		}
 	} else {
@@ -4348,19 +4348,19 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 		if (swapstp) {
 			nfs4_put_stid(&stp->st_stid);
 			stp = swapstp;
-			down_read(&stp->st_rwsem);
+			mutex_lock(&stp->st_mutex);
 			status = nfs4_upgrade_open(rqstp, fp, current_fh,
 						stp, open);
 			if (status) {
-				up_read(&stp->st_rwsem);
+				mutex_unlock(&stp->st_mutex);
 				goto out;
 			}
 			goto upgrade_out;
 		}
-		down_read(&stp->st_rwsem);
+		mutex_lock(&stp->st_mutex);
 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
 		if (status) {
-			up_read(&stp->st_rwsem);
+			mutex_unlock(&stp->st_mutex);
 			release_open_stateid(stp);
 			goto out;
 		}
@@ -4372,7 +4372,7 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	}
 upgrade_out:
 	nfs4_inc_and_copy_stateid(&open->op_stateid, &stp->st_stid);
-	up_read(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 
 	if (nfsd4_has_session(&resp->cstate)) {
 		if (open->op_deleg_want & NFS4_SHARE_WANT_NO_DELEG) {
@@ -4977,12 +4977,12 @@ static __be32 nfs4_seqid_op_checks(struct nfsd4_compound_state *cstate, stateid_
 		 * revoked delegations are kept only for free_stateid.
 		 */
 		return nfserr_bad_stateid;
-	down_write(&stp->st_rwsem);
+	mutex_lock(&stp->st_mutex);
 	status = check_stateid_generation(stateid, &stp->st_stid.sc_stateid, nfsd4_has_session(cstate));
 	if (status == nfs_ok)
 		status = nfs4_check_fh(current_fh, &stp->st_stid);
 	if (status != nfs_ok)
-		up_write(&stp->st_rwsem);
+		mutex_unlock(&stp->st_mutex);
 	return status;
 }
 
@@ -5030,7 +5030,7 @@ static __be32 nfs4_preprocess_confirmed_seqid_op(struct nfsd4_compound_state *cs
 		return status;
 	oo = openowner(stp->st_stateowner);
 	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
-		up_write(&stp->st_rwsem);
+		mutex_unlock(&stp->st_mutex);
 		nfs4_put_stid(&stp->st_stid);
 		return nfserr_bad_stateid;
 	}
@@ -5062,12 +5062,12 @@ nfsd4_open_confirm(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	oo = openowner(stp->st_stateowner);
 	status = nfserr_bad_stateid;
 	if (oo->oo_flags & NFS4_OO_CONFIRMED) {
-		up_write(&stp->st_rwsem);
+		mutex_unlock(&stp->st_mutex);
 		goto put_stateid;
 	}
 	oo->oo_flags |= NFS4_OO_CONFIRMED;
 	nfs4_inc_and_copy_stateid(&oc->oc_resp_stateid, &stp->st_stid);
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 	dprintk("NFSD: %s: success, seqid=%d stateid=" STATEID_FMT "\n",
 		__func__, oc->oc_seqid, STATEID_VAL(&stp->st_stid.sc_stateid));
 
@@ -5143,7 +5143,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
 	nfs4_inc_and_copy_stateid(&od->od_stateid, &stp->st_stid);
 	status = nfs_ok;
 put_stateid:
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 	nfs4_put_stid(&stp->st_stid);
 out:
 	nfsd4_bump_seqid(cstate, status);
@@ -5196,7 +5196,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	if (status)
 		goto out; 
 	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 
 	nfsd4_close_open_stateid(stp);
 
@@ -5422,7 +5422,7 @@ init_lock_stateid(struct nfs4_ol_stateid *stp, struct nfs4_lockowner *lo,
 	stp->st_access_bmap = 0;
 	stp->st_deny_bmap = open_stp->st_deny_bmap;
 	stp->st_openstp = open_stp;
-	init_rwsem(&stp->st_rwsem);
+	mutex_init(&stp->st_mutex);
 	list_add(&stp->st_locks, &open_stp->st_locks);
 	list_add(&stp->st_perstateowner, &lo->lo_owner.so_stateids);
 	spin_lock(&fp->fi_lock);
@@ -5591,7 +5591,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 					&open_stp, nn);
 		if (status)
 			goto out;
-		up_write(&open_stp->st_rwsem);
+		mutex_unlock(&open_stp->st_mutex);
 		open_sop = openowner(open_stp->st_stateowner);
 		status = nfserr_bad_stateid;
 		if (!same_clid(&open_sop->oo_owner.so_client->cl_clientid,
@@ -5600,7 +5600,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		status = lookup_or_create_lock_state(cstate, open_stp, lock,
 							&lock_stp, &new);
 		if (status == nfs_ok)
-			down_write(&lock_stp->st_rwsem);
+			mutex_lock(&lock_stp->st_mutex);
 	} else {
 		status = nfs4_preprocess_seqid_op(cstate,
 				       lock->lk_old_lock_seqid,
@@ -5704,7 +5704,7 @@ out:
 		    seqid_mutating_err(ntohl(status)))
 			lock_sop->lo_owner.so_seqid++;
 
-		up_write(&lock_stp->st_rwsem);
+		mutex_unlock(&lock_stp->st_mutex);
 
 		/*
 		 * If this is a new, never-before-used stateid, and we are
@@ -5874,7 +5874,7 @@ nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 fput:
 	fput(filp);
 put_stateid:
-	up_write(&stp->st_rwsem);
+	mutex_unlock(&stp->st_mutex);
 	nfs4_put_stid(&stp->st_stid);
 out:
 	nfsd4_bump_seqid(cstate, status);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 986e51e..64053ea 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -535,7 +535,7 @@ struct nfs4_ol_stateid {
 	unsigned char			st_access_bmap;
 	unsigned char			st_deny_bmap;
 	struct nfs4_ol_stateid		*st_openstp;
-	struct rw_semaphore		st_rwsem;
+	struct mutex			st_mutex;
 };
 
 static inline struct nfs4_ol_stateid *openlockstateid(struct nfs4_stid *s)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/3] nfsd: Extend the mutex holding region around in nfsd4_process_open2()
  2016-06-15  3:28                                                   ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
  2016-06-15  3:28                                                     ` [PATCH 1/3] nfsd: Always lock state exclusively Oleg Drokin
@ 2016-06-15  3:28                                                     ` Oleg Drokin
  2016-06-15  3:28                                                     ` [PATCH 3/3] nfsd: Make init_open_stateid() a bit more whole Oleg Drokin
  2016-06-16  1:54                                                     ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
  3 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-15  3:28 UTC (permalink / raw)
  To: Jeff Layton, J . Bruce Fields; +Cc: linux-nfs, linux-kernel, Oleg Drokin

To avoid racing entry into nfs4_get_vfs_file().
Make init_open_stateid() return with locked stateid to be unlocked
by the caller.

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
---
 fs/nfsd/nfs4state.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index c927d36..94854a0 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3487,6 +3487,10 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
 	struct nfs4_openowner *oo = open->op_openowner;
 	struct nfs4_ol_stateid *retstp = NULL;
 
+	/* We are moving these outside of the spinlocks to avoid the warnings */
+	mutex_init(&stp->st_mutex);
+	mutex_lock(&stp->st_mutex);
+
 	spin_lock(&oo->oo_owner.so_client->cl_lock);
 	spin_lock(&fp->fi_lock);
 
@@ -3502,13 +3506,17 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
 	stp->st_access_bmap = 0;
 	stp->st_deny_bmap = 0;
 	stp->st_openstp = NULL;
-	mutex_init(&stp->st_mutex);
 	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
 	list_add(&stp->st_perfile, &fp->fi_stateids);
 
 out_unlock:
 	spin_unlock(&fp->fi_lock);
 	spin_unlock(&oo->oo_owner.so_client->cl_lock);
+	if (retstp) {
+		mutex_lock(&retstp->st_mutex);
+		/* Not that we need to, just for neatness */
+		mutex_unlock(&stp->st_mutex);
+	}
 	return retstp;
 }
 
@@ -4344,11 +4352,14 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	} else {
 		stp = open->op_stp;
 		open->op_stp = NULL;
+		/*
+		 * init_open_stateid() either returns a locked stateid
+		 * it found, or initializes and locks the new one we passed in
+		 */
 		swapstp = init_open_stateid(stp, fp, open);
 		if (swapstp) {
 			nfs4_put_stid(&stp->st_stid);
 			stp = swapstp;
-			mutex_lock(&stp->st_mutex);
 			status = nfs4_upgrade_open(rqstp, fp, current_fh,
 						stp, open);
 			if (status) {
@@ -4357,7 +4368,6 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 			}
 			goto upgrade_out;
 		}
-		mutex_lock(&stp->st_mutex);
 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
 		if (status) {
 			mutex_unlock(&stp->st_mutex);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 3/3] nfsd: Make init_open_stateid() a bit more whole
  2016-06-15  3:28                                                   ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
  2016-06-15  3:28                                                     ` [PATCH 1/3] nfsd: Always lock state exclusively Oleg Drokin
  2016-06-15  3:28                                                     ` [PATCH 2/3] nfsd: Extend the mutex holding region around in nfsd4_process_open2() Oleg Drokin
@ 2016-06-15  3:28                                                     ` Oleg Drokin
  2016-06-16  1:54                                                     ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
  3 siblings, 0 replies; 60+ messages in thread
From: Oleg Drokin @ 2016-06-15  3:28 UTC (permalink / raw)
  To: Jeff Layton, J . Bruce Fields; +Cc: linux-nfs, linux-kernel, Oleg Drokin

Move the state selection logic inside from the caller,
always making it return correct stp to use.

Signed-off-by: J . Bruce Fields <bfields@fieldses.org>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
---
 fs/nfsd/nfs4state.c | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 94854a0..70d0b9b 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3480,13 +3480,14 @@ alloc_init_open_stateowner(unsigned int strhashval, struct nfsd4_open *open,
 }
 
 static struct nfs4_ol_stateid *
-init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
-		struct nfsd4_open *open)
+init_open_stateid(struct nfs4_file *fp, struct nfsd4_open *open)
 {
 
 	struct nfs4_openowner *oo = open->op_openowner;
 	struct nfs4_ol_stateid *retstp = NULL;
+	struct nfs4_ol_stateid *stp;
 
+	stp = open->op_stp;
 	/* We are moving these outside of the spinlocks to avoid the warnings */
 	mutex_init(&stp->st_mutex);
 	mutex_lock(&stp->st_mutex);
@@ -3497,6 +3498,8 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
 	retstp = nfsd4_find_existing_open(fp, open);
 	if (retstp)
 		goto out_unlock;
+
+	open->op_stp = NULL;
 	atomic_inc(&stp->st_stid.sc_count);
 	stp->st_stid.sc_type = NFS4_OPEN_STID;
 	INIT_LIST_HEAD(&stp->st_locks);
@@ -3514,10 +3517,11 @@ out_unlock:
 	spin_unlock(&oo->oo_owner.so_client->cl_lock);
 	if (retstp) {
 		mutex_lock(&retstp->st_mutex);
-		/* Not that we need to, just for neatness */
+		/* To keep mutex tracking happy */
 		mutex_unlock(&stp->st_mutex);
+		stp = retstp;
 	}
-	return retstp;
+	return stp;
 }
 
 /*
@@ -4313,7 +4317,6 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	struct nfs4_client *cl = open->op_openowner->oo_owner.so_client;
 	struct nfs4_file *fp = NULL;
 	struct nfs4_ol_stateid *stp = NULL;
-	struct nfs4_ol_stateid *swapstp = NULL;
 	struct nfs4_delegation *dp = NULL;
 	__be32 status;
 
@@ -4350,16 +4353,10 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 			goto out;
 		}
 	} else {
-		stp = open->op_stp;
-		open->op_stp = NULL;
-		/*
-		 * init_open_stateid() either returns a locked stateid
-		 * it found, or initializes and locks the new one we passed in
-		 */
-		swapstp = init_open_stateid(stp, fp, open);
-		if (swapstp) {
-			nfs4_put_stid(&stp->st_stid);
-			stp = swapstp;
+		/* stp is returned locked. */
+		stp = init_open_stateid(fp, open);
+		/* See if we lost the race to some other thread */
+		if (stp->st_access_bmap != 0) {
 			status = nfs4_upgrade_open(rqstp, fp, current_fh,
 						stp, open);
 			if (status) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v2] nfsd: Always lock state exclusively.
  2016-06-15  2:19                                             ` Oleg Drokin
@ 2016-06-15 13:31                                               ` J . Bruce Fields
  0 siblings, 0 replies; 60+ messages in thread
From: J . Bruce Fields @ 2016-06-15 13:31 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Jeff Layton, linux-nfs, linux-kernel

On Tue, Jun 14, 2016 at 10:19:49PM -0400, Oleg Drokin wrote:
> On Jun 14, 2016, at 2:46 PM, J . Bruce Fields wrote:
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index fa5fb5aa4847..41b59854c40f 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -3480,13 +3480,15 @@ alloc_init_open_stateowner(unsigned int strhashval, struct nfsd4_open *open,
> > }
> > 
> > static struct nfs4_ol_stateid *
> > -init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
> > -		struct nfsd4_open *open)
> > +init_open_stateid(struct nfs4_file *fp, struct nfsd4_open *open)
> > {
> > 
> > 	struct nfs4_openowner *oo = open->op_openowner;
> > 	struct nfs4_ol_stateid *retstp = NULL;
> > +	struct nfs4_ol_stateid *stp;
> > 
> > +	stp = open->op_stp;
> > +	open->op_stp = NULL;
> > 	/* We are moving these outside of the spinlocks to avoid the warnings */
> > 	mutex_init(&stp->st_mutex);
> > 	mutex_lock(&stp->st_mutex);
> > @@ -3512,9 +3514,12 @@ init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
> > out_unlock:
> > 	spin_unlock(&fp->fi_lock);
> > 	spin_unlock(&oo->oo_owner.so_client->cl_lock);
> > -	if (retstp)
> > -		mutex_lock(&retstp->st_mutex);
> > -	return retstp;
> > +	if (retstp) {
> > +		nfs4_put_stid(&stp->st_stid);
> 
> So as I am trying to integrate this into my patchset,
> do we really need this?
> We don't if we took the other path and left this one
> hanging off the struct nfsd4_open (why do we need to
> assign it NULL before the search?) I imagine then
> we'd save some free/realloc churn as well?

Yes, good idea.

> I assume struct nfsd4_open cannot be shared between threads?

Right.

> Otherwise we have bigger problems at hand like mutex init on a locked
> mutex from another thread and stuff.
> 
> I'll try this theory I guess.

Sounds good!

--b.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/3] nfsd state handling fixes
  2016-06-15  3:28                                                   ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
                                                                       ` (2 preceding siblings ...)
  2016-06-15  3:28                                                     ` [PATCH 3/3] nfsd: Make init_open_stateid() a bit more whole Oleg Drokin
@ 2016-06-16  1:54                                                     ` Oleg Drokin
  2016-06-16  2:07                                                       ` J . Bruce Fields
  3 siblings, 1 reply; 60+ messages in thread
From: Oleg Drokin @ 2016-06-16  1:54 UTC (permalink / raw)
  To: Jeff Layton, J . Bruce Fields; +Cc: linux-nfs, linux-kernel


On Jun 14, 2016, at 11:28 PM, Oleg Drokin wrote:

> These three patches do the much discussed job of making nfsd state handling
> more robust in face of races where several opens arrive for the same file
> at the same time from the same client.
> 
> This does not yet handle a case when one of those opens gets an error
> and others don't.
> 
> Also this is undergoing testing ATM, so please only use it as a
> discussion/review piece for now.

With 24 hours in testing an no problems encountered, I guess
it's safe to declare this patchset as good to go if nobody has any
objections against it.

> 
> Oleg Drokin (3):
>  nfsd: Always lock state exclusively.
>  nfsd: Extend the mutex holding region around in nfsd4_process_open2()
>  nfsd: Make init_open_stateid() a bit more whole
> 
> fs/nfsd/nfs4state.c | 67 +++++++++++++++++++++++++++++------------------------
> fs/nfsd/state.h     |  2 +-
> 2 files changed, 38 insertions(+), 31 deletions(-)
> 
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/3] nfsd state handling fixes
  2016-06-16  1:54                                                     ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
@ 2016-06-16  2:07                                                       ` J . Bruce Fields
  0 siblings, 0 replies; 60+ messages in thread
From: J . Bruce Fields @ 2016-06-16  2:07 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Jeff Layton, linux-nfs, linux-kernel

On Wed, Jun 15, 2016 at 09:54:28PM -0400, Oleg Drokin wrote:
> 
> On Jun 14, 2016, at 11:28 PM, Oleg Drokin wrote:
> 
> > These three patches do the much discussed job of making nfsd state handling
> > more robust in face of races where several opens arrive for the same file
> > at the same time from the same client.
> > 
> > This does not yet handle a case when one of those opens gets an error
> > and others don't.
> > 
> > Also this is undergoing testing ATM, so please only use it as a
> > discussion/review piece for now.
> 
> With 24 hours in testing an no problems encountered, I guess
> it's safe to declare this patchset as good to go if nobody has any
> objections against it.

Great, thanks for the testing, and all the work tracking this down;
committing for 4.7 (and the first two for stable).

--b.

> 
> > 
> > Oleg Drokin (3):
> >  nfsd: Always lock state exclusively.
> >  nfsd: Extend the mutex holding region around in nfsd4_process_open2()
> >  nfsd: Make init_open_stateid() a bit more whole
> > 
> > fs/nfsd/nfs4state.c | 67 +++++++++++++++++++++++++++++------------------------
> > fs/nfsd/state.h     |  2 +-
> > 2 files changed, 38 insertions(+), 31 deletions(-)
> > 
> > -- 
> > 2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-06-16  2:07 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-07 15:37 Files leak from nfsd in 4.7.1-rc1 (and more?) Oleg Drokin
2016-06-07 15:37 ` Oleg Drokin
2016-06-07 17:10 ` Jeff Layton
2016-06-07 17:10   ` Jeff Layton
2016-06-07 17:30   ` Oleg Drokin
2016-06-07 17:30     ` Oleg Drokin
2016-06-07 20:04     ` Jeff Layton
2016-06-07 20:04       ` Jeff Layton
2016-06-07 23:39       ` Oleg Drokin
2016-06-07 23:39         ` Oleg Drokin
2016-06-08  0:03         ` Jeff Layton
2016-06-08  0:03           ` Jeff Layton
2016-06-08  0:46           ` Oleg Drokin
2016-06-08  0:46             ` Oleg Drokin
2016-06-08  2:22           ` Oleg Drokin
2016-06-08  2:22             ` Oleg Drokin
2016-06-08  3:55             ` Oleg Drokin
2016-06-08  3:55               ` Oleg Drokin
2016-06-08 10:58             ` Jeff Layton
2016-06-08 10:58               ` Jeff Layton
2016-06-08 14:44               ` Oleg Drokin
2016-06-08 14:44                 ` Oleg Drokin
2016-06-08 16:10               ` Oleg Drokin
2016-06-08 16:10                 ` Oleg Drokin
2016-06-08 17:22                 ` Jeff Layton
2016-06-08 17:22                   ` Jeff Layton
2016-06-08 17:37                   ` Oleg Drokin
2016-06-08 17:37                     ` Oleg Drokin
2016-06-09  2:55                   ` [PATCH] nfsd: Always lock state exclusively Oleg Drokin
2016-06-09 10:13                     ` Jeff Layton
2016-06-09 21:01                   ` [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file Oleg Drokin
2016-06-10  4:18                     ` Oleg Drokin
2016-06-10 10:50                       ` Jeff Layton
2016-06-10 20:55                         ` J . Bruce Fields
2016-06-11 15:41                           ` Oleg Drokin
2016-06-12  1:33                             ` Jeff Layton
2016-06-12  2:06                               ` Oleg Drokin
2016-06-12  2:50                                 ` Jeff Layton
2016-06-12  3:15                                   ` Oleg Drokin
2016-06-12 13:13                                     ` Jeff Layton
2016-06-13  1:26                                     ` [PATCH v2] nfsd: Always lock state exclusively Oleg Drokin
2016-06-14 15:38                                       ` J . Bruce Fields
2016-06-14 15:53                                         ` Oleg Drokin
2016-06-14 18:50                                           ` J . Bruce Fields
2016-06-14 22:52                                             ` Jeff Layton
2016-06-14 22:54                                               ` Oleg Drokin
2016-06-14 22:57                                                 ` Jeff Layton
2016-06-15  3:28                                                   ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
2016-06-15  3:28                                                     ` [PATCH 1/3] nfsd: Always lock state exclusively Oleg Drokin
2016-06-15  3:28                                                     ` [PATCH 2/3] nfsd: Extend the mutex holding region around in nfsd4_process_open2() Oleg Drokin
2016-06-15  3:28                                                     ` [PATCH 3/3] nfsd: Make init_open_stateid() a bit more whole Oleg Drokin
2016-06-16  1:54                                                     ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
2016-06-16  2:07                                                       ` J . Bruce Fields
2016-06-14 15:46                                       ` [PATCH v2] nfsd: Always lock state exclusively J . Bruce Fields
2016-06-14 15:56                                         ` Oleg Drokin
2016-06-14 18:46                                           ` J . Bruce Fields
2016-06-15  2:19                                             ` Oleg Drokin
2016-06-15 13:31                                               ` J . Bruce Fields
2016-06-09 12:13               ` Files leak from nfsd in 4.7.1-rc1 (and more?) Andrew W Elble
2016-06-09 12:13                 ` Andrew W Elble

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.