On Tue, Aug 21 2018, Jeff Layton wrote:

> On Tue, 2018-08-21 at 15:11 +1000, NeilBrown wrote:
>> On Thu, Aug 16 2018, NeilBrown wrote:
>> 
>> > On Wed, Aug 15 2018, Jeff Layton wrote:
>> > 
>> > > On Wed, 2018-08-15 at 14:28 +0200, Krzysztof Kozlowski wrote:
>> > > > Hi,
>> > > > 
>> > > > Bisect pointed commit ce3147990450a68b3f549088b30f087742a08b5d
>> > > > ("fs/locks: allow a lock request to block other requests.") to failure
>> > > > boot of NFSv4 with root on several boards.
>> > > > 
>> > > > Log is here:
>> > > > https://krzk.eu/#/builders/21/builds/836/steps/12/logs/serial0
>> > > > 
>> > > > With several errors:
>> > > > kernel BUG at ../fs/locks.c:336!
>> > > > Unable to handle kernel NULL pointer dereference at virtual address 00000004
>> > > > 
>> > > > Configuration:
>> > > > 1. exynos_defconfig
>> > > > 2. Arch ARM Linux
>> > > > 3. Boards:
>> > > > a. Odroid family (ARMv7, octa-core (Cortex-A7+A15), Exynos5422 SoC)
>> > > > b. Toradex Colibri VF50 (ARMv7, UP, Cortex-A5)
>> > > > 4. Systemd: v236, 238
>> > > > 5. All boards boot from TFTP with NFS root (NFSv4)
>> > > > 
>> > > > On Colibri VF50 I got slightly different errors:
>> > > > [   11.663204] Internal error: Oops - undefined instruction: 0 [#1] ARM
>> > > > [   12.455273] Unable to handle kernel NULL pointer dereference at
>> > > > virtual address 00000004
>> > > > and only with some specific GCC (v6.3) or with other conditions which
>> > > > I did not bisect yet. Maybe Colibri's failure is unrelated to that
>> > > > commit.
>> > > > 
>> > > > Best regards,
>> > > > Krzysztof
>> > 
>> > Thanks a lot for the report Krzysztof!!
>> > 
>> > > 
>> > > The BUG is due to a lock being freed when the fl_blocked list wasn't
>> > > empty (implying that there were still blocked locks waiting on it).
>> > > 
>> > > There are a number of calls to locks_delete_lock_ctx in posix_lock_inode
>> > > and I don't think the fl_blocked list is being handled properly with all
>> > > of them. It only transplants the blocked locks to a new lock when there
>> > > are surviving locks on the list, and that may not be the case when the
>> > > whole file is being unlocked.
>> > 
>> > locks_delete_lock_ctx() calls locks_unlink_lock_ctx() which calls
>> > locks_wake_up_block() which doesn't only wake_up the blocks, but also
>> > detached them. When that function completes, ->fl_blocked must be empty.
>> > 
>> > The trace shows the locks_free_lock() call at the end of fcntl_setlk64()
>> > as the problematic call.
>> > This suggests that do_lock_file_wait() exited with ->fl_blocked
>> > non-empty, which it shouldn't.
>> > 
>> > I think we need to insert a call to locks_wake_up_block() in
>> > do_lock_file_wait() before it returns.
>> > I cannot find a sequence that would make this necessary, but
>> > it isn't surprising that there might be one.
>> > 
>> > I'll dig through the code a bit more later and make sure I understand
>> > what is happening.
>> > 
>> 
>> I think this problem if fixed by the following.  It is probably
>> triggered when the owner already has a lock for part of the requested
>> range.  After waiting for some other lock, the pending request gets
>> merged with the existing lock, and blocked requests aren't moved across
>> in that case.
>> 
>> I still haven't done more testing, so this is just FYI, not a
>> submission.
>> 
>> Thanks,
>> NeilBrown
>> 
>> From: NeilBrown <neilb@suse.com>
>> Date: Tue, 21 Aug 2018 15:09:06 +1000
>> Subject: [PATCH] fs/locks: always delete_block after waiting.
>> 
>> Now that requests can block other requests, we
>> need to be careful to always clean up those blocked
>> requests.
>> Any time that we wait for a request, we might have
>> other requests attached, and when we stop waiting,
>> we much clean them up.
>> If the lock was granted, the requests might have been
>> moved to the new lock, though when merged with a
>> pre-exiting lock, this might not happen.
>> No all cases we don't want blocked locks to remain
>> attached, so we remove them to be safe.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  fs/locks.c | 24 +++++++++---------------
>>  1 file changed, 9 insertions(+), 15 deletions(-)
>> 
>> diff --git a/fs/locks.c b/fs/locks.c
>> index de38bafb7f7b..6b310112cf3b 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -1276,12 +1276,10 @@ static int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl)
>>  		if (error != FILE_LOCK_DEFERRED)
>>  			break;
>>  		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
>> -		if (!error)
>> -			continue;
>> -
>> -		locks_delete_block(fl);
>> -		break;
>> +		if (error)
>> +			break;
>>  	}
>> +	locks_delete_block(fl);
>>  	return error;
>>  }
>>  
>> @@ -1971,12 +1969,10 @@ static int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl)
>>  		if (error != FILE_LOCK_DEFERRED)
>>  			break;
>>  		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
>> -		if (!error)
>> -			continue;
>> -
>> -		locks_delete_block(fl);
>> -		break;
>> +		if (error)
>> +			break;
>>  	}
>> +	locks_delete_block(fl);
>>  	return error;
>>  }
>>  
>> @@ -2250,12 +2246,10 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
>>  		if (error != FILE_LOCK_DEFERRED)
>>  			break;
>>  		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
>> -		if (!error)
>> -			continue;
>> -
>> -		locks_delete_block(fl);
>> -		break;
>> +		if (error)
>> +			break;
>>  	}
>> +	locks_delete_block(fl);
>>  
>>  	return error;
>>  }
>
> Thanks Neil.
>
> FWIW, I was able to reproduce something that looked a lot like what
> Krzysztof reported by running the cthon04 lock tests on a client running
> the kernel with the original set.
>
> I applied the above patch on top of that set, reran the test and got a
> different BUG (list corruption):
>
> [   85.117307] ------------[ cut here ]------------
> [   85.118130] kernel BUG at lib/list_debug.c:53!
> [   85.118684] invalid opcode: 0000 [#1] SMP NOPTI
> [   85.119800] CPU: 5 PID: 92 Comm: kworker/u16:1 Not tainted 4.18.0+ #46
> [   85.120845] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
> [   85.122350] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [   85.123242] RIP: 0010:__list_del_entry_valid.cold.1+0x34/0x4c
> [   85.124116] Code: 03 10 9b e8 08 07 ca ff 0f 0b 48 c7 c7 c8 03 10 9b e8 fa 06 ca ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 88 03 10 9b e8 e6 06 ca ff <0f> 0b 48 89 fe 48 c7 c7 50 03 10 9b e8 d5 06 ca ff 0f 0b 90 90 90
> [   85.126704] RSP: 0018:ffffa0fe0133bd90 EFLAGS: 00010246
> [   85.127382] RAX: 0000000000000054 RBX: ffff92bcf3a46ad8 RCX: 0000000000000000
> [   85.128322] RDX: 0000000000000000 RSI: ffff92bcffd56828 RDI: ffff92bcffd56828
> [   85.129251] RBP: ffff92bcf3a46b10 R08: 0000000000000000 R09: 0000000000aaaaaa
> [   85.130250] R10: 0000000000000000 R11: 0000000000000001 R12: ffff92bce2588618
> [   85.131230] R13: ffff92bcf3a45800 R14: ffffffffc06f9f60 R15: ffffffffc06f9f60
> [   85.132191] FS:  0000000000000000(0000) GS:ffff92bcffd40000(0000) knlGS:0000000000000000
> [   85.133296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   85.134088] CR2: 00007f16c1292008 CR3: 0000000138588000 CR4: 00000000000006e0
> [   85.134926] Call Trace:
> [   85.135251]  __locks_delete_block+0x3f/0x70
> [   85.135751]  locks_delete_block+0x25/0x30
> [   85.136259]  locks_lock_inode_wait+0x63/0x150
> [   85.136841]  ? nfs41_release_slot+0x98/0xd0 [nfsv4]
> [   85.137556]  nfs4_lock_done+0x1a2/0x1c0 [nfsv4]
> [   85.138272]  rpc_exit_task+0x2d/0x80 [sunrpc]
> [   85.138994]  __rpc_execute+0x7f/0x340 [sunrpc]
> [   85.139953]  process_one_work+0x1a1/0x350
> [   85.140678]  worker_thread+0x30/0x380
> [   85.141800]  ? wq_update_unbound_numa+0x1a0/0x1a0
> [   85.142904]  kthread+0x112/0x130
> [   85.143445]  ? kthread_create_worker_on_cpu+0x70/0x70
> [   85.144273]  ret_from_fork+0x22/0x40
> [   85.144859] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xt_conntrack nf_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_raw ip6table_security iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc i2c_piix4 edac_mce_amd joydev pcspkr virtio_balloon xfs libcrc32c virtio_net net_failover virtio_console failover virtio_blk floppy qxl serio_raw drm_kms_helper ttm qemu_fw_cfg drm virtio_pci ata_generic pata_acpi virtio_rng virtio_ring virtio
> [   85.153426] ---[ end trace 63df06139208ee23 ]---
>

Oh dear.
nfs4_alloc_lockdata contains:
	memcpy(&p->fl, fl, sizeof(p->fl));

so any list_heads that are valid in fl will be invalid in p->fl.

Maybe I should initialize the relevant list_heads at the start of wait
functions.
I should look more closely at what filesystems do with locks though.

Thanks,
NeilBrown