All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: NeilBrown <neilb@suse.com>, Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
	Martin Wilck <mwilck@suse.de>,
	linux-fsdevel@vger.kernel.org,
	Frank Filz <ffilzlnx@mindspring.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/9 v3] locks: avoid thundering-herd wake-ups
Date: Tue, 30 Oct 2018 08:04:57 -0400	[thread overview]
Message-ID: <e07c6e910d36150be06a930b3ed8da99d197ea11.camel@kernel.org> (raw)
In-Reply-To: <0617e06e8d66170254a10cd922750c92d6ff04ff.camel@kernel.org>

On Mon, 2018-10-29 at 08:38 -0400, Jeff Layton wrote:
> On Mon, 2018-10-29 at 12:56 +1100, NeilBrown wrote:
> > On Fri, Oct 26 2018, Jeff Layton wrote:
> > 
> > > On Wed, 2018-10-24 at 09:43 +1100, NeilBrown wrote:
> > > > This took longer that I had wanted, due to various reasons - sorry.
> > > > And I'm now posting it in a merge window, which is not ideal.  I don't
> > > > expect it to be included in this merge window and I won't be at all
> > > > impatient for review, but I didn't want to delay it further.
> > > > 
> > > > Testing found some problems, particularly the need to use
> > > > locks_copy_lock in NFS.  And there was a small thinko in there that
> > > > effectively removed all the speed gains :-(
> > > > 
> > > > But this version:
> > > >  - shows excellent scalability with lots of threads on lots of CPUs
> > > >    contending on a single file
> > > >  - avoids the problems that Bruce found
> > > >  - seems to work.
> > > > 
> > > > Thanks,
> > > > NeilBrown
> > > > 
> > > > 
> > > > ---
> > > > 
> > > > NeilBrown (9):
> > > >       fs/locks: rename some lists and pointers.
> > > >       fs/locks: split out __locks_wake_up_blocks().
> > > >       NFS: use locks_copy_lock() to copy locks.
> > > >       fs/locks: allow a lock request to block other requests.
> > > >       fs/locks: always delete_block after waiting.
> > > >       fs/locks: change all *_conflict() functions to return bool.
> > > >       fs/locks: create a tree of dependent requests.
> > > >       locks: merge posix_unblock_lock() and locks_delete_block()
> > > >       VFS: locks: remove unnecessary white space.
> > > > 
> > > > 
> > > >  fs/cifs/file.c                  |    4 -
> > > >  fs/lockd/svclock.c              |    2 
> > > >  fs/locks.c                      |  231 ++++++++++++++++++++++-----------------
> > > >  fs/nfs/nfs4proc.c               |    6 +
> > > >  fs/nfsd/nfs4state.c             |    6 +
> > > >  include/linux/fs.h              |   11 +-
> > > >  include/trace/events/filelock.h |   16 +--
> > > >  7 files changed, 153 insertions(+), 123 deletions(-)
> > > > 
> > > > --
> > > > Signature
> > > > 
> > > 
> > > 
> > > I built a kernel with these patches and ran the cthon04 lock tests and
> > > got this on lock test 1 after a long hang (the test passed though):
> > 
> > I've been looking deeper into this, and I cannot see how this can
> > happen.
> > 
> > This is an unlock request, happening when a file is closed.
> > locks_delete_block() will only be called from locks_lock_inode_wait()
> > after posix_lock_inode() (or possible flock_lock_inode()) returns
> > FILE_LOCK_DEFERRED.
> > 
> > But these only return FILE_LOCK_DEFERRED when fl_type != F_UNLCK.
> > (or possibly if FL_ACCESS and FL_SLEEP are both set - that would be
> >  weird).
> > 
> > So this shouldn't happen - an unlock request should never result in
> > locks_delete_block() being called.
> > But if it can, I'll need to change do_flock() in gfs2/file.c to use a
> > properly initialized 'struct file_lock', rather than a manifest
> > constant.  Maybe I should do that anyway.
> > 
> > Any ideas? I'll try running connectathon myself later and see if I can
> > reproduce it.
> > 
> > Thanks,
> > NeilBrown
> > 
> > > 
> > > [ 1694.787367] BUG: unable to handle kernel NULL pointer dereference at 000000000000002c
> > > [ 1694.789546] PGD 118ff0067 P4D 118ff0067 PUD 135915067 PMD 0 
> > > [ 1694.790772] Oops: 0000 [#1] SMP NOPTI
> > > [ 1694.791749] CPU: 7 PID: 1514 Comm: tlocklfs Not tainted 4.19.0+ #56
> > > [ 1694.792876] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
> > > [ 1694.795179] RIP: 0010:__locks_delete_block+0x14/0x90
> > > [ 1694.796203] Code: 01 a1 e9 9f 4f d8 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 05 29 9d 58 01 55 53 48 89 fb 85 c0 75 5a <48> 8b 43 20 48 85 c0 74 20 48 8b 53 18 48 89 10 48 85 d2 74 04 48
> > > [ 1694.799277] RSP: 0018:ffff9d21c1f63cb8 EFLAGS: 00010202
> > > [ 1694.800374] RAX: 0000000000000001 RBX: 000000000000000c RCX: aaaaaaaaaaaaaaad
> > > [ 1694.801682] RDX: 0000000000000001 RSI: ffffffff9f7b0c38 RDI: 0000000000000246
> > > [ 1694.802996] RBP: 000000000000000c R08: 0000000000000000 R09: 0000000000000001
> > > [ 1694.804317] R10: 0000000000000001 R11: ffffffffa0bdc188 R12: ffff9d21c1f63dd8
> > > [ 1694.805633] R13: ffff9d21c1f63e00 R14: ffffffff9f3241a8 R15: ffff8d0b5aef72e0
> > > [ 1694.806941] FS:  00007efc8699c740(0000) GS:ffff8d0b7ba00000(0000) knlGS:0000000000000000
> > > [ 1694.808380] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 1694.809550] CR2: 000000000000002c CR3: 000000011e0d8000 CR4: 00000000000006e0
> > > [ 1694.810888] Call Trace:
> > > [ 1694.811692]  __locks_wake_up_blocks+0x2d/0x80
> > > [ 1694.812713]  locks_delete_block+0x1d/0x40
> > > [ 1694.813691]  locks_lock_inode_wait+0x9c/0x1c0
> > > [ 1694.814731]  nfs4_proc_lock+0x120/0x440 [nfsv4]
> > > [ 1694.815786]  ? nfs_put_lock_context+0x25/0x80 [nfs]
> > > [ 1694.816866]  ? do_unlk+0x98/0xe0 [nfs]
> > > [ 1694.817818]  locks_remove_posix+0xba/0x1d0
> > > [ 1694.818811]  ? _cond_resched+0x15/0x30
> > > [ 1694.819768]  ? wait_on_commit+0x38/0xb0 [nfs]
> > > [ 1694.820787]  ? process_echoes+0x60/0x60
> > > [ 1694.821752]  ? __nfs_commit_inode+0xc2/0x1c0 [nfs]
> > > [ 1694.822819]  filp_close+0x56/0x70
> > > [ 1694.823712]  __x64_sys_close+0x1e/0x50
> > > [ 1694.824661]  do_syscall_64+0x60/0x1f0
> > > [ 1694.825599]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > > [ 1694.826731] RIP: 0033:0x7efc8616c0a4
> > > [ 1694.827673] Code: eb 89 e8 af f6 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 05 aa e7 2c 00 48 63 ff 85 c0 75 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 44 f3 c3 66 90 48 83 ec 18 48 89 7c 24 08 e8
> > > [ 1694.830929] RSP: 002b:00007ffc70beb7b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> > > [ 1694.832371] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efc8616c0a4
> > > [ 1694.833784] RDX: 0000000000000000 RSI: 00007efc864378a0 RDI: 0000000000000009
> > > [ 1694.835183] RBP: 00007ffc70beb7d0 R08: 00007efc864378a0 R09: 00007efc8699c740
> > > [ 1694.836560] R10: 00000000000006b4 R11: 0000000000000246 R12: 0000000000401000
> > > [ 1694.837941] R13: 00007ffc70beb990 R14: 0000000000000000 R15: 0000000000000000
> > > [ 1694.839322] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nfsd auth_rpcgss nfs_acl lockd grace ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc joydev virtio_balloon i2c_piix4 pcspkr edac_mce_amd xfs libcrc32c serio_raw virtio_blk qxl drm_kms_helper virtio_net ttm net_failover virtio_console failover drm ata_generic pata_acpi floppy qemu_fw_cfg
> > > [ 1694.849736] CR2: 000000000000002c
> > > [ 1694.850813] ---[ end trace da2f469c62deb564 ]---
> > > 
> > > -- 
> > > Jeff Layton <jlayton@kernel.org>
> 
> Yes. It crashed here:
> 
> (gdb) list *(__locks_delete_block+0x14)
> 0xffffffff81391374 is in __locks_delete_block
> (./include/linux/list.h:693).
> 688		n->pprev = LIST_POISON2;
> 689	}
> 690	
> 691	static inline void hlist_del_init(struct hlist_node *n)
> 692	{
> 693		if (!hlist_unhashed(n)) {
> 694			__hlist_del(n);
> 695			INIT_HLIST_NODE(n);
> 696		}
> 697	}
> 
> ...and that should be the address of fl->fl_link.
> 
> I think the issue is probably in locks_remove_posix. It creates a lock
> request on the stack, and doesn't seem to initialize fl_link. That used
> to be ok, but patch #5 in the series changes that.
> 
> The following patch seems to fix the problem in a quick test. I'll plan
> to run some more tests later. It may be a day or two before I can get to
> it though.
> 
> -----------------------8<---------------------------
> 
> [PATCH] locks: initialize list heads in locks_remove_posix lock request
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/locks.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 6b9da320d9a0..60019e146839 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2535,6 +2535,7 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner)
>  	lock.fl_file = filp;
>  	lock.fl_ops = NULL;
>  	lock.fl_lmops = NULL;
> +	locks_init_lock_heads(&lock);
>  
>  	error = vfs_lock_file(filp, F_SETLK, &lock, NULL);
>  

I ran a bunch of tests on top of your series + this patch and it seemed
to do fine. I'm guessing maybe you just got "lucky" and your kernel
happened to end up with the fl_link value set in such a way that
hlist_unhashed returned true there?

In any case, this seems to fix the issue I was seeing. The only question
I have is whether this will be more expensive than doing something more
clever like checking for FL_SLEEP in locks_remove_block.

I've gone ahead and pushed the set to my locks-next branch to start
getting more testing. If all goes well, I'll plan to send Linus a PR for
v4.21 (or v5.1).

Thanks again for doing this!
-- 
Jeff Layton <jlayton@kernel.org>


  reply	other threads:[~2018-10-30 12:05 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-23 22:43 [PATCH 0/9 v3] locks: avoid thundering-herd wake-ups NeilBrown
2018-10-23 22:43 ` [PATCH 8/9] locks: merge posix_unblock_lock() and locks_delete_block() NeilBrown
2018-10-23 22:43 ` [PATCH 6/9] fs/locks: change all *_conflict() functions to return bool NeilBrown
2018-10-23 22:43 ` [PATCH 3/9] NFS: use locks_copy_lock() to copy locks NeilBrown
2018-10-23 22:43 ` [PATCH 2/9] fs/locks: split out __locks_wake_up_blocks() NeilBrown
2018-10-23 22:43 ` [PATCH 9/9] VFS: locks: remove unnecessary white space NeilBrown
2018-10-23 22:43 ` [PATCH 7/9] fs/locks: create a tree of dependent requests NeilBrown
2018-10-23 22:43 ` [PATCH 1/9] fs/locks: rename some lists and pointers NeilBrown
2018-10-23 22:43 ` [PATCH 4/9] fs/locks: allow a lock request to block other requests NeilBrown
2018-10-23 22:43 ` [PATCH 5/9] fs/locks: always delete_block after waiting NeilBrown
2018-10-25 16:04 ` [PATCH 0/9 v3] locks: avoid thundering-herd wake-ups J. Bruce Fields
2018-10-25 19:27   ` Martin Wilck
2018-10-26 13:46 ` Jeff Layton
2018-10-28 22:43   ` NeilBrown
2018-10-29  1:56   ` NeilBrown
2018-10-29 12:38     ` Jeff Layton
2018-10-30 12:04       ` Jeff Layton [this message]
2018-10-30 17:56         ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e07c6e910d36150be06a930b3ed8da99d197ea11.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=bfields@fieldses.org \
    --cc=ffilzlnx@mindspring.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mwilck@suse.de \
    --cc=neilb@suse.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.