linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.de>
To: Jeff Layton <jlayton@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Cc: yangerkun <yangerkun@huawei.com>,
	kernel test robot <rong.a.chen@intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org, Bruce Fields <bfields@fieldses.org>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [locks] 6d390e4b5d: will-it-scale.per_process_ops -96.6% regression
Date: Fri, 13 Mar 2020 09:19:24 +1100	[thread overview]
Message-ID: <87ftedtdw3.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <5e5a109f2a8f64324c114f4f55b7cb7c21a8d8da.camel@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 8643 bytes --]

On Thu, Mar 12 2020, Jeff Layton wrote:

> On Thu, 2020-03-12 at 15:42 +1100, NeilBrown wrote:
>> On Wed, Mar 11 2020, Linus Torvalds wrote:
>> 
>> > On Wed, Mar 11, 2020 at 3:22 PM NeilBrown <neilb@suse.de> wrote:
>> > > We can combine the two ideas - move the list_del_init() later, and still
>> > > protect it with the wq locks.  This avoids holding the lock across the
>> > > callback, but provides clear atomicity guarantees.
>> > 
>> > Ugfh. Honestly, this is disgusting.
>> > 
>> > Now you re-take the same lock in immediate succession for the
>> > non-callback case.  It's just hidden.
>> > 
>> > And it's not like the list_del_init() _needs_ the lock (it's not
>> > currently called with the lock held).
>> > 
>> > So that "hold the lock over list_del_init()" seems to be horrendously
>> > bogus. It's only done as a serialization thing for that optimistic
>> > case.
>> > 
>> > And that optimistic case doesn't even *want* that kind of
>> > serialization. It really just wants a "I'm done" flag.
>> > 
>> > So no. Don't do this. It's mis-using the lock in several ways.
>> > 
>> >              Linus
>> 
>> It seems that test_and_set_bit_lock() is the preferred way to handle
>> flags when memory ordering is important, and I can't see how to use that
>> well with an "I'm done" flag.  I can make it look OK with a "I'm
>> detaching" flag.  Maybe this is better.
>> 
>> NeilBrown
>> 
>> From f46db25f328ddf37ca9fbd390c6eb5f50c4bd2e6 Mon Sep 17 00:00:00 2001
>> From: NeilBrown <neilb@suse.de>
>> Date: Wed, 11 Mar 2020 07:39:04 +1100
>> Subject: [PATCH] locks: restore locks_delete_lock optimization
>> 
>> A recent patch (see Fixes: below) removed an optimization which is
>> important as it avoids taking a lock in a common case.
>> 
>> The comment justifying the optimisation was correct as far as it went,
>> in that if the tests succeeded, then the values would remain stable and
>> the test result will remain valid even without a lock.
>> 
>> However after the test succeeds the lock can be freed while some other
>> thread might have only just set ->blocker to NULL (thus allowing the
>> test to succeed) but has not yet called wake_up() on the wq in the lock.
>> If the wake_up happens after the lock is freed, a use-after-free error occurs.
>> 
>> This patch restores the optimization and adds a flag to ensure this
>> use-after-free is avoid.  The use happens only when the flag is set, and
>> the free doesn't happen until the flag has been cleared, or we have
>> taken blocked_lock_lock.
>> 
>> Fixes: 6d390e4b5d48 ("locks: fix a potential use-after-free problem when wakeup a waiter")
>> Signed-off-by: NeilBrown <neilb@suse.de>
>> ---
>>  fs/locks.c         | 44 ++++++++++++++++++++++++++++++++++++++------
>>  include/linux/fs.h |  3 ++-
>>  2 files changed, 40 insertions(+), 7 deletions(-)
>> 
>
> Just a note that I'm traveling at the moment, and won't be able do much
> other than comment on this for a few days.
>
>> diff --git a/fs/locks.c b/fs/locks.c
>> index 426b55d333d5..334473004c6c 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -283,7 +283,7 @@ locks_dump_ctx_list(struct list_head *list, char *list_type)
>>  	struct file_lock *fl;
>>  
>>  	list_for_each_entry(fl, list, fl_list) {
>> -		pr_warn("%s: fl_owner=%p fl_flags=0x%x fl_type=0x%x fl_pid=%u\n", list_type, fl->fl_owner, fl->fl_flags, fl->fl_type, fl->fl_pid);
>> +		pr_warn("%s: fl_owner=%p fl_flags=0x%lx fl_type=0x%x fl_pid=%u\n", list_type, fl->fl_owner, fl->fl_flags, fl->fl_type, fl->fl_pid);
>>  	}
>>  }
>>  
>> @@ -314,7 +314,7 @@ locks_check_ctx_file_list(struct file *filp, struct list_head *list,
>>  	list_for_each_entry(fl, list, fl_list)
>>  		if (fl->fl_file == filp)
>>  			pr_warn("Leaked %s lock on dev=0x%x:0x%x ino=0x%lx "
>> -				" fl_owner=%p fl_flags=0x%x fl_type=0x%x fl_pid=%u\n",
>> +				" fl_owner=%p fl_flags=0x%lx fl_type=0x%x fl_pid=%u\n",
>>  				list_type, MAJOR(inode->i_sb->s_dev),
>>  				MINOR(inode->i_sb->s_dev), inode->i_ino,
>>  				fl->fl_owner, fl->fl_flags, fl->fl_type, fl->fl_pid);
>> @@ -736,10 +736,13 @@ static void __locks_wake_up_blocks(struct file_lock *blocker)
>>  		waiter = list_first_entry(&blocker->fl_blocked_requests,
>>  					  struct file_lock, fl_blocked_member);
>>  		__locks_delete_block(waiter);
>> -		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
>> -			waiter->fl_lmops->lm_notify(waiter);
>> -		else
>> -			wake_up(&waiter->fl_wait);
>> +		if (!test_and_set_bit_lock(FL_DELETING, &waiter->fl_flags)) {
>> +			if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
>> +				waiter->fl_lmops->lm_notify(waiter);
>> +			else
>> +				wake_up(&waiter->fl_wait);
>> +			clear_bit_unlock(FL_DELETING, &waiter->fl_flags);
>> +		}
>
> I *think* this is probably safe.
>
> AIUI, when you use atomic bitops on a flag word like this, you should
> use them for all modifications to ensure that your changes don't get
> clobbered by another task racing in to do a read/modify/write cycle on
> the same word.
>
> I haven't gone over all of the places where fl_flags is changed, but I
> don't see any at first glance that do it on a blocked request.
>
>>  	}
>>  }
>>  
>> @@ -753,11 +756,40 @@ int locks_delete_block(struct file_lock *waiter)
>>  {
>>  	int status = -ENOENT;
>>  
>> +	/*
>> +	 * If fl_blocker is NULL, it won't be set again as this thread
>> +	 * "owns" the lock and is the only one that might try to claim
>> +	 * the lock.  So it is safe to test fl_blocker locklessly.
>> +	 * Also if fl_blocker is NULL, this waiter is not listed on
>> +	 * fl_blocked_requests for some lock, so no other request can
>> +	 * be added to the list of fl_blocked_requests for this
>> +	 * request.  So if fl_blocker is NULL, it is safe to
>> +	 * locklessly check if fl_blocked_requests is empty.  If both
>> +	 * of these checks succeed, there is no need to take the lock.
>> +	 *
>> +	 * We perform these checks only if we can set FL_DELETING.
>> +	 * This ensure that we don't race with __locks_wake_up_blocks()
>> +	 * in a way which leads it to call wake_up() *after* we return
>> +	 * and the file_lock is freed.
>> +	 */
>> +	if (!test_and_set_bit_lock(FL_DELETING, &waiter->fl_flags)) {
>> +		if (waiter->fl_blocker == NULL &&
>> +		    list_empty(&waiter->fl_blocked_requests)) {
>> +			/* Already fully unlinked */
>> +			clear_bit_unlock(FL_DELETING, &waiter->fl_flags);
>> +			return status;
>> +		}
>> +	}
>> +
>>  	spin_lock(&blocked_lock_lock);
>>  	if (waiter->fl_blocker)
>>  		status = 0;
>>  	__locks_wake_up_blocks(waiter);
>>  	__locks_delete_block(waiter);
>> +	/* This flag might not be set and it is largely irrelevant
>> +	 * now, but it seem cleaner to clear it.
>> +	 */
>> +	clear_bit(FL_DELETING, &waiter->fl_flags);
>>  	spin_unlock(&blocked_lock_lock);
>>  	return status;
>>  }
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 3cd4fe6b845e..4db514f29bca 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1012,6 +1012,7 @@ static inline struct file *get_file(struct file *f)
>>  #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
>>  #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
>>  #define FL_LAYOUT	2048	/* outstanding pNFS layout */
>> +#define FL_DELETING	32768	/* lock is being disconnected */
>
> nit: Why the big gap?

No good reason - it seems like a conceptually different sort of flag so
I vaguely felt that it would help if it were numerically separate.

>
>>  
>>  #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
>>  
>> @@ -1087,7 +1088,7 @@ struct file_lock {
>>  						 * ->fl_blocker->fl_blocked_requests
>>  						 */
>>  	fl_owner_t fl_owner;
>> -	unsigned int fl_flags;
>> +	unsigned long fl_flags;
>
> This will break kABI, so backporting this to enterprise distro kernels
> won't be trivial. Not a showstopper, but it might be nice to avoid that
> if we can.
>
> While it's not quite as efficient, we could just do the FL_DELETING
> manipulation under the flc->flc_lock. That's per-inode, so it should be
> safe to do it that way.

If we are going to use a spinlock, I'd much rather not add a flag bit,
but instead use the blocked_member list_head.

I'm almost tempted to suggest adding
  smp_list_del_init_release() and smp_list_empty_careful_acquire()
so that list membership can be used as a barrier.  I'm not sure I game
though.

NeilBrown


>
>>  	unsigned char fl_type;
>>  	unsigned int fl_pid;
>>  	int fl_link_cpu;		/* what cpu's list is this on? */

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  reply	other threads:[~2020-03-12 22:19 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-08 14:03 [locks] 6d390e4b5d: will-it-scale.per_process_ops -96.6% regression kernel test robot
2020-03-09 14:36 ` Jeff Layton
2020-03-09 15:52   ` Linus Torvalds
2020-03-09 17:22     ` Jeff Layton
2020-03-09 19:09       ` Jeff Layton
2020-03-09 19:53         ` Jeff Layton
2020-03-09 21:42         ` NeilBrown
2020-03-09 21:58           ` Jeff Layton
2020-03-10  7:52             ` kernel test robot
2020-03-09 22:11           ` Jeff Layton
2020-03-10  3:24             ` yangerkun
2020-03-10  7:54               ` kernel test robot
2020-03-10 12:52               ` Jeff Layton
2020-03-10 14:18                 ` yangerkun
2020-03-10 15:06                   ` Jeff Layton
2020-03-10 17:27                 ` Jeff Layton
2020-03-10 21:01                   ` NeilBrown
2020-03-10 21:14                     ` Jeff Layton
2020-03-10 21:21                       ` NeilBrown
2020-03-10 21:47                         ` Linus Torvalds
2020-03-10 22:07                           ` Jeff Layton
2020-03-10 22:31                             ` Linus Torvalds
2020-03-11 22:22                               ` NeilBrown
2020-03-12  0:38                                 ` Linus Torvalds
2020-03-12  4:42                                   ` NeilBrown
2020-03-12 12:31                                     ` Jeff Layton
2020-03-12 22:19                                       ` NeilBrown [this message]
2020-03-14  1:11                                         ` Jeff Layton
2020-03-12 16:07                                     ` Linus Torvalds
2020-03-14  1:31                                       ` Jeff Layton
2020-03-14  2:31                                         ` NeilBrown
2020-03-14 15:58                                           ` Linus Torvalds
2020-03-15 13:54                                             ` Jeff Layton
2020-03-16  5:06                                               ` NeilBrown
2020-03-16 11:07                                                 ` Jeff Layton
2020-03-16 17:26                                                   ` Linus Torvalds
2020-03-17  1:41                                                     ` yangerkun
2020-03-17 14:05                                                       ` yangerkun
2020-03-17 16:07                                                         ` Jeff Layton
2020-03-18  1:09                                                           ` yangerkun
2020-03-19 17:51                                                     ` Jeff Layton
2020-03-19 19:23                                                       ` Linus Torvalds
2020-03-19 19:24                                                         ` Jeff Layton
2020-03-19 19:35                                                           ` Linus Torvalds
2020-03-19 20:10                                                             ` Jeff Layton
2020-03-16 22:45                                                   ` NeilBrown
2020-03-17 15:59                                                     ` Jeff Layton
2020-03-17 21:27                                                       ` NeilBrown
2020-03-18  5:12                                                   ` kernel test robot
2020-03-16  4:26                                             ` NeilBrown
2020-03-11  1:57                     ` yangerkun
2020-03-11 12:52                       ` Jeff Layton
2020-03-11 13:26                         ` yangerkun
2020-03-11 22:15                       ` NeilBrown
2020-03-10  7:50           ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ftedtdw3.fsf@notabene.neil.brown.name \
    --to=neilb@suse.de \
    --cc=bfields@fieldses.org \
    --cc=jlayton@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@lists.01.org \
    --cc=rong.a.chen@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yangerkun@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).