linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fs: Optimized fget to improve performance
@ 2020-08-27 10:19 Shaokun Zhang
  2020-08-27 12:30 ` Matthew Wilcox
  2020-08-27 14:28 ` [NAK] " Al Viro
  0 siblings, 2 replies; 8+ messages in thread
From: Shaokun Zhang @ 2020-08-27 10:19 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Yuqi Jin, kernel test robot, Will Deacon, Mark Rutland,
	Peter Zijlstra, Alexander Viro, Boqun Feng, Shaokun Zhang

From: Yuqi Jin <jinyuqi@huawei.com>

It is well known that the performance of atomic_add is better than that of
atomic_cmpxchg.
The initial value of @f_count is 1. While @f_count is increased by 1 in
__fget_files, it will go through three phases: > 0, < 0, and = 0. When the
fixed value 0 is used as the condition for terminating the increase of 1,
only atomic_cmpxchg can be used. When we use < 0 as the condition for
stopping plus 1, we can use atomic_add to obtain better performance.

we test syscall in unixbench on Huawei Kunpeng920(arm64). We've got a 132%
performance boost. 

with this patch and the patch [1]
System Call Overhead                        9516926.2 lps   (10.0 s, 1 samples)

System Benchmarks Partial Index              BASELINE       RESULT    INDEX
System Call Overhead                          15000.0    9516926.2   6344.6
                                                                   ========
System Benchmarks Index Score (Partial Only)                         6344.6

with this patch and without the patch [1]
System Call Overhead                        5290449.3 lps   (10.0 s, 1 samples)

System Benchmarks Partial Index              BASELINE       RESULT    INDEX
System Call Overhead                          15000.0    5290449.3   3527.0
                                                                   ========
System Benchmarks Index Score (Partial Only)                         3527.0

without any patch
System Call Overhead                        4102310.5 lps   (10.0 s, 1 samples)

System Benchmarks Partial Index              BASELINE       RESULT    INDEX
System Call Overhead                          15000.0    4102310.5   2734.9
                                                                   ========
System Benchmarks Index Score (Partial Only)                         2734.9

[1] https://lkml.org/lkml/2020/6/24/283

Cc: kernel test robot <rong.a.chen@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Yuqi Jin <jinyuqi@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
Hi Rong,

Can you help to test this patch individually and with [1] together on
your platform please? [1] has been tested on your platform[2].

[2] https://lkml.org/lkml/2020/7/8/227

 include/linux/fs.h | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e019ea2f1347..2a9c2a30dc58 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -972,8 +972,19 @@ static inline struct file *get_file(struct file *f)
 	atomic_long_inc(&f->f_count);
 	return f;
 }
+
+static inline bool get_file_unless_negative(atomic_long_t *v, long a)
+{
+	long c = atomic_long_read(v);
+
+	if (c <= 0)
+		return 0;
+
+	return atomic_long_add_return(a, v) - 1;
+}
+
 #define get_file_rcu_many(x, cnt)	\
-	atomic_long_add_unless(&(x)->f_count, (cnt), 0)
+	get_file_unless_negative(&(x)->f_count, (cnt))
 #define get_file_rcu(x) get_file_rcu_many((x), 1)
 #define file_count(x)	atomic_long_read(&(x)->f_count)
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] fs: Optimized fget to improve performance
  2020-08-27 10:19 [PATCH] fs: Optimized fget to improve performance Shaokun Zhang
@ 2020-08-27 12:30 ` Matthew Wilcox
  2020-08-27 13:07   ` David Laight
  2020-08-27 14:28 ` [NAK] " Al Viro
  1 sibling, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2020-08-27 12:30 UTC (permalink / raw)
  To: Shaokun Zhang
  Cc: linux-fsdevel, linux-kernel, Yuqi Jin, Will Deacon, Mark Rutland

On Thu, Aug 27, 2020 at 06:19:44PM +0800, Shaokun Zhang wrote:
> From: Yuqi Jin <jinyuqi@huawei.com>
> 
> It is well known that the performance of atomic_add is better than that of
> atomic_cmpxchg.

I don't think that's well-known at all.

> +static inline bool get_file_unless_negative(atomic_long_t *v, long a)
> +{
> +	long c = atomic_long_read(v);
> +
> +	if (c <= 0)
> +		return 0;
> +
> +	return atomic_long_add_return(a, v) - 1;
> +}
> +
>  #define get_file_rcu_many(x, cnt)	\
> -	atomic_long_add_unless(&(x)->f_count, (cnt), 0)
> +	get_file_unless_negative(&(x)->f_count, (cnt))
>  #define get_file_rcu(x) get_file_rcu_many((x), 1)
>  #define file_count(x)	atomic_long_read(&(x)->f_count)

I think you should be proposing a patch to fix atomic_long_add_unless()
on arm64 instead.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] fs: Optimized fget to improve performance
  2020-08-27 12:30 ` Matthew Wilcox
@ 2020-08-27 13:07   ` David Laight
  0 siblings, 0 replies; 8+ messages in thread
From: David Laight @ 2020-08-27 13:07 UTC (permalink / raw)
  To: 'Matthew Wilcox', Shaokun Zhang
  Cc: linux-fsdevel, linux-kernel, Yuqi Jin, Will Deacon, Mark Rutland

From: Matthew Wilcox
> Sent: 27 August 2020 13:31
> On Thu, Aug 27, 2020 at 06:19:44PM +0800, Shaokun Zhang wrote:
> > From: Yuqi Jin <jinyuqi@huawei.com>
> >
> > It is well known that the performance of atomic_add is better than that of
> > atomic_cmpxchg.
> 
> I don't think that's well-known at all.

Especially since on (probably) every architecture except x86
atomic_add() is implemented using atomic_cmpxchg().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [NAK] Re: [PATCH] fs: Optimized fget to improve performance
  2020-08-27 10:19 [PATCH] fs: Optimized fget to improve performance Shaokun Zhang
  2020-08-27 12:30 ` Matthew Wilcox
@ 2020-08-27 14:28 ` Al Viro
  2020-08-28 11:04   ` Will Deacon
  2020-08-31  1:43   ` Shaokun Zhang
  1 sibling, 2 replies; 8+ messages in thread
From: Al Viro @ 2020-08-27 14:28 UTC (permalink / raw)
  To: Shaokun Zhang
  Cc: linux-fsdevel, linux-kernel, Yuqi Jin, kernel test robot,
	Will Deacon, Mark Rutland, Peter Zijlstra, Boqun Feng

On Thu, Aug 27, 2020 at 06:19:44PM +0800, Shaokun Zhang wrote:
> From: Yuqi Jin <jinyuqi@huawei.com>
> 
> It is well known that the performance of atomic_add is better than that of
> atomic_cmpxchg.
> The initial value of @f_count is 1. While @f_count is increased by 1 in
> __fget_files, it will go through three phases: > 0, < 0, and = 0. When the
> fixed value 0 is used as the condition for terminating the increase of 1,
> only atomic_cmpxchg can be used. When we use < 0 as the condition for
> stopping plus 1, we can use atomic_add to obtain better performance.

Suppose another thread has just removed it from the descriptor table.

> +static inline bool get_file_unless_negative(atomic_long_t *v, long a)
> +{
> +	long c = atomic_long_read(v);
> +
> +	if (c <= 0)
> +		return 0;

Still 1.  Now the other thread has gotten to dropping the last reference,
decremented counter to zero and committed to freeing the struct file.

> +
> +	return atomic_long_add_return(a, v) - 1;

... and you increment that sucker back to 1.  Sure, you return 0, so the
caller does nothing to that struct file.  Which includes undoing the
changes to its refecount.

In the meanwhile, the third thread does fget on the same descriptor,
and there we end up bumping the refcount to 2 and succeeding.  Which
leaves the caller with reference to already doomed struct file...

	IOW, NAK - this is completely broken.  The whole point of
atomic_long_add_unless() is that the check and conditional increment
are atomic.  Together.  That's what your optimization takes out.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [NAK] Re: [PATCH] fs: Optimized fget to improve performance
  2020-08-27 14:28 ` [NAK] " Al Viro
@ 2020-08-28 11:04   ` Will Deacon
  2020-08-31  1:43   ` Shaokun Zhang
  1 sibling, 0 replies; 8+ messages in thread
From: Will Deacon @ 2020-08-28 11:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Shaokun Zhang, linux-fsdevel, linux-kernel, Yuqi Jin,
	kernel test robot, Mark Rutland, Peter Zijlstra, Boqun Feng

On Thu, Aug 27, 2020 at 03:28:48PM +0100, Al Viro wrote:
> On Thu, Aug 27, 2020 at 06:19:44PM +0800, Shaokun Zhang wrote:
> > From: Yuqi Jin <jinyuqi@huawei.com>
> > 
> > It is well known that the performance of atomic_add is better than that of
> > atomic_cmpxchg.
> > The initial value of @f_count is 1. While @f_count is increased by 1 in
> > __fget_files, it will go through three phases: > 0, < 0, and = 0. When the
> > fixed value 0 is used as the condition for terminating the increase of 1,
> > only atomic_cmpxchg can be used. When we use < 0 as the condition for
> > stopping plus 1, we can use atomic_add to obtain better performance.
> 
> Suppose another thread has just removed it from the descriptor table.
> 
> > +static inline bool get_file_unless_negative(atomic_long_t *v, long a)
> > +{
> > +	long c = atomic_long_read(v);
> > +
> > +	if (c <= 0)
> > +		return 0;
> 
> Still 1.  Now the other thread has gotten to dropping the last reference,
> decremented counter to zero and committed to freeing the struct file.
> 
> > +
> > +	return atomic_long_add_return(a, v) - 1;
> 
> ... and you increment that sucker back to 1.  Sure, you return 0, so the
> caller does nothing to that struct file.  Which includes undoing the
> changes to its refecount.
> 
> In the meanwhile, the third thread does fget on the same descriptor,
> and there we end up bumping the refcount to 2 and succeeding.  Which
> leaves the caller with reference to already doomed struct file...
> 
> 	IOW, NAK - this is completely broken.  The whole point of
> atomic_long_add_unless() is that the check and conditional increment
> are atomic.  Together.  That's what your optimization takes out.

Cheers Al, yes, this is fscked.

As an aside, I've previously toyed with the idea of implementing a form
of cmpxchg() using a pair of xchg() operations and a smp_cond_load_relaxed(),
where the thing would transition through a "reserved value", which might
perform better with the current trend of building hardware that doesn't
handle CAS failure so well.

But I've never had the time/motivation to hack it up, and it relies on that
reserved value which obviously doesn't always work (so it would have to be a
separate API).

Will

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [NAK] Re: [PATCH] fs: Optimized fget to improve performance
  2020-08-27 14:28 ` [NAK] " Al Viro
  2020-08-28 11:04   ` Will Deacon
@ 2020-08-31  1:43   ` Shaokun Zhang
  2020-08-31  3:21     ` Al Viro
  1 sibling, 1 reply; 8+ messages in thread
From: Shaokun Zhang @ 2020-08-31  1:43 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, linux-kernel, Yuqi Jin, kernel test robot,
	Will Deacon, Mark Rutland, Peter Zijlstra, Boqun Feng

Hi Al,

在 2020/8/27 22:28, Al Viro 写道:
> On Thu, Aug 27, 2020 at 06:19:44PM +0800, Shaokun Zhang wrote:
>> From: Yuqi Jin <jinyuqi@huawei.com>
>>
>> It is well known that the performance of atomic_add is better than that of
>> atomic_cmpxchg.
>> The initial value of @f_count is 1. While @f_count is increased by 1 in
>> __fget_files, it will go through three phases: > 0, < 0, and = 0. When the
>> fixed value 0 is used as the condition for terminating the increase of 1,
>> only atomic_cmpxchg can be used. When we use < 0 as the condition for
>> stopping plus 1, we can use atomic_add to obtain better performance.
> 
> Suppose another thread has just removed it from the descriptor table.
> 
>> +static inline bool get_file_unless_negative(atomic_long_t *v, long a)
>> +{
>> +	long c = atomic_long_read(v);
>> +
>> +	if (c <= 0)
>> +		return 0;
> 
> Still 1.  Now the other thread has gotten to dropping the last reference,
> decremented counter to zero and committed to freeing the struct file.
> 

Apologies that I missed it.

>> +
>> +	return atomic_long_add_return(a, v) - 1;
> 
> ... and you increment that sucker back to 1.  Sure, you return 0, so the
> caller does nothing to that struct file.  Which includes undoing the
> changes to its refecount.
> 
> In the meanwhile, the third thread does fget on the same descriptor,
> and there we end up bumping the refcount to 2 and succeeding.  Which
> leaves the caller with reference to already doomed struct file...
> 
> 	IOW, NAK - this is completely broken.  The whole point of
> atomic_long_add_unless() is that the check and conditional increment
> are atomic.  Together.  That's what your optimization takes out.
> 

How about this? We try to replace atomic_cmpxchg with atomic_add to improve
performance. The atomic_add does not check the current f_count value.
Therefore, the number of online CPUs is reserved to prevent multi-core
competition.

+
+static inline bool get_file_unless(atomic_long_t *v, long a)
+{
+       long cpus = num_online_cpus();
+       long c = atomic_long_read(v);
+       long ret;
+
+       if (c > cpus || c < -cpus)
+               ret = atomic_long_add_return(a, v) - a;
+       else
+               ret = atomic_long_add_unless(v, a, 0);
+
+       return ret;
+}
+
 #define get_file_rcu_many(x, cnt)      \
-       atomic_long_add_unless(&(x)->f_count, (cnt), 0)
+       get_file_unless(&(x)->f_count, (cnt))

Thanks,
Shaokun

> .
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [NAK] Re: [PATCH] fs: Optimized fget to improve performance
  2020-08-31  1:43   ` Shaokun Zhang
@ 2020-08-31  3:21     ` Al Viro
  2020-09-01  9:29       ` David Laight
  0 siblings, 1 reply; 8+ messages in thread
From: Al Viro @ 2020-08-31  3:21 UTC (permalink / raw)
  To: Shaokun Zhang
  Cc: linux-fsdevel, linux-kernel, Yuqi Jin, kernel test robot,
	Will Deacon, Mark Rutland, Peter Zijlstra, Boqun Feng

On Mon, Aug 31, 2020 at 09:43:31AM +0800, Shaokun Zhang wrote:

> How about this? We try to replace atomic_cmpxchg with atomic_add to improve
> performance. The atomic_add does not check the current f_count value.
> Therefore, the number of online CPUs is reserved to prevent multi-core
> competition.

No.  Really, really - no.  Not unless you can guarantee that process on another
CPU won't lose its timeslice, ending up with more than one increment happening on
the same CPU - done by different processes scheduled there, one after another.

If you have some change of atomic_long_add_unless(), do it there.  And get it
past the arm64 folks.  get_file_rcu() is nothing special in that respect *AND*
it has to cope with any architecture out there.

BTW, keep in mind that there's such thing as a KVM - race windows are much
wider there, since a thread representing a guest CPU might lose its timeslice
whenever the host feels like that.  At which point you get a single instruction
on a guest CPU taking longer than many thousands of instructions on another
CPU of the same guest.

AFAIK, arm64 does support KVM with SMP guests.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [NAK] Re: [PATCH] fs: Optimized fget to improve performance
  2020-08-31  3:21     ` Al Viro
@ 2020-09-01  9:29       ` David Laight
  0 siblings, 0 replies; 8+ messages in thread
From: David Laight @ 2020-09-01  9:29 UTC (permalink / raw)
  To: 'Al Viro', Shaokun Zhang
  Cc: linux-fsdevel, linux-kernel, Yuqi Jin, kernel test robot,
	Will Deacon, Mark Rutland, Peter Zijlstra, Boqun Feng

From: Al Viro
> Sent: 31 August 2020 04:21
> 
> On Mon, Aug 31, 2020 at 09:43:31AM +0800, Shaokun Zhang wrote:
> 
> > How about this? We try to replace atomic_cmpxchg with atomic_add to improve
> > performance. The atomic_add does not check the current f_count value.
> > Therefore, the number of online CPUs is reserved to prevent multi-core
> > competition.
> 
> No.  Really, really - no.  Not unless you can guarantee that process on another
> CPU won't lose its timeslice, ending up with more than one increment happening on
> the same CPU - done by different processes scheduled there, one after another.
> 
> If you have some change of atomic_long_add_unless(), do it there.  And get it
> past the arm64 folks.  get_file_rcu() is nothing special in that respect *AND*
> it has to cope with any architecture out there.
> 
> BTW, keep in mind that there's such thing as a KVM - race windows are much
> wider there, since a thread representing a guest CPU might lose its timeslice
> whenever the host feels like that.  At which point you get a single instruction
> on a guest CPU taking longer than many thousands of instructions on another
> CPU of the same guest.

The same thing can happen if a hardware interrupt occurs.
Not only the delays for the interrupt itself, but all the softint
processing that happens as well.
That can take a long time - even milliseconds.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-09-01  9:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-27 10:19 [PATCH] fs: Optimized fget to improve performance Shaokun Zhang
2020-08-27 12:30 ` Matthew Wilcox
2020-08-27 13:07   ` David Laight
2020-08-27 14:28 ` [NAK] " Al Viro
2020-08-28 11:04   ` Will Deacon
2020-08-31  1:43   ` Shaokun Zhang
2020-08-31  3:21     ` Al Viro
2020-09-01  9:29       ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).