linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] bit_spinlock: introduce smp_cond_load_relaxed
@ 2018-10-13  6:47 Gao Xiang
  2018-10-13  7:04 ` Greg Kroah-Hartman
  2018-10-30  6:04 ` [PATCH v2] " Gao Xiang
  0 siblings, 2 replies; 19+ messages in thread
From: Gao Xiang @ 2018-10-13  6:47 UTC (permalink / raw)
  To: Philippe Ombredanne, Kate Stewart, Thomas Gleixner, Greg Kroah-Hartman
  Cc: linux-kernel, Miao Xie, Chao Yu, Gao Xiang

It is better to use smp_cond_load_relaxed instead
of busy waiting for bit_spinlock.

Signed-off-by: Gao Xiang <hsiangkao@aol.com>
---
 include/linux/bit_spinlock.h | 21 ++++++++-------------
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
index bbc4730..713efc4 100644
--- a/include/linux/bit_spinlock.h
+++ b/include/linux/bit_spinlock.h
@@ -15,22 +15,17 @@
  */
 static inline void bit_spin_lock(int bitnum, unsigned long *addr)
 {
-	/*
-	 * Assuming the lock is uncontended, this never enters
-	 * the body of the outer loop. If it is contended, then
-	 * within the inner loop a non-atomic test is used to
-	 * busywait with less bus contention for a good time to
-	 * attempt to acquire the lock bit.
-	 */
-	preempt_disable();
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-	while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
-		preempt_enable();
-		do {
-			cpu_relax();
-		} while (test_bit(bitnum, addr));
+	while (1) {
+		smp_cond_load_relaxed(&addr[BIT_WORD(bitnum)],
+				      !(VAL >> (bitnum & (BITS_PER_LONG-1))));
 		preempt_disable();
+		if (!test_and_set_bit_lock(bitnum, addr))
+			break;
+		preempt_enable();
 	}
+#else
+	preempt_disable();
 #endif
 	__acquire(bitlock);
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] bit_spinlock: introduce smp_cond_load_relaxed
  2018-10-13  6:47 [RFC PATCH] bit_spinlock: introduce smp_cond_load_relaxed Gao Xiang
@ 2018-10-13  7:04 ` Greg Kroah-Hartman
  2018-10-13  7:22   ` Gao Xiang
  2018-10-30  6:04 ` [PATCH v2] " Gao Xiang
  1 sibling, 1 reply; 19+ messages in thread
From: Greg Kroah-Hartman @ 2018-10-13  7:04 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Philippe Ombredanne, Kate Stewart, Thomas Gleixner, linux-kernel,
	Miao Xie, Chao Yu

On Sat, Oct 13, 2018 at 02:47:29PM +0800, Gao Xiang wrote:
> It is better to use smp_cond_load_relaxed instead
> of busy waiting for bit_spinlock.

Why?  I think we need some kind of "proof" that this is true before
being able to accept a patch like this, don't you agree?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] bit_spinlock: introduce smp_cond_load_relaxed
  2018-10-13  7:04 ` Greg Kroah-Hartman
@ 2018-10-13  7:22   ` Gao Xiang
  2018-10-13  7:30     ` Greg Kroah-Hartman
  2018-10-13  7:30     ` Gao Xiang
  0 siblings, 2 replies; 19+ messages in thread
From: Gao Xiang @ 2018-10-13  7:22 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Philippe Ombredanne, Kate Stewart, Thomas Gleixner, linux-kernel,
	Miao Xie, Chao Yu

Hi Greg,

On 2018/10/13 15:04, Greg Kroah-Hartman wrote:
> On Sat, Oct 13, 2018 at 02:47:29PM +0800, Gao Xiang wrote:
>> It is better to use smp_cond_load_relaxed instead
>> of busy waiting for bit_spinlock.
> 
> Why?  I think we need some kind of "proof" that this is true before
> being able to accept a patch like this, don't you agree?

There are some materials which discuss smp_cond_load_* earlier.
https://patchwork.kernel.org/patch/10335991/
https://patchwork.kernel.org/patch/10325057/

In ARM64, they implements a function called "cmpwait", which uses
hardware instructions to monitor a value change, I think it is more
energy efficient than just do a open-code busy loop...

And it seem smp_cond_load_* is already used in the current kernel, such as:
./kernel/locking/mcs_spinlock.h
./kernel/locking/qspinlock.c
./kernel/sched/core.c
./kernel/smp.c

For other architectures like x86/arm64, I think they could implement
smp_cond_load_* later.


Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] bit_spinlock: introduce smp_cond_load_relaxed
  2018-10-13  7:22   ` Gao Xiang
@ 2018-10-13  7:30     ` Greg Kroah-Hartman
  2018-10-13  7:44       ` Gao Xiang
  2018-10-13  7:30     ` Gao Xiang
  1 sibling, 1 reply; 19+ messages in thread
From: Greg Kroah-Hartman @ 2018-10-13  7:30 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Philippe Ombredanne, Kate Stewart, Thomas Gleixner, linux-kernel,
	Miao Xie, Chao Yu

On Sat, Oct 13, 2018 at 03:22:08PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/10/13 15:04, Greg Kroah-Hartman wrote:
> > On Sat, Oct 13, 2018 at 02:47:29PM +0800, Gao Xiang wrote:
> >> It is better to use smp_cond_load_relaxed instead
> >> of busy waiting for bit_spinlock.
> > 
> > Why?  I think we need some kind of "proof" that this is true before
> > being able to accept a patch like this, don't you agree?
> 
> There are some materials which discuss smp_cond_load_* earlier.
> https://patchwork.kernel.org/patch/10335991/
> https://patchwork.kernel.org/patch/10325057/
> 
> In ARM64, they implements a function called "cmpwait", which uses
> hardware instructions to monitor a value change, I think it is more
> energy efficient than just do a open-code busy loop...
> 
> And it seem smp_cond_load_* is already used in the current kernel, such as:
> ./kernel/locking/mcs_spinlock.h
> ./kernel/locking/qspinlock.c
> ./kernel/sched/core.c
> ./kernel/smp.c
> 
> For other architectures like x86/arm64, I think they could implement
> smp_cond_load_* later.

And have you benchmarked this change to show that it provides any
benifit?

You need to do that...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] bit_spinlock: introduce smp_cond_load_relaxed
  2018-10-13  7:22   ` Gao Xiang
  2018-10-13  7:30     ` Greg Kroah-Hartman
@ 2018-10-13  7:30     ` Gao Xiang
  1 sibling, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2018-10-13  7:30 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Philippe Ombredanne, Kate Stewart, Thomas Gleixner, linux-kernel,
	Miao Xie, Chao Yu



On 2018/10/13 15:22, Gao Xiang wrote:
> For other architectures like x86/arm64, I think they could implement
> smp_cond_load_* later.

Sorry about that, I mean "amd64".

Actually I don't have performance numbers to proof that now. I think
it really depends on the detailed architecture hardware implementation.

In my opinion, I just think it is better to wrap it up rather than
do open-coded all around...
do {
  cpu_relax()
} while(...);

I was just cleaning up EROFS file system, and saw these piece of code
(bit_spinlock) by chance. Therefore I write a patch to get some idea
about it....

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] bit_spinlock: introduce smp_cond_load_relaxed
  2018-10-13  7:30     ` Greg Kroah-Hartman
@ 2018-10-13  7:44       ` Gao Xiang
  0 siblings, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2018-10-13  7:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Philippe Ombredanne, Kate Stewart, Thomas Gleixner, linux-kernel,
	Miao Xie, Chao Yu

Hi Greg,

On 2018/10/13 15:30, Greg Kroah-Hartman wrote:
> On Sat, Oct 13, 2018 at 03:22:08PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/10/13 15:04, Greg Kroah-Hartman wrote:
>>> On Sat, Oct 13, 2018 at 02:47:29PM +0800, Gao Xiang wrote:
>>>> It is better to use smp_cond_load_relaxed instead
>>>> of busy waiting for bit_spinlock.
>>>
>>> Why?  I think we need some kind of "proof" that this is true before
>>> being able to accept a patch like this, don't you agree?
>>
>> There are some materials which discuss smp_cond_load_* earlier.
>> https://patchwork.kernel.org/patch/10335991/
>> https://patchwork.kernel.org/patch/10325057/
>>
>> In ARM64, they implements a function called "cmpwait", which uses
>> hardware instructions to monitor a value change, I think it is more
>> energy efficient than just do a open-code busy loop...
>>
>> And it seem smp_cond_load_* is already used in the current kernel, such as:
>> ./kernel/locking/mcs_spinlock.h
>> ./kernel/locking/qspinlock.c
>> ./kernel/sched/core.c
>> ./kernel/smp.c
>>
>> For other architectures like x86/arm64, I think they could implement
>> smp_cond_load_* later.
> 
> And have you benchmarked this change to show that it provides any
> benifit?
> 
> You need to do that...

OK, it is my responsibility to test this patch in ARM64 indeed.

I will test it later in ARM64 to see if it has any performance difference
after I handled EROFS product landing stuffs (perhaps weeks later,
many urgent stuffs for the current product that I need to solve...)

Or if some warm-hearted folks interest in it, I'm very happy to see other
implementations or comments about that. :)

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-10-30  6:04 ` [PATCH v2] " Gao Xiang
@ 2018-10-30  5:52   ` Gao Xiang
  2018-11-05 17:11     ` Gao Xiang
  2018-11-05 22:49   ` Will Deacon
  1 sibling, 1 reply; 19+ messages in thread
From: Gao Xiang @ 2018-10-30  5:52 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Philippe Ombredanne, Kate Stewart, Thomas Gleixner, Will Deacon,
	linux-kernel, Miao Xie, Chao Yu

Hi,

On 2018/10/30 14:04, Gao Xiang wrote:
> It is better to use wrapped smp_cond_load_relaxed
> instead of open-coded busy waiting for bit_spinlock.
> 
> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
> ---
> 
> change log v2:
>  - fix the incorrect expression !(VAL >> (bitnum & (BITS_PER_LONG-1)))
>  - the test result is described in the following reply.
> 
> Thanks,
> Gao Xiang


Simple test script:
#include <linux/kthread.h>
#include <linux/bit_spinlock.h>
#include <linux/module.h>

unsigned long global_lock;

int test_thread(void *data)
{
	unsigned long thread_id = (unsigned long)data;
	int i;
	u64 start = ktime_get_ns();

	for (i = 0; i < 500000; ++i) {
		bit_spin_lock(0, &global_lock);
		__asm__("yield");
		bit_spin_unlock(0, &global_lock);
	}
	pr_err("Thread id: %lu time: %llu\n", thread_id, ktime_get_ns() - start);

	do_exit(0);
}


static int __init bitspinlock_test_module_init(void)
{
	int i;

	for (i = 0; i < 8; ++i) {
		if (IS_ERR(kthread_run(test_thread, (void *)(unsigned long)i, "thread-%d", i)))
			pr_err("fail to create thread %d\n", i);
	}

	return 0;
}

static void __exit bitspinlock_test_module_exit(void)
{
}

module_init(bitspinlock_test_module_init);
module_exit(bitspinlock_test_module_exit);
MODULE_LICENSE("GPL");


...and tested in the following ARM server environment:

Processor: HI1616 (https://en.wikichip.org/wiki/hisilicon/hi16xx/hi1616)
Board: HiSilicon D05 Development Board (http://open-estuary.org/d05/)
Memory: 512GB
Host OS: Ubuntu 18.04.1 LTS (Ubuntu 4.15.0-29.31-generic 4.15.18)
QEMU KVM OS: Linux 4.19 + buildroot
QEMU KVM cmdline: qemu-system-aarch64 -enable-kvm -cpu host -smp 4 -m 256M -kernel Image -M virt,kernel_irqchip=on -nographic -hda rootfs.ext2 -append 'root=/dev/vda console=ttyAMA0 earlycon=pl011,0x9000000' -serial mon:stdio -net none

Without this patch:
  Thread 0   Thread 1   Thread 2   Thread 3   Thread 4   Thread 5   Thread 6   Thread 7
1 1283709480 1271869280  454742480 1173673820 1145643640 1118846920  774616920 1144146140
2  643580180  625143860  576841700  322982340  649987880  585749000  529178880  373374780
3  672307220  847315000  880801860 1039502040  667086380 1033939940 1035381120 1046898300
4  568635580  440547020  737000380  910040880  804543740  712314280  868896880  867049000
5  749107320  726397720  776134480  611970100  756721040  753449440  711691300  609343300

With this patch:
  Thread 0   Thread 1   Thread 2   Thread 3   Thread 4   Thread 5   Thread 6   Thread 7
1  170327620  196322160  169434180   74723860  178145600  178873460  143843260   70998780
2  166415220  129649200  166161240  175241520  155474460  112811860  157003140  150087420
3  511420780  117655640  598641860  596213720  462888760  430838600  554346300  428035120 
4  174520240  156311800  120274280   87465380  172781400  136118620  163728340   63026360
5  153677940  202786860  183626500  140721300  150311360  161266840  168154340  107247460

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-10-13  6:47 [RFC PATCH] bit_spinlock: introduce smp_cond_load_relaxed Gao Xiang
  2018-10-13  7:04 ` Greg Kroah-Hartman
@ 2018-10-30  6:04 ` Gao Xiang
  2018-10-30  5:52   ` Gao Xiang
  2018-11-05 22:49   ` Will Deacon
  1 sibling, 2 replies; 19+ messages in thread
From: Gao Xiang @ 2018-10-30  6:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Philippe Ombredanne, Kate Stewart, Thomas Gleixner, Will Deacon,
	linux-kernel, Miao Xie, Chao Yu, Gao Xiang

It is better to use wrapped smp_cond_load_relaxed
instead of open-coded busy waiting for bit_spinlock.

Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---

change log v2:
 - fix the incorrect expression !(VAL >> (bitnum & (BITS_PER_LONG-1)))
 - the test result is described in the following reply.

Thanks,
Gao Xiang

 include/linux/bit_spinlock.h | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
index bbc4730a6505..d5f922b5ffd9 100644
--- a/include/linux/bit_spinlock.h
+++ b/include/linux/bit_spinlock.h
@@ -15,22 +15,19 @@
  */
 static inline void bit_spin_lock(int bitnum, unsigned long *addr)
 {
-	/*
-	 * Assuming the lock is uncontended, this never enters
-	 * the body of the outer loop. If it is contended, then
-	 * within the inner loop a non-atomic test is used to
-	 * busywait with less bus contention for a good time to
-	 * attempt to acquire the lock bit.
-	 */
-	preempt_disable();
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-	while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
-		preempt_enable();
-		do {
-			cpu_relax();
-		} while (test_bit(bitnum, addr));
+	const unsigned int bitshift = bitnum & (BITS_PER_LONG - 1);
+
+	while (1) {
+		smp_cond_load_relaxed(&addr[BIT_WORD(bitnum)],
+				      !((VAL >> bitshift) & 1));
 		preempt_disable();
+		if (!test_and_set_bit_lock(bitnum, addr))
+			break;
+		preempt_enable();
 	}
+#else
+	preempt_disable();
 #endif
 	__acquire(bitlock);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-10-30  5:52   ` Gao Xiang
@ 2018-11-05 17:11     ` Gao Xiang
  0 siblings, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2018-11-05 17:11 UTC (permalink / raw)
  To: Gao Xiang, Greg Kroah-Hartman
  Cc: Philippe Ombredanne, Kate Stewart, Thomas Gleixner, Will Deacon,
	linux-kernel, Peter Zijlstra, Miao Xie, Chao Yu

Hi all,

ping... is there someone interested in this patch?

It seems bit_spinlock performs much better by using
smp_cond_load_relaxed than just do busy-loop in arm64...

Thanks,
Gao Xiang

On 2018/10/30 13:52, Gao Xiang wrote:
> Hi,
> 
> On 2018/10/30 14:04, Gao Xiang wrote:
>> It is better to use wrapped smp_cond_load_relaxed
>> instead of open-coded busy waiting for bit_spinlock.
>>
>> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
>> ---
>>
>> change log v2:
>>  - fix the incorrect expression !(VAL >> (bitnum & (BITS_PER_LONG-1)))
>>  - the test result is described in the following reply.
>>
>> Thanks,
>> Gao Xiang
> 
> 
> Simple test script:
> #include <linux/kthread.h>
> #include <linux/bit_spinlock.h>
> #include <linux/module.h>
> 
> unsigned long global_lock;
> 
> int test_thread(void *data)
> {
> 	unsigned long thread_id = (unsigned long)data;
> 	int i;
> 	u64 start = ktime_get_ns();
> 
> 	for (i = 0; i < 500000; ++i) {
> 		bit_spin_lock(0, &global_lock);
> 		__asm__("yield");
> 		bit_spin_unlock(0, &global_lock);
> 	}
> 	pr_err("Thread id: %lu time: %llu\n", thread_id, ktime_get_ns() - start);
> 
> 	do_exit(0);
> }
> 
> 
> static int __init bitspinlock_test_module_init(void)
> {
> 	int i;
> 
> 	for (i = 0; i < 8; ++i) {
> 		if (IS_ERR(kthread_run(test_thread, (void *)(unsigned long)i, "thread-%d", i)))
> 			pr_err("fail to create thread %d\n", i);
> 	}
> 
> 	return 0;
> }
> 
> static void __exit bitspinlock_test_module_exit(void)
> {
> }
> 
> module_init(bitspinlock_test_module_init);
> module_exit(bitspinlock_test_module_exit);
> MODULE_LICENSE("GPL");
> 
> 
> ...and tested in the following ARM server environment:
> 
> Processor: HI1616 (https://en.wikichip.org/wiki/hisilicon/hi16xx/hi1616)
> Board: HiSilicon D05 Development Board (http://open-estuary.org/d05/)
> Memory: 512GB
> Host OS: Ubuntu 18.04.1 LTS (Ubuntu 4.15.0-29.31-generic 4.15.18)
> QEMU KVM OS: Linux 4.19 + buildroot
> QEMU KVM cmdline: qemu-system-aarch64 -enable-kvm -cpu host -smp 4 -m 256M -kernel Image -M virt,kernel_irqchip=on -nographic -hda rootfs.ext2 -append 'root=/dev/vda console=ttyAMA0 earlycon=pl011,0x9000000' -serial mon:stdio -net none
> 
> Without this patch:
>   Thread 0   Thread 1   Thread 2   Thread 3   Thread 4   Thread 5   Thread 6   Thread 7
> 1 1283709480 1271869280  454742480 1173673820 1145643640 1118846920  774616920 1144146140
> 2  643580180  625143860  576841700  322982340  649987880  585749000  529178880  373374780
> 3  672307220  847315000  880801860 1039502040  667086380 1033939940 1035381120 1046898300
> 4  568635580  440547020  737000380  910040880  804543740  712314280  868896880  867049000
> 5  749107320  726397720  776134480  611970100  756721040  753449440  711691300  609343300
> 
> With this patch:
>   Thread 0   Thread 1   Thread 2   Thread 3   Thread 4   Thread 5   Thread 6   Thread 7
> 1  170327620  196322160  169434180   74723860  178145600  178873460  143843260   70998780
> 2  166415220  129649200  166161240  175241520  155474460  112811860  157003140  150087420
> 3  511420780  117655640  598641860  596213720  462888760  430838600  554346300  428035120 
> 4  174520240  156311800  120274280   87465380  172781400  136118620  163728340   63026360
> 5  153677940  202786860  183626500  140721300  150311360  161266840  168154340  107247460
> 
> Thanks,
> Gao Xiang
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-10-30  6:04 ` [PATCH v2] " Gao Xiang
  2018-10-30  5:52   ` Gao Xiang
@ 2018-11-05 22:49   ` Will Deacon
  2018-11-06  1:45     ` Gao Xiang
  2018-11-06  9:06     ` Peter Zijlstra
  1 sibling, 2 replies; 19+ messages in thread
From: Will Deacon @ 2018-11-05 22:49 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Greg Kroah-Hartman, Philippe Ombredanne, Kate Stewart,
	Thomas Gleixner, linux-kernel, Miao Xie, Chao Yu, peterz

[+PeterZ -- please include him on stuff like this]

Hi Gao,

On Tue, Oct 30, 2018 at 02:04:41PM +0800, Gao Xiang wrote:
> It is better to use wrapped smp_cond_load_relaxed
> instead of open-coded busy waiting for bit_spinlock.
> 
> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
> ---
> 
> change log v2:
>  - fix the incorrect expression !(VAL >> (bitnum & (BITS_PER_LONG-1)))
>  - the test result is described in the following reply.

Please include the results in the commit message, so that this change is
justified.

> diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
> index bbc4730a6505..d5f922b5ffd9 100644
> --- a/include/linux/bit_spinlock.h
> +++ b/include/linux/bit_spinlock.h
> @@ -15,22 +15,19 @@
>   */
>  static inline void bit_spin_lock(int bitnum, unsigned long *addr)
>  {
> -	/*
> -	 * Assuming the lock is uncontended, this never enters
> -	 * the body of the outer loop. If it is contended, then
> -	 * within the inner loop a non-atomic test is used to
> -	 * busywait with less bus contention for a good time to
> -	 * attempt to acquire the lock bit.
> -	 */
> -	preempt_disable();
>  #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
> -	while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
> -		preempt_enable();
> -		do {
> -			cpu_relax();
> -		} while (test_bit(bitnum, addr));
> +	const unsigned int bitshift = bitnum & (BITS_PER_LONG - 1);
> +
> +	while (1) {
> +		smp_cond_load_relaxed(&addr[BIT_WORD(bitnum)],
> +				      !((VAL >> bitshift) & 1));
>  		preempt_disable();
> +		if (!test_and_set_bit_lock(bitnum, addr))
> +			break;
> +		preempt_enable();
>  	}
> +#else
> +	preempt_disable();

This appears to introduce a bunch of overhead for the uncontended fastpath.
How about the much-simpler-but-completely-untested (tm) patch below?

Will

--->8

diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h
index 3ae021368f48..9de8d3544630 100644
--- a/include/asm-generic/bitops/lock.h
+++ b/include/asm-generic/bitops/lock.h
@@ -6,6 +6,15 @@
 #include <linux/compiler.h>
 #include <asm/barrier.h>
 
+static inline void spin_until_bit_unlock(unsigned int nr,
+					 volatile unsigned long *p)
+{
+	unsigned long mask = BIT_MASK(bitnum);
+
+	p += BIT_WORD(nr);
+	smp_cond_load_relaxed(p, VAL & mask);
+}
+
 /**
  * test_and_set_bit_lock - Set a bit and return its old value, for lock
  * @nr: Bit to set
diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
index bbc4730a6505..d711c62e718c 100644
--- a/include/linux/bit_spinlock.h
+++ b/include/linux/bit_spinlock.h
@@ -26,9 +26,7 @@ static inline void bit_spin_lock(int bitnum, unsigned long *addr)
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
 	while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
 		preempt_enable();
-		do {
-			cpu_relax();
-		} while (test_bit(bitnum, addr));
+		spin_until_bit_unlock(bitnum, addr);
 		preempt_disable();
 	}
 #endif

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-11-05 22:49   ` Will Deacon
@ 2018-11-06  1:45     ` Gao Xiang
  2018-11-06  9:06     ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2018-11-06  1:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Greg Kroah-Hartman, Philippe Ombredanne, Kate Stewart,
	Thomas Gleixner, linux-kernel, Miao Xie, Chao Yu, peterz

Hi Will,

On 2018/11/6 6:49, Will Deacon wrote:
> Hi Gao,
> 
> On Tue, Oct 30, 2018 at 02:04:41PM +0800, Gao Xiang wrote:
>> It is better to use wrapped smp_cond_load_relaxed
>> instead of open-coded busy waiting for bit_spinlock.
>>
>> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
>> ---
>>
>> change log v2:
>>  - fix the incorrect expression !(VAL >> (bitnum & (BITS_PER_LONG-1)))
>>  - the test result is described in the following reply.
> Please include the results in the commit message, so that this change is
> justified.

Will add in the next version...

> 
> This appears to introduce a bunch of overhead for the uncontended fastpath.
> How about the much-simpler-but-completely-untested (tm) patch below?

Actually I thought to do like the following (much simpler indeed) at first...

But the current implementation of smp_cond_load_relaxed will do a judgement immediately
which seems unnecessary (right after the test_and_set_bit_lock rather than after
__cmpwait_relaxed...)
        for (;;) {							\
		VAL = READ_ONCE(*__PTR);				\
		if (cond_expr)						\
			break;						\
		__cmpwait_relaxed(__PTR, VAL);				\
	}								\

p.s. I have no idea the original uncontended fastpath really works effectively...
some idea about this? Thanks in advance...


Thanks,
Gao Xiang


> 
> Will
> 
> --->8
> 
> diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h
> index 3ae021368f48..9de8d3544630 100644
> --- a/include/asm-generic/bitops/lock.h
> +++ b/include/asm-generic/bitops/lock.h
> @@ -6,6 +6,15 @@
>  #include <linux/compiler.h>
>  #include <asm/barrier.h>
>  
> +static inline void spin_until_bit_unlock(unsigned int nr,
> +					 volatile unsigned long *p)
> +{
> +	unsigned long mask = BIT_MASK(bitnum);
> +
> +	p += BIT_WORD(nr);
> +	smp_cond_load_relaxed(p, VAL & mask);
> +}
> +
>  /**
>   * test_and_set_bit_lock - Set a bit and return its old value, for lock
>   * @nr: Bit to set
> diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
> index bbc4730a6505..d711c62e718c 100644
> --- a/include/linux/bit_spinlock.h
> +++ b/include/linux/bit_spinlock.h
> @@ -26,9 +26,7 @@ static inline void bit_spin_lock(int bitnum, unsigned long *addr)
>  #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
>  	while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
>  		preempt_enable();
> -		do {
> -			cpu_relax();
> -		} while (test_bit(bitnum, addr));
> +		spin_until_bit_unlock(bitnum, addr);
>  		preempt_disable();
>  	}
>  #endif

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-11-05 22:49   ` Will Deacon
  2018-11-06  1:45     ` Gao Xiang
@ 2018-11-06  9:06     ` Peter Zijlstra
  2018-11-06 10:22       ` Gao Xiang
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2018-11-06  9:06 UTC (permalink / raw)
  To: Will Deacon
  Cc: Gao Xiang, Greg Kroah-Hartman, Philippe Ombredanne, Kate Stewart,
	Thomas Gleixner, linux-kernel, Miao Xie, Chao Yu

On Mon, Nov 05, 2018 at 10:49:21PM +0000, Will Deacon wrote:
> diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h
> index 3ae021368f48..9de8d3544630 100644
> --- a/include/asm-generic/bitops/lock.h
> +++ b/include/asm-generic/bitops/lock.h
> @@ -6,6 +6,15 @@
>  #include <linux/compiler.h>
>  #include <asm/barrier.h>
>  
> +static inline void spin_until_bit_unlock(unsigned int nr,
> +					 volatile unsigned long *p)
> +{
> +	unsigned long mask = BIT_MASK(bitnum);
> +
> +	p += BIT_WORD(nr);
> +	smp_cond_load_relaxed(p, VAL & mask);
> +}
> +
>  /**
>   * test_and_set_bit_lock - Set a bit and return its old value, for lock
>   * @nr: Bit to set
> diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
> index bbc4730a6505..d711c62e718c 100644
> --- a/include/linux/bit_spinlock.h
> +++ b/include/linux/bit_spinlock.h
> @@ -26,9 +26,7 @@ static inline void bit_spin_lock(int bitnum, unsigned long *addr)
>  #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
>  	while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
>  		preempt_enable();
> -		do {
> -			cpu_relax();
> -		} while (test_bit(bitnum, addr));
> +		spin_until_bit_unlock(bitnum, addr);
>  		preempt_disable();
>  	}
>  #endif

Yes, that's much better. Ideally though, we'd get rid of bit spinlocks
that have significant enough contention for this to matter.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-11-06  9:06     ` Peter Zijlstra
@ 2018-11-06 10:22       ` Gao Xiang
  2018-11-06 11:00         ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Gao Xiang @ 2018-11-06 10:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Greg Kroah-Hartman, Philippe Ombredanne,
	Kate Stewart, Thomas Gleixner, linux-kernel, Miao Xie, Chao Yu

Hi Peter,

On 2018/11/6 17:06, Peter Zijlstra wrote:
> On Mon, Nov 05, 2018 at 10:49:21PM +0000, Will Deacon wrote:
>> diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h
>> index 3ae021368f48..9de8d3544630 100644
>> --- a/include/asm-generic/bitops/lock.h
>> +++ b/include/asm-generic/bitops/lock.h
>> @@ -6,6 +6,15 @@
>>  #include <linux/compiler.h>
>>  #include <asm/barrier.h>
>>  
>> +static inline void spin_until_bit_unlock(unsigned int nr,
>> +					 volatile unsigned long *p)
>> +{
>> +	unsigned long mask = BIT_MASK(bitnum);
>> +
>> +	p += BIT_WORD(nr);
>> +	smp_cond_load_relaxed(p, VAL & mask);
>> +}
>> +
>>  /**
>>   * test_and_set_bit_lock - Set a bit and return its old value, for lock
>>   * @nr: Bit to set
>> diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
>> index bbc4730a6505..d711c62e718c 100644
>> --- a/include/linux/bit_spinlock.h
>> +++ b/include/linux/bit_spinlock.h
>> @@ -26,9 +26,7 @@ static inline void bit_spin_lock(int bitnum, unsigned long *addr)
>>  #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
>>  	while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
>>  		preempt_enable();
>> -		do {
>> -			cpu_relax();
>> -		} while (test_bit(bitnum, addr));
>> +		spin_until_bit_unlock(bitnum, addr);
>>  		preempt_disable();
>>  	}
>>  #endif
> 
> Yes, that's much better. Ideally though, we'd get rid of bit spinlocks
> that have significant enough contention for this to matter.

OK, I will send v3 to fix like the above.

Thanks,
Gao Xiang

> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-11-06 10:22       ` Gao Xiang
@ 2018-11-06 11:00         ` Peter Zijlstra
  2018-11-06 11:36           ` Gao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2018-11-06 11:00 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Will Deacon, Greg Kroah-Hartman, Philippe Ombredanne,
	Kate Stewart, Thomas Gleixner, linux-kernel, Miao Xie, Chao Yu

On Tue, Nov 06, 2018 at 06:22:31PM +0800, Gao Xiang wrote:
> Hi Peter,
> 
> On 2018/11/6 17:06, Peter Zijlstra wrote:
> > On Mon, Nov 05, 2018 at 10:49:21PM +0000, Will Deacon wrote:
> >> diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h
> >> index 3ae021368f48..9de8d3544630 100644
> >> --- a/include/asm-generic/bitops/lock.h
> >> +++ b/include/asm-generic/bitops/lock.h
> >> @@ -6,6 +6,15 @@
> >>  #include <linux/compiler.h>
> >>  #include <asm/barrier.h>
> >>  
> >> +static inline void spin_until_bit_unlock(unsigned int nr,
> >> +					 volatile unsigned long *p)
> >> +{
> >> +	unsigned long mask = BIT_MASK(bitnum);
> >> +
> >> +	p += BIT_WORD(nr);
> >> +	smp_cond_load_relaxed(p, VAL & mask);
> >> +}
> >> +
> >>  /**
> >>   * test_and_set_bit_lock - Set a bit and return its old value, for lock
> >>   * @nr: Bit to set
> >> diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
> >> index bbc4730a6505..d711c62e718c 100644
> >> --- a/include/linux/bit_spinlock.h
> >> +++ b/include/linux/bit_spinlock.h
> >> @@ -26,9 +26,7 @@ static inline void bit_spin_lock(int bitnum, unsigned long *addr)
> >>  #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
> >>  	while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
> >>  		preempt_enable();
> >> -		do {
> >> -			cpu_relax();
> >> -		} while (test_bit(bitnum, addr));
> >> +		spin_until_bit_unlock(bitnum, addr);
> >>  		preempt_disable();
> >>  	}
> >>  #endif
> > 
> > Yes, that's much better. Ideally though, we'd get rid of bit spinlocks
> > that have significant enough contention for this to matter.
> 
> OK, I will send v3 to fix like the above.

That's not answering the full question though. What bit spinlocks did
you find where this matters? And can't we convert them to proper
spinlocks instead?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-11-06 11:00         ` Peter Zijlstra
@ 2018-11-06 11:36           ` Gao Xiang
  2018-11-06 12:27             ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Gao Xiang @ 2018-11-06 11:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Greg Kroah-Hartman, Philippe Ombredanne,
	Kate Stewart, Thomas Gleixner, linux-kernel, Miao Xie, Chao Yu

Hi Peter,

On 2018/11/6 19:00, Peter Zijlstra wrote:
>>> Yes, that's much better. Ideally though, we'd get rid of bit spinlocks
>>> that have significant enough contention for this to matter.
>> OK, I will send v3 to fix like the above.
> That's not answering the full question though. What bit spinlocks did
> you find where this matters? And can't we convert them to proper
> spinlocks instead?

I just misunderstood your question and I get your point now.

"What bit spinlocks did you find where this matters?" nope..I said the original
background to Greg before, that is I tried to use smp_cond_load_relaxed instead
of busy spining in the development of erofs file system and I saw this bit_spinlock
implementation by chance...This is not a big modification but since I raised the
question before and I want to trace to the end...

"And can't we convert them to proper spinlocks instead?" I think bit_spinlock is
sometime preferred since it requires little memory and can be integrated into some
fields(eg. flags)...It is selectable for spinlocks don't have to many users
at the same time, so I think it depends on the detailed real use scenerio...
It is just a tool for user code to select case by case... That is my personal idea...

IMO, to use wrapped up function for the detailed scenario could be better than
open-coded all the time (eg. do cpu_relax(); while(...)) since it could be
optimizated even more for the specific architecture...

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-11-06 11:36           ` Gao Xiang
@ 2018-11-06 12:27             ` Peter Zijlstra
  2018-11-06 12:33               ` Gao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2018-11-06 12:27 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Will Deacon, Greg Kroah-Hartman, Philippe Ombredanne,
	Kate Stewart, Thomas Gleixner, linux-kernel, Miao Xie, Chao Yu

On Tue, Nov 06, 2018 at 07:36:41PM +0800, Gao Xiang wrote:
> IMO, to use wrapped up function for the detailed scenario could be better than
> open-coded all the time (eg. do cpu_relax(); while(...)) since it could be
> optimizated even more for the specific architecture...

That's the whole point though; if this actually matters, you're doing it
wrong.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-11-06 12:27             ` Peter Zijlstra
@ 2018-11-06 12:33               ` Gao Xiang
  2018-11-06 12:38                 ` Gao Xiang
  2018-11-06 12:43                 ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Gao Xiang @ 2018-11-06 12:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Greg Kroah-Hartman, Philippe Ombredanne,
	Kate Stewart, Thomas Gleixner, linux-kernel, Miao Xie, Chao Yu

Hi Peter,

On 2018/11/6 20:27, Peter Zijlstra wrote:
> On Tue, Nov 06, 2018 at 07:36:41PM +0800, Gao Xiang wrote:
>> IMO, to use wrapped up function for the detailed scenario could be better than
>> open-coded all the time (eg. do cpu_relax(); while(...)) since it could be
>> optimizated even more for the specific architecture...
> That's the whole point though; if this actually matters, you're doing it
> wrong.

I cannot fully understand your point...Sorry about my English...

To the point, you mean it is much better to fix it as Will suggested before or
leave the matter as it is since the performance of bit_spinlock itself doesn't matter?

Thanks in advance.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-11-06 12:33               ` Gao Xiang
@ 2018-11-06 12:38                 ` Gao Xiang
  2018-11-06 12:43                 ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2018-11-06 12:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Greg Kroah-Hartman, Philippe Ombredanne,
	Kate Stewart, Thomas Gleixner, linux-kernel, Miao Xie, Chao Yu

Hi Peter,

OK, I think I understand your point. If you think the performance of bit_spinlock
doesn't matter, Let's keep the current status.

Thanks,
Gao Xiang

On 2018/11/6 20:33, Gao Xiang wrote:
> Hi Peter,
> 
> On 2018/11/6 20:27, Peter Zijlstra wrote:
>> On Tue, Nov 06, 2018 at 07:36:41PM +0800, Gao Xiang wrote:
>>> IMO, to use wrapped up function for the detailed scenario could be better than
>>> open-coded all the time (eg. do cpu_relax(); while(...)) since it could be
>>> optimizated even more for the specific architecture...
>> That's the whole point though; if this actually matters, you're doing it
>> wrong.
> 
> I cannot fully understand your point...Sorry about my English...
> 
> To the point, you mean it is much better to fix it as Will suggested before or
> leave the matter as it is since the performance of bit_spinlock itself doesn't matter?
> 
> Thanks in advance.
> 
> Thanks,
> Gao Xiang
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] bit_spinlock: introduce smp_cond_load_relaxed
  2018-11-06 12:33               ` Gao Xiang
  2018-11-06 12:38                 ` Gao Xiang
@ 2018-11-06 12:43                 ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2018-11-06 12:43 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Will Deacon, Greg Kroah-Hartman, Philippe Ombredanne,
	Kate Stewart, Thomas Gleixner, linux-kernel, Miao Xie, Chao Yu

On Tue, Nov 06, 2018 at 08:33:56PM +0800, Gao Xiang wrote:
> Hi Peter,
> 
> On 2018/11/6 20:27, Peter Zijlstra wrote:
> > On Tue, Nov 06, 2018 at 07:36:41PM +0800, Gao Xiang wrote:
> >> IMO, to use wrapped up function for the detailed scenario could be better than
> >> open-coded all the time (eg. do cpu_relax(); while(...)) since it could be
> >> optimizated even more for the specific architecture...
> > That's the whole point though; if this actually matters, you're doing it
> > wrong.
> 
> I cannot fully understand your point...Sorry about my English...
> 
> To the point, you mean it is much better to fix it as Will suggested before or
> leave the matter as it is since the performance of bit_spinlock itself doesn't matter?

Right, bit-spinlocks are terrible when contended. If the contended
behaviour of bit-spinlocks start to matter, you've lost already.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-11-06 12:43 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-13  6:47 [RFC PATCH] bit_spinlock: introduce smp_cond_load_relaxed Gao Xiang
2018-10-13  7:04 ` Greg Kroah-Hartman
2018-10-13  7:22   ` Gao Xiang
2018-10-13  7:30     ` Greg Kroah-Hartman
2018-10-13  7:44       ` Gao Xiang
2018-10-13  7:30     ` Gao Xiang
2018-10-30  6:04 ` [PATCH v2] " Gao Xiang
2018-10-30  5:52   ` Gao Xiang
2018-11-05 17:11     ` Gao Xiang
2018-11-05 22:49   ` Will Deacon
2018-11-06  1:45     ` Gao Xiang
2018-11-06  9:06     ` Peter Zijlstra
2018-11-06 10:22       ` Gao Xiang
2018-11-06 11:00         ` Peter Zijlstra
2018-11-06 11:36           ` Gao Xiang
2018-11-06 12:27             ` Peter Zijlstra
2018-11-06 12:33               ` Gao Xiang
2018-11-06 12:38                 ` Gao Xiang
2018-11-06 12:43                 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).