Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
@ 2022-06-11 16:32 Zhangfei Gao
  2022-06-11 16:59 ` Paul E. McKenney
  2022-06-14  1:53 ` chenxiang (M)
  0 siblings, 2 replies; 37+ messages in thread
From: Zhangfei Gao @ 2022-06-11 16:32 UTC (permalink / raw)
  To: Paul E. McKenney, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi

Hi, Paul

When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel 
stuck for several minutes.
And on 5.18, there is no such problem.

After revert this patch, the issue solved.
Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers 
from  consuming CPU)


qemu cmd:
build/aarch64-softmmu/qemu-system-aarch64 -machine 
virt,gic-version=3,iommu=smmuv3 \
-enable-kvm -cpu host -m 1024 \
-kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
"rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
-bios QEMU_EFI.fd

log:
InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
add-symbol-file 
/home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll 
0x75459000
Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
ProtectUefiImageCommon - 0x7AA4D040
   - 0x0000000075458000 - 0x000000000003F000
SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 
(0x0000000000004008)
SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 
(0x0000000000020008)
SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 
(0x0000000000004008)
InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728


Not sure it is either reported or solved.

Thanks








^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-11 16:32 Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow Zhangfei Gao
@ 2022-06-11 16:59 ` Paul E. McKenney
  2022-06-12  7:40   ` zhangfei.gao
  2022-06-14  1:53 ` chenxiang (M)
  1 sibling, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-11 16:59 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi

On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
> Hi, Paul
> 
> When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
> stuck for several minutes.

Stuck for several minutes but then continues normally?  Or stuck for
several minutes before you kill qemu?

And I have to ask...  What happened without the ACPI RMR feature?

> And on 5.18, there is no such problem.
> 
> After revert this patch, the issue solved.
> Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from 
> consuming CPU)
> 
> 
> qemu cmd:
> build/aarch64-softmmu/qemu-system-aarch64 -machine
> virt,gic-version=3,iommu=smmuv3 \
> -enable-kvm -cpu host -m 1024 \
> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
> -bios QEMU_EFI.fd
> 
> log:
> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> 0x75459000
> Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
> ProtectUefiImageCommon - 0x7AA4D040
>   - 0x0000000075458000 - 0x000000000003F000
> SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
> (0x0000000000004008)
> SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
> (0x0000000000020008)
> SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
> (0x0000000000004008)
> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
> InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
> InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
> 
> 
> Not sure it is either reported or solved.

This is the first I have heard of it, so thank you for reporting it.

Do you have a way of collecting something sysrq-t output?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-11 16:59 ` Paul E. McKenney
@ 2022-06-12  7:40   ` zhangfei.gao
  2022-06-12 13:36     ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: zhangfei.gao @ 2022-06-12  7:40 UTC (permalink / raw)
  To: paulmck, Zhangfei Gao
  Cc: linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi

Hi, Paul

On 2022/6/12 上午12:59, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
>> Hi, Paul
>>
>> When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
>> stuck for several minutes.
> Stuck for several minutes but then continues normally?  Or stuck for
> several minutes before you kill qemu?
qemu boot stuck for several minutes, then guest can bootup normally, 
just slower.
>
> And I have to ask...  What happened without the ACPI RMR feature?
If no ACPI, qemu boot quickly without stuck.
build/aarch64-softmmu/qemu-system-aarch64 -machine 
virt,gic-version=3,iommu=smmuv3 \
-enable-kvm -cpu host -m 1024 \
-kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
"rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"

Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.


By the way, my hardware platform is aarch64.

Only change this can solve the stuck issue.

--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct 
srcu_struct *ssp)
  {
         unsigned long jbase = SRCU_INTERVAL;

+       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), 
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
+               return 0;
+       return SRCU_INTERVAL;
+

>
>> And on 5.18, there is no such problem.
>>
>> After revert this patch, the issue solved.
>> Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
>> consuming CPU)
>>
>>
>> qemu cmd:
>> build/aarch64-softmmu/qemu-system-aarch64 -machine
>> virt,gic-version=3,iommu=smmuv3 \
>> -enable-kvm -cpu host -m 1024 \
>> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
>> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
>> -bios QEMU_EFI.fd
>>
>> log:
>> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
>> add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
>> 0x75459000
>> Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
>> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
>> ProtectUefiImageCommon - 0x7AA4D040
>>    - 0x0000000075458000 - 0x000000000003F000
>> SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
>> (0x0000000000004008)
>> SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
>> (0x0000000000020008)
>> SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
>> (0x0000000000004008)
>> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
>> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
>> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
>> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
>> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
>> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
>> InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
>> InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
>> InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
>>
>>
>> Not sure it is either reported or solved.
> This is the first I have heard of it, so thank you for reporting it.
>
> Do you have a way of collecting something sysrq-t output?
Do you mean "echo t > /proc/sysrq-trigger",
There are too much output and kernel dump can not stop.

Thanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12  7:40   ` zhangfei.gao
@ 2022-06-12 13:36     ` Paul E. McKenney
  2022-06-12 14:59       ` zhangfei.gao
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-12 13:36 UTC (permalink / raw)
  To: zhangfei.gao
  Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi

On Sun, Jun 12, 2022 at 03:40:30PM +0800, zhangfei.gao@foxmail.com wrote:
> Hi, Paul
> 
> On 2022/6/12 上午12:59, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
> > > Hi, Paul
> > > 
> > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
> > > stuck for several minutes.
> > Stuck for several minutes but then continues normally?  Or stuck for
> > several minutes before you kill qemu?
> 
> qemu boot stuck for several minutes, then guest can bootup normally, just
> slower.
> > 
> > And I have to ask...  What happened without the ACPI RMR feature?
> If no ACPI, qemu boot quickly without stuck.
> build/aarch64-softmmu/qemu-system-aarch64 -machine
> virt,gic-version=3,iommu=smmuv3 \
> -enable-kvm -cpu host -m 1024 \
> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"
> 
> Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.
> 
> By the way, my hardware platform is aarch64.

Thank you for the information!  The problem is excessive delay rather
than a hang, and it is configuration-dependent.  Good to know!

> Only change this can solve the stuck issue.
> 
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct
> *ssp)
>  {
>         unsigned long jbase = SRCU_INTERVAL;
> 
> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> +               return 0;
> +       return SRCU_INTERVAL;

I am glad that you have a workaround for this problem, but this change
would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent
expedited GPs and blocking readers from consuming CPU") was intended
to fix.  For one example, your change can prevent kernel live patching
from applying a patch.  So something else is needed.

Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay
significantly?  (This is not a fix, either, but instead a debug check.)

Your change always returns zero if another SRCU grace period is needed.
Let's look at the callers of srcu_get_delay():

o	cleanup_srcu_struct() uses it to check whether there is an
	expedited grace period pending, leaking the srcu_struct if so.
	This should not affect boot delay.  (Unless you are invoking
	init_srcu_struct() and cleanup_srcu_struct() really really
	often.)

o	srcu_gp_end() uses it to determine whether or not to allow
	a one-jiffy delay before invoking callbacks at the end of
	a grace period.

o	srcu_funnel_gp_start() uses it to determine whether or not to
	allow a one-jiffy delay before starting the process of checking
	for the end of an SRCU grace period.

o	try_check_zero() uses it to add an additional short delay
	(instead of a long delay) between checks of reader state.

o	process_srcu() uses it to calculate the long delay between
	checks of reader state.

These add one-jiffy delays, except for process_srcu(), which adds a delay
of up to 10 jiffies.  Even given HZ=100 (as opposed to the HZ=1000 that
I normally use), this requires thousands of such delays to add up to the
several minutes that you are seeing.  (In theory, the delays could also
be due to SRCU readers, except that in that case adjusting timeouts in
the grace-period processing would not make things go faster.)

So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods?  If so,
it would be very good make sure that this code is using SRCU efficiently.
One way to check would be to put a printk() into synchronize_srcu(),
though maintaining a counter and printing (say) every 1000th invocation
might be easier on the console output.

> > > And on 5.18, there is no such problem.
> > > 
> > > After revert this patch, the issue solved.
> > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
> > > consuming CPU)
> > > 
> > > 
> > > qemu cmd:
> > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > virt,gic-version=3,iommu=smmuv3 \
> > > -enable-kvm -cpu host -m 1024 \
> > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
> > > -bios QEMU_EFI.fd
> > > 
> > > log:
> > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> > > 0x75459000
> > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
> > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
> > > ProtectUefiImageCommon - 0x7AA4D040
> > >    - 0x0000000075458000 - 0x000000000003F000
> > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
> > > (0x0000000000004008)
> > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
> > > (0x0000000000020008)
> > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
> > > (0x0000000000004008)
> > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
> > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
> > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
> > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
> > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
> > > 
> > > 
> > > Not sure it is either reported or solved.
> > This is the first I have heard of it, so thank you for reporting it.
> > 
> > Do you have a way of collecting something sysrq-t output?
> Do you mean "echo t > /proc/sysrq-trigger",
> There are too much output and kernel dump can not stop.

OK.  What other tools do you have to work out what is happening during
temporary hangs such as this one?

The question to be answered: "Is there usually at least one task waiting
in synchronize_srcu() during these hangs, and if so, which srcu_struct
is passed to those synchronize_srcu() calls?"

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12 13:36     ` Paul E. McKenney
@ 2022-06-12 14:59       ` zhangfei.gao
  2022-06-12 16:20         ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: zhangfei.gao @ 2022-06-12 14:59 UTC (permalink / raw)
  To: paulmck
  Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi

Hi, Paul

On 2022/6/12 下午9:36, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 03:40:30PM +0800, zhangfei.gao@foxmail.com wrote:
>> Hi, Paul
>>
>> On 2022/6/12 上午12:59, Paul E. McKenney wrote:
>>> On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
>>>> Hi, Paul
>>>>
>>>> When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
>>>> stuck for several minutes.
>>> Stuck for several minutes but then continues normally?  Or stuck for
>>> several minutes before you kill qemu?
>> qemu boot stuck for several minutes, then guest can bootup normally, just
>> slower.
>>> And I have to ask...  What happened without the ACPI RMR feature?
>> If no ACPI, qemu boot quickly without stuck.
>> build/aarch64-softmmu/qemu-system-aarch64 -machine
>> virt,gic-version=3,iommu=smmuv3 \
>> -enable-kvm -cpu host -m 1024 \
>> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
>> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"
>>
>> Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.
>>
>> By the way, my hardware platform is aarch64.
> Thank you for the information!  The problem is excessive delay rather
> than a hang, and it is configuration-dependent.  Good to know!
>
>> Only change this can solve the stuck issue.
>>
>> --- a/kernel/rcu/srcutree.c
>> +++ b/kernel/rcu/srcutree.c
>> @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct
>> *ssp)
>>   {
>>          unsigned long jbase = SRCU_INTERVAL;
>>
>> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>> +               return 0;
>> +       return SRCU_INTERVAL;
> I am glad that you have a workaround for this problem, but this change
> would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent
> expedited GPs and blocking readers from consuming CPU") was intended
> to fix.  For one example, your change can prevent kernel live patching
> from applying a patch.  So something else is needed.
Understand, it is just debug where has issue.
>
> Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay
> significantly?  (This is not a fix, either, but instead a debug check.)
No use.
>
> Your change always returns zero if another SRCU grace period is needed.
> Let's look at the callers of srcu_get_delay():
>
> o	cleanup_srcu_struct() uses it to check whether there is an
> 	expedited grace period pending, leaking the srcu_struct if so.
> 	This should not affect boot delay.  (Unless you are invoking
> 	init_srcu_struct() and cleanup_srcu_struct() really really
> 	often.)
>
> o	srcu_gp_end() uses it to determine whether or not to allow
> 	a one-jiffy delay before invoking callbacks at the end of
> 	a grace period.
>
> o	srcu_funnel_gp_start() uses it to determine whether or not to
> 	allow a one-jiffy delay before starting the process of checking
> 	for the end of an SRCU grace period.
>
> o	try_check_zero() uses it to add an additional short delay
> 	(instead of a long delay) between checks of reader state.
>
> o	process_srcu() uses it to calculate the long delay between
> 	checks of reader state.
>
> These add one-jiffy delays, except for process_srcu(), which adds a delay
> of up to 10 jiffies.  Even given HZ=100 (as opposed to the HZ=1000 that
> I normally use), this requires thousands of such delays to add up to the
> several minutes that you are seeing.  (In theory, the delays could also
> be due to SRCU readers, except that in that case adjusting timeouts in
> the grace-period processing would not make things go faster.)
>
> So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods?  If so,
> it would be very good make sure that this code is using SRCU efficiently.
> One way to check would be to put a printk() into synchronize_srcu(),
> though maintaining a counter and printing (say) every 1000th invocation
> might be easier on the console output.
good idea.

>>>> And on 5.18, there is no such problem.
>>>>
>>>> After revert this patch, the issue solved.
>>>> Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
>>>> consuming CPU)
>>>>
>>>>
>>>> qemu cmd:
>>>> build/aarch64-softmmu/qemu-system-aarch64 -machine
>>>> virt,gic-version=3,iommu=smmuv3 \
>>>> -enable-kvm -cpu host -m 1024 \
>>>> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
>>>> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
>>>> -bios QEMU_EFI.fd
>>>>
>>>> log:
>>>> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
>>>> add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
>>>> 0x75459000
>>>> Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
>>>> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
>>>> ProtectUefiImageCommon - 0x7AA4D040
>>>>     - 0x0000000075458000 - 0x000000000003F000
>>>> SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
>>>> (0x0000000000004008)
>>>> SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
>>>> (0x0000000000020008)
>>>> SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
>>>> (0x0000000000004008)
>>>> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
>>>> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
>>>> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
>>>> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
>>>> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
>>>> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
>>>> InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
>>>> InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
>>>> InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
>>>>
>>>>
>>>> Not sure it is either reported or solved.
>>> This is the first I have heard of it, so thank you for reporting it.
>>>
>>> Do you have a way of collecting something sysrq-t output?
>> Do you mean "echo t > /proc/sysrq-trigger",
>> There are too much output and kernel dump can not stop.
> OK.  What other tools do you have to work out what is happening during
> temporary hangs such as this one?
>
> The question to be answered: "Is there usually at least one task waiting
> in synchronize_srcu() during these hangs, and if so, which srcu_struct
> is passed to those synchronize_srcu() calls?"

As you suggested, add print in __synchronize_srcu, 1000 times print once.

With acpi=force & -bios QEMU_EFI.fd

When qemu stuck in
InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
add-symbol-file 
/home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll 
0x75459000

The print in  __synchronize_srcu is print from 0 t0 9001.

[   94.271350] gzf __synchronize_srcu loop=1001
....

[  222.621659]  __synchronize_srcu loop=9001
[  222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted 
5.19.0-rc1-15071-g697f40b5235f-dirty #615
[  222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, 
BIOS 2280-V2 CS V5.B133.01 03/25/2021
[  222.621667] Call trace:
[  222.621668]  dump_backtrace+0xe4/0xf0
[  222.621670] show_stack+0x20/0x70
[  222.621672] dump_stack_lvl+0x8c/0xb8
[  222.621674]  dump_stack+0x18/0x34
[  222.621676] __synchronize_srcu+0x120/0x128
[  222.621678] synchronize_srcu_expedited+0x2c/0x40
[  222.621680] kvm_swap_active_memslots+0x130/0x198
[  222.621683] kvm_activate_memslot+0x40/0x68
[  222.621684]  kvm_set_memslot+0x184/0x3b0
[  222.621686]  __kvm_set_memory_region+0x288/0x438
[  222.621688] kvm_set_memory_region+0x3c/0x60
[  222.621689]  kvm_vm_ioctl+0x5a0/0x13e0
[  222.621691]  __arm64_sys_ioctl+0xb0/0xf8
[  222.621693]  invoke_syscall+0x4c/0x110
[  222.621696] el0_svc_common.constprop.0+0x68/0x128
[  222.621698]  do_el0_svc+0x34/0xc0
[  222.621701]  el0_svc+0x30/0x98
[  222.621704]  el0t_64_sync_handler+0xb8/0xc0
[  222.621706]  el0t_64_sync+0x18c/0x190


If no acpi=force, no print at all, 1000 times one print.

Thanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12 14:59       ` zhangfei.gao
@ 2022-06-12 16:20         ` Paul E. McKenney
  2022-06-12 16:40           ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-12 16:20 UTC (permalink / raw)
  To: zhangfei.gao
  Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi

On Sun, Jun 12, 2022 at 10:59:30PM +0800, zhangfei.gao@foxmail.com wrote:
> Hi, Paul
> 
> On 2022/6/12 下午9:36, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 03:40:30PM +0800, zhangfei.gao@foxmail.com wrote:
> > > Hi, Paul
> > > 
> > > On 2022/6/12 上午12:59, Paul E. McKenney wrote:
> > > > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
> > > > > Hi, Paul
> > > > > 
> > > > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
> > > > > stuck for several minutes.
> > > > Stuck for several minutes but then continues normally?  Or stuck for
> > > > several minutes before you kill qemu?
> > > qemu boot stuck for several minutes, then guest can bootup normally, just
> > > slower.
> > > > And I have to ask...  What happened without the ACPI RMR feature?
> > > If no ACPI, qemu boot quickly without stuck.
> > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > virt,gic-version=3,iommu=smmuv3 \
> > > -enable-kvm -cpu host -m 1024 \
> > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"
> > > 
> > > Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.
> > > 
> > > By the way, my hardware platform is aarch64.
> > Thank you for the information!  The problem is excessive delay rather
> > than a hang, and it is configuration-dependent.  Good to know!
> > 
> > > Only change this can solve the stuck issue.
> > > 
> > > --- a/kernel/rcu/srcutree.c
> > > +++ b/kernel/rcu/srcutree.c
> > > @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct
> > > *ssp)
> > >   {
> > >          unsigned long jbase = SRCU_INTERVAL;
> > > 
> > > +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > +               return 0;
> > > +       return SRCU_INTERVAL;
> > I am glad that you have a workaround for this problem, but this change
> > would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent
> > expedited GPs and blocking readers from consuming CPU") was intended
> > to fix.  For one example, your change can prevent kernel live patching
> > from applying a patch.  So something else is needed.
> Understand, it is just debug where has issue.
> > 
> > Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay
> > significantly?  (This is not a fix, either, but instead a debug check.)
> No use.

OK, that indicates that you have a very large number of invocations
of synchronize_srcu() or synchronize_srcu_expedited() instead of only
a few that take a very long time each.

> > Your change always returns zero if another SRCU grace period is needed.
> > Let's look at the callers of srcu_get_delay():
> > 
> > o	cleanup_srcu_struct() uses it to check whether there is an
> > 	expedited grace period pending, leaking the srcu_struct if so.
> > 	This should not affect boot delay.  (Unless you are invoking
> > 	init_srcu_struct() and cleanup_srcu_struct() really really
> > 	often.)
> > 
> > o	srcu_gp_end() uses it to determine whether or not to allow
> > 	a one-jiffy delay before invoking callbacks at the end of
> > 	a grace period.
> > 
> > o	srcu_funnel_gp_start() uses it to determine whether or not to
> > 	allow a one-jiffy delay before starting the process of checking
> > 	for the end of an SRCU grace period.
> > 
> > o	try_check_zero() uses it to add an additional short delay
> > 	(instead of a long delay) between checks of reader state.
> > 
> > o	process_srcu() uses it to calculate the long delay between
> > 	checks of reader state.
> > 
> > These add one-jiffy delays, except for process_srcu(), which adds a delay
> > of up to 10 jiffies.  Even given HZ=100 (as opposed to the HZ=1000 that
> > I normally use), this requires thousands of such delays to add up to the
> > several minutes that you are seeing.  (In theory, the delays could also
> > be due to SRCU readers, except that in that case adjusting timeouts in
> > the grace-period processing would not make things go faster.)
> > 
> > So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods?  If so,
> > it would be very good make sure that this code is using SRCU efficiently.
> > One way to check would be to put a printk() into synchronize_srcu(),
> > though maintaining a counter and printing (say) every 1000th invocation
> > might be easier on the console output.
> 
> good idea.

Glad you like it.  ;-)

> > > > > And on 5.18, there is no such problem.
> > > > > 
> > > > > After revert this patch, the issue solved.
> > > > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
> > > > > consuming CPU)
> > > > > 
> > > > > 
> > > > > qemu cmd:
> > > > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > > > virt,gic-version=3,iommu=smmuv3 \
> > > > > -enable-kvm -cpu host -m 1024 \
> > > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
> > > > > -bios QEMU_EFI.fd
> > > > > 
> > > > > log:
> > > > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> > > > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> > > > > 0x75459000
> > > > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
> > > > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
> > > > > ProtectUefiImageCommon - 0x7AA4D040
> > > > >     - 0x0000000075458000 - 0x000000000003F000
> > > > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
> > > > > (0x0000000000004008)
> > > > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
> > > > > (0x0000000000020008)
> > > > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
> > > > > (0x0000000000004008)
> > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
> > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
> > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
> > > > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
> > > > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
> > > > > 
> > > > > 
> > > > > Not sure it is either reported or solved.
> > > > This is the first I have heard of it, so thank you for reporting it.
> > > > 
> > > > Do you have a way of collecting something sysrq-t output?
> > > Do you mean "echo t > /proc/sysrq-trigger",
> > > There are too much output and kernel dump can not stop.
> > OK.  What other tools do you have to work out what is happening during
> > temporary hangs such as this one?
> > 
> > The question to be answered: "Is there usually at least one task waiting
> > in synchronize_srcu() during these hangs, and if so, which srcu_struct
> > is passed to those synchronize_srcu() calls?"
> 
> As you suggested, add print in __synchronize_srcu, 1000 times print once.
> 
> With acpi=force & -bios QEMU_EFI.fd
> 
> When qemu stuck in
> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> 0x75459000
> 
> The print in  __synchronize_srcu is print from 0 t0 9001.

Now that is what I call a large number of calls!

> [   94.271350] gzf __synchronize_srcu loop=1001
> ....
> 
> [  222.621659]  __synchronize_srcu loop=9001
> [  222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted
> 5.19.0-rc1-15071-g697f40b5235f-dirty #615
> [  222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS
> 2280-V2 CS V5.B133.01 03/25/2021
> [  222.621667] Call trace:
> [  222.621668]  dump_backtrace+0xe4/0xf0
> [  222.621670] show_stack+0x20/0x70
> [  222.621672] dump_stack_lvl+0x8c/0xb8
> [  222.621674]  dump_stack+0x18/0x34
> [  222.621676] __synchronize_srcu+0x120/0x128
> [  222.621678] synchronize_srcu_expedited+0x2c/0x40
> [  222.621680] kvm_swap_active_memslots+0x130/0x198
> [  222.621683] kvm_activate_memslot+0x40/0x68
> [  222.621684]  kvm_set_memslot+0x184/0x3b0
> [  222.621686]  __kvm_set_memory_region+0x288/0x438
> [  222.621688] kvm_set_memory_region+0x3c/0x60

This is KVM setting up one mapping in your IORT RMR memory, correct?
(As in I/O remapping table reserved memory regions.)

> [  222.621689]  kvm_vm_ioctl+0x5a0/0x13e0

And this ioctl() is qemu telling KVM to do so, correct?

> [  222.621691]  __arm64_sys_ioctl+0xb0/0xf8
> [  222.621693]  invoke_syscall+0x4c/0x110
> [  222.621696] el0_svc_common.constprop.0+0x68/0x128
> [  222.621698]  do_el0_svc+0x34/0xc0
> [  222.621701]  el0_svc+0x30/0x98
> [  222.621704]  el0t_64_sync_handler+0xb8/0xc0
> [  222.621706]  el0t_64_sync+0x18c/0x190
> 
> 
> If no acpi=force, no print at all, 1000 times one print.

OK, this certainly explains the slowdown, both from adding the IORT RMR
and from the commit that you bisected to.  The answers to a few questions
might help illuminate a path towards a fix:

Do these reserved memory regions really need to be allocated separately?
(For example, are they really all non-contiguous?  If not, that is, if
there are a lot of contiguous memory regions, could you sort the IORT
by address and do one ioctl() for each set of contiguous memory regions?)

Are all of these reserved memory regions set up before init is spawned?

Are all of these reserved memory regions set up while there is only a
single vCPU up and running?

Is the SRCU grace period really needed in this case?  (I freely confess
to not being all that familiar with KVM.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12 16:20         ` Paul E. McKenney
@ 2022-06-12 16:40           ` Paul E. McKenney
  2022-06-12 17:29             ` Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-12 16:40 UTC (permalink / raw)
  To: zhangfei.gao
  Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi,
	mtosatti, pbonzini, sheng.yang

On Sun, Jun 12, 2022 at 09:20:29AM -0700, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 10:59:30PM +0800, zhangfei.gao@foxmail.com wrote:
> > Hi, Paul
> > 
> > On 2022/6/12 下午9:36, Paul E. McKenney wrote:
> > > On Sun, Jun 12, 2022 at 03:40:30PM +0800, zhangfei.gao@foxmail.com wrote:
> > > > Hi, Paul
> > > > 
> > > > On 2022/6/12 上午12:59, Paul E. McKenney wrote:
> > > > > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
> > > > > > Hi, Paul
> > > > > > 
> > > > > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
> > > > > > stuck for several minutes.
> > > > > Stuck for several minutes but then continues normally?  Or stuck for
> > > > > several minutes before you kill qemu?
> > > > qemu boot stuck for several minutes, then guest can bootup normally, just
> > > > slower.
> > > > > And I have to ask...  What happened without the ACPI RMR feature?
> > > > If no ACPI, qemu boot quickly without stuck.
> > > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > > virt,gic-version=3,iommu=smmuv3 \
> > > > -enable-kvm -cpu host -m 1024 \
> > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"
> > > > 
> > > > Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.
> > > > 
> > > > By the way, my hardware platform is aarch64.
> > > Thank you for the information!  The problem is excessive delay rather
> > > than a hang, and it is configuration-dependent.  Good to know!
> > > 
> > > > Only change this can solve the stuck issue.
> > > > 
> > > > --- a/kernel/rcu/srcutree.c
> > > > +++ b/kernel/rcu/srcutree.c
> > > > @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct
> > > > *ssp)
> > > >   {
> > > >          unsigned long jbase = SRCU_INTERVAL;
> > > > 
> > > > +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > > +               return 0;
> > > > +       return SRCU_INTERVAL;
> > > I am glad that you have a workaround for this problem, but this change
> > > would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent
> > > expedited GPs and blocking readers from consuming CPU") was intended
> > > to fix.  For one example, your change can prevent kernel live patching
> > > from applying a patch.  So something else is needed.
> > Understand, it is just debug where has issue.
> > > 
> > > Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay
> > > significantly?  (This is not a fix, either, but instead a debug check.)
> > No use.
> 
> OK, that indicates that you have a very large number of invocations
> of synchronize_srcu() or synchronize_srcu_expedited() instead of only
> a few that take a very long time each.
> 
> > > Your change always returns zero if another SRCU grace period is needed.
> > > Let's look at the callers of srcu_get_delay():
> > > 
> > > o	cleanup_srcu_struct() uses it to check whether there is an
> > > 	expedited grace period pending, leaking the srcu_struct if so.
> > > 	This should not affect boot delay.  (Unless you are invoking
> > > 	init_srcu_struct() and cleanup_srcu_struct() really really
> > > 	often.)
> > > 
> > > o	srcu_gp_end() uses it to determine whether or not to allow
> > > 	a one-jiffy delay before invoking callbacks at the end of
> > > 	a grace period.
> > > 
> > > o	srcu_funnel_gp_start() uses it to determine whether or not to
> > > 	allow a one-jiffy delay before starting the process of checking
> > > 	for the end of an SRCU grace period.
> > > 
> > > o	try_check_zero() uses it to add an additional short delay
> > > 	(instead of a long delay) between checks of reader state.
> > > 
> > > o	process_srcu() uses it to calculate the long delay between
> > > 	checks of reader state.
> > > 
> > > These add one-jiffy delays, except for process_srcu(), which adds a delay
> > > of up to 10 jiffies.  Even given HZ=100 (as opposed to the HZ=1000 that
> > > I normally use), this requires thousands of such delays to add up to the
> > > several minutes that you are seeing.  (In theory, the delays could also
> > > be due to SRCU readers, except that in that case adjusting timeouts in
> > > the grace-period processing would not make things go faster.)
> > > 
> > > So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods?  If so,
> > > it would be very good make sure that this code is using SRCU efficiently.
> > > One way to check would be to put a printk() into synchronize_srcu(),
> > > though maintaining a counter and printing (say) every 1000th invocation
> > > might be easier on the console output.
> > 
> > good idea.
> 
> Glad you like it.  ;-)
> 
> > > > > > And on 5.18, there is no such problem.
> > > > > > 
> > > > > > After revert this patch, the issue solved.
> > > > > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
> > > > > > consuming CPU)
> > > > > > 
> > > > > > 
> > > > > > qemu cmd:
> > > > > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > > > > virt,gic-version=3,iommu=smmuv3 \
> > > > > > -enable-kvm -cpu host -m 1024 \
> > > > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
> > > > > > -bios QEMU_EFI.fd
> > > > > > 
> > > > > > log:
> > > > > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> > > > > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> > > > > > 0x75459000
> > > > > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
> > > > > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
> > > > > > ProtectUefiImageCommon - 0x7AA4D040
> > > > > >     - 0x0000000075458000 - 0x000000000003F000
> > > > > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
> > > > > > (0x0000000000004008)
> > > > > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
> > > > > > (0x0000000000020008)
> > > > > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
> > > > > > (0x0000000000004008)
> > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
> > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
> > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > > > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
> > > > > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
> > > > > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
> > > > > > 
> > > > > > 
> > > > > > Not sure it is either reported or solved.
> > > > > This is the first I have heard of it, so thank you for reporting it.
> > > > > 
> > > > > Do you have a way of collecting something sysrq-t output?
> > > > Do you mean "echo t > /proc/sysrq-trigger",
> > > > There are too much output and kernel dump can not stop.
> > > OK.  What other tools do you have to work out what is happening during
> > > temporary hangs such as this one?
> > > 
> > > The question to be answered: "Is there usually at least one task waiting
> > > in synchronize_srcu() during these hangs, and if so, which srcu_struct
> > > is passed to those synchronize_srcu() calls?"
> > 
> > As you suggested, add print in __synchronize_srcu, 1000 times print once.
> > 
> > With acpi=force & -bios QEMU_EFI.fd
> > 
> > When qemu stuck in
> > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> > 0x75459000
> > 
> > The print in  __synchronize_srcu is print from 0 t0 9001.
> 
> Now that is what I call a large number of calls!
> 
> > [   94.271350] gzf __synchronize_srcu loop=1001
> > ....
> > 
> > [  222.621659]  __synchronize_srcu loop=9001
> > [  222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted
> > 5.19.0-rc1-15071-g697f40b5235f-dirty #615
> > [  222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS
> > 2280-V2 CS V5.B133.01 03/25/2021
> > [  222.621667] Call trace:
> > [  222.621668]  dump_backtrace+0xe4/0xf0
> > [  222.621670] show_stack+0x20/0x70
> > [  222.621672] dump_stack_lvl+0x8c/0xb8
> > [  222.621674]  dump_stack+0x18/0x34
> > [  222.621676] __synchronize_srcu+0x120/0x128
> > [  222.621678] synchronize_srcu_expedited+0x2c/0x40
> > [  222.621680] kvm_swap_active_memslots+0x130/0x198
> > [  222.621683] kvm_activate_memslot+0x40/0x68
> > [  222.621684]  kvm_set_memslot+0x184/0x3b0
> > [  222.621686]  __kvm_set_memory_region+0x288/0x438
> > [  222.621688] kvm_set_memory_region+0x3c/0x60
> 
> This is KVM setting up one mapping in your IORT RMR memory, correct?
> (As in I/O remapping table reserved memory regions.)
> 
> > [  222.621689]  kvm_vm_ioctl+0x5a0/0x13e0
> 
> And this ioctl() is qemu telling KVM to do so, correct?
> 
> > [  222.621691]  __arm64_sys_ioctl+0xb0/0xf8
> > [  222.621693]  invoke_syscall+0x4c/0x110
> > [  222.621696] el0_svc_common.constprop.0+0x68/0x128
> > [  222.621698]  do_el0_svc+0x34/0xc0
> > [  222.621701]  el0_svc+0x30/0x98
> > [  222.621704]  el0t_64_sync_handler+0xb8/0xc0
> > [  222.621706]  el0t_64_sync+0x18c/0x190
> > 
> > 
> > If no acpi=force, no print at all, 1000 times one print.
> 
> OK, this certainly explains the slowdown, both from adding the IORT RMR
> and from the commit that you bisected to.  The answers to a few questions
> might help illuminate a path towards a fix:
> 
> Do these reserved memory regions really need to be allocated separately?
> (For example, are they really all non-contiguous?  If not, that is, if
> there are a lot of contiguous memory regions, could you sort the IORT
> by address and do one ioctl() for each set of contiguous memory regions?)
> 
> Are all of these reserved memory regions set up before init is spawned?
> 
> Are all of these reserved memory regions set up while there is only a
> single vCPU up and running?
> 
> Is the SRCU grace period really needed in this case?  (I freely confess
> to not being all that familiar with KVM.)

Oh, and there was a similar many-requests problem with networking many
years ago.  This was solved by adding a new syscall/ioctl()/whatever
that permitted many requests to be presented to the kernel with a single
system call.

Could a new ioctl() be introduced that requested a large number
of these memory regions in one go so as to make each call to
synchronize_rcu_expedited() cover a useful fraction of your 9000+
requests?  Adding a few of the KVM guys on CC for their thoughts.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12 16:40           ` Paul E. McKenney
@ 2022-06-12 17:29             ` Paolo Bonzini
  2022-06-12 17:47               ` Paolo Bonzini
  2022-06-12 18:49               ` Paul E. McKenney
  0 siblings, 2 replies; 37+ messages in thread
From: Paolo Bonzini @ 2022-06-12 17:29 UTC (permalink / raw)
  To: paulmck, zhangfei.gao
  Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi,
	mtosatti, sheng.yang

On 6/12/22 18:40, Paul E. McKenney wrote:
>> Do these reserved memory regions really need to be allocated separately?
>> (For example, are they really all non-contiguous?  If not, that is, if
>> there are a lot of contiguous memory regions, could you sort the IORT
>> by address and do one ioctl() for each set of contiguous memory regions?)
>>
>> Are all of these reserved memory regions set up before init is spawned?
>>
>> Are all of these reserved memory regions set up while there is only a
>> single vCPU up and running?
>>
>> Is the SRCU grace period really needed in this case?  (I freely confess
>> to not being all that familiar with KVM.)
> 
> Oh, and there was a similar many-requests problem with networking many
> years ago.  This was solved by adding a new syscall/ioctl()/whatever
> that permitted many requests to be presented to the kernel with a single
> system call.
> 
> Could a new ioctl() be introduced that requested a large number
> of these memory regions in one go so as to make each call to
> synchronize_rcu_expedited() cover a useful fraction of your 9000+
> requests?  Adding a few of the KVM guys on CC for their thoughts.

Unfortunately not.  Apart from this specific case, in general the calls 
to KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers 
in the guest, and those writes then map to a ioctl.  Typically the guest 
sets up a device at a time, and each setup step causes a 
synchronize_srcu()---and expedited at that.

KVM has two SRCUs:

1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has 
readers that are very very small, but it needs extremely fast detection 
of grace periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up 
KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers 
are not so frequent.

2) kvm->srcu is nastier because there are readers all the time.  The 
read-side critical section are still short-ish, but they need the 
sleepable part because they access user memory.

Writers are not frequent per se; the problem is they come in very large 
bursts when a guest boots.  And while the whole boot path overall can be 
quadratic, O(n) expensive calls to synchronize_srcu() can have a larger 
impact on runtime than the O(n^2) parts, as demonstrated here.

Therefore, we operated on the assumption that the callers of 
synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest 
code and the desire was to get past the booting phase as fast as 
possible.  If the guest wants to eat host CPU it can "for(;;)" as much 
as it wants; therefore, as long as expedited GPs didn't eat CPU 
*throughout the whole system*, a preemptable busy wait in 
synchronize_srcu_expedited() were not problematic.

This assumptions did match the SRCU code when kvm->srcu and 
kvm->irq_srcu were was introduced (respectively in 2009 and 2014).  But 
perhaps they do not hold anymore now that each SRCU is not as 
independent as it used to be in those years, and instead they use 
workqueues instead?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12 17:29             ` Paolo Bonzini
@ 2022-06-12 17:47               ` Paolo Bonzini
  2022-06-12 18:51                 ` Paul E. McKenney
  2022-06-12 18:49               ` Paul E. McKenney
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2022-06-12 17:47 UTC (permalink / raw)
  To: paulmck, zhangfei.gao
  Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi,
	mtosatti, sheng.yang

On 6/12/22 19:29, Paolo Bonzini wrote:
> On 6/12/22 18:40, Paul E. McKenney wrote:
>>> Do these reserved memory regions really need to be allocated separately?
>>> (For example, are they really all non-contiguous?  If not, that is, if
>>> there are a lot of contiguous memory regions, could you sort the IORT
>>> by address and do one ioctl() for each set of contiguous memory 
>>> regions?)
>>>
>>> Are all of these reserved memory regions set up before init is spawned?
>>>
>>> Are all of these reserved memory regions set up while there is only a
>>> single vCPU up and running?
>>>
>>> Is the SRCU grace period really needed in this case?  (I freely confess
>>> to not being all that familiar with KVM.)
>>
>> Oh, and there was a similar many-requests problem with networking many
>> years ago.  This was solved by adding a new syscall/ioctl()/whatever
>> that permitted many requests to be presented to the kernel with a single
>> system call.
>>
>> Could a new ioctl() be introduced that requested a large number
>> of these memory regions in one go so as to make each call to
>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
>> requests?  Adding a few of the KVM guys on CC for their thoughts.

Another question: how much can call_srcu() callbacks pile up these days? 
  I've always been a bit wary of letting userspace do an arbitrary 
number of allocations that can only be freed after a grace period, but 
perhaps there's a way to query SRCU and apply some backpressure?

Paolo


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12 17:29             ` Paolo Bonzini
  2022-06-12 17:47               ` Paolo Bonzini
@ 2022-06-12 18:49               ` Paul E. McKenney
  2022-06-12 19:23                 ` Paolo Bonzini
  2022-06-13  3:04                 ` zhangfei.gao
  1 sibling, 2 replies; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-12 18:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: zhangfei.gao, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, sheng.yang

On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> On 6/12/22 18:40, Paul E. McKenney wrote:
> > > Do these reserved memory regions really need to be allocated separately?
> > > (For example, are they really all non-contiguous?  If not, that is, if
> > > there are a lot of contiguous memory regions, could you sort the IORT
> > > by address and do one ioctl() for each set of contiguous memory regions?)
> > > 
> > > Are all of these reserved memory regions set up before init is spawned?
> > > 
> > > Are all of these reserved memory regions set up while there is only a
> > > single vCPU up and running?
> > > 
> > > Is the SRCU grace period really needed in this case?  (I freely confess
> > > to not being all that familiar with KVM.)
> > 
> > Oh, and there was a similar many-requests problem with networking many
> > years ago.  This was solved by adding a new syscall/ioctl()/whatever
> > that permitted many requests to be presented to the kernel with a single
> > system call.
> > 
> > Could a new ioctl() be introduced that requested a large number
> > of these memory regions in one go so as to make each call to
> > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > requests?  Adding a few of the KVM guys on CC for their thoughts.
> 
> Unfortunately not.  Apart from this specific case, in general the calls to
> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> guest, and those writes then map to a ioctl.  Typically the guest sets up a
> device at a time, and each setup step causes a synchronize_srcu()---and
> expedited at that.

I was afraid of something like that...

> KVM has two SRCUs:
> 
> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> that are very very small, but it needs extremely fast detection of grace
> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
> not so frequent.
> 
> 2) kvm->srcu is nastier because there are readers all the time.  The
> read-side critical section are still short-ish, but they need the sleepable
> part because they access user memory.

Which one of these two is in play in this case?

> Writers are not frequent per se; the problem is they come in very large
> bursts when a guest boots.  And while the whole boot path overall can be
> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> impact on runtime than the O(n^2) parts, as demonstrated here.
> 
> Therefore, we operated on the assumption that the callers of
> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> and the desire was to get past the booting phase as fast as possible.  If
> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> problematic.
> 
> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> were was introduced (respectively in 2009 and 2014).  But perhaps they do
> not hold anymore now that each SRCU is not as independent as it used to be
> in those years, and instead they use workqueues instead?

The problem was not internal to SRCU, but rather due to the fact
that kernel live patching (KLP) had problems with the CPU-bound tasks
resulting from repeated synchronize_rcu_expedited() invocations.  So I
added heuristics to get the occasional sleep in there for KLP's benefit.
Perhaps these heuristics need to be less aggressive about adding sleep.

These heuristics have these aspects:

1.	The longer readers persist in an expedited SRCU grace period,
	the longer the wait between successive checks of the reader
	state.  Roughly speaking, we wait as long as the grace period
	has currently been in effect, capped at ten jiffies.

2.	SRCU grace periods have several phases.  We reset so that each
	phase starts by not waiting (new phase, new set of readers,
	so don't penalize this set for the sins of the previous set).
	But once we get to the point of adding delay, we add the
	delay based on the beginning of the full grace period.

Right now, the checking for grace-period length does not allow for the
possibility that a grace period might start just before the jiffies
counter gets incremented (because I didn't realize that anyone cared),
so that is one possible thing to change.  I can also allow more no-delay
checks per SRCU grace-period phase.

Zhangfei, does something like the patch shown below help?

Additional adjustments are likely needed to avoid re-breaking KLP,
but we have to start somewhere...

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 50ba70f019dea..6a354368ac1d1 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
 
 #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
 #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
-#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
+#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
 #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
 
 /*
@@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
  */
 static unsigned long srcu_get_delay(struct srcu_struct *ssp)
 {
+	unsigned long gpstart;
+	unsigned long j;
 	unsigned long jbase = SRCU_INTERVAL;
 
 	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
 		jbase = 0;
-	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
-		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
+	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
+		j = jiffies - 1;
+		gpstart = READ_ONCE(ssp->srcu_gp_start);
+		if (time_after(j, gpstart))
+			jbase += j - gpstart;
+	}
 	if (!jbase) {
 		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
 		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12 17:47               ` Paolo Bonzini
@ 2022-06-12 18:51                 ` Paul E. McKenney
  0 siblings, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-12 18:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: zhangfei.gao, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, sheng.yang

On Sun, Jun 12, 2022 at 07:47:10PM +0200, Paolo Bonzini wrote:
> On 6/12/22 19:29, Paolo Bonzini wrote:
> > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > Do these reserved memory regions really need to be allocated separately?
> > > > (For example, are they really all non-contiguous?  If not, that is, if
> > > > there are a lot of contiguous memory regions, could you sort the IORT
> > > > by address and do one ioctl() for each set of contiguous memory
> > > > regions?)
> > > > 
> > > > Are all of these reserved memory regions set up before init is spawned?
> > > > 
> > > > Are all of these reserved memory regions set up while there is only a
> > > > single vCPU up and running?
> > > > 
> > > > Is the SRCU grace period really needed in this case?  (I freely confess
> > > > to not being all that familiar with KVM.)
> > > 
> > > Oh, and there was a similar many-requests problem with networking many
> > > years ago.  This was solved by adding a new syscall/ioctl()/whatever
> > > that permitted many requests to be presented to the kernel with a single
> > > system call.
> > > 
> > > Could a new ioctl() be introduced that requested a large number
> > > of these memory regions in one go so as to make each call to
> > > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > > requests?  Adding a few of the KVM guys on CC for their thoughts.
> 
> Another question: how much can call_srcu() callbacks pile up these days?
> I've always been a bit wary of letting userspace do an arbitrary number of
> allocations that can only be freed after a grace period, but perhaps there's
> a way to query SRCU and apply some backpressure?

They can pile up as much as ever, especially if you have long-duration
sleeping readers.

But you could do the occasional srcu_barrier() to wait for all the
preceding ones to get done.  Maybe every 1000th call_srcu() or similar?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12 18:49               ` Paul E. McKenney
@ 2022-06-12 19:23                 ` Paolo Bonzini
  2022-06-12 20:09                   ` Paul E. McKenney
  2022-06-13  3:04                 ` zhangfei.gao
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2022-06-12 19:23 UTC (permalink / raw)
  To: paulmck
  Cc: zhangfei.gao, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, sheng.yang

On 6/12/22 20:49, Paul E. McKenney wrote:
>>
>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
>> that are very very small, but it needs extremely fast detection of grace
>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
>> not so frequent.
>>
>> 2) kvm->srcu is nastier because there are readers all the time.  The
>> read-side critical section are still short-ish, but they need the sleepable
>> part because they access user memory.
> 
> Which one of these two is in play in this case?

The latter, kvm->srcu; though at boot time both are hammered on quite a 
bit (and then essentially not at all).

For the one involved it's still pretty rare for readers to sleep, but it 
cannot be excluded.  Most critical sections are short, I'd guess in the 
thousands of clock cycles but I can add some instrumentation tomorrow 
(or anyway before Tuesday).

> The problem was not internal to SRCU, but rather due to the fact
> that kernel live patching (KLP) had problems with the CPU-bound tasks
> resulting from repeated synchronize_rcu_expedited() invocations.

I see.  Perhaps only add to the back-to-back counter if the 
synchronize_srcu_expedited() takes longer than a jiffy? This would 
indirectly check if syncronize_srcu_expedited() readers are actually 
blocking.  KVM uses syncronize_srcu_expedited() because it expects it to 
take very little (again I'll get hard numbers asap).

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12 19:23                 ` Paolo Bonzini
@ 2022-06-12 20:09                   ` Paul E. McKenney
  0 siblings, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-12 20:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: zhangfei.gao, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, sheng.yang

On Sun, Jun 12, 2022 at 09:23:14PM +0200, Paolo Bonzini wrote:
> On 6/12/22 20:49, Paul E. McKenney wrote:
> > > 
> > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> > > that are very very small, but it needs extremely fast detection of grace
> > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
> > > not so frequent.
> > > 
> > > 2) kvm->srcu is nastier because there are readers all the time.  The
> > > read-side critical section are still short-ish, but they need the sleepable
> > > part because they access user memory.
> > 
> > Which one of these two is in play in this case?
> 
> The latter, kvm->srcu; though at boot time both are hammered on quite a bit
> (and then essentially not at all).
> 
> For the one involved it's still pretty rare for readers to sleep, but it
> cannot be excluded.  Most critical sections are short, I'd guess in the
> thousands of clock cycles but I can add some instrumentation tomorrow (or
> anyway before Tuesday).

And in any case, readers can be preempted.

> > The problem was not internal to SRCU, but rather due to the fact
> > that kernel live patching (KLP) had problems with the CPU-bound tasks
> > resulting from repeated synchronize_rcu_expedited() invocations.
> 
> I see.  Perhaps only add to the back-to-back counter if the
> synchronize_srcu_expedited() takes longer than a jiffy? This would
> indirectly check if syncronize_srcu_expedited() readers are actually
> blocking.  KVM uses syncronize_srcu_expedited() because it expects it to
> take very little (again I'll get hard numbers asap).

This is in effect what the patch in my previous email does.  In current
mainline, it waits for up to a jiffy before switching to sleep mode,
but with the patch it waits for between one and two jiffies before making
that switch.

Using call_srcu() with the occasional srcu_barrier() would of course be
faster still, but perhaps more complex.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-12 18:49               ` Paul E. McKenney
  2022-06-12 19:23                 ` Paolo Bonzini
@ 2022-06-13  3:04                 ` zhangfei.gao
  2022-06-13  3:57                   ` Paul E. McKenney
  1 sibling, 1 reply; 37+ messages in thread
From: zhangfei.gao @ 2022-06-13  3:04 UTC (permalink / raw)
  To: paulmck, Paolo Bonzini
  Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi,
	mtosatti, sheng.yang

Hi, Paul

On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
>> On 6/12/22 18:40, Paul E. McKenney wrote:
>>>> Do these reserved memory regions really need to be allocated separately?
>>>> (For example, are they really all non-contiguous?  If not, that is, if
>>>> there are a lot of contiguous memory regions, could you sort the IORT
>>>> by address and do one ioctl() for each set of contiguous memory regions?)
>>>>
>>>> Are all of these reserved memory regions set up before init is spawned?
>>>>
>>>> Are all of these reserved memory regions set up while there is only a
>>>> single vCPU up and running?
>>>>
>>>> Is the SRCU grace period really needed in this case?  (I freely confess
>>>> to not being all that familiar with KVM.)
>>> Oh, and there was a similar many-requests problem with networking many
>>> years ago.  This was solved by adding a new syscall/ioctl()/whatever
>>> that permitted many requests to be presented to the kernel with a single
>>> system call.
>>>
>>> Could a new ioctl() be introduced that requested a large number
>>> of these memory regions in one go so as to make each call to
>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
>>> requests?  Adding a few of the KVM guys on CC for their thoughts.
>> Unfortunately not.  Apart from this specific case, in general the calls to
>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
>> guest, and those writes then map to a ioctl.  Typically the guest sets up a
>> device at a time, and each setup step causes a synchronize_srcu()---and
>> expedited at that.
> I was afraid of something like that...
>
>> KVM has two SRCUs:
>>
>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
>> that are very very small, but it needs extremely fast detection of grace
>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
>> not so frequent.
>>
>> 2) kvm->srcu is nastier because there are readers all the time.  The
>> read-side critical section are still short-ish, but they need the sleepable
>> part because they access user memory.
> Which one of these two is in play in this case?
>
>> Writers are not frequent per se; the problem is they come in very large
>> bursts when a guest boots.  And while the whole boot path overall can be
>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
>> impact on runtime than the O(n^2) parts, as demonstrated here.
>>
>> Therefore, we operated on the assumption that the callers of
>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
>> and the desire was to get past the booting phase as fast as possible.  If
>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
>> problematic.
>>
>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
>> were was introduced (respectively in 2009 and 2014).  But perhaps they do
>> not hold anymore now that each SRCU is not as independent as it used to be
>> in those years, and instead they use workqueues instead?
> The problem was not internal to SRCU, but rather due to the fact
> that kernel live patching (KLP) had problems with the CPU-bound tasks
> resulting from repeated synchronize_rcu_expedited() invocations.  So I
> added heuristics to get the occasional sleep in there for KLP's benefit.
> Perhaps these heuristics need to be less aggressive about adding sleep.
>
> These heuristics have these aspects:
>
> 1.	The longer readers persist in an expedited SRCU grace period,
> 	the longer the wait between successive checks of the reader
> 	state.  Roughly speaking, we wait as long as the grace period
> 	has currently been in effect, capped at ten jiffies.
>
> 2.	SRCU grace periods have several phases.  We reset so that each
> 	phase starts by not waiting (new phase, new set of readers,
> 	so don't penalize this set for the sins of the previous set).
> 	But once we get to the point of adding delay, we add the
> 	delay based on the beginning of the full grace period.
>
> Right now, the checking for grace-period length does not allow for the
> possibility that a grace period might start just before the jiffies
> counter gets incremented (because I didn't realize that anyone cared),
> so that is one possible thing to change.  I can also allow more no-delay
> checks per SRCU grace-period phase.
>
> Zhangfei, does something like the patch shown below help?
>
> Additional adjustments are likely needed to avoid re-breaking KLP,
> but we have to start somewhere...
>
> 							Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 50ba70f019dea..6a354368ac1d1 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>   
>   #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
>   #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
> -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
> +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
>   #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
>   
>   /*
> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>    */
>   static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>   {
> +	unsigned long gpstart;
> +	unsigned long j;
>   	unsigned long jbase = SRCU_INTERVAL;
>   
>   	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>   		jbase = 0;
> -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> +		j = jiffies - 1;
> +		gpstart = READ_ONCE(ssp->srcu_gp_start);
> +		if (time_after(j, gpstart))
> +			jbase += j - gpstart;
> +	}
>   	if (!jbase) {
>   		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>   		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
Unfortunately, this patch does not helpful.

Then re-add the debug info.

During the qemu boot
[  232.997667]  __synchronize_srcu loop=1000

[  361.094493]  __synchronize_srcu loop=9000
[  361.094501] Call trace:
[  361.094502]  dump_backtrace+0xe4/0xf0
[  361.094505]  show_stack+0x20/0x70
[  361.094507]  dump_stack_lvl+0x8c/0xb8
[  361.094509]  dump_stack+0x18/0x34
[  361.094511]  __synchronize_srcu+0x120/0x128
[  361.094514]  synchronize_srcu_expedited+0x2c/0x40
[  361.094515]  kvm_swap_active_memslots+0x130/0x198
[  361.094519]  kvm_activate_memslot+0x40/0x68
[  361.094520]  kvm_set_memslot+0x2f8/0x3b0
[  361.094523]  __kvm_set_memory_region+0x2e4/0x438
[  361.094524]  kvm_set_memory_region+0x78/0xb8
[  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
[  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
[  361.094530]  invoke_syscall+0x4c/0x110
[  361.094533]  el0_svc_common.constprop.0+0x68/0x128
[  361.094536]  do_el0_svc+0x34/0xc0
[  361.094538]  el0_svc+0x30/0x98
[  361.094541]  el0t_64_sync_handler+0xb8/0xc0
[  361.094544]  el0t_64_sync+0x18c/0x190
[  363.942817]  kvm_set_memory_region loop=6000


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13  3:04                 ` zhangfei.gao
@ 2022-06-13  3:57                   ` Paul E. McKenney
  2022-06-13  4:16                     ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-13  3:57 UTC (permalink / raw)
  To: zhangfei.gao
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, sheng.yang

On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote:
> Hi, Paul
> 
> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> > > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > > Do these reserved memory regions really need to be allocated separately?
> > > > > (For example, are they really all non-contiguous?  If not, that is, if
> > > > > there are a lot of contiguous memory regions, could you sort the IORT
> > > > > by address and do one ioctl() for each set of contiguous memory regions?)
> > > > > 
> > > > > Are all of these reserved memory regions set up before init is spawned?
> > > > > 
> > > > > Are all of these reserved memory regions set up while there is only a
> > > > > single vCPU up and running?
> > > > > 
> > > > > Is the SRCU grace period really needed in this case?  (I freely confess
> > > > > to not being all that familiar with KVM.)
> > > > Oh, and there was a similar many-requests problem with networking many
> > > > years ago.  This was solved by adding a new syscall/ioctl()/whatever
> > > > that permitted many requests to be presented to the kernel with a single
> > > > system call.
> > > > 
> > > > Could a new ioctl() be introduced that requested a large number
> > > > of these memory regions in one go so as to make each call to
> > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > > > requests?  Adding a few of the KVM guys on CC for their thoughts.
> > > Unfortunately not.  Apart from this specific case, in general the calls to
> > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> > > guest, and those writes then map to a ioctl.  Typically the guest sets up a
> > > device at a time, and each setup step causes a synchronize_srcu()---and
> > > expedited at that.
> > I was afraid of something like that...
> > 
> > > KVM has two SRCUs:
> > > 
> > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> > > that are very very small, but it needs extremely fast detection of grace
> > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
> > > not so frequent.
> > > 
> > > 2) kvm->srcu is nastier because there are readers all the time.  The
> > > read-side critical section are still short-ish, but they need the sleepable
> > > part because they access user memory.
> > Which one of these two is in play in this case?
> > 
> > > Writers are not frequent per se; the problem is they come in very large
> > > bursts when a guest boots.  And while the whole boot path overall can be
> > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> > > impact on runtime than the O(n^2) parts, as demonstrated here.
> > > 
> > > Therefore, we operated on the assumption that the callers of
> > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> > > and the desire was to get past the booting phase as fast as possible.  If
> > > the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> > > problematic.
> > > 
> > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> > > were was introduced (respectively in 2009 and 2014).  But perhaps they do
> > > not hold anymore now that each SRCU is not as independent as it used to be
> > > in those years, and instead they use workqueues instead?
> > The problem was not internal to SRCU, but rather due to the fact
> > that kernel live patching (KLP) had problems with the CPU-bound tasks
> > resulting from repeated synchronize_rcu_expedited() invocations.  So I
> > added heuristics to get the occasional sleep in there for KLP's benefit.
> > Perhaps these heuristics need to be less aggressive about adding sleep.
> > 
> > These heuristics have these aspects:
> > 
> > 1.	The longer readers persist in an expedited SRCU grace period,
> > 	the longer the wait between successive checks of the reader
> > 	state.  Roughly speaking, we wait as long as the grace period
> > 	has currently been in effect, capped at ten jiffies.
> > 
> > 2.	SRCU grace periods have several phases.  We reset so that each
> > 	phase starts by not waiting (new phase, new set of readers,
> > 	so don't penalize this set for the sins of the previous set).
> > 	But once we get to the point of adding delay, we add the
> > 	delay based on the beginning of the full grace period.
> > 
> > Right now, the checking for grace-period length does not allow for the
> > possibility that a grace period might start just before the jiffies
> > counter gets incremented (because I didn't realize that anyone cared),
> > so that is one possible thing to change.  I can also allow more no-delay
> > checks per SRCU grace-period phase.
> > 
> > Zhangfei, does something like the patch shown below help?
> > 
> > Additional adjustments are likely needed to avoid re-breaking KLP,
> > but we have to start somewhere...
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 50ba70f019dea..6a354368ac1d1 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> >   #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
> >   #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
> > -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
> > +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
> >   #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
> >   /*
> > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> >    */
> >   static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> >   {
> > +	unsigned long gpstart;
> > +	unsigned long j;
> >   	unsigned long jbase = SRCU_INTERVAL;
> >   	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> >   		jbase = 0;
> > -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > +		j = jiffies - 1;
> > +		gpstart = READ_ONCE(ssp->srcu_gp_start);
> > +		if (time_after(j, gpstart))
> > +			jbase += j - gpstart;
> > +	}
> >   	if (!jbase) {
> >   		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> >   		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> Unfortunately, this patch does not helpful.
> 
> Then re-add the debug info.
> 
> During the qemu boot
> [  232.997667]  __synchronize_srcu loop=1000
> 
> [  361.094493]  __synchronize_srcu loop=9000
> [  361.094501] Call trace:
> [  361.094502]  dump_backtrace+0xe4/0xf0
> [  361.094505]  show_stack+0x20/0x70
> [  361.094507]  dump_stack_lvl+0x8c/0xb8
> [  361.094509]  dump_stack+0x18/0x34
> [  361.094511]  __synchronize_srcu+0x120/0x128
> [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> [  361.094519]  kvm_activate_memslot+0x40/0x68
> [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> [  361.094524]  kvm_set_memory_region+0x78/0xb8
> [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> [  361.094530]  invoke_syscall+0x4c/0x110
> [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> [  361.094536]  do_el0_svc+0x34/0xc0
> [  361.094538]  el0_svc+0x30/0x98
> [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> [  361.094544]  el0t_64_sync+0x18c/0x190
> [  363.942817]  kvm_set_memory_region loop=6000

Huh.

One possibility is that the "if (!jbase)" block needs to be nested
within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.

One additional debug is to apply the patch below on top of the one you
just now kindly tested, then use whatever debug technique you wish to
work out what fraction of the time during that critical interval that
srcu_get_delay() returns non-zero.

Other thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13  3:57                   ` Paul E. McKenney
@ 2022-06-13  4:16                     ` Paul E. McKenney
  2022-06-13  6:55                       ` zhangfei.gao
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-13  4:16 UTC (permalink / raw)
  To: zhangfei.gao
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, sheng.yang

On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote:
> > Hi, Paul
> > 
> > On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> > > > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > > > Do these reserved memory regions really need to be allocated separately?
> > > > > > (For example, are they really all non-contiguous?  If not, that is, if
> > > > > > there are a lot of contiguous memory regions, could you sort the IORT
> > > > > > by address and do one ioctl() for each set of contiguous memory regions?)
> > > > > > 
> > > > > > Are all of these reserved memory regions set up before init is spawned?
> > > > > > 
> > > > > > Are all of these reserved memory regions set up while there is only a
> > > > > > single vCPU up and running?
> > > > > > 
> > > > > > Is the SRCU grace period really needed in this case?  (I freely confess
> > > > > > to not being all that familiar with KVM.)
> > > > > Oh, and there was a similar many-requests problem with networking many
> > > > > years ago.  This was solved by adding a new syscall/ioctl()/whatever
> > > > > that permitted many requests to be presented to the kernel with a single
> > > > > system call.
> > > > > 
> > > > > Could a new ioctl() be introduced that requested a large number
> > > > > of these memory regions in one go so as to make each call to
> > > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > > > > requests?  Adding a few of the KVM guys on CC for their thoughts.
> > > > Unfortunately not.  Apart from this specific case, in general the calls to
> > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> > > > guest, and those writes then map to a ioctl.  Typically the guest sets up a
> > > > device at a time, and each setup step causes a synchronize_srcu()---and
> > > > expedited at that.
> > > I was afraid of something like that...
> > > 
> > > > KVM has two SRCUs:
> > > > 
> > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> > > > that are very very small, but it needs extremely fast detection of grace
> > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
> > > > not so frequent.
> > > > 
> > > > 2) kvm->srcu is nastier because there are readers all the time.  The
> > > > read-side critical section are still short-ish, but they need the sleepable
> > > > part because they access user memory.
> > > Which one of these two is in play in this case?
> > > 
> > > > Writers are not frequent per se; the problem is they come in very large
> > > > bursts when a guest boots.  And while the whole boot path overall can be
> > > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> > > > impact on runtime than the O(n^2) parts, as demonstrated here.
> > > > 
> > > > Therefore, we operated on the assumption that the callers of
> > > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> > > > and the desire was to get past the booting phase as fast as possible.  If
> > > > the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> > > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> > > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> > > > problematic.
> > > > 
> > > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> > > > were was introduced (respectively in 2009 and 2014).  But perhaps they do
> > > > not hold anymore now that each SRCU is not as independent as it used to be
> > > > in those years, and instead they use workqueues instead?
> > > The problem was not internal to SRCU, but rather due to the fact
> > > that kernel live patching (KLP) had problems with the CPU-bound tasks
> > > resulting from repeated synchronize_rcu_expedited() invocations.  So I
> > > added heuristics to get the occasional sleep in there for KLP's benefit.
> > > Perhaps these heuristics need to be less aggressive about adding sleep.
> > > 
> > > These heuristics have these aspects:
> > > 
> > > 1.	The longer readers persist in an expedited SRCU grace period,
> > > 	the longer the wait between successive checks of the reader
> > > 	state.  Roughly speaking, we wait as long as the grace period
> > > 	has currently been in effect, capped at ten jiffies.
> > > 
> > > 2.	SRCU grace periods have several phases.  We reset so that each
> > > 	phase starts by not waiting (new phase, new set of readers,
> > > 	so don't penalize this set for the sins of the previous set).
> > > 	But once we get to the point of adding delay, we add the
> > > 	delay based on the beginning of the full grace period.
> > > 
> > > Right now, the checking for grace-period length does not allow for the
> > > possibility that a grace period might start just before the jiffies
> > > counter gets incremented (because I didn't realize that anyone cared),
> > > so that is one possible thing to change.  I can also allow more no-delay
> > > checks per SRCU grace-period phase.
> > > 
> > > Zhangfei, does something like the patch shown below help?
> > > 
> > > Additional adjustments are likely needed to avoid re-breaking KLP,
> > > but we have to start somewhere...
> > > 
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > index 50ba70f019dea..6a354368ac1d1 100644
> > > --- a/kernel/rcu/srcutree.c
> > > +++ b/kernel/rcu/srcutree.c
> > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > >   #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
> > >   #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
> > > -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
> > > +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
> > >   #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
> > >   /*
> > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > >    */
> > >   static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > >   {
> > > +	unsigned long gpstart;
> > > +	unsigned long j;
> > >   	unsigned long jbase = SRCU_INTERVAL;
> > >   	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > >   		jbase = 0;
> > > -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > > +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > +		j = jiffies - 1;
> > > +		gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > +		if (time_after(j, gpstart))
> > > +			jbase += j - gpstart;
> > > +	}
> > >   	if (!jbase) {
> > >   		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > >   		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> > Unfortunately, this patch does not helpful.
> > 
> > Then re-add the debug info.
> > 
> > During the qemu boot
> > [  232.997667]  __synchronize_srcu loop=1000
> > 
> > [  361.094493]  __synchronize_srcu loop=9000
> > [  361.094501] Call trace:
> > [  361.094502]  dump_backtrace+0xe4/0xf0
> > [  361.094505]  show_stack+0x20/0x70
> > [  361.094507]  dump_stack_lvl+0x8c/0xb8
> > [  361.094509]  dump_stack+0x18/0x34
> > [  361.094511]  __synchronize_srcu+0x120/0x128
> > [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> > [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> > [  361.094519]  kvm_activate_memslot+0x40/0x68
> > [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> > [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> > [  361.094524]  kvm_set_memory_region+0x78/0xb8
> > [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> > [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> > [  361.094530]  invoke_syscall+0x4c/0x110
> > [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> > [  361.094536]  do_el0_svc+0x34/0xc0
> > [  361.094538]  el0_svc+0x30/0x98
> > [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> > [  361.094544]  el0t_64_sync+0x18c/0x190
> > [  363.942817]  kvm_set_memory_region loop=6000
> 
> Huh.
> 
> One possibility is that the "if (!jbase)" block needs to be nested
> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.

And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
above change reduces the overhead by more than an order of magnitude.
Except that the overhead of the series is far less than one second,
not the several minutes that you are seeing.  So the per-call overhead
decreases from about 17 microseconds to a bit more than one microsecond.

I could imagine an extra order of magnitude if you are running HZ=100
instead of the HZ=1000 that I am running.  But that only gets up to a
few seconds.

> One additional debug is to apply the patch below on top of the one you
> just now kindly tested, then use whatever debug technique you wish to
> work out what fraction of the time during that critical interval that
> srcu_get_delay() returns non-zero.

So I am very interested in the above debug.  ;-)

Other thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13  4:16                     ` Paul E. McKenney
@ 2022-06-13  6:55                       ` zhangfei.gao
  2022-06-13 12:18                         ` Paul E. McKenney
                                           ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: zhangfei.gao @ 2022-06-13  6:55 UTC (permalink / raw)
  To: paulmck
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, Auger Eric

Hi, Paul

On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
>> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote:
>>> Hi, Paul
>>>
>>> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
>>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
>>>>> On 6/12/22 18:40, Paul E. McKenney wrote:
>>>>>>> Do these reserved memory regions really need to be allocated separately?
>>>>>>> (For example, are they really all non-contiguous?  If not, that is, if
>>>>>>> there are a lot of contiguous memory regions, could you sort the IORT
>>>>>>> by address and do one ioctl() for each set of contiguous memory regions?)
>>>>>>>
>>>>>>> Are all of these reserved memory regions set up before init is spawned?
>>>>>>>
>>>>>>> Are all of these reserved memory regions set up while there is only a
>>>>>>> single vCPU up and running?
>>>>>>>
>>>>>>> Is the SRCU grace period really needed in this case?  (I freely confess
>>>>>>> to not being all that familiar with KVM.)
>>>>>> Oh, and there was a similar many-requests problem with networking many
>>>>>> years ago.  This was solved by adding a new syscall/ioctl()/whatever
>>>>>> that permitted many requests to be presented to the kernel with a single
>>>>>> system call.
>>>>>>
>>>>>> Could a new ioctl() be introduced that requested a large number
>>>>>> of these memory regions in one go so as to make each call to
>>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
>>>>>> requests?  Adding a few of the KVM guys on CC for their thoughts.
>>>>> Unfortunately not.  Apart from this specific case, in general the calls to
>>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
>>>>> guest, and those writes then map to a ioctl.  Typically the guest sets up a
>>>>> device at a time, and each setup step causes a synchronize_srcu()---and
>>>>> expedited at that.
>>>> I was afraid of something like that...
>>>>
>>>>> KVM has two SRCUs:
>>>>>
>>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
>>>>> that are very very small, but it needs extremely fast detection of grace
>>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
>>>>> not so frequent.
>>>>>
>>>>> 2) kvm->srcu is nastier because there are readers all the time.  The
>>>>> read-side critical section are still short-ish, but they need the sleepable
>>>>> part because they access user memory.
>>>> Which one of these two is in play in this case?
>>>>
>>>>> Writers are not frequent per se; the problem is they come in very large
>>>>> bursts when a guest boots.  And while the whole boot path overall can be
>>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
>>>>> impact on runtime than the O(n^2) parts, as demonstrated here.
>>>>>
>>>>> Therefore, we operated on the assumption that the callers of
>>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
>>>>> and the desire was to get past the booting phase as fast as possible.  If
>>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
>>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
>>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
>>>>> problematic.
>>>>>
>>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
>>>>> were was introduced (respectively in 2009 and 2014).  But perhaps they do
>>>>> not hold anymore now that each SRCU is not as independent as it used to be
>>>>> in those years, and instead they use workqueues instead?
>>>> The problem was not internal to SRCU, but rather due to the fact
>>>> that kernel live patching (KLP) had problems with the CPU-bound tasks
>>>> resulting from repeated synchronize_rcu_expedited() invocations.  So I
>>>> added heuristics to get the occasional sleep in there for KLP's benefit.
>>>> Perhaps these heuristics need to be less aggressive about adding sleep.
>>>>
>>>> These heuristics have these aspects:
>>>>
>>>> 1.	The longer readers persist in an expedited SRCU grace period,
>>>> 	the longer the wait between successive checks of the reader
>>>> 	state.  Roughly speaking, we wait as long as the grace period
>>>> 	has currently been in effect, capped at ten jiffies.
>>>>
>>>> 2.	SRCU grace periods have several phases.  We reset so that each
>>>> 	phase starts by not waiting (new phase, new set of readers,
>>>> 	so don't penalize this set for the sins of the previous set).
>>>> 	But once we get to the point of adding delay, we add the
>>>> 	delay based on the beginning of the full grace period.
>>>>
>>>> Right now, the checking for grace-period length does not allow for the
>>>> possibility that a grace period might start just before the jiffies
>>>> counter gets incremented (because I didn't realize that anyone cared),
>>>> so that is one possible thing to change.  I can also allow more no-delay
>>>> checks per SRCU grace-period phase.
>>>>
>>>> Zhangfei, does something like the patch shown below help?
>>>>
>>>> Additional adjustments are likely needed to avoid re-breaking KLP,
>>>> but we have to start somewhere...
>>>>
>>>> 							Thanx, Paul
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>> index 50ba70f019dea..6a354368ac1d1 100644
>>>> --- a/kernel/rcu/srcutree.c
>>>> +++ b/kernel/rcu/srcutree.c
>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>>>    #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
>>>>    #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
>>>> -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
>>>> +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
>>>>    #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
>>>>    /*
>>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>>>     */
>>>>    static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>>>>    {
>>>> +	unsigned long gpstart;
>>>> +	unsigned long j;
>>>>    	unsigned long jbase = SRCU_INTERVAL;
>>>>    	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>    		jbase = 0;
>>>> -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>>>> -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
>>>> +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>>>> +		j = jiffies - 1;
>>>> +		gpstart = READ_ONCE(ssp->srcu_gp_start);
>>>> +		if (time_after(j, gpstart))
>>>> +			jbase += j - gpstart;
>>>> +	}
>>>>    	if (!jbase) {
>>>>    		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>>    		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
>>> Unfortunately, this patch does not helpful.
>>>
>>> Then re-add the debug info.
>>>
>>> During the qemu boot
>>> [  232.997667]  __synchronize_srcu loop=1000
>>>
>>> [  361.094493]  __synchronize_srcu loop=9000
>>> [  361.094501] Call trace:
>>> [  361.094502]  dump_backtrace+0xe4/0xf0
>>> [  361.094505]  show_stack+0x20/0x70
>>> [  361.094507]  dump_stack_lvl+0x8c/0xb8
>>> [  361.094509]  dump_stack+0x18/0x34
>>> [  361.094511]  __synchronize_srcu+0x120/0x128
>>> [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
>>> [  361.094515]  kvm_swap_active_memslots+0x130/0x198
>>> [  361.094519]  kvm_activate_memslot+0x40/0x68
>>> [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
>>> [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
>>> [  361.094524]  kvm_set_memory_region+0x78/0xb8
>>> [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
>>> [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
>>> [  361.094530]  invoke_syscall+0x4c/0x110
>>> [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
>>> [  361.094536]  do_el0_svc+0x34/0xc0
>>> [  361.094538]  el0_svc+0x30/0x98
>>> [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
>>> [  361.094544]  el0t_64_sync+0x18c/0x190
>>> [  363.942817]  kvm_set_memory_region loop=6000
>> Huh.
>>
>> One possibility is that the "if (!jbase)" block needs to be nested
>> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.

I test this diff and NO helpful

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 50ba70f019de..36286a4b74e6 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)

  #define SRCU_INTERVAL          1       // Base delay if no expedited 
GPs pending.
  #define SRCU_MAX_INTERVAL      10      // Maximum incremental delay 
from slow readers.
-#define SRCU_MAX_NODELAY_PHASE 1       // Maximum per-GP-phase 
consecutive no-delay instances.
+#define SRCU_MAX_NODELAY_PHASE 3       // Maximum per-GP-phase 
consecutive no-delay instances.
  #define SRCU_MAX_NODELAY       100     // Maximum consecutive no-delay 
instances.

  /*
@@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct 
*ssp)
   */
  static unsigned long srcu_get_delay(struct srcu_struct *ssp)
  {
+       unsigned long gpstart;
+       unsigned long j;
         unsigned long jbase = SRCU_INTERVAL;

         if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), 
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
                 jbase = 0;
-       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
-               jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
-       if (!jbase) {
-               WRITE_ONCE(ssp->srcu_n_exp_nodelay, 
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
-               if (READ_ONCE(ssp->srcu_n_exp_nodelay) > 
SRCU_MAX_NODELAY_PHASE)
-                       jbase = 1;
+       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
+               j = jiffies - 1;
+               gpstart = READ_ONCE(ssp->srcu_gp_start);
+               if (time_after(j, gpstart))
+                       jbase += j - gpstart;
+
+               if (!jbase) {
+                       WRITE_ONCE(ssp->srcu_n_exp_nodelay, 
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
+                       if (READ_ONCE(ssp->srcu_n_exp_nodelay) > 
SRCU_MAX_NODELAY_PHASE)
+                               jbase = 1;
+               }
         }

> And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
> above change reduces the overhead by more than an order of magnitude.
> Except that the overhead of the series is far less than one second,
> not the several minutes that you are seeing.  So the per-call overhead
> decreases from about 17 microseconds to a bit more than one microsecond.
>
> I could imagine an extra order of magnitude if you are running HZ=100
> instead of the HZ=1000 that I am running.  But that only gets up to a
> few seconds.
>
>> One additional debug is to apply the patch below on top of the one you
apply the patch below?
>> just now kindly tested, then use whatever debug technique you wish to
>> work out what fraction of the time during that critical interval that
>> srcu_get_delay() returns non-zero.
Sorry, I am confused, no patch right?
Just measure srcu_get_delay return to non-zero?


By the way, the issue should be only related with qemu apci. not related 
with rmr feature
Test with: https://github.com/qemu/qemu/tree/stable-6.1

Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force,
If no acpi, no print kvm_region_add/del (1000 times print once)

If with acpi=force,
During qemu boot
kvm_region_add region_add = 1000
kvm_region_del region_del = 1000
kvm_region_add region_add = 2000
kvm_region_del region_del = 2000
kvm_region_add region_add = 3000
kvm_region_del region_del = 3000
kvm_region_add region_add = 4000
kvm_region_del region_del = 4000
kvm_region_add region_add = 5000
kvm_region_del region_del = 5000
kvm_region_add region_add = 6000
kvm_region_del region_del = 6000

kvm_region_add/kvm_region_del ->
kvm_set_phys_mem->
kvm_set_user_memory_region->
kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)

[  361.094493]  __synchronize_srcu loop=9000
[  361.094501] Call trace:
[  361.094502]  dump_backtrace+0xe4/0xf0
[  361.094505]  show_stack+0x20/0x70
[  361.094507]  dump_stack_lvl+0x8c/0xb8
[  361.094509]  dump_stack+0x18/0x34
[  361.094511]  __synchronize_srcu+0x120/0x128
[  361.094514]  synchronize_srcu_expedited+0x2c/0x40
[  361.094515]  kvm_swap_active_memslots+0x130/0x198
[  361.094519]  kvm_activate_memslot+0x40/0x68
[  361.094520]  kvm_set_memslot+0x2f8/0x3b0
[  361.094523]  __kvm_set_memory_region+0x2e4/0x438
[  361.094524]  kvm_set_memory_region+0x78/0xb8
[  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
[  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
[  361.094530]  invoke_syscall+0x4c/0x110
[  361.094533]  el0_svc_common.constprop.0+0x68/0x128
[  361.094536]  do_el0_svc+0x34/0xc0
[  361.094538]  el0_svc+0x30/0x98
[  361.094541]  el0t_64_sync_handler+0xb8/0xc0
[  361.094544]  el0t_64_sync+0x18c/0x190
[  363.942817]  kvm_set_memory_region loop=6000




^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13  6:55                       ` zhangfei.gao
@ 2022-06-13 12:18                         ` Paul E. McKenney
  2022-06-13 13:23                           ` zhangfei.gao
  2022-06-13 15:02                         ` Shameerali Kolothum Thodi
  2022-06-15  8:29                         ` Marc Zyngier
  2 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-13 12:18 UTC (permalink / raw)
  To: zhangfei.gao
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, Auger Eric

On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@foxmail.com wrote:
> Hi, Paul
> 
> On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> > > On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote:
> > > > Hi, Paul
> > > > 
> > > > On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> > > > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> > > > > > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > > > > > Do these reserved memory regions really need to be allocated separately?
> > > > > > > > (For example, are they really all non-contiguous?  If not, that is, if
> > > > > > > > there are a lot of contiguous memory regions, could you sort the IORT
> > > > > > > > by address and do one ioctl() for each set of contiguous memory regions?)
> > > > > > > > 
> > > > > > > > Are all of these reserved memory regions set up before init is spawned?
> > > > > > > > 
> > > > > > > > Are all of these reserved memory regions set up while there is only a
> > > > > > > > single vCPU up and running?
> > > > > > > > 
> > > > > > > > Is the SRCU grace period really needed in this case?  (I freely confess
> > > > > > > > to not being all that familiar with KVM.)
> > > > > > > Oh, and there was a similar many-requests problem with networking many
> > > > > > > years ago.  This was solved by adding a new syscall/ioctl()/whatever
> > > > > > > that permitted many requests to be presented to the kernel with a single
> > > > > > > system call.
> > > > > > > 
> > > > > > > Could a new ioctl() be introduced that requested a large number
> > > > > > > of these memory regions in one go so as to make each call to
> > > > > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > > > > > > requests?  Adding a few of the KVM guys on CC for their thoughts.
> > > > > > Unfortunately not.  Apart from this specific case, in general the calls to
> > > > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> > > > > > guest, and those writes then map to a ioctl.  Typically the guest sets up a
> > > > > > device at a time, and each setup step causes a synchronize_srcu()---and
> > > > > > expedited at that.
> > > > > I was afraid of something like that...
> > > > > 
> > > > > > KVM has two SRCUs:
> > > > > > 
> > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> > > > > > that are very very small, but it needs extremely fast detection of grace
> > > > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
> > > > > > not so frequent.
> > > > > > 
> > > > > > 2) kvm->srcu is nastier because there are readers all the time.  The
> > > > > > read-side critical section are still short-ish, but they need the sleepable
> > > > > > part because they access user memory.
> > > > > Which one of these two is in play in this case?
> > > > > 
> > > > > > Writers are not frequent per se; the problem is they come in very large
> > > > > > bursts when a guest boots.  And while the whole boot path overall can be
> > > > > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> > > > > > impact on runtime than the O(n^2) parts, as demonstrated here.
> > > > > > 
> > > > > > Therefore, we operated on the assumption that the callers of
> > > > > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> > > > > > and the desire was to get past the booting phase as fast as possible.  If
> > > > > > the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> > > > > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> > > > > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> > > > > > problematic.
> > > > > > 
> > > > > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> > > > > > were was introduced (respectively in 2009 and 2014).  But perhaps they do
> > > > > > not hold anymore now that each SRCU is not as independent as it used to be
> > > > > > in those years, and instead they use workqueues instead?
> > > > > The problem was not internal to SRCU, but rather due to the fact
> > > > > that kernel live patching (KLP) had problems with the CPU-bound tasks
> > > > > resulting from repeated synchronize_rcu_expedited() invocations.  So I
> > > > > added heuristics to get the occasional sleep in there for KLP's benefit.
> > > > > Perhaps these heuristics need to be less aggressive about adding sleep.
> > > > > 
> > > > > These heuristics have these aspects:
> > > > > 
> > > > > 1.	The longer readers persist in an expedited SRCU grace period,
> > > > > 	the longer the wait between successive checks of the reader
> > > > > 	state.  Roughly speaking, we wait as long as the grace period
> > > > > 	has currently been in effect, capped at ten jiffies.
> > > > > 
> > > > > 2.	SRCU grace periods have several phases.  We reset so that each
> > > > > 	phase starts by not waiting (new phase, new set of readers,
> > > > > 	so don't penalize this set for the sins of the previous set).
> > > > > 	But once we get to the point of adding delay, we add the
> > > > > 	delay based on the beginning of the full grace period.
> > > > > 
> > > > > Right now, the checking for grace-period length does not allow for the
> > > > > possibility that a grace period might start just before the jiffies
> > > > > counter gets incremented (because I didn't realize that anyone cared),
> > > > > so that is one possible thing to change.  I can also allow more no-delay
> > > > > checks per SRCU grace-period phase.
> > > > > 
> > > > > Zhangfei, does something like the patch shown below help?
> > > > > 
> > > > > Additional adjustments are likely needed to avoid re-breaking KLP,
> > > > > but we have to start somewhere...
> > > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > > ------------------------------------------------------------------------
> > > > > 
> > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > > > index 50ba70f019dea..6a354368ac1d1 100644
> > > > > --- a/kernel/rcu/srcutree.c
> > > > > +++ b/kernel/rcu/srcutree.c
> > > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > > >    #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
> > > > >    #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
> > > > > -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
> > > > > +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
> > > > >    #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
> > > > >    /*
> > > > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > > >     */
> > > > >    static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > > > >    {
> > > > > +	unsigned long gpstart;
> > > > > +	unsigned long j;
> > > > >    	unsigned long jbase = SRCU_INTERVAL;
> > > > >    	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > > >    		jbase = 0;
> > > > > -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > > > -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > > > > +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > > > +		j = jiffies - 1;
> > > > > +		gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > > > +		if (time_after(j, gpstart))
> > > > > +			jbase += j - gpstart;
> > > > > +	}
> > > > >    	if (!jbase) {
> > > > >    		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > > >    		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> > > > Unfortunately, this patch does not helpful.
> > > > 
> > > > Then re-add the debug info.
> > > > 
> > > > During the qemu boot
> > > > [  232.997667]  __synchronize_srcu loop=1000
> > > > 
> > > > [  361.094493]  __synchronize_srcu loop=9000
> > > > [  361.094501] Call trace:
> > > > [  361.094502]  dump_backtrace+0xe4/0xf0
> > > > [  361.094505]  show_stack+0x20/0x70
> > > > [  361.094507]  dump_stack_lvl+0x8c/0xb8
> > > > [  361.094509]  dump_stack+0x18/0x34
> > > > [  361.094511]  __synchronize_srcu+0x120/0x128
> > > > [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> > > > [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> > > > [  361.094519]  kvm_activate_memslot+0x40/0x68
> > > > [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> > > > [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> > > > [  361.094524]  kvm_set_memory_region+0x78/0xb8
> > > > [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> > > > [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> > > > [  361.094530]  invoke_syscall+0x4c/0x110
> > > > [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> > > > [  361.094536]  do_el0_svc+0x34/0xc0
> > > > [  361.094538]  el0_svc+0x30/0x98
> > > > [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> > > > [  361.094544]  el0t_64_sync+0x18c/0x190
> > > > [  363.942817]  kvm_set_memory_region loop=6000
> > > Huh.
> > > 
> > > One possibility is that the "if (!jbase)" block needs to be nested
> > > within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
> 
> I test this diff and NO helpful
> 
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 50ba70f019de..36286a4b74e6 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> 
>  #define SRCU_INTERVAL          1       // Base delay if no expedited GPs
> pending.
>  #define SRCU_MAX_INTERVAL      10      // Maximum incremental delay from
> slow readers.
> -#define SRCU_MAX_NODELAY_PHASE 1       // Maximum per-GP-phase consecutive
> no-delay instances.
> +#define SRCU_MAX_NODELAY_PHASE 3       // Maximum per-GP-phase consecutive
> no-delay instances.
>  #define SRCU_MAX_NODELAY       100     // Maximum consecutive no-delay
> instances.
> 
>  /*
> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct
> *ssp)
>   */
>  static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>  {
> +       unsigned long gpstart;
> +       unsigned long j;
>         unsigned long jbase = SRCU_INTERVAL;
> 
>         if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>                 jbase = 0;
> -       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> -               jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> -       if (!jbase) {
> -               WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> -               if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> -                       jbase = 1;
> +       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> +               j = jiffies - 1;
> +               gpstart = READ_ONCE(ssp->srcu_gp_start);
> +               if (time_after(j, gpstart))
> +                       jbase += j - gpstart;
> +
> +               if (!jbase) {
> +                       WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> +                       if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> +                               jbase = 1;
> +               }
>         }

That is in fact what I was intending you to test, thank you.  As you
say, unfortunately it did not help.

Could you please test removing the "if (!jbase)" block entirely?

> > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
> > above change reduces the overhead by more than an order of magnitude.
> > Except that the overhead of the series is far less than one second,
> > not the several minutes that you are seeing.  So the per-call overhead
> > decreases from about 17 microseconds to a bit more than one microsecond.
> > 
> > I could imagine an extra order of magnitude if you are running HZ=100
> > instead of the HZ=1000 that I am running.  But that only gets up to a
> > few seconds.

One possible reason for the difference would be if your code has
SRCU readers.

Could you please tell me the value of CONFIG_HZ on your system?
Also the value of CONFIG_PREEMPTION?

> > > One additional debug is to apply the patch below on top of the one you
> apply the patch below?
> > > just now kindly tested, then use whatever debug technique you wish to
> > > work out what fraction of the time during that critical interval that
> > > srcu_get_delay() returns non-zero.
> Sorry, I am confused, no patch right?

Apologies, my omission.

> Just measure srcu_get_delay return to non-zero?

Exactly, please!

> By the way, the issue should be only related with qemu apci. not related
> with rmr feature
> Test with: https://github.com/qemu/qemu/tree/stable-6.1
> 
> Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force,
> If no acpi, no print kvm_region_add/del (1000 times print once)
> 
> If with acpi=force,
> During qemu boot
> kvm_region_add region_add = 1000
> kvm_region_del region_del = 1000
> kvm_region_add region_add = 2000
> kvm_region_del region_del = 2000
> kvm_region_add region_add = 3000
> kvm_region_del region_del = 3000
> kvm_region_add region_add = 4000
> kvm_region_del region_del = 4000
> kvm_region_add region_add = 5000
> kvm_region_del region_del = 5000
> kvm_region_add region_add = 6000
> kvm_region_del region_del = 6000
> 
> kvm_region_add/kvm_region_del ->
> kvm_set_phys_mem->
> kvm_set_user_memory_region->
> kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
> 
> [  361.094493]  __synchronize_srcu loop=9000
> [  361.094501] Call trace:
> [  361.094502]  dump_backtrace+0xe4/0xf0
> [  361.094505]  show_stack+0x20/0x70
> [  361.094507]  dump_stack_lvl+0x8c/0xb8
> [  361.094509]  dump_stack+0x18/0x34
> [  361.094511]  __synchronize_srcu+0x120/0x128
> [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> [  361.094519]  kvm_activate_memslot+0x40/0x68
> [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> [  361.094524]  kvm_set_memory_region+0x78/0xb8
> [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> [  361.094530]  invoke_syscall+0x4c/0x110
> [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> [  361.094536]  do_el0_svc+0x34/0xc0
> [  361.094538]  el0_svc+0x30/0x98
> [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> [  361.094544]  el0t_64_sync+0x18c/0x190
> [  363.942817]  kvm_set_memory_region loop=6000

Good to know, thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13 12:18                         ` Paul E. McKenney
@ 2022-06-13 13:23                           ` zhangfei.gao
  2022-06-13 14:59                             ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: zhangfei.gao @ 2022-06-13 13:23 UTC (permalink / raw)
  To: paulmck
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, Auger Eric



On 2022/6/13 下午8:18, Paul E. McKenney wrote:
> On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@foxmail.com wrote:
>> Hi, Paul
>>
>> On 2022/6/13 下午12:16, Paul E. McKenney wrote:
>>> On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
>>>> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote:
>>>>> Hi, Paul
>>>>>
>>>>> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
>>>>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
>>>>>>> On 6/12/22 18:40, Paul E. McKenney wrote:
>>>>>>>>> Do these reserved memory regions really need to be allocated separately?
>>>>>>>>> (For example, are they really all non-contiguous?  If not, that is, if
>>>>>>>>> there are a lot of contiguous memory regions, could you sort the IORT
>>>>>>>>> by address and do one ioctl() for each set of contiguous memory regions?)
>>>>>>>>>
>>>>>>>>> Are all of these reserved memory regions set up before init is spawned?
>>>>>>>>>
>>>>>>>>> Are all of these reserved memory regions set up while there is only a
>>>>>>>>> single vCPU up and running?
>>>>>>>>>
>>>>>>>>> Is the SRCU grace period really needed in this case?  (I freely confess
>>>>>>>>> to not being all that familiar with KVM.)
>>>>>>>> Oh, and there was a similar many-requests problem with networking many
>>>>>>>> years ago.  This was solved by adding a new syscall/ioctl()/whatever
>>>>>>>> that permitted many requests to be presented to the kernel with a single
>>>>>>>> system call.
>>>>>>>>
>>>>>>>> Could a new ioctl() be introduced that requested a large number
>>>>>>>> of these memory regions in one go so as to make each call to
>>>>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
>>>>>>>> requests?  Adding a few of the KVM guys on CC for their thoughts.
>>>>>>> Unfortunately not.  Apart from this specific case, in general the calls to
>>>>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
>>>>>>> guest, and those writes then map to a ioctl.  Typically the guest sets up a
>>>>>>> device at a time, and each setup step causes a synchronize_srcu()---and
>>>>>>> expedited at that.
>>>>>> I was afraid of something like that...
>>>>>>
>>>>>>> KVM has two SRCUs:
>>>>>>>
>>>>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
>>>>>>> that are very very small, but it needs extremely fast detection of grace
>>>>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>>>>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
>>>>>>> not so frequent.
>>>>>>>
>>>>>>> 2) kvm->srcu is nastier because there are readers all the time.  The
>>>>>>> read-side critical section are still short-ish, but they need the sleepable
>>>>>>> part because they access user memory.
>>>>>> Which one of these two is in play in this case?
>>>>>>
>>>>>>> Writers are not frequent per se; the problem is they come in very large
>>>>>>> bursts when a guest boots.  And while the whole boot path overall can be
>>>>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
>>>>>>> impact on runtime than the O(n^2) parts, as demonstrated here.
>>>>>>>
>>>>>>> Therefore, we operated on the assumption that the callers of
>>>>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
>>>>>>> and the desire was to get past the booting phase as fast as possible.  If
>>>>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
>>>>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
>>>>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
>>>>>>> problematic.
>>>>>>>
>>>>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
>>>>>>> were was introduced (respectively in 2009 and 2014).  But perhaps they do
>>>>>>> not hold anymore now that each SRCU is not as independent as it used to be
>>>>>>> in those years, and instead they use workqueues instead?
>>>>>> The problem was not internal to SRCU, but rather due to the fact
>>>>>> that kernel live patching (KLP) had problems with the CPU-bound tasks
>>>>>> resulting from repeated synchronize_rcu_expedited() invocations.  So I
>>>>>> added heuristics to get the occasional sleep in there for KLP's benefit.
>>>>>> Perhaps these heuristics need to be less aggressive about adding sleep.
>>>>>>
>>>>>> These heuristics have these aspects:
>>>>>>
>>>>>> 1.	The longer readers persist in an expedited SRCU grace period,
>>>>>> 	the longer the wait between successive checks of the reader
>>>>>> 	state.  Roughly speaking, we wait as long as the grace period
>>>>>> 	has currently been in effect, capped at ten jiffies.
>>>>>>
>>>>>> 2.	SRCU grace periods have several phases.  We reset so that each
>>>>>> 	phase starts by not waiting (new phase, new set of readers,
>>>>>> 	so don't penalize this set for the sins of the previous set).
>>>>>> 	But once we get to the point of adding delay, we add the
>>>>>> 	delay based on the beginning of the full grace period.
>>>>>>
>>>>>> Right now, the checking for grace-period length does not allow for the
>>>>>> possibility that a grace period might start just before the jiffies
>>>>>> counter gets incremented (because I didn't realize that anyone cared),
>>>>>> so that is one possible thing to change.  I can also allow more no-delay
>>>>>> checks per SRCU grace-period phase.
>>>>>>
>>>>>> Zhangfei, does something like the patch shown below help?
>>>>>>
>>>>>> Additional adjustments are likely needed to avoid re-breaking KLP,
>>>>>> but we have to start somewhere...
>>>>>>
>>>>>> 							Thanx, Paul
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>>>> index 50ba70f019dea..6a354368ac1d1 100644
>>>>>> --- a/kernel/rcu/srcutree.c
>>>>>> +++ b/kernel/rcu/srcutree.c
>>>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>>>>>     #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
>>>>>>     #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
>>>>>> -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
>>>>>> +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
>>>>>>     #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
>>>>>>     /*
>>>>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>>>>>      */
>>>>>>     static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>>>>>>     {
>>>>>> +	unsigned long gpstart;
>>>>>> +	unsigned long j;
>>>>>>     	unsigned long jbase = SRCU_INTERVAL;
>>>>>>     	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>>     		jbase = 0;
>>>>>> -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>>>>>> -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
>>>>>> +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>>>>>> +		j = jiffies - 1;
>>>>>> +		gpstart = READ_ONCE(ssp->srcu_gp_start);
>>>>>> +		if (time_after(j, gpstart))
>>>>>> +			jbase += j - gpstart;
>>>>>> +	}
>>>>>>     	if (!jbase) {
>>>>>>     		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>>>>     		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
>>>>> Unfortunately, this patch does not helpful.
>>>>>
>>>>> Then re-add the debug info.
>>>>>
>>>>> During the qemu boot
>>>>> [  232.997667]  __synchronize_srcu loop=1000
>>>>>
>>>>> [  361.094493]  __synchronize_srcu loop=9000
>>>>> [  361.094501] Call trace:
>>>>> [  361.094502]  dump_backtrace+0xe4/0xf0
>>>>> [  361.094505]  show_stack+0x20/0x70
>>>>> [  361.094507]  dump_stack_lvl+0x8c/0xb8
>>>>> [  361.094509]  dump_stack+0x18/0x34
>>>>> [  361.094511]  __synchronize_srcu+0x120/0x128
>>>>> [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
>>>>> [  361.094515]  kvm_swap_active_memslots+0x130/0x198
>>>>> [  361.094519]  kvm_activate_memslot+0x40/0x68
>>>>> [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
>>>>> [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
>>>>> [  361.094524]  kvm_set_memory_region+0x78/0xb8
>>>>> [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
>>>>> [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
>>>>> [  361.094530]  invoke_syscall+0x4c/0x110
>>>>> [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
>>>>> [  361.094536]  do_el0_svc+0x34/0xc0
>>>>> [  361.094538]  el0_svc+0x30/0x98
>>>>> [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
>>>>> [  361.094544]  el0t_64_sync+0x18c/0x190
>>>>> [  363.942817]  kvm_set_memory_region loop=6000
>>>> Huh.
>>>>
>>>> One possibility is that the "if (!jbase)" block needs to be nested
>>>> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
>> I test this diff and NO helpful
>>
>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>> index 50ba70f019de..36286a4b74e6 100644
>> --- a/kernel/rcu/srcutree.c
>> +++ b/kernel/rcu/srcutree.c
>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>
>>   #define SRCU_INTERVAL          1       // Base delay if no expedited GPs
>> pending.
>>   #define SRCU_MAX_INTERVAL      10      // Maximum incremental delay from
>> slow readers.
>> -#define SRCU_MAX_NODELAY_PHASE 1       // Maximum per-GP-phase consecutive
>> no-delay instances.
>> +#define SRCU_MAX_NODELAY_PHASE 3       // Maximum per-GP-phase consecutive
>> no-delay instances.
>>   #define SRCU_MAX_NODELAY       100     // Maximum consecutive no-delay
>> instances.
>>
>>   /*
>> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct
>> *ssp)
>>    */
>>   static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>>   {
>> +       unsigned long gpstart;
>> +       unsigned long j;
>>          unsigned long jbase = SRCU_INTERVAL;
>>
>>          if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>                  jbase = 0;
>> -       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>> -               jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
>> -       if (!jbase) {
>> -               WRITE_ONCE(ssp->srcu_n_exp_nodelay,
>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>> -               if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
>> SRCU_MAX_NODELAY_PHASE)
>> -                       jbase = 1;
>> +       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>> +               j = jiffies - 1;
>> +               gpstart = READ_ONCE(ssp->srcu_gp_start);
>> +               if (time_after(j, gpstart))
>> +                       jbase += j - gpstart;
>> +
>> +               if (!jbase) {
>> +                       WRITE_ONCE(ssp->srcu_n_exp_nodelay,
>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>> +                       if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
>> SRCU_MAX_NODELAY_PHASE)
>> +                               jbase = 1;
>> +               }
>>          }
> That is in fact what I was intending you to test, thank you.  As you
> say, unfortunately it did not help.
>
> Could you please test removing the "if (!jbase)" block entirely?
Remove "if (!jbase)" block is much faster,
not measure clearly, qemu (with debug version efi) boot seems normally.

 From log timestamp:
[  114.624713] __synchronize_srcu loop=1000
[  124.157011]  __synchronize_srcu loop=9000

Several method: timestamps are different.

5.19-rc1
[   94.271350]  __synchronize_srcu loop=1001
[  222.621659]  __synchronize_srcu loop=9001


With your first diff:
[  232.997667]  __synchronize_srcu loop=1000
[  361.094493]  __synchronize_srcu loop=9000

Remove "if (!jbase)" block
[  114.624713] __synchronize_srcu loop=1000
[  124.157011]  __synchronize_srcu loop=9000


5.18 method
+       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
+               return 0;
+       return SRCU_INTERVAL;

[   74.598480] __synchronize_srcu loop=9000
[   68.938297] __synchronize_srcu loop=1000

>
>>> And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
>>> above change reduces the overhead by more than an order of magnitude.
>>> Except that the overhead of the series is far less than one second,
>>> not the several minutes that you are seeing.  So the per-call overhead
>>> decreases from about 17 microseconds to a bit more than one microsecond.
>>>
>>> I could imagine an extra order of magnitude if you are running HZ=100
>>> instead of the HZ=1000 that I am running.  But that only gets up to a
>>> few seconds.
> One possible reason for the difference would be if your code has
> SRCU readers.
>
> Could you please tell me the value of CONFIG_HZ on your system?
> Also the value of CONFIG_PREEMPTION?
I am using arch/arm64/configs/defconfig
make defconfig
CONFIG_PREEMPTION=y
CONFIG_HZ_250=y

Thanks

>
>>>> One additional debug is to apply the patch below on top of the one you
>> apply the patch below?
>>>> just now kindly tested, then use whatever debug technique you wish to
>>>> work out what fraction of the time during that critical interval that
>>>> srcu_get_delay() returns non-zero.
>> Sorry, I am confused, no patch right?
> Apologies, my omission.
>
>> Just measure srcu_get_delay return to non-zero?
> Exactly, please!
>
>> By the way, the issue should be only related with qemu apci. not related
>> with rmr feature
>> Test with: https://github.com/qemu/qemu/tree/stable-6.1
>>
>> Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force,
>> If no acpi, no print kvm_region_add/del (1000 times print once)
>>
>> If with acpi=force,
>> During qemu boot
>> kvm_region_add region_add = 1000
>> kvm_region_del region_del = 1000
>> kvm_region_add region_add = 2000
>> kvm_region_del region_del = 2000
>> kvm_region_add region_add = 3000
>> kvm_region_del region_del = 3000
>> kvm_region_add region_add = 4000
>> kvm_region_del region_del = 4000
>> kvm_region_add region_add = 5000
>> kvm_region_del region_del = 5000
>> kvm_region_add region_add = 6000
>> kvm_region_del region_del = 6000
>>
>> kvm_region_add/kvm_region_del ->
>> kvm_set_phys_mem->
>> kvm_set_user_memory_region->
>> kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
>>
>> [  361.094493]  __synchronize_srcu loop=9000
>> [  361.094501] Call trace:
>> [  361.094502]  dump_backtrace+0xe4/0xf0
>> [  361.094505]  show_stack+0x20/0x70
>> [  361.094507]  dump_stack_lvl+0x8c/0xb8
>> [  361.094509]  dump_stack+0x18/0x34
>> [  361.094511]  __synchronize_srcu+0x120/0x128
>> [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
>> [  361.094515]  kvm_swap_active_memslots+0x130/0x198
>> [  361.094519]  kvm_activate_memslot+0x40/0x68
>> [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
>> [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
>> [  361.094524]  kvm_set_memory_region+0x78/0xb8
>> [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
>> [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
>> [  361.094530]  invoke_syscall+0x4c/0x110
>> [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
>> [  361.094536]  do_el0_svc+0x34/0xc0
>> [  361.094538]  el0_svc+0x30/0x98
>> [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
>> [  361.094544]  el0t_64_sync+0x18c/0x190
>> [  363.942817]  kvm_set_memory_region loop=6000
> Good to know, thank you!
>
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13 13:23                           ` zhangfei.gao
@ 2022-06-13 14:59                             ` Paul E. McKenney
  2022-06-13 20:55                               ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-13 14:59 UTC (permalink / raw)
  To: zhangfei.gao
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, Auger Eric

On Mon, Jun 13, 2022 at 09:23:50PM +0800, zhangfei.gao@foxmail.com wrote:
> 
> 
> On 2022/6/13 下午8:18, Paul E. McKenney wrote:
> > On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@foxmail.com wrote:
> > > Hi, Paul
> > > 
> > > On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> > > > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote:
> > > > > > Hi, Paul
> > > > > > 
> > > > > > On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> > > > > > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> > > > > > > > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > > > > > > > Do these reserved memory regions really need to be allocated separately?
> > > > > > > > > > (For example, are they really all non-contiguous?  If not, that is, if
> > > > > > > > > > there are a lot of contiguous memory regions, could you sort the IORT
> > > > > > > > > > by address and do one ioctl() for each set of contiguous memory regions?)
> > > > > > > > > > 
> > > > > > > > > > Are all of these reserved memory regions set up before init is spawned?
> > > > > > > > > > 
> > > > > > > > > > Are all of these reserved memory regions set up while there is only a
> > > > > > > > > > single vCPU up and running?
> > > > > > > > > > 
> > > > > > > > > > Is the SRCU grace period really needed in this case?  (I freely confess
> > > > > > > > > > to not being all that familiar with KVM.)
> > > > > > > > > Oh, and there was a similar many-requests problem with networking many
> > > > > > > > > years ago.  This was solved by adding a new syscall/ioctl()/whatever
> > > > > > > > > that permitted many requests to be presented to the kernel with a single
> > > > > > > > > system call.
> > > > > > > > > 
> > > > > > > > > Could a new ioctl() be introduced that requested a large number
> > > > > > > > > of these memory regions in one go so as to make each call to
> > > > > > > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > > > > > > > > requests?  Adding a few of the KVM guys on CC for their thoughts.
> > > > > > > > Unfortunately not.  Apart from this specific case, in general the calls to
> > > > > > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> > > > > > > > guest, and those writes then map to a ioctl.  Typically the guest sets up a
> > > > > > > > device at a time, and each setup step causes a synchronize_srcu()---and
> > > > > > > > expedited at that.
> > > > > > > I was afraid of something like that...
> > > > > > > 
> > > > > > > > KVM has two SRCUs:
> > > > > > > > 
> > > > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> > > > > > > > that are very very small, but it needs extremely fast detection of grace
> > > > > > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > > > > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
> > > > > > > > not so frequent.
> > > > > > > > 
> > > > > > > > 2) kvm->srcu is nastier because there are readers all the time.  The
> > > > > > > > read-side critical section are still short-ish, but they need the sleepable
> > > > > > > > part because they access user memory.
> > > > > > > Which one of these two is in play in this case?
> > > > > > > 
> > > > > > > > Writers are not frequent per se; the problem is they come in very large
> > > > > > > > bursts when a guest boots.  And while the whole boot path overall can be
> > > > > > > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> > > > > > > > impact on runtime than the O(n^2) parts, as demonstrated here.
> > > > > > > > 
> > > > > > > > Therefore, we operated on the assumption that the callers of
> > > > > > > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> > > > > > > > and the desire was to get past the booting phase as fast as possible.  If
> > > > > > > > the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> > > > > > > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> > > > > > > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> > > > > > > > problematic.
> > > > > > > > 
> > > > > > > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> > > > > > > > were was introduced (respectively in 2009 and 2014).  But perhaps they do
> > > > > > > > not hold anymore now that each SRCU is not as independent as it used to be
> > > > > > > > in those years, and instead they use workqueues instead?
> > > > > > > The problem was not internal to SRCU, but rather due to the fact
> > > > > > > that kernel live patching (KLP) had problems with the CPU-bound tasks
> > > > > > > resulting from repeated synchronize_rcu_expedited() invocations.  So I
> > > > > > > added heuristics to get the occasional sleep in there for KLP's benefit.
> > > > > > > Perhaps these heuristics need to be less aggressive about adding sleep.
> > > > > > > 
> > > > > > > These heuristics have these aspects:
> > > > > > > 
> > > > > > > 1.	The longer readers persist in an expedited SRCU grace period,
> > > > > > > 	the longer the wait between successive checks of the reader
> > > > > > > 	state.  Roughly speaking, we wait as long as the grace period
> > > > > > > 	has currently been in effect, capped at ten jiffies.
> > > > > > > 
> > > > > > > 2.	SRCU grace periods have several phases.  We reset so that each
> > > > > > > 	phase starts by not waiting (new phase, new set of readers,
> > > > > > > 	so don't penalize this set for the sins of the previous set).
> > > > > > > 	But once we get to the point of adding delay, we add the
> > > > > > > 	delay based on the beginning of the full grace period.
> > > > > > > 
> > > > > > > Right now, the checking for grace-period length does not allow for the
> > > > > > > possibility that a grace period might start just before the jiffies
> > > > > > > counter gets incremented (because I didn't realize that anyone cared),
> > > > > > > so that is one possible thing to change.  I can also allow more no-delay
> > > > > > > checks per SRCU grace-period phase.
> > > > > > > 
> > > > > > > Zhangfei, does something like the patch shown below help?
> > > > > > > 
> > > > > > > Additional adjustments are likely needed to avoid re-breaking KLP,
> > > > > > > but we have to start somewhere...
> > > > > > > 
> > > > > > > 							Thanx, Paul
> > > > > > > 
> > > > > > > ------------------------------------------------------------------------
> > > > > > > 
> > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > > > > > index 50ba70f019dea..6a354368ac1d1 100644
> > > > > > > --- a/kernel/rcu/srcutree.c
> > > > > > > +++ b/kernel/rcu/srcutree.c
> > > > > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > > > > >     #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
> > > > > > >     #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
> > > > > > > -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
> > > > > > > +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
> > > > > > >     #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
> > > > > > >     /*
> > > > > > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > > > > >      */
> > > > > > >     static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > > > > > >     {
> > > > > > > +	unsigned long gpstart;
> > > > > > > +	unsigned long j;
> > > > > > >     	unsigned long jbase = SRCU_INTERVAL;
> > > > > > >     	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > > > > >     		jbase = 0;
> > > > > > > -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > > > > > -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > > > > > > +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > > > > > +		j = jiffies - 1;
> > > > > > > +		gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > > > > > +		if (time_after(j, gpstart))
> > > > > > > +			jbase += j - gpstart;
> > > > > > > +	}
> > > > > > >     	if (!jbase) {
> > > > > > >     		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > > > > >     		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> > > > > > Unfortunately, this patch does not helpful.
> > > > > > 
> > > > > > Then re-add the debug info.
> > > > > > 
> > > > > > During the qemu boot
> > > > > > [  232.997667]  __synchronize_srcu loop=1000
> > > > > > 
> > > > > > [  361.094493]  __synchronize_srcu loop=9000
> > > > > > [  361.094501] Call trace:
> > > > > > [  361.094502]  dump_backtrace+0xe4/0xf0
> > > > > > [  361.094505]  show_stack+0x20/0x70
> > > > > > [  361.094507]  dump_stack_lvl+0x8c/0xb8
> > > > > > [  361.094509]  dump_stack+0x18/0x34
> > > > > > [  361.094511]  __synchronize_srcu+0x120/0x128
> > > > > > [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> > > > > > [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> > > > > > [  361.094519]  kvm_activate_memslot+0x40/0x68
> > > > > > [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> > > > > > [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> > > > > > [  361.094524]  kvm_set_memory_region+0x78/0xb8
> > > > > > [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> > > > > > [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> > > > > > [  361.094530]  invoke_syscall+0x4c/0x110
> > > > > > [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> > > > > > [  361.094536]  do_el0_svc+0x34/0xc0
> > > > > > [  361.094538]  el0_svc+0x30/0x98
> > > > > > [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> > > > > > [  361.094544]  el0t_64_sync+0x18c/0x190
> > > > > > [  363.942817]  kvm_set_memory_region loop=6000
> > > > > Huh.
> > > > > 
> > > > > One possibility is that the "if (!jbase)" block needs to be nested
> > > > > within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
> > > I test this diff and NO helpful
> > > 
> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > index 50ba70f019de..36286a4b74e6 100644
> > > --- a/kernel/rcu/srcutree.c
> > > +++ b/kernel/rcu/srcutree.c
> > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > 
> > >   #define SRCU_INTERVAL          1       // Base delay if no expedited GPs
> > > pending.
> > >   #define SRCU_MAX_INTERVAL      10      // Maximum incremental delay from
> > > slow readers.
> > > -#define SRCU_MAX_NODELAY_PHASE 1       // Maximum per-GP-phase consecutive
> > > no-delay instances.
> > > +#define SRCU_MAX_NODELAY_PHASE 3       // Maximum per-GP-phase consecutive
> > > no-delay instances.
> > >   #define SRCU_MAX_NODELAY       100     // Maximum consecutive no-delay
> > > instances.
> > > 
> > >   /*
> > > @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct
> > > *ssp)
> > >    */
> > >   static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > >   {
> > > +       unsigned long gpstart;
> > > +       unsigned long j;
> > >          unsigned long jbase = SRCU_INTERVAL;
> > > 
> > >          if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > >                  jbase = 0;
> > > -       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > -               jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > > -       if (!jbase) {
> > > -               WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > -               if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> > > SRCU_MAX_NODELAY_PHASE)
> > > -                       jbase = 1;
> > > +       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > +               j = jiffies - 1;
> > > +               gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > +               if (time_after(j, gpstart))
> > > +                       jbase += j - gpstart;
> > > +
> > > +               if (!jbase) {
> > > +                       WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > +                       if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> > > SRCU_MAX_NODELAY_PHASE)
> > > +                               jbase = 1;
> > > +               }
> > >          }
> > That is in fact what I was intending you to test, thank you.  As you
> > say, unfortunately it did not help.
> > 
> > Could you please test removing the "if (!jbase)" block entirely?
> Remove "if (!jbase)" block is much faster,
> not measure clearly, qemu (with debug version efi) boot seems normally.
> 
> From log timestamp:
> [  114.624713] __synchronize_srcu loop=1000
> [  124.157011]  __synchronize_srcu loop=9000
> 
> Several method: timestamps are different.
> 
> 5.19-rc1
> [   94.271350]  __synchronize_srcu loop=1001
> [  222.621659]  __synchronize_srcu loop=9001
> 
> 
> With your first diff:
> [  232.997667]  __synchronize_srcu loop=1000
> [  361.094493]  __synchronize_srcu loop=9000
> 
> Remove "if (!jbase)" block
> [  114.624713] __synchronize_srcu loop=1000
> [  124.157011]  __synchronize_srcu loop=9000
> 
> 
> 5.18 method
> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> +               return 0;
> +       return SRCU_INTERVAL;
> 
> [   74.598480] __synchronize_srcu loop=9000
> [   68.938297] __synchronize_srcu loop=1000

Thank you for the information!

What happens if you keep the that "if (!jbase)" block", but set the
value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000?

This would be too large for KLP, but my hope is that there is a value
of SRCU_MAX_NODELAY_PHASE that works for everyone.  But first, does
this help at all?  ;-)

> > > > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
> > > > above change reduces the overhead by more than an order of magnitude.
> > > > Except that the overhead of the series is far less than one second,
> > > > not the several minutes that you are seeing.  So the per-call overhead
> > > > decreases from about 17 microseconds to a bit more than one microsecond.
> > > > 
> > > > I could imagine an extra order of magnitude if you are running HZ=100
> > > > instead of the HZ=1000 that I am running.  But that only gets up to a
> > > > few seconds.
> > One possible reason for the difference would be if your code has
> > SRCU readers.
> > 
> > Could you please tell me the value of CONFIG_HZ on your system?
> > Also the value of CONFIG_PREEMPTION?
> I am using arch/arm64/configs/defconfig
> make defconfig
> CONFIG_PREEMPTION=y
> CONFIG_HZ_250=y

Thank you again!

And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might depend
on HZ.  And who knows what all else...

							Thanx, Paul

> Thanks
> 
> > 
> > > > > One additional debug is to apply the patch below on top of the one you
> > > apply the patch below?
> > > > > just now kindly tested, then use whatever debug technique you wish to
> > > > > work out what fraction of the time during that critical interval that
> > > > > srcu_get_delay() returns non-zero.
> > > Sorry, I am confused, no patch right?
> > Apologies, my omission.
> > 
> > > Just measure srcu_get_delay return to non-zero?
> > Exactly, please!
> > 
> > > By the way, the issue should be only related with qemu apci. not related
> > > with rmr feature
> > > Test with: https://github.com/qemu/qemu/tree/stable-6.1
> > > 
> > > Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force,
> > > If no acpi, no print kvm_region_add/del (1000 times print once)
> > > 
> > > If with acpi=force,
> > > During qemu boot
> > > kvm_region_add region_add = 1000
> > > kvm_region_del region_del = 1000
> > > kvm_region_add region_add = 2000
> > > kvm_region_del region_del = 2000
> > > kvm_region_add region_add = 3000
> > > kvm_region_del region_del = 3000
> > > kvm_region_add region_add = 4000
> > > kvm_region_del region_del = 4000
> > > kvm_region_add region_add = 5000
> > > kvm_region_del region_del = 5000
> > > kvm_region_add region_add = 6000
> > > kvm_region_del region_del = 6000
> > > 
> > > kvm_region_add/kvm_region_del ->
> > > kvm_set_phys_mem->
> > > kvm_set_user_memory_region->
> > > kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
> > > 
> > > [  361.094493]  __synchronize_srcu loop=9000
> > > [  361.094501] Call trace:
> > > [  361.094502]  dump_backtrace+0xe4/0xf0
> > > [  361.094505]  show_stack+0x20/0x70
> > > [  361.094507]  dump_stack_lvl+0x8c/0xb8
> > > [  361.094509]  dump_stack+0x18/0x34
> > > [  361.094511]  __synchronize_srcu+0x120/0x128
> > > [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> > > [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> > > [  361.094519]  kvm_activate_memslot+0x40/0x68
> > > [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> > > [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> > > [  361.094524]  kvm_set_memory_region+0x78/0xb8
> > > [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> > > [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> > > [  361.094530]  invoke_syscall+0x4c/0x110
> > > [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> > > [  361.094536]  do_el0_svc+0x34/0xc0
> > > [  361.094538]  el0_svc+0x30/0x98
> > > [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> > > [  361.094544]  el0t_64_sync+0x18c/0x190
> > > [  363.942817]  kvm_set_memory_region loop=6000
> > Good to know, thank you!
> > 
> > 							Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13  6:55                       ` zhangfei.gao
  2022-06-13 12:18                         ` Paul E. McKenney
@ 2022-06-13 15:02                         ` Shameerali Kolothum Thodi
  2022-06-15  8:38                           ` Marc Zyngier
  2022-06-15  8:29                         ` Marc Zyngier
  2 siblings, 1 reply; 37+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-06-13 15:02 UTC (permalink / raw)
  To: zhangfei.gao, paulmck
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti,
	Auger Eric, chenxiang (M)



> -----Original Message-----
> From: zhangfei.gao@foxmail.com [mailto:zhangfei.gao@foxmail.com]
> Sent: 13 June 2022 07:56
> To: paulmck@kernel.org
> Cc: Paolo Bonzini <pbonzini@redhat.com>; Zhangfei Gao
> <zhangfei.gao@linaro.org>; linux-kernel@vger.kernel.org;
> rcu@vger.kernel.org; Lai Jiangshan <jiangshanlai@gmail.com>; Josh Triplett
> <josh@joshtriplett.org>; Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com>; Matthew Wilcox <willy@infradead.org>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> mtosatti@redhat.com; Auger Eric <eric.auger@redhat.com>
> Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and
> blocking readers from consuming CPU) cause qemu boot slow
> 
 
> By the way, the issue should be only related with qemu apci. not related
> with rmr feature
> Test with: https://github.com/qemu/qemu/tree/stable-6.1
> 
> Looks it caused by too many kvm_region_add & kvm_region_del if
> acpi=force,

Based on the setup I have, I think it has nothing to do with Guest kernel booting with
ACPI per se(ie, acpi=force in Qemu kernel cmd line).
It is more to do with Qemu having the "-bios QEMU_EFI.fd" which sets up
pflash devices resulting in large number of pflash read/write calls(before Guest kernel
even boots) which in turn seems to be triggering the below kvm_region_add/del calls.
 
Thanks,
Shameer

> If no acpi, no print kvm_region_add/del (1000 times print once)
> 
> If with acpi=force,
> During qemu boot
> kvm_region_add region_add = 1000
> kvm_region_del region_del = 1000
> kvm_region_add region_add = 2000
> kvm_region_del region_del = 2000
> kvm_region_add region_add = 3000
> kvm_region_del region_del = 3000
> kvm_region_add region_add = 4000
> kvm_region_del region_del = 4000
> kvm_region_add region_add = 5000
> kvm_region_del region_del = 5000
> kvm_region_add region_add = 6000
> kvm_region_del region_del = 6000
> 
> kvm_region_add/kvm_region_del ->
> kvm_set_phys_mem->
> kvm_set_user_memory_region->
> kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
> 
> [  361.094493]  __synchronize_srcu loop=9000
> [  361.094501] Call trace:
> [  361.094502]  dump_backtrace+0xe4/0xf0
> [  361.094505]  show_stack+0x20/0x70
> [  361.094507]  dump_stack_lvl+0x8c/0xb8
> [  361.094509]  dump_stack+0x18/0x34
> [  361.094511]  __synchronize_srcu+0x120/0x128
> [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> [  361.094519]  kvm_activate_memslot+0x40/0x68
> [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> [  361.094524]  kvm_set_memory_region+0x78/0xb8
> [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> [  361.094530]  invoke_syscall+0x4c/0x110
> [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> [  361.094536]  do_el0_svc+0x34/0xc0
> [  361.094538]  el0_svc+0x30/0x98
> [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> [  361.094544]  el0t_64_sync+0x18c/0x190
> [  363.942817]  kvm_set_memory_region loop=6000
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13 14:59                             ` Paul E. McKenney
@ 2022-06-13 20:55                               ` Shameerali Kolothum Thodi
  2022-06-14 12:19                                 ` Neeraj Upadhyay
  0 siblings, 1 reply; 37+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-06-13 20:55 UTC (permalink / raw)
  To: paulmck, zhangfei.gao
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti,
	Auger Eric, chenxiang (M)



> -----Original Message-----
> From: Paul E. McKenney [mailto:paulmck@kernel.org]
> Sent: 13 June 2022 15:59
> To: zhangfei.gao@foxmail.com
> Cc: Paolo Bonzini <pbonzini@redhat.com>; Zhangfei Gao
> <zhangfei.gao@linaro.org>; linux-kernel@vger.kernel.org;
> rcu@vger.kernel.org; Lai Jiangshan <jiangshanlai@gmail.com>; Josh Triplett
> <josh@joshtriplett.org>; Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com>; Matthew Wilcox <willy@infradead.org>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> mtosatti@redhat.com; Auger Eric <eric.auger@redhat.com>
> Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and
> blocking readers from consuming CPU) cause qemu boot slow
> 
> On Mon, Jun 13, 2022 at 09:23:50PM +0800, zhangfei.gao@foxmail.com
> wrote:
> >
> >
> > On 2022/6/13 下午8:18, Paul E. McKenney wrote:
> > > On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@foxmail.com
> wrote:
> > > > Hi, Paul
> > > >
> > > > On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> > > > > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, Jun 13, 2022 at 11:04:39AM +0800,
> zhangfei.gao@foxmail.com wrote:
> > > > > > > Hi, Paul
> > > > > > >
> > > > > > > On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> > > > > > > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini
> wrote:
> > > > > > > > > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > > > > > > > > Do these reserved memory regions really need to be
> allocated separately?
> > > > > > > > > > > (For example, are they really all non-contiguous?  If not,
> that is, if
> > > > > > > > > > > there are a lot of contiguous memory regions, could you
> sort the IORT
> > > > > > > > > > > by address and do one ioctl() for each set of contiguous
> memory regions?)
> > > > > > > > > > >
> > > > > > > > > > > Are all of these reserved memory regions set up before init
> is spawned?
> > > > > > > > > > >
> > > > > > > > > > > Are all of these reserved memory regions set up while
> there is only a
> > > > > > > > > > > single vCPU up and running?
> > > > > > > > > > >
> > > > > > > > > > > Is the SRCU grace period really needed in this case?  (I
> freely confess
> > > > > > > > > > > to not being all that familiar with KVM.)
> > > > > > > > > > Oh, and there was a similar many-requests problem with
> networking many
> > > > > > > > > > years ago.  This was solved by adding a new
> syscall/ioctl()/whatever
> > > > > > > > > > that permitted many requests to be presented to the kernel
> with a single
> > > > > > > > > > system call.
> > > > > > > > > >
> > > > > > > > > > Could a new ioctl() be introduced that requested a large
> number
> > > > > > > > > > of these memory regions in one go so as to make each call to
> > > > > > > > > > synchronize_rcu_expedited() cover a useful fraction of your
> 9000+
> > > > > > > > > > requests?  Adding a few of the KVM guys on CC for their
> thoughts.
> > > > > > > > > Unfortunately not.  Apart from this specific case, in general
> the calls to
> > > > > > > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to
> I/O registers in the
> > > > > > > > > guest, and those writes then map to a ioctl.  Typically the
> guest sets up a
> > > > > > > > > device at a time, and each setup step causes a
> synchronize_srcu()---and
> > > > > > > > > expedited at that.
> > > > > > > > I was afraid of something like that...
> > > > > > > >
> > > > > > > > > KVM has two SRCUs:
> > > > > > > > >
> > > > > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it
> has readers
> > > > > > > > > that are very very small, but it needs extremely fast detection
> of grace
> > > > > > > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > > > > > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off
> kvm->srcu.  Readers are
> > > > > > > > > not so frequent.
> > > > > > > > >
> > > > > > > > > 2) kvm->srcu is nastier because there are readers all the time.
> The
> > > > > > > > > read-side critical section are still short-ish, but they need the
> sleepable
> > > > > > > > > part because they access user memory.
> > > > > > > > Which one of these two is in play in this case?
> > > > > > > >
> > > > > > > > > Writers are not frequent per se; the problem is they come in
> very large
> > > > > > > > > bursts when a guest boots.  And while the whole boot path
> overall can be
> > > > > > > > > quadratic, O(n) expensive calls to synchronize_srcu() can have
> a larger
> > > > > > > > > impact on runtime than the O(n^2) parts, as demonstrated
> here.
> > > > > > > > >
> > > > > > > > > Therefore, we operated on the assumption that the callers of
> > > > > > > > > synchronized_srcu_expedited were _anyway_ busy running
> CPU-bound guest code
> > > > > > > > > and the desire was to get past the booting phase as fast as
> possible.  If
> > > > > > > > > the guest wants to eat host CPU it can "for(;;)" as much as it
> wants;
> > > > > > > > > therefore, as long as expedited GPs didn't eat CPU
> *throughout the whole
> > > > > > > > > system*, a preemptable busy wait in
> synchronize_srcu_expedited() were not
> > > > > > > > > problematic.
> > > > > > > > >
> > > > > > > > > This assumptions did match the SRCU code when kvm->srcu
> and kvm->irq_srcu
> > > > > > > > > were was introduced (respectively in 2009 and 2014).  But
> perhaps they do
> > > > > > > > > not hold anymore now that each SRCU is not as independent
> as it used to be
> > > > > > > > > in those years, and instead they use workqueues instead?
> > > > > > > > The problem was not internal to SRCU, but rather due to the fact
> > > > > > > > that kernel live patching (KLP) had problems with the
> CPU-bound tasks
> > > > > > > > resulting from repeated synchronize_rcu_expedited()
> invocations.  So I
> > > > > > > > added heuristics to get the occasional sleep in there for KLP's
> benefit.
> > > > > > > > Perhaps these heuristics need to be less aggressive about adding
> sleep.
> > > > > > > >
> > > > > > > > These heuristics have these aspects:
> > > > > > > >
> > > > > > > > 1.	The longer readers persist in an expedited SRCU grace period,
> > > > > > > > 	the longer the wait between successive checks of the reader
> > > > > > > > 	state.  Roughly speaking, we wait as long as the grace period
> > > > > > > > 	has currently been in effect, capped at ten jiffies.
> > > > > > > >
> > > > > > > > 2.	SRCU grace periods have several phases.  We reset so that
> each
> > > > > > > > 	phase starts by not waiting (new phase, new set of readers,
> > > > > > > > 	so don't penalize this set for the sins of the previous set).
> > > > > > > > 	But once we get to the point of adding delay, we add the
> > > > > > > > 	delay based on the beginning of the full grace period.
> > > > > > > >
> > > > > > > > Right now, the checking for grace-period length does not allow
> for the
> > > > > > > > possibility that a grace period might start just before the jiffies
> > > > > > > > counter gets incremented (because I didn't realize that anyone
> cared),
> > > > > > > > so that is one possible thing to change.  I can also allow more
> no-delay
> > > > > > > > checks per SRCU grace-period phase.
> > > > > > > >
> > > > > > > > Zhangfei, does something like the patch shown below help?
> > > > > > > >
> > > > > > > > Additional adjustments are likely needed to avoid re-breaking
> KLP,
> > > > > > > > but we have to start somewhere...
> > > > > > > >
> > > > > > > > 							Thanx, Paul
> > > > > > > >
> > > > > > > > ------------------------------------------------------------------------
> > > > > > > >
> > > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > > > > > > index 50ba70f019dea..6a354368ac1d1 100644
> > > > > > > > --- a/kernel/rcu/srcutree.c
> > > > > > > > +++ b/kernel/rcu/srcutree.c
> > > > > > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct
> srcu_struct *ssp)
> > > > > > > >     #define SRCU_INTERVAL		1	// Base delay if no
> expedited GPs pending.
> > > > > > > >     #define SRCU_MAX_INTERVAL	10	// Maximum
> incremental delay from slow readers.
> > > > > > > > -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum
> per-GP-phase consecutive no-delay instances.
> > > > > > > > +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum
> per-GP-phase consecutive no-delay instances.
> > > > > > > >     #define SRCU_MAX_NODELAY	100	// Maximum
> consecutive no-delay instances.
> > > > > > > >     /*
> > > > > > > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct
> srcu_struct *ssp)
> > > > > > > >      */
> > > > > > > >     static unsigned long srcu_get_delay(struct srcu_struct
> *ssp)
> > > > > > > >     {
> > > > > > > > +	unsigned long gpstart;
> > > > > > > > +	unsigned long j;
> > > > > > > >     	unsigned long jbase = SRCU_INTERVAL;
> > > > > > > >     	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > > > > > >     		jbase = 0;
> > > > > > > > -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > > > > > > -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > > > > > > > +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > > > > > > +		j = jiffies - 1;
> > > > > > > > +		gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > > > > > > +		if (time_after(j, gpstart))
> > > > > > > > +			jbase += j - gpstart;
> > > > > > > > +	}
> > > > > > > >     	if (!jbase) {
> > > > > > > >     		WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > > > > > >     		if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> > > > > > > Unfortunately, this patch does not helpful.
> > > > > > >
> > > > > > > Then re-add the debug info.
> > > > > > >
> > > > > > > During the qemu boot
> > > > > > > [  232.997667]  __synchronize_srcu loop=1000
> > > > > > >
> > > > > > > [  361.094493]  __synchronize_srcu loop=9000
> > > > > > > [  361.094501] Call trace:
> > > > > > > [  361.094502]  dump_backtrace+0xe4/0xf0
> > > > > > > [  361.094505]  show_stack+0x20/0x70
> > > > > > > [  361.094507]  dump_stack_lvl+0x8c/0xb8
> > > > > > > [  361.094509]  dump_stack+0x18/0x34
> > > > > > > [  361.094511]  __synchronize_srcu+0x120/0x128
> > > > > > > [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> > > > > > > [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> > > > > > > [  361.094519]  kvm_activate_memslot+0x40/0x68
> > > > > > > [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> > > > > > > [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> > > > > > > [  361.094524]  kvm_set_memory_region+0x78/0xb8
> > > > > > > [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> > > > > > > [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> > > > > > > [  361.094530]  invoke_syscall+0x4c/0x110
> > > > > > > [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> > > > > > > [  361.094536]  do_el0_svc+0x34/0xc0
> > > > > > > [  361.094538]  el0_svc+0x30/0x98
> > > > > > > [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> > > > > > > [  361.094544]  el0t_64_sync+0x18c/0x190
> > > > > > > [  363.942817]  kvm_set_memory_region loop=6000
> > > > > > Huh.
> > > > > >
> > > > > > One possibility is that the "if (!jbase)" block needs to be nested
> > > > > > within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {"
> block.
> > > > I test this diff and NO helpful
> > > >
> > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > > index 50ba70f019de..36286a4b74e6 100644
> > > > --- a/kernel/rcu/srcutree.c
> > > > +++ b/kernel/rcu/srcutree.c
> > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct
> srcu_struct *ssp)
> > > >
> > > >   #define SRCU_INTERVAL          1       // Base delay if no
> expedited GPs
> > > > pending.
> > > >   #define SRCU_MAX_INTERVAL      10      // Maximum
> incremental delay from
> > > > slow readers.
> > > > -#define SRCU_MAX_NODELAY_PHASE 1       // Maximum
> per-GP-phase consecutive
> > > > no-delay instances.
> > > > +#define SRCU_MAX_NODELAY_PHASE 3       // Maximum
> per-GP-phase consecutive
> > > > no-delay instances.
> > > >   #define SRCU_MAX_NODELAY       100     // Maximum
> consecutive no-delay
> > > > instances.
> > > >
> > > >   /*
> > > > @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct
> srcu_struct
> > > > *ssp)
> > > >    */
> > > >   static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > > >   {
> > > > +       unsigned long gpstart;
> > > > +       unsigned long j;
> > > >          unsigned long jbase = SRCU_INTERVAL;
> > > >
> > > >          if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > >                  jbase = 0;
> > > > -       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > > -               jbase += jiffies -
> READ_ONCE(ssp->srcu_gp_start);
> > > > -       if (!jbase) {
> > > > -               WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> > > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > > -               if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> > > > SRCU_MAX_NODELAY_PHASE)
> > > > -                       jbase = 1;
> > > > +       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > > +               j = jiffies - 1;
> > > > +               gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > > +               if (time_after(j, gpstart))
> > > > +                       jbase += j - gpstart;
> > > > +
> > > > +               if (!jbase) {
> > > >
> +                       WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> > > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > > +                       if
> (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> > > > SRCU_MAX_NODELAY_PHASE)
> > > > +                               jbase = 1;
> > > > +               }
> > > >          }
> > > That is in fact what I was intending you to test, thank you.  As you
> > > say, unfortunately it did not help.
> > >
> > > Could you please test removing the "if (!jbase)" block entirely?
> > Remove "if (!jbase)" block is much faster,
> > not measure clearly, qemu (with debug version efi) boot seems normally.
> >
> > From log timestamp:
> > [  114.624713] __synchronize_srcu loop=1000
> > [  124.157011]  __synchronize_srcu loop=9000
> >
> > Several method: timestamps are different.
> >
> > 5.19-rc1
> > [   94.271350]  __synchronize_srcu loop=1001
> > [  222.621659]  __synchronize_srcu loop=9001
> >
> >
> > With your first diff:
> > [  232.997667]  __synchronize_srcu loop=1000
> > [  361.094493]  __synchronize_srcu loop=9000
> >
> > Remove "if (!jbase)" block
> > [  114.624713] __synchronize_srcu loop=1000
> > [  124.157011]  __synchronize_srcu loop=9000
> >
> >
> > 5.18 method
> > +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > +               return 0;
> > +       return SRCU_INTERVAL;
> >
> > [   74.598480] __synchronize_srcu loop=9000
> > [   68.938297] __synchronize_srcu loop=1000
> 
> Thank you for the information!
> 
> What happens if you keep the that "if (!jbase)" block", but set the
> value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000?

From the setup I have, this is almost similar to that of the previous logic(without
the "if(!jbase)"). In both cases, I think we are not close to 5.18, but definitely much
better compared to 5.19-rc1.

The numbers from my test setup(CONFIG_HZ_250, CONFIG_PREEMPTION=y),

                       Guest boot time(using 'time'):

5.18-rc4 based               ~8sec

5.19-rc1                     ~2m43sec

5.19-rc1+fix1                 ~19sec

5.19-rc1-fix2                 ~19sec

I will wait for Zhangfei to confirm this on his setup, especially the difference
compared to 5.18.

Thanks,
Shameer
 
> This would be too large for KLP, but my hope is that there is a value
> of SRCU_MAX_NODELAY_PHASE that works for everyone.  But first, does
> this help at all?  ;-)
> 
> > > > > And when I run 10,000 consecutive synchronize_rcu_expedited() calls,
> the
> > > > > above change reduces the overhead by more than an order of
> magnitude.
> > > > > Except that the overhead of the series is far less than one second,
> > > > > not the several minutes that you are seeing.  So the per-call
> overhead
> > > > > decreases from about 17 microseconds to a bit more than one
> microsecond.
> > > > >
> > > > > I could imagine an extra order of magnitude if you are running
> HZ=100
> > > > > instead of the HZ=1000 that I am running.  But that only gets up to a
> > > > > few seconds.
> > > One possible reason for the difference would be if your code has
> > > SRCU readers.
> > >
> > > Could you please tell me the value of CONFIG_HZ on your system?
> > > Also the value of CONFIG_PREEMPTION?
> > I am using arch/arm64/configs/defconfig
> > make defconfig
> > CONFIG_PREEMPTION=y
> > CONFIG_HZ_250=y
> 
> Thank you again!
> 
> And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might
> depend
> on HZ.  And who knows what all else...
> 
> 							Thanx, Paul
> 
> > Thanks
> >
> > >
> > > > > > One additional debug is to apply the patch below on top of the one
> you
> > > > apply the patch below?
> > > > > > just now kindly tested, then use whatever debug technique you wish
> to
> > > > > > work out what fraction of the time during that critical interval that
> > > > > > srcu_get_delay() returns non-zero.
> > > > Sorry, I am confused, no patch right?
> > > Apologies, my omission.
> > >
> > > > Just measure srcu_get_delay return to non-zero?
> > > Exactly, please!
> > >
> > > > By the way, the issue should be only related with qemu apci. not related
> > > > with rmr feature
> > > > Test with: https://github.com/qemu/qemu/tree/stable-6.1
> > > >
> > > > Looks it caused by too many kvm_region_add & kvm_region_del if
> acpi=force,
> > > > If no acpi, no print kvm_region_add/del (1000 times print once)
> > > >
> > > > If with acpi=force,
> > > > During qemu boot
> > > > kvm_region_add region_add = 1000
> > > > kvm_region_del region_del = 1000
> > > > kvm_region_add region_add = 2000
> > > > kvm_region_del region_del = 2000
> > > > kvm_region_add region_add = 3000
> > > > kvm_region_del region_del = 3000
> > > > kvm_region_add region_add = 4000
> > > > kvm_region_del region_del = 4000
> > > > kvm_region_add region_add = 5000
> > > > kvm_region_del region_del = 5000
> > > > kvm_region_add region_add = 6000
> > > > kvm_region_del region_del = 6000
> > > >
> > > > kvm_region_add/kvm_region_del ->
> > > > kvm_set_phys_mem->
> > > > kvm_set_user_memory_region->
> > > > kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
> > > >
> > > > [  361.094493]  __synchronize_srcu loop=9000
> > > > [  361.094501] Call trace:
> > > > [  361.094502]  dump_backtrace+0xe4/0xf0
> > > > [  361.094505]  show_stack+0x20/0x70
> > > > [  361.094507]  dump_stack_lvl+0x8c/0xb8
> > > > [  361.094509]  dump_stack+0x18/0x34
> > > > [  361.094511]  __synchronize_srcu+0x120/0x128
> > > > [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> > > > [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> > > > [  361.094519]  kvm_activate_memslot+0x40/0x68
> > > > [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> > > > [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> > > > [  361.094524]  kvm_set_memory_region+0x78/0xb8
> > > > [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> > > > [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> > > > [  361.094530]  invoke_syscall+0x4c/0x110
> > > > [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> > > > [  361.094536]  do_el0_svc+0x34/0xc0
> > > > [  361.094538]  el0_svc+0x30/0x98
> > > > [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> > > > [  361.094544]  el0t_64_sync+0x18c/0x190
> > > > [  363.942817]  kvm_set_memory_region loop=6000
> > > Good to know, thank you!
> > >
> > > 							Thanx, Paul
> >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-11 16:32 Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow Zhangfei Gao
  2022-06-11 16:59 ` Paul E. McKenney
@ 2022-06-14  1:53 ` chenxiang (M)
  1 sibling, 0 replies; 37+ messages in thread
From: chenxiang (M) @ 2022-06-14  1:53 UTC (permalink / raw)
  To: Zhangfei Gao, Paul E. McKenney, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi

Hi, I also encounter a similar issue , and i reported it 
here(https://www.spinics.net/lists/kernel/msg4396974.html).

I tried the change Paul provides on the thread, but the issue is still.


在 2022/6/12 0:32, Zhangfei Gao 写道:
> Hi, Paul
>
> When verifying qemu with acpi rmr feature on v5.19-rc1, the guest 
> kernel stuck for several minutes.
> And on 5.18, there is no such problem.
>
> After revert this patch, the issue solved.
> Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers 
> from  consuming CPU)
>
>
> qemu cmd:
> build/aarch64-softmmu/qemu-system-aarch64 -machine 
> virt,gic-version=3,iommu=smmuv3 \
> -enable-kvm -cpu host -m 1024 \
> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off 
> acpi=force" \
> -bios QEMU_EFI.fd
>
> log:
> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> add-symbol-file 
> /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll 
> 0x75459000
> Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
> ProtectUefiImageCommon - 0x7AA4D040
>   - 0x0000000075458000 - 0x000000000003F000
> SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 
> (0x0000000000004008)
> SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 
> (0x0000000000020008)
> SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 
> (0x0000000000004008)
> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
> InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
> InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
>
>
> Not sure it is either reported or solved.
>
> Thanks
>
>
>
>
>
>
>
> .
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13 20:55                               ` Shameerali Kolothum Thodi
@ 2022-06-14 12:19                                 ` Neeraj Upadhyay
  2022-06-14 14:03                                   ` zhangfei.gao
  0 siblings, 1 reply; 37+ messages in thread
From: Neeraj Upadhyay @ 2022-06-14 12:19 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, paulmck, zhangfei.gao
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti,
	Auger Eric, chenxiang (M)

Hi,


On 6/14/2022 2:25 AM, Shameerali Kolothum Thodi wrote:
> 
> 
>> -----Original Message-----
>> From: Paul E. McKenney [mailto:paulmck@kernel.org]
>> Sent: 13 June 2022 15:59
>> To: zhangfei.gao@foxmail.com
>> Cc: Paolo Bonzini <pbonzini@redhat.com>; Zhangfei Gao
>> <zhangfei.gao@linaro.org>; linux-kernel@vger.kernel.org;
>> rcu@vger.kernel.org; Lai Jiangshan <jiangshanlai@gmail.com>; Josh Triplett
>> <josh@joshtriplett.org>; Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com>; Matthew Wilcox <willy@infradead.org>;
>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
>> mtosatti@redhat.com; Auger Eric <eric.auger@redhat.com>
>> Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and
>> blocking readers from consuming CPU) cause qemu boot slow
>>
>> On Mon, Jun 13, 2022 at 09:23:50PM +0800, zhangfei.gao@foxmail.com
>> wrote:
>>>
>>>
>>> On 2022/6/13 下午8:18, Paul E. McKenney wrote:
>>>> On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@foxmail.com
>> wrote:
>>>>> Hi, Paul
>>>>>
>>>>> On 2022/6/13 下午12:16, Paul E. McKenney wrote:
>>>>>> On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
>>>>>>> On Mon, Jun 13, 2022 at 11:04:39AM +0800,
>> zhangfei.gao@foxmail.com wrote:
>>>>>>>> Hi, Paul
>>>>>>>>
>>>>>>>> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
>>>>>>>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini
>> wrote:
>>>>>>>>>> On 6/12/22 18:40, Paul E. McKenney wrote:
>>>>>>>>>>>> Do these reserved memory regions really need to be
>> allocated separately?
>>>>>>>>>>>> (For example, are they really all non-contiguous?  If not,
>> that is, if
>>>>>>>>>>>> there are a lot of contiguous memory regions, could you
>> sort the IORT
>>>>>>>>>>>> by address and do one ioctl() for each set of contiguous
>> memory regions?)
>>>>>>>>>>>>
>>>>>>>>>>>> Are all of these reserved memory regions set up before init
>> is spawned?
>>>>>>>>>>>>
>>>>>>>>>>>> Are all of these reserved memory regions set up while
>> there is only a
>>>>>>>>>>>> single vCPU up and running?
>>>>>>>>>>>>
>>>>>>>>>>>> Is the SRCU grace period really needed in this case?  (I
>> freely confess
>>>>>>>>>>>> to not being all that familiar with KVM.)
>>>>>>>>>>> Oh, and there was a similar many-requests problem with
>> networking many
>>>>>>>>>>> years ago.  This was solved by adding a new
>> syscall/ioctl()/whatever
>>>>>>>>>>> that permitted many requests to be presented to the kernel
>> with a single
>>>>>>>>>>> system call.
>>>>>>>>>>>
>>>>>>>>>>> Could a new ioctl() be introduced that requested a large
>> number
>>>>>>>>>>> of these memory regions in one go so as to make each call to
>>>>>>>>>>> synchronize_rcu_expedited() cover a useful fraction of your
>> 9000+
>>>>>>>>>>> requests?  Adding a few of the KVM guys on CC for their
>> thoughts.
>>>>>>>>>> Unfortunately not.  Apart from this specific case, in general
>> the calls to
>>>>>>>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to
>> I/O registers in the
>>>>>>>>>> guest, and those writes then map to a ioctl.  Typically the
>> guest sets up a
>>>>>>>>>> device at a time, and each setup step causes a
>> synchronize_srcu()---and
>>>>>>>>>> expedited at that.
>>>>>>>>> I was afraid of something like that...
>>>>>>>>>
>>>>>>>>>> KVM has two SRCUs:
>>>>>>>>>>
>>>>>>>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it
>> has readers
>>>>>>>>>> that are very very small, but it needs extremely fast detection
>> of grace
>>>>>>>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>>>>>>>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off
>> kvm->srcu.  Readers are
>>>>>>>>>> not so frequent.
>>>>>>>>>>
>>>>>>>>>> 2) kvm->srcu is nastier because there are readers all the time.
>> The
>>>>>>>>>> read-side critical section are still short-ish, but they need the
>> sleepable
>>>>>>>>>> part because they access user memory.
>>>>>>>>> Which one of these two is in play in this case?
>>>>>>>>>
>>>>>>>>>> Writers are not frequent per se; the problem is they come in
>> very large
>>>>>>>>>> bursts when a guest boots.  And while the whole boot path
>> overall can be
>>>>>>>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have
>> a larger
>>>>>>>>>> impact on runtime than the O(n^2) parts, as demonstrated
>> here.
>>>>>>>>>>
>>>>>>>>>> Therefore, we operated on the assumption that the callers of
>>>>>>>>>> synchronized_srcu_expedited were _anyway_ busy running
>> CPU-bound guest code
>>>>>>>>>> and the desire was to get past the booting phase as fast as
>> possible.  If
>>>>>>>>>> the guest wants to eat host CPU it can "for(;;)" as much as it
>> wants;
>>>>>>>>>> therefore, as long as expedited GPs didn't eat CPU
>> *throughout the whole
>>>>>>>>>> system*, a preemptable busy wait in
>> synchronize_srcu_expedited() were not
>>>>>>>>>> problematic.
>>>>>>>>>>
>>>>>>>>>> This assumptions did match the SRCU code when kvm->srcu
>> and kvm->irq_srcu
>>>>>>>>>> were was introduced (respectively in 2009 and 2014).  But
>> perhaps they do
>>>>>>>>>> not hold anymore now that each SRCU is not as independent
>> as it used to be
>>>>>>>>>> in those years, and instead they use workqueues instead?
>>>>>>>>> The problem was not internal to SRCU, but rather due to the fact
>>>>>>>>> that kernel live patching (KLP) had problems with the
>> CPU-bound tasks
>>>>>>>>> resulting from repeated synchronize_rcu_expedited()
>> invocations.  So I
>>>>>>>>> added heuristics to get the occasional sleep in there for KLP's
>> benefit.
>>>>>>>>> Perhaps these heuristics need to be less aggressive about adding
>> sleep.
>>>>>>>>>
>>>>>>>>> These heuristics have these aspects:
>>>>>>>>>
>>>>>>>>> 1.	The longer readers persist in an expedited SRCU grace period,
>>>>>>>>> 	the longer the wait between successive checks of the reader
>>>>>>>>> 	state.  Roughly speaking, we wait as long as the grace period
>>>>>>>>> 	has currently been in effect, capped at ten jiffies.
>>>>>>>>>
>>>>>>>>> 2.	SRCU grace periods have several phases.  We reset so that
>> each
>>>>>>>>> 	phase starts by not waiting (new phase, new set of readers,
>>>>>>>>> 	so don't penalize this set for the sins of the previous set).
>>>>>>>>> 	But once we get to the point of adding delay, we add the
>>>>>>>>> 	delay based on the beginning of the full grace period.
>>>>>>>>>
>>>>>>>>> Right now, the checking for grace-period length does not allow
>> for the
>>>>>>>>> possibility that a grace period might start just before the jiffies
>>>>>>>>> counter gets incremented (because I didn't realize that anyone
>> cared),
>>>>>>>>> so that is one possible thing to change.  I can also allow more
>> no-delay
>>>>>>>>> checks per SRCU grace-period phase.
>>>>>>>>>
>>>>>>>>> Zhangfei, does something like the patch shown below help?
>>>>>>>>>
>>>>>>>>> Additional adjustments are likely needed to avoid re-breaking
>> KLP,
>>>>>>>>> but we have to start somewhere...
>>>>>>>>>
>>>>>>>>> 							Thanx, Paul
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>>>>>>> index 50ba70f019dea..6a354368ac1d1 100644
>>>>>>>>> --- a/kernel/rcu/srcutree.c
>>>>>>>>> +++ b/kernel/rcu/srcutree.c
>>>>>>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct
>> srcu_struct *ssp)
>>>>>>>>>      #define SRCU_INTERVAL		1	// Base delay if no
>> expedited GPs pending.
>>>>>>>>>      #define SRCU_MAX_INTERVAL	10	// Maximum
>> incremental delay from slow readers.
>>>>>>>>> -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum
>> per-GP-phase consecutive no-delay instances.
>>>>>>>>> +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum
>> per-GP-phase consecutive no-delay instances.
>>>>>>>>>      #define SRCU_MAX_NODELAY	100	// Maximum
>> consecutive no-delay instances.
>>>>>>>>>      /*
>>>>>>>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct
>> srcu_struct *ssp)
>>>>>>>>>       */
>>>>>>>>>      static unsigned long srcu_get_delay(struct srcu_struct
>> *ssp)
>>>>>>>>>      {
>>>>>>>>> +	unsigned long gpstart;
>>>>>>>>> +	unsigned long j;
>>>>>>>>>      	unsigned long jbase = SRCU_INTERVAL;
>>>>>>>>>      	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>>>>>      		jbase = 0;
>>>>>>>>> -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>>>>>>>>> -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
>>>>>>>>> +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>>>>>>>>> +		j = jiffies - 1;
>>>>>>>>> +		gpstart = READ_ONCE(ssp->srcu_gp_start);
>>>>>>>>> +		if (time_after(j, gpstart))
>>>>>>>>> +			jbase += j - gpstart;
>>>>>>>>> +	}
>>>>>>>>>      	if (!jbase) {
>>>>>>>>>      		WRITE_ONCE(ssp->srcu_n_exp_nodelay,
>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>>>>>>>      		if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
>> SRCU_MAX_NODELAY_PHASE)
>>>>>>>> Unfortunately, this patch does not helpful.
>>>>>>>>
>>>>>>>> Then re-add the debug info.
>>>>>>>>
>>>>>>>> During the qemu boot
>>>>>>>> [  232.997667]  __synchronize_srcu loop=1000
>>>>>>>>
>>>>>>>> [  361.094493]  __synchronize_srcu loop=9000
>>>>>>>> [  361.094501] Call trace:
>>>>>>>> [  361.094502]  dump_backtrace+0xe4/0xf0
>>>>>>>> [  361.094505]  show_stack+0x20/0x70
>>>>>>>> [  361.094507]  dump_stack_lvl+0x8c/0xb8
>>>>>>>> [  361.094509]  dump_stack+0x18/0x34
>>>>>>>> [  361.094511]  __synchronize_srcu+0x120/0x128
>>>>>>>> [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
>>>>>>>> [  361.094515]  kvm_swap_active_memslots+0x130/0x198
>>>>>>>> [  361.094519]  kvm_activate_memslot+0x40/0x68
>>>>>>>> [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
>>>>>>>> [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
>>>>>>>> [  361.094524]  kvm_set_memory_region+0x78/0xb8
>>>>>>>> [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
>>>>>>>> [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
>>>>>>>> [  361.094530]  invoke_syscall+0x4c/0x110
>>>>>>>> [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
>>>>>>>> [  361.094536]  do_el0_svc+0x34/0xc0
>>>>>>>> [  361.094538]  el0_svc+0x30/0x98
>>>>>>>> [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
>>>>>>>> [  361.094544]  el0t_64_sync+0x18c/0x190
>>>>>>>> [  363.942817]  kvm_set_memory_region loop=6000
>>>>>>> Huh.
>>>>>>>
>>>>>>> One possibility is that the "if (!jbase)" block needs to be nested
>>>>>>> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {"
>> block.
>>>>> I test this diff and NO helpful
>>>>>
>>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>>> index 50ba70f019de..36286a4b74e6 100644
>>>>> --- a/kernel/rcu/srcutree.c
>>>>> +++ b/kernel/rcu/srcutree.c
>>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct
>> srcu_struct *ssp)
>>>>>
>>>>>    #define SRCU_INTERVAL          1       // Base delay if no
>> expedited GPs
>>>>> pending.
>>>>>    #define SRCU_MAX_INTERVAL      10      // Maximum
>> incremental delay from
>>>>> slow readers.
>>>>> -#define SRCU_MAX_NODELAY_PHASE 1       // Maximum
>> per-GP-phase consecutive
>>>>> no-delay instances.
>>>>> +#define SRCU_MAX_NODELAY_PHASE 3       // Maximum
>> per-GP-phase consecutive
>>>>> no-delay instances.
>>>>>    #define SRCU_MAX_NODELAY       100     // Maximum
>> consecutive no-delay
>>>>> instances.
>>>>>
>>>>>    /*
>>>>> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct
>> srcu_struct
>>>>> *ssp)
>>>>>     */
>>>>>    static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>>>>>    {
>>>>> +       unsigned long gpstart;
>>>>> +       unsigned long j;
>>>>>           unsigned long jbase = SRCU_INTERVAL;
>>>>>
>>>>>           if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>                   jbase = 0;
>>>>> -       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>>>>> -               jbase += jiffies -
>> READ_ONCE(ssp->srcu_gp_start);
>>>>> -       if (!jbase) {
>>>>> -               WRITE_ONCE(ssp->srcu_n_exp_nodelay,
>>>>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>>> -               if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
>>>>> SRCU_MAX_NODELAY_PHASE)
>>>>> -                       jbase = 1;
>>>>> +       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>>>>> +               j = jiffies - 1;
>>>>> +               gpstart = READ_ONCE(ssp->srcu_gp_start);
>>>>> +               if (time_after(j, gpstart))
>>>>> +                       jbase += j - gpstart;
>>>>> +
>>>>> +               if (!jbase) {
>>>>>
>> +                       WRITE_ONCE(ssp->srcu_n_exp_nodelay,
>>>>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>>> +                       if
>> (READ_ONCE(ssp->srcu_n_exp_nodelay) >
>>>>> SRCU_MAX_NODELAY_PHASE)
>>>>> +                               jbase = 1;
>>>>> +               }
>>>>>           }
>>>> That is in fact what I was intending you to test, thank you.  As you
>>>> say, unfortunately it did not help.
>>>>
>>>> Could you please test removing the "if (!jbase)" block entirely?
>>> Remove "if (!jbase)" block is much faster,
>>> not measure clearly, qemu (with debug version efi) boot seems normally.
>>>
>>>  From log timestamp:
>>> [  114.624713] __synchronize_srcu loop=1000
>>> [  124.157011]  __synchronize_srcu loop=9000
>>>
>>> Several method: timestamps are different.
>>>
>>> 5.19-rc1
>>> [   94.271350]  __synchronize_srcu loop=1001
>>> [  222.621659]  __synchronize_srcu loop=9001
>>>
>>>
>>> With your first diff:
>>> [  232.997667]  __synchronize_srcu loop=1000
>>> [  361.094493]  __synchronize_srcu loop=9000
>>>
>>> Remove "if (!jbase)" block
>>> [  114.624713] __synchronize_srcu loop=1000
>>> [  124.157011]  __synchronize_srcu loop=9000
>>>
>>>
>>> 5.18 method
>>> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>> +               return 0;
>>> +       return SRCU_INTERVAL;
>>>
>>> [   74.598480] __synchronize_srcu loop=9000
>>> [   68.938297] __synchronize_srcu loop=1000
>>
>> Thank you for the information!
>>
>> What happens if you keep the that "if (!jbase)" block", but set the
>> value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000?
> 
>  From the setup I have, this is almost similar to that of the previous logic(without
> the "if(!jbase)"). In both cases, I think we are not close to 5.18, but definitely much
> better compared to 5.19-rc1.
> 
> The numbers from my test setup(CONFIG_HZ_250, CONFIG_PREEMPTION=y),
> 
>                         Guest boot time(using 'time'):
> 
> 5.18-rc4 based               ~8sec
> 
> 5.19-rc1                     ~2m43sec
> 
> 5.19-rc1+fix1                 ~19sec
> 
> 5.19-rc1-fix2                 ~19sec
> 

If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ; 
does it show any difference in boot time?

--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct 
*ssp, struct srcu_node *snp
   */
  static void srcu_gp_end(struct srcu_struct *ssp)
  {
-       unsigned long cbdelay;
+       unsigned long cbdelay = 1;
         bool cbs;
         bool last_lvl;
         int cpu;
@@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
         spin_lock_irq_rcu_node(ssp);
         idx = rcu_seq_state(ssp->srcu_gp_seq);
         WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
-       cbdelay = !!srcu_get_delay(ssp);
+       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), 
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
+               cbdelay = 0;
+
         WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());



Thanks
Neeraj


> I will wait for Zhangfei to confirm this on his setup, especially the difference
> compared to 5.18.
> 
> Thanks,
> Shameer
>   
>> This would be too large for KLP, but my hope is that there is a value
>> of SRCU_MAX_NODELAY_PHASE that works for everyone.  But first, does
>> this help at all?  ;-)
>>
>>>>>> And when I run 10,000 consecutive synchronize_rcu_expedited() calls,
>> the
>>>>>> above change reduces the overhead by more than an order of
>> magnitude.
>>>>>> Except that the overhead of the series is far less than one second,
>>>>>> not the several minutes that you are seeing.  So the per-call
>> overhead
>>>>>> decreases from about 17 microseconds to a bit more than one
>> microsecond.
>>>>>>
>>>>>> I could imagine an extra order of magnitude if you are running
>> HZ=100
>>>>>> instead of the HZ=1000 that I am running.  But that only gets up to a
>>>>>> few seconds.
>>>> One possible reason for the difference would be if your code has
>>>> SRCU readers.
>>>>
>>>> Could you please tell me the value of CONFIG_HZ on your system?
>>>> Also the value of CONFIG_PREEMPTION?
>>> I am using arch/arm64/configs/defconfig
>>> make defconfig
>>> CONFIG_PREEMPTION=y
>>> CONFIG_HZ_250=y
>>
>> Thank you again!
>>
>> And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might
>> depend
>> on HZ.  And who knows what all else...
>>
>> 							Thanx, Paul
>>
>>> Thanks
>>>
>>>>
>>>>>>> One additional debug is to apply the patch below on top of the one
>> you
>>>>> apply the patch below?
>>>>>>> just now kindly tested, then use whatever debug technique you wish
>> to
>>>>>>> work out what fraction of the time during that critical interval that
>>>>>>> srcu_get_delay() returns non-zero.
>>>>> Sorry, I am confused, no patch right?
>>>> Apologies, my omission.
>>>>
>>>>> Just measure srcu_get_delay return to non-zero?
>>>> Exactly, please!
>>>>
>>>>> By the way, the issue should be only related with qemu apci. not related
>>>>> with rmr feature
>>>>> Test with: https://github.com/qemu/qemu/tree/stable-6.1
>>>>>
>>>>> Looks it caused by too many kvm_region_add & kvm_region_del if
>> acpi=force,
>>>>> If no acpi, no print kvm_region_add/del (1000 times print once)
>>>>>
>>>>> If with acpi=force,
>>>>> During qemu boot
>>>>> kvm_region_add region_add = 1000
>>>>> kvm_region_del region_del = 1000
>>>>> kvm_region_add region_add = 2000
>>>>> kvm_region_del region_del = 2000
>>>>> kvm_region_add region_add = 3000
>>>>> kvm_region_del region_del = 3000
>>>>> kvm_region_add region_add = 4000
>>>>> kvm_region_del region_del = 4000
>>>>> kvm_region_add region_add = 5000
>>>>> kvm_region_del region_del = 5000
>>>>> kvm_region_add region_add = 6000
>>>>> kvm_region_del region_del = 6000
>>>>>
>>>>> kvm_region_add/kvm_region_del ->
>>>>> kvm_set_phys_mem->
>>>>> kvm_set_user_memory_region->
>>>>> kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
>>>>>
>>>>> [  361.094493]  __synchronize_srcu loop=9000
>>>>> [  361.094501] Call trace:
>>>>> [  361.094502]  dump_backtrace+0xe4/0xf0
>>>>> [  361.094505]  show_stack+0x20/0x70
>>>>> [  361.094507]  dump_stack_lvl+0x8c/0xb8
>>>>> [  361.094509]  dump_stack+0x18/0x34
>>>>> [  361.094511]  __synchronize_srcu+0x120/0x128
>>>>> [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
>>>>> [  361.094515]  kvm_swap_active_memslots+0x130/0x198
>>>>> [  361.094519]  kvm_activate_memslot+0x40/0x68
>>>>> [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
>>>>> [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
>>>>> [  361.094524]  kvm_set_memory_region+0x78/0xb8
>>>>> [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
>>>>> [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
>>>>> [  361.094530]  invoke_syscall+0x4c/0x110
>>>>> [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
>>>>> [  361.094536]  do_el0_svc+0x34/0xc0
>>>>> [  361.094538]  el0_svc+0x30/0x98
>>>>> [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
>>>>> [  361.094544]  el0t_64_sync+0x18c/0x190
>>>>> [  363.942817]  kvm_set_memory_region loop=6000
>>>> Good to know, thank you!
>>>>
>>>> 							Thanx, Paul
>>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-14 12:19                                 ` Neeraj Upadhyay
@ 2022-06-14 14:03                                   ` zhangfei.gao
  2022-06-14 14:14                                     ` Neeraj Upadhyay
  2022-06-14 14:17                                     ` Paul E. McKenney
  0 siblings, 2 replies; 37+ messages in thread
From: zhangfei.gao @ 2022-06-14 14:03 UTC (permalink / raw)
  To: Neeraj Upadhyay, Shameerali Kolothum Thodi, paulmck
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti,
	Auger Eric, chenxiang (M)



On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>
>>
>> 5.18-rc4 based               ~8sec
>>
>> 5.19-rc1                     ~2m43sec
>>
>> 5.19-rc1+fix1                 ~19sec
>>
>> 5.19-rc1-fix2                 ~19sec
>>
>
> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 
> ; does it show any difference in boot time?
>
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct 
> srcu_struct *ssp, struct srcu_node *snp
>   */
>  static void srcu_gp_end(struct srcu_struct *ssp)
>  {
> -       unsigned long cbdelay;
> +       unsigned long cbdelay = 1;
>         bool cbs;
>         bool last_lvl;
>         int cpu;
> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>         spin_lock_irq_rcu_node(ssp);
>         idx = rcu_seq_state(ssp->srcu_gp_seq);
>         WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
> -       cbdelay = !!srcu_get_delay(ssp);
> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), 
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> +               cbdelay = 0;
> +
>         WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>
Test here:
qemu: https://github.com/qemu/qemu/tree/stable-6.1
kernel: 
https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
(in case test patch not clear, push in git tree)

Hardware: aarch64

1. 5.18-rc6
real    0m8.402s
user    0m3.015s
sys     0m1.102s

2. 5.19-rc1
real    2m41.433s
user    0m3.097s
sys     0m1.177s

3. 5.19-rc1 + fix1 from Paul
real    2m43.404s
user    0m2.880s
sys     0m1.214s

4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
real    0m15.262s
user    0m3.003s
sys     0m1.033s

When build kernel in the meantime, load time become longer.

5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
real    0m15.215s
user    0m2.942s
sys    0m1.172s

6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end　
real    1m23.936s
user    0m2.969s
sys    0m1.181s

More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5

Thanks


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-14 14:03                                   ` zhangfei.gao
@ 2022-06-14 14:14                                     ` Neeraj Upadhyay
  2022-06-14 14:57                                       ` zhangfei.gao
  2022-06-14 14:17                                     ` Paul E. McKenney
  1 sibling, 1 reply; 37+ messages in thread
From: Neeraj Upadhyay @ 2022-06-14 14:14 UTC (permalink / raw)
  To: zhangfei.gao, Shameerali Kolothum Thodi, paulmck
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti,
	Auger Eric, chenxiang (M)



On 6/14/2022 7:33 PM, zhangfei.gao@foxmail.com wrote:
> 
> 
> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>
>>>
>>> 5.18-rc4 based               ~8sec
>>>
>>> 5.19-rc1                     ~2m43sec
>>>
>>> 5.19-rc1+fix1                 ~19sec
>>>
>>> 5.19-rc1-fix2                 ~19sec
>>>
>>
>> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 
>> ; does it show any difference in boot time?
>>
>> --- a/kernel/rcu/srcutree.c
>> +++ b/kernel/rcu/srcutree.c
>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct 
>> srcu_struct *ssp, struct srcu_node *snp
>>   */
>>  static void srcu_gp_end(struct srcu_struct *ssp)
>>  {
>> -       unsigned long cbdelay;
>> +       unsigned long cbdelay = 1;
>>         bool cbs;
>>         bool last_lvl;
>>         int cpu;
>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>         spin_lock_irq_rcu_node(ssp);
>>         idx = rcu_seq_state(ssp->srcu_gp_seq);
>>         WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>> -       cbdelay = !!srcu_get_delay(ssp);
>> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), 
>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>> +               cbdelay = 0;
>> +
>>         WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>>
> Test here:
> qemu: https://github.com/qemu/qemu/tree/stable-6.1
> kernel: 
> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
> (in case test patch not clear, push in git tree)
> 
> Hardware: aarch64
> 
> 1. 5.18-rc6
> real    0m8.402s
> user    0m3.015s
> sys     0m1.102s
> 
> 2. 5.19-rc1
> real    2m41.433s
> user    0m3.097s
> sys     0m1.177s
> 
> 3. 5.19-rc1 + fix1 from Paul
> real    2m43.404s
> user    0m2.880s
> sys     0m1.214s
> 
> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
> real    0m15.262s
> user    0m3.003s
> sys     0m1.033s
> 
> When build kernel in the meantime, load time become longer.
> 
> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
> real    0m15.215s
> user    0m2.942s
> sys    0m1.172s
> 
> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end　
> real    1m23.936s
> user    0m2.969s
> sys    0m1.181s
> 

Thanks for this data. Can you please share below test combo also?

7. 5.19-rc1 + fix5: fix2 + Neeraj's change of srcu_gp_end


8. 5.19-rc1 + fix6: fix3 + Neeraj's change of srcu_gp_end


Thanks
Neeraj

> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-14 14:03                                   ` zhangfei.gao
  2022-06-14 14:14                                     ` Neeraj Upadhyay
@ 2022-06-14 14:17                                     ` Paul E. McKenney
  2022-06-15  9:03                                       ` zhangfei.gao
  1 sibling, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2022-06-14 14:17 UTC (permalink / raw)
  To: zhangfei.gao
  Cc: Neeraj Upadhyay, Shameerali Kolothum Thodi, Paolo Bonzini,
	Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric,
	chenxiang (M)

On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com wrote:
> 
> 
> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
> > 
> > > 
> > > 5.18-rc4 based               ~8sec
> > > 
> > > 5.19-rc1                     ~2m43sec
> > > 
> > > 5.19-rc1+fix1                 ~19sec
> > > 
> > > 5.19-rc1-fix2                 ~19sec
> > > 
> > 
> > If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ;
> > does it show any difference in boot time?
> > 
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct
> > *ssp, struct srcu_node *snp
> >   */
> >  static void srcu_gp_end(struct srcu_struct *ssp)
> >  {
> > -       unsigned long cbdelay;
> > +       unsigned long cbdelay = 1;
> >         bool cbs;
> >         bool last_lvl;
> >         int cpu;
> > @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
> >         spin_lock_irq_rcu_node(ssp);
> >         idx = rcu_seq_state(ssp->srcu_gp_seq);
> >         WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
> > -       cbdelay = !!srcu_get_delay(ssp);
> > +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > +               cbdelay = 0;
> > +
> >         WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());

Thank you both for the testing and the proposed fix!

> Test here:
> qemu: https://github.com/qemu/qemu/tree/stable-6.1
> kernel:
> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
> (in case test patch not clear, push in git tree)
> 
> Hardware: aarch64
> 
> 1. 5.18-rc6
> real    0m8.402s
> user    0m3.015s
> sys     0m1.102s
> 
> 2. 5.19-rc1
> real    2m41.433s
> user    0m3.097s
> sys     0m1.177s
> 
> 3. 5.19-rc1 + fix1 from Paul
> real    2m43.404s
> user    0m2.880s
> sys     0m1.214s
> 
> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
> real    0m15.262s
> user    0m3.003s
> sys     0m1.033s
> 
> When build kernel in the meantime, load time become longer.
> 
> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
> real    0m15.215s
> user    0m2.942s
> sys    0m1.172s
> 
> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end　
> real    1m23.936s
> user    0m2.969s
> sys    0m1.181s

And thank you for the testing!

Could you please try fix3 + Neeraj's change of srcu_gp_end?

That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
srcu_gp_end.

Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
times start rising?  This is probably best done by starting with
SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
until boot time becomes slow, followed by a binary search between the
last two values.  (The idea is to bias the search so that fast boot
times are the common case.)

> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5

And thank you for these details.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-14 14:14                                     ` Neeraj Upadhyay
@ 2022-06-14 14:57                                       ` zhangfei.gao
  0 siblings, 0 replies; 37+ messages in thread
From: zhangfei.gao @ 2022-06-14 14:57 UTC (permalink / raw)
  To: Neeraj Upadhyay, Shameerali Kolothum Thodi, paulmck
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti,
	Auger Eric, chenxiang (M)



On 2022/6/14 下午10:14, Neeraj Upadhyay wrote:
>
>
> On 6/14/2022 7:33 PM, zhangfei.gao@foxmail.com wrote:
>>
>>
>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>
>>>>
>>>> 5.18-rc4 based               ~8sec
>>>>
>>>> 5.19-rc1                     ~2m43sec
>>>>
>>>> 5.19-rc1+fix1                 ~19sec
>>>>
>>>> 5.19-rc1-fix2                 ~19sec
>>>>
>>>
>>> If you try below diff on top of either 5.19-rc1+fix1 or 
>>> 5.19-rc1-fix2 ; does it show any difference in boot time?
>>>
>>> --- a/kernel/rcu/srcutree.c
>>> +++ b/kernel/rcu/srcutree.c
>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct 
>>> srcu_struct *ssp, struct srcu_node *snp
>>>   */
>>>  static void srcu_gp_end(struct srcu_struct *ssp)
>>>  {
>>> -       unsigned long cbdelay;
>>> +       unsigned long cbdelay = 1;
>>>         bool cbs;
>>>         bool last_lvl;
>>>         int cpu;
>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>>         spin_lock_irq_rcu_node(ssp);
>>>         idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>         WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>> -       cbdelay = !!srcu_get_delay(ssp);
>>> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), 
>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>> +               cbdelay = 0;
>>> +
>>>         WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>>>
>> Test here:
>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>> kernel: 
>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
>> (in case test patch not clear, push in git tree)
>>
>> Hardware: aarch64
>>
>> 1. 5.18-rc6
>> real    0m8.402s
>> user    0m3.015s
>> sys     0m1.102s
>>
>> 2. 5.19-rc1
>> real    2m41.433s
>> user    0m3.097s
>> sys     0m1.177s
>>
>> 3. 5.19-rc1 + fix1 from Paul
>> real    2m43.404s
>> user    0m2.880s
>> sys     0m1.214s
>>
>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>> real    0m15.262s
>> user    0m3.003s
>> sys     0m1.033s
>>
>> When build kernel in the meantime, load time become longer.
>>
>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>> real    0m15.215s
>> user    0m2.942s
>> sys    0m1.172s
>>
>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end　
>> real    1m23.936s
>> user    0m2.969s
>> sys    0m1.181s
>>
>
7. 5.19-rc1 + fix5: fix4 + Remove "if (!jbase)" block

real 0m11.418s

user 0m3.031s

sys 0m1.067s



8. 5.19-rc1 + fix 6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000

real 0m11.154s ~12s

user 0m2.919s

sys 0m1.064s


Thanks

> Thanks for this data. Can you please share below test combo also?
>
> 7. 5.19-rc1 + fix5: fix2 + Neeraj's change of srcu_gp_end
>
>
> 8. 5.19-rc1 + fix6: fix3 + Neeraj's change of srcu_gp_end
>
>
> Thanks
> Neeraj
>
>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
>>
>> Thanks
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13  6:55                       ` zhangfei.gao
  2022-06-13 12:18                         ` Paul E. McKenney
  2022-06-13 15:02                         ` Shameerali Kolothum Thodi
@ 2022-06-15  8:29                         ` Marc Zyngier
  2 siblings, 0 replies; 37+ messages in thread
From: Marc Zyngier @ 2022-06-15  8:29 UTC (permalink / raw)
  To: zhangfei.gao, Paul E. McKenney
  Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	Shameerali Kolothum Thodi, mtosatti, Auger Eric

On Mon, 13 Jun 2022 07:55:47 +0100,
"zhangfei.gao@foxmail.com" <zhangfei.gao@foxmail.com> wrote:
> 
> Hi, Paul
> 
> On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> >> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote:
> >>> Hi, Paul
> >>> 
> >>> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> >>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> >>>>> On 6/12/22 18:40, Paul E. McKenney wrote:
> >>>>>>> Do these reserved memory regions really need to be allocated separately?
> >>>>>>> (For example, are they really all non-contiguous?  If not, that is, if
> >>>>>>> there are a lot of contiguous memory regions, could you sort the IORT
> >>>>>>> by address and do one ioctl() for each set of contiguous memory regions?)
> >>>>>>> 
> >>>>>>> Are all of these reserved memory regions set up before init is spawned?
> >>>>>>> 
> >>>>>>> Are all of these reserved memory regions set up while there is only a
> >>>>>>> single vCPU up and running?
> >>>>>>> 
> >>>>>>> Is the SRCU grace period really needed in this case?  (I freely confess
> >>>>>>> to not being all that familiar with KVM.)
> >>>>>> Oh, and there was a similar many-requests problem with networking many
> >>>>>> years ago.  This was solved by adding a new syscall/ioctl()/whatever
> >>>>>> that permitted many requests to be presented to the kernel with a single
> >>>>>> system call.
> >>>>>> 
> >>>>>> Could a new ioctl() be introduced that requested a large number
> >>>>>> of these memory regions in one go so as to make each call to
> >>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
> >>>>>> requests?  Adding a few of the KVM guys on CC for their thoughts.
> >>>>> Unfortunately not.  Apart from this specific case, in general the calls to
> >>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> >>>>> guest, and those writes then map to a ioctl.  Typically the guest sets up a
> >>>>> device at a time, and each setup step causes a synchronize_srcu()---and
> >>>>> expedited at that.
> >>>> I was afraid of something like that...
> >>>> 
> >>>>> KVM has two SRCUs:
> >>>>> 
> >>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> >>>>> that are very very small, but it needs extremely fast detection of grace
> >>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> >>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
> >>>>> not so frequent.
> >>>>> 
> >>>>> 2) kvm->srcu is nastier because there are readers all the time.  The
> >>>>> read-side critical section are still short-ish, but they need the sleepable
> >>>>> part because they access user memory.
> >>>> Which one of these two is in play in this case?
> >>>> 
> >>>>> Writers are not frequent per se; the problem is they come in very large
> >>>>> bursts when a guest boots.  And while the whole boot path overall can be
> >>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> >>>>> impact on runtime than the O(n^2) parts, as demonstrated here.
> >>>>> 
> >>>>> Therefore, we operated on the assumption that the callers of
> >>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> >>>>> and the desire was to get past the booting phase as fast as possible.  If
> >>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> >>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> >>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> >>>>> problematic.
> >>>>> 
> >>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> >>>>> were was introduced (respectively in 2009 and 2014).  But perhaps they do
> >>>>> not hold anymore now that each SRCU is not as independent as it used to be
> >>>>> in those years, and instead they use workqueues instead?
> >>>> The problem was not internal to SRCU, but rather due to the fact
> >>>> that kernel live patching (KLP) had problems with the CPU-bound tasks
> >>>> resulting from repeated synchronize_rcu_expedited() invocations.  So I
> >>>> added heuristics to get the occasional sleep in there for KLP's benefit.
> >>>> Perhaps these heuristics need to be less aggressive about adding sleep.
> >>>> 
> >>>> These heuristics have these aspects:
> >>>> 
> >>>> 1.	The longer readers persist in an expedited SRCU grace period,
> >>>> 	the longer the wait between successive checks of the reader
> >>>> 	state.  Roughly speaking, we wait as long as the grace period
> >>>> 	has currently been in effect, capped at ten jiffies.
> >>>> 
> >>>> 2.	SRCU grace periods have several phases.  We reset so that each
> >>>> 	phase starts by not waiting (new phase, new set of readers,
> >>>> 	so don't penalize this set for the sins of the previous set).
> >>>> 	But once we get to the point of adding delay, we add the
> >>>> 	delay based on the beginning of the full grace period.
> >>>> 
> >>>> Right now, the checking for grace-period length does not allow for the
> >>>> possibility that a grace period might start just before the jiffies
> >>>> counter gets incremented (because I didn't realize that anyone cared),
> >>>> so that is one possible thing to change.  I can also allow more no-delay
> >>>> checks per SRCU grace-period phase.
> >>>> 
> >>>> Zhangfei, does something like the patch shown below help?
> >>>> 
> >>>> Additional adjustments are likely needed to avoid re-breaking KLP,
> >>>> but we have to start somewhere...
> >>>> 
> >>>> 							Thanx, Paul
> >>>> 
> >>>> ------------------------------------------------------------------------
> >>>> 
> >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> >>>> index 50ba70f019dea..6a354368ac1d1 100644
> >>>> --- a/kernel/rcu/srcutree.c
> >>>> +++ b/kernel/rcu/srcutree.c
> >>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> >>>>    #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
> >>>>    #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
> >>>> -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
> >>>> +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
> >>>>    #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
> >>>>    /*
> >>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> >>>>     */
> >>>>    static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> >>>>    {
> >>>> +	unsigned long gpstart;
> >>>> +	unsigned long j;
> >>>>    	unsigned long jbase = SRCU_INTERVAL;
> >>>>    	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> >>>>    		jbase = 0;
> >>>> -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> >>>> -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> >>>> +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> >>>> +		j = jiffies - 1;
> >>>> +		gpstart = READ_ONCE(ssp->srcu_gp_start);
> >>>> +		if (time_after(j, gpstart))
> >>>> +			jbase += j - gpstart;
> >>>> +	}
> >>>>    	if (!jbase) {
> >>>>    		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> >>>>    		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> >>> Unfortunately, this patch does not helpful.
> >>> 
> >>> Then re-add the debug info.
> >>> 
> >>> During the qemu boot
> >>> [  232.997667]  __synchronize_srcu loop=1000
> >>> 
> >>> [  361.094493]  __synchronize_srcu loop=9000
> >>> [  361.094501] Call trace:
> >>> [  361.094502]  dump_backtrace+0xe4/0xf0
> >>> [  361.094505]  show_stack+0x20/0x70
> >>> [  361.094507]  dump_stack_lvl+0x8c/0xb8
> >>> [  361.094509]  dump_stack+0x18/0x34
> >>> [  361.094511]  __synchronize_srcu+0x120/0x128
> >>> [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> >>> [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> >>> [  361.094519]  kvm_activate_memslot+0x40/0x68
> >>> [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> >>> [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> >>> [  361.094524]  kvm_set_memory_region+0x78/0xb8
> >>> [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> >>> [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> >>> [  361.094530]  invoke_syscall+0x4c/0x110
> >>> [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> >>> [  361.094536]  do_el0_svc+0x34/0xc0
> >>> [  361.094538]  el0_svc+0x30/0x98
> >>> [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> >>> [  361.094544]  el0t_64_sync+0x18c/0x190
> >>> [  363.942817]  kvm_set_memory_region loop=6000
> >> Huh.
> >> 
> >> One possibility is that the "if (!jbase)" block needs to be nested
> >> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
> 
> I test this diff and NO helpful
> 
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 50ba70f019de..36286a4b74e6 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> 
>  #define SRCU_INTERVAL          1       // Base delay if no expedited
> GPs pending.
>  #define SRCU_MAX_INTERVAL      10      // Maximum incremental delay
> from slow readers.
> -#define SRCU_MAX_NODELAY_PHASE 1       // Maximum per-GP-phase
> consecutive no-delay instances.
> +#define SRCU_MAX_NODELAY_PHASE 3       // Maximum per-GP-phase
> consecutive no-delay instances.
>  #define SRCU_MAX_NODELAY       100     // Maximum consecutive
> no-delay instances.
> 
>  /*
> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct
> srcu_struct *ssp)
>   */
>  static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>  {
> +       unsigned long gpstart;
> +       unsigned long j;
>         unsigned long jbase = SRCU_INTERVAL;
> 
>         if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>                 jbase = 0;
> -       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> -               jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> -       if (!jbase) {
> -               WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> -               if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> -                       jbase = 1;
> +       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> +               j = jiffies - 1;
> +               gpstart = READ_ONCE(ssp->srcu_gp_start);
> +               if (time_after(j, gpstart))
> +                       jbase += j - gpstart;
> +
> +               if (!jbase) {
> +                       WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> +                       if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> +                               jbase = 1;
> +               }
>         }
> 
> > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
> > above change reduces the overhead by more than an order of magnitude.
> > Except that the overhead of the series is far less than one second,
> > not the several minutes that you are seeing.  So the per-call overhead
> > decreases from about 17 microseconds to a bit more than one microsecond.
> > 
> > I could imagine an extra order of magnitude if you are running HZ=100
> > instead of the HZ=1000 that I am running.  But that only gets up to a
> > few seconds.
> > 
> >> One additional debug is to apply the patch below on top of the one you
> apply the patch below?
> >> just now kindly tested, then use whatever debug technique you wish to
> >> work out what fraction of the time during that critical interval that
> >> srcu_get_delay() returns non-zero.
> Sorry, I am confused, no patch right?
> Just measure srcu_get_delay return to non-zero?
> 
> 
> By the way, the issue should be only related with qemu apci. not
> related with rmr feature

No, this also occurs if you supply the guest's EFI with an empty set
of persistent variables. EFI goes and zeroes it, which results in a
read-only memslot write access being taken to userspace, the memslot
being unmapped from the guest, QEMU doing a little dance, and
eventually restoring the memslot back to the guest. Rince, repeat.

Do that one byte at a time over 64MB, and your boot time for EFI only
goes from 39s to 3m50s (that's on a speed-challenged Synquacer box),
which completely kills the "deploy a new VM" use case.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-13 15:02                         ` Shameerali Kolothum Thodi
@ 2022-06-15  8:38                           ` Marc Zyngier
  0 siblings, 0 replies; 37+ messages in thread
From: Marc Zyngier @ 2022-06-15  8:38 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: zhangfei.gao, paulmck, Paolo Bonzini, Zhangfei Gao, linux-kernel,
	rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers,
	Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M)

On Mon, 13 Jun 2022 16:02:30 +0100,
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
> 
> 
> 
> > -----Original Message-----
> > From: zhangfei.gao@foxmail.com [mailto:zhangfei.gao@foxmail.com]
> > Sent: 13 June 2022 07:56
> > To: paulmck@kernel.org
> > Cc: Paolo Bonzini <pbonzini@redhat.com>; Zhangfei Gao
> > <zhangfei.gao@linaro.org>; linux-kernel@vger.kernel.org;
> > rcu@vger.kernel.org; Lai Jiangshan <jiangshanlai@gmail.com>; Josh Triplett
> > <josh@joshtriplett.org>; Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com>; Matthew Wilcox <willy@infradead.org>;
> > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> > mtosatti@redhat.com; Auger Eric <eric.auger@redhat.com>
> > Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and
> > blocking readers from consuming CPU) cause qemu boot slow
> > 
>  
> > By the way, the issue should be only related with qemu apci. not related
> > with rmr feature
> > Test with: https://github.com/qemu/qemu/tree/stable-6.1
> > 
> > Looks it caused by too many kvm_region_add & kvm_region_del if
> > acpi=force,
> 
> Based on the setup I have, I think it has nothing to do with Guest
> kernel booting with ACPI per se(ie, acpi=force in Qemu kernel cmd
> line).  It is more to do with Qemu having the "-bios QEMU_EFI.fd"
> which sets up pflash devices resulting in large number of pflash
> read/write calls(before Guest kernel even boots) which in turn seems
> to be triggering the below kvm_region_add/del calls.

Indeed, this is all about memslots being added/removed at an alarming
rate (well, not more alarming today than yesterday, but QEMU's flash
emulation is... interesting).

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-14 14:17                                     ` Paul E. McKenney
@ 2022-06-15  9:03                                       ` zhangfei.gao
  2022-06-15 10:40                                         ` Neeraj Upadhyay
  0 siblings, 1 reply; 37+ messages in thread
From: zhangfei.gao @ 2022-06-15  9:03 UTC (permalink / raw)
  To: paulmck, zhangfei
  Cc: Neeraj Upadhyay, Shameerali Kolothum Thodi, Paolo Bonzini,
	linux-kernel, rcu, Lai Jiangshan, Josh Triplett,
	Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric,
	chenxiang (M)



On 2022/6/14 下午10:17, Paul E. McKenney wrote:
> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com wrote:
>>
>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>> 5.18-rc4 based               ~8sec
>>>>
>>>> 5.19-rc1                     ~2m43sec
>>>>
>>>> 5.19-rc1+fix1                 ~19sec
>>>>
>>>> 5.19-rc1-fix2                 ~19sec
>>>>
>>> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ;
>>> does it show any difference in boot time?
>>>
>>> --- a/kernel/rcu/srcutree.c
>>> +++ b/kernel/rcu/srcutree.c
>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct
>>> *ssp, struct srcu_node *snp
>>>    */
>>>   static void srcu_gp_end(struct srcu_struct *ssp)
>>>   {
>>> -       unsigned long cbdelay;
>>> +       unsigned long cbdelay = 1;
>>>          bool cbs;
>>>          bool last_lvl;
>>>          int cpu;
>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>>          spin_lock_irq_rcu_node(ssp);
>>>          idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>          WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>> -       cbdelay = !!srcu_get_delay(ssp);
>>> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>> +               cbdelay = 0;
>>> +
>>>          WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
> Thank you both for the testing and the proposed fix!
>
>> Test here:
>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>> kernel:
>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
>> (in case test patch not clear, push in git tree)
>>
>> Hardware: aarch64
>>
>> 1. 5.18-rc6
>> real    0m8.402s
>> user    0m3.015s
>> sys     0m1.102s
>>
>> 2. 5.19-rc1
>> real    2m41.433s
>> user    0m3.097s
>> sys     0m1.177s
>>
>> 3. 5.19-rc1 + fix1 from Paul
>> real    2m43.404s
>> user    0m2.880s
>> sys     0m1.214s
>>
>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>> real    0m15.262s
>> user    0m3.003s
>> sys     0m1.033s
>>
>> When build kernel in the meantime, load time become longer.
>>
>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>> real    0m15.215s
>> user    0m2.942s
>> sys    0m1.172s
>>
>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end　
>> real    1m23.936s
>> user    0m2.969s
>> sys    0m1.181s
> And thank you for the testing!
>
> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>
> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
> srcu_gp_end.
>
> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
> times start rising?  This is probably best done by starting with
> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
> until boot time becomes slow, followed by a binary search between the
> last two values.  (The idea is to bias the search so that fast boot
> times are the common case.)

SRCU_MAX_NODELAY_PHASE 100 becomes slower.


8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000

real 0m11.154s ~12s

user 0m2.919s

sys 0m1.064s



9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000

real 0m11.258s

user 0m3.113s

sys 0m1.073s



10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100

real 0m30.053s ~ 32s

user 0m2.827s

sys 0m1.161s



By the way, if build kernel on the board in the meantime (using memory), 
time become much longer.

real 1m2.763s



11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000

real 0m11.443s

user 0m3.022s

sys 0m1.052s


Thanks

>
>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
> And thank you for these details.
>
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-15  9:03                                       ` zhangfei.gao
@ 2022-06-15 10:40                                         ` Neeraj Upadhyay
  2022-06-15 10:50                                           ` Paolo Bonzini
  2022-06-18  3:07                                           ` zhangfei.gao
  0 siblings, 2 replies; 37+ messages in thread
From: Neeraj Upadhyay @ 2022-06-15 10:40 UTC (permalink / raw)
  To: zhangfei.gao, paulmck, zhangfei
  Cc: Shameerali Kolothum Thodi, Paolo Bonzini, linux-kernel, rcu,
	Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	mtosatti, Auger Eric, chenxiang (M)

Hi,

On 6/15/2022 2:33 PM, zhangfei.gao@foxmail.com wrote:
> 
> 
> On 2022/6/14 下午10:17, Paul E. McKenney wrote:
>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com wrote:
>>>
>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>>> 5.18-rc4 based               ~8sec
>>>>>
>>>>> 5.19-rc1                     ~2m43sec
>>>>>
>>>>> 5.19-rc1+fix1                 ~19sec
>>>>>
>>>>> 5.19-rc1-fix2                 ~19sec
>>>>>
>>>> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ;
>>>> does it show any difference in boot time?
>>>>
>>>> --- a/kernel/rcu/srcutree.c
>>>> +++ b/kernel/rcu/srcutree.c
>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct 
>>>> srcu_struct
>>>> *ssp, struct srcu_node *snp
>>>>    */
>>>>   static void srcu_gp_end(struct srcu_struct *ssp)
>>>>   {
>>>> -       unsigned long cbdelay;
>>>> +       unsigned long cbdelay = 1;
>>>>          bool cbs;
>>>>          bool last_lvl;
>>>>          int cpu;
>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>>>          spin_lock_irq_rcu_node(ssp);
>>>>          idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>>          WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>>> -       cbdelay = !!srcu_get_delay(ssp);
>>>> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>> +               cbdelay = 0;
>>>> +
>>>>          WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>> Thank you both for the testing and the proposed fix!
>>
>>> Test here:
>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>>> kernel:
>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test 
>>>
>>> (in case test patch not clear, push in git tree)
>>>
>>> Hardware: aarch64
>>>
>>> 1. 5.18-rc6
>>> real    0m8.402s
>>> user    0m3.015s
>>> sys     0m1.102s
>>>
>>> 2. 5.19-rc1
>>> real    2m41.433s
>>> user    0m3.097s
>>> sys     0m1.177s
>>>
>>> 3. 5.19-rc1 + fix1 from Paul
>>> real    2m43.404s
>>> user    0m2.880s
>>> sys     0m1.214s
>>>
>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>>> real    0m15.262s
>>> user    0m3.003s
>>> sys     0m1.033s
>>>
>>> When build kernel in the meantime, load time become longer.
>>>
>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>>> real    0m15.215s
>>> user    0m2.942s
>>> sys    0m1.172s
>>>
>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end　
>>> real    1m23.936s
>>> user    0m2.969s
>>> sys    0m1.181s
>> And thank you for the testing!
>>
>> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>>
>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
>> srcu_gp_end.
>>
>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
>> times start rising?  This is probably best done by starting with
>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
>> until boot time becomes slow, followed by a binary search between the
>> last two values.  (The idea is to bias the search so that fast boot
>> times are the common case.)
> 
> SRCU_MAX_NODELAY_PHASE 100 becomes slower.
> 
> 
> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
> 
> real 0m11.154s ~12s
> 
> user 0m2.919s
> 
> sys 0m1.064s
> 
> 
> 
> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
> 
> real 0m11.258s
> 
> user 0m3.113s
> 
> sys 0m1.073s
> 
> 
> 
> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
> 
> real 0m30.053s ~ 32s
> 
> user 0m2.827s
> 
> sys 0m1.161s
> 
> 
> 
> By the way, if build kernel on the board in the meantime (using memory), 
> time become much longer.
> 
> real 1m2.763s
> 
> 
> 
> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
> 
> real 0m11.443s
> 
> user 0m3.022s
> 
> sys 0m1.052s
> 
> 

This is useful data, thanks! Did you get chance to check between 100 and 
1000, to narrow down further, from which point (does need to be exact 
value) between 100 and 1000,  you start seeing degradation at, for ex. 
250, 500 , ...?

Is it also possible to try experiment 10 and 11 with below diff.
What I have done in below diff is, call srcu_get_delay() only once
in try_check_zero() (and not for every loop iteration); also
retry with a different delay for the extra iteration which is done
when srcu_get_delay(ssp) returns 0.

Once we have this data, can you also try by changing 
SRCU_RETRY_CHECK_LONG_DELAY   to 100, on top of below diff.

#define SRCU_RETRY_CHECK_LONG_DELAY  100

-------------------------------------------------------------------------
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 6a354368ac1d..3aff2f3e99ab 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
   * we repeatedly block for 1-millisecond time periods.
   */
  #define SRCU_RETRY_CHECK_DELAY         5
+#define SRCU_RETRY_CHECK_LONG_DELAY            5

  /*
   * Start an SRCU grace period.
@@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct 
srcu_struct *ssp, struct srcu_data *sdp,
   */
  static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount)
  {
+       unsigned long curdelay;
+       curdelay = !srcu_get_delay(ssp);
         for (;;) {
                 if (srcu_readers_active_idx_check(ssp, idx))
                         return true;
-               if (--trycount + !srcu_get_delay(ssp) <= 0)
+               if (--trycount + curdelay <= 0)
                         return false;
-               udelay(SRCU_RETRY_CHECK_DELAY);
+               if (trycount)
+                       udelay(SRCU_RETRY_CHECK_DELAY);
+               else
+                       udelay(SRCU_RETRY_CHECK_LONG_DELAY);
         }
  }


Thanks
Neeraj

> Thanks
> 
>>
>>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
>> And thank you for these details.
>>
>>                             Thanx, Paul
> 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-15 10:40                                         ` Neeraj Upadhyay
@ 2022-06-15 10:50                                           ` Paolo Bonzini
  2022-06-15 11:04                                             ` Neeraj Upadhyay
  2022-06-18  3:07                                           ` zhangfei.gao
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2022-06-15 10:50 UTC (permalink / raw)
  To: Neeraj Upadhyay, zhangfei.gao, paulmck, zhangfei
  Cc: Shameerali Kolothum Thodi, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti,
	Auger Eric, chenxiang (M)

On 6/15/22 12:40, Neeraj Upadhyay wrote:
> 
> This is useful data, thanks! Did you get chance to check between 100 and 
> 1000, to narrow down further, from which point (does need to be exact 
> value) between 100 and 1000,  you start seeing degradation at, for ex. 
> 250, 500 , ...?
> 
> Is it also possible to try experiment 10 and 11 with below diff.
> What I have done in below diff is, call srcu_get_delay() only once
> in try_check_zero() (and not for every loop iteration); also
> retry with a different delay for the extra iteration which is done
> when srcu_get_delay(ssp) returns 0.
> 
> Once we have this data, can you also try by changing 
> SRCU_RETRY_CHECK_LONG_DELAY   to 100, on top of below diff.
> 
> #define SRCU_RETRY_CHECK_LONG_DELAY  100

Is there any data that you would like me to gather from the KVM side, 
for example with respect to how much it takes to do synchronize_srcu() 
on an unpatched kernel, or the duration of the read-sides?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-15 10:50                                           ` Paolo Bonzini
@ 2022-06-15 11:04                                             ` Neeraj Upadhyay
  0 siblings, 0 replies; 37+ messages in thread
From: Neeraj Upadhyay @ 2022-06-15 11:04 UTC (permalink / raw)
  To: Paolo Bonzini, zhangfei.gao, paulmck, zhangfei
  Cc: Shameerali Kolothum Thodi, linux-kernel, rcu, Lai Jiangshan,
	Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti,
	Auger Eric, chenxiang (M)



On 6/15/2022 4:20 PM, Paolo Bonzini wrote:
> On 6/15/22 12:40, Neeraj Upadhyay wrote:
>>
>> This is useful data, thanks! Did you get chance to check between 100 
>> and 1000, to narrow down further, from which point (does need to be 
>> exact value) between 100 and 1000,  you start seeing degradation at, 
>> for ex. 250, 500 , ...?
>>
>> Is it also possible to try experiment 10 and 11 with below diff.
>> What I have done in below diff is, call srcu_get_delay() only once
>> in try_check_zero() (and not for every loop iteration); also
>> retry with a different delay for the extra iteration which is done
>> when srcu_get_delay(ssp) returns 0.
>>
>> Once we have this data, can you also try by changing 
>> SRCU_RETRY_CHECK_LONG_DELAY   to 100, on top of below diff.
>>
>> #define SRCU_RETRY_CHECK_LONG_DELAY  100
> 
> Is there any data that you would like me to gather from the KVM side, 
> for example with respect to how much it takes to do synchronize_srcu() 
> on an unpatched kernel, or the duration of the read-sides?
> 

Hi Paolo,

Thanks! I think synchronize_srcu() time and read side duration will 
help. Given that changing SRCU_MAX_NODELAY_PHASE value has impact
on the boot time (and SRCU_MAX_NODELAY_PHASE impact is dependent on 
duration of SRCU read side duration), this information will be helpful 
to get more understanding of the scenario.


Thanks
Neeraj

> Thanks,
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-15 10:40                                         ` Neeraj Upadhyay
  2022-06-15 10:50                                           ` Paolo Bonzini
@ 2022-06-18  3:07                                           ` zhangfei.gao
  2022-06-20  7:50                                             ` Neeraj Upadhyay
  1 sibling, 1 reply; 37+ messages in thread
From: zhangfei.gao @ 2022-06-18  3:07 UTC (permalink / raw)
  To: Neeraj Upadhyay, paulmck, zhangfei
  Cc: Shameerali Kolothum Thodi, Paolo Bonzini, linux-kernel, rcu,
	Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	mtosatti, Auger Eric, chenxiang (M)



On 2022/6/15 下午6:40, Neeraj Upadhyay wrote:
> Hi,
>
> On 6/15/2022 2:33 PM, zhangfei.gao@foxmail.com wrote:
>>
>>
>> On 2022/6/14 下午10:17, Paul E. McKenney wrote:
>>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com 
>>> wrote:
>>>>
>>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>>>> 5.18-rc4 based               ~8sec
>>>>>>
>>>>>> 5.19-rc1                     ~2m43sec
>>>>>>
>>>>>> 5.19-rc1+fix1                 ~19sec
>>>>>>
>>>>>> 5.19-rc1-fix2                 ~19sec
>>>>>>
>>>>> If you try below diff on top of either 5.19-rc1+fix1 or 
>>>>> 5.19-rc1-fix2 ;
>>>>> does it show any difference in boot time?
>>>>>
>>>>> --- a/kernel/rcu/srcutree.c
>>>>> +++ b/kernel/rcu/srcutree.c
>>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct 
>>>>> srcu_struct
>>>>> *ssp, struct srcu_node *snp
>>>>>    */
>>>>>   static void srcu_gp_end(struct srcu_struct *ssp)
>>>>>   {
>>>>> -       unsigned long cbdelay;
>>>>> +       unsigned long cbdelay = 1;
>>>>>          bool cbs;
>>>>>          bool last_lvl;
>>>>>          int cpu;
>>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>>>>          spin_lock_irq_rcu_node(ssp);
>>>>>          idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>>>          WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>>>> -       cbdelay = !!srcu_get_delay(ssp);
>>>>> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>> +               cbdelay = 0;
>>>>> +
>>>>>          WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>>> Thank you both for the testing and the proposed fix!
>>>
>>>> Test here:
>>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>>>> kernel:
>>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test 
>>>>
>>>> (in case test patch not clear, push in git tree)
>>>>
>>>> Hardware: aarch64
>>>>
>>>> 1. 5.18-rc6
>>>> real    0m8.402s
>>>> user    0m3.015s
>>>> sys     0m1.102s
>>>>
>>>> 2. 5.19-rc1
>>>> real    2m41.433s
>>>> user    0m3.097s
>>>> sys     0m1.177s
>>>>
>>>> 3. 5.19-rc1 + fix1 from Paul
>>>> real    2m43.404s
>>>> user    0m2.880s
>>>> sys     0m1.214s
>>>>
>>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>>>> real    0m15.262s
>>>> user    0m3.003s
>>>> sys     0m1.033s
>>>>
>>>> When build kernel in the meantime, load time become longer.
>>>>
>>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>>>> real    0m15.215s
>>>> user    0m2.942s
>>>> sys    0m1.172s
>>>>
>>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end　
>>>> real    1m23.936s
>>>> user    0m2.969s
>>>> sys    0m1.181s
>>> And thank you for the testing!
>>>
>>> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>>>
>>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
>>> srcu_gp_end.
>>>
>>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
>>> times start rising?  This is probably best done by starting with
>>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
>>> until boot time becomes slow, followed by a binary search between the
>>> last two values.  (The idea is to bias the search so that fast boot
>>> times are the common case.)
>>
>> SRCU_MAX_NODELAY_PHASE 100 becomes slower.
>>
>>
>> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
>>
>> real 0m11.154s ~12s
>>
>> user 0m2.919s
>>
>> sys 0m1.064s
>>
>>
>>
>> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
>>
>> real 0m11.258s
>>
>> user 0m3.113s
>>
>> sys 0m1.073s
>>
>>
>>
>> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>>
>> real 0m30.053s ~ 32s
>>
>> user 0m2.827s
>>
>> sys 0m1.161s
>>
>>
>>
>> By the way, if build kernel on the board in the meantime (using 
>> memory), time become much longer.
>>
>> real 1m2.763s
>>
>>
>>
>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>>
>> real 0m11.443s
>>
>> user 0m3.022s
>>
>> sys 0m1.052s
>>
>>
>
> This is useful data, thanks! Did you get chance to check between 100 
> and 1000, to narrow down further, from which point (does need to be 
> exact value) between 100 and 1000,  you start seeing degradation at, 
> for ex. 250, 500 , ...?
>
> Is it also possible to try experiment 10 and 11 with below diff.
> What I have done in below diff is, call srcu_get_delay() only once
> in try_check_zero() (and not for every loop iteration); also
> retry with a different delay for the extra iteration which is done
> when srcu_get_delay(ssp) returns 0.
>
> Once we have this data, can you also try by changing 
> SRCU_RETRY_CHECK_LONG_DELAY   to 100, on top of below diff.
>
> #define SRCU_RETRY_CHECK_LONG_DELAY  100
>
> -------------------------------------------------------------------------
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 6a354368ac1d..3aff2f3e99ab 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>   * we repeatedly block for 1-millisecond time periods.
>   */
>  #define SRCU_RETRY_CHECK_DELAY         5
> +#define SRCU_RETRY_CHECK_LONG_DELAY            5
>
>  /*
>   * Start an SRCU grace period.
> @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct 
> srcu_struct *ssp, struct srcu_data *sdp,
>   */
>  static bool try_check_zero(struct srcu_struct *ssp, int idx, int 
> trycount)
>  {
> +       unsigned long curdelay;
> +       curdelay = !srcu_get_delay(ssp);
>         for (;;) {
>                 if (srcu_readers_active_idx_check(ssp, idx))
>                         return true;
> -               if (--trycount + !srcu_get_delay(ssp) <= 0)
> +               if (--trycount + curdelay <= 0)
>                         return false;
> -               udelay(SRCU_RETRY_CHECK_DELAY);
> +               if (trycount)
> +                       udelay(SRCU_RETRY_CHECK_DELAY);
> +               else
> +                       udelay(SRCU_RETRY_CHECK_LONG_DELAY);
>         }
>  }
>

11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
  real    0m11.443
s user    0m3.022
s sys    0m1.052s

fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500

real    0m11.401s
user    0m2.798s
sys     0m1.328s


fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250

     real    0m15.748s
     user    0m2.781s
     sys     0m1.294s


fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200

     real    0m20.704s  21
     user    0m2.954s
     sys     0m1.226s

fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150

     real    0m25.151s
     user    0m2.980s
     sys     0m1.256s


fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
real    0m30.053s ~ 32s
  user    0m2.827s
  sys    0m1.161s


fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5

     real    0m19.263s
     user    0m3.018s
     sys     0m1.211s



     fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 + 
SRCU_RETRY_CHECK_LONG_DELAY 100

     real    0m9.347s
     user    0m3.132s
     sys     0m1.041s


And  Shameer suggests this method, to decrease region_add/del time from 
6000+ to 200+, also works on 5.19-rc1

Make the EFI flash image file
$ dd if=/dev/zero of=flash0.img bs=1M count=64
$ dd if=./QEMU_EFI-2022.fd of=flash0.img conv=notrunc
$ dd if=/dev/zero of=flash1.img bs=1M count=64
Include the below line instead of "-bios QEMU_EFI.fd" in Qemu cmd line.
-pflash flash0.img -pflash flash1.img \



Thanks

>
> Thanks
> Neeraj
>
>> Thanks
>>
>>>
>>>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
>>> And thank you for these details.
>>>
>>>                             Thanx, Paul
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-18  3:07                                           ` zhangfei.gao
@ 2022-06-20  7:50                                             ` Neeraj Upadhyay
  2022-06-24 15:30                                               ` zhangfei.gao
  0 siblings, 1 reply; 37+ messages in thread
From: Neeraj Upadhyay @ 2022-06-20  7:50 UTC (permalink / raw)
  To: zhangfei.gao, paulmck, zhangfei
  Cc: Shameerali Kolothum Thodi, Paolo Bonzini, linux-kernel, rcu,
	Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	mtosatti, Auger Eric, chenxiang (M)

Hi,


On 6/18/2022 8:37 AM, zhangfei.gao@foxmail.com wrote:
> 
> 
> On 2022/6/15 下午6:40, Neeraj Upadhyay wrote:
>> Hi,
>>
>> On 6/15/2022 2:33 PM, zhangfei.gao@foxmail.com wrote:
>>>
>>>
>>> On 2022/6/14 下午10:17, Paul E. McKenney wrote:
>>>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com 
>>>> wrote:
>>>>>
>>>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>>>>> 5.18-rc4 based               ~8sec
>>>>>>>
>>>>>>> 5.19-rc1                     ~2m43sec
>>>>>>>
>>>>>>> 5.19-rc1+fix1                 ~19sec
>>>>>>>
>>>>>>> 5.19-rc1-fix2                 ~19sec
>>>>>>>
>>>>>> If you try below diff on top of either 5.19-rc1+fix1 or 
>>>>>> 5.19-rc1-fix2 ;
>>>>>> does it show any difference in boot time?
>>>>>>
>>>>>> --- a/kernel/rcu/srcutree.c
>>>>>> +++ b/kernel/rcu/srcutree.c
>>>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct 
>>>>>> srcu_struct
>>>>>> *ssp, struct srcu_node *snp
>>>>>>    */
>>>>>>   static void srcu_gp_end(struct srcu_struct *ssp)
>>>>>>   {
>>>>>> -       unsigned long cbdelay;
>>>>>> +       unsigned long cbdelay = 1;
>>>>>>          bool cbs;
>>>>>>          bool last_lvl;
>>>>>>          int cpu;
>>>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>>>>>          spin_lock_irq_rcu_node(ssp);
>>>>>>          idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>>>>          WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>>>>> -       cbdelay = !!srcu_get_delay(ssp);
>>>>>> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>> +               cbdelay = 0;
>>>>>> +
>>>>>>          WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>>>> Thank you both for the testing and the proposed fix!
>>>>
>>>>> Test here:
>>>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>>>>> kernel:
>>>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test 
>>>>>
>>>>> (in case test patch not clear, push in git tree)
>>>>>
>>>>> Hardware: aarch64
>>>>>
>>>>> 1. 5.18-rc6
>>>>> real    0m8.402s
>>>>> user    0m3.015s
>>>>> sys     0m1.102s
>>>>>
>>>>> 2. 5.19-rc1
>>>>> real    2m41.433s
>>>>> user    0m3.097s
>>>>> sys     0m1.177s
>>>>>
>>>>> 3. 5.19-rc1 + fix1 from Paul
>>>>> real    2m43.404s
>>>>> user    0m2.880s
>>>>> sys     0m1.214s
>>>>>
>>>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>>>>> real    0m15.262s
>>>>> user    0m3.003s
>>>>> sys     0m1.033s
>>>>>
>>>>> When build kernel in the meantime, load time become longer.
>>>>>
>>>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>>>>> real    0m15.215s
>>>>> user    0m2.942s
>>>>> sys    0m1.172s
>>>>>
>>>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end　
>>>>> real    1m23.936s
>>>>> user    0m2.969s
>>>>> sys    0m1.181s
>>>> And thank you for the testing!
>>>>
>>>> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>>>>
>>>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
>>>> srcu_gp_end.
>>>>
>>>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
>>>> times start rising?  This is probably best done by starting with
>>>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
>>>> until boot time becomes slow, followed by a binary search between the
>>>> last two values.  (The idea is to bias the search so that fast boot
>>>> times are the common case.)
>>>
>>> SRCU_MAX_NODELAY_PHASE 100 becomes slower.
>>>
>>>
>>> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
>>>
>>> real 0m11.154s ~12s
>>>
>>> user 0m2.919s
>>>
>>> sys 0m1.064s
>>>
>>>
>>>
>>> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
>>>
>>> real 0m11.258s
>>>
>>> user 0m3.113s
>>>
>>> sys 0m1.073s
>>>
>>>
>>>
>>> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>>>
>>> real 0m30.053s ~ 32s
>>>
>>> user 0m2.827s
>>>
>>> sys 0m1.161s
>>>
>>>
>>>
>>> By the way, if build kernel on the board in the meantime (using 
>>> memory), time become much longer.
>>>
>>> real 1m2.763s
>>>
>>>
>>>
>>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>>>
>>> real 0m11.443s
>>>
>>> user 0m3.022s
>>>
>>> sys 0m1.052s
>>>
>>>
>>
>> This is useful data, thanks! Did you get chance to check between 100 
>> and 1000, to narrow down further, from which point (does need to be 
>> exact value) between 100 and 1000,  you start seeing degradation at, 
>> for ex. 250, 500 , ...?
>>
>> Is it also possible to try experiment 10 and 11 with below diff.
>> What I have done in below diff is, call srcu_get_delay() only once
>> in try_check_zero() (and not for every loop iteration); also
>> retry with a different delay for the extra iteration which is done
>> when srcu_get_delay(ssp) returns 0.
>>
>> Once we have this data, can you also try by changing 
>> SRCU_RETRY_CHECK_LONG_DELAY   to 100, on top of below diff.
>>
>> #define SRCU_RETRY_CHECK_LONG_DELAY  100
>>
>> -------------------------------------------------------------------------
>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>> index 6a354368ac1d..3aff2f3e99ab 100644
>> --- a/kernel/rcu/srcutree.c
>> +++ b/kernel/rcu/srcutree.c
>> @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>>   * we repeatedly block for 1-millisecond time periods.
>>   */
>>  #define SRCU_RETRY_CHECK_DELAY         5
>> +#define SRCU_RETRY_CHECK_LONG_DELAY            5
>>
>>  /*
>>   * Start an SRCU grace period.
>> @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct 
>> srcu_struct *ssp, struct srcu_data *sdp,
>>   */
>>  static bool try_check_zero(struct srcu_struct *ssp, int idx, int 
>> trycount)
>>  {
>> +       unsigned long curdelay;
>> +       curdelay = !srcu_get_delay(ssp);
>>         for (;;) {
>>                 if (srcu_readers_active_idx_check(ssp, idx))
>>                         return true;
>> -               if (--trycount + !srcu_get_delay(ssp) <= 0)
>> +               if (--trycount + curdelay <= 0)
>>                         return false;
>> -               udelay(SRCU_RETRY_CHECK_DELAY);
>> +               if (trycount)
>> +                       udelay(SRCU_RETRY_CHECK_DELAY);
>> +               else
>> +                       udelay(SRCU_RETRY_CHECK_LONG_DELAY);
>>         }
>>  }
>>
> 
> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>   real    0m11.443
> s user    0m3.022
> s sys    0m1.052s
> 
> fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500
> 
> real    0m11.401s
> user    0m2.798s
> sys     0m1.328s
> 
> 
> fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250
> 
>      real    0m15.748s
>      user    0m2.781s
>      sys     0m1.294s
> 
> 
> fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200
> 
>      real    0m20.704s  21
>      user    0m2.954s
>      sys     0m1.226s
> 
> fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150
> 
>      real    0m25.151s
>      user    0m2.980s
>      sys     0m1.256s
> 
> 
> fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
> real    0m30.053s ~ 32s
>   user    0m2.827s
>   sys    0m1.161s
> 
> 
> fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5
> 
>      real    0m19.263s
>      user    0m3.018s
>      sys     0m1.211s
> 
> 
> 
>      fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 + 
> SRCU_RETRY_CHECK_LONG_DELAY 100
> 
>      real    0m9.347s
>      user    0m3.132s
>      sys     0m1.041s
> 
> 

Thanks. From the data and experiments done, looks to me that we get
comparable (to 5.18-rc4 ) timings, when we retry without sleep for time 
duration close to 4-5 ms, which could be closer to the configured HZ (as 
it is 250)? Is it possible to try below configuration on top of  fix15?
If possible can you try with both HZ_1000 and HZ_250?
As multiple fixes are getting combined in experiments, for clarity, 
please also share the diff of srcutree.c (on top of baseline) for all 
experiments.

16. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + 
(long delay scaled to 1 jiffy)


#define SRCU_MAX_NODELAY_TRY_CHECK_PHASE       10
#define SRCU_MAX_NODELAY_PHASE (SRCU_MAX_NODELAY_TRY_CHECK_PHASE * 2)
#define SRCU_RETRY_CHECK_LONG_DELAY           \
   (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)


17. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + 
(long delay scaled to 2 jiffy)

#define SRCU_RETRY_CHECK_LONG_DELAY        \
      (2 * USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)

18. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + 
(long delay scaled to 1/2 jiffy)

#define SRCU_RETRY_CHECK_LONG_DELAY      \
        (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE / 2)



Thanks
Neeraj

> And  Shameer suggests this method, to decrease region_add/del time from 
> 6000+ to 200+, also works on 5.19-rc1
> 
> Make the EFI flash image file
> $ dd if=/dev/zero of=flash0.img bs=1M count=64
> $ dd if=./QEMU_EFI-2022.fd of=flash0.img conv=notrunc
> $ dd if=/dev/zero of=flash1.img bs=1M count=64
> Include the below line instead of "-bios QEMU_EFI.fd" in Qemu cmd line.
> -pflash flash0.img -pflash flash1.img \
> 

Thanks for sharing this wa info!
> 
> 
> Thanks
> 
>>
>> Thanks
>> Neeraj
>>
>>> Thanks
>>>
>>>>
>>>>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
>>>> And thank you for these details.
>>>>
>>>>                             Thanx, Paul
>>>
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow
  2022-06-20  7:50                                             ` Neeraj Upadhyay
@ 2022-06-24 15:30                                               ` zhangfei.gao
  0 siblings, 0 replies; 37+ messages in thread
From: zhangfei.gao @ 2022-06-24 15:30 UTC (permalink / raw)
  To: Neeraj Upadhyay, paulmck
  Cc: Shameerali Kolothum Thodi, Paolo Bonzini, linux-kernel, rcu,
	Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox,
	mtosatti, Auger Eric, chenxiang (M)



On 2022/6/20 下午3:50, Neeraj Upadhyay wrote:
> Hi,
>
>
> On 6/18/2022 8:37 AM, zhangfei.gao@foxmail.com wrote:
>>
>>
>> On 2022/6/15 下午6:40, Neeraj Upadhyay wrote:
>>> Hi,
>>>
>>> On 6/15/2022 2:33 PM, zhangfei.gao@foxmail.com wrote:
>>>>
>>>>
>>>> On 2022/6/14 下午10:17, Paul E. McKenney wrote:
>>>>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com 
>>>>> wrote:
>>>>>>
>>>>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>>>>>> 5.18-rc4 based ~8sec
>>>>>>>>
>>>>>>>> 5.19-rc1                     ~2m43sec
>>>>>>>>
>>>>>>>> 5.19-rc1+fix1                 ~19sec
>>>>>>>>
>>>>>>>> 5.19-rc1-fix2                 ~19sec
>>>>>>>>
>>>>>>> If you try below diff on top of either 5.19-rc1+fix1 or 
>>>>>>> 5.19-rc1-fix2 ;
>>>>>>> does it show any difference in boot time?
>>>>>>>
>>>>>>> --- a/kernel/rcu/srcutree.c
>>>>>>> +++ b/kernel/rcu/srcutree.c
>>>>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct 
>>>>>>> srcu_struct
>>>>>>> *ssp, struct srcu_node *snp
>>>>>>>    */
>>>>>>>   static void srcu_gp_end(struct srcu_struct *ssp)
>>>>>>>   {
>>>>>>> -       unsigned long cbdelay;
>>>>>>> +       unsigned long cbdelay = 1;
>>>>>>>          bool cbs;
>>>>>>>          bool last_lvl;
>>>>>>>          int cpu;
>>>>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct 
>>>>>>> *ssp)
>>>>>>>          spin_lock_irq_rcu_node(ssp);
>>>>>>>          idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>>>>>          WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>>>>>> -       cbdelay = !!srcu_get_delay(ssp);
>>>>>>> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>>> +               cbdelay = 0;
>>>>>>> +
>>>>>>>          WRITE_ONCE(ssp->srcu_last_gp_end, 
>>>>>>> ktime_get_mono_fast_ns());
>>>>> Thank you both for the testing and the proposed fix!
>>>>>
>>>>>> Test here:
>>>>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>>>>>> kernel:
>>>>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test 
>>>>>>
>>>>>> (in case test patch not clear, push in git tree)
>>>>>>
>>>>>> Hardware: aarch64
>>>>>>
>>>>>> 1. 5.18-rc6
>>>>>> real    0m8.402s
>>>>>> user    0m3.015s
>>>>>> sys     0m1.102s
>>>>>>
>>>>>> 2. 5.19-rc1
>>>>>> real    2m41.433s
>>>>>> user    0m3.097s
>>>>>> sys     0m1.177s
>>>>>>
>>>>>> 3. 5.19-rc1 + fix1 from Paul
>>>>>> real    2m43.404s
>>>>>> user    0m2.880s
>>>>>> sys     0m1.214s
>>>>>>
>>>>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>>>>>> real    0m15.262s
>>>>>> user    0m3.003s
>>>>>> sys     0m1.033s
>>>>>>
>>>>>> When build kernel in the meantime, load time become longer.
>>>>>>
>>>>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>>>>>> real    0m15.215s
>>>>>> user    0m2.942s
>>>>>> sys    0m1.172s
>>>>>>
>>>>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end　
>>>>>> real    1m23.936s
>>>>>> user    0m2.969s
>>>>>> sys    0m1.181s
>>>>> And thank you for the testing!
>>>>>
>>>>> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>>>>>
>>>>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
>>>>> srcu_gp_end.
>>>>>
>>>>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
>>>>> times start rising?  This is probably best done by starting with
>>>>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
>>>>> until boot time becomes slow, followed by a binary search between the
>>>>> last two values.  (The idea is to bias the search so that fast boot
>>>>> times are the common case.)
>>>>
>>>> SRCU_MAX_NODELAY_PHASE 100 becomes slower.
>>>>
>>>>
>>>> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
>>>>
>>>> real 0m11.154s ~12s
>>>>
>>>> user 0m2.919s
>>>>
>>>> sys 0m1.064s
>>>>
>>>>
>>>>
>>>> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
>>>>
>>>> real 0m11.258s
>>>>
>>>> user 0m3.113s
>>>>
>>>> sys 0m1.073s
>>>>
>>>>
>>>>
>>>> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>>>>
>>>> real 0m30.053s ~ 32s
>>>>
>>>> user 0m2.827s
>>>>
>>>> sys 0m1.161s
>>>>
>>>>
>>>>
>>>> By the way, if build kernel on the board in the meantime (using 
>>>> memory), time become much longer.
>>>>
>>>> real 1m2.763s
>>>>
>>>>
>>>>
>>>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>>>>
>>>> real 0m11.443s
>>>>
>>>> user 0m3.022s
>>>>
>>>> sys 0m1.052s
>>>>
>>>>
>>>
>>> This is useful data, thanks! Did you get chance to check between 100 
>>> and 1000, to narrow down further, from which point (does need to be 
>>> exact value) between 100 and 1000,  you start seeing degradation at, 
>>> for ex. 250, 500 , ...?
>>>
>>> Is it also possible to try experiment 10 and 11 with below diff.
>>> What I have done in below diff is, call srcu_get_delay() only once
>>> in try_check_zero() (and not for every loop iteration); also
>>> retry with a different delay for the extra iteration which is done
>>> when srcu_get_delay(ssp) returns 0.
>>>
>>> Once we have this data, can you also try by changing 
>>> SRCU_RETRY_CHECK_LONG_DELAY   to 100, on top of below diff.
>>>
>>> #define SRCU_RETRY_CHECK_LONG_DELAY  100
>>>
>>> ------------------------------------------------------------------------- 
>>>
>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>> index 6a354368ac1d..3aff2f3e99ab 100644
>>> --- a/kernel/rcu/srcutree.c
>>> +++ b/kernel/rcu/srcutree.c
>>> @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>>>   * we repeatedly block for 1-millisecond time periods.
>>>   */
>>>  #define SRCU_RETRY_CHECK_DELAY         5
>>> +#define SRCU_RETRY_CHECK_LONG_DELAY            5
>>>
>>>  /*
>>>   * Start an SRCU grace period.
>>> @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct 
>>> srcu_struct *ssp, struct srcu_data *sdp,
>>>   */
>>>  static bool try_check_zero(struct srcu_struct *ssp, int idx, int 
>>> trycount)
>>>  {
>>> +       unsigned long curdelay;
>>> +       curdelay = !srcu_get_delay(ssp);
>>>         for (;;) {
>>>                 if (srcu_readers_active_idx_check(ssp, idx))
>>>                         return true;
>>> -               if (--trycount + !srcu_get_delay(ssp) <= 0)
>>> +               if (--trycount + curdelay <= 0)
>>>                         return false;
>>> -               udelay(SRCU_RETRY_CHECK_DELAY);
>>> +               if (trycount)
>>> +                       udelay(SRCU_RETRY_CHECK_DELAY);
>>> +               else
>>> +                       udelay(SRCU_RETRY_CHECK_LONG_DELAY);
>>>         }
>>>  }
>>>
>>
>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>>   real    0m11.443
>> s user    0m3.022
>> s sys    0m1.052s
>>
>> fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500
>>
>> real    0m11.401s
>> user    0m2.798s
>> sys     0m1.328s
>>
>>
>> fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250
>>
>>      real    0m15.748s
>>      user    0m2.781s
>>      sys     0m1.294s
>>
>>
>> fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200
>>
>>      real    0m20.704s  21
>>      user    0m2.954s
>>      sys     0m1.226s
>>
>> fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150
>>
>>      real    0m25.151s
>>      user    0m2.980s
>>      sys     0m1.256s
>>
>>
>> fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>> real    0m30.053s ~ 32s
>>   user    0m2.827s
>>   sys    0m1.161s
>>
>>
>> fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5
>>
>>      real    0m19.263s
>>      user    0m3.018s
>>      sys     0m1.211s
>>
>>
>>
>>      fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 + 
>> SRCU_RETRY_CHECK_LONG_DELAY 100
>>
>>      real    0m9.347s
>>      user    0m3.132s
>>      sys     0m1.041s
>>
>>
>
> Thanks. From the data and experiments done, looks to me that we get
> comparable (to 5.18-rc4 ) timings, when we retry without sleep for 
> time duration close to 4-5 ms, which could be closer to the configured 
> HZ (as it is 250)? Is it possible to try below configuration on top 
> of  fix15?
> If possible can you try with both HZ_1000 and HZ_250?
> As multiple fixes are getting combined in experiments, for clarity, 
> please also share the diff of srcutree.c (on top of baseline) for all 
> experiments.
>
> 16. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + 
> (long delay scaled to 1 jiffy)
>
>
> #define SRCU_MAX_NODELAY_TRY_CHECK_PHASE       10
> #define SRCU_MAX_NODELAY_PHASE (SRCU_MAX_NODELAY_TRY_CHECK_PHASE * 2)
> #define SRCU_RETRY_CHECK_LONG_DELAY           \
>   (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)
>
>
> 17. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + 
> (long delay scaled to 2 jiffy)
>
> #define SRCU_RETRY_CHECK_LONG_DELAY        \
>      (2 * USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)
>
> 18. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + 
> (long delay scaled to 1/2 jiffy)
>
> #define SRCU_RETRY_CHECK_LONG_DELAY      \
>        (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE / 2)

     fix16: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() 
calls) + (long delay scaled to 1 jiffy)

     real    0m10.120s
     user    0m3.885s
     sys     0m1.040s


     fix17: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() 
calls) + (long delay scaled to 2 jiffy)

     real    0m9.851s
     user    0m3.886s
     sys     0m1.011s


     fix18: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() 
calls) + (long delay scaled to 1/2 jiffy)

     real    0m9.741s
     user    0m3.837s
     sys     0m1.060s

code push to
https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test



^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-06-24 15:32 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-11 16:32 Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow Zhangfei Gao
2022-06-11 16:59 ` Paul E. McKenney
2022-06-12  7:40   ` zhangfei.gao
2022-06-12 13:36     ` Paul E. McKenney
2022-06-12 14:59       ` zhangfei.gao
2022-06-12 16:20         ` Paul E. McKenney
2022-06-12 16:40           ` Paul E. McKenney
2022-06-12 17:29             ` Paolo Bonzini
2022-06-12 17:47               ` Paolo Bonzini
2022-06-12 18:51                 ` Paul E. McKenney
2022-06-12 18:49               ` Paul E. McKenney
2022-06-12 19:23                 ` Paolo Bonzini
2022-06-12 20:09                   ` Paul E. McKenney
2022-06-13  3:04                 ` zhangfei.gao
2022-06-13  3:57                   ` Paul E. McKenney
2022-06-13  4:16                     ` Paul E. McKenney
2022-06-13  6:55                       ` zhangfei.gao
2022-06-13 12:18                         ` Paul E. McKenney
2022-06-13 13:23                           ` zhangfei.gao
2022-06-13 14:59                             ` Paul E. McKenney
2022-06-13 20:55                               ` Shameerali Kolothum Thodi
2022-06-14 12:19                                 ` Neeraj Upadhyay
2022-06-14 14:03                                   ` zhangfei.gao
2022-06-14 14:14                                     ` Neeraj Upadhyay
2022-06-14 14:57                                       ` zhangfei.gao
2022-06-14 14:17                                     ` Paul E. McKenney
2022-06-15  9:03                                       ` zhangfei.gao
2022-06-15 10:40                                         ` Neeraj Upadhyay
2022-06-15 10:50                                           ` Paolo Bonzini
2022-06-15 11:04                                             ` Neeraj Upadhyay
2022-06-18  3:07                                           ` zhangfei.gao
2022-06-20  7:50                                             ` Neeraj Upadhyay
2022-06-24 15:30                                               ` zhangfei.gao
2022-06-13 15:02                         ` Shameerali Kolothum Thodi
2022-06-15  8:38                           ` Marc Zyngier
2022-06-15  8:29                         ` Marc Zyngier
2022-06-14  1:53 ` chenxiang (M)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).