* Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow @ 2022-06-11 16:32 Zhangfei Gao 2022-06-11 16:59 ` Paul E. McKenney 2022-06-14 1:53 ` chenxiang (M) 0 siblings, 2 replies; 37+ messages in thread From: Zhangfei Gao @ 2022-06-11 16:32 UTC (permalink / raw) To: Paul E. McKenney, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi Hi, Paul When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel stuck for several minutes. And on 5.18, there is no such problem. After revert this patch, the issue solved. Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) qemu cmd: build/aarch64-softmmu/qemu-system-aarch64 -machine virt,gic-version=3,iommu=smmuv3 \ -enable-kvm -cpu host -m 1024 \ -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \ -bios QEMU_EFI.fd log: InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll 0x75459000 Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98 ProtectUefiImageCommon - 0x7AA4D040 - 0x0000000075458000 - 0x000000000003F000 SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 (0x0000000000004008) SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 (0x0000000000020008) SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 (0x0000000000004008) InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8 InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8 InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348 InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8 InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728 Not sure it is either reported or solved. Thanks ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-11 16:32 Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow Zhangfei Gao @ 2022-06-11 16:59 ` Paul E. McKenney 2022-06-12 7:40 ` zhangfei.gao 2022-06-14 1:53 ` chenxiang (M) 1 sibling, 1 reply; 37+ messages in thread From: Paul E. McKenney @ 2022-06-11 16:59 UTC (permalink / raw) To: Zhangfei Gao Cc: linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote: > Hi, Paul > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel > stuck for several minutes. Stuck for several minutes but then continues normally? Or stuck for several minutes before you kill qemu? And I have to ask... What happened without the ACPI RMR feature? > And on 5.18, there is no such problem. > > After revert this patch, the issue solved. > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from > consuming CPU) > > > qemu cmd: > build/aarch64-softmmu/qemu-system-aarch64 -machine > virt,gic-version=3,iommu=smmuv3 \ > -enable-kvm -cpu host -m 1024 \ > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \ > -bios QEMU_EFI.fd > > log: > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll > 0x75459000 > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98 > ProtectUefiImageCommon - 0x7AA4D040 > - 0x0000000075458000 - 0x000000000003F000 > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 > (0x0000000000004008) > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 > (0x0000000000020008) > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 > (0x0000000000004008) > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8 > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8 > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348 > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8 > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728 > > > Not sure it is either reported or solved. This is the first I have heard of it, so thank you for reporting it. Do you have a way of collecting something sysrq-t output? Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-11 16:59 ` Paul E. McKenney @ 2022-06-12 7:40 ` zhangfei.gao 2022-06-12 13:36 ` Paul E. McKenney 0 siblings, 1 reply; 37+ messages in thread From: zhangfei.gao @ 2022-06-12 7:40 UTC (permalink / raw) To: paulmck, Zhangfei Gao Cc: linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi Hi, Paul On 2022/6/12 上午12:59, Paul E. McKenney wrote: > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote: >> Hi, Paul >> >> When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel >> stuck for several minutes. > Stuck for several minutes but then continues normally? Or stuck for > several minutes before you kill qemu? qemu boot stuck for several minutes, then guest can bootup normally, just slower. > > And I have to ask... What happened without the ACPI RMR feature? If no ACPI, qemu boot quickly without stuck. build/aarch64-softmmu/qemu-system-aarch64 -machine virt,gic-version=3,iommu=smmuv3 \ -enable-kvm -cpu host -m 1024 \ -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off" Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes. By the way, my hardware platform is aarch64. Only change this can solve the stuck issue. --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct *ssp) { unsigned long jbase = SRCU_INTERVAL; + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) + return 0; + return SRCU_INTERVAL; + > >> And on 5.18, there is no such problem. >> >> After revert this patch, the issue solved. >> Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from >> consuming CPU) >> >> >> qemu cmd: >> build/aarch64-softmmu/qemu-system-aarch64 -machine >> virt,gic-version=3,iommu=smmuv3 \ >> -enable-kvm -cpu host -m 1024 \ >> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ >> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \ >> -bios QEMU_EFI.fd >> >> log: >> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 >> add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll >> 0x75459000 >> Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi >> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98 >> ProtectUefiImageCommon - 0x7AA4D040 >> - 0x0000000075458000 - 0x000000000003F000 >> SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 >> (0x0000000000004008) >> SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 >> (0x0000000000020008) >> SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 >> (0x0000000000004008) >> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8 >> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 >> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 >> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8 >> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 >> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 >> InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348 >> InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8 >> InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728 >> >> >> Not sure it is either reported or solved. > This is the first I have heard of it, so thank you for reporting it. > > Do you have a way of collecting something sysrq-t output? Do you mean "echo t > /proc/sysrq-trigger", There are too much output and kernel dump can not stop. Thanks ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 7:40 ` zhangfei.gao @ 2022-06-12 13:36 ` Paul E. McKenney 2022-06-12 14:59 ` zhangfei.gao 0 siblings, 1 reply; 37+ messages in thread From: Paul E. McKenney @ 2022-06-12 13:36 UTC (permalink / raw) To: zhangfei.gao Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi On Sun, Jun 12, 2022 at 03:40:30PM +0800, zhangfei.gao@foxmail.com wrote: > Hi, Paul > > On 2022/6/12 上午12:59, Paul E. McKenney wrote: > > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote: > > > Hi, Paul > > > > > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel > > > stuck for several minutes. > > Stuck for several minutes but then continues normally? Or stuck for > > several minutes before you kill qemu? > > qemu boot stuck for several minutes, then guest can bootup normally, just > slower. > > > > And I have to ask... What happened without the ACPI RMR feature? > If no ACPI, qemu boot quickly without stuck. > build/aarch64-softmmu/qemu-system-aarch64 -machine > virt,gic-version=3,iommu=smmuv3 \ > -enable-kvm -cpu host -m 1024 \ > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off" > > Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes. > > By the way, my hardware platform is aarch64. Thank you for the information! The problem is excessive delay rather than a hang, and it is configuration-dependent. Good to know! > Only change this can solve the stuck issue. > > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct > *ssp) > { > unsigned long jbase = SRCU_INTERVAL; > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > + return 0; > + return SRCU_INTERVAL; I am glad that you have a workaround for this problem, but this change would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") was intended to fix. For one example, your change can prevent kernel live patching from applying a patch. So something else is needed. Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay significantly? (This is not a fix, either, but instead a debug check.) Your change always returns zero if another SRCU grace period is needed. Let's look at the callers of srcu_get_delay(): o cleanup_srcu_struct() uses it to check whether there is an expedited grace period pending, leaking the srcu_struct if so. This should not affect boot delay. (Unless you are invoking init_srcu_struct() and cleanup_srcu_struct() really really often.) o srcu_gp_end() uses it to determine whether or not to allow a one-jiffy delay before invoking callbacks at the end of a grace period. o srcu_funnel_gp_start() uses it to determine whether or not to allow a one-jiffy delay before starting the process of checking for the end of an SRCU grace period. o try_check_zero() uses it to add an additional short delay (instead of a long delay) between checks of reader state. o process_srcu() uses it to calculate the long delay between checks of reader state. These add one-jiffy delays, except for process_srcu(), which adds a delay of up to 10 jiffies. Even given HZ=100 (as opposed to the HZ=1000 that I normally use), this requires thousands of such delays to add up to the several minutes that you are seeing. (In theory, the delays could also be due to SRCU readers, except that in that case adjusting timeouts in the grace-period processing would not make things go faster.) So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods? If so, it would be very good make sure that this code is using SRCU efficiently. One way to check would be to put a printk() into synchronize_srcu(), though maintaining a counter and printing (say) every 1000th invocation might be easier on the console output. > > > And on 5.18, there is no such problem. > > > > > > After revert this patch, the issue solved. > > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from > > > consuming CPU) > > > > > > > > > qemu cmd: > > > build/aarch64-softmmu/qemu-system-aarch64 -machine > > > virt,gic-version=3,iommu=smmuv3 \ > > > -enable-kvm -cpu host -m 1024 \ > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \ > > > -bios QEMU_EFI.fd > > > > > > log: > > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 > > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll > > > 0x75459000 > > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi > > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98 > > > ProtectUefiImageCommon - 0x7AA4D040 > > > - 0x0000000075458000 - 0x000000000003F000 > > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 > > > (0x0000000000004008) > > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 > > > (0x0000000000020008) > > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 > > > (0x0000000000004008) > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8 > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8 > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348 > > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8 > > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728 > > > > > > > > > Not sure it is either reported or solved. > > This is the first I have heard of it, so thank you for reporting it. > > > > Do you have a way of collecting something sysrq-t output? > Do you mean "echo t > /proc/sysrq-trigger", > There are too much output and kernel dump can not stop. OK. What other tools do you have to work out what is happening during temporary hangs such as this one? The question to be answered: "Is there usually at least one task waiting in synchronize_srcu() during these hangs, and if so, which srcu_struct is passed to those synchronize_srcu() calls?" Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 13:36 ` Paul E. McKenney @ 2022-06-12 14:59 ` zhangfei.gao 2022-06-12 16:20 ` Paul E. McKenney 0 siblings, 1 reply; 37+ messages in thread From: zhangfei.gao @ 2022-06-12 14:59 UTC (permalink / raw) To: paulmck Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi Hi, Paul On 2022/6/12 下午9:36, Paul E. McKenney wrote: > On Sun, Jun 12, 2022 at 03:40:30PM +0800, zhangfei.gao@foxmail.com wrote: >> Hi, Paul >> >> On 2022/6/12 上午12:59, Paul E. McKenney wrote: >>> On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote: >>>> Hi, Paul >>>> >>>> When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel >>>> stuck for several minutes. >>> Stuck for several minutes but then continues normally? Or stuck for >>> several minutes before you kill qemu? >> qemu boot stuck for several minutes, then guest can bootup normally, just >> slower. >>> And I have to ask... What happened without the ACPI RMR feature? >> If no ACPI, qemu boot quickly without stuck. >> build/aarch64-softmmu/qemu-system-aarch64 -machine >> virt,gic-version=3,iommu=smmuv3 \ >> -enable-kvm -cpu host -m 1024 \ >> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ >> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off" >> >> Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes. >> >> By the way, my hardware platform is aarch64. > Thank you for the information! The problem is excessive delay rather > than a hang, and it is configuration-dependent. Good to know! > >> Only change this can solve the stuck issue. >> >> --- a/kernel/rcu/srcutree.c >> +++ b/kernel/rcu/srcutree.c >> @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct >> *ssp) >> { >> unsigned long jbase = SRCU_INTERVAL; >> >> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >> + return 0; >> + return SRCU_INTERVAL; > I am glad that you have a workaround for this problem, but this change > would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent > expedited GPs and blocking readers from consuming CPU") was intended > to fix. For one example, your change can prevent kernel live patching > from applying a patch. So something else is needed. Understand, it is just debug where has issue. > > Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay > significantly? (This is not a fix, either, but instead a debug check.) No use. > > Your change always returns zero if another SRCU grace period is needed. > Let's look at the callers of srcu_get_delay(): > > o cleanup_srcu_struct() uses it to check whether there is an > expedited grace period pending, leaking the srcu_struct if so. > This should not affect boot delay. (Unless you are invoking > init_srcu_struct() and cleanup_srcu_struct() really really > often.) > > o srcu_gp_end() uses it to determine whether or not to allow > a one-jiffy delay before invoking callbacks at the end of > a grace period. > > o srcu_funnel_gp_start() uses it to determine whether or not to > allow a one-jiffy delay before starting the process of checking > for the end of an SRCU grace period. > > o try_check_zero() uses it to add an additional short delay > (instead of a long delay) between checks of reader state. > > o process_srcu() uses it to calculate the long delay between > checks of reader state. > > These add one-jiffy delays, except for process_srcu(), which adds a delay > of up to 10 jiffies. Even given HZ=100 (as opposed to the HZ=1000 that > I normally use), this requires thousands of such delays to add up to the > several minutes that you are seeing. (In theory, the delays could also > be due to SRCU readers, except that in that case adjusting timeouts in > the grace-period processing would not make things go faster.) > > So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods? If so, > it would be very good make sure that this code is using SRCU efficiently. > One way to check would be to put a printk() into synchronize_srcu(), > though maintaining a counter and printing (say) every 1000th invocation > might be easier on the console output. good idea. >>>> And on 5.18, there is no such problem. >>>> >>>> After revert this patch, the issue solved. >>>> Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from >>>> consuming CPU) >>>> >>>> >>>> qemu cmd: >>>> build/aarch64-softmmu/qemu-system-aarch64 -machine >>>> virt,gic-version=3,iommu=smmuv3 \ >>>> -enable-kvm -cpu host -m 1024 \ >>>> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ >>>> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \ >>>> -bios QEMU_EFI.fd >>>> >>>> log: >>>> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 >>>> add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll >>>> 0x75459000 >>>> Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi >>>> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98 >>>> ProtectUefiImageCommon - 0x7AA4D040 >>>> - 0x0000000075458000 - 0x000000000003F000 >>>> SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 >>>> (0x0000000000004008) >>>> SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 >>>> (0x0000000000020008) >>>> SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 >>>> (0x0000000000004008) >>>> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8 >>>> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 >>>> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 >>>> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8 >>>> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 >>>> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 >>>> InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348 >>>> InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8 >>>> InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728 >>>> >>>> >>>> Not sure it is either reported or solved. >>> This is the first I have heard of it, so thank you for reporting it. >>> >>> Do you have a way of collecting something sysrq-t output? >> Do you mean "echo t > /proc/sysrq-trigger", >> There are too much output and kernel dump can not stop. > OK. What other tools do you have to work out what is happening during > temporary hangs such as this one? > > The question to be answered: "Is there usually at least one task waiting > in synchronize_srcu() during these hangs, and if so, which srcu_struct > is passed to those synchronize_srcu() calls?" As you suggested, add print in __synchronize_srcu, 1000 times print once. With acpi=force & -bios QEMU_EFI.fd When qemu stuck in InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll 0x75459000 The print in __synchronize_srcu is print from 0 t0 9001. [ 94.271350] gzf __synchronize_srcu loop=1001 .... [ 222.621659] __synchronize_srcu loop=9001 [ 222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted 5.19.0-rc1-15071-g697f40b5235f-dirty #615 [ 222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS 2280-V2 CS V5.B133.01 03/25/2021 [ 222.621667] Call trace: [ 222.621668] dump_backtrace+0xe4/0xf0 [ 222.621670] show_stack+0x20/0x70 [ 222.621672] dump_stack_lvl+0x8c/0xb8 [ 222.621674] dump_stack+0x18/0x34 [ 222.621676] __synchronize_srcu+0x120/0x128 [ 222.621678] synchronize_srcu_expedited+0x2c/0x40 [ 222.621680] kvm_swap_active_memslots+0x130/0x198 [ 222.621683] kvm_activate_memslot+0x40/0x68 [ 222.621684] kvm_set_memslot+0x184/0x3b0 [ 222.621686] __kvm_set_memory_region+0x288/0x438 [ 222.621688] kvm_set_memory_region+0x3c/0x60 [ 222.621689] kvm_vm_ioctl+0x5a0/0x13e0 [ 222.621691] __arm64_sys_ioctl+0xb0/0xf8 [ 222.621693] invoke_syscall+0x4c/0x110 [ 222.621696] el0_svc_common.constprop.0+0x68/0x128 [ 222.621698] do_el0_svc+0x34/0xc0 [ 222.621701] el0_svc+0x30/0x98 [ 222.621704] el0t_64_sync_handler+0xb8/0xc0 [ 222.621706] el0t_64_sync+0x18c/0x190 If no acpi=force, no print at all, 1000 times one print. Thanks ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 14:59 ` zhangfei.gao @ 2022-06-12 16:20 ` Paul E. McKenney 2022-06-12 16:40 ` Paul E. McKenney 0 siblings, 1 reply; 37+ messages in thread From: Paul E. McKenney @ 2022-06-12 16:20 UTC (permalink / raw) To: zhangfei.gao Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi On Sun, Jun 12, 2022 at 10:59:30PM +0800, zhangfei.gao@foxmail.com wrote: > Hi, Paul > > On 2022/6/12 下午9:36, Paul E. McKenney wrote: > > On Sun, Jun 12, 2022 at 03:40:30PM +0800, zhangfei.gao@foxmail.com wrote: > > > Hi, Paul > > > > > > On 2022/6/12 上午12:59, Paul E. McKenney wrote: > > > > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote: > > > > > Hi, Paul > > > > > > > > > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel > > > > > stuck for several minutes. > > > > Stuck for several minutes but then continues normally? Or stuck for > > > > several minutes before you kill qemu? > > > qemu boot stuck for several minutes, then guest can bootup normally, just > > > slower. > > > > And I have to ask... What happened without the ACPI RMR feature? > > > If no ACPI, qemu boot quickly without stuck. > > > build/aarch64-softmmu/qemu-system-aarch64 -machine > > > virt,gic-version=3,iommu=smmuv3 \ > > > -enable-kvm -cpu host -m 1024 \ > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off" > > > > > > Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes. > > > > > > By the way, my hardware platform is aarch64. > > Thank you for the information! The problem is excessive delay rather > > than a hang, and it is configuration-dependent. Good to know! > > > > > Only change this can solve the stuck issue. > > > > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct > > > *ssp) > > > { > > > unsigned long jbase = SRCU_INTERVAL; > > > > > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > + return 0; > > > + return SRCU_INTERVAL; > > I am glad that you have a workaround for this problem, but this change > > would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent > > expedited GPs and blocking readers from consuming CPU") was intended > > to fix. For one example, your change can prevent kernel live patching > > from applying a patch. So something else is needed. > Understand, it is just debug where has issue. > > > > Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay > > significantly? (This is not a fix, either, but instead a debug check.) > No use. OK, that indicates that you have a very large number of invocations of synchronize_srcu() or synchronize_srcu_expedited() instead of only a few that take a very long time each. > > Your change always returns zero if another SRCU grace period is needed. > > Let's look at the callers of srcu_get_delay(): > > > > o cleanup_srcu_struct() uses it to check whether there is an > > expedited grace period pending, leaking the srcu_struct if so. > > This should not affect boot delay. (Unless you are invoking > > init_srcu_struct() and cleanup_srcu_struct() really really > > often.) > > > > o srcu_gp_end() uses it to determine whether or not to allow > > a one-jiffy delay before invoking callbacks at the end of > > a grace period. > > > > o srcu_funnel_gp_start() uses it to determine whether or not to > > allow a one-jiffy delay before starting the process of checking > > for the end of an SRCU grace period. > > > > o try_check_zero() uses it to add an additional short delay > > (instead of a long delay) between checks of reader state. > > > > o process_srcu() uses it to calculate the long delay between > > checks of reader state. > > > > These add one-jiffy delays, except for process_srcu(), which adds a delay > > of up to 10 jiffies. Even given HZ=100 (as opposed to the HZ=1000 that > > I normally use), this requires thousands of such delays to add up to the > > several minutes that you are seeing. (In theory, the delays could also > > be due to SRCU readers, except that in that case adjusting timeouts in > > the grace-period processing would not make things go faster.) > > > > So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods? If so, > > it would be very good make sure that this code is using SRCU efficiently. > > One way to check would be to put a printk() into synchronize_srcu(), > > though maintaining a counter and printing (say) every 1000th invocation > > might be easier on the console output. > > good idea. Glad you like it. ;-) > > > > > And on 5.18, there is no such problem. > > > > > > > > > > After revert this patch, the issue solved. > > > > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from > > > > > consuming CPU) > > > > > > > > > > > > > > > qemu cmd: > > > > > build/aarch64-softmmu/qemu-system-aarch64 -machine > > > > > virt,gic-version=3,iommu=smmuv3 \ > > > > > -enable-kvm -cpu host -m 1024 \ > > > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ > > > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \ > > > > > -bios QEMU_EFI.fd > > > > > > > > > > log: > > > > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 > > > > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll > > > > > 0x75459000 > > > > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi > > > > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98 > > > > > ProtectUefiImageCommon - 0x7AA4D040 > > > > > - 0x0000000075458000 - 0x000000000003F000 > > > > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 > > > > > (0x0000000000004008) > > > > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 > > > > > (0x0000000000020008) > > > > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 > > > > > (0x0000000000004008) > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8 > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8 > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > > > > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348 > > > > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8 > > > > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728 > > > > > > > > > > > > > > > Not sure it is either reported or solved. > > > > This is the first I have heard of it, so thank you for reporting it. > > > > > > > > Do you have a way of collecting something sysrq-t output? > > > Do you mean "echo t > /proc/sysrq-trigger", > > > There are too much output and kernel dump can not stop. > > OK. What other tools do you have to work out what is happening during > > temporary hangs such as this one? > > > > The question to be answered: "Is there usually at least one task waiting > > in synchronize_srcu() during these hangs, and if so, which srcu_struct > > is passed to those synchronize_srcu() calls?" > > As you suggested, add print in __synchronize_srcu, 1000 times print once. > > With acpi=force & -bios QEMU_EFI.fd > > When qemu stuck in > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll > 0x75459000 > > The print in __synchronize_srcu is print from 0 t0 9001. Now that is what I call a large number of calls! > [ 94.271350] gzf __synchronize_srcu loop=1001 > .... > > [ 222.621659] __synchronize_srcu loop=9001 > [ 222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted > 5.19.0-rc1-15071-g697f40b5235f-dirty #615 > [ 222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS > 2280-V2 CS V5.B133.01 03/25/2021 > [ 222.621667] Call trace: > [ 222.621668] dump_backtrace+0xe4/0xf0 > [ 222.621670] show_stack+0x20/0x70 > [ 222.621672] dump_stack_lvl+0x8c/0xb8 > [ 222.621674] dump_stack+0x18/0x34 > [ 222.621676] __synchronize_srcu+0x120/0x128 > [ 222.621678] synchronize_srcu_expedited+0x2c/0x40 > [ 222.621680] kvm_swap_active_memslots+0x130/0x198 > [ 222.621683] kvm_activate_memslot+0x40/0x68 > [ 222.621684] kvm_set_memslot+0x184/0x3b0 > [ 222.621686] __kvm_set_memory_region+0x288/0x438 > [ 222.621688] kvm_set_memory_region+0x3c/0x60 This is KVM setting up one mapping in your IORT RMR memory, correct? (As in I/O remapping table reserved memory regions.) > [ 222.621689] kvm_vm_ioctl+0x5a0/0x13e0 And this ioctl() is qemu telling KVM to do so, correct? > [ 222.621691] __arm64_sys_ioctl+0xb0/0xf8 > [ 222.621693] invoke_syscall+0x4c/0x110 > [ 222.621696] el0_svc_common.constprop.0+0x68/0x128 > [ 222.621698] do_el0_svc+0x34/0xc0 > [ 222.621701] el0_svc+0x30/0x98 > [ 222.621704] el0t_64_sync_handler+0xb8/0xc0 > [ 222.621706] el0t_64_sync+0x18c/0x190 > > > If no acpi=force, no print at all, 1000 times one print. OK, this certainly explains the slowdown, both from adding the IORT RMR and from the commit that you bisected to. The answers to a few questions might help illuminate a path towards a fix: Do these reserved memory regions really need to be allocated separately? (For example, are they really all non-contiguous? If not, that is, if there are a lot of contiguous memory regions, could you sort the IORT by address and do one ioctl() for each set of contiguous memory regions?) Are all of these reserved memory regions set up before init is spawned? Are all of these reserved memory regions set up while there is only a single vCPU up and running? Is the SRCU grace period really needed in this case? (I freely confess to not being all that familiar with KVM.) Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 16:20 ` Paul E. McKenney @ 2022-06-12 16:40 ` Paul E. McKenney 2022-06-12 17:29 ` Paolo Bonzini 0 siblings, 1 reply; 37+ messages in thread From: Paul E. McKenney @ 2022-06-12 16:40 UTC (permalink / raw) To: zhangfei.gao Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, pbonzini, sheng.yang On Sun, Jun 12, 2022 at 09:20:29AM -0700, Paul E. McKenney wrote: > On Sun, Jun 12, 2022 at 10:59:30PM +0800, zhangfei.gao@foxmail.com wrote: > > Hi, Paul > > > > On 2022/6/12 下午9:36, Paul E. McKenney wrote: > > > On Sun, Jun 12, 2022 at 03:40:30PM +0800, zhangfei.gao@foxmail.com wrote: > > > > Hi, Paul > > > > > > > > On 2022/6/12 上午12:59, Paul E. McKenney wrote: > > > > > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote: > > > > > > Hi, Paul > > > > > > > > > > > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel > > > > > > stuck for several minutes. > > > > > Stuck for several minutes but then continues normally? Or stuck for > > > > > several minutes before you kill qemu? > > > > qemu boot stuck for several minutes, then guest can bootup normally, just > > > > slower. > > > > > And I have to ask... What happened without the ACPI RMR feature? > > > > If no ACPI, qemu boot quickly without stuck. > > > > build/aarch64-softmmu/qemu-system-aarch64 -machine > > > > virt,gic-version=3,iommu=smmuv3 \ > > > > -enable-kvm -cpu host -m 1024 \ > > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ > > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off" > > > > > > > > Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes. > > > > > > > > By the way, my hardware platform is aarch64. > > > Thank you for the information! The problem is excessive delay rather > > > than a hang, and it is configuration-dependent. Good to know! > > > > > > > Only change this can solve the stuck issue. > > > > > > > > --- a/kernel/rcu/srcutree.c > > > > +++ b/kernel/rcu/srcutree.c > > > > @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct > > > > *ssp) > > > > { > > > > unsigned long jbase = SRCU_INTERVAL; > > > > > > > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > > + return 0; > > > > + return SRCU_INTERVAL; > > > I am glad that you have a workaround for this problem, but this change > > > would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent > > > expedited GPs and blocking readers from consuming CPU") was intended > > > to fix. For one example, your change can prevent kernel live patching > > > from applying a patch. So something else is needed. > > Understand, it is just debug where has issue. > > > > > > Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay > > > significantly? (This is not a fix, either, but instead a debug check.) > > No use. > > OK, that indicates that you have a very large number of invocations > of synchronize_srcu() or synchronize_srcu_expedited() instead of only > a few that take a very long time each. > > > > Your change always returns zero if another SRCU grace period is needed. > > > Let's look at the callers of srcu_get_delay(): > > > > > > o cleanup_srcu_struct() uses it to check whether there is an > > > expedited grace period pending, leaking the srcu_struct if so. > > > This should not affect boot delay. (Unless you are invoking > > > init_srcu_struct() and cleanup_srcu_struct() really really > > > often.) > > > > > > o srcu_gp_end() uses it to determine whether or not to allow > > > a one-jiffy delay before invoking callbacks at the end of > > > a grace period. > > > > > > o srcu_funnel_gp_start() uses it to determine whether or not to > > > allow a one-jiffy delay before starting the process of checking > > > for the end of an SRCU grace period. > > > > > > o try_check_zero() uses it to add an additional short delay > > > (instead of a long delay) between checks of reader state. > > > > > > o process_srcu() uses it to calculate the long delay between > > > checks of reader state. > > > > > > These add one-jiffy delays, except for process_srcu(), which adds a delay > > > of up to 10 jiffies. Even given HZ=100 (as opposed to the HZ=1000 that > > > I normally use), this requires thousands of such delays to add up to the > > > several minutes that you are seeing. (In theory, the delays could also > > > be due to SRCU readers, except that in that case adjusting timeouts in > > > the grace-period processing would not make things go faster.) > > > > > > So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods? If so, > > > it would be very good make sure that this code is using SRCU efficiently. > > > One way to check would be to put a printk() into synchronize_srcu(), > > > though maintaining a counter and printing (say) every 1000th invocation > > > might be easier on the console output. > > > > good idea. > > Glad you like it. ;-) > > > > > > > And on 5.18, there is no such problem. > > > > > > > > > > > > After revert this patch, the issue solved. > > > > > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from > > > > > > consuming CPU) > > > > > > > > > > > > > > > > > > qemu cmd: > > > > > > build/aarch64-softmmu/qemu-system-aarch64 -machine > > > > > > virt,gic-version=3,iommu=smmuv3 \ > > > > > > -enable-kvm -cpu host -m 1024 \ > > > > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ > > > > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \ > > > > > > -bios QEMU_EFI.fd > > > > > > > > > > > > log: > > > > > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 > > > > > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll > > > > > > 0x75459000 > > > > > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi > > > > > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98 > > > > > > ProtectUefiImageCommon - 0x7AA4D040 > > > > > > - 0x0000000075458000 - 0x000000000003F000 > > > > > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 > > > > > > (0x0000000000004008) > > > > > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 > > > > > > (0x0000000000020008) > > > > > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 > > > > > > (0x0000000000004008) > > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8 > > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8 > > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > > > > > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348 > > > > > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8 > > > > > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728 > > > > > > > > > > > > > > > > > > Not sure it is either reported or solved. > > > > > This is the first I have heard of it, so thank you for reporting it. > > > > > > > > > > Do you have a way of collecting something sysrq-t output? > > > > Do you mean "echo t > /proc/sysrq-trigger", > > > > There are too much output and kernel dump can not stop. > > > OK. What other tools do you have to work out what is happening during > > > temporary hangs such as this one? > > > > > > The question to be answered: "Is there usually at least one task waiting > > > in synchronize_srcu() during these hangs, and if so, which srcu_struct > > > is passed to those synchronize_srcu() calls?" > > > > As you suggested, add print in __synchronize_srcu, 1000 times print once. > > > > With acpi=force & -bios QEMU_EFI.fd > > > > When qemu stuck in > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll > > 0x75459000 > > > > The print in __synchronize_srcu is print from 0 t0 9001. > > Now that is what I call a large number of calls! > > > [ 94.271350] gzf __synchronize_srcu loop=1001 > > .... > > > > [ 222.621659] __synchronize_srcu loop=9001 > > [ 222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted > > 5.19.0-rc1-15071-g697f40b5235f-dirty #615 > > [ 222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS > > 2280-V2 CS V5.B133.01 03/25/2021 > > [ 222.621667] Call trace: > > [ 222.621668] dump_backtrace+0xe4/0xf0 > > [ 222.621670] show_stack+0x20/0x70 > > [ 222.621672] dump_stack_lvl+0x8c/0xb8 > > [ 222.621674] dump_stack+0x18/0x34 > > [ 222.621676] __synchronize_srcu+0x120/0x128 > > [ 222.621678] synchronize_srcu_expedited+0x2c/0x40 > > [ 222.621680] kvm_swap_active_memslots+0x130/0x198 > > [ 222.621683] kvm_activate_memslot+0x40/0x68 > > [ 222.621684] kvm_set_memslot+0x184/0x3b0 > > [ 222.621686] __kvm_set_memory_region+0x288/0x438 > > [ 222.621688] kvm_set_memory_region+0x3c/0x60 > > This is KVM setting up one mapping in your IORT RMR memory, correct? > (As in I/O remapping table reserved memory regions.) > > > [ 222.621689] kvm_vm_ioctl+0x5a0/0x13e0 > > And this ioctl() is qemu telling KVM to do so, correct? > > > [ 222.621691] __arm64_sys_ioctl+0xb0/0xf8 > > [ 222.621693] invoke_syscall+0x4c/0x110 > > [ 222.621696] el0_svc_common.constprop.0+0x68/0x128 > > [ 222.621698] do_el0_svc+0x34/0xc0 > > [ 222.621701] el0_svc+0x30/0x98 > > [ 222.621704] el0t_64_sync_handler+0xb8/0xc0 > > [ 222.621706] el0t_64_sync+0x18c/0x190 > > > > > > If no acpi=force, no print at all, 1000 times one print. > > OK, this certainly explains the slowdown, both from adding the IORT RMR > and from the commit that you bisected to. The answers to a few questions > might help illuminate a path towards a fix: > > Do these reserved memory regions really need to be allocated separately? > (For example, are they really all non-contiguous? If not, that is, if > there are a lot of contiguous memory regions, could you sort the IORT > by address and do one ioctl() for each set of contiguous memory regions?) > > Are all of these reserved memory regions set up before init is spawned? > > Are all of these reserved memory regions set up while there is only a > single vCPU up and running? > > Is the SRCU grace period really needed in this case? (I freely confess > to not being all that familiar with KVM.) Oh, and there was a similar many-requests problem with networking many years ago. This was solved by adding a new syscall/ioctl()/whatever that permitted many requests to be presented to the kernel with a single system call. Could a new ioctl() be introduced that requested a large number of these memory regions in one go so as to make each call to synchronize_rcu_expedited() cover a useful fraction of your 9000+ requests? Adding a few of the KVM guys on CC for their thoughts. Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 16:40 ` Paul E. McKenney @ 2022-06-12 17:29 ` Paolo Bonzini 2022-06-12 17:47 ` Paolo Bonzini 2022-06-12 18:49 ` Paul E. McKenney 0 siblings, 2 replies; 37+ messages in thread From: Paolo Bonzini @ 2022-06-12 17:29 UTC (permalink / raw) To: paulmck, zhangfei.gao Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, sheng.yang On 6/12/22 18:40, Paul E. McKenney wrote: >> Do these reserved memory regions really need to be allocated separately? >> (For example, are they really all non-contiguous? If not, that is, if >> there are a lot of contiguous memory regions, could you sort the IORT >> by address and do one ioctl() for each set of contiguous memory regions?) >> >> Are all of these reserved memory regions set up before init is spawned? >> >> Are all of these reserved memory regions set up while there is only a >> single vCPU up and running? >> >> Is the SRCU grace period really needed in this case? (I freely confess >> to not being all that familiar with KVM.) > > Oh, and there was a similar many-requests problem with networking many > years ago. This was solved by adding a new syscall/ioctl()/whatever > that permitted many requests to be presented to the kernel with a single > system call. > > Could a new ioctl() be introduced that requested a large number > of these memory regions in one go so as to make each call to > synchronize_rcu_expedited() cover a useful fraction of your 9000+ > requests? Adding a few of the KVM guys on CC for their thoughts. Unfortunately not. Apart from this specific case, in general the calls to KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the guest, and those writes then map to a ioctl. Typically the guest sets up a device at a time, and each setup step causes a synchronize_srcu()---and expedited at that. KVM has two SRCUs: 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers that are very very small, but it needs extremely fast detection of grace periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are not so frequent. 2) kvm->srcu is nastier because there are readers all the time. The read-side critical section are still short-ish, but they need the sleepable part because they access user memory. Writers are not frequent per se; the problem is they come in very large bursts when a guest boots. And while the whole boot path overall can be quadratic, O(n) expensive calls to synchronize_srcu() can have a larger impact on runtime than the O(n^2) parts, as demonstrated here. Therefore, we operated on the assumption that the callers of synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code and the desire was to get past the booting phase as fast as possible. If the guest wants to eat host CPU it can "for(;;)" as much as it wants; therefore, as long as expedited GPs didn't eat CPU *throughout the whole system*, a preemptable busy wait in synchronize_srcu_expedited() were not problematic. This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu were was introduced (respectively in 2009 and 2014). But perhaps they do not hold anymore now that each SRCU is not as independent as it used to be in those years, and instead they use workqueues instead? Thanks, Paolo ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 17:29 ` Paolo Bonzini @ 2022-06-12 17:47 ` Paolo Bonzini 2022-06-12 18:51 ` Paul E. McKenney 2022-06-12 18:49 ` Paul E. McKenney 1 sibling, 1 reply; 37+ messages in thread From: Paolo Bonzini @ 2022-06-12 17:47 UTC (permalink / raw) To: paulmck, zhangfei.gao Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, sheng.yang On 6/12/22 19:29, Paolo Bonzini wrote: > On 6/12/22 18:40, Paul E. McKenney wrote: >>> Do these reserved memory regions really need to be allocated separately? >>> (For example, are they really all non-contiguous? If not, that is, if >>> there are a lot of contiguous memory regions, could you sort the IORT >>> by address and do one ioctl() for each set of contiguous memory >>> regions?) >>> >>> Are all of these reserved memory regions set up before init is spawned? >>> >>> Are all of these reserved memory regions set up while there is only a >>> single vCPU up and running? >>> >>> Is the SRCU grace period really needed in this case? (I freely confess >>> to not being all that familiar with KVM.) >> >> Oh, and there was a similar many-requests problem with networking many >> years ago. This was solved by adding a new syscall/ioctl()/whatever >> that permitted many requests to be presented to the kernel with a single >> system call. >> >> Could a new ioctl() be introduced that requested a large number >> of these memory regions in one go so as to make each call to >> synchronize_rcu_expedited() cover a useful fraction of your 9000+ >> requests? Adding a few of the KVM guys on CC for their thoughts. Another question: how much can call_srcu() callbacks pile up these days? I've always been a bit wary of letting userspace do an arbitrary number of allocations that can only be freed after a grace period, but perhaps there's a way to query SRCU and apply some backpressure? Paolo ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 17:47 ` Paolo Bonzini @ 2022-06-12 18:51 ` Paul E. McKenney 0 siblings, 0 replies; 37+ messages in thread From: Paul E. McKenney @ 2022-06-12 18:51 UTC (permalink / raw) To: Paolo Bonzini Cc: zhangfei.gao, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, sheng.yang On Sun, Jun 12, 2022 at 07:47:10PM +0200, Paolo Bonzini wrote: > On 6/12/22 19:29, Paolo Bonzini wrote: > > On 6/12/22 18:40, Paul E. McKenney wrote: > > > > Do these reserved memory regions really need to be allocated separately? > > > > (For example, are they really all non-contiguous? If not, that is, if > > > > there are a lot of contiguous memory regions, could you sort the IORT > > > > by address and do one ioctl() for each set of contiguous memory > > > > regions?) > > > > > > > > Are all of these reserved memory regions set up before init is spawned? > > > > > > > > Are all of these reserved memory regions set up while there is only a > > > > single vCPU up and running? > > > > > > > > Is the SRCU grace period really needed in this case? (I freely confess > > > > to not being all that familiar with KVM.) > > > > > > Oh, and there was a similar many-requests problem with networking many > > > years ago. This was solved by adding a new syscall/ioctl()/whatever > > > that permitted many requests to be presented to the kernel with a single > > > system call. > > > > > > Could a new ioctl() be introduced that requested a large number > > > of these memory regions in one go so as to make each call to > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+ > > > requests? Adding a few of the KVM guys on CC for their thoughts. > > Another question: how much can call_srcu() callbacks pile up these days? > I've always been a bit wary of letting userspace do an arbitrary number of > allocations that can only be freed after a grace period, but perhaps there's > a way to query SRCU and apply some backpressure? They can pile up as much as ever, especially if you have long-duration sleeping readers. But you could do the occasional srcu_barrier() to wait for all the preceding ones to get done. Maybe every 1000th call_srcu() or similar? Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 17:29 ` Paolo Bonzini 2022-06-12 17:47 ` Paolo Bonzini @ 2022-06-12 18:49 ` Paul E. McKenney 2022-06-12 19:23 ` Paolo Bonzini 2022-06-13 3:04 ` zhangfei.gao 1 sibling, 2 replies; 37+ messages in thread From: Paul E. McKenney @ 2022-06-12 18:49 UTC (permalink / raw) To: Paolo Bonzini Cc: zhangfei.gao, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, sheng.yang On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: > On 6/12/22 18:40, Paul E. McKenney wrote: > > > Do these reserved memory regions really need to be allocated separately? > > > (For example, are they really all non-contiguous? If not, that is, if > > > there are a lot of contiguous memory regions, could you sort the IORT > > > by address and do one ioctl() for each set of contiguous memory regions?) > > > > > > Are all of these reserved memory regions set up before init is spawned? > > > > > > Are all of these reserved memory regions set up while there is only a > > > single vCPU up and running? > > > > > > Is the SRCU grace period really needed in this case? (I freely confess > > > to not being all that familiar with KVM.) > > > > Oh, and there was a similar many-requests problem with networking many > > years ago. This was solved by adding a new syscall/ioctl()/whatever > > that permitted many requests to be presented to the kernel with a single > > system call. > > > > Could a new ioctl() be introduced that requested a large number > > of these memory regions in one go so as to make each call to > > synchronize_rcu_expedited() cover a useful fraction of your 9000+ > > requests? Adding a few of the KVM guys on CC for their thoughts. > > Unfortunately not. Apart from this specific case, in general the calls to > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the > guest, and those writes then map to a ioctl. Typically the guest sets up a > device at a time, and each setup step causes a synchronize_srcu()---and > expedited at that. I was afraid of something like that... > KVM has two SRCUs: > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers > that are very very small, but it needs extremely fast detection of grace > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are > not so frequent. > > 2) kvm->srcu is nastier because there are readers all the time. The > read-side critical section are still short-ish, but they need the sleepable > part because they access user memory. Which one of these two is in play in this case? > Writers are not frequent per se; the problem is they come in very large > bursts when a guest boots. And while the whole boot path overall can be > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger > impact on runtime than the O(n^2) parts, as demonstrated here. > > Therefore, we operated on the assumption that the callers of > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code > and the desire was to get past the booting phase as fast as possible. If > the guest wants to eat host CPU it can "for(;;)" as much as it wants; > therefore, as long as expedited GPs didn't eat CPU *throughout the whole > system*, a preemptable busy wait in synchronize_srcu_expedited() were not > problematic. > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu > were was introduced (respectively in 2009 and 2014). But perhaps they do > not hold anymore now that each SRCU is not as independent as it used to be > in those years, and instead they use workqueues instead? The problem was not internal to SRCU, but rather due to the fact that kernel live patching (KLP) had problems with the CPU-bound tasks resulting from repeated synchronize_rcu_expedited() invocations. So I added heuristics to get the occasional sleep in there for KLP's benefit. Perhaps these heuristics need to be less aggressive about adding sleep. These heuristics have these aspects: 1. The longer readers persist in an expedited SRCU grace period, the longer the wait between successive checks of the reader state. Roughly speaking, we wait as long as the grace period has currently been in effect, capped at ten jiffies. 2. SRCU grace periods have several phases. We reset so that each phase starts by not waiting (new phase, new set of readers, so don't penalize this set for the sins of the previous set). But once we get to the point of adding delay, we add the delay based on the beginning of the full grace period. Right now, the checking for grace-period length does not allow for the possibility that a grace period might start just before the jiffies counter gets incremented (because I didn't realize that anyone cared), so that is one possible thing to change. I can also allow more no-delay checks per SRCU grace-period phase. Zhangfei, does something like the patch shown below help? Additional adjustments are likely needed to avoid re-breaking KLP, but we have to start somewhere... Thanx, Paul ------------------------------------------------------------------------ diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 50ba70f019dea..6a354368ac1d1 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. /* @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp) */ static unsigned long srcu_get_delay(struct srcu_struct *ssp) { + unsigned long gpstart; + unsigned long j; unsigned long jbase = SRCU_INTERVAL; if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) jbase = 0; - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { + j = jiffies - 1; + gpstart = READ_ONCE(ssp->srcu_gp_start); + if (time_after(j, gpstart)) + jbase += j - gpstart; + } if (!jbase) { WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 18:49 ` Paul E. McKenney @ 2022-06-12 19:23 ` Paolo Bonzini 2022-06-12 20:09 ` Paul E. McKenney 2022-06-13 3:04 ` zhangfei.gao 1 sibling, 1 reply; 37+ messages in thread From: Paolo Bonzini @ 2022-06-12 19:23 UTC (permalink / raw) To: paulmck Cc: zhangfei.gao, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, sheng.yang On 6/12/22 20:49, Paul E. McKenney wrote: >> >> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers >> that are very very small, but it needs extremely fast detection of grace >> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up >> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are >> not so frequent. >> >> 2) kvm->srcu is nastier because there are readers all the time. The >> read-side critical section are still short-ish, but they need the sleepable >> part because they access user memory. > > Which one of these two is in play in this case? The latter, kvm->srcu; though at boot time both are hammered on quite a bit (and then essentially not at all). For the one involved it's still pretty rare for readers to sleep, but it cannot be excluded. Most critical sections are short, I'd guess in the thousands of clock cycles but I can add some instrumentation tomorrow (or anyway before Tuesday). > The problem was not internal to SRCU, but rather due to the fact > that kernel live patching (KLP) had problems with the CPU-bound tasks > resulting from repeated synchronize_rcu_expedited() invocations. I see. Perhaps only add to the back-to-back counter if the synchronize_srcu_expedited() takes longer than a jiffy? This would indirectly check if syncronize_srcu_expedited() readers are actually blocking. KVM uses syncronize_srcu_expedited() because it expects it to take very little (again I'll get hard numbers asap). Paolo ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 19:23 ` Paolo Bonzini @ 2022-06-12 20:09 ` Paul E. McKenney 0 siblings, 0 replies; 37+ messages in thread From: Paul E. McKenney @ 2022-06-12 20:09 UTC (permalink / raw) To: Paolo Bonzini Cc: zhangfei.gao, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, sheng.yang On Sun, Jun 12, 2022 at 09:23:14PM +0200, Paolo Bonzini wrote: > On 6/12/22 20:49, Paul E. McKenney wrote: > > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers > > > that are very very small, but it needs extremely fast detection of grace > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are > > > not so frequent. > > > > > > 2) kvm->srcu is nastier because there are readers all the time. The > > > read-side critical section are still short-ish, but they need the sleepable > > > part because they access user memory. > > > > Which one of these two is in play in this case? > > The latter, kvm->srcu; though at boot time both are hammered on quite a bit > (and then essentially not at all). > > For the one involved it's still pretty rare for readers to sleep, but it > cannot be excluded. Most critical sections are short, I'd guess in the > thousands of clock cycles but I can add some instrumentation tomorrow (or > anyway before Tuesday). And in any case, readers can be preempted. > > The problem was not internal to SRCU, but rather due to the fact > > that kernel live patching (KLP) had problems with the CPU-bound tasks > > resulting from repeated synchronize_rcu_expedited() invocations. > > I see. Perhaps only add to the back-to-back counter if the > synchronize_srcu_expedited() takes longer than a jiffy? This would > indirectly check if syncronize_srcu_expedited() readers are actually > blocking. KVM uses syncronize_srcu_expedited() because it expects it to > take very little (again I'll get hard numbers asap). This is in effect what the patch in my previous email does. In current mainline, it waits for up to a jiffy before switching to sleep mode, but with the patch it waits for between one and two jiffies before making that switch. Using call_srcu() with the occasional srcu_barrier() would of course be faster still, but perhaps more complex. Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-12 18:49 ` Paul E. McKenney 2022-06-12 19:23 ` Paolo Bonzini @ 2022-06-13 3:04 ` zhangfei.gao 2022-06-13 3:57 ` Paul E. McKenney 1 sibling, 1 reply; 37+ messages in thread From: zhangfei.gao @ 2022-06-13 3:04 UTC (permalink / raw) To: paulmck, Paolo Bonzini Cc: Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, sheng.yang Hi, Paul On 2022/6/13 上午2:49, Paul E. McKenney wrote: > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: >> On 6/12/22 18:40, Paul E. McKenney wrote: >>>> Do these reserved memory regions really need to be allocated separately? >>>> (For example, are they really all non-contiguous? If not, that is, if >>>> there are a lot of contiguous memory regions, could you sort the IORT >>>> by address and do one ioctl() for each set of contiguous memory regions?) >>>> >>>> Are all of these reserved memory regions set up before init is spawned? >>>> >>>> Are all of these reserved memory regions set up while there is only a >>>> single vCPU up and running? >>>> >>>> Is the SRCU grace period really needed in this case? (I freely confess >>>> to not being all that familiar with KVM.) >>> Oh, and there was a similar many-requests problem with networking many >>> years ago. This was solved by adding a new syscall/ioctl()/whatever >>> that permitted many requests to be presented to the kernel with a single >>> system call. >>> >>> Could a new ioctl() be introduced that requested a large number >>> of these memory regions in one go so as to make each call to >>> synchronize_rcu_expedited() cover a useful fraction of your 9000+ >>> requests? Adding a few of the KVM guys on CC for their thoughts. >> Unfortunately not. Apart from this specific case, in general the calls to >> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the >> guest, and those writes then map to a ioctl. Typically the guest sets up a >> device at a time, and each setup step causes a synchronize_srcu()---and >> expedited at that. > I was afraid of something like that... > >> KVM has two SRCUs: >> >> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers >> that are very very small, but it needs extremely fast detection of grace >> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up >> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are >> not so frequent. >> >> 2) kvm->srcu is nastier because there are readers all the time. The >> read-side critical section are still short-ish, but they need the sleepable >> part because they access user memory. > Which one of these two is in play in this case? > >> Writers are not frequent per se; the problem is they come in very large >> bursts when a guest boots. And while the whole boot path overall can be >> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger >> impact on runtime than the O(n^2) parts, as demonstrated here. >> >> Therefore, we operated on the assumption that the callers of >> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code >> and the desire was to get past the booting phase as fast as possible. If >> the guest wants to eat host CPU it can "for(;;)" as much as it wants; >> therefore, as long as expedited GPs didn't eat CPU *throughout the whole >> system*, a preemptable busy wait in synchronize_srcu_expedited() were not >> problematic. >> >> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu >> were was introduced (respectively in 2009 and 2014). But perhaps they do >> not hold anymore now that each SRCU is not as independent as it used to be >> in those years, and instead they use workqueues instead? > The problem was not internal to SRCU, but rather due to the fact > that kernel live patching (KLP) had problems with the CPU-bound tasks > resulting from repeated synchronize_rcu_expedited() invocations. So I > added heuristics to get the occasional sleep in there for KLP's benefit. > Perhaps these heuristics need to be less aggressive about adding sleep. > > These heuristics have these aspects: > > 1. The longer readers persist in an expedited SRCU grace period, > the longer the wait between successive checks of the reader > state. Roughly speaking, we wait as long as the grace period > has currently been in effect, capped at ten jiffies. > > 2. SRCU grace periods have several phases. We reset so that each > phase starts by not waiting (new phase, new set of readers, > so don't penalize this set for the sins of the previous set). > But once we get to the point of adding delay, we add the > delay based on the beginning of the full grace period. > > Right now, the checking for grace-period length does not allow for the > possibility that a grace period might start just before the jiffies > counter gets incremented (because I didn't realize that anyone cared), > so that is one possible thing to change. I can also allow more no-delay > checks per SRCU grace-period phase. > > Zhangfei, does something like the patch shown below help? > > Additional adjustments are likely needed to avoid re-breaking KLP, > but we have to start somewhere... > > Thanx, Paul > > ------------------------------------------------------------------------ > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 50ba70f019dea..6a354368ac1d1 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > > /* > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > */ > static unsigned long srcu_get_delay(struct srcu_struct *ssp) > { > + unsigned long gpstart; > + unsigned long j; > unsigned long jbase = SRCU_INTERVAL; > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > jbase = 0; > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > + j = jiffies - 1; > + gpstart = READ_ONCE(ssp->srcu_gp_start); > + if (time_after(j, gpstart)) > + jbase += j - gpstart; > + } > if (!jbase) { > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) Unfortunately, this patch does not helpful. Then re-add the debug info. During the qemu boot [ 232.997667] __synchronize_srcu loop=1000 [ 361.094493] __synchronize_srcu loop=9000 [ 361.094501] Call trace: [ 361.094502] dump_backtrace+0xe4/0xf0 [ 361.094505] show_stack+0x20/0x70 [ 361.094507] dump_stack_lvl+0x8c/0xb8 [ 361.094509] dump_stack+0x18/0x34 [ 361.094511] __synchronize_srcu+0x120/0x128 [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 [ 361.094515] kvm_swap_active_memslots+0x130/0x198 [ 361.094519] kvm_activate_memslot+0x40/0x68 [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 [ 361.094524] kvm_set_memory_region+0x78/0xb8 [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 [ 361.094530] invoke_syscall+0x4c/0x110 [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 [ 361.094536] do_el0_svc+0x34/0xc0 [ 361.094538] el0_svc+0x30/0x98 [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 [ 361.094544] el0t_64_sync+0x18c/0x190 [ 363.942817] kvm_set_memory_region loop=6000 ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 3:04 ` zhangfei.gao @ 2022-06-13 3:57 ` Paul E. McKenney 2022-06-13 4:16 ` Paul E. McKenney 0 siblings, 1 reply; 37+ messages in thread From: Paul E. McKenney @ 2022-06-13 3:57 UTC (permalink / raw) To: zhangfei.gao Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, sheng.yang On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote: > Hi, Paul > > On 2022/6/13 上午2:49, Paul E. McKenney wrote: > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: > > > On 6/12/22 18:40, Paul E. McKenney wrote: > > > > > Do these reserved memory regions really need to be allocated separately? > > > > > (For example, are they really all non-contiguous? If not, that is, if > > > > > there are a lot of contiguous memory regions, could you sort the IORT > > > > > by address and do one ioctl() for each set of contiguous memory regions?) > > > > > > > > > > Are all of these reserved memory regions set up before init is spawned? > > > > > > > > > > Are all of these reserved memory regions set up while there is only a > > > > > single vCPU up and running? > > > > > > > > > > Is the SRCU grace period really needed in this case? (I freely confess > > > > > to not being all that familiar with KVM.) > > > > Oh, and there was a similar many-requests problem with networking many > > > > years ago. This was solved by adding a new syscall/ioctl()/whatever > > > > that permitted many requests to be presented to the kernel with a single > > > > system call. > > > > > > > > Could a new ioctl() be introduced that requested a large number > > > > of these memory regions in one go so as to make each call to > > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+ > > > > requests? Adding a few of the KVM guys on CC for their thoughts. > > > Unfortunately not. Apart from this specific case, in general the calls to > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the > > > guest, and those writes then map to a ioctl. Typically the guest sets up a > > > device at a time, and each setup step causes a synchronize_srcu()---and > > > expedited at that. > > I was afraid of something like that... > > > > > KVM has two SRCUs: > > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers > > > that are very very small, but it needs extremely fast detection of grace > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are > > > not so frequent. > > > > > > 2) kvm->srcu is nastier because there are readers all the time. The > > > read-side critical section are still short-ish, but they need the sleepable > > > part because they access user memory. > > Which one of these two is in play in this case? > > > > > Writers are not frequent per se; the problem is they come in very large > > > bursts when a guest boots. And while the whole boot path overall can be > > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger > > > impact on runtime than the O(n^2) parts, as demonstrated here. > > > > > > Therefore, we operated on the assumption that the callers of > > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code > > > and the desire was to get past the booting phase as fast as possible. If > > > the guest wants to eat host CPU it can "for(;;)" as much as it wants; > > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole > > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not > > > problematic. > > > > > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu > > > were was introduced (respectively in 2009 and 2014). But perhaps they do > > > not hold anymore now that each SRCU is not as independent as it used to be > > > in those years, and instead they use workqueues instead? > > The problem was not internal to SRCU, but rather due to the fact > > that kernel live patching (KLP) had problems with the CPU-bound tasks > > resulting from repeated synchronize_rcu_expedited() invocations. So I > > added heuristics to get the occasional sleep in there for KLP's benefit. > > Perhaps these heuristics need to be less aggressive about adding sleep. > > > > These heuristics have these aspects: > > > > 1. The longer readers persist in an expedited SRCU grace period, > > the longer the wait between successive checks of the reader > > state. Roughly speaking, we wait as long as the grace period > > has currently been in effect, capped at ten jiffies. > > > > 2. SRCU grace periods have several phases. We reset so that each > > phase starts by not waiting (new phase, new set of readers, > > so don't penalize this set for the sins of the previous set). > > But once we get to the point of adding delay, we add the > > delay based on the beginning of the full grace period. > > > > Right now, the checking for grace-period length does not allow for the > > possibility that a grace period might start just before the jiffies > > counter gets incremented (because I didn't realize that anyone cared), > > so that is one possible thing to change. I can also allow more no-delay > > checks per SRCU grace-period phase. > > > > Zhangfei, does something like the patch shown below help? > > > > Additional adjustments are likely needed to avoid re-breaking KLP, > > but we have to start somewhere... > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 50ba70f019dea..6a354368ac1d1 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. > > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > > /* > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > */ > > static unsigned long srcu_get_delay(struct srcu_struct *ssp) > > { > > + unsigned long gpstart; > > + unsigned long j; > > unsigned long jbase = SRCU_INTERVAL; > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > jbase = 0; > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > > + j = jiffies - 1; > > + gpstart = READ_ONCE(ssp->srcu_gp_start); > > + if (time_after(j, gpstart)) > > + jbase += j - gpstart; > > + } > > if (!jbase) { > > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) > Unfortunately, this patch does not helpful. > > Then re-add the debug info. > > During the qemu boot > [ 232.997667] __synchronize_srcu loop=1000 > > [ 361.094493] __synchronize_srcu loop=9000 > [ 361.094501] Call trace: > [ 361.094502] dump_backtrace+0xe4/0xf0 > [ 361.094505] show_stack+0x20/0x70 > [ 361.094507] dump_stack_lvl+0x8c/0xb8 > [ 361.094509] dump_stack+0x18/0x34 > [ 361.094511] __synchronize_srcu+0x120/0x128 > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > [ 361.094519] kvm_activate_memslot+0x40/0x68 > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > [ 361.094524] kvm_set_memory_region+0x78/0xb8 > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > [ 361.094530] invoke_syscall+0x4c/0x110 > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > [ 361.094536] do_el0_svc+0x34/0xc0 > [ 361.094538] el0_svc+0x30/0x98 > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > [ 361.094544] el0t_64_sync+0x18c/0x190 > [ 363.942817] kvm_set_memory_region loop=6000 Huh. One possibility is that the "if (!jbase)" block needs to be nested within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block. One additional debug is to apply the patch below on top of the one you just now kindly tested, then use whatever debug technique you wish to work out what fraction of the time during that critical interval that srcu_get_delay() returns non-zero. Other thoughts? Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 3:57 ` Paul E. McKenney @ 2022-06-13 4:16 ` Paul E. McKenney 2022-06-13 6:55 ` zhangfei.gao 0 siblings, 1 reply; 37+ messages in thread From: Paul E. McKenney @ 2022-06-13 4:16 UTC (permalink / raw) To: zhangfei.gao Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, sheng.yang On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote: > On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote: > > Hi, Paul > > > > On 2022/6/13 上午2:49, Paul E. McKenney wrote: > > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: > > > > On 6/12/22 18:40, Paul E. McKenney wrote: > > > > > > Do these reserved memory regions really need to be allocated separately? > > > > > > (For example, are they really all non-contiguous? If not, that is, if > > > > > > there are a lot of contiguous memory regions, could you sort the IORT > > > > > > by address and do one ioctl() for each set of contiguous memory regions?) > > > > > > > > > > > > Are all of these reserved memory regions set up before init is spawned? > > > > > > > > > > > > Are all of these reserved memory regions set up while there is only a > > > > > > single vCPU up and running? > > > > > > > > > > > > Is the SRCU grace period really needed in this case? (I freely confess > > > > > > to not being all that familiar with KVM.) > > > > > Oh, and there was a similar many-requests problem with networking many > > > > > years ago. This was solved by adding a new syscall/ioctl()/whatever > > > > > that permitted many requests to be presented to the kernel with a single > > > > > system call. > > > > > > > > > > Could a new ioctl() be introduced that requested a large number > > > > > of these memory regions in one go so as to make each call to > > > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+ > > > > > requests? Adding a few of the KVM guys on CC for their thoughts. > > > > Unfortunately not. Apart from this specific case, in general the calls to > > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the > > > > guest, and those writes then map to a ioctl. Typically the guest sets up a > > > > device at a time, and each setup step causes a synchronize_srcu()---and > > > > expedited at that. > > > I was afraid of something like that... > > > > > > > KVM has two SRCUs: > > > > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers > > > > that are very very small, but it needs extremely fast detection of grace > > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up > > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are > > > > not so frequent. > > > > > > > > 2) kvm->srcu is nastier because there are readers all the time. The > > > > read-side critical section are still short-ish, but they need the sleepable > > > > part because they access user memory. > > > Which one of these two is in play in this case? > > > > > > > Writers are not frequent per se; the problem is they come in very large > > > > bursts when a guest boots. And while the whole boot path overall can be > > > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger > > > > impact on runtime than the O(n^2) parts, as demonstrated here. > > > > > > > > Therefore, we operated on the assumption that the callers of > > > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code > > > > and the desire was to get past the booting phase as fast as possible. If > > > > the guest wants to eat host CPU it can "for(;;)" as much as it wants; > > > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole > > > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not > > > > problematic. > > > > > > > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu > > > > were was introduced (respectively in 2009 and 2014). But perhaps they do > > > > not hold anymore now that each SRCU is not as independent as it used to be > > > > in those years, and instead they use workqueues instead? > > > The problem was not internal to SRCU, but rather due to the fact > > > that kernel live patching (KLP) had problems with the CPU-bound tasks > > > resulting from repeated synchronize_rcu_expedited() invocations. So I > > > added heuristics to get the occasional sleep in there for KLP's benefit. > > > Perhaps these heuristics need to be less aggressive about adding sleep. > > > > > > These heuristics have these aspects: > > > > > > 1. The longer readers persist in an expedited SRCU grace period, > > > the longer the wait between successive checks of the reader > > > state. Roughly speaking, we wait as long as the grace period > > > has currently been in effect, capped at ten jiffies. > > > > > > 2. SRCU grace periods have several phases. We reset so that each > > > phase starts by not waiting (new phase, new set of readers, > > > so don't penalize this set for the sins of the previous set). > > > But once we get to the point of adding delay, we add the > > > delay based on the beginning of the full grace period. > > > > > > Right now, the checking for grace-period length does not allow for the > > > possibility that a grace period might start just before the jiffies > > > counter gets incremented (because I didn't realize that anyone cared), > > > so that is one possible thing to change. I can also allow more no-delay > > > checks per SRCU grace-period phase. > > > > > > Zhangfei, does something like the patch shown below help? > > > > > > Additional adjustments are likely needed to avoid re-breaking KLP, > > > but we have to start somewhere... > > > > > > Thanx, Paul > > > > > > ------------------------------------------------------------------------ > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > index 50ba70f019dea..6a354368ac1d1 100644 > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > > > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. > > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. > > > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > > > /* > > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > > */ > > > static unsigned long srcu_get_delay(struct srcu_struct *ssp) > > > { > > > + unsigned long gpstart; > > > + unsigned long j; > > > unsigned long jbase = SRCU_INTERVAL; > > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > jbase = 0; > > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > > > + j = jiffies - 1; > > > + gpstart = READ_ONCE(ssp->srcu_gp_start); > > > + if (time_after(j, gpstart)) > > > + jbase += j - gpstart; > > > + } > > > if (!jbase) { > > > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > > if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) > > Unfortunately, this patch does not helpful. > > > > Then re-add the debug info. > > > > During the qemu boot > > [ 232.997667] __synchronize_srcu loop=1000 > > > > [ 361.094493] __synchronize_srcu loop=9000 > > [ 361.094501] Call trace: > > [ 361.094502] dump_backtrace+0xe4/0xf0 > > [ 361.094505] show_stack+0x20/0x70 > > [ 361.094507] dump_stack_lvl+0x8c/0xb8 > > [ 361.094509] dump_stack+0x18/0x34 > > [ 361.094511] __synchronize_srcu+0x120/0x128 > > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > > [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > > [ 361.094519] kvm_activate_memslot+0x40/0x68 > > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > > [ 361.094524] kvm_set_memory_region+0x78/0xb8 > > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > > [ 361.094530] invoke_syscall+0x4c/0x110 > > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > > [ 361.094536] do_el0_svc+0x34/0xc0 > > [ 361.094538] el0_svc+0x30/0x98 > > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > > [ 361.094544] el0t_64_sync+0x18c/0x190 > > [ 363.942817] kvm_set_memory_region loop=6000 > > Huh. > > One possibility is that the "if (!jbase)" block needs to be nested > within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block. And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the above change reduces the overhead by more than an order of magnitude. Except that the overhead of the series is far less than one second, not the several minutes that you are seeing. So the per-call overhead decreases from about 17 microseconds to a bit more than one microsecond. I could imagine an extra order of magnitude if you are running HZ=100 instead of the HZ=1000 that I am running. But that only gets up to a few seconds. > One additional debug is to apply the patch below on top of the one you > just now kindly tested, then use whatever debug technique you wish to > work out what fraction of the time during that critical interval that > srcu_get_delay() returns non-zero. So I am very interested in the above debug. ;-) Other thoughts? Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 4:16 ` Paul E. McKenney @ 2022-06-13 6:55 ` zhangfei.gao 2022-06-13 12:18 ` Paul E. McKenney ` (2 more replies) 0 siblings, 3 replies; 37+ messages in thread From: zhangfei.gao @ 2022-06-13 6:55 UTC (permalink / raw) To: paulmck Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, Auger Eric Hi, Paul On 2022/6/13 下午12:16, Paul E. McKenney wrote: > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote: >> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote: >>> Hi, Paul >>> >>> On 2022/6/13 上午2:49, Paul E. McKenney wrote: >>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: >>>>> On 6/12/22 18:40, Paul E. McKenney wrote: >>>>>>> Do these reserved memory regions really need to be allocated separately? >>>>>>> (For example, are they really all non-contiguous? If not, that is, if >>>>>>> there are a lot of contiguous memory regions, could you sort the IORT >>>>>>> by address and do one ioctl() for each set of contiguous memory regions?) >>>>>>> >>>>>>> Are all of these reserved memory regions set up before init is spawned? >>>>>>> >>>>>>> Are all of these reserved memory regions set up while there is only a >>>>>>> single vCPU up and running? >>>>>>> >>>>>>> Is the SRCU grace period really needed in this case? (I freely confess >>>>>>> to not being all that familiar with KVM.) >>>>>> Oh, and there was a similar many-requests problem with networking many >>>>>> years ago. This was solved by adding a new syscall/ioctl()/whatever >>>>>> that permitted many requests to be presented to the kernel with a single >>>>>> system call. >>>>>> >>>>>> Could a new ioctl() be introduced that requested a large number >>>>>> of these memory regions in one go so as to make each call to >>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+ >>>>>> requests? Adding a few of the KVM guys on CC for their thoughts. >>>>> Unfortunately not. Apart from this specific case, in general the calls to >>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the >>>>> guest, and those writes then map to a ioctl. Typically the guest sets up a >>>>> device at a time, and each setup step causes a synchronize_srcu()---and >>>>> expedited at that. >>>> I was afraid of something like that... >>>> >>>>> KVM has two SRCUs: >>>>> >>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers >>>>> that are very very small, but it needs extremely fast detection of grace >>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up >>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are >>>>> not so frequent. >>>>> >>>>> 2) kvm->srcu is nastier because there are readers all the time. The >>>>> read-side critical section are still short-ish, but they need the sleepable >>>>> part because they access user memory. >>>> Which one of these two is in play in this case? >>>> >>>>> Writers are not frequent per se; the problem is they come in very large >>>>> bursts when a guest boots. And while the whole boot path overall can be >>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger >>>>> impact on runtime than the O(n^2) parts, as demonstrated here. >>>>> >>>>> Therefore, we operated on the assumption that the callers of >>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code >>>>> and the desire was to get past the booting phase as fast as possible. If >>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants; >>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole >>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not >>>>> problematic. >>>>> >>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu >>>>> were was introduced (respectively in 2009 and 2014). But perhaps they do >>>>> not hold anymore now that each SRCU is not as independent as it used to be >>>>> in those years, and instead they use workqueues instead? >>>> The problem was not internal to SRCU, but rather due to the fact >>>> that kernel live patching (KLP) had problems with the CPU-bound tasks >>>> resulting from repeated synchronize_rcu_expedited() invocations. So I >>>> added heuristics to get the occasional sleep in there for KLP's benefit. >>>> Perhaps these heuristics need to be less aggressive about adding sleep. >>>> >>>> These heuristics have these aspects: >>>> >>>> 1. The longer readers persist in an expedited SRCU grace period, >>>> the longer the wait between successive checks of the reader >>>> state. Roughly speaking, we wait as long as the grace period >>>> has currently been in effect, capped at ten jiffies. >>>> >>>> 2. SRCU grace periods have several phases. We reset so that each >>>> phase starts by not waiting (new phase, new set of readers, >>>> so don't penalize this set for the sins of the previous set). >>>> But once we get to the point of adding delay, we add the >>>> delay based on the beginning of the full grace period. >>>> >>>> Right now, the checking for grace-period length does not allow for the >>>> possibility that a grace period might start just before the jiffies >>>> counter gets incremented (because I didn't realize that anyone cared), >>>> so that is one possible thing to change. I can also allow more no-delay >>>> checks per SRCU grace-period phase. >>>> >>>> Zhangfei, does something like the patch shown below help? >>>> >>>> Additional adjustments are likely needed to avoid re-breaking KLP, >>>> but we have to start somewhere... >>>> >>>> Thanx, Paul >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >>>> index 50ba70f019dea..6a354368ac1d1 100644 >>>> --- a/kernel/rcu/srcutree.c >>>> +++ b/kernel/rcu/srcutree.c >>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) >>>> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. >>>> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. >>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. >>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. >>>> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. >>>> /* >>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp) >>>> */ >>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp) >>>> { >>>> + unsigned long gpstart; >>>> + unsigned long j; >>>> unsigned long jbase = SRCU_INTERVAL; >>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>>> jbase = 0; >>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) >>>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); >>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { >>>> + j = jiffies - 1; >>>> + gpstart = READ_ONCE(ssp->srcu_gp_start); >>>> + if (time_after(j, gpstart)) >>>> + jbase += j - gpstart; >>>> + } >>>> if (!jbase) { >>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) >>> Unfortunately, this patch does not helpful. >>> >>> Then re-add the debug info. >>> >>> During the qemu boot >>> [ 232.997667] __synchronize_srcu loop=1000 >>> >>> [ 361.094493] __synchronize_srcu loop=9000 >>> [ 361.094501] Call trace: >>> [ 361.094502] dump_backtrace+0xe4/0xf0 >>> [ 361.094505] show_stack+0x20/0x70 >>> [ 361.094507] dump_stack_lvl+0x8c/0xb8 >>> [ 361.094509] dump_stack+0x18/0x34 >>> [ 361.094511] __synchronize_srcu+0x120/0x128 >>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 >>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198 >>> [ 361.094519] kvm_activate_memslot+0x40/0x68 >>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 >>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 >>> [ 361.094524] kvm_set_memory_region+0x78/0xb8 >>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 >>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 >>> [ 361.094530] invoke_syscall+0x4c/0x110 >>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 >>> [ 361.094536] do_el0_svc+0x34/0xc0 >>> [ 361.094538] el0_svc+0x30/0x98 >>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 >>> [ 361.094544] el0t_64_sync+0x18c/0x190 >>> [ 363.942817] kvm_set_memory_region loop=6000 >> Huh. >> >> One possibility is that the "if (!jbase)" block needs to be nested >> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block. I test this diff and NO helpful diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 50ba70f019de..36286a4b74e6 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. /* @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct *ssp) */ static unsigned long srcu_get_delay(struct srcu_struct *ssp) { + unsigned long gpstart; + unsigned long j; unsigned long jbase = SRCU_INTERVAL; if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) jbase = 0; - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); - if (!jbase) { - WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) - jbase = 1; + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { + j = jiffies - 1; + gpstart = READ_ONCE(ssp->srcu_gp_start); + if (time_after(j, gpstart)) + jbase += j - gpstart; + + if (!jbase) { + WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) + jbase = 1; + } } > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the > above change reduces the overhead by more than an order of magnitude. > Except that the overhead of the series is far less than one second, > not the several minutes that you are seeing. So the per-call overhead > decreases from about 17 microseconds to a bit more than one microsecond. > > I could imagine an extra order of magnitude if you are running HZ=100 > instead of the HZ=1000 that I am running. But that only gets up to a > few seconds. > >> One additional debug is to apply the patch below on top of the one you apply the patch below? >> just now kindly tested, then use whatever debug technique you wish to >> work out what fraction of the time during that critical interval that >> srcu_get_delay() returns non-zero. Sorry, I am confused, no patch right? Just measure srcu_get_delay return to non-zero? By the way, the issue should be only related with qemu apci. not related with rmr feature Test with: https://github.com/qemu/qemu/tree/stable-6.1 Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force, If no acpi, no print kvm_region_add/del (1000 times print once) If with acpi=force, During qemu boot kvm_region_add region_add = 1000 kvm_region_del region_del = 1000 kvm_region_add region_add = 2000 kvm_region_del region_del = 2000 kvm_region_add region_add = 3000 kvm_region_del region_del = 3000 kvm_region_add region_add = 4000 kvm_region_del region_del = 4000 kvm_region_add region_add = 5000 kvm_region_del region_del = 5000 kvm_region_add region_add = 6000 kvm_region_del region_del = 6000 kvm_region_add/kvm_region_del -> kvm_set_phys_mem-> kvm_set_user_memory_region-> kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem) [ 361.094493] __synchronize_srcu loop=9000 [ 361.094501] Call trace: [ 361.094502] dump_backtrace+0xe4/0xf0 [ 361.094505] show_stack+0x20/0x70 [ 361.094507] dump_stack_lvl+0x8c/0xb8 [ 361.094509] dump_stack+0x18/0x34 [ 361.094511] __synchronize_srcu+0x120/0x128 [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 [ 361.094515] kvm_swap_active_memslots+0x130/0x198 [ 361.094519] kvm_activate_memslot+0x40/0x68 [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 [ 361.094524] kvm_set_memory_region+0x78/0xb8 [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 [ 361.094530] invoke_syscall+0x4c/0x110 [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 [ 361.094536] do_el0_svc+0x34/0xc0 [ 361.094538] el0_svc+0x30/0x98 [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 [ 361.094544] el0t_64_sync+0x18c/0x190 [ 363.942817] kvm_set_memory_region loop=6000 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 6:55 ` zhangfei.gao @ 2022-06-13 12:18 ` Paul E. McKenney 2022-06-13 13:23 ` zhangfei.gao 2022-06-13 15:02 ` Shameerali Kolothum Thodi 2022-06-15 8:29 ` Marc Zyngier 2 siblings, 1 reply; 37+ messages in thread From: Paul E. McKenney @ 2022-06-13 12:18 UTC (permalink / raw) To: zhangfei.gao Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, Auger Eric On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@foxmail.com wrote: > Hi, Paul > > On 2022/6/13 下午12:16, Paul E. McKenney wrote: > > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote: > > > On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote: > > > > Hi, Paul > > > > > > > > On 2022/6/13 上午2:49, Paul E. McKenney wrote: > > > > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: > > > > > > On 6/12/22 18:40, Paul E. McKenney wrote: > > > > > > > > Do these reserved memory regions really need to be allocated separately? > > > > > > > > (For example, are they really all non-contiguous? If not, that is, if > > > > > > > > there are a lot of contiguous memory regions, could you sort the IORT > > > > > > > > by address and do one ioctl() for each set of contiguous memory regions?) > > > > > > > > > > > > > > > > Are all of these reserved memory regions set up before init is spawned? > > > > > > > > > > > > > > > > Are all of these reserved memory regions set up while there is only a > > > > > > > > single vCPU up and running? > > > > > > > > > > > > > > > > Is the SRCU grace period really needed in this case? (I freely confess > > > > > > > > to not being all that familiar with KVM.) > > > > > > > Oh, and there was a similar many-requests problem with networking many > > > > > > > years ago. This was solved by adding a new syscall/ioctl()/whatever > > > > > > > that permitted many requests to be presented to the kernel with a single > > > > > > > system call. > > > > > > > > > > > > > > Could a new ioctl() be introduced that requested a large number > > > > > > > of these memory regions in one go so as to make each call to > > > > > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+ > > > > > > > requests? Adding a few of the KVM guys on CC for their thoughts. > > > > > > Unfortunately not. Apart from this specific case, in general the calls to > > > > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the > > > > > > guest, and those writes then map to a ioctl. Typically the guest sets up a > > > > > > device at a time, and each setup step causes a synchronize_srcu()---and > > > > > > expedited at that. > > > > > I was afraid of something like that... > > > > > > > > > > > KVM has two SRCUs: > > > > > > > > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers > > > > > > that are very very small, but it needs extremely fast detection of grace > > > > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up > > > > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are > > > > > > not so frequent. > > > > > > > > > > > > 2) kvm->srcu is nastier because there are readers all the time. The > > > > > > read-side critical section are still short-ish, but they need the sleepable > > > > > > part because they access user memory. > > > > > Which one of these two is in play in this case? > > > > > > > > > > > Writers are not frequent per se; the problem is they come in very large > > > > > > bursts when a guest boots. And while the whole boot path overall can be > > > > > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger > > > > > > impact on runtime than the O(n^2) parts, as demonstrated here. > > > > > > > > > > > > Therefore, we operated on the assumption that the callers of > > > > > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code > > > > > > and the desire was to get past the booting phase as fast as possible. If > > > > > > the guest wants to eat host CPU it can "for(;;)" as much as it wants; > > > > > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole > > > > > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not > > > > > > problematic. > > > > > > > > > > > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu > > > > > > were was introduced (respectively in 2009 and 2014). But perhaps they do > > > > > > not hold anymore now that each SRCU is not as independent as it used to be > > > > > > in those years, and instead they use workqueues instead? > > > > > The problem was not internal to SRCU, but rather due to the fact > > > > > that kernel live patching (KLP) had problems with the CPU-bound tasks > > > > > resulting from repeated synchronize_rcu_expedited() invocations. So I > > > > > added heuristics to get the occasional sleep in there for KLP's benefit. > > > > > Perhaps these heuristics need to be less aggressive about adding sleep. > > > > > > > > > > These heuristics have these aspects: > > > > > > > > > > 1. The longer readers persist in an expedited SRCU grace period, > > > > > the longer the wait between successive checks of the reader > > > > > state. Roughly speaking, we wait as long as the grace period > > > > > has currently been in effect, capped at ten jiffies. > > > > > > > > > > 2. SRCU grace periods have several phases. We reset so that each > > > > > phase starts by not waiting (new phase, new set of readers, > > > > > so don't penalize this set for the sins of the previous set). > > > > > But once we get to the point of adding delay, we add the > > > > > delay based on the beginning of the full grace period. > > > > > > > > > > Right now, the checking for grace-period length does not allow for the > > > > > possibility that a grace period might start just before the jiffies > > > > > counter gets incremented (because I didn't realize that anyone cared), > > > > > so that is one possible thing to change. I can also allow more no-delay > > > > > checks per SRCU grace-period phase. > > > > > > > > > > Zhangfei, does something like the patch shown below help? > > > > > > > > > > Additional adjustments are likely needed to avoid re-breaking KLP, > > > > > but we have to start somewhere... > > > > > > > > > > Thanx, Paul > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > > > index 50ba70f019dea..6a354368ac1d1 100644 > > > > > --- a/kernel/rcu/srcutree.c > > > > > +++ b/kernel/rcu/srcutree.c > > > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > > > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > > > > > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > > > > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. > > > > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. > > > > > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > > > > > /* > > > > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > > > > */ > > > > > static unsigned long srcu_get_delay(struct srcu_struct *ssp) > > > > > { > > > > > + unsigned long gpstart; > > > > > + unsigned long j; > > > > > unsigned long jbase = SRCU_INTERVAL; > > > > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > > > jbase = 0; > > > > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > > > > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > > > > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > > > > > + j = jiffies - 1; > > > > > + gpstart = READ_ONCE(ssp->srcu_gp_start); > > > > > + if (time_after(j, gpstart)) > > > > > + jbase += j - gpstart; > > > > > + } > > > > > if (!jbase) { > > > > > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > > > > if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) > > > > Unfortunately, this patch does not helpful. > > > > > > > > Then re-add the debug info. > > > > > > > > During the qemu boot > > > > [ 232.997667] __synchronize_srcu loop=1000 > > > > > > > > [ 361.094493] __synchronize_srcu loop=9000 > > > > [ 361.094501] Call trace: > > > > [ 361.094502] dump_backtrace+0xe4/0xf0 > > > > [ 361.094505] show_stack+0x20/0x70 > > > > [ 361.094507] dump_stack_lvl+0x8c/0xb8 > > > > [ 361.094509] dump_stack+0x18/0x34 > > > > [ 361.094511] __synchronize_srcu+0x120/0x128 > > > > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > > > > [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > > > > [ 361.094519] kvm_activate_memslot+0x40/0x68 > > > > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > > > > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > > > > [ 361.094524] kvm_set_memory_region+0x78/0xb8 > > > > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > > > > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > > > > [ 361.094530] invoke_syscall+0x4c/0x110 > > > > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > > > > [ 361.094536] do_el0_svc+0x34/0xc0 > > > > [ 361.094538] el0_svc+0x30/0x98 > > > > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > > > > [ 361.094544] el0t_64_sync+0x18c/0x190 > > > > [ 363.942817] kvm_set_memory_region loop=6000 > > > Huh. > > > > > > One possibility is that the "if (!jbase)" block needs to be nested > > > within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block. > > I test this diff and NO helpful > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 50ba70f019de..36286a4b74e6 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs > pending. > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from > slow readers. > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive > no-delay instances. > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive > no-delay instances. > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay > instances. > > /* > @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct > *ssp) > */ > static unsigned long srcu_get_delay(struct srcu_struct *ssp) > { > + unsigned long gpstart; > + unsigned long j; > unsigned long jbase = SRCU_INTERVAL; > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > jbase = 0; > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > - if (!jbase) { > - WRITE_ONCE(ssp->srcu_n_exp_nodelay, > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > SRCU_MAX_NODELAY_PHASE) > - jbase = 1; > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > + j = jiffies - 1; > + gpstart = READ_ONCE(ssp->srcu_gp_start); > + if (time_after(j, gpstart)) > + jbase += j - gpstart; > + > + if (!jbase) { > + WRITE_ONCE(ssp->srcu_n_exp_nodelay, > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > SRCU_MAX_NODELAY_PHASE) > + jbase = 1; > + } > } That is in fact what I was intending you to test, thank you. As you say, unfortunately it did not help. Could you please test removing the "if (!jbase)" block entirely? > > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the > > above change reduces the overhead by more than an order of magnitude. > > Except that the overhead of the series is far less than one second, > > not the several minutes that you are seeing. So the per-call overhead > > decreases from about 17 microseconds to a bit more than one microsecond. > > > > I could imagine an extra order of magnitude if you are running HZ=100 > > instead of the HZ=1000 that I am running. But that only gets up to a > > few seconds. One possible reason for the difference would be if your code has SRCU readers. Could you please tell me the value of CONFIG_HZ on your system? Also the value of CONFIG_PREEMPTION? > > > One additional debug is to apply the patch below on top of the one you > apply the patch below? > > > just now kindly tested, then use whatever debug technique you wish to > > > work out what fraction of the time during that critical interval that > > > srcu_get_delay() returns non-zero. > Sorry, I am confused, no patch right? Apologies, my omission. > Just measure srcu_get_delay return to non-zero? Exactly, please! > By the way, the issue should be only related with qemu apci. not related > with rmr feature > Test with: https://github.com/qemu/qemu/tree/stable-6.1 > > Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force, > If no acpi, no print kvm_region_add/del (1000 times print once) > > If with acpi=force, > During qemu boot > kvm_region_add region_add = 1000 > kvm_region_del region_del = 1000 > kvm_region_add region_add = 2000 > kvm_region_del region_del = 2000 > kvm_region_add region_add = 3000 > kvm_region_del region_del = 3000 > kvm_region_add region_add = 4000 > kvm_region_del region_del = 4000 > kvm_region_add region_add = 5000 > kvm_region_del region_del = 5000 > kvm_region_add region_add = 6000 > kvm_region_del region_del = 6000 > > kvm_region_add/kvm_region_del -> > kvm_set_phys_mem-> > kvm_set_user_memory_region-> > kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem) > > [ 361.094493] __synchronize_srcu loop=9000 > [ 361.094501] Call trace: > [ 361.094502] dump_backtrace+0xe4/0xf0 > [ 361.094505] show_stack+0x20/0x70 > [ 361.094507] dump_stack_lvl+0x8c/0xb8 > [ 361.094509] dump_stack+0x18/0x34 > [ 361.094511] __synchronize_srcu+0x120/0x128 > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > [ 361.094519] kvm_activate_memslot+0x40/0x68 > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > [ 361.094524] kvm_set_memory_region+0x78/0xb8 > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > [ 361.094530] invoke_syscall+0x4c/0x110 > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > [ 361.094536] do_el0_svc+0x34/0xc0 > [ 361.094538] el0_svc+0x30/0x98 > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > [ 361.094544] el0t_64_sync+0x18c/0x190 > [ 363.942817] kvm_set_memory_region loop=6000 Good to know, thank you! Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 12:18 ` Paul E. McKenney @ 2022-06-13 13:23 ` zhangfei.gao 2022-06-13 14:59 ` Paul E. McKenney 0 siblings, 1 reply; 37+ messages in thread From: zhangfei.gao @ 2022-06-13 13:23 UTC (permalink / raw) To: paulmck Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, Auger Eric On 2022/6/13 下午8:18, Paul E. McKenney wrote: > On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@foxmail.com wrote: >> Hi, Paul >> >> On 2022/6/13 下午12:16, Paul E. McKenney wrote: >>> On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote: >>>> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote: >>>>> Hi, Paul >>>>> >>>>> On 2022/6/13 上午2:49, Paul E. McKenney wrote: >>>>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: >>>>>>> On 6/12/22 18:40, Paul E. McKenney wrote: >>>>>>>>> Do these reserved memory regions really need to be allocated separately? >>>>>>>>> (For example, are they really all non-contiguous? If not, that is, if >>>>>>>>> there are a lot of contiguous memory regions, could you sort the IORT >>>>>>>>> by address and do one ioctl() for each set of contiguous memory regions?) >>>>>>>>> >>>>>>>>> Are all of these reserved memory regions set up before init is spawned? >>>>>>>>> >>>>>>>>> Are all of these reserved memory regions set up while there is only a >>>>>>>>> single vCPU up and running? >>>>>>>>> >>>>>>>>> Is the SRCU grace period really needed in this case? (I freely confess >>>>>>>>> to not being all that familiar with KVM.) >>>>>>>> Oh, and there was a similar many-requests problem with networking many >>>>>>>> years ago. This was solved by adding a new syscall/ioctl()/whatever >>>>>>>> that permitted many requests to be presented to the kernel with a single >>>>>>>> system call. >>>>>>>> >>>>>>>> Could a new ioctl() be introduced that requested a large number >>>>>>>> of these memory regions in one go so as to make each call to >>>>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+ >>>>>>>> requests? Adding a few of the KVM guys on CC for their thoughts. >>>>>>> Unfortunately not. Apart from this specific case, in general the calls to >>>>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the >>>>>>> guest, and those writes then map to a ioctl. Typically the guest sets up a >>>>>>> device at a time, and each setup step causes a synchronize_srcu()---and >>>>>>> expedited at that. >>>>>> I was afraid of something like that... >>>>>> >>>>>>> KVM has two SRCUs: >>>>>>> >>>>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers >>>>>>> that are very very small, but it needs extremely fast detection of grace >>>>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up >>>>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are >>>>>>> not so frequent. >>>>>>> >>>>>>> 2) kvm->srcu is nastier because there are readers all the time. The >>>>>>> read-side critical section are still short-ish, but they need the sleepable >>>>>>> part because they access user memory. >>>>>> Which one of these two is in play in this case? >>>>>> >>>>>>> Writers are not frequent per se; the problem is they come in very large >>>>>>> bursts when a guest boots. And while the whole boot path overall can be >>>>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger >>>>>>> impact on runtime than the O(n^2) parts, as demonstrated here. >>>>>>> >>>>>>> Therefore, we operated on the assumption that the callers of >>>>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code >>>>>>> and the desire was to get past the booting phase as fast as possible. If >>>>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants; >>>>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole >>>>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not >>>>>>> problematic. >>>>>>> >>>>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu >>>>>>> were was introduced (respectively in 2009 and 2014). But perhaps they do >>>>>>> not hold anymore now that each SRCU is not as independent as it used to be >>>>>>> in those years, and instead they use workqueues instead? >>>>>> The problem was not internal to SRCU, but rather due to the fact >>>>>> that kernel live patching (KLP) had problems with the CPU-bound tasks >>>>>> resulting from repeated synchronize_rcu_expedited() invocations. So I >>>>>> added heuristics to get the occasional sleep in there for KLP's benefit. >>>>>> Perhaps these heuristics need to be less aggressive about adding sleep. >>>>>> >>>>>> These heuristics have these aspects: >>>>>> >>>>>> 1. The longer readers persist in an expedited SRCU grace period, >>>>>> the longer the wait between successive checks of the reader >>>>>> state. Roughly speaking, we wait as long as the grace period >>>>>> has currently been in effect, capped at ten jiffies. >>>>>> >>>>>> 2. SRCU grace periods have several phases. We reset so that each >>>>>> phase starts by not waiting (new phase, new set of readers, >>>>>> so don't penalize this set for the sins of the previous set). >>>>>> But once we get to the point of adding delay, we add the >>>>>> delay based on the beginning of the full grace period. >>>>>> >>>>>> Right now, the checking for grace-period length does not allow for the >>>>>> possibility that a grace period might start just before the jiffies >>>>>> counter gets incremented (because I didn't realize that anyone cared), >>>>>> so that is one possible thing to change. I can also allow more no-delay >>>>>> checks per SRCU grace-period phase. >>>>>> >>>>>> Zhangfei, does something like the patch shown below help? >>>>>> >>>>>> Additional adjustments are likely needed to avoid re-breaking KLP, >>>>>> but we have to start somewhere... >>>>>> >>>>>> Thanx, Paul >>>>>> >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >>>>>> index 50ba70f019dea..6a354368ac1d1 100644 >>>>>> --- a/kernel/rcu/srcutree.c >>>>>> +++ b/kernel/rcu/srcutree.c >>>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) >>>>>> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. >>>>>> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. >>>>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. >>>>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. >>>>>> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. >>>>>> /* >>>>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp) >>>>>> */ >>>>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp) >>>>>> { >>>>>> + unsigned long gpstart; >>>>>> + unsigned long j; >>>>>> unsigned long jbase = SRCU_INTERVAL; >>>>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>>>>> jbase = 0; >>>>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) >>>>>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); >>>>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { >>>>>> + j = jiffies - 1; >>>>>> + gpstart = READ_ONCE(ssp->srcu_gp_start); >>>>>> + if (time_after(j, gpstart)) >>>>>> + jbase += j - gpstart; >>>>>> + } >>>>>> if (!jbase) { >>>>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >>>>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) >>>>> Unfortunately, this patch does not helpful. >>>>> >>>>> Then re-add the debug info. >>>>> >>>>> During the qemu boot >>>>> [ 232.997667] __synchronize_srcu loop=1000 >>>>> >>>>> [ 361.094493] __synchronize_srcu loop=9000 >>>>> [ 361.094501] Call trace: >>>>> [ 361.094502] dump_backtrace+0xe4/0xf0 >>>>> [ 361.094505] show_stack+0x20/0x70 >>>>> [ 361.094507] dump_stack_lvl+0x8c/0xb8 >>>>> [ 361.094509] dump_stack+0x18/0x34 >>>>> [ 361.094511] __synchronize_srcu+0x120/0x128 >>>>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 >>>>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198 >>>>> [ 361.094519] kvm_activate_memslot+0x40/0x68 >>>>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 >>>>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 >>>>> [ 361.094524] kvm_set_memory_region+0x78/0xb8 >>>>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 >>>>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 >>>>> [ 361.094530] invoke_syscall+0x4c/0x110 >>>>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 >>>>> [ 361.094536] do_el0_svc+0x34/0xc0 >>>>> [ 361.094538] el0_svc+0x30/0x98 >>>>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 >>>>> [ 361.094544] el0t_64_sync+0x18c/0x190 >>>>> [ 363.942817] kvm_set_memory_region loop=6000 >>>> Huh. >>>> >>>> One possibility is that the "if (!jbase)" block needs to be nested >>>> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block. >> I test this diff and NO helpful >> >> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >> index 50ba70f019de..36286a4b74e6 100644 >> --- a/kernel/rcu/srcutree.c >> +++ b/kernel/rcu/srcutree.c >> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) >> >> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs >> pending. >> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from >> slow readers. >> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive >> no-delay instances. >> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive >> no-delay instances. >> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay >> instances. >> >> /* >> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct >> *ssp) >> */ >> static unsigned long srcu_get_delay(struct srcu_struct *ssp) >> { >> + unsigned long gpstart; >> + unsigned long j; >> unsigned long jbase = SRCU_INTERVAL; >> >> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >> jbase = 0; >> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) >> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); >> - if (!jbase) { >> - WRITE_ONCE(ssp->srcu_n_exp_nodelay, >> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >> SRCU_MAX_NODELAY_PHASE) >> - jbase = 1; >> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { >> + j = jiffies - 1; >> + gpstart = READ_ONCE(ssp->srcu_gp_start); >> + if (time_after(j, gpstart)) >> + jbase += j - gpstart; >> + >> + if (!jbase) { >> + WRITE_ONCE(ssp->srcu_n_exp_nodelay, >> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >> + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >> SRCU_MAX_NODELAY_PHASE) >> + jbase = 1; >> + } >> } > That is in fact what I was intending you to test, thank you. As you > say, unfortunately it did not help. > > Could you please test removing the "if (!jbase)" block entirely? Remove "if (!jbase)" block is much faster, not measure clearly, qemu (with debug version efi) boot seems normally. From log timestamp: [ 114.624713] __synchronize_srcu loop=1000 [ 124.157011] __synchronize_srcu loop=9000 Several method: timestamps are different. 5.19-rc1 [ 94.271350] __synchronize_srcu loop=1001 [ 222.621659] __synchronize_srcu loop=9001 With your first diff: [ 232.997667] __synchronize_srcu loop=1000 [ 361.094493] __synchronize_srcu loop=9000 Remove "if (!jbase)" block [ 114.624713] __synchronize_srcu loop=1000 [ 124.157011] __synchronize_srcu loop=9000 5.18 method + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) + return 0; + return SRCU_INTERVAL; [ 74.598480] __synchronize_srcu loop=9000 [ 68.938297] __synchronize_srcu loop=1000 > >>> And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the >>> above change reduces the overhead by more than an order of magnitude. >>> Except that the overhead of the series is far less than one second, >>> not the several minutes that you are seeing. So the per-call overhead >>> decreases from about 17 microseconds to a bit more than one microsecond. >>> >>> I could imagine an extra order of magnitude if you are running HZ=100 >>> instead of the HZ=1000 that I am running. But that only gets up to a >>> few seconds. > One possible reason for the difference would be if your code has > SRCU readers. > > Could you please tell me the value of CONFIG_HZ on your system? > Also the value of CONFIG_PREEMPTION? I am using arch/arm64/configs/defconfig make defconfig CONFIG_PREEMPTION=y CONFIG_HZ_250=y Thanks > >>>> One additional debug is to apply the patch below on top of the one you >> apply the patch below? >>>> just now kindly tested, then use whatever debug technique you wish to >>>> work out what fraction of the time during that critical interval that >>>> srcu_get_delay() returns non-zero. >> Sorry, I am confused, no patch right? > Apologies, my omission. > >> Just measure srcu_get_delay return to non-zero? > Exactly, please! > >> By the way, the issue should be only related with qemu apci. not related >> with rmr feature >> Test with: https://github.com/qemu/qemu/tree/stable-6.1 >> >> Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force, >> If no acpi, no print kvm_region_add/del (1000 times print once) >> >> If with acpi=force, >> During qemu boot >> kvm_region_add region_add = 1000 >> kvm_region_del region_del = 1000 >> kvm_region_add region_add = 2000 >> kvm_region_del region_del = 2000 >> kvm_region_add region_add = 3000 >> kvm_region_del region_del = 3000 >> kvm_region_add region_add = 4000 >> kvm_region_del region_del = 4000 >> kvm_region_add region_add = 5000 >> kvm_region_del region_del = 5000 >> kvm_region_add region_add = 6000 >> kvm_region_del region_del = 6000 >> >> kvm_region_add/kvm_region_del -> >> kvm_set_phys_mem-> >> kvm_set_user_memory_region-> >> kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem) >> >> [ 361.094493] __synchronize_srcu loop=9000 >> [ 361.094501] Call trace: >> [ 361.094502] dump_backtrace+0xe4/0xf0 >> [ 361.094505] show_stack+0x20/0x70 >> [ 361.094507] dump_stack_lvl+0x8c/0xb8 >> [ 361.094509] dump_stack+0x18/0x34 >> [ 361.094511] __synchronize_srcu+0x120/0x128 >> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 >> [ 361.094515] kvm_swap_active_memslots+0x130/0x198 >> [ 361.094519] kvm_activate_memslot+0x40/0x68 >> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 >> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 >> [ 361.094524] kvm_set_memory_region+0x78/0xb8 >> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 >> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 >> [ 361.094530] invoke_syscall+0x4c/0x110 >> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 >> [ 361.094536] do_el0_svc+0x34/0xc0 >> [ 361.094538] el0_svc+0x30/0x98 >> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 >> [ 361.094544] el0t_64_sync+0x18c/0x190 >> [ 363.942817] kvm_set_memory_region loop=6000 > Good to know, thank you! > > Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 13:23 ` zhangfei.gao @ 2022-06-13 14:59 ` Paul E. McKenney 2022-06-13 20:55 ` Shameerali Kolothum Thodi 0 siblings, 1 reply; 37+ messages in thread From: Paul E. McKenney @ 2022-06-13 14:59 UTC (permalink / raw) To: zhangfei.gao Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, Auger Eric On Mon, Jun 13, 2022 at 09:23:50PM +0800, zhangfei.gao@foxmail.com wrote: > > > On 2022/6/13 下午8:18, Paul E. McKenney wrote: > > On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@foxmail.com wrote: > > > Hi, Paul > > > > > > On 2022/6/13 下午12:16, Paul E. McKenney wrote: > > > > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote: > > > > > On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote: > > > > > > Hi, Paul > > > > > > > > > > > > On 2022/6/13 上午2:49, Paul E. McKenney wrote: > > > > > > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: > > > > > > > > On 6/12/22 18:40, Paul E. McKenney wrote: > > > > > > > > > > Do these reserved memory regions really need to be allocated separately? > > > > > > > > > > (For example, are they really all non-contiguous? If not, that is, if > > > > > > > > > > there are a lot of contiguous memory regions, could you sort the IORT > > > > > > > > > > by address and do one ioctl() for each set of contiguous memory regions?) > > > > > > > > > > > > > > > > > > > > Are all of these reserved memory regions set up before init is spawned? > > > > > > > > > > > > > > > > > > > > Are all of these reserved memory regions set up while there is only a > > > > > > > > > > single vCPU up and running? > > > > > > > > > > > > > > > > > > > > Is the SRCU grace period really needed in this case? (I freely confess > > > > > > > > > > to not being all that familiar with KVM.) > > > > > > > > > Oh, and there was a similar many-requests problem with networking many > > > > > > > > > years ago. This was solved by adding a new syscall/ioctl()/whatever > > > > > > > > > that permitted many requests to be presented to the kernel with a single > > > > > > > > > system call. > > > > > > > > > > > > > > > > > > Could a new ioctl() be introduced that requested a large number > > > > > > > > > of these memory regions in one go so as to make each call to > > > > > > > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+ > > > > > > > > > requests? Adding a few of the KVM guys on CC for their thoughts. > > > > > > > > Unfortunately not. Apart from this specific case, in general the calls to > > > > > > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the > > > > > > > > guest, and those writes then map to a ioctl. Typically the guest sets up a > > > > > > > > device at a time, and each setup step causes a synchronize_srcu()---and > > > > > > > > expedited at that. > > > > > > > I was afraid of something like that... > > > > > > > > > > > > > > > KVM has two SRCUs: > > > > > > > > > > > > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers > > > > > > > > that are very very small, but it needs extremely fast detection of grace > > > > > > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up > > > > > > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are > > > > > > > > not so frequent. > > > > > > > > > > > > > > > > 2) kvm->srcu is nastier because there are readers all the time. The > > > > > > > > read-side critical section are still short-ish, but they need the sleepable > > > > > > > > part because they access user memory. > > > > > > > Which one of these two is in play in this case? > > > > > > > > > > > > > > > Writers are not frequent per se; the problem is they come in very large > > > > > > > > bursts when a guest boots. And while the whole boot path overall can be > > > > > > > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger > > > > > > > > impact on runtime than the O(n^2) parts, as demonstrated here. > > > > > > > > > > > > > > > > Therefore, we operated on the assumption that the callers of > > > > > > > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code > > > > > > > > and the desire was to get past the booting phase as fast as possible. If > > > > > > > > the guest wants to eat host CPU it can "for(;;)" as much as it wants; > > > > > > > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole > > > > > > > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not > > > > > > > > problematic. > > > > > > > > > > > > > > > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu > > > > > > > > were was introduced (respectively in 2009 and 2014). But perhaps they do > > > > > > > > not hold anymore now that each SRCU is not as independent as it used to be > > > > > > > > in those years, and instead they use workqueues instead? > > > > > > > The problem was not internal to SRCU, but rather due to the fact > > > > > > > that kernel live patching (KLP) had problems with the CPU-bound tasks > > > > > > > resulting from repeated synchronize_rcu_expedited() invocations. So I > > > > > > > added heuristics to get the occasional sleep in there for KLP's benefit. > > > > > > > Perhaps these heuristics need to be less aggressive about adding sleep. > > > > > > > > > > > > > > These heuristics have these aspects: > > > > > > > > > > > > > > 1. The longer readers persist in an expedited SRCU grace period, > > > > > > > the longer the wait between successive checks of the reader > > > > > > > state. Roughly speaking, we wait as long as the grace period > > > > > > > has currently been in effect, capped at ten jiffies. > > > > > > > > > > > > > > 2. SRCU grace periods have several phases. We reset so that each > > > > > > > phase starts by not waiting (new phase, new set of readers, > > > > > > > so don't penalize this set for the sins of the previous set). > > > > > > > But once we get to the point of adding delay, we add the > > > > > > > delay based on the beginning of the full grace period. > > > > > > > > > > > > > > Right now, the checking for grace-period length does not allow for the > > > > > > > possibility that a grace period might start just before the jiffies > > > > > > > counter gets incremented (because I didn't realize that anyone cared), > > > > > > > so that is one possible thing to change. I can also allow more no-delay > > > > > > > checks per SRCU grace-period phase. > > > > > > > > > > > > > > Zhangfei, does something like the patch shown below help? > > > > > > > > > > > > > > Additional adjustments are likely needed to avoid re-breaking KLP, > > > > > > > but we have to start somewhere... > > > > > > > > > > > > > > Thanx, Paul > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > > > > > index 50ba70f019dea..6a354368ac1d1 100644 > > > > > > > --- a/kernel/rcu/srcutree.c > > > > > > > +++ b/kernel/rcu/srcutree.c > > > > > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > > > > > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > > > > > > > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > > > > > > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. > > > > > > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. > > > > > > > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > > > > > > > /* > > > > > > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > > > > > > */ > > > > > > > static unsigned long srcu_get_delay(struct srcu_struct *ssp) > > > > > > > { > > > > > > > + unsigned long gpstart; > > > > > > > + unsigned long j; > > > > > > > unsigned long jbase = SRCU_INTERVAL; > > > > > > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > > > > > jbase = 0; > > > > > > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > > > > > > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > > > > > > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > > > > > > > + j = jiffies - 1; > > > > > > > + gpstart = READ_ONCE(ssp->srcu_gp_start); > > > > > > > + if (time_after(j, gpstart)) > > > > > > > + jbase += j - gpstart; > > > > > > > + } > > > > > > > if (!jbase) { > > > > > > > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > > > > > > if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) > > > > > > Unfortunately, this patch does not helpful. > > > > > > > > > > > > Then re-add the debug info. > > > > > > > > > > > > During the qemu boot > > > > > > [ 232.997667] __synchronize_srcu loop=1000 > > > > > > > > > > > > [ 361.094493] __synchronize_srcu loop=9000 > > > > > > [ 361.094501] Call trace: > > > > > > [ 361.094502] dump_backtrace+0xe4/0xf0 > > > > > > [ 361.094505] show_stack+0x20/0x70 > > > > > > [ 361.094507] dump_stack_lvl+0x8c/0xb8 > > > > > > [ 361.094509] dump_stack+0x18/0x34 > > > > > > [ 361.094511] __synchronize_srcu+0x120/0x128 > > > > > > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > > > > > > [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > > > > > > [ 361.094519] kvm_activate_memslot+0x40/0x68 > > > > > > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > > > > > > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > > > > > > [ 361.094524] kvm_set_memory_region+0x78/0xb8 > > > > > > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > > > > > > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > > > > > > [ 361.094530] invoke_syscall+0x4c/0x110 > > > > > > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > > > > > > [ 361.094536] do_el0_svc+0x34/0xc0 > > > > > > [ 361.094538] el0_svc+0x30/0x98 > > > > > > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > > > > > > [ 361.094544] el0t_64_sync+0x18c/0x190 > > > > > > [ 363.942817] kvm_set_memory_region loop=6000 > > > > > Huh. > > > > > > > > > > One possibility is that the "if (!jbase)" block needs to be nested > > > > > within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block. > > > I test this diff and NO helpful > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > index 50ba70f019de..36286a4b74e6 100644 > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > > > > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs > > > pending. > > > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from > > > slow readers. > > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive > > > no-delay instances. > > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive > > > no-delay instances. > > > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay > > > instances. > > > > > > /* > > > @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct > > > *ssp) > > > */ > > > static unsigned long srcu_get_delay(struct srcu_struct *ssp) > > > { > > > + unsigned long gpstart; > > > + unsigned long j; > > > unsigned long jbase = SRCU_INTERVAL; > > > > > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > jbase = 0; > > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > > > - if (!jbase) { > > > - WRITE_ONCE(ssp->srcu_n_exp_nodelay, > > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > > - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > > > SRCU_MAX_NODELAY_PHASE) > > > - jbase = 1; > > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > > > + j = jiffies - 1; > > > + gpstart = READ_ONCE(ssp->srcu_gp_start); > > > + if (time_after(j, gpstart)) > > > + jbase += j - gpstart; > > > + > > > + if (!jbase) { > > > + WRITE_ONCE(ssp->srcu_n_exp_nodelay, > > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > > + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > > > SRCU_MAX_NODELAY_PHASE) > > > + jbase = 1; > > > + } > > > } > > That is in fact what I was intending you to test, thank you. As you > > say, unfortunately it did not help. > > > > Could you please test removing the "if (!jbase)" block entirely? > Remove "if (!jbase)" block is much faster, > not measure clearly, qemu (with debug version efi) boot seems normally. > > From log timestamp: > [ 114.624713] __synchronize_srcu loop=1000 > [ 124.157011] __synchronize_srcu loop=9000 > > Several method: timestamps are different. > > 5.19-rc1 > [ 94.271350] __synchronize_srcu loop=1001 > [ 222.621659] __synchronize_srcu loop=9001 > > > With your first diff: > [ 232.997667] __synchronize_srcu loop=1000 > [ 361.094493] __synchronize_srcu loop=9000 > > Remove "if (!jbase)" block > [ 114.624713] __synchronize_srcu loop=1000 > [ 124.157011] __synchronize_srcu loop=9000 > > > 5.18 method > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > + return 0; > + return SRCU_INTERVAL; > > [ 74.598480] __synchronize_srcu loop=9000 > [ 68.938297] __synchronize_srcu loop=1000 Thank you for the information! What happens if you keep the that "if (!jbase)" block", but set the value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000? This would be too large for KLP, but my hope is that there is a value of SRCU_MAX_NODELAY_PHASE that works for everyone. But first, does this help at all? ;-) > > > > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the > > > > above change reduces the overhead by more than an order of magnitude. > > > > Except that the overhead of the series is far less than one second, > > > > not the several minutes that you are seeing. So the per-call overhead > > > > decreases from about 17 microseconds to a bit more than one microsecond. > > > > > > > > I could imagine an extra order of magnitude if you are running HZ=100 > > > > instead of the HZ=1000 that I am running. But that only gets up to a > > > > few seconds. > > One possible reason for the difference would be if your code has > > SRCU readers. > > > > Could you please tell me the value of CONFIG_HZ on your system? > > Also the value of CONFIG_PREEMPTION? > I am using arch/arm64/configs/defconfig > make defconfig > CONFIG_PREEMPTION=y > CONFIG_HZ_250=y Thank you again! And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might depend on HZ. And who knows what all else... Thanx, Paul > Thanks > > > > > > > > One additional debug is to apply the patch below on top of the one you > > > apply the patch below? > > > > > just now kindly tested, then use whatever debug technique you wish to > > > > > work out what fraction of the time during that critical interval that > > > > > srcu_get_delay() returns non-zero. > > > Sorry, I am confused, no patch right? > > Apologies, my omission. > > > > > Just measure srcu_get_delay return to non-zero? > > Exactly, please! > > > > > By the way, the issue should be only related with qemu apci. not related > > > with rmr feature > > > Test with: https://github.com/qemu/qemu/tree/stable-6.1 > > > > > > Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force, > > > If no acpi, no print kvm_region_add/del (1000 times print once) > > > > > > If with acpi=force, > > > During qemu boot > > > kvm_region_add region_add = 1000 > > > kvm_region_del region_del = 1000 > > > kvm_region_add region_add = 2000 > > > kvm_region_del region_del = 2000 > > > kvm_region_add region_add = 3000 > > > kvm_region_del region_del = 3000 > > > kvm_region_add region_add = 4000 > > > kvm_region_del region_del = 4000 > > > kvm_region_add region_add = 5000 > > > kvm_region_del region_del = 5000 > > > kvm_region_add region_add = 6000 > > > kvm_region_del region_del = 6000 > > > > > > kvm_region_add/kvm_region_del -> > > > kvm_set_phys_mem-> > > > kvm_set_user_memory_region-> > > > kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem) > > > > > > [ 361.094493] __synchronize_srcu loop=9000 > > > [ 361.094501] Call trace: > > > [ 361.094502] dump_backtrace+0xe4/0xf0 > > > [ 361.094505] show_stack+0x20/0x70 > > > [ 361.094507] dump_stack_lvl+0x8c/0xb8 > > > [ 361.094509] dump_stack+0x18/0x34 > > > [ 361.094511] __synchronize_srcu+0x120/0x128 > > > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > > > [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > > > [ 361.094519] kvm_activate_memslot+0x40/0x68 > > > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > > > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > > > [ 361.094524] kvm_set_memory_region+0x78/0xb8 > > > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > > > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > > > [ 361.094530] invoke_syscall+0x4c/0x110 > > > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > > > [ 361.094536] do_el0_svc+0x34/0xc0 > > > [ 361.094538] el0_svc+0x30/0x98 > > > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > > > [ 361.094544] el0t_64_sync+0x18c/0x190 > > > [ 363.942817] kvm_set_memory_region loop=6000 > > Good to know, thank you! > > > > Thanx, Paul > ^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 14:59 ` Paul E. McKenney @ 2022-06-13 20:55 ` Shameerali Kolothum Thodi 2022-06-14 12:19 ` Neeraj Upadhyay 0 siblings, 1 reply; 37+ messages in thread From: Shameerali Kolothum Thodi @ 2022-06-13 20:55 UTC (permalink / raw) To: paulmck, zhangfei.gao Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) > -----Original Message----- > From: Paul E. McKenney [mailto:paulmck@kernel.org] > Sent: 13 June 2022 15:59 > To: zhangfei.gao@foxmail.com > Cc: Paolo Bonzini <pbonzini@redhat.com>; Zhangfei Gao > <zhangfei.gao@linaro.org>; linux-kernel@vger.kernel.org; > rcu@vger.kernel.org; Lai Jiangshan <jiangshanlai@gmail.com>; Josh Triplett > <josh@joshtriplett.org>; Mathieu Desnoyers > <mathieu.desnoyers@efficios.com>; Matthew Wilcox <willy@infradead.org>; > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; > mtosatti@redhat.com; Auger Eric <eric.auger@redhat.com> > Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and > blocking readers from consuming CPU) cause qemu boot slow > > On Mon, Jun 13, 2022 at 09:23:50PM +0800, zhangfei.gao@foxmail.com > wrote: > > > > > > On 2022/6/13 下午8:18, Paul E. McKenney wrote: > > > On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@foxmail.com > wrote: > > > > Hi, Paul > > > > > > > > On 2022/6/13 下午12:16, Paul E. McKenney wrote: > > > > > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote: > > > > > > On Mon, Jun 13, 2022 at 11:04:39AM +0800, > zhangfei.gao@foxmail.com wrote: > > > > > > > Hi, Paul > > > > > > > > > > > > > > On 2022/6/13 上午2:49, Paul E. McKenney wrote: > > > > > > > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini > wrote: > > > > > > > > > On 6/12/22 18:40, Paul E. McKenney wrote: > > > > > > > > > > > Do these reserved memory regions really need to be > allocated separately? > > > > > > > > > > > (For example, are they really all non-contiguous? If not, > that is, if > > > > > > > > > > > there are a lot of contiguous memory regions, could you > sort the IORT > > > > > > > > > > > by address and do one ioctl() for each set of contiguous > memory regions?) > > > > > > > > > > > > > > > > > > > > > > Are all of these reserved memory regions set up before init > is spawned? > > > > > > > > > > > > > > > > > > > > > > Are all of these reserved memory regions set up while > there is only a > > > > > > > > > > > single vCPU up and running? > > > > > > > > > > > > > > > > > > > > > > Is the SRCU grace period really needed in this case? (I > freely confess > > > > > > > > > > > to not being all that familiar with KVM.) > > > > > > > > > > Oh, and there was a similar many-requests problem with > networking many > > > > > > > > > > years ago. This was solved by adding a new > syscall/ioctl()/whatever > > > > > > > > > > that permitted many requests to be presented to the kernel > with a single > > > > > > > > > > system call. > > > > > > > > > > > > > > > > > > > > Could a new ioctl() be introduced that requested a large > number > > > > > > > > > > of these memory regions in one go so as to make each call to > > > > > > > > > > synchronize_rcu_expedited() cover a useful fraction of your > 9000+ > > > > > > > > > > requests? Adding a few of the KVM guys on CC for their > thoughts. > > > > > > > > > Unfortunately not. Apart from this specific case, in general > the calls to > > > > > > > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to > I/O registers in the > > > > > > > > > guest, and those writes then map to a ioctl. Typically the > guest sets up a > > > > > > > > > device at a time, and each setup step causes a > synchronize_srcu()---and > > > > > > > > > expedited at that. > > > > > > > > I was afraid of something like that... > > > > > > > > > > > > > > > > > KVM has two SRCUs: > > > > > > > > > > > > > > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it > has readers > > > > > > > > > that are very very small, but it needs extremely fast detection > of grace > > > > > > > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up > > > > > > > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off > kvm->srcu. Readers are > > > > > > > > > not so frequent. > > > > > > > > > > > > > > > > > > 2) kvm->srcu is nastier because there are readers all the time. > The > > > > > > > > > read-side critical section are still short-ish, but they need the > sleepable > > > > > > > > > part because they access user memory. > > > > > > > > Which one of these two is in play in this case? > > > > > > > > > > > > > > > > > Writers are not frequent per se; the problem is they come in > very large > > > > > > > > > bursts when a guest boots. And while the whole boot path > overall can be > > > > > > > > > quadratic, O(n) expensive calls to synchronize_srcu() can have > a larger > > > > > > > > > impact on runtime than the O(n^2) parts, as demonstrated > here. > > > > > > > > > > > > > > > > > > Therefore, we operated on the assumption that the callers of > > > > > > > > > synchronized_srcu_expedited were _anyway_ busy running > CPU-bound guest code > > > > > > > > > and the desire was to get past the booting phase as fast as > possible. If > > > > > > > > > the guest wants to eat host CPU it can "for(;;)" as much as it > wants; > > > > > > > > > therefore, as long as expedited GPs didn't eat CPU > *throughout the whole > > > > > > > > > system*, a preemptable busy wait in > synchronize_srcu_expedited() were not > > > > > > > > > problematic. > > > > > > > > > > > > > > > > > > This assumptions did match the SRCU code when kvm->srcu > and kvm->irq_srcu > > > > > > > > > were was introduced (respectively in 2009 and 2014). But > perhaps they do > > > > > > > > > not hold anymore now that each SRCU is not as independent > as it used to be > > > > > > > > > in those years, and instead they use workqueues instead? > > > > > > > > The problem was not internal to SRCU, but rather due to the fact > > > > > > > > that kernel live patching (KLP) had problems with the > CPU-bound tasks > > > > > > > > resulting from repeated synchronize_rcu_expedited() > invocations. So I > > > > > > > > added heuristics to get the occasional sleep in there for KLP's > benefit. > > > > > > > > Perhaps these heuristics need to be less aggressive about adding > sleep. > > > > > > > > > > > > > > > > These heuristics have these aspects: > > > > > > > > > > > > > > > > 1. The longer readers persist in an expedited SRCU grace period, > > > > > > > > the longer the wait between successive checks of the reader > > > > > > > > state. Roughly speaking, we wait as long as the grace period > > > > > > > > has currently been in effect, capped at ten jiffies. > > > > > > > > > > > > > > > > 2. SRCU grace periods have several phases. We reset so that > each > > > > > > > > phase starts by not waiting (new phase, new set of readers, > > > > > > > > so don't penalize this set for the sins of the previous set). > > > > > > > > But once we get to the point of adding delay, we add the > > > > > > > > delay based on the beginning of the full grace period. > > > > > > > > > > > > > > > > Right now, the checking for grace-period length does not allow > for the > > > > > > > > possibility that a grace period might start just before the jiffies > > > > > > > > counter gets incremented (because I didn't realize that anyone > cared), > > > > > > > > so that is one possible thing to change. I can also allow more > no-delay > > > > > > > > checks per SRCU grace-period phase. > > > > > > > > > > > > > > > > Zhangfei, does something like the patch shown below help? > > > > > > > > > > > > > > > > Additional adjustments are likely needed to avoid re-breaking > KLP, > > > > > > > > but we have to start somewhere... > > > > > > > > > > > > > > > > Thanx, Paul > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > > > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > > > > > > index 50ba70f019dea..6a354368ac1d1 100644 > > > > > > > > --- a/kernel/rcu/srcutree.c > > > > > > > > +++ b/kernel/rcu/srcutree.c > > > > > > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct > srcu_struct *ssp) > > > > > > > > #define SRCU_INTERVAL 1 // Base delay if no > expedited GPs pending. > > > > > > > > #define SRCU_MAX_INTERVAL 10 // Maximum > incremental delay from slow readers. > > > > > > > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum > per-GP-phase consecutive no-delay instances. > > > > > > > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum > per-GP-phase consecutive no-delay instances. > > > > > > > > #define SRCU_MAX_NODELAY 100 // Maximum > consecutive no-delay instances. > > > > > > > > /* > > > > > > > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct > srcu_struct *ssp) > > > > > > > > */ > > > > > > > > static unsigned long srcu_get_delay(struct srcu_struct > *ssp) > > > > > > > > { > > > > > > > > + unsigned long gpstart; > > > > > > > > + unsigned long j; > > > > > > > > unsigned long jbase = SRCU_INTERVAL; > > > > > > > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > > > > > > jbase = 0; > > > > > > > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > > > > > > > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > > > > > > > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > > > > > > > > + j = jiffies - 1; > > > > > > > > + gpstart = READ_ONCE(ssp->srcu_gp_start); > > > > > > > > + if (time_after(j, gpstart)) > > > > > > > > + jbase += j - gpstart; > > > > > > > > + } > > > > > > > > if (!jbase) { > > > > > > > > WRITE_ONCE(ssp->srcu_n_exp_nodelay, > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > > > > > > > if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > SRCU_MAX_NODELAY_PHASE) > > > > > > > Unfortunately, this patch does not helpful. > > > > > > > > > > > > > > Then re-add the debug info. > > > > > > > > > > > > > > During the qemu boot > > > > > > > [ 232.997667] __synchronize_srcu loop=1000 > > > > > > > > > > > > > > [ 361.094493] __synchronize_srcu loop=9000 > > > > > > > [ 361.094501] Call trace: > > > > > > > [ 361.094502] dump_backtrace+0xe4/0xf0 > > > > > > > [ 361.094505] show_stack+0x20/0x70 > > > > > > > [ 361.094507] dump_stack_lvl+0x8c/0xb8 > > > > > > > [ 361.094509] dump_stack+0x18/0x34 > > > > > > > [ 361.094511] __synchronize_srcu+0x120/0x128 > > > > > > > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > > > > > > > [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > > > > > > > [ 361.094519] kvm_activate_memslot+0x40/0x68 > > > > > > > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > > > > > > > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > > > > > > > [ 361.094524] kvm_set_memory_region+0x78/0xb8 > > > > > > > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > > > > > > > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > > > > > > > [ 361.094530] invoke_syscall+0x4c/0x110 > > > > > > > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > > > > > > > [ 361.094536] do_el0_svc+0x34/0xc0 > > > > > > > [ 361.094538] el0_svc+0x30/0x98 > > > > > > > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > > > > > > > [ 361.094544] el0t_64_sync+0x18c/0x190 > > > > > > > [ 363.942817] kvm_set_memory_region loop=6000 > > > > > > Huh. > > > > > > > > > > > > One possibility is that the "if (!jbase)" block needs to be nested > > > > > > within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" > block. > > > > I test this diff and NO helpful > > > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > > index 50ba70f019de..36286a4b74e6 100644 > > > > --- a/kernel/rcu/srcutree.c > > > > +++ b/kernel/rcu/srcutree.c > > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct > srcu_struct *ssp) > > > > > > > > #define SRCU_INTERVAL 1 // Base delay if no > expedited GPs > > > > pending. > > > > #define SRCU_MAX_INTERVAL 10 // Maximum > incremental delay from > > > > slow readers. > > > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum > per-GP-phase consecutive > > > > no-delay instances. > > > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum > per-GP-phase consecutive > > > > no-delay instances. > > > > #define SRCU_MAX_NODELAY 100 // Maximum > consecutive no-delay > > > > instances. > > > > > > > > /* > > > > @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct > srcu_struct > > > > *ssp) > > > > */ > > > > static unsigned long srcu_get_delay(struct srcu_struct *ssp) > > > > { > > > > + unsigned long gpstart; > > > > + unsigned long j; > > > > unsigned long jbase = SRCU_INTERVAL; > > > > > > > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > > jbase = 0; > > > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > > > > - jbase += jiffies - > READ_ONCE(ssp->srcu_gp_start); > > > > - if (!jbase) { > > > > - WRITE_ONCE(ssp->srcu_n_exp_nodelay, > > > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > > > - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > > > > SRCU_MAX_NODELAY_PHASE) > > > > - jbase = 1; > > > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > > > > + j = jiffies - 1; > > > > + gpstart = READ_ONCE(ssp->srcu_gp_start); > > > > + if (time_after(j, gpstart)) > > > > + jbase += j - gpstart; > > > > + > > > > + if (!jbase) { > > > > > + WRITE_ONCE(ssp->srcu_n_exp_nodelay, > > > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > > > + if > (READ_ONCE(ssp->srcu_n_exp_nodelay) > > > > > SRCU_MAX_NODELAY_PHASE) > > > > + jbase = 1; > > > > + } > > > > } > > > That is in fact what I was intending you to test, thank you. As you > > > say, unfortunately it did not help. > > > > > > Could you please test removing the "if (!jbase)" block entirely? > > Remove "if (!jbase)" block is much faster, > > not measure clearly, qemu (with debug version efi) boot seems normally. > > > > From log timestamp: > > [ 114.624713] __synchronize_srcu loop=1000 > > [ 124.157011] __synchronize_srcu loop=9000 > > > > Several method: timestamps are different. > > > > 5.19-rc1 > > [ 94.271350] __synchronize_srcu loop=1001 > > [ 222.621659] __synchronize_srcu loop=9001 > > > > > > With your first diff: > > [ 232.997667] __synchronize_srcu loop=1000 > > [ 361.094493] __synchronize_srcu loop=9000 > > > > Remove "if (!jbase)" block > > [ 114.624713] __synchronize_srcu loop=1000 > > [ 124.157011] __synchronize_srcu loop=9000 > > > > > > 5.18 method > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > + return 0; > > + return SRCU_INTERVAL; > > > > [ 74.598480] __synchronize_srcu loop=9000 > > [ 68.938297] __synchronize_srcu loop=1000 > > Thank you for the information! > > What happens if you keep the that "if (!jbase)" block", but set the > value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000? From the setup I have, this is almost similar to that of the previous logic(without the "if(!jbase)"). In both cases, I think we are not close to 5.18, but definitely much better compared to 5.19-rc1. The numbers from my test setup(CONFIG_HZ_250, CONFIG_PREEMPTION=y), Guest boot time(using 'time'): 5.18-rc4 based ~8sec 5.19-rc1 ~2m43sec 5.19-rc1+fix1 ~19sec 5.19-rc1-fix2 ~19sec I will wait for Zhangfei to confirm this on his setup, especially the difference compared to 5.18. Thanks, Shameer > This would be too large for KLP, but my hope is that there is a value > of SRCU_MAX_NODELAY_PHASE that works for everyone. But first, does > this help at all? ;-) > > > > > > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, > the > > > > > above change reduces the overhead by more than an order of > magnitude. > > > > > Except that the overhead of the series is far less than one second, > > > > > not the several minutes that you are seeing. So the per-call > overhead > > > > > decreases from about 17 microseconds to a bit more than one > microsecond. > > > > > > > > > > I could imagine an extra order of magnitude if you are running > HZ=100 > > > > > instead of the HZ=1000 that I am running. But that only gets up to a > > > > > few seconds. > > > One possible reason for the difference would be if your code has > > > SRCU readers. > > > > > > Could you please tell me the value of CONFIG_HZ on your system? > > > Also the value of CONFIG_PREEMPTION? > > I am using arch/arm64/configs/defconfig > > make defconfig > > CONFIG_PREEMPTION=y > > CONFIG_HZ_250=y > > Thank you again! > > And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might > depend > on HZ. And who knows what all else... > > Thanx, Paul > > > Thanks > > > > > > > > > > > One additional debug is to apply the patch below on top of the one > you > > > > apply the patch below? > > > > > > just now kindly tested, then use whatever debug technique you wish > to > > > > > > work out what fraction of the time during that critical interval that > > > > > > srcu_get_delay() returns non-zero. > > > > Sorry, I am confused, no patch right? > > > Apologies, my omission. > > > > > > > Just measure srcu_get_delay return to non-zero? > > > Exactly, please! > > > > > > > By the way, the issue should be only related with qemu apci. not related > > > > with rmr feature > > > > Test with: https://github.com/qemu/qemu/tree/stable-6.1 > > > > > > > > Looks it caused by too many kvm_region_add & kvm_region_del if > acpi=force, > > > > If no acpi, no print kvm_region_add/del (1000 times print once) > > > > > > > > If with acpi=force, > > > > During qemu boot > > > > kvm_region_add region_add = 1000 > > > > kvm_region_del region_del = 1000 > > > > kvm_region_add region_add = 2000 > > > > kvm_region_del region_del = 2000 > > > > kvm_region_add region_add = 3000 > > > > kvm_region_del region_del = 3000 > > > > kvm_region_add region_add = 4000 > > > > kvm_region_del region_del = 4000 > > > > kvm_region_add region_add = 5000 > > > > kvm_region_del region_del = 5000 > > > > kvm_region_add region_add = 6000 > > > > kvm_region_del region_del = 6000 > > > > > > > > kvm_region_add/kvm_region_del -> > > > > kvm_set_phys_mem-> > > > > kvm_set_user_memory_region-> > > > > kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem) > > > > > > > > [ 361.094493] __synchronize_srcu loop=9000 > > > > [ 361.094501] Call trace: > > > > [ 361.094502] dump_backtrace+0xe4/0xf0 > > > > [ 361.094505] show_stack+0x20/0x70 > > > > [ 361.094507] dump_stack_lvl+0x8c/0xb8 > > > > [ 361.094509] dump_stack+0x18/0x34 > > > > [ 361.094511] __synchronize_srcu+0x120/0x128 > > > > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > > > > [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > > > > [ 361.094519] kvm_activate_memslot+0x40/0x68 > > > > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > > > > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > > > > [ 361.094524] kvm_set_memory_region+0x78/0xb8 > > > > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > > > > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > > > > [ 361.094530] invoke_syscall+0x4c/0x110 > > > > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > > > > [ 361.094536] do_el0_svc+0x34/0xc0 > > > > [ 361.094538] el0_svc+0x30/0x98 > > > > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > > > > [ 361.094544] el0t_64_sync+0x18c/0x190 > > > > [ 363.942817] kvm_set_memory_region loop=6000 > > > Good to know, thank you! > > > > > > Thanx, Paul > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 20:55 ` Shameerali Kolothum Thodi @ 2022-06-14 12:19 ` Neeraj Upadhyay 2022-06-14 14:03 ` zhangfei.gao 0 siblings, 1 reply; 37+ messages in thread From: Neeraj Upadhyay @ 2022-06-14 12:19 UTC (permalink / raw) To: Shameerali Kolothum Thodi, paulmck, zhangfei.gao Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) Hi, On 6/14/2022 2:25 AM, Shameerali Kolothum Thodi wrote: > > >> -----Original Message----- >> From: Paul E. McKenney [mailto:paulmck@kernel.org] >> Sent: 13 June 2022 15:59 >> To: zhangfei.gao@foxmail.com >> Cc: Paolo Bonzini <pbonzini@redhat.com>; Zhangfei Gao >> <zhangfei.gao@linaro.org>; linux-kernel@vger.kernel.org; >> rcu@vger.kernel.org; Lai Jiangshan <jiangshanlai@gmail.com>; Josh Triplett >> <josh@joshtriplett.org>; Mathieu Desnoyers >> <mathieu.desnoyers@efficios.com>; Matthew Wilcox <willy@infradead.org>; >> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; >> mtosatti@redhat.com; Auger Eric <eric.auger@redhat.com> >> Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and >> blocking readers from consuming CPU) cause qemu boot slow >> >> On Mon, Jun 13, 2022 at 09:23:50PM +0800, zhangfei.gao@foxmail.com >> wrote: >>> >>> >>> On 2022/6/13 下午8:18, Paul E. McKenney wrote: >>>> On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@foxmail.com >> wrote: >>>>> Hi, Paul >>>>> >>>>> On 2022/6/13 下午12:16, Paul E. McKenney wrote: >>>>>> On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote: >>>>>>> On Mon, Jun 13, 2022 at 11:04:39AM +0800, >> zhangfei.gao@foxmail.com wrote: >>>>>>>> Hi, Paul >>>>>>>> >>>>>>>> On 2022/6/13 上午2:49, Paul E. McKenney wrote: >>>>>>>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini >> wrote: >>>>>>>>>> On 6/12/22 18:40, Paul E. McKenney wrote: >>>>>>>>>>>> Do these reserved memory regions really need to be >> allocated separately? >>>>>>>>>>>> (For example, are they really all non-contiguous? If not, >> that is, if >>>>>>>>>>>> there are a lot of contiguous memory regions, could you >> sort the IORT >>>>>>>>>>>> by address and do one ioctl() for each set of contiguous >> memory regions?) >>>>>>>>>>>> >>>>>>>>>>>> Are all of these reserved memory regions set up before init >> is spawned? >>>>>>>>>>>> >>>>>>>>>>>> Are all of these reserved memory regions set up while >> there is only a >>>>>>>>>>>> single vCPU up and running? >>>>>>>>>>>> >>>>>>>>>>>> Is the SRCU grace period really needed in this case? (I >> freely confess >>>>>>>>>>>> to not being all that familiar with KVM.) >>>>>>>>>>> Oh, and there was a similar many-requests problem with >> networking many >>>>>>>>>>> years ago. This was solved by adding a new >> syscall/ioctl()/whatever >>>>>>>>>>> that permitted many requests to be presented to the kernel >> with a single >>>>>>>>>>> system call. >>>>>>>>>>> >>>>>>>>>>> Could a new ioctl() be introduced that requested a large >> number >>>>>>>>>>> of these memory regions in one go so as to make each call to >>>>>>>>>>> synchronize_rcu_expedited() cover a useful fraction of your >> 9000+ >>>>>>>>>>> requests? Adding a few of the KVM guys on CC for their >> thoughts. >>>>>>>>>> Unfortunately not. Apart from this specific case, in general >> the calls to >>>>>>>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to >> I/O registers in the >>>>>>>>>> guest, and those writes then map to a ioctl. Typically the >> guest sets up a >>>>>>>>>> device at a time, and each setup step causes a >> synchronize_srcu()---and >>>>>>>>>> expedited at that. >>>>>>>>> I was afraid of something like that... >>>>>>>>> >>>>>>>>>> KVM has two SRCUs: >>>>>>>>>> >>>>>>>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it >> has readers >>>>>>>>>> that are very very small, but it needs extremely fast detection >> of grace >>>>>>>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up >>>>>>>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off >> kvm->srcu. Readers are >>>>>>>>>> not so frequent. >>>>>>>>>> >>>>>>>>>> 2) kvm->srcu is nastier because there are readers all the time. >> The >>>>>>>>>> read-side critical section are still short-ish, but they need the >> sleepable >>>>>>>>>> part because they access user memory. >>>>>>>>> Which one of these two is in play in this case? >>>>>>>>> >>>>>>>>>> Writers are not frequent per se; the problem is they come in >> very large >>>>>>>>>> bursts when a guest boots. And while the whole boot path >> overall can be >>>>>>>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have >> a larger >>>>>>>>>> impact on runtime than the O(n^2) parts, as demonstrated >> here. >>>>>>>>>> >>>>>>>>>> Therefore, we operated on the assumption that the callers of >>>>>>>>>> synchronized_srcu_expedited were _anyway_ busy running >> CPU-bound guest code >>>>>>>>>> and the desire was to get past the booting phase as fast as >> possible. If >>>>>>>>>> the guest wants to eat host CPU it can "for(;;)" as much as it >> wants; >>>>>>>>>> therefore, as long as expedited GPs didn't eat CPU >> *throughout the whole >>>>>>>>>> system*, a preemptable busy wait in >> synchronize_srcu_expedited() were not >>>>>>>>>> problematic. >>>>>>>>>> >>>>>>>>>> This assumptions did match the SRCU code when kvm->srcu >> and kvm->irq_srcu >>>>>>>>>> were was introduced (respectively in 2009 and 2014). But >> perhaps they do >>>>>>>>>> not hold anymore now that each SRCU is not as independent >> as it used to be >>>>>>>>>> in those years, and instead they use workqueues instead? >>>>>>>>> The problem was not internal to SRCU, but rather due to the fact >>>>>>>>> that kernel live patching (KLP) had problems with the >> CPU-bound tasks >>>>>>>>> resulting from repeated synchronize_rcu_expedited() >> invocations. So I >>>>>>>>> added heuristics to get the occasional sleep in there for KLP's >> benefit. >>>>>>>>> Perhaps these heuristics need to be less aggressive about adding >> sleep. >>>>>>>>> >>>>>>>>> These heuristics have these aspects: >>>>>>>>> >>>>>>>>> 1. The longer readers persist in an expedited SRCU grace period, >>>>>>>>> the longer the wait between successive checks of the reader >>>>>>>>> state. Roughly speaking, we wait as long as the grace period >>>>>>>>> has currently been in effect, capped at ten jiffies. >>>>>>>>> >>>>>>>>> 2. SRCU grace periods have several phases. We reset so that >> each >>>>>>>>> phase starts by not waiting (new phase, new set of readers, >>>>>>>>> so don't penalize this set for the sins of the previous set). >>>>>>>>> But once we get to the point of adding delay, we add the >>>>>>>>> delay based on the beginning of the full grace period. >>>>>>>>> >>>>>>>>> Right now, the checking for grace-period length does not allow >> for the >>>>>>>>> possibility that a grace period might start just before the jiffies >>>>>>>>> counter gets incremented (because I didn't realize that anyone >> cared), >>>>>>>>> so that is one possible thing to change. I can also allow more >> no-delay >>>>>>>>> checks per SRCU grace-period phase. >>>>>>>>> >>>>>>>>> Zhangfei, does something like the patch shown below help? >>>>>>>>> >>>>>>>>> Additional adjustments are likely needed to avoid re-breaking >> KLP, >>>>>>>>> but we have to start somewhere... >>>>>>>>> >>>>>>>>> Thanx, Paul >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>> >>>>>>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >>>>>>>>> index 50ba70f019dea..6a354368ac1d1 100644 >>>>>>>>> --- a/kernel/rcu/srcutree.c >>>>>>>>> +++ b/kernel/rcu/srcutree.c >>>>>>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct >> srcu_struct *ssp) >>>>>>>>> #define SRCU_INTERVAL 1 // Base delay if no >> expedited GPs pending. >>>>>>>>> #define SRCU_MAX_INTERVAL 10 // Maximum >> incremental delay from slow readers. >>>>>>>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum >> per-GP-phase consecutive no-delay instances. >>>>>>>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum >> per-GP-phase consecutive no-delay instances. >>>>>>>>> #define SRCU_MAX_NODELAY 100 // Maximum >> consecutive no-delay instances. >>>>>>>>> /* >>>>>>>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct >> srcu_struct *ssp) >>>>>>>>> */ >>>>>>>>> static unsigned long srcu_get_delay(struct srcu_struct >> *ssp) >>>>>>>>> { >>>>>>>>> + unsigned long gpstart; >>>>>>>>> + unsigned long j; >>>>>>>>> unsigned long jbase = SRCU_INTERVAL; >>>>>>>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>>>>>>>> jbase = 0; >>>>>>>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) >>>>>>>>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); >>>>>>>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { >>>>>>>>> + j = jiffies - 1; >>>>>>>>> + gpstart = READ_ONCE(ssp->srcu_gp_start); >>>>>>>>> + if (time_after(j, gpstart)) >>>>>>>>> + jbase += j - gpstart; >>>>>>>>> + } >>>>>>>>> if (!jbase) { >>>>>>>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, >> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >>>>>>>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >> SRCU_MAX_NODELAY_PHASE) >>>>>>>> Unfortunately, this patch does not helpful. >>>>>>>> >>>>>>>> Then re-add the debug info. >>>>>>>> >>>>>>>> During the qemu boot >>>>>>>> [ 232.997667] __synchronize_srcu loop=1000 >>>>>>>> >>>>>>>> [ 361.094493] __synchronize_srcu loop=9000 >>>>>>>> [ 361.094501] Call trace: >>>>>>>> [ 361.094502] dump_backtrace+0xe4/0xf0 >>>>>>>> [ 361.094505] show_stack+0x20/0x70 >>>>>>>> [ 361.094507] dump_stack_lvl+0x8c/0xb8 >>>>>>>> [ 361.094509] dump_stack+0x18/0x34 >>>>>>>> [ 361.094511] __synchronize_srcu+0x120/0x128 >>>>>>>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 >>>>>>>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198 >>>>>>>> [ 361.094519] kvm_activate_memslot+0x40/0x68 >>>>>>>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 >>>>>>>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 >>>>>>>> [ 361.094524] kvm_set_memory_region+0x78/0xb8 >>>>>>>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 >>>>>>>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 >>>>>>>> [ 361.094530] invoke_syscall+0x4c/0x110 >>>>>>>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 >>>>>>>> [ 361.094536] do_el0_svc+0x34/0xc0 >>>>>>>> [ 361.094538] el0_svc+0x30/0x98 >>>>>>>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 >>>>>>>> [ 361.094544] el0t_64_sync+0x18c/0x190 >>>>>>>> [ 363.942817] kvm_set_memory_region loop=6000 >>>>>>> Huh. >>>>>>> >>>>>>> One possibility is that the "if (!jbase)" block needs to be nested >>>>>>> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" >> block. >>>>> I test this diff and NO helpful >>>>> >>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >>>>> index 50ba70f019de..36286a4b74e6 100644 >>>>> --- a/kernel/rcu/srcutree.c >>>>> +++ b/kernel/rcu/srcutree.c >>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct >> srcu_struct *ssp) >>>>> >>>>> #define SRCU_INTERVAL 1 // Base delay if no >> expedited GPs >>>>> pending. >>>>> #define SRCU_MAX_INTERVAL 10 // Maximum >> incremental delay from >>>>> slow readers. >>>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum >> per-GP-phase consecutive >>>>> no-delay instances. >>>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum >> per-GP-phase consecutive >>>>> no-delay instances. >>>>> #define SRCU_MAX_NODELAY 100 // Maximum >> consecutive no-delay >>>>> instances. >>>>> >>>>> /* >>>>> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct >> srcu_struct >>>>> *ssp) >>>>> */ >>>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp) >>>>> { >>>>> + unsigned long gpstart; >>>>> + unsigned long j; >>>>> unsigned long jbase = SRCU_INTERVAL; >>>>> >>>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>>>> jbase = 0; >>>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) >>>>> - jbase += jiffies - >> READ_ONCE(ssp->srcu_gp_start); >>>>> - if (!jbase) { >>>>> - WRITE_ONCE(ssp->srcu_n_exp_nodelay, >>>>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >>>>> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >>>>> SRCU_MAX_NODELAY_PHASE) >>>>> - jbase = 1; >>>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { >>>>> + j = jiffies - 1; >>>>> + gpstart = READ_ONCE(ssp->srcu_gp_start); >>>>> + if (time_after(j, gpstart)) >>>>> + jbase += j - gpstart; >>>>> + >>>>> + if (!jbase) { >>>>> >> + WRITE_ONCE(ssp->srcu_n_exp_nodelay, >>>>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >>>>> + if >> (READ_ONCE(ssp->srcu_n_exp_nodelay) > >>>>> SRCU_MAX_NODELAY_PHASE) >>>>> + jbase = 1; >>>>> + } >>>>> } >>>> That is in fact what I was intending you to test, thank you. As you >>>> say, unfortunately it did not help. >>>> >>>> Could you please test removing the "if (!jbase)" block entirely? >>> Remove "if (!jbase)" block is much faster, >>> not measure clearly, qemu (with debug version efi) boot seems normally. >>> >>> From log timestamp: >>> [ 114.624713] __synchronize_srcu loop=1000 >>> [ 124.157011] __synchronize_srcu loop=9000 >>> >>> Several method: timestamps are different. >>> >>> 5.19-rc1 >>> [ 94.271350] __synchronize_srcu loop=1001 >>> [ 222.621659] __synchronize_srcu loop=9001 >>> >>> >>> With your first diff: >>> [ 232.997667] __synchronize_srcu loop=1000 >>> [ 361.094493] __synchronize_srcu loop=9000 >>> >>> Remove "if (!jbase)" block >>> [ 114.624713] __synchronize_srcu loop=1000 >>> [ 124.157011] __synchronize_srcu loop=9000 >>> >>> >>> 5.18 method >>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>> + return 0; >>> + return SRCU_INTERVAL; >>> >>> [ 74.598480] __synchronize_srcu loop=9000 >>> [ 68.938297] __synchronize_srcu loop=1000 >> >> Thank you for the information! >> >> What happens if you keep the that "if (!jbase)" block", but set the >> value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000? > > From the setup I have, this is almost similar to that of the previous logic(without > the "if(!jbase)"). In both cases, I think we are not close to 5.18, but definitely much > better compared to 5.19-rc1. > > The numbers from my test setup(CONFIG_HZ_250, CONFIG_PREEMPTION=y), > > Guest boot time(using 'time'): > > 5.18-rc4 based ~8sec > > 5.19-rc1 ~2m43sec > > 5.19-rc1+fix1 ~19sec > > 5.19-rc1-fix2 ~19sec > If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ; does it show any difference in boot time? --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp */ static void srcu_gp_end(struct srcu_struct *ssp) { - unsigned long cbdelay; + unsigned long cbdelay = 1; bool cbs; bool last_lvl; int cpu; @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) spin_lock_irq_rcu_node(ssp); idx = rcu_seq_state(ssp->srcu_gp_seq); WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); - cbdelay = !!srcu_get_delay(ssp); + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) + cbdelay = 0; + WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); Thanks Neeraj > I will wait for Zhangfei to confirm this on his setup, especially the difference > compared to 5.18. > > Thanks, > Shameer > >> This would be too large for KLP, but my hope is that there is a value >> of SRCU_MAX_NODELAY_PHASE that works for everyone. But first, does >> this help at all? ;-) >> >>>>>> And when I run 10,000 consecutive synchronize_rcu_expedited() calls, >> the >>>>>> above change reduces the overhead by more than an order of >> magnitude. >>>>>> Except that the overhead of the series is far less than one second, >>>>>> not the several minutes that you are seeing. So the per-call >> overhead >>>>>> decreases from about 17 microseconds to a bit more than one >> microsecond. >>>>>> >>>>>> I could imagine an extra order of magnitude if you are running >> HZ=100 >>>>>> instead of the HZ=1000 that I am running. But that only gets up to a >>>>>> few seconds. >>>> One possible reason for the difference would be if your code has >>>> SRCU readers. >>>> >>>> Could you please tell me the value of CONFIG_HZ on your system? >>>> Also the value of CONFIG_PREEMPTION? >>> I am using arch/arm64/configs/defconfig >>> make defconfig >>> CONFIG_PREEMPTION=y >>> CONFIG_HZ_250=y >> >> Thank you again! >> >> And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might >> depend >> on HZ. And who knows what all else... >> >> Thanx, Paul >> >>> Thanks >>> >>>> >>>>>>> One additional debug is to apply the patch below on top of the one >> you >>>>> apply the patch below? >>>>>>> just now kindly tested, then use whatever debug technique you wish >> to >>>>>>> work out what fraction of the time during that critical interval that >>>>>>> srcu_get_delay() returns non-zero. >>>>> Sorry, I am confused, no patch right? >>>> Apologies, my omission. >>>> >>>>> Just measure srcu_get_delay return to non-zero? >>>> Exactly, please! >>>> >>>>> By the way, the issue should be only related with qemu apci. not related >>>>> with rmr feature >>>>> Test with: https://github.com/qemu/qemu/tree/stable-6.1 >>>>> >>>>> Looks it caused by too many kvm_region_add & kvm_region_del if >> acpi=force, >>>>> If no acpi, no print kvm_region_add/del (1000 times print once) >>>>> >>>>> If with acpi=force, >>>>> During qemu boot >>>>> kvm_region_add region_add = 1000 >>>>> kvm_region_del region_del = 1000 >>>>> kvm_region_add region_add = 2000 >>>>> kvm_region_del region_del = 2000 >>>>> kvm_region_add region_add = 3000 >>>>> kvm_region_del region_del = 3000 >>>>> kvm_region_add region_add = 4000 >>>>> kvm_region_del region_del = 4000 >>>>> kvm_region_add region_add = 5000 >>>>> kvm_region_del region_del = 5000 >>>>> kvm_region_add region_add = 6000 >>>>> kvm_region_del region_del = 6000 >>>>> >>>>> kvm_region_add/kvm_region_del -> >>>>> kvm_set_phys_mem-> >>>>> kvm_set_user_memory_region-> >>>>> kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem) >>>>> >>>>> [ 361.094493] __synchronize_srcu loop=9000 >>>>> [ 361.094501] Call trace: >>>>> [ 361.094502] dump_backtrace+0xe4/0xf0 >>>>> [ 361.094505] show_stack+0x20/0x70 >>>>> [ 361.094507] dump_stack_lvl+0x8c/0xb8 >>>>> [ 361.094509] dump_stack+0x18/0x34 >>>>> [ 361.094511] __synchronize_srcu+0x120/0x128 >>>>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 >>>>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198 >>>>> [ 361.094519] kvm_activate_memslot+0x40/0x68 >>>>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 >>>>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 >>>>> [ 361.094524] kvm_set_memory_region+0x78/0xb8 >>>>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 >>>>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 >>>>> [ 361.094530] invoke_syscall+0x4c/0x110 >>>>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 >>>>> [ 361.094536] do_el0_svc+0x34/0xc0 >>>>> [ 361.094538] el0_svc+0x30/0x98 >>>>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 >>>>> [ 361.094544] el0t_64_sync+0x18c/0x190 >>>>> [ 363.942817] kvm_set_memory_region loop=6000 >>>> Good to know, thank you! >>>> >>>> Thanx, Paul >>> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-14 12:19 ` Neeraj Upadhyay @ 2022-06-14 14:03 ` zhangfei.gao 2022-06-14 14:14 ` Neeraj Upadhyay 2022-06-14 14:17 ` Paul E. McKenney 0 siblings, 2 replies; 37+ messages in thread From: zhangfei.gao @ 2022-06-14 14:03 UTC (permalink / raw) To: Neeraj Upadhyay, Shameerali Kolothum Thodi, paulmck Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) On 2022/6/14 下午8:19, Neeraj Upadhyay wrote: > >> >> 5.18-rc4 based ~8sec >> >> 5.19-rc1 ~2m43sec >> >> 5.19-rc1+fix1 ~19sec >> >> 5.19-rc1-fix2 ~19sec >> > > If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 > ; does it show any difference in boot time? > > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct > srcu_struct *ssp, struct srcu_node *snp > */ > static void srcu_gp_end(struct srcu_struct *ssp) > { > - unsigned long cbdelay; > + unsigned long cbdelay = 1; > bool cbs; > bool last_lvl; > int cpu; > @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) > spin_lock_irq_rcu_node(ssp); > idx = rcu_seq_state(ssp->srcu_gp_seq); > WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); > - cbdelay = !!srcu_get_delay(ssp); > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > + cbdelay = 0; > + > WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); > Test here: qemu: https://github.com/qemu/qemu/tree/stable-6.1 kernel: https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test (in case test patch not clear, push in git tree) Hardware: aarch64 1. 5.18-rc6 real 0m8.402s user 0m3.015s sys 0m1.102s 2. 5.19-rc1 real 2m41.433s user 0m3.097s sys 0m1.177s 3. 5.19-rc1 + fix1 from Paul real 2m43.404s user 0m2.880s sys 0m1.214s 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block real 0m15.262s user 0m3.003s sys 0m1.033s When build kernel in the meantime, load time become longer. 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000 real 0m15.215s user 0m2.942s sys 0m1.172s 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end real 1m23.936s user 0m2.969s sys 0m1.181s More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5 Thanks ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-14 14:03 ` zhangfei.gao @ 2022-06-14 14:14 ` Neeraj Upadhyay 2022-06-14 14:57 ` zhangfei.gao 2022-06-14 14:17 ` Paul E. McKenney 1 sibling, 1 reply; 37+ messages in thread From: Neeraj Upadhyay @ 2022-06-14 14:14 UTC (permalink / raw) To: zhangfei.gao, Shameerali Kolothum Thodi, paulmck Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) On 6/14/2022 7:33 PM, zhangfei.gao@foxmail.com wrote: > > > On 2022/6/14 下午8:19, Neeraj Upadhyay wrote: >> >>> >>> 5.18-rc4 based ~8sec >>> >>> 5.19-rc1 ~2m43sec >>> >>> 5.19-rc1+fix1 ~19sec >>> >>> 5.19-rc1-fix2 ~19sec >>> >> >> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 >> ; does it show any difference in boot time? >> >> --- a/kernel/rcu/srcutree.c >> +++ b/kernel/rcu/srcutree.c >> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct >> srcu_struct *ssp, struct srcu_node *snp >> */ >> static void srcu_gp_end(struct srcu_struct *ssp) >> { >> - unsigned long cbdelay; >> + unsigned long cbdelay = 1; >> bool cbs; >> bool last_lvl; >> int cpu; >> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >> spin_lock_irq_rcu_node(ssp); >> idx = rcu_seq_state(ssp->srcu_gp_seq); >> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >> - cbdelay = !!srcu_get_delay(ssp); >> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >> + cbdelay = 0; >> + >> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); >> > Test here: > qemu: https://github.com/qemu/qemu/tree/stable-6.1 > kernel: > https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test > (in case test patch not clear, push in git tree) > > Hardware: aarch64 > > 1. 5.18-rc6 > real 0m8.402s > user 0m3.015s > sys 0m1.102s > > 2. 5.19-rc1 > real 2m41.433s > user 0m3.097s > sys 0m1.177s > > 3. 5.19-rc1 + fix1 from Paul > real 2m43.404s > user 0m2.880s > sys 0m1.214s > > 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block > real 0m15.262s > user 0m3.003s > sys 0m1.033s > > When build kernel in the meantime, load time become longer. > > 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000 > real 0m15.215s > user 0m2.942s > sys 0m1.172s > > 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end > real 1m23.936s > user 0m2.969s > sys 0m1.181s > Thanks for this data. Can you please share below test combo also? 7. 5.19-rc1 + fix5: fix2 + Neeraj's change of srcu_gp_end 8. 5.19-rc1 + fix6: fix3 + Neeraj's change of srcu_gp_end Thanks Neeraj > More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5 > > Thanks > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-14 14:14 ` Neeraj Upadhyay @ 2022-06-14 14:57 ` zhangfei.gao 0 siblings, 0 replies; 37+ messages in thread From: zhangfei.gao @ 2022-06-14 14:57 UTC (permalink / raw) To: Neeraj Upadhyay, Shameerali Kolothum Thodi, paulmck Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) On 2022/6/14 下午10:14, Neeraj Upadhyay wrote: > > > On 6/14/2022 7:33 PM, zhangfei.gao@foxmail.com wrote: >> >> >> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote: >>> >>>> >>>> 5.18-rc4 based ~8sec >>>> >>>> 5.19-rc1 ~2m43sec >>>> >>>> 5.19-rc1+fix1 ~19sec >>>> >>>> 5.19-rc1-fix2 ~19sec >>>> >>> >>> If you try below diff on top of either 5.19-rc1+fix1 or >>> 5.19-rc1-fix2 ; does it show any difference in boot time? >>> >>> --- a/kernel/rcu/srcutree.c >>> +++ b/kernel/rcu/srcutree.c >>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct >>> srcu_struct *ssp, struct srcu_node *snp >>> */ >>> static void srcu_gp_end(struct srcu_struct *ssp) >>> { >>> - unsigned long cbdelay; >>> + unsigned long cbdelay = 1; >>> bool cbs; >>> bool last_lvl; >>> int cpu; >>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >>> spin_lock_irq_rcu_node(ssp); >>> idx = rcu_seq_state(ssp->srcu_gp_seq); >>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >>> - cbdelay = !!srcu_get_delay(ssp); >>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >>> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>> + cbdelay = 0; >>> + >>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); >>> >> Test here: >> qemu: https://github.com/qemu/qemu/tree/stable-6.1 >> kernel: >> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test >> (in case test patch not clear, push in git tree) >> >> Hardware: aarch64 >> >> 1. 5.18-rc6 >> real 0m8.402s >> user 0m3.015s >> sys 0m1.102s >> >> 2. 5.19-rc1 >> real 2m41.433s >> user 0m3.097s >> sys 0m1.177s >> >> 3. 5.19-rc1 + fix1 from Paul >> real 2m43.404s >> user 0m2.880s >> sys 0m1.214s >> >> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block >> real 0m15.262s >> user 0m3.003s >> sys 0m1.033s >> >> When build kernel in the meantime, load time become longer. >> >> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000 >> real 0m15.215s >> user 0m2.942s >> sys 0m1.172s >> >> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end >> real 1m23.936s >> user 0m2.969s >> sys 0m1.181s >> > 7. 5.19-rc1 + fix5: fix4 + Remove "if (!jbase)" block real 0m11.418s user 0m3.031s sys 0m1.067s 8. 5.19-rc1 + fix 6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000 real 0m11.154s ~12s user 0m2.919s sys 0m1.064s Thanks > Thanks for this data. Can you please share below test combo also? > > 7. 5.19-rc1 + fix5: fix2 + Neeraj's change of srcu_gp_end > > > 8. 5.19-rc1 + fix6: fix3 + Neeraj's change of srcu_gp_end > > > Thanks > Neeraj > >> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5 >> >> Thanks >> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-14 14:03 ` zhangfei.gao 2022-06-14 14:14 ` Neeraj Upadhyay @ 2022-06-14 14:17 ` Paul E. McKenney 2022-06-15 9:03 ` zhangfei.gao 1 sibling, 1 reply; 37+ messages in thread From: Paul E. McKenney @ 2022-06-14 14:17 UTC (permalink / raw) To: zhangfei.gao Cc: Neeraj Upadhyay, Shameerali Kolothum Thodi, Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com wrote: > > > On 2022/6/14 下午8:19, Neeraj Upadhyay wrote: > > > > > > > > 5.18-rc4 based ~8sec > > > > > > 5.19-rc1 ~2m43sec > > > > > > 5.19-rc1+fix1 ~19sec > > > > > > 5.19-rc1-fix2 ~19sec > > > > > > > If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ; > > does it show any difference in boot time? > > > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct > > *ssp, struct srcu_node *snp > > */ > > static void srcu_gp_end(struct srcu_struct *ssp) > > { > > - unsigned long cbdelay; > > + unsigned long cbdelay = 1; > > bool cbs; > > bool last_lvl; > > int cpu; > > @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) > > spin_lock_irq_rcu_node(ssp); > > idx = rcu_seq_state(ssp->srcu_gp_seq); > > WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); > > - cbdelay = !!srcu_get_delay(ssp); > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > + cbdelay = 0; > > + > > WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); Thank you both for the testing and the proposed fix! > Test here: > qemu: https://github.com/qemu/qemu/tree/stable-6.1 > kernel: > https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test > (in case test patch not clear, push in git tree) > > Hardware: aarch64 > > 1. 5.18-rc6 > real 0m8.402s > user 0m3.015s > sys 0m1.102s > > 2. 5.19-rc1 > real 2m41.433s > user 0m3.097s > sys 0m1.177s > > 3. 5.19-rc1 + fix1 from Paul > real 2m43.404s > user 0m2.880s > sys 0m1.214s > > 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block > real 0m15.262s > user 0m3.003s > sys 0m1.033s > > When build kernel in the meantime, load time become longer. > > 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000 > real 0m15.215s > user 0m2.942s > sys 0m1.172s > > 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end > real 1m23.936s > user 0m2.969s > sys 0m1.181s And thank you for the testing! Could you please try fix3 + Neeraj's change of srcu_gp_end? That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of srcu_gp_end. Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot times start rising? This is probably best done by starting with SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run until boot time becomes slow, followed by a binary search between the last two values. (The idea is to bias the search so that fast boot times are the common case.) > More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5 And thank you for these details. Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-14 14:17 ` Paul E. McKenney @ 2022-06-15 9:03 ` zhangfei.gao 2022-06-15 10:40 ` Neeraj Upadhyay 0 siblings, 1 reply; 37+ messages in thread From: zhangfei.gao @ 2022-06-15 9:03 UTC (permalink / raw) To: paulmck, zhangfei Cc: Neeraj Upadhyay, Shameerali Kolothum Thodi, Paolo Bonzini, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) On 2022/6/14 下午10:17, Paul E. McKenney wrote: > On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com wrote: >> >> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote: >>>> 5.18-rc4 based ~8sec >>>> >>>> 5.19-rc1 ~2m43sec >>>> >>>> 5.19-rc1+fix1 ~19sec >>>> >>>> 5.19-rc1-fix2 ~19sec >>>> >>> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ; >>> does it show any difference in boot time? >>> >>> --- a/kernel/rcu/srcutree.c >>> +++ b/kernel/rcu/srcutree.c >>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct >>> *ssp, struct srcu_node *snp >>> */ >>> static void srcu_gp_end(struct srcu_struct *ssp) >>> { >>> - unsigned long cbdelay; >>> + unsigned long cbdelay = 1; >>> bool cbs; >>> bool last_lvl; >>> int cpu; >>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >>> spin_lock_irq_rcu_node(ssp); >>> idx = rcu_seq_state(ssp->srcu_gp_seq); >>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >>> - cbdelay = !!srcu_get_delay(ssp); >>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >>> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>> + cbdelay = 0; >>> + >>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); > Thank you both for the testing and the proposed fix! > >> Test here: >> qemu: https://github.com/qemu/qemu/tree/stable-6.1 >> kernel: >> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test >> (in case test patch not clear, push in git tree) >> >> Hardware: aarch64 >> >> 1. 5.18-rc6 >> real 0m8.402s >> user 0m3.015s >> sys 0m1.102s >> >> 2. 5.19-rc1 >> real 2m41.433s >> user 0m3.097s >> sys 0m1.177s >> >> 3. 5.19-rc1 + fix1 from Paul >> real 2m43.404s >> user 0m2.880s >> sys 0m1.214s >> >> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block >> real 0m15.262s >> user 0m3.003s >> sys 0m1.033s >> >> When build kernel in the meantime, load time become longer. >> >> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000 >> real 0m15.215s >> user 0m2.942s >> sys 0m1.172s >> >> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end >> real 1m23.936s >> user 0m2.969s >> sys 0m1.181s > And thank you for the testing! > > Could you please try fix3 + Neeraj's change of srcu_gp_end? > > That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of > srcu_gp_end. > > Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot > times start rising? This is probably best done by starting with > SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run > until boot time becomes slow, followed by a binary search between the > last two values. (The idea is to bias the search so that fast boot > times are the common case.) SRCU_MAX_NODELAY_PHASE 100 becomes slower. 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000 real 0m11.154s ~12s user 0m2.919s sys 0m1.064s 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000 real 0m11.258s user 0m3.113s sys 0m1.073s 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100 real 0m30.053s ~ 32s user 0m2.827s sys 0m1.161s By the way, if build kernel on the board in the meantime (using memory), time become much longer. real 1m2.763s 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000 real 0m11.443s user 0m3.022s sys 0m1.052s Thanks > >> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5 > And thank you for these details. > > Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-15 9:03 ` zhangfei.gao @ 2022-06-15 10:40 ` Neeraj Upadhyay 2022-06-15 10:50 ` Paolo Bonzini 2022-06-18 3:07 ` zhangfei.gao 0 siblings, 2 replies; 37+ messages in thread From: Neeraj Upadhyay @ 2022-06-15 10:40 UTC (permalink / raw) To: zhangfei.gao, paulmck, zhangfei Cc: Shameerali Kolothum Thodi, Paolo Bonzini, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) Hi, On 6/15/2022 2:33 PM, zhangfei.gao@foxmail.com wrote: > > > On 2022/6/14 下午10:17, Paul E. McKenney wrote: >> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com wrote: >>> >>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote: >>>>> 5.18-rc4 based ~8sec >>>>> >>>>> 5.19-rc1 ~2m43sec >>>>> >>>>> 5.19-rc1+fix1 ~19sec >>>>> >>>>> 5.19-rc1-fix2 ~19sec >>>>> >>>> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ; >>>> does it show any difference in boot time? >>>> >>>> --- a/kernel/rcu/srcutree.c >>>> +++ b/kernel/rcu/srcutree.c >>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct >>>> srcu_struct >>>> *ssp, struct srcu_node *snp >>>> */ >>>> static void srcu_gp_end(struct srcu_struct *ssp) >>>> { >>>> - unsigned long cbdelay; >>>> + unsigned long cbdelay = 1; >>>> bool cbs; >>>> bool last_lvl; >>>> int cpu; >>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >>>> spin_lock_irq_rcu_node(ssp); >>>> idx = rcu_seq_state(ssp->srcu_gp_seq); >>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >>>> - cbdelay = !!srcu_get_delay(ssp); >>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>>> + cbdelay = 0; >>>> + >>>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); >> Thank you both for the testing and the proposed fix! >> >>> Test here: >>> qemu: https://github.com/qemu/qemu/tree/stable-6.1 >>> kernel: >>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test >>> >>> (in case test patch not clear, push in git tree) >>> >>> Hardware: aarch64 >>> >>> 1. 5.18-rc6 >>> real 0m8.402s >>> user 0m3.015s >>> sys 0m1.102s >>> >>> 2. 5.19-rc1 >>> real 2m41.433s >>> user 0m3.097s >>> sys 0m1.177s >>> >>> 3. 5.19-rc1 + fix1 from Paul >>> real 2m43.404s >>> user 0m2.880s >>> sys 0m1.214s >>> >>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block >>> real 0m15.262s >>> user 0m3.003s >>> sys 0m1.033s >>> >>> When build kernel in the meantime, load time become longer. >>> >>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000 >>> real 0m15.215s >>> user 0m2.942s >>> sys 0m1.172s >>> >>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end >>> real 1m23.936s >>> user 0m2.969s >>> sys 0m1.181s >> And thank you for the testing! >> >> Could you please try fix3 + Neeraj's change of srcu_gp_end? >> >> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of >> srcu_gp_end. >> >> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot >> times start rising? This is probably best done by starting with >> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run >> until boot time becomes slow, followed by a binary search between the >> last two values. (The idea is to bias the search so that fast boot >> times are the common case.) > > SRCU_MAX_NODELAY_PHASE 100 becomes slower. > > > 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000 > > real 0m11.154s ~12s > > user 0m2.919s > > sys 0m1.064s > > > > 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000 > > real 0m11.258s > > user 0m3.113s > > sys 0m1.073s > > > > 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100 > > real 0m30.053s ~ 32s > > user 0m2.827s > > sys 0m1.161s > > > > By the way, if build kernel on the board in the meantime (using memory), > time become much longer. > > real 1m2.763s > > > > 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000 > > real 0m11.443s > > user 0m3.022s > > sys 0m1.052s > > This is useful data, thanks! Did you get chance to check between 100 and 1000, to narrow down further, from which point (does need to be exact value) between 100 and 1000, you start seeing degradation at, for ex. 250, 500 , ...? Is it also possible to try experiment 10 and 11 with below diff. What I have done in below diff is, call srcu_get_delay() only once in try_check_zero() (and not for every loop iteration); also retry with a different delay for the extra iteration which is done when srcu_get_delay(ssp) returns 0. Once we have this data, can you also try by changing SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff. #define SRCU_RETRY_CHECK_LONG_DELAY 100 ------------------------------------------------------------------------- diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 6a354368ac1d..3aff2f3e99ab 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock); * we repeatedly block for 1-millisecond time periods. */ #define SRCU_RETRY_CHECK_DELAY 5 +#define SRCU_RETRY_CHECK_LONG_DELAY 5 /* * Start an SRCU grace period. @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, */ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) { + unsigned long curdelay; + curdelay = !srcu_get_delay(ssp); for (;;) { if (srcu_readers_active_idx_check(ssp, idx)) return true; - if (--trycount + !srcu_get_delay(ssp) <= 0) + if (--trycount + curdelay <= 0) return false; - udelay(SRCU_RETRY_CHECK_DELAY); + if (trycount) + udelay(SRCU_RETRY_CHECK_DELAY); + else + udelay(SRCU_RETRY_CHECK_LONG_DELAY); } } Thanks Neeraj > Thanks > >> >>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5 >> And thank you for these details. >> >> Thanx, Paul > ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-15 10:40 ` Neeraj Upadhyay @ 2022-06-15 10:50 ` Paolo Bonzini 2022-06-15 11:04 ` Neeraj Upadhyay 2022-06-18 3:07 ` zhangfei.gao 1 sibling, 1 reply; 37+ messages in thread From: Paolo Bonzini @ 2022-06-15 10:50 UTC (permalink / raw) To: Neeraj Upadhyay, zhangfei.gao, paulmck, zhangfei Cc: Shameerali Kolothum Thodi, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) On 6/15/22 12:40, Neeraj Upadhyay wrote: > > This is useful data, thanks! Did you get chance to check between 100 and > 1000, to narrow down further, from which point (does need to be exact > value) between 100 and 1000, you start seeing degradation at, for ex. > 250, 500 , ...? > > Is it also possible to try experiment 10 and 11 with below diff. > What I have done in below diff is, call srcu_get_delay() only once > in try_check_zero() (and not for every loop iteration); also > retry with a different delay for the extra iteration which is done > when srcu_get_delay(ssp) returns 0. > > Once we have this data, can you also try by changing > SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff. > > #define SRCU_RETRY_CHECK_LONG_DELAY 100 Is there any data that you would like me to gather from the KVM side, for example with respect to how much it takes to do synchronize_srcu() on an unpatched kernel, or the duration of the read-sides? Thanks, Paolo ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-15 10:50 ` Paolo Bonzini @ 2022-06-15 11:04 ` Neeraj Upadhyay 0 siblings, 0 replies; 37+ messages in thread From: Neeraj Upadhyay @ 2022-06-15 11:04 UTC (permalink / raw) To: Paolo Bonzini, zhangfei.gao, paulmck, zhangfei Cc: Shameerali Kolothum Thodi, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) On 6/15/2022 4:20 PM, Paolo Bonzini wrote: > On 6/15/22 12:40, Neeraj Upadhyay wrote: >> >> This is useful data, thanks! Did you get chance to check between 100 >> and 1000, to narrow down further, from which point (does need to be >> exact value) between 100 and 1000, you start seeing degradation at, >> for ex. 250, 500 , ...? >> >> Is it also possible to try experiment 10 and 11 with below diff. >> What I have done in below diff is, call srcu_get_delay() only once >> in try_check_zero() (and not for every loop iteration); also >> retry with a different delay for the extra iteration which is done >> when srcu_get_delay(ssp) returns 0. >> >> Once we have this data, can you also try by changing >> SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff. >> >> #define SRCU_RETRY_CHECK_LONG_DELAY 100 > > Is there any data that you would like me to gather from the KVM side, > for example with respect to how much it takes to do synchronize_srcu() > on an unpatched kernel, or the duration of the read-sides? > Hi Paolo, Thanks! I think synchronize_srcu() time and read side duration will help. Given that changing SRCU_MAX_NODELAY_PHASE value has impact on the boot time (and SRCU_MAX_NODELAY_PHASE impact is dependent on duration of SRCU read side duration), this information will be helpful to get more understanding of the scenario. Thanks Neeraj > Thanks, > > Paolo > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-15 10:40 ` Neeraj Upadhyay 2022-06-15 10:50 ` Paolo Bonzini @ 2022-06-18 3:07 ` zhangfei.gao 2022-06-20 7:50 ` Neeraj Upadhyay 1 sibling, 1 reply; 37+ messages in thread From: zhangfei.gao @ 2022-06-18 3:07 UTC (permalink / raw) To: Neeraj Upadhyay, paulmck, zhangfei Cc: Shameerali Kolothum Thodi, Paolo Bonzini, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) On 2022/6/15 下午6:40, Neeraj Upadhyay wrote: > Hi, > > On 6/15/2022 2:33 PM, zhangfei.gao@foxmail.com wrote: >> >> >> On 2022/6/14 下午10:17, Paul E. McKenney wrote: >>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com >>> wrote: >>>> >>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote: >>>>>> 5.18-rc4 based ~8sec >>>>>> >>>>>> 5.19-rc1 ~2m43sec >>>>>> >>>>>> 5.19-rc1+fix1 ~19sec >>>>>> >>>>>> 5.19-rc1-fix2 ~19sec >>>>>> >>>>> If you try below diff on top of either 5.19-rc1+fix1 or >>>>> 5.19-rc1-fix2 ; >>>>> does it show any difference in boot time? >>>>> >>>>> --- a/kernel/rcu/srcutree.c >>>>> +++ b/kernel/rcu/srcutree.c >>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct >>>>> srcu_struct >>>>> *ssp, struct srcu_node *snp >>>>> */ >>>>> static void srcu_gp_end(struct srcu_struct *ssp) >>>>> { >>>>> - unsigned long cbdelay; >>>>> + unsigned long cbdelay = 1; >>>>> bool cbs; >>>>> bool last_lvl; >>>>> int cpu; >>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >>>>> spin_lock_irq_rcu_node(ssp); >>>>> idx = rcu_seq_state(ssp->srcu_gp_seq); >>>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >>>>> - cbdelay = !!srcu_get_delay(ssp); >>>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>>>> + cbdelay = 0; >>>>> + >>>>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); >>> Thank you both for the testing and the proposed fix! >>> >>>> Test here: >>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1 >>>> kernel: >>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test >>>> >>>> (in case test patch not clear, push in git tree) >>>> >>>> Hardware: aarch64 >>>> >>>> 1. 5.18-rc6 >>>> real 0m8.402s >>>> user 0m3.015s >>>> sys 0m1.102s >>>> >>>> 2. 5.19-rc1 >>>> real 2m41.433s >>>> user 0m3.097s >>>> sys 0m1.177s >>>> >>>> 3. 5.19-rc1 + fix1 from Paul >>>> real 2m43.404s >>>> user 0m2.880s >>>> sys 0m1.214s >>>> >>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block >>>> real 0m15.262s >>>> user 0m3.003s >>>> sys 0m1.033s >>>> >>>> When build kernel in the meantime, load time become longer. >>>> >>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000 >>>> real 0m15.215s >>>> user 0m2.942s >>>> sys 0m1.172s >>>> >>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end >>>> real 1m23.936s >>>> user 0m2.969s >>>> sys 0m1.181s >>> And thank you for the testing! >>> >>> Could you please try fix3 + Neeraj's change of srcu_gp_end? >>> >>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of >>> srcu_gp_end. >>> >>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot >>> times start rising? This is probably best done by starting with >>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run >>> until boot time becomes slow, followed by a binary search between the >>> last two values. (The idea is to bias the search so that fast boot >>> times are the common case.) >> >> SRCU_MAX_NODELAY_PHASE 100 becomes slower. >> >> >> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000 >> >> real 0m11.154s ~12s >> >> user 0m2.919s >> >> sys 0m1.064s >> >> >> >> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000 >> >> real 0m11.258s >> >> user 0m3.113s >> >> sys 0m1.073s >> >> >> >> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100 >> >> real 0m30.053s ~ 32s >> >> user 0m2.827s >> >> sys 0m1.161s >> >> >> >> By the way, if build kernel on the board in the meantime (using >> memory), time become much longer. >> >> real 1m2.763s >> >> >> >> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000 >> >> real 0m11.443s >> >> user 0m3.022s >> >> sys 0m1.052s >> >> > > This is useful data, thanks! Did you get chance to check between 100 > and 1000, to narrow down further, from which point (does need to be > exact value) between 100 and 1000, you start seeing degradation at, > for ex. 250, 500 , ...? > > Is it also possible to try experiment 10 and 11 with below diff. > What I have done in below diff is, call srcu_get_delay() only once > in try_check_zero() (and not for every loop iteration); also > retry with a different delay for the extra iteration which is done > when srcu_get_delay(ssp) returns 0. > > Once we have this data, can you also try by changing > SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff. > > #define SRCU_RETRY_CHECK_LONG_DELAY 100 > > ------------------------------------------------------------------------- > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 6a354368ac1d..3aff2f3e99ab 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock); > * we repeatedly block for 1-millisecond time periods. > */ > #define SRCU_RETRY_CHECK_DELAY 5 > +#define SRCU_RETRY_CHECK_LONG_DELAY 5 > > /* > * Start an SRCU grace period. > @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct > srcu_struct *ssp, struct srcu_data *sdp, > */ > static bool try_check_zero(struct srcu_struct *ssp, int idx, int > trycount) > { > + unsigned long curdelay; > + curdelay = !srcu_get_delay(ssp); > for (;;) { > if (srcu_readers_active_idx_check(ssp, idx)) > return true; > - if (--trycount + !srcu_get_delay(ssp) <= 0) > + if (--trycount + curdelay <= 0) > return false; > - udelay(SRCU_RETRY_CHECK_DELAY); > + if (trycount) > + udelay(SRCU_RETRY_CHECK_DELAY); > + else > + udelay(SRCU_RETRY_CHECK_LONG_DELAY); > } > } > 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000 real 0m11.443 s user 0m3.022 s sys 0m1.052s fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500 real 0m11.401s user 0m2.798s sys 0m1.328s fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250 real 0m15.748s user 0m2.781s sys 0m1.294s fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200 real 0m20.704s 21 user 0m2.954s sys 0m1.226s fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150 real 0m25.151s user 0m2.980s sys 0m1.256s fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100 real 0m30.053s ~ 32s user 0m2.827s sys 0m1.161s fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5 real 0m19.263s user 0m3.018s sys 0m1.211s fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 100 real 0m9.347s user 0m3.132s sys 0m1.041s And Shameer suggests this method, to decrease region_add/del time from 6000+ to 200+, also works on 5.19-rc1 Make the EFI flash image file $ dd if=/dev/zero of=flash0.img bs=1M count=64 $ dd if=./QEMU_EFI-2022.fd of=flash0.img conv=notrunc $ dd if=/dev/zero of=flash1.img bs=1M count=64 Include the below line instead of "-bios QEMU_EFI.fd" in Qemu cmd line. -pflash flash0.img -pflash flash1.img \ Thanks > > Thanks > Neeraj > >> Thanks >> >>> >>>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5 >>> And thank you for these details. >>> >>> Thanx, Paul >> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-18 3:07 ` zhangfei.gao @ 2022-06-20 7:50 ` Neeraj Upadhyay 2022-06-24 15:30 ` zhangfei.gao 0 siblings, 1 reply; 37+ messages in thread From: Neeraj Upadhyay @ 2022-06-20 7:50 UTC (permalink / raw) To: zhangfei.gao, paulmck, zhangfei Cc: Shameerali Kolothum Thodi, Paolo Bonzini, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) Hi, On 6/18/2022 8:37 AM, zhangfei.gao@foxmail.com wrote: > > > On 2022/6/15 下午6:40, Neeraj Upadhyay wrote: >> Hi, >> >> On 6/15/2022 2:33 PM, zhangfei.gao@foxmail.com wrote: >>> >>> >>> On 2022/6/14 下午10:17, Paul E. McKenney wrote: >>>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com >>>> wrote: >>>>> >>>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote: >>>>>>> 5.18-rc4 based ~8sec >>>>>>> >>>>>>> 5.19-rc1 ~2m43sec >>>>>>> >>>>>>> 5.19-rc1+fix1 ~19sec >>>>>>> >>>>>>> 5.19-rc1-fix2 ~19sec >>>>>>> >>>>>> If you try below diff on top of either 5.19-rc1+fix1 or >>>>>> 5.19-rc1-fix2 ; >>>>>> does it show any difference in boot time? >>>>>> >>>>>> --- a/kernel/rcu/srcutree.c >>>>>> +++ b/kernel/rcu/srcutree.c >>>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct >>>>>> srcu_struct >>>>>> *ssp, struct srcu_node *snp >>>>>> */ >>>>>> static void srcu_gp_end(struct srcu_struct *ssp) >>>>>> { >>>>>> - unsigned long cbdelay; >>>>>> + unsigned long cbdelay = 1; >>>>>> bool cbs; >>>>>> bool last_lvl; >>>>>> int cpu; >>>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >>>>>> spin_lock_irq_rcu_node(ssp); >>>>>> idx = rcu_seq_state(ssp->srcu_gp_seq); >>>>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >>>>>> - cbdelay = !!srcu_get_delay(ssp); >>>>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>>>>> + cbdelay = 0; >>>>>> + >>>>>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); >>>> Thank you both for the testing and the proposed fix! >>>> >>>>> Test here: >>>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1 >>>>> kernel: >>>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test >>>>> >>>>> (in case test patch not clear, push in git tree) >>>>> >>>>> Hardware: aarch64 >>>>> >>>>> 1. 5.18-rc6 >>>>> real 0m8.402s >>>>> user 0m3.015s >>>>> sys 0m1.102s >>>>> >>>>> 2. 5.19-rc1 >>>>> real 2m41.433s >>>>> user 0m3.097s >>>>> sys 0m1.177s >>>>> >>>>> 3. 5.19-rc1 + fix1 from Paul >>>>> real 2m43.404s >>>>> user 0m2.880s >>>>> sys 0m1.214s >>>>> >>>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block >>>>> real 0m15.262s >>>>> user 0m3.003s >>>>> sys 0m1.033s >>>>> >>>>> When build kernel in the meantime, load time become longer. >>>>> >>>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000 >>>>> real 0m15.215s >>>>> user 0m2.942s >>>>> sys 0m1.172s >>>>> >>>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end >>>>> real 1m23.936s >>>>> user 0m2.969s >>>>> sys 0m1.181s >>>> And thank you for the testing! >>>> >>>> Could you please try fix3 + Neeraj's change of srcu_gp_end? >>>> >>>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of >>>> srcu_gp_end. >>>> >>>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot >>>> times start rising? This is probably best done by starting with >>>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run >>>> until boot time becomes slow, followed by a binary search between the >>>> last two values. (The idea is to bias the search so that fast boot >>>> times are the common case.) >>> >>> SRCU_MAX_NODELAY_PHASE 100 becomes slower. >>> >>> >>> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000 >>> >>> real 0m11.154s ~12s >>> >>> user 0m2.919s >>> >>> sys 0m1.064s >>> >>> >>> >>> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000 >>> >>> real 0m11.258s >>> >>> user 0m3.113s >>> >>> sys 0m1.073s >>> >>> >>> >>> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100 >>> >>> real 0m30.053s ~ 32s >>> >>> user 0m2.827s >>> >>> sys 0m1.161s >>> >>> >>> >>> By the way, if build kernel on the board in the meantime (using >>> memory), time become much longer. >>> >>> real 1m2.763s >>> >>> >>> >>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000 >>> >>> real 0m11.443s >>> >>> user 0m3.022s >>> >>> sys 0m1.052s >>> >>> >> >> This is useful data, thanks! Did you get chance to check between 100 >> and 1000, to narrow down further, from which point (does need to be >> exact value) between 100 and 1000, you start seeing degradation at, >> for ex. 250, 500 , ...? >> >> Is it also possible to try experiment 10 and 11 with below diff. >> What I have done in below diff is, call srcu_get_delay() only once >> in try_check_zero() (and not for every loop iteration); also >> retry with a different delay for the extra iteration which is done >> when srcu_get_delay(ssp) returns 0. >> >> Once we have this data, can you also try by changing >> SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff. >> >> #define SRCU_RETRY_CHECK_LONG_DELAY 100 >> >> ------------------------------------------------------------------------- >> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >> index 6a354368ac1d..3aff2f3e99ab 100644 >> --- a/kernel/rcu/srcutree.c >> +++ b/kernel/rcu/srcutree.c >> @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock); >> * we repeatedly block for 1-millisecond time periods. >> */ >> #define SRCU_RETRY_CHECK_DELAY 5 >> +#define SRCU_RETRY_CHECK_LONG_DELAY 5 >> >> /* >> * Start an SRCU grace period. >> @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct >> srcu_struct *ssp, struct srcu_data *sdp, >> */ >> static bool try_check_zero(struct srcu_struct *ssp, int idx, int >> trycount) >> { >> + unsigned long curdelay; >> + curdelay = !srcu_get_delay(ssp); >> for (;;) { >> if (srcu_readers_active_idx_check(ssp, idx)) >> return true; >> - if (--trycount + !srcu_get_delay(ssp) <= 0) >> + if (--trycount + curdelay <= 0) >> return false; >> - udelay(SRCU_RETRY_CHECK_DELAY); >> + if (trycount) >> + udelay(SRCU_RETRY_CHECK_DELAY); >> + else >> + udelay(SRCU_RETRY_CHECK_LONG_DELAY); >> } >> } >> > > 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000 > real 0m11.443 > s user 0m3.022 > s sys 0m1.052s > > fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500 > > real 0m11.401s > user 0m2.798s > sys 0m1.328s > > > fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250 > > real 0m15.748s > user 0m2.781s > sys 0m1.294s > > > fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200 > > real 0m20.704s 21 > user 0m2.954s > sys 0m1.226s > > fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150 > > real 0m25.151s > user 0m2.980s > sys 0m1.256s > > > fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100 > real 0m30.053s ~ 32s > user 0m2.827s > sys 0m1.161s > > > fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5 > > real 0m19.263s > user 0m3.018s > sys 0m1.211s > > > > fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 + > SRCU_RETRY_CHECK_LONG_DELAY 100 > > real 0m9.347s > user 0m3.132s > sys 0m1.041s > > Thanks. From the data and experiments done, looks to me that we get comparable (to 5.18-rc4 ) timings, when we retry without sleep for time duration close to 4-5 ms, which could be closer to the configured HZ (as it is 250)? Is it possible to try below configuration on top of fix15? If possible can you try with both HZ_1000 and HZ_250? As multiple fixes are getting combined in experiments, for clarity, please also share the diff of srcutree.c (on top of baseline) for all experiments. 16. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 1 jiffy) #define SRCU_MAX_NODELAY_TRY_CHECK_PHASE 10 #define SRCU_MAX_NODELAY_PHASE (SRCU_MAX_NODELAY_TRY_CHECK_PHASE * 2) #define SRCU_RETRY_CHECK_LONG_DELAY \ (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE) 17. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 2 jiffy) #define SRCU_RETRY_CHECK_LONG_DELAY \ (2 * USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE) 18. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 1/2 jiffy) #define SRCU_RETRY_CHECK_LONG_DELAY \ (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE / 2) Thanks Neeraj > And Shameer suggests this method, to decrease region_add/del time from > 6000+ to 200+, also works on 5.19-rc1 > > Make the EFI flash image file > $ dd if=/dev/zero of=flash0.img bs=1M count=64 > $ dd if=./QEMU_EFI-2022.fd of=flash0.img conv=notrunc > $ dd if=/dev/zero of=flash1.img bs=1M count=64 > Include the below line instead of "-bios QEMU_EFI.fd" in Qemu cmd line. > -pflash flash0.img -pflash flash1.img \ > Thanks for sharing this wa info! > > > Thanks > >> >> Thanks >> Neeraj >> >>> Thanks >>> >>>> >>>>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5 >>>> And thank you for these details. >>>> >>>> Thanx, Paul >>> > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-20 7:50 ` Neeraj Upadhyay @ 2022-06-24 15:30 ` zhangfei.gao 0 siblings, 0 replies; 37+ messages in thread From: zhangfei.gao @ 2022-06-24 15:30 UTC (permalink / raw) To: Neeraj Upadhyay, paulmck Cc: Shameerali Kolothum Thodi, Paolo Bonzini, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) On 2022/6/20 下午3:50, Neeraj Upadhyay wrote: > Hi, > > > On 6/18/2022 8:37 AM, zhangfei.gao@foxmail.com wrote: >> >> >> On 2022/6/15 下午6:40, Neeraj Upadhyay wrote: >>> Hi, >>> >>> On 6/15/2022 2:33 PM, zhangfei.gao@foxmail.com wrote: >>>> >>>> >>>> On 2022/6/14 下午10:17, Paul E. McKenney wrote: >>>>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@foxmail.com >>>>> wrote: >>>>>> >>>>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote: >>>>>>>> 5.18-rc4 based ~8sec >>>>>>>> >>>>>>>> 5.19-rc1 ~2m43sec >>>>>>>> >>>>>>>> 5.19-rc1+fix1 ~19sec >>>>>>>> >>>>>>>> 5.19-rc1-fix2 ~19sec >>>>>>>> >>>>>>> If you try below diff on top of either 5.19-rc1+fix1 or >>>>>>> 5.19-rc1-fix2 ; >>>>>>> does it show any difference in boot time? >>>>>>> >>>>>>> --- a/kernel/rcu/srcutree.c >>>>>>> +++ b/kernel/rcu/srcutree.c >>>>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct >>>>>>> srcu_struct >>>>>>> *ssp, struct srcu_node *snp >>>>>>> */ >>>>>>> static void srcu_gp_end(struct srcu_struct *ssp) >>>>>>> { >>>>>>> - unsigned long cbdelay; >>>>>>> + unsigned long cbdelay = 1; >>>>>>> bool cbs; >>>>>>> bool last_lvl; >>>>>>> int cpu; >>>>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct >>>>>>> *ssp) >>>>>>> spin_lock_irq_rcu_node(ssp); >>>>>>> idx = rcu_seq_state(ssp->srcu_gp_seq); >>>>>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >>>>>>> - cbdelay = !!srcu_get_delay(ssp); >>>>>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >>>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>>>>>> + cbdelay = 0; >>>>>>> + >>>>>>> WRITE_ONCE(ssp->srcu_last_gp_end, >>>>>>> ktime_get_mono_fast_ns()); >>>>> Thank you both for the testing and the proposed fix! >>>>> >>>>>> Test here: >>>>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1 >>>>>> kernel: >>>>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test >>>>>> >>>>>> (in case test patch not clear, push in git tree) >>>>>> >>>>>> Hardware: aarch64 >>>>>> >>>>>> 1. 5.18-rc6 >>>>>> real 0m8.402s >>>>>> user 0m3.015s >>>>>> sys 0m1.102s >>>>>> >>>>>> 2. 5.19-rc1 >>>>>> real 2m41.433s >>>>>> user 0m3.097s >>>>>> sys 0m1.177s >>>>>> >>>>>> 3. 5.19-rc1 + fix1 from Paul >>>>>> real 2m43.404s >>>>>> user 0m2.880s >>>>>> sys 0m1.214s >>>>>> >>>>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block >>>>>> real 0m15.262s >>>>>> user 0m3.003s >>>>>> sys 0m1.033s >>>>>> >>>>>> When build kernel in the meantime, load time become longer. >>>>>> >>>>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000 >>>>>> real 0m15.215s >>>>>> user 0m2.942s >>>>>> sys 0m1.172s >>>>>> >>>>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end >>>>>> real 1m23.936s >>>>>> user 0m2.969s >>>>>> sys 0m1.181s >>>>> And thank you for the testing! >>>>> >>>>> Could you please try fix3 + Neeraj's change of srcu_gp_end? >>>>> >>>>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of >>>>> srcu_gp_end. >>>>> >>>>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot >>>>> times start rising? This is probably best done by starting with >>>>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run >>>>> until boot time becomes slow, followed by a binary search between the >>>>> last two values. (The idea is to bias the search so that fast boot >>>>> times are the common case.) >>>> >>>> SRCU_MAX_NODELAY_PHASE 100 becomes slower. >>>> >>>> >>>> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000 >>>> >>>> real 0m11.154s ~12s >>>> >>>> user 0m2.919s >>>> >>>> sys 0m1.064s >>>> >>>> >>>> >>>> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000 >>>> >>>> real 0m11.258s >>>> >>>> user 0m3.113s >>>> >>>> sys 0m1.073s >>>> >>>> >>>> >>>> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100 >>>> >>>> real 0m30.053s ~ 32s >>>> >>>> user 0m2.827s >>>> >>>> sys 0m1.161s >>>> >>>> >>>> >>>> By the way, if build kernel on the board in the meantime (using >>>> memory), time become much longer. >>>> >>>> real 1m2.763s >>>> >>>> >>>> >>>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000 >>>> >>>> real 0m11.443s >>>> >>>> user 0m3.022s >>>> >>>> sys 0m1.052s >>>> >>>> >>> >>> This is useful data, thanks! Did you get chance to check between 100 >>> and 1000, to narrow down further, from which point (does need to be >>> exact value) between 100 and 1000, you start seeing degradation at, >>> for ex. 250, 500 , ...? >>> >>> Is it also possible to try experiment 10 and 11 with below diff. >>> What I have done in below diff is, call srcu_get_delay() only once >>> in try_check_zero() (and not for every loop iteration); also >>> retry with a different delay for the extra iteration which is done >>> when srcu_get_delay(ssp) returns 0. >>> >>> Once we have this data, can you also try by changing >>> SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff. >>> >>> #define SRCU_RETRY_CHECK_LONG_DELAY 100 >>> >>> ------------------------------------------------------------------------- >>> >>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >>> index 6a354368ac1d..3aff2f3e99ab 100644 >>> --- a/kernel/rcu/srcutree.c >>> +++ b/kernel/rcu/srcutree.c >>> @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock); >>> * we repeatedly block for 1-millisecond time periods. >>> */ >>> #define SRCU_RETRY_CHECK_DELAY 5 >>> +#define SRCU_RETRY_CHECK_LONG_DELAY 5 >>> >>> /* >>> * Start an SRCU grace period. >>> @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct >>> srcu_struct *ssp, struct srcu_data *sdp, >>> */ >>> static bool try_check_zero(struct srcu_struct *ssp, int idx, int >>> trycount) >>> { >>> + unsigned long curdelay; >>> + curdelay = !srcu_get_delay(ssp); >>> for (;;) { >>> if (srcu_readers_active_idx_check(ssp, idx)) >>> return true; >>> - if (--trycount + !srcu_get_delay(ssp) <= 0) >>> + if (--trycount + curdelay <= 0) >>> return false; >>> - udelay(SRCU_RETRY_CHECK_DELAY); >>> + if (trycount) >>> + udelay(SRCU_RETRY_CHECK_DELAY); >>> + else >>> + udelay(SRCU_RETRY_CHECK_LONG_DELAY); >>> } >>> } >>> >> >> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000 >> real 0m11.443 >> s user 0m3.022 >> s sys 0m1.052s >> >> fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500 >> >> real 0m11.401s >> user 0m2.798s >> sys 0m1.328s >> >> >> fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250 >> >> real 0m15.748s >> user 0m2.781s >> sys 0m1.294s >> >> >> fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200 >> >> real 0m20.704s 21 >> user 0m2.954s >> sys 0m1.226s >> >> fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150 >> >> real 0m25.151s >> user 0m2.980s >> sys 0m1.256s >> >> >> fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100 >> real 0m30.053s ~ 32s >> user 0m2.827s >> sys 0m1.161s >> >> >> fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5 >> >> real 0m19.263s >> user 0m3.018s >> sys 0m1.211s >> >> >> >> fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 + >> SRCU_RETRY_CHECK_LONG_DELAY 100 >> >> real 0m9.347s >> user 0m3.132s >> sys 0m1.041s >> >> > > Thanks. From the data and experiments done, looks to me that we get > comparable (to 5.18-rc4 ) timings, when we retry without sleep for > time duration close to 4-5 ms, which could be closer to the configured > HZ (as it is 250)? Is it possible to try below configuration on top > of fix15? > If possible can you try with both HZ_1000 and HZ_250? > As multiple fixes are getting combined in experiments, for clarity, > please also share the diff of srcutree.c (on top of baseline) for all > experiments. > > 16. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + > (long delay scaled to 1 jiffy) > > > #define SRCU_MAX_NODELAY_TRY_CHECK_PHASE 10 > #define SRCU_MAX_NODELAY_PHASE (SRCU_MAX_NODELAY_TRY_CHECK_PHASE * 2) > #define SRCU_RETRY_CHECK_LONG_DELAY \ > (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE) > > > 17. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + > (long delay scaled to 2 jiffy) > > #define SRCU_RETRY_CHECK_LONG_DELAY \ > (2 * USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE) > > 18. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + > (long delay scaled to 1/2 jiffy) > > #define SRCU_RETRY_CHECK_LONG_DELAY \ > (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE / 2) fix16: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 1 jiffy) real 0m10.120s user 0m3.885s sys 0m1.040s fix17: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 2 jiffy) real 0m9.851s user 0m3.886s sys 0m1.011s fix18: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 1/2 jiffy) real 0m9.741s user 0m3.837s sys 0m1.060s code push to https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test ^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 6:55 ` zhangfei.gao 2022-06-13 12:18 ` Paul E. McKenney @ 2022-06-13 15:02 ` Shameerali Kolothum Thodi 2022-06-15 8:38 ` Marc Zyngier 2022-06-15 8:29 ` Marc Zyngier 2 siblings, 1 reply; 37+ messages in thread From: Shameerali Kolothum Thodi @ 2022-06-13 15:02 UTC (permalink / raw) To: zhangfei.gao, paulmck Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) > -----Original Message----- > From: zhangfei.gao@foxmail.com [mailto:zhangfei.gao@foxmail.com] > Sent: 13 June 2022 07:56 > To: paulmck@kernel.org > Cc: Paolo Bonzini <pbonzini@redhat.com>; Zhangfei Gao > <zhangfei.gao@linaro.org>; linux-kernel@vger.kernel.org; > rcu@vger.kernel.org; Lai Jiangshan <jiangshanlai@gmail.com>; Josh Triplett > <josh@joshtriplett.org>; Mathieu Desnoyers > <mathieu.desnoyers@efficios.com>; Matthew Wilcox <willy@infradead.org>; > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; > mtosatti@redhat.com; Auger Eric <eric.auger@redhat.com> > Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and > blocking readers from consuming CPU) cause qemu boot slow > > By the way, the issue should be only related with qemu apci. not related > with rmr feature > Test with: https://github.com/qemu/qemu/tree/stable-6.1 > > Looks it caused by too many kvm_region_add & kvm_region_del if > acpi=force, Based on the setup I have, I think it has nothing to do with Guest kernel booting with ACPI per se(ie, acpi=force in Qemu kernel cmd line). It is more to do with Qemu having the "-bios QEMU_EFI.fd" which sets up pflash devices resulting in large number of pflash read/write calls(before Guest kernel even boots) which in turn seems to be triggering the below kvm_region_add/del calls. Thanks, Shameer > If no acpi, no print kvm_region_add/del (1000 times print once) > > If with acpi=force, > During qemu boot > kvm_region_add region_add = 1000 > kvm_region_del region_del = 1000 > kvm_region_add region_add = 2000 > kvm_region_del region_del = 2000 > kvm_region_add region_add = 3000 > kvm_region_del region_del = 3000 > kvm_region_add region_add = 4000 > kvm_region_del region_del = 4000 > kvm_region_add region_add = 5000 > kvm_region_del region_del = 5000 > kvm_region_add region_add = 6000 > kvm_region_del region_del = 6000 > > kvm_region_add/kvm_region_del -> > kvm_set_phys_mem-> > kvm_set_user_memory_region-> > kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem) > > [ 361.094493] __synchronize_srcu loop=9000 > [ 361.094501] Call trace: > [ 361.094502] dump_backtrace+0xe4/0xf0 > [ 361.094505] show_stack+0x20/0x70 > [ 361.094507] dump_stack_lvl+0x8c/0xb8 > [ 361.094509] dump_stack+0x18/0x34 > [ 361.094511] __synchronize_srcu+0x120/0x128 > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > [ 361.094519] kvm_activate_memslot+0x40/0x68 > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > [ 361.094524] kvm_set_memory_region+0x78/0xb8 > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > [ 361.094530] invoke_syscall+0x4c/0x110 > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > [ 361.094536] do_el0_svc+0x34/0xc0 > [ 361.094538] el0_svc+0x30/0x98 > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > [ 361.094544] el0t_64_sync+0x18c/0x190 > [ 363.942817] kvm_set_memory_region loop=6000 > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 15:02 ` Shameerali Kolothum Thodi @ 2022-06-15 8:38 ` Marc Zyngier 0 siblings, 0 replies; 37+ messages in thread From: Marc Zyngier @ 2022-06-15 8:38 UTC (permalink / raw) To: Shameerali Kolothum Thodi Cc: zhangfei.gao, paulmck, Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, mtosatti, Auger Eric, chenxiang (M) On Mon, 13 Jun 2022 16:02:30 +0100, Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: > > > > > -----Original Message----- > > From: zhangfei.gao@foxmail.com [mailto:zhangfei.gao@foxmail.com] > > Sent: 13 June 2022 07:56 > > To: paulmck@kernel.org > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Zhangfei Gao > > <zhangfei.gao@linaro.org>; linux-kernel@vger.kernel.org; > > rcu@vger.kernel.org; Lai Jiangshan <jiangshanlai@gmail.com>; Josh Triplett > > <josh@joshtriplett.org>; Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com>; Matthew Wilcox <willy@infradead.org>; > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; > > mtosatti@redhat.com; Auger Eric <eric.auger@redhat.com> > > Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and > > blocking readers from consuming CPU) cause qemu boot slow > > > > > By the way, the issue should be only related with qemu apci. not related > > with rmr feature > > Test with: https://github.com/qemu/qemu/tree/stable-6.1 > > > > Looks it caused by too many kvm_region_add & kvm_region_del if > > acpi=force, > > Based on the setup I have, I think it has nothing to do with Guest > kernel booting with ACPI per se(ie, acpi=force in Qemu kernel cmd > line). It is more to do with Qemu having the "-bios QEMU_EFI.fd" > which sets up pflash devices resulting in large number of pflash > read/write calls(before Guest kernel even boots) which in turn seems > to be triggering the below kvm_region_add/del calls. Indeed, this is all about memslots being added/removed at an alarming rate (well, not more alarming today than yesterday, but QEMU's flash emulation is... interesting). M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-13 6:55 ` zhangfei.gao 2022-06-13 12:18 ` Paul E. McKenney 2022-06-13 15:02 ` Shameerali Kolothum Thodi @ 2022-06-15 8:29 ` Marc Zyngier 2 siblings, 0 replies; 37+ messages in thread From: Marc Zyngier @ 2022-06-15 8:29 UTC (permalink / raw) To: zhangfei.gao, Paul E. McKenney Cc: Paolo Bonzini, Zhangfei Gao, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi, mtosatti, Auger Eric On Mon, 13 Jun 2022 07:55:47 +0100, "zhangfei.gao@foxmail.com" <zhangfei.gao@foxmail.com> wrote: > > Hi, Paul > > On 2022/6/13 下午12:16, Paul E. McKenney wrote: > > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote: > >> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wrote: > >>> Hi, Paul > >>> > >>> On 2022/6/13 上午2:49, Paul E. McKenney wrote: > >>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: > >>>>> On 6/12/22 18:40, Paul E. McKenney wrote: > >>>>>>> Do these reserved memory regions really need to be allocated separately? > >>>>>>> (For example, are they really all non-contiguous? If not, that is, if > >>>>>>> there are a lot of contiguous memory regions, could you sort the IORT > >>>>>>> by address and do one ioctl() for each set of contiguous memory regions?) > >>>>>>> > >>>>>>> Are all of these reserved memory regions set up before init is spawned? > >>>>>>> > >>>>>>> Are all of these reserved memory regions set up while there is only a > >>>>>>> single vCPU up and running? > >>>>>>> > >>>>>>> Is the SRCU grace period really needed in this case? (I freely confess > >>>>>>> to not being all that familiar with KVM.) > >>>>>> Oh, and there was a similar many-requests problem with networking many > >>>>>> years ago. This was solved by adding a new syscall/ioctl()/whatever > >>>>>> that permitted many requests to be presented to the kernel with a single > >>>>>> system call. > >>>>>> > >>>>>> Could a new ioctl() be introduced that requested a large number > >>>>>> of these memory regions in one go so as to make each call to > >>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+ > >>>>>> requests? Adding a few of the KVM guys on CC for their thoughts. > >>>>> Unfortunately not. Apart from this specific case, in general the calls to > >>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the > >>>>> guest, and those writes then map to a ioctl. Typically the guest sets up a > >>>>> device at a time, and each setup step causes a synchronize_srcu()---and > >>>>> expedited at that. > >>>> I was afraid of something like that... > >>>> > >>>>> KVM has two SRCUs: > >>>>> > >>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers > >>>>> that are very very small, but it needs extremely fast detection of grace > >>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up > >>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are > >>>>> not so frequent. > >>>>> > >>>>> 2) kvm->srcu is nastier because there are readers all the time. The > >>>>> read-side critical section are still short-ish, but they need the sleepable > >>>>> part because they access user memory. > >>>> Which one of these two is in play in this case? > >>>> > >>>>> Writers are not frequent per se; the problem is they come in very large > >>>>> bursts when a guest boots. And while the whole boot path overall can be > >>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger > >>>>> impact on runtime than the O(n^2) parts, as demonstrated here. > >>>>> > >>>>> Therefore, we operated on the assumption that the callers of > >>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code > >>>>> and the desire was to get past the booting phase as fast as possible. If > >>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants; > >>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole > >>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not > >>>>> problematic. > >>>>> > >>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu > >>>>> were was introduced (respectively in 2009 and 2014). But perhaps they do > >>>>> not hold anymore now that each SRCU is not as independent as it used to be > >>>>> in those years, and instead they use workqueues instead? > >>>> The problem was not internal to SRCU, but rather due to the fact > >>>> that kernel live patching (KLP) had problems with the CPU-bound tasks > >>>> resulting from repeated synchronize_rcu_expedited() invocations. So I > >>>> added heuristics to get the occasional sleep in there for KLP's benefit. > >>>> Perhaps these heuristics need to be less aggressive about adding sleep. > >>>> > >>>> These heuristics have these aspects: > >>>> > >>>> 1. The longer readers persist in an expedited SRCU grace period, > >>>> the longer the wait between successive checks of the reader > >>>> state. Roughly speaking, we wait as long as the grace period > >>>> has currently been in effect, capped at ten jiffies. > >>>> > >>>> 2. SRCU grace periods have several phases. We reset so that each > >>>> phase starts by not waiting (new phase, new set of readers, > >>>> so don't penalize this set for the sins of the previous set). > >>>> But once we get to the point of adding delay, we add the > >>>> delay based on the beginning of the full grace period. > >>>> > >>>> Right now, the checking for grace-period length does not allow for the > >>>> possibility that a grace period might start just before the jiffies > >>>> counter gets incremented (because I didn't realize that anyone cared), > >>>> so that is one possible thing to change. I can also allow more no-delay > >>>> checks per SRCU grace-period phase. > >>>> > >>>> Zhangfei, does something like the patch shown below help? > >>>> > >>>> Additional adjustments are likely needed to avoid re-breaking KLP, > >>>> but we have to start somewhere... > >>>> > >>>> Thanx, Paul > >>>> > >>>> ------------------------------------------------------------------------ > >>>> > >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > >>>> index 50ba70f019dea..6a354368ac1d1 100644 > >>>> --- a/kernel/rcu/srcutree.c > >>>> +++ b/kernel/rcu/srcutree.c > >>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > >>>> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > >>>> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > >>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. > >>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. > >>>> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > >>>> /* > >>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > >>>> */ > >>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp) > >>>> { > >>>> + unsigned long gpstart; > >>>> + unsigned long j; > >>>> unsigned long jbase = SRCU_INTERVAL; > >>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > >>>> jbase = 0; > >>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > >>>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > >>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > >>>> + j = jiffies - 1; > >>>> + gpstart = READ_ONCE(ssp->srcu_gp_start); > >>>> + if (time_after(j, gpstart)) > >>>> + jbase += j - gpstart; > >>>> + } > >>>> if (!jbase) { > >>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > >>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) > >>> Unfortunately, this patch does not helpful. > >>> > >>> Then re-add the debug info. > >>> > >>> During the qemu boot > >>> [ 232.997667] __synchronize_srcu loop=1000 > >>> > >>> [ 361.094493] __synchronize_srcu loop=9000 > >>> [ 361.094501] Call trace: > >>> [ 361.094502] dump_backtrace+0xe4/0xf0 > >>> [ 361.094505] show_stack+0x20/0x70 > >>> [ 361.094507] dump_stack_lvl+0x8c/0xb8 > >>> [ 361.094509] dump_stack+0x18/0x34 > >>> [ 361.094511] __synchronize_srcu+0x120/0x128 > >>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > >>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > >>> [ 361.094519] kvm_activate_memslot+0x40/0x68 > >>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > >>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > >>> [ 361.094524] kvm_set_memory_region+0x78/0xb8 > >>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > >>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > >>> [ 361.094530] invoke_syscall+0x4c/0x110 > >>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > >>> [ 361.094536] do_el0_svc+0x34/0xc0 > >>> [ 361.094538] el0_svc+0x30/0x98 > >>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > >>> [ 361.094544] el0t_64_sync+0x18c/0x190 > >>> [ 363.942817] kvm_set_memory_region loop=6000 > >> Huh. > >> > >> One possibility is that the "if (!jbase)" block needs to be nested > >> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block. > > I test this diff and NO helpful > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 50ba70f019de..36286a4b74e6 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > #define SRCU_INTERVAL 1 // Base delay if no expedited > GPs pending. > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay > from slow readers. > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase > consecutive no-delay instances. > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase > consecutive no-delay instances. > #define SRCU_MAX_NODELAY 100 // Maximum consecutive > no-delay instances. > > /* > @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct > srcu_struct *ssp) > */ > static unsigned long srcu_get_delay(struct srcu_struct *ssp) > { > + unsigned long gpstart; > + unsigned long j; > unsigned long jbase = SRCU_INTERVAL; > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > jbase = 0; > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > - if (!jbase) { > - WRITE_ONCE(ssp->srcu_n_exp_nodelay, > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > SRCU_MAX_NODELAY_PHASE) > - jbase = 1; > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > + j = jiffies - 1; > + gpstart = READ_ONCE(ssp->srcu_gp_start); > + if (time_after(j, gpstart)) > + jbase += j - gpstart; > + > + if (!jbase) { > + WRITE_ONCE(ssp->srcu_n_exp_nodelay, > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > SRCU_MAX_NODELAY_PHASE) > + jbase = 1; > + } > } > > > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the > > above change reduces the overhead by more than an order of magnitude. > > Except that the overhead of the series is far less than one second, > > not the several minutes that you are seeing. So the per-call overhead > > decreases from about 17 microseconds to a bit more than one microsecond. > > > > I could imagine an extra order of magnitude if you are running HZ=100 > > instead of the HZ=1000 that I am running. But that only gets up to a > > few seconds. > > > >> One additional debug is to apply the patch below on top of the one you > apply the patch below? > >> just now kindly tested, then use whatever debug technique you wish to > >> work out what fraction of the time during that critical interval that > >> srcu_get_delay() returns non-zero. > Sorry, I am confused, no patch right? > Just measure srcu_get_delay return to non-zero? > > > By the way, the issue should be only related with qemu apci. not > related with rmr feature No, this also occurs if you supply the guest's EFI with an empty set of persistent variables. EFI goes and zeroes it, which results in a read-only memslot write access being taken to userspace, the memslot being unmapped from the guest, QEMU doing a little dance, and eventually restoring the memslot back to the guest. Rince, repeat. Do that one byte at a time over 64MB, and your boot time for EFI only goes from 39s to 3m50s (that's on a speed-challenged Synquacer box), which completely kills the "deploy a new VM" use case. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow 2022-06-11 16:32 Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow Zhangfei Gao 2022-06-11 16:59 ` Paul E. McKenney @ 2022-06-14 1:53 ` chenxiang (M) 1 sibling, 0 replies; 37+ messages in thread From: chenxiang (M) @ 2022-06-14 1:53 UTC (permalink / raw) To: Zhangfei Gao, Paul E. McKenney, linux-kernel, rcu, Lai Jiangshan, Josh Triplett, Mathieu Desnoyers, Matthew Wilcox, Shameerali Kolothum Thodi Hi, I also encounter a similar issue , and i reported it here(https://www.spinics.net/lists/kernel/msg4396974.html). I tried the change Paul provides on the thread, but the issue is still. 在 2022/6/12 0:32, Zhangfei Gao 写道: > Hi, Paul > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest > kernel stuck for several minutes. > And on 5.18, there is no such problem. > > After revert this patch, the issue solved. > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers > from consuming CPU) > > > qemu cmd: > build/aarch64-softmmu/qemu-system-aarch64 -machine > virt,gic-version=3,iommu=smmuv3 \ > -enable-kvm -cpu host -m 1024 \ > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off > acpi=force" \ > -bios QEMU_EFI.fd > > log: > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 > add-symbol-file > /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll > 0x75459000 > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98 > ProtectUefiImageCommon - 0x7AA4D040 > - 0x0000000075458000 - 0x000000000003F000 > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 > (0x0000000000004008) > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 > (0x0000000000020008) > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 > (0x0000000000004008) > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8 > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8 > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348 > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8 > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728 > > > Not sure it is either reported or solved. > > Thanks > > > > > > > > . > ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2022-06-24 15:32 UTC | newest] Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-06-11 16:32 Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow Zhangfei Gao 2022-06-11 16:59 ` Paul E. McKenney 2022-06-12 7:40 ` zhangfei.gao 2022-06-12 13:36 ` Paul E. McKenney 2022-06-12 14:59 ` zhangfei.gao 2022-06-12 16:20 ` Paul E. McKenney 2022-06-12 16:40 ` Paul E. McKenney 2022-06-12 17:29 ` Paolo Bonzini 2022-06-12 17:47 ` Paolo Bonzini 2022-06-12 18:51 ` Paul E. McKenney 2022-06-12 18:49 ` Paul E. McKenney 2022-06-12 19:23 ` Paolo Bonzini 2022-06-12 20:09 ` Paul E. McKenney 2022-06-13 3:04 ` zhangfei.gao 2022-06-13 3:57 ` Paul E. McKenney 2022-06-13 4:16 ` Paul E. McKenney 2022-06-13 6:55 ` zhangfei.gao 2022-06-13 12:18 ` Paul E. McKenney 2022-06-13 13:23 ` zhangfei.gao 2022-06-13 14:59 ` Paul E. McKenney 2022-06-13 20:55 ` Shameerali Kolothum Thodi 2022-06-14 12:19 ` Neeraj Upadhyay 2022-06-14 14:03 ` zhangfei.gao 2022-06-14 14:14 ` Neeraj Upadhyay 2022-06-14 14:57 ` zhangfei.gao 2022-06-14 14:17 ` Paul E. McKenney 2022-06-15 9:03 ` zhangfei.gao 2022-06-15 10:40 ` Neeraj Upadhyay 2022-06-15 10:50 ` Paolo Bonzini 2022-06-15 11:04 ` Neeraj Upadhyay 2022-06-18 3:07 ` zhangfei.gao 2022-06-20 7:50 ` Neeraj Upadhyay 2022-06-24 15:30 ` zhangfei.gao 2022-06-13 15:02 ` Shameerali Kolothum Thodi 2022-06-15 8:38 ` Marc Zyngier 2022-06-15 8:29 ` Marc Zyngier 2022-06-14 1:53 ` chenxiang (M)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).