All of lore.kernel.org
 help / color / mirror / Atom feed
* incoming
@ 2022-04-15  2:12 Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                   ` (13 more replies)
  0 siblings, 14 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits, patches

14 patches, based on 115acbb56978941bb7537a97dfc303da286106c1.

Subsystems affected by this patch series:

  MAINTAINERS
  mm/tmpfs
  m/secretmem
  mm/kasan
  mm/kfence
  mm/pagealloc
  mm/zram
  mm/compaction
  mm/hugetlb
  binfmt
  mm/vmalloc
  mm/kmemleak

Subsystem: MAINTAINERS

    Joe Perches <joe@perches.com>:
      MAINTAINERS: Broadcom internal lists aren't maintainers

Subsystem: mm/tmpfs

    Hugh Dickins <hughd@google.com>:
      tmpfs: fix regressions from wider use of ZERO_PAGE

Subsystem: m/secretmem

    Axel Rasmussen <axelrasmussen@google.com>:
      mm/secretmem: fix panic when growing a memfd_secret

Subsystem: mm/kasan

    Zqiang <qiang1.zhang@intel.com>:
      irq_work: use kasan_record_aux_stack_noalloc() record callstack

    Vincenzo Frascino <vincenzo.frascino@arm.com>:
      kasan: fix hw tags enablement when KUNIT tests are disabled

Subsystem: mm/kfence

    Marco Elver <elver@google.com>:
      mm, kfence: support kmem_dump_obj() for KFENCE objects

Subsystem: mm/pagealloc

    Juergen Gross <jgross@suse.com>:
      mm, page_alloc: fix build_zonerefs_node()

Subsystem: mm/zram

    Minchan Kim <minchan@kernel.org>:
      mm: fix unexpected zeroed page mapping with zram swap

Subsystem: mm/compaction

    Charan Teja Kalla <quic_charante@quicinc.com>:
      mm: compaction: fix compiler warning when CONFIG_COMPACTION=n

Subsystem: mm/hugetlb

    Mike Kravetz <mike.kravetz@oracle.com>:
      hugetlb: do not demote poisoned hugetlb pages

Subsystem: binfmt

    Andrew Morton <akpm@linux-foundation.org>:
      revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
      revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"

Subsystem: mm/vmalloc

    Omar Sandoval <osandov@fb.com>:
      mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore

Subsystem: mm/kmemleak

    Patrick Wang <patrick.wang.shcn@gmail.com>:
      mm: kmemleak: take a full lowmem check in kmemleak_*_phys()

 MAINTAINERS                     |   64 ++++++++++++++++++++--------------------
 arch/x86/include/asm/io.h       |    2 -
 arch/x86/kernel/crash_dump_64.c |    1 
 fs/binfmt_elf.c                 |    6 +--
 include/linux/kfence.h          |   24 +++++++++++++++
 kernel/irq_work.c               |    2 -
 mm/compaction.c                 |   10 +++---
 mm/filemap.c                    |    6 ---
 mm/hugetlb.c                    |   17 ++++++----
 mm/kasan/hw_tags.c              |    5 +--
 mm/kasan/kasan.h                |   10 +++---
 mm/kfence/core.c                |   21 -------------
 mm/kfence/kfence.h              |   21 +++++++++++++
 mm/kfence/report.c              |   47 +++++++++++++++++++++++++++++
 mm/kmemleak.c                   |    8 ++---
 mm/page_alloc.c                 |    2 -
 mm/page_io.c                    |   54 ---------------------------------
 mm/secretmem.c                  |   17 ++++++++++
 mm/shmem.c                      |   31 ++++++++++++-------
 mm/slab.c                       |    2 -
 mm/slab.h                       |    2 -
 mm/slab_common.c                |    9 +++++
 mm/slob.c                       |    2 -
 mm/slub.c                       |    2 -
 mm/vmalloc.c                    |   11 ------
 25 files changed, 207 insertions(+), 169 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 01/14] MAINTAINERS: Broadcom internal lists aren't maintainers
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: joe, akpm, patches, linux-mm, mm-commits, torvalds, akpm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 11904 bytes --]

From: Joe Perches <joe@perches.com>
Subject: MAINTAINERS: Broadcom internal lists aren't maintainers

Convert the broadcom internal list M: and L: entries to R: as
exploder email addresses are neither maintainers nor mailing lists.

Reorder the entries as necessary.

Link: https://lkml.kernel.org/r/04eb301f5b3adbefdd78e76657eff0acb3e3d87f.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |   64 +++++++++++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 32 deletions(-)

--- a/MAINTAINERS~maintainers-broadcom-internal-lists-arent-maintainers
+++ a/MAINTAINERS
@@ -3743,7 +3743,7 @@ F:	include/linux/platform_data/b53.h
 
 BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE
 M:	Nicolas Saenz Julienne <nsaenz@kernel.org>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-rpi-kernel@lists.infradead.org (moderated for non-subscribers)
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
@@ -3758,7 +3758,7 @@ BROADCOM BCM281XX/BCM11XXX/BCM216XX ARM
 M:	Florian Fainelli <f.fainelli@gmail.com>
 M:	Ray Jui <rjui@broadcom.com>
 M:	Scott Branden <sbranden@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Maintained
 T:	git git://github.com/broadcom/mach-bcm
 F:	arch/arm/mach-bcm/
@@ -3778,7 +3778,7 @@ F:	arch/mips/include/asm/mach-bcm47xx/*
 
 BROADCOM BCM4908 ETHERNET DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/net/brcm,bcm4908-enet.yaml
@@ -3787,7 +3787,7 @@ F:	drivers/net/ethernet/broadcom/unimac.
 
 BROADCOM BCM4908 PINMUX DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-gpio@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/pinctrl/brcm,bcm4908-pinctrl.yaml
@@ -3797,7 +3797,7 @@ BROADCOM BCM5301X ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
 M:	Hauke Mehrtens <hauke@hauke-m.de>
 M:	Rafał Miłecki <zajec5@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	arch/arm/boot/dts/bcm470*
@@ -3808,7 +3808,7 @@ F:	arch/arm/mach-bcm/bcm_5301x.c
 BROADCOM BCM53573 ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
 M:	Rafał Miłecki <rafal@milecki.pl>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	arch/arm/boot/dts/bcm47189*
@@ -3816,7 +3816,7 @@ F:	arch/arm/boot/dts/bcm53573*
 
 BROADCOM BCM63XX ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -3830,7 +3830,7 @@ F:	drivers/usb/gadget/udc/bcm63xx_udc.*
 
 BROADCOM BCM7XXX ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -3848,21 +3848,21 @@ N:	bcm7120
 BROADCOM BDC DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
 L:	linux-usb@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Maintained
 F:	Documentation/devicetree/bindings/usb/brcm,bdc.yaml
 F:	drivers/usb/gadget/udc/bdc/
 
 BROADCOM BMIPS CPUFREQ DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 F:	drivers/cpufreq/bmips-cpufreq.c
 
 BROADCOM BMIPS MIPS ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-mips@vger.kernel.org
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -3928,53 +3928,53 @@ F:	drivers/net/wireless/broadcom/brcm802
 BROADCOM BRCMSTB GPIO DRIVER
 M:	Doug Berger <opendmb@gmail.com>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Supported
 F:	Documentation/devicetree/bindings/gpio/brcm,brcmstb-gpio.yaml
 F:	drivers/gpio/gpio-brcmstb.c
 
 BROADCOM BRCMSTB I2C DRIVER
 M:	Kamal Dasu <kdasu.kdev@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-i2c@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Supported
 F:	Documentation/devicetree/bindings/i2c/brcm,brcmstb-i2c.yaml
 F:	drivers/i2c/busses/i2c-brcmstb.c
 
 BROADCOM BRCMSTB UART DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-serial@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	Documentation/devicetree/bindings/serial/brcm,bcm7271-uart.yaml
 F:	drivers/tty/serial/8250/8250_bcm7271.c
 
 BROADCOM BRCMSTB USB EHCI DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-usb@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	Documentation/devicetree/bindings/usb/brcm,bcm7445-ehci.yaml
 F:	drivers/usb/host/ehci-brcm.*
 
 BROADCOM BRCMSTB USB PIN MAP DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-usb@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	Documentation/devicetree/bindings/usb/brcm,usb-pinmap.yaml
 F:	drivers/usb/misc/brcmstb-usb-pinmap.c
 
 BROADCOM BRCMSTB USB2 and USB3 PHY DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-kernel@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	drivers/phy/broadcom/phy-brcm-usb*
 
 BROADCOM ETHERNET PHY DRIVERS
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 F:	Documentation/devicetree/bindings/net/broadcom-bcm87xx.txt
@@ -3985,7 +3985,7 @@ F:	include/linux/brcmphy.h
 BROADCOM GENET ETHERNET DRIVER
 M:	Doug Berger <opendmb@gmail.com>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 F:	Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml
@@ -3999,7 +3999,7 @@ F:	include/linux/platform_data/mdio-bcm-
 BROADCOM IPROC ARM ARCHITECTURE
 M:	Ray Jui <rjui@broadcom.com>
 M:	Scott Branden <sbranden@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -4027,7 +4027,7 @@ N:	stingray
 
 BROADCOM IPROC GBIT ETHERNET DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/net/brcm,amac.yaml
@@ -4036,7 +4036,7 @@ F:	drivers/net/ethernet/broadcom/unimac.
 
 BROADCOM KONA GPIO DRIVER
 M:	Ray Jui <rjui@broadcom.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Supported
 F:	Documentation/devicetree/bindings/gpio/brcm,kona-gpio.txt
 F:	drivers/gpio/gpio-bcm-kona.c
@@ -4069,7 +4069,7 @@ F:	drivers/firmware/broadcom/*
 BROADCOM PMB (POWER MANAGEMENT BUS) DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -4085,7 +4085,7 @@ F:	include/linux/bcma/
 
 BROADCOM SPI DRIVER
 M:	Kamal Dasu <kdasu.kdev@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Maintained
 F:	Documentation/devicetree/bindings/spi/brcm,spi-bcm-qspi.yaml
 F:	drivers/spi/spi-bcm-qspi.*
@@ -4094,7 +4094,7 @@ F:	drivers/spi/spi-iproc-qspi.c
 
 BROADCOM STB AVS CPUFREQ DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/cpufreq/brcm,stb-avs-cpu-freq.txt
@@ -4102,7 +4102,7 @@ F:	drivers/cpufreq/brcmstb*
 
 BROADCOM STB AVS TMON DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/thermal/brcm,avs-tmon.yaml
@@ -4110,7 +4110,7 @@ F:	drivers/thermal/broadcom/brcmstb*
 
 BROADCOM STB DPFE DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	Documentation/devicetree/bindings/memory-controllers/brcm,dpfe-cpu.yaml
@@ -4119,8 +4119,8 @@ F:	drivers/memory/brcmstb_dpfe.c
 BROADCOM STB NAND FLASH DRIVER
 M:	Brian Norris <computersforpeace@gmail.com>
 M:	Kamal Dasu <kdasu.kdev@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-mtd@lists.infradead.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	drivers/mtd/nand/raw/brcmnand/
 F:	include/linux/platform_data/brcmnand.h
@@ -4129,7 +4129,7 @@ BROADCOM STB PCIE DRIVER
 M:	Jim Quinlan <jim2101024@gmail.com>
 M:	Nicolas Saenz Julienne <nsaenz@kernel.org>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pci@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/pci/brcm,stb-pcie.yaml
@@ -4137,7 +4137,7 @@ F:	drivers/pci/controller/pcie-brcmstb.c
 
 BROADCOM SYSTEMPORT ETHERNET DRIVER
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 F:	drivers/net/ethernet/broadcom/bcmsysport.*
@@ -4154,7 +4154,7 @@ F:	drivers/net/ethernet/broadcom/tg3.*
 
 BROADCOM VK DRIVER
 M:	Scott Branden <scott.branden@broadcom.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Supported
 F:	drivers/misc/bcm-vk/
 F:	include/uapi/linux/misc/bcm_vk.h
@@ -17648,8 +17648,8 @@ K:	\bTIF_SECCOMP\b
 
 SECURE DIGITAL HOST CONTROLLER INTERFACE (SDHCI) Broadcom BRCMSTB DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-mmc@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	drivers/mmc/host/sdhci-brcmstb*
 
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 01/14] MAINTAINERS: Broadcom internal lists aren't maintainers
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: joe, akpm, patches, linux-mm, mm-commits, torvalds, akpm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 11904 bytes --]

From: Joe Perches <joe@perches.com>
Subject: MAINTAINERS: Broadcom internal lists aren't maintainers

Convert the broadcom internal list M: and L: entries to R: as
exploder email addresses are neither maintainers nor mailing lists.

Reorder the entries as necessary.

Link: https://lkml.kernel.org/r/04eb301f5b3adbefdd78e76657eff0acb3e3d87f.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |   64 +++++++++++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 32 deletions(-)

--- a/MAINTAINERS~maintainers-broadcom-internal-lists-arent-maintainers
+++ a/MAINTAINERS
@@ -3743,7 +3743,7 @@ F:	include/linux/platform_data/b53.h
 
 BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE
 M:	Nicolas Saenz Julienne <nsaenz@kernel.org>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-rpi-kernel@lists.infradead.org (moderated for non-subscribers)
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
@@ -3758,7 +3758,7 @@ BROADCOM BCM281XX/BCM11XXX/BCM216XX ARM
 M:	Florian Fainelli <f.fainelli@gmail.com>
 M:	Ray Jui <rjui@broadcom.com>
 M:	Scott Branden <sbranden@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Maintained
 T:	git git://github.com/broadcom/mach-bcm
 F:	arch/arm/mach-bcm/
@@ -3778,7 +3778,7 @@ F:	arch/mips/include/asm/mach-bcm47xx/*
 
 BROADCOM BCM4908 ETHERNET DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/net/brcm,bcm4908-enet.yaml
@@ -3787,7 +3787,7 @@ F:	drivers/net/ethernet/broadcom/unimac.
 
 BROADCOM BCM4908 PINMUX DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-gpio@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/pinctrl/brcm,bcm4908-pinctrl.yaml
@@ -3797,7 +3797,7 @@ BROADCOM BCM5301X ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
 M:	Hauke Mehrtens <hauke@hauke-m.de>
 M:	Rafał Miłecki <zajec5@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	arch/arm/boot/dts/bcm470*
@@ -3808,7 +3808,7 @@ F:	arch/arm/mach-bcm/bcm_5301x.c
 BROADCOM BCM53573 ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
 M:	Rafał Miłecki <rafal@milecki.pl>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	arch/arm/boot/dts/bcm47189*
@@ -3816,7 +3816,7 @@ F:	arch/arm/boot/dts/bcm53573*
 
 BROADCOM BCM63XX ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -3830,7 +3830,7 @@ F:	drivers/usb/gadget/udc/bcm63xx_udc.*
 
 BROADCOM BCM7XXX ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -3848,21 +3848,21 @@ N:	bcm7120
 BROADCOM BDC DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
 L:	linux-usb@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Maintained
 F:	Documentation/devicetree/bindings/usb/brcm,bdc.yaml
 F:	drivers/usb/gadget/udc/bdc/
 
 BROADCOM BMIPS CPUFREQ DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 F:	drivers/cpufreq/bmips-cpufreq.c
 
 BROADCOM BMIPS MIPS ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-mips@vger.kernel.org
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -3928,53 +3928,53 @@ F:	drivers/net/wireless/broadcom/brcm802
 BROADCOM BRCMSTB GPIO DRIVER
 M:	Doug Berger <opendmb@gmail.com>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Supported
 F:	Documentation/devicetree/bindings/gpio/brcm,brcmstb-gpio.yaml
 F:	drivers/gpio/gpio-brcmstb.c
 
 BROADCOM BRCMSTB I2C DRIVER
 M:	Kamal Dasu <kdasu.kdev@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-i2c@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Supported
 F:	Documentation/devicetree/bindings/i2c/brcm,brcmstb-i2c.yaml
 F:	drivers/i2c/busses/i2c-brcmstb.c
 
 BROADCOM BRCMSTB UART DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-serial@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	Documentation/devicetree/bindings/serial/brcm,bcm7271-uart.yaml
 F:	drivers/tty/serial/8250/8250_bcm7271.c
 
 BROADCOM BRCMSTB USB EHCI DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-usb@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	Documentation/devicetree/bindings/usb/brcm,bcm7445-ehci.yaml
 F:	drivers/usb/host/ehci-brcm.*
 
 BROADCOM BRCMSTB USB PIN MAP DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-usb@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	Documentation/devicetree/bindings/usb/brcm,usb-pinmap.yaml
 F:	drivers/usb/misc/brcmstb-usb-pinmap.c
 
 BROADCOM BRCMSTB USB2 and USB3 PHY DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-kernel@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	drivers/phy/broadcom/phy-brcm-usb*
 
 BROADCOM ETHERNET PHY DRIVERS
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 F:	Documentation/devicetree/bindings/net/broadcom-bcm87xx.txt
@@ -3985,7 +3985,7 @@ F:	include/linux/brcmphy.h
 BROADCOM GENET ETHERNET DRIVER
 M:	Doug Berger <opendmb@gmail.com>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 F:	Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml
@@ -3999,7 +3999,7 @@ F:	include/linux/platform_data/mdio-bcm-
 BROADCOM IPROC ARM ARCHITECTURE
 M:	Ray Jui <rjui@broadcom.com>
 M:	Scott Branden <sbranden@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -4027,7 +4027,7 @@ N:	stingray
 
 BROADCOM IPROC GBIT ETHERNET DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/net/brcm,amac.yaml
@@ -4036,7 +4036,7 @@ F:	drivers/net/ethernet/broadcom/unimac.
 
 BROADCOM KONA GPIO DRIVER
 M:	Ray Jui <rjui@broadcom.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Supported
 F:	Documentation/devicetree/bindings/gpio/brcm,kona-gpio.txt
 F:	drivers/gpio/gpio-bcm-kona.c
@@ -4069,7 +4069,7 @@ F:	drivers/firmware/broadcom/*
 BROADCOM PMB (POWER MANAGEMENT BUS) DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -4085,7 +4085,7 @@ F:	include/linux/bcma/
 
 BROADCOM SPI DRIVER
 M:	Kamal Dasu <kdasu.kdev@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Maintained
 F:	Documentation/devicetree/bindings/spi/brcm,spi-bcm-qspi.yaml
 F:	drivers/spi/spi-bcm-qspi.*
@@ -4094,7 +4094,7 @@ F:	drivers/spi/spi-iproc-qspi.c
 
 BROADCOM STB AVS CPUFREQ DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/cpufreq/brcm,stb-avs-cpu-freq.txt
@@ -4102,7 +4102,7 @@ F:	drivers/cpufreq/brcmstb*
 
 BROADCOM STB AVS TMON DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/thermal/brcm,avs-tmon.yaml
@@ -4110,7 +4110,7 @@ F:	drivers/thermal/broadcom/brcmstb*
 
 BROADCOM STB DPFE DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	Documentation/devicetree/bindings/memory-controllers/brcm,dpfe-cpu.yaml
@@ -4119,8 +4119,8 @@ F:	drivers/memory/brcmstb_dpfe.c
 BROADCOM STB NAND FLASH DRIVER
 M:	Brian Norris <computersforpeace@gmail.com>
 M:	Kamal Dasu <kdasu.kdev@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-mtd@lists.infradead.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	drivers/mtd/nand/raw/brcmnand/
 F:	include/linux/platform_data/brcmnand.h
@@ -4129,7 +4129,7 @@ BROADCOM STB PCIE DRIVER
 M:	Jim Quinlan <jim2101024@gmail.com>
 M:	Nicolas Saenz Julienne <nsaenz@kernel.org>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pci@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/pci/brcm,stb-pcie.yaml
@@ -4137,7 +4137,7 @@ F:	drivers/pci/controller/pcie-brcmstb.c
 
 BROADCOM SYSTEMPORT ETHERNET DRIVER
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 F:	drivers/net/ethernet/broadcom/bcmsysport.*
@@ -4154,7 +4154,7 @@ F:	drivers/net/ethernet/broadcom/tg3.*
 
 BROADCOM VK DRIVER
 M:	Scott Branden <scott.branden@broadcom.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Supported
 F:	drivers/misc/bcm-vk/
 F:	include/uapi/linux/misc/bcm_vk.h
@@ -17648,8 +17648,8 @@ K:	\bTIF_SECCOMP\b
 
 SECURE DIGITAL HOST CONTROLLER INTERFACE (SDHCI) Broadcom BRCMSTB DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-mmc@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	drivers/mmc/host/sdhci-brcmstb*
 
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: patrice.chotard, mpatocka, markhemm, lczerner, hch, djwong,
	chuck.lever, hughd, akpm, patches, linux-mm, mm-commits,
	torvalds, akpm

From: Hugh Dickins <hughd@google.com>
Subject: tmpfs: fix regressions from wider use of ZERO_PAGE

Chuck Lever reported fsx-based xfstests generic 075 091 112 127 failing
when 5.18-rc1 NFS server exports tmpfs: bisected to recent tmpfs change.

Whilst nfsd_splice_action() does contain some questionable handling of
repeated pages, and Chuck was able to work around there, history from
Mark Hemment makes clear that there might be similar dangers elsewhere:
it was not a good idea for me to pass ZERO_PAGE down to unknown actors.

Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
but the x86 clear_user() is not nearly so well optimized as copy to user
(dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).

And now pagecache_init() does not need to SetPageUptodate(ZERO_PAGE(0)):
which had caused boot failure on arm noMMU STM32F7 and STM32H7 boards

Link: https://lkml.kernel.org/r/9a978571-8648-e830-5735-1f4748ce2e30@google.com
Fixes: 56a8c8eb1eaf ("tmpfs: do not allocate pages on read")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Patrice CHOTARD <patrice.chotard@foss.st.com>
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Tested-by: Chuck Lever III <chuck.lever@oracle.com>
Cc: Mark Hemment <markhemm@googlemail.com>
Cc: Patrice CHOTARD <patrice.chotard@foss.st.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |    6 ------
 mm/shmem.c   |   31 ++++++++++++++++++++-----------
 2 files changed, 20 insertions(+), 17 deletions(-)

--- a/mm/filemap.c~tmpfs-fix-regressions-from-wider-use-of-zero_page
+++ a/mm/filemap.c
@@ -1063,12 +1063,6 @@ void __init pagecache_init(void)
 		init_waitqueue_head(&folio_wait_table[i]);
 
 	page_writeback_init();
-
-	/*
-	 * tmpfs uses the ZERO_PAGE for reading holes: it is up-to-date,
-	 * and splice's page_cache_pipe_buf_confirm() needs to see that.
-	 */
-	SetPageUptodate(ZERO_PAGE(0));
 }
 
 /*
--- a/mm/shmem.c~tmpfs-fix-regressions-from-wider-use-of-zero_page
+++ a/mm/shmem.c
@@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(stru
 		pgoff_t end_index;
 		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
-		bool got_page;
 
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
@@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(stru
 			 */
 			if (!offset)
 				mark_page_accessed(page);
-			got_page = true;
+			/*
+			 * Ok, we have the page, and it's up-to-date, so
+			 * now we can copy it to user space...
+			 */
+			ret = copy_page_to_iter(page, offset, nr, to);
+			put_page(page);
+
+		} else if (iter_is_iovec(to)) {
+			/*
+			 * Copy to user tends to be so well optimized, but
+			 * clear_user() not so much, that it is noticeably
+			 * faster to copy the zero page instead of clearing.
+			 */
+			ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
 		} else {
-			page = ZERO_PAGE(0);
-			got_page = false;
+			/*
+			 * But submitting the same page twice in a row to
+			 * splice() - or others? - can result in confusion:
+			 * so don't attempt that optimization on pipes etc.
+			 */
+			ret = iov_iter_zero(nr, to);
 		}
 
-		/*
-		 * Ok, we have the page, and it's up-to-date, so
-		 * now we can copy it to user space...
-		 */
-		ret = copy_page_to_iter(page, offset, nr, to);
 		retval += ret;
 		offset += ret;
 		index += offset >> PAGE_SHIFT;
 		offset &= ~PAGE_MASK;
 
-		if (got_page)
-			put_page(page);
 		if (!iov_iter_count(to))
 			break;
 		if (ret < nr) {
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: patrice.chotard, mpatocka, markhemm, lczerner, hch, djwong,
	chuck.lever, hughd, akpm, patches, linux-mm, mm-commits,
	torvalds, akpm

From: Hugh Dickins <hughd@google.com>
Subject: tmpfs: fix regressions from wider use of ZERO_PAGE

Chuck Lever reported fsx-based xfstests generic 075 091 112 127 failing
when 5.18-rc1 NFS server exports tmpfs: bisected to recent tmpfs change.

Whilst nfsd_splice_action() does contain some questionable handling of
repeated pages, and Chuck was able to work around there, history from
Mark Hemment makes clear that there might be similar dangers elsewhere:
it was not a good idea for me to pass ZERO_PAGE down to unknown actors.

Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
but the x86 clear_user() is not nearly so well optimized as copy to user
(dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).

And now pagecache_init() does not need to SetPageUptodate(ZERO_PAGE(0)):
which had caused boot failure on arm noMMU STM32F7 and STM32H7 boards

Link: https://lkml.kernel.org/r/9a978571-8648-e830-5735-1f4748ce2e30@google.com
Fixes: 56a8c8eb1eaf ("tmpfs: do not allocate pages on read")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Patrice CHOTARD <patrice.chotard@foss.st.com>
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Tested-by: Chuck Lever III <chuck.lever@oracle.com>
Cc: Mark Hemment <markhemm@googlemail.com>
Cc: Patrice CHOTARD <patrice.chotard@foss.st.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |    6 ------
 mm/shmem.c   |   31 ++++++++++++++++++++-----------
 2 files changed, 20 insertions(+), 17 deletions(-)

--- a/mm/filemap.c~tmpfs-fix-regressions-from-wider-use-of-zero_page
+++ a/mm/filemap.c
@@ -1063,12 +1063,6 @@ void __init pagecache_init(void)
 		init_waitqueue_head(&folio_wait_table[i]);
 
 	page_writeback_init();
-
-	/*
-	 * tmpfs uses the ZERO_PAGE for reading holes: it is up-to-date,
-	 * and splice's page_cache_pipe_buf_confirm() needs to see that.
-	 */
-	SetPageUptodate(ZERO_PAGE(0));
 }
 
 /*
--- a/mm/shmem.c~tmpfs-fix-regressions-from-wider-use-of-zero_page
+++ a/mm/shmem.c
@@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(stru
 		pgoff_t end_index;
 		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
-		bool got_page;
 
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
@@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(stru
 			 */
 			if (!offset)
 				mark_page_accessed(page);
-			got_page = true;
+			/*
+			 * Ok, we have the page, and it's up-to-date, so
+			 * now we can copy it to user space...
+			 */
+			ret = copy_page_to_iter(page, offset, nr, to);
+			put_page(page);
+
+		} else if (iter_is_iovec(to)) {
+			/*
+			 * Copy to user tends to be so well optimized, but
+			 * clear_user() not so much, that it is noticeably
+			 * faster to copy the zero page instead of clearing.
+			 */
+			ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
 		} else {
-			page = ZERO_PAGE(0);
-			got_page = false;
+			/*
+			 * But submitting the same page twice in a row to
+			 * splice() - or others? - can result in confusion:
+			 * so don't attempt that optimization on pipes etc.
+			 */
+			ret = iov_iter_zero(nr, to);
 		}
 
-		/*
-		 * Ok, we have the page, and it's up-to-date, so
-		 * now we can copy it to user space...
-		 */
-		ret = copy_page_to_iter(page, offset, nr, to);
 		retval += ret;
 		offset += ret;
 		index += offset >> PAGE_SHIFT;
 		offset &= ~PAGE_MASK;
 
-		if (got_page)
-			put_page(page);
 		if (!iov_iter_count(to))
 			break;
 		if (ret < nr) {
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 03/14] mm/secretmem: fix panic when growing a memfd_secret
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: willy, stable, rppt, lkp, axelrasmussen, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: mm/secretmem: fix panic when growing a memfd_secret

When one tries to grow an existing memfd_secret with ftruncate, one gets a
panic [1].  For example, doing the following reliably induces the panic:

    fd = memfd_secret();

    ftruncate(fd, 10);
    ptr = mmap(NULL, 10, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    strcpy(ptr, "123456789");

    munmap(ptr, 10);
    ftruncate(fd, 20);

The basic reason for this is, when we grow with ftruncate, we call down
into simple_setattr, and then truncate_inode_pages_range, and eventually
we try to zero part of the memory.  The normal truncation code does this
via the direct map (i.e., it calls page_address() and hands that to
memset()).

For memfd_secret though, we specifically don't map our pages via the
direct map (i.e.  we call set_direct_map_invalid_noflush() on every
fault).  So the address returned by page_address() isn't useful, and when
we try to memset() with it we panic.

This patch avoids the panic by implementing a custom setattr for
memfd_secret, which detects resizes specifically (setting the size for the
first time works just fine, since there are no existing pages to try to
zero), and rejects them with EINVAL.

One could argue growing should be supported, but I think that will require
a significantly more lengthy change.  So, I propose a minimal fix for the
benefit of stable kernels, and then perhaps to extend memfd_secret to
support growing in a separate patch.

[1]:

[  774.320433] BUG: unable to handle page fault for address: ffffa0a889277028
[  774.322297] #PF: supervisor write access in kernel mode
[  774.323306] #PF: error_code(0x0002) - not-present page
[  774.324296] PGD afa01067 P4D afa01067 PUD 83f909067 PMD 83f8bf067 PTE 800ffffef6d88060
[  774.325841] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[  774.326934] CPU: 0 PID: 281 Comm: repro Not tainted 5.17.0-dbg-DEV #1
[  774.328074] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[  774.329732] RIP: 0010:memset_erms+0x9/0x10
[  774.330474] Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
[  774.333543] RSP: 0018:ffffb932c09afbf0 EFLAGS: 00010246
[  774.334404] RAX: 0000000000000000 RBX: ffffda63c4249dc0 RCX: 0000000000000fd8
[  774.335545] RDX: 0000000000000fd8 RSI: 0000000000000000 RDI: ffffa0a889277028
[  774.336685] RBP: ffffb932c09afc00 R08: 0000000000001000 R09: ffffa0a889277028
[  774.337929] R10: 0000000000020023 R11: 0000000000000000 R12: ffffda63c4249dc0
[  774.339236] R13: ffffa0a890d70d98 R14: 0000000000000028 R15: 0000000000000fd8
[  774.340356] FS:  00007f7294899580(0000) GS:ffffa0af9bc00000(0000) knlGS:0000000000000000
[  774.341635] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  774.342535] CR2: ffffa0a889277028 CR3: 0000000107ef6006 CR4: 0000000000370ef0
[  774.343651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  774.344780] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  774.345938] Call Trace:
[  774.346334]  <TASK>
[  774.346671]  ? zero_user_segments+0x82/0x190
[  774.347346]  truncate_inode_partial_folio+0xd4/0x2a0
[  774.348128]  truncate_inode_pages_range+0x380/0x830
[  774.348904]  truncate_setsize+0x63/0x80
[  774.349530]  simple_setattr+0x37/0x60
[  774.350102]  notify_change+0x3d8/0x4d0
[  774.350681]  do_sys_ftruncate+0x162/0x1d0
[  774.351302]  __x64_sys_ftruncate+0x1c/0x20
[  774.351936]  do_syscall_64+0x44/0xa0
[  774.352486]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  774.353284] RIP: 0033:0x7f72947c392b
[  774.354001] Code: 77 05 c3 0f 1f 40 00 48 8b 15 41 85 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 11 85 0c 00 f7 d8
[  774.357938] RSP: 002b:00007ffcad62a1a8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
[  774.359116] RAX: ffffffffffffffda RBX: 000055f47662b440 RCX: 00007f72947c392b
[  774.360186] RDX: 0000000000000028 RSI: 0000000000000028 RDI: 0000000000000003
[  774.361246] RBP: 00007ffcad62a1c0 R08: 0000000000000003 R09: 0000000000000000
[  774.362324] R10: 00007f72946dc230 R11: 0000000000000202 R12: 000055f47662b0e0
[  774.363393] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  774.364470]  </TASK>
[  774.364807] Modules linked in: xhci_pci xhci_hcd virtio_net net_failover failover virtio_blk virtio_balloon uhci_hcd ohci_pci ohci_hcd evdev ehci_pci ehci_hcd 9pnet_virtio 9p netfs 9pnet
[  774.367325] CR2: ffffa0a889277028
[  774.367838] ---[ end trace 0000000000000000 ]---
[  774.368543] RIP: 0010:memset_erms+0x9/0x10
[  774.369187] Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
[  774.372282] RSP: 0018:ffffb932c09afbf0 EFLAGS: 00010246
[  774.373372] RAX: 0000000000000000 RBX: ffffda63c4249dc0 RCX: 0000000000000fd8
[  774.374814] RDX: 0000000000000fd8 RSI: 0000000000000000 RDI: ffffa0a889277028
[  774.376248] RBP: ffffb932c09afc00 R08: 0000000000001000 R09: ffffa0a889277028
[  774.377687] R10: 0000000000020023 R11: 0000000000000000 R12: ffffda63c4249dc0
[  774.379135] R13: ffffa0a890d70d98 R14: 0000000000000028 R15: 0000000000000fd8
[  774.380550] FS:  00007f7294899580(0000) GS:ffffa0af9bc00000(0000) knlGS:0000000000000000
[  774.382177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  774.383329] CR2: ffffa0a889277028 CR3: 0000000107ef6006 CR4: 0000000000370ef0
[  774.384763] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  774.386229] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  774.387664] Kernel panic - not syncing: Fatal exception
[  774.388863] Kernel Offset: 0x8000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  774.391014] ---[ end Kernel panic - not syncing: Fatal exception ]---

[lkp@intel.com: secretmem_iops can be static]
  Signed-off-by: kernel test robot <lkp@intel.com>
[axelrasmussen@google.com: return EINVAL]
  Link: https://lkml.kernel.org/r/20220412193023.279320-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220324210909.1843814-1-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/secretmem.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

--- a/mm/secretmem.c~mm-secretmem-fix-panic-when-growing-a-memfd_secret
+++ a/mm/secretmem.c
@@ -158,6 +158,22 @@ const struct address_space_operations se
 	.isolate_page	= secretmem_isolate_page,
 };
 
+static int secretmem_setattr(struct user_namespace *mnt_userns,
+			     struct dentry *dentry, struct iattr *iattr)
+{
+	struct inode *inode = d_inode(dentry);
+	unsigned int ia_valid = iattr->ia_valid;
+
+	if ((ia_valid & ATTR_SIZE) && inode->i_size)
+		return -EINVAL;
+
+	return simple_setattr(mnt_userns, dentry, iattr);
+}
+
+static const struct inode_operations secretmem_iops = {
+	.setattr = secretmem_setattr,
+};
+
 static struct vfsmount *secretmem_mnt;
 
 static struct file *secretmem_file_create(unsigned long flags)
@@ -177,6 +193,7 @@ static struct file *secretmem_file_creat
 	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
 	mapping_set_unevictable(inode->i_mapping);
 
+	inode->i_op = &secretmem_iops;
 	inode->i_mapping->a_ops = &secretmem_aops;
 
 	/* pretend we are a normal file with zero size */
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 03/14] mm/secretmem: fix panic when growing a memfd_secret
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: willy, stable, rppt, lkp, axelrasmussen, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: mm/secretmem: fix panic when growing a memfd_secret

When one tries to grow an existing memfd_secret with ftruncate, one gets a
panic [1].  For example, doing the following reliably induces the panic:

    fd = memfd_secret();

    ftruncate(fd, 10);
    ptr = mmap(NULL, 10, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    strcpy(ptr, "123456789");

    munmap(ptr, 10);
    ftruncate(fd, 20);

The basic reason for this is, when we grow with ftruncate, we call down
into simple_setattr, and then truncate_inode_pages_range, and eventually
we try to zero part of the memory.  The normal truncation code does this
via the direct map (i.e., it calls page_address() and hands that to
memset()).

For memfd_secret though, we specifically don't map our pages via the
direct map (i.e.  we call set_direct_map_invalid_noflush() on every
fault).  So the address returned by page_address() isn't useful, and when
we try to memset() with it we panic.

This patch avoids the panic by implementing a custom setattr for
memfd_secret, which detects resizes specifically (setting the size for the
first time works just fine, since there are no existing pages to try to
zero), and rejects them with EINVAL.

One could argue growing should be supported, but I think that will require
a significantly more lengthy change.  So, I propose a minimal fix for the
benefit of stable kernels, and then perhaps to extend memfd_secret to
support growing in a separate patch.

[1]:

[  774.320433] BUG: unable to handle page fault for address: ffffa0a889277028
[  774.322297] #PF: supervisor write access in kernel mode
[  774.323306] #PF: error_code(0x0002) - not-present page
[  774.324296] PGD afa01067 P4D afa01067 PUD 83f909067 PMD 83f8bf067 PTE 800ffffef6d88060
[  774.325841] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[  774.326934] CPU: 0 PID: 281 Comm: repro Not tainted 5.17.0-dbg-DEV #1
[  774.328074] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[  774.329732] RIP: 0010:memset_erms+0x9/0x10
[  774.330474] Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
[  774.333543] RSP: 0018:ffffb932c09afbf0 EFLAGS: 00010246
[  774.334404] RAX: 0000000000000000 RBX: ffffda63c4249dc0 RCX: 0000000000000fd8
[  774.335545] RDX: 0000000000000fd8 RSI: 0000000000000000 RDI: ffffa0a889277028
[  774.336685] RBP: ffffb932c09afc00 R08: 0000000000001000 R09: ffffa0a889277028
[  774.337929] R10: 0000000000020023 R11: 0000000000000000 R12: ffffda63c4249dc0
[  774.339236] R13: ffffa0a890d70d98 R14: 0000000000000028 R15: 0000000000000fd8
[  774.340356] FS:  00007f7294899580(0000) GS:ffffa0af9bc00000(0000) knlGS:0000000000000000
[  774.341635] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  774.342535] CR2: ffffa0a889277028 CR3: 0000000107ef6006 CR4: 0000000000370ef0
[  774.343651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  774.344780] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  774.345938] Call Trace:
[  774.346334]  <TASK>
[  774.346671]  ? zero_user_segments+0x82/0x190
[  774.347346]  truncate_inode_partial_folio+0xd4/0x2a0
[  774.348128]  truncate_inode_pages_range+0x380/0x830
[  774.348904]  truncate_setsize+0x63/0x80
[  774.349530]  simple_setattr+0x37/0x60
[  774.350102]  notify_change+0x3d8/0x4d0
[  774.350681]  do_sys_ftruncate+0x162/0x1d0
[  774.351302]  __x64_sys_ftruncate+0x1c/0x20
[  774.351936]  do_syscall_64+0x44/0xa0
[  774.352486]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  774.353284] RIP: 0033:0x7f72947c392b
[  774.354001] Code: 77 05 c3 0f 1f 40 00 48 8b 15 41 85 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 11 85 0c 00 f7 d8
[  774.357938] RSP: 002b:00007ffcad62a1a8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
[  774.359116] RAX: ffffffffffffffda RBX: 000055f47662b440 RCX: 00007f72947c392b
[  774.360186] RDX: 0000000000000028 RSI: 0000000000000028 RDI: 0000000000000003
[  774.361246] RBP: 00007ffcad62a1c0 R08: 0000000000000003 R09: 0000000000000000
[  774.362324] R10: 00007f72946dc230 R11: 0000000000000202 R12: 000055f47662b0e0
[  774.363393] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  774.364470]  </TASK>
[  774.364807] Modules linked in: xhci_pci xhci_hcd virtio_net net_failover failover virtio_blk virtio_balloon uhci_hcd ohci_pci ohci_hcd evdev ehci_pci ehci_hcd 9pnet_virtio 9p netfs 9pnet
[  774.367325] CR2: ffffa0a889277028
[  774.367838] ---[ end trace 0000000000000000 ]---
[  774.368543] RIP: 0010:memset_erms+0x9/0x10
[  774.369187] Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
[  774.372282] RSP: 0018:ffffb932c09afbf0 EFLAGS: 00010246
[  774.373372] RAX: 0000000000000000 RBX: ffffda63c4249dc0 RCX: 0000000000000fd8
[  774.374814] RDX: 0000000000000fd8 RSI: 0000000000000000 RDI: ffffa0a889277028
[  774.376248] RBP: ffffb932c09afc00 R08: 0000000000001000 R09: ffffa0a889277028
[  774.377687] R10: 0000000000020023 R11: 0000000000000000 R12: ffffda63c4249dc0
[  774.379135] R13: ffffa0a890d70d98 R14: 0000000000000028 R15: 0000000000000fd8
[  774.380550] FS:  00007f7294899580(0000) GS:ffffa0af9bc00000(0000) knlGS:0000000000000000
[  774.382177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  774.383329] CR2: ffffa0a889277028 CR3: 0000000107ef6006 CR4: 0000000000370ef0
[  774.384763] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  774.386229] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  774.387664] Kernel panic - not syncing: Fatal exception
[  774.388863] Kernel Offset: 0x8000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  774.391014] ---[ end Kernel panic - not syncing: Fatal exception ]---

[lkp@intel.com: secretmem_iops can be static]
  Signed-off-by: kernel test robot <lkp@intel.com>
[axelrasmussen@google.com: return EINVAL]
  Link: https://lkml.kernel.org/r/20220412193023.279320-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220324210909.1843814-1-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/secretmem.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

--- a/mm/secretmem.c~mm-secretmem-fix-panic-when-growing-a-memfd_secret
+++ a/mm/secretmem.c
@@ -158,6 +158,22 @@ const struct address_space_operations se
 	.isolate_page	= secretmem_isolate_page,
 };
 
+static int secretmem_setattr(struct user_namespace *mnt_userns,
+			     struct dentry *dentry, struct iattr *iattr)
+{
+	struct inode *inode = d_inode(dentry);
+	unsigned int ia_valid = iattr->ia_valid;
+
+	if ((ia_valid & ATTR_SIZE) && inode->i_size)
+		return -EINVAL;
+
+	return simple_setattr(mnt_userns, dentry, iattr);
+}
+
+static const struct inode_operations secretmem_iops = {
+	.setattr = secretmem_setattr,
+};
+
 static struct vfsmount *secretmem_mnt;
 
 static struct file *secretmem_file_create(unsigned long flags)
@@ -177,6 +193,7 @@ static struct file *secretmem_file_creat
 	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
 	mapping_set_unevictable(inode->i_mapping);
 
+	inode->i_op = &secretmem_iops;
 	inode->i_mapping->a_ops = &secretmem_aops;
 
 	/* pretend we are a normal file with zero size */
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 04/14] irq_work: use kasan_record_aux_stack_noalloc() record callstack
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: ryabinin.a.a, glider, dvyukov, andreyknvl, qiang1.zhang, akpm,
	patches, linux-mm, mm-commits, torvalds, akpm

From: Zqiang <qiang1.zhang@intel.com>
Subject: irq_work: use kasan_record_aux_stack_noalloc() record callstack

[    4.113128] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
[    4.113132] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 239, name: bootlogd
[    4.113149] Preemption disabled at:
[    4.113149] [<ffffffffbab1a531>] rt_mutex_slowunlock+0xa1/0x4e0
[    4.113154] CPU: 3 PID: 239 Comm: bootlogd Tainted: G        W
5.17.1-rt17-yocto-preempt-rt+ #105
[    4.113157] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[    4.113159] Call Trace:
[    4.113160]  <TASK>
[    4.113161]  dump_stack_lvl+0x60/0x8c
[    4.113165]  dump_stack+0x10/0x12
[    4.113167]  __might_resched.cold+0x13b/0x173
[    4.113172]  rt_spin_lock+0x5b/0xf0
[    4.113179]  get_page_from_freelist+0x20c/0x1610
[    4.113208]  __alloc_pages+0x25e/0x5e0
[    4.113222]  __stack_depot_save+0x3c0/0x4a0
[    4.113228]  kasan_save_stack+0x3a/0x50
[    4.113322]  __kasan_record_aux_stack+0xb6/0xc0
[    4.113326]  kasan_record_aux_stack+0xe/0x10
[    4.113329]  irq_work_queue_on+0x6a/0x1c0
[    4.113333]  pull_rt_task+0x631/0x6b0
[    4.113343]  do_balance_callbacks+0x56/0x80
[    4.113346]  __balance_callbacks+0x63/0x90
[    4.113350]  rt_mutex_setprio+0x349/0x880
[    4.113366]  rt_mutex_slowunlock+0x22a/0x4e0
[    4.113377]  rt_spin_unlock+0x49/0x80
[    4.113380]  uart_write+0x186/0x2b0
[    4.113385]  do_output_char+0x2e9/0x3a0
[    4.113389]  n_tty_write+0x306/0x800
[    4.113413]  file_tty_write.isra.0+0x2af/0x450
[    4.113422]  tty_write+0x22/0x30
[    4.113425]  new_sync_write+0x27c/0x3a0
[    4.113446]  vfs_write+0x3f7/0x5d0
[    4.113451]  ksys_write+0xd9/0x180
[    4.113463]  __x64_sys_write+0x43/0x50
[    4.113466]  do_syscall_64+0x44/0x90
[    4.113469]  entry_SYSCALL_64_after_hwframe+0x44/0xae

On PREEMPT_RT kernel and KASAN is enabled.  the kasan_record_aux_stack()
may call alloc_pages(), and the rt-spinlock will be acquired, if currently
in atomic context, will trigger warning.  fix it by use
kasan_record_aux_stack_noalloc() to avoid call alloc_pages().

Link: https://lkml.kernel.org/r/20220402142555.2699582-1-qiang1.zhang@intel.com
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/irq_work.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/irq_work.c~irq_work-use-kasan_record_aux_stack_noalloc-record-callstack
+++ a/kernel/irq_work.c
@@ -137,7 +137,7 @@ bool irq_work_queue_on(struct irq_work *
 	if (!irq_work_claim(work))
 		return false;
 
-	kasan_record_aux_stack(work);
+	kasan_record_aux_stack_noalloc(work);
 
 	preempt_disable();
 	if (cpu != smp_processor_id()) {
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 04/14] irq_work: use kasan_record_aux_stack_noalloc() record callstack
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: ryabinin.a.a, glider, dvyukov, andreyknvl, qiang1.zhang, akpm,
	patches, linux-mm, mm-commits, torvalds, akpm

From: Zqiang <qiang1.zhang@intel.com>
Subject: irq_work: use kasan_record_aux_stack_noalloc() record callstack

[    4.113128] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
[    4.113132] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 239, name: bootlogd
[    4.113149] Preemption disabled at:
[    4.113149] [<ffffffffbab1a531>] rt_mutex_slowunlock+0xa1/0x4e0
[    4.113154] CPU: 3 PID: 239 Comm: bootlogd Tainted: G        W
5.17.1-rt17-yocto-preempt-rt+ #105
[    4.113157] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[    4.113159] Call Trace:
[    4.113160]  <TASK>
[    4.113161]  dump_stack_lvl+0x60/0x8c
[    4.113165]  dump_stack+0x10/0x12
[    4.113167]  __might_resched.cold+0x13b/0x173
[    4.113172]  rt_spin_lock+0x5b/0xf0
[    4.113179]  get_page_from_freelist+0x20c/0x1610
[    4.113208]  __alloc_pages+0x25e/0x5e0
[    4.113222]  __stack_depot_save+0x3c0/0x4a0
[    4.113228]  kasan_save_stack+0x3a/0x50
[    4.113322]  __kasan_record_aux_stack+0xb6/0xc0
[    4.113326]  kasan_record_aux_stack+0xe/0x10
[    4.113329]  irq_work_queue_on+0x6a/0x1c0
[    4.113333]  pull_rt_task+0x631/0x6b0
[    4.113343]  do_balance_callbacks+0x56/0x80
[    4.113346]  __balance_callbacks+0x63/0x90
[    4.113350]  rt_mutex_setprio+0x349/0x880
[    4.113366]  rt_mutex_slowunlock+0x22a/0x4e0
[    4.113377]  rt_spin_unlock+0x49/0x80
[    4.113380]  uart_write+0x186/0x2b0
[    4.113385]  do_output_char+0x2e9/0x3a0
[    4.113389]  n_tty_write+0x306/0x800
[    4.113413]  file_tty_write.isra.0+0x2af/0x450
[    4.113422]  tty_write+0x22/0x30
[    4.113425]  new_sync_write+0x27c/0x3a0
[    4.113446]  vfs_write+0x3f7/0x5d0
[    4.113451]  ksys_write+0xd9/0x180
[    4.113463]  __x64_sys_write+0x43/0x50
[    4.113466]  do_syscall_64+0x44/0x90
[    4.113469]  entry_SYSCALL_64_after_hwframe+0x44/0xae

On PREEMPT_RT kernel and KASAN is enabled.  the kasan_record_aux_stack()
may call alloc_pages(), and the rt-spinlock will be acquired, if currently
in atomic context, will trigger warning.  fix it by use
kasan_record_aux_stack_noalloc() to avoid call alloc_pages().

Link: https://lkml.kernel.org/r/20220402142555.2699582-1-qiang1.zhang@intel.com
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/irq_work.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/irq_work.c~irq_work-use-kasan_record_aux_stack_noalloc-record-callstack
+++ a/kernel/irq_work.c
@@ -137,7 +137,7 @@ bool irq_work_queue_on(struct irq_work *
 	if (!irq_work_claim(work))
 		return false;
 
-	kasan_record_aux_stack(work);
+	kasan_record_aux_stack_noalloc(work);
 
 	preempt_disable();
 	if (cpu != smp_processor_id()) {
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 05/14] kasan: fix hw tags enablement when KUNIT tests are disabled
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: will, ryabinin.a.a, glider, dvyukov, catalin.marinas, andreyknvl,
	vincenzo.frascino, akpm, patches, linux-mm, mm-commits, torvalds,
	akpm

From: Vincenzo Frascino <vincenzo.frascino@arm.com>
Subject: kasan: fix hw tags enablement when KUNIT tests are disabled

Kasan enables hw tags via kasan_enable_tagging() which based on the mode
passed via kernel command line selects the correct hw backend. 
kasan_enable_tagging() is meant to be invoked indirectly via the cpu
features framework of the architectures that support these backends. 
Currently the invocation of this function is guarded by
CONFIG_KASAN_KUNIT_TEST which allows the enablement of the correct backend
only when KUNIT tests are enabled in the kernel.

This inconsistency was introduced in commit:

  ed6d74446cbf ("kasan: test: support async (again) and asymm modes for HW_TAGS")

... and prevents to enable MTE on arm64 when KUNIT tests for kasan hw_tags are
disabled.

Fix the issue making sure that the CONFIG_KASAN_KUNIT_TEST guard does not
prevent the correct invocation of kasan_enable_tagging().

Link: https://lkml.kernel.org/r/20220408124323.10028-1-vincenzo.frascino@arm.com
Fixes: ed6d74446cbf ("kasan: test: support async (again) and asymm modes for HW_TAGS")
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kasan/hw_tags.c |    5 +++--
 mm/kasan/kasan.h   |   10 ++++++----
 2 files changed, 9 insertions(+), 6 deletions(-)

--- a/mm/kasan/hw_tags.c~kasan-fix-hw-tags-enablement-when-kunit-tests-are-disabled
+++ a/mm/kasan/hw_tags.c
@@ -336,8 +336,6 @@ void __kasan_poison_vmalloc(const void *
 
 #endif
 
-#if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST)
-
 void kasan_enable_tagging(void)
 {
 	if (kasan_arg_mode == KASAN_ARG_MODE_ASYNC)
@@ -347,6 +345,9 @@ void kasan_enable_tagging(void)
 	else
 		hw_enable_tagging_sync();
 }
+
+#if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST)
+
 EXPORT_SYMBOL_GPL(kasan_enable_tagging);
 
 void kasan_force_async_fault(void)
--- a/mm/kasan/kasan.h~kasan-fix-hw-tags-enablement-when-kunit-tests-are-disabled
+++ a/mm/kasan/kasan.h
@@ -355,25 +355,27 @@ static inline const void *arch_kasan_set
 #define hw_set_mem_tag_range(addr, size, tag, init) \
 			arch_set_mem_tag_range((addr), (size), (tag), (init))
 
+void kasan_enable_tagging(void);
+
 #else /* CONFIG_KASAN_HW_TAGS */
 
 #define hw_enable_tagging_sync()
 #define hw_enable_tagging_async()
 #define hw_enable_tagging_asymm()
 
+static inline void kasan_enable_tagging(void) { }
+
 #endif /* CONFIG_KASAN_HW_TAGS */
 
 #if defined(CONFIG_KASAN_HW_TAGS) && IS_ENABLED(CONFIG_KASAN_KUNIT_TEST)
 
-void kasan_enable_tagging(void);
 void kasan_force_async_fault(void);
 
-#else /* CONFIG_KASAN_HW_TAGS || CONFIG_KASAN_KUNIT_TEST */
+#else /* CONFIG_KASAN_HW_TAGS && CONFIG_KASAN_KUNIT_TEST */
 
-static inline void kasan_enable_tagging(void) { }
 static inline void kasan_force_async_fault(void) { }
 
-#endif /* CONFIG_KASAN_HW_TAGS || CONFIG_KASAN_KUNIT_TEST */
+#endif /* CONFIG_KASAN_HW_TAGS && CONFIG_KASAN_KUNIT_TEST */
 
 #ifdef CONFIG_KASAN_SW_TAGS
 u8 kasan_random_tag(void);
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 05/14] kasan: fix hw tags enablement when KUNIT tests are disabled
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: will, ryabinin.a.a, glider, dvyukov, catalin.marinas, andreyknvl,
	vincenzo.frascino, akpm, patches, linux-mm, mm-commits, torvalds,
	akpm

From: Vincenzo Frascino <vincenzo.frascino@arm.com>
Subject: kasan: fix hw tags enablement when KUNIT tests are disabled

Kasan enables hw tags via kasan_enable_tagging() which based on the mode
passed via kernel command line selects the correct hw backend. 
kasan_enable_tagging() is meant to be invoked indirectly via the cpu
features framework of the architectures that support these backends. 
Currently the invocation of this function is guarded by
CONFIG_KASAN_KUNIT_TEST which allows the enablement of the correct backend
only when KUNIT tests are enabled in the kernel.

This inconsistency was introduced in commit:

  ed6d74446cbf ("kasan: test: support async (again) and asymm modes for HW_TAGS")

... and prevents to enable MTE on arm64 when KUNIT tests for kasan hw_tags are
disabled.

Fix the issue making sure that the CONFIG_KASAN_KUNIT_TEST guard does not
prevent the correct invocation of kasan_enable_tagging().

Link: https://lkml.kernel.org/r/20220408124323.10028-1-vincenzo.frascino@arm.com
Fixes: ed6d74446cbf ("kasan: test: support async (again) and asymm modes for HW_TAGS")
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kasan/hw_tags.c |    5 +++--
 mm/kasan/kasan.h   |   10 ++++++----
 2 files changed, 9 insertions(+), 6 deletions(-)

--- a/mm/kasan/hw_tags.c~kasan-fix-hw-tags-enablement-when-kunit-tests-are-disabled
+++ a/mm/kasan/hw_tags.c
@@ -336,8 +336,6 @@ void __kasan_poison_vmalloc(const void *
 
 #endif
 
-#if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST)
-
 void kasan_enable_tagging(void)
 {
 	if (kasan_arg_mode == KASAN_ARG_MODE_ASYNC)
@@ -347,6 +345,9 @@ void kasan_enable_tagging(void)
 	else
 		hw_enable_tagging_sync();
 }
+
+#if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST)
+
 EXPORT_SYMBOL_GPL(kasan_enable_tagging);
 
 void kasan_force_async_fault(void)
--- a/mm/kasan/kasan.h~kasan-fix-hw-tags-enablement-when-kunit-tests-are-disabled
+++ a/mm/kasan/kasan.h
@@ -355,25 +355,27 @@ static inline const void *arch_kasan_set
 #define hw_set_mem_tag_range(addr, size, tag, init) \
 			arch_set_mem_tag_range((addr), (size), (tag), (init))
 
+void kasan_enable_tagging(void);
+
 #else /* CONFIG_KASAN_HW_TAGS */
 
 #define hw_enable_tagging_sync()
 #define hw_enable_tagging_async()
 #define hw_enable_tagging_asymm()
 
+static inline void kasan_enable_tagging(void) { }
+
 #endif /* CONFIG_KASAN_HW_TAGS */
 
 #if defined(CONFIG_KASAN_HW_TAGS) && IS_ENABLED(CONFIG_KASAN_KUNIT_TEST)
 
-void kasan_enable_tagging(void);
 void kasan_force_async_fault(void);
 
-#else /* CONFIG_KASAN_HW_TAGS || CONFIG_KASAN_KUNIT_TEST */
+#else /* CONFIG_KASAN_HW_TAGS && CONFIG_KASAN_KUNIT_TEST */
 
-static inline void kasan_enable_tagging(void) { }
 static inline void kasan_force_async_fault(void) { }
 
-#endif /* CONFIG_KASAN_HW_TAGS || CONFIG_KASAN_KUNIT_TEST */
+#endif /* CONFIG_KASAN_HW_TAGS && CONFIG_KASAN_KUNIT_TEST */
 
 #ifdef CONFIG_KASAN_SW_TAGS
 u8 kasan_random_tag(void);
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 06/14] mm, kfence: support kmem_dump_obj() for KFENCE objects
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: vbabka, oliver.sang, 42.hyeyoo, elver, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Marco Elver <elver@google.com>
Subject: mm, kfence: support kmem_dump_obj() for KFENCE objects

Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
producing garbage data due to the object not actually being maintained by
SLAB or SLUB.

Fix this by implementing __kfence_obj_info() that copies relevant
information to struct kmem_obj_info when the object was allocated by
KFENCE; this is called by a common kmem_obj_info(), which also calls the
slab/slub/slob specific variant now called __kmem_obj_info().

For completeness, kmem_dump_obj() now displays if the object was allocated
by KFENCE.

Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
Fixes: b89fb5ef0ce6 ("mm, kfence: insert KFENCE hooks for SLUB")
Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>	[slab]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/kfence.h |   24 +++++++++++++++++++
 mm/kfence/core.c       |   21 -----------------
 mm/kfence/kfence.h     |   21 +++++++++++++++++
 mm/kfence/report.c     |   47 +++++++++++++++++++++++++++++++++++++++
 mm/slab.c              |    2 -
 mm/slab.h              |    2 -
 mm/slab_common.c       |    9 +++++++
 mm/slob.c              |    2 -
 mm/slub.c              |    2 -
 9 files changed, 105 insertions(+), 25 deletions(-)

--- a/include/linux/kfence.h~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/include/linux/kfence.h
@@ -204,6 +204,22 @@ static __always_inline __must_check bool
  */
 bool __must_check kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs *regs);
 
+#ifdef CONFIG_PRINTK
+struct kmem_obj_info;
+/**
+ * __kfence_obj_info() - fill kmem_obj_info struct
+ * @kpp: kmem_obj_info to be filled
+ * @object: the object
+ *
+ * Return:
+ * * false - not a KFENCE object
+ * * true - a KFENCE object, filled @kpp
+ *
+ * Copies information to @kpp for KFENCE objects.
+ */
+bool __kfence_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
+#endif
+
 #else /* CONFIG_KFENCE */
 
 static inline bool is_kfence_address(const void *addr) { return false; }
@@ -221,6 +237,14 @@ static inline bool __must_check kfence_h
 	return false;
 }
 
+#ifdef CONFIG_PRINTK
+struct kmem_obj_info;
+static inline bool __kfence_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+{
+	return false;
+}
+#endif
+
 #endif
 
 #endif /* _LINUX_KFENCE_H */
--- a/mm/kfence/core.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/kfence/core.c
@@ -231,27 +231,6 @@ static bool kfence_unprotect(unsigned lo
 	return !KFENCE_WARN_ON(!kfence_protect_page(ALIGN_DOWN(addr, PAGE_SIZE), false));
 }
 
-static inline struct kfence_metadata *addr_to_metadata(unsigned long addr)
-{
-	long index;
-
-	/* The checks do not affect performance; only called from slow-paths. */
-
-	if (!is_kfence_address((void *)addr))
-		return NULL;
-
-	/*
-	 * May be an invalid index if called with an address at the edge of
-	 * __kfence_pool, in which case we would report an "invalid access"
-	 * error.
-	 */
-	index = (addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2) - 1;
-	if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
-		return NULL;
-
-	return &kfence_metadata[index];
-}
-
 static inline unsigned long metadata_to_pageaddr(const struct kfence_metadata *meta)
 {
 	unsigned long offset = (meta - kfence_metadata + 1) * PAGE_SIZE * 2;
--- a/mm/kfence/kfence.h~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/kfence/kfence.h
@@ -96,6 +96,27 @@ struct kfence_metadata {
 
 extern struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];
 
+static inline struct kfence_metadata *addr_to_metadata(unsigned long addr)
+{
+	long index;
+
+	/* The checks do not affect performance; only called from slow-paths. */
+
+	if (!is_kfence_address((void *)addr))
+		return NULL;
+
+	/*
+	 * May be an invalid index if called with an address at the edge of
+	 * __kfence_pool, in which case we would report an "invalid access"
+	 * error.
+	 */
+	index = (addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2) - 1;
+	if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
+		return NULL;
+
+	return &kfence_metadata[index];
+}
+
 /* KFENCE error types for report generation. */
 enum kfence_error_type {
 	KFENCE_ERROR_OOB,		/* Detected a out-of-bounds access. */
--- a/mm/kfence/report.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/kfence/report.c
@@ -273,3 +273,50 @@ void kfence_report_error(unsigned long a
 	/* We encountered a memory safety error, taint the kernel! */
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_STILL_OK);
 }
+
+#ifdef CONFIG_PRINTK
+static void kfence_to_kp_stack(const struct kfence_track *track, void **kp_stack)
+{
+	int i, j;
+
+	i = get_stack_skipnr(track->stack_entries, track->num_stack_entries, NULL);
+	for (j = 0; i < track->num_stack_entries && j < KS_ADDRS_COUNT; ++i, ++j)
+		kp_stack[j] = (void *)track->stack_entries[i];
+	if (j < KS_ADDRS_COUNT)
+		kp_stack[j] = NULL;
+}
+
+bool __kfence_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+{
+	struct kfence_metadata *meta = addr_to_metadata((unsigned long)object);
+	unsigned long flags;
+
+	if (!meta)
+		return false;
+
+	/*
+	 * If state is UNUSED at least show the pointer requested; the rest
+	 * would be garbage data.
+	 */
+	kpp->kp_ptr = object;
+
+	/* Requesting info an a never-used object is almost certainly a bug. */
+	if (WARN_ON(meta->state == KFENCE_OBJECT_UNUSED))
+		return true;
+
+	raw_spin_lock_irqsave(&meta->lock, flags);
+
+	kpp->kp_slab = slab;
+	kpp->kp_slab_cache = meta->cache;
+	kpp->kp_objp = (void *)meta->addr;
+	kfence_to_kp_stack(&meta->alloc_track, kpp->kp_stack);
+	if (meta->state == KFENCE_OBJECT_FREED)
+		kfence_to_kp_stack(&meta->free_track, kpp->kp_free_stack);
+	/* get_stack_skipnr() ensures the first entry is outside allocator. */
+	kpp->kp_ret = kpp->kp_stack[0];
+
+	raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+	return true;
+}
+#endif
--- a/mm/slab.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slab.c
@@ -3665,7 +3665,7 @@ EXPORT_SYMBOL(__kmalloc_node_track_calle
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 {
 	struct kmem_cache *cachep;
 	unsigned int objnr;
--- a/mm/slab_common.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slab_common.c
@@ -555,6 +555,13 @@ bool kmem_valid_obj(void *object)
 }
 EXPORT_SYMBOL_GPL(kmem_valid_obj);
 
+static void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+{
+	if (__kfence_obj_info(kpp, object, slab))
+		return;
+	__kmem_obj_info(kpp, object, slab);
+}
+
 /**
  * kmem_dump_obj - Print available slab provenance information
  * @object: slab object for which to find provenance information.
@@ -590,6 +597,8 @@ void kmem_dump_obj(void *object)
 		pr_cont(" slab%s %s", cp, kp.kp_slab_cache->name);
 	else
 		pr_cont(" slab%s", cp);
+	if (is_kfence_address(object))
+		pr_cont(" (kfence)");
 	if (kp.kp_objp)
 		pr_cont(" start %px", kp.kp_objp);
 	if (kp.kp_data_offset)
--- a/mm/slab.h~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slab.h
@@ -868,7 +868,7 @@ struct kmem_obj_info {
 	void *kp_stack[KS_ADDRS_COUNT];
 	void *kp_free_stack[KS_ADDRS_COUNT];
 };
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
 #endif
 
 #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
--- a/mm/slob.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slob.c
@@ -463,7 +463,7 @@ out:
 }
 
 #ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 {
 	kpp->kp_ptr = object;
 	kpp->kp_slab = slab;
--- a/mm/slub.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slub.c
@@ -4312,7 +4312,7 @@ int __kmem_cache_shutdown(struct kmem_ca
 }
 
 #ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 {
 	void *base;
 	int __maybe_unused i;
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 06/14] mm, kfence: support kmem_dump_obj() for KFENCE objects
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: vbabka, oliver.sang, 42.hyeyoo, elver, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Marco Elver <elver@google.com>
Subject: mm, kfence: support kmem_dump_obj() for KFENCE objects

Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
producing garbage data due to the object not actually being maintained by
SLAB or SLUB.

Fix this by implementing __kfence_obj_info() that copies relevant
information to struct kmem_obj_info when the object was allocated by
KFENCE; this is called by a common kmem_obj_info(), which also calls the
slab/slub/slob specific variant now called __kmem_obj_info().

For completeness, kmem_dump_obj() now displays if the object was allocated
by KFENCE.

Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
Fixes: b89fb5ef0ce6 ("mm, kfence: insert KFENCE hooks for SLUB")
Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>	[slab]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/kfence.h |   24 +++++++++++++++++++
 mm/kfence/core.c       |   21 -----------------
 mm/kfence/kfence.h     |   21 +++++++++++++++++
 mm/kfence/report.c     |   47 +++++++++++++++++++++++++++++++++++++++
 mm/slab.c              |    2 -
 mm/slab.h              |    2 -
 mm/slab_common.c       |    9 +++++++
 mm/slob.c              |    2 -
 mm/slub.c              |    2 -
 9 files changed, 105 insertions(+), 25 deletions(-)

--- a/include/linux/kfence.h~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/include/linux/kfence.h
@@ -204,6 +204,22 @@ static __always_inline __must_check bool
  */
 bool __must_check kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs *regs);
 
+#ifdef CONFIG_PRINTK
+struct kmem_obj_info;
+/**
+ * __kfence_obj_info() - fill kmem_obj_info struct
+ * @kpp: kmem_obj_info to be filled
+ * @object: the object
+ *
+ * Return:
+ * * false - not a KFENCE object
+ * * true - a KFENCE object, filled @kpp
+ *
+ * Copies information to @kpp for KFENCE objects.
+ */
+bool __kfence_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
+#endif
+
 #else /* CONFIG_KFENCE */
 
 static inline bool is_kfence_address(const void *addr) { return false; }
@@ -221,6 +237,14 @@ static inline bool __must_check kfence_h
 	return false;
 }
 
+#ifdef CONFIG_PRINTK
+struct kmem_obj_info;
+static inline bool __kfence_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+{
+	return false;
+}
+#endif
+
 #endif
 
 #endif /* _LINUX_KFENCE_H */
--- a/mm/kfence/core.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/kfence/core.c
@@ -231,27 +231,6 @@ static bool kfence_unprotect(unsigned lo
 	return !KFENCE_WARN_ON(!kfence_protect_page(ALIGN_DOWN(addr, PAGE_SIZE), false));
 }
 
-static inline struct kfence_metadata *addr_to_metadata(unsigned long addr)
-{
-	long index;
-
-	/* The checks do not affect performance; only called from slow-paths. */
-
-	if (!is_kfence_address((void *)addr))
-		return NULL;
-
-	/*
-	 * May be an invalid index if called with an address at the edge of
-	 * __kfence_pool, in which case we would report an "invalid access"
-	 * error.
-	 */
-	index = (addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2) - 1;
-	if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
-		return NULL;
-
-	return &kfence_metadata[index];
-}
-
 static inline unsigned long metadata_to_pageaddr(const struct kfence_metadata *meta)
 {
 	unsigned long offset = (meta - kfence_metadata + 1) * PAGE_SIZE * 2;
--- a/mm/kfence/kfence.h~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/kfence/kfence.h
@@ -96,6 +96,27 @@ struct kfence_metadata {
 
 extern struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];
 
+static inline struct kfence_metadata *addr_to_metadata(unsigned long addr)
+{
+	long index;
+
+	/* The checks do not affect performance; only called from slow-paths. */
+
+	if (!is_kfence_address((void *)addr))
+		return NULL;
+
+	/*
+	 * May be an invalid index if called with an address at the edge of
+	 * __kfence_pool, in which case we would report an "invalid access"
+	 * error.
+	 */
+	index = (addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2) - 1;
+	if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
+		return NULL;
+
+	return &kfence_metadata[index];
+}
+
 /* KFENCE error types for report generation. */
 enum kfence_error_type {
 	KFENCE_ERROR_OOB,		/* Detected a out-of-bounds access. */
--- a/mm/kfence/report.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/kfence/report.c
@@ -273,3 +273,50 @@ void kfence_report_error(unsigned long a
 	/* We encountered a memory safety error, taint the kernel! */
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_STILL_OK);
 }
+
+#ifdef CONFIG_PRINTK
+static void kfence_to_kp_stack(const struct kfence_track *track, void **kp_stack)
+{
+	int i, j;
+
+	i = get_stack_skipnr(track->stack_entries, track->num_stack_entries, NULL);
+	for (j = 0; i < track->num_stack_entries && j < KS_ADDRS_COUNT; ++i, ++j)
+		kp_stack[j] = (void *)track->stack_entries[i];
+	if (j < KS_ADDRS_COUNT)
+		kp_stack[j] = NULL;
+}
+
+bool __kfence_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+{
+	struct kfence_metadata *meta = addr_to_metadata((unsigned long)object);
+	unsigned long flags;
+
+	if (!meta)
+		return false;
+
+	/*
+	 * If state is UNUSED at least show the pointer requested; the rest
+	 * would be garbage data.
+	 */
+	kpp->kp_ptr = object;
+
+	/* Requesting info an a never-used object is almost certainly a bug. */
+	if (WARN_ON(meta->state == KFENCE_OBJECT_UNUSED))
+		return true;
+
+	raw_spin_lock_irqsave(&meta->lock, flags);
+
+	kpp->kp_slab = slab;
+	kpp->kp_slab_cache = meta->cache;
+	kpp->kp_objp = (void *)meta->addr;
+	kfence_to_kp_stack(&meta->alloc_track, kpp->kp_stack);
+	if (meta->state == KFENCE_OBJECT_FREED)
+		kfence_to_kp_stack(&meta->free_track, kpp->kp_free_stack);
+	/* get_stack_skipnr() ensures the first entry is outside allocator. */
+	kpp->kp_ret = kpp->kp_stack[0];
+
+	raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+	return true;
+}
+#endif
--- a/mm/slab.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slab.c
@@ -3665,7 +3665,7 @@ EXPORT_SYMBOL(__kmalloc_node_track_calle
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 {
 	struct kmem_cache *cachep;
 	unsigned int objnr;
--- a/mm/slab_common.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slab_common.c
@@ -555,6 +555,13 @@ bool kmem_valid_obj(void *object)
 }
 EXPORT_SYMBOL_GPL(kmem_valid_obj);
 
+static void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+{
+	if (__kfence_obj_info(kpp, object, slab))
+		return;
+	__kmem_obj_info(kpp, object, slab);
+}
+
 /**
  * kmem_dump_obj - Print available slab provenance information
  * @object: slab object for which to find provenance information.
@@ -590,6 +597,8 @@ void kmem_dump_obj(void *object)
 		pr_cont(" slab%s %s", cp, kp.kp_slab_cache->name);
 	else
 		pr_cont(" slab%s", cp);
+	if (is_kfence_address(object))
+		pr_cont(" (kfence)");
 	if (kp.kp_objp)
 		pr_cont(" start %px", kp.kp_objp);
 	if (kp.kp_data_offset)
--- a/mm/slab.h~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slab.h
@@ -868,7 +868,7 @@ struct kmem_obj_info {
 	void *kp_stack[KS_ADDRS_COUNT];
 	void *kp_free_stack[KS_ADDRS_COUNT];
 };
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
 #endif
 
 #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
--- a/mm/slob.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slob.c
@@ -463,7 +463,7 @@ out:
 }
 
 #ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 {
 	kpp->kp_ptr = object;
 	kpp->kp_slab = slab;
--- a/mm/slub.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slub.c
@@ -4312,7 +4312,7 @@ int __kmem_cache_shutdown(struct kmem_ca
 }
 
 #ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 {
 	void *base;
 	int __maybe_unused i;
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 07/14] mm, page_alloc: fix build_zonerefs_node()
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: stable, richard.weiyang, mhocko, marmarek, david, jgross, akpm,
	patches, linux-mm, mm-commits, torvalds, akpm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2684 bytes --]

From: Juergen Gross <jgross@suse.com>
Subject: mm, page_alloc: fix build_zonerefs_node()

Since commit 6aa303defb74 ("mm, vmscan: only allocate and reclaim from
zones with pages managed by the buddy allocator") only zones with free
memory are included in a built zonelist.  This is problematic when e.g. 
all memory of a zone has been ballooned out when zonelists are being
rebuilt.

The decision whether to rebuild the zonelists when onlining new memory is
done based on populated_zone() returning 0 for the zone the memory will be
added to.  The new zone is added to the zonelists only, if it has free
memory pages (managed_zone() returns a non-zero value) after the memory
has been onlined.  This implies, that onlining memory will always free the
added pages to the allocator immediately, but this is not true in all
cases: when e.g.  running as a Xen guest the onlined new memory will be
added only to the ballooned memory list, it will be freed only when the
guest is being ballooned up afterwards.

Another problem with using managed_zone() for the decision whether a zone
is being added to the zonelists is, that a zone with all memory used will
in fact be removed from all zonelists in case the zonelists happen to be
rebuilt.

Use populated_zone() when building a zonelist as it has been done before
that commit.

There was a report that QubesOS (based on Xen) is hitting this problem.
Xen has switched to use the zone device functionality in kernel 5.9
and QubesOS wants to use memory hotplugging for guests in order to be
able to start a guest with minimal memory and expand it as needed. 
This was the report leading to the patch.

Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-page_alloc-fix-build_zonerefs_node
+++ a/mm/page_alloc.c
@@ -6131,7 +6131,7 @@ static int build_zonerefs_node(pg_data_t
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
-		if (managed_zone(zone)) {
+		if (populated_zone(zone)) {
 			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
 			check_highest_zone(zone_type);
 		}
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 07/14] mm, page_alloc: fix build_zonerefs_node()
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: stable, richard.weiyang, mhocko, marmarek, david, jgross, akpm,
	patches, linux-mm, mm-commits, torvalds, akpm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2684 bytes --]

From: Juergen Gross <jgross@suse.com>
Subject: mm, page_alloc: fix build_zonerefs_node()

Since commit 6aa303defb74 ("mm, vmscan: only allocate and reclaim from
zones with pages managed by the buddy allocator") only zones with free
memory are included in a built zonelist.  This is problematic when e.g. 
all memory of a zone has been ballooned out when zonelists are being
rebuilt.

The decision whether to rebuild the zonelists when onlining new memory is
done based on populated_zone() returning 0 for the zone the memory will be
added to.  The new zone is added to the zonelists only, if it has free
memory pages (managed_zone() returns a non-zero value) after the memory
has been onlined.  This implies, that onlining memory will always free the
added pages to the allocator immediately, but this is not true in all
cases: when e.g.  running as a Xen guest the onlined new memory will be
added only to the ballooned memory list, it will be freed only when the
guest is being ballooned up afterwards.

Another problem with using managed_zone() for the decision whether a zone
is being added to the zonelists is, that a zone with all memory used will
in fact be removed from all zonelists in case the zonelists happen to be
rebuilt.

Use populated_zone() when building a zonelist as it has been done before
that commit.

There was a report that QubesOS (based on Xen) is hitting this problem.
Xen has switched to use the zone device functionality in kernel 5.9
and QubesOS wants to use memory hotplugging for guests in order to be
able to start a guest with minimal memory and expand it as needed. 
This was the report leading to the patch.

Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-page_alloc-fix-build_zonerefs_node
+++ a/mm/page_alloc.c
@@ -6131,7 +6131,7 @@ static int build_zonerefs_node(pg_data_t
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
-		if (managed_zone(zone)) {
+		if (populated_zone(zone)) {
 			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
 			check_highest_zone(zone_type);
 		}
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 08/14] mm: fix unexpected zeroed page mapping with zram swap
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: stable, senozhatsky, ngupta, ivan, david, axboe, minchan, akpm,
	patches, linux-mm, mm-commits, torvalds, akpm

From: Minchan Kim <minchan@kernel.org>
Subject: mm: fix unexpected zeroed page mapping with zram swap

Two processes under CLONE_VM cloning, user process can be corrupted by
seeing zeroed page unexpectedly.

    CPU A                        CPU B

do_swap_page                do_swap_page
SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
swap_readpage valid data
  swap_slot_free_notify
    delete zram entry
                            swap_readpage zeroed(invalid) data
                            pte_lock
                            map the *zero data* to userspace
                            pte_unlock
pte_lock
if (!pte_same)
  goto out_nomap;
pte_unlock
return and next refault will
read zeroed data

The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
increase the refcount of swap slot at copy_mm so it couldn't catch up
whether it's safe or not to discard data from backing device.  In the
case, only the lock it could rely on to synchronize swap slot freeing is
page table lock.  Thus, this patch gets rid of the swap_slot_free_notify
function.  With this patch, CPU A will see correct data.

    CPU A                        CPU B

do_swap_page                do_swap_page
SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
                            swap_readpage original data
                            pte_lock
                            map the original data
                            swap_free
                              swap_range_free
                                bd_disk->fops->swap_slot_free_notify
swap_readpage read zeroed data
                            pte_unlock
pte_lock
if (!pte_same)
  goto out_nomap;
pte_unlock
return
on next refault will see mapped data by CPU B

The concern of the patch would increase memory consumption since it could
keep wasted memory with compressed form in zram as well as uncompressed
form in address space.  However, most of cases of zram uses no readahead
and do_swap_page is followed by swap_free so it will free the compressed
form from in zram quickly.

Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com
Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: Ivan Babrou <ivan@cloudflare.com>
Tested-by: Ivan Babrou <ivan@cloudflare.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_io.c |   54 -------------------------------------------------
 1 file changed, 54 deletions(-)

--- a/mm/page_io.c~mm-fix-unexpected-zeroed-page-mapping-with-zram-swap
+++ a/mm/page_io.c
@@ -51,54 +51,6 @@ void end_swap_bio_write(struct bio *bio)
 	bio_put(bio);
 }
 
-static void swap_slot_free_notify(struct page *page)
-{
-	struct swap_info_struct *sis;
-	struct gendisk *disk;
-	swp_entry_t entry;
-
-	/*
-	 * There is no guarantee that the page is in swap cache - the software
-	 * suspend code (at least) uses end_swap_bio_read() against a non-
-	 * swapcache page.  So we must check PG_swapcache before proceeding with
-	 * this optimization.
-	 */
-	if (unlikely(!PageSwapCache(page)))
-		return;
-
-	sis = page_swap_info(page);
-	if (data_race(!(sis->flags & SWP_BLKDEV)))
-		return;
-
-	/*
-	 * The swap subsystem performs lazy swap slot freeing,
-	 * expecting that the page will be swapped out again.
-	 * So we can avoid an unnecessary write if the page
-	 * isn't redirtied.
-	 * This is good for real swap storage because we can
-	 * reduce unnecessary I/O and enhance wear-leveling
-	 * if an SSD is used as the as swap device.
-	 * But if in-memory swap device (eg zram) is used,
-	 * this causes a duplicated copy between uncompressed
-	 * data in VM-owned memory and compressed data in
-	 * zram-owned memory.  So let's free zram-owned memory
-	 * and make the VM-owned decompressed page *dirty*,
-	 * so the page should be swapped out somewhere again if
-	 * we again wish to reclaim it.
-	 */
-	disk = sis->bdev->bd_disk;
-	entry.val = page_private(page);
-	if (disk->fops->swap_slot_free_notify && __swap_count(entry) == 1) {
-		unsigned long offset;
-
-		offset = swp_offset(entry);
-
-		SetPageDirty(page);
-		disk->fops->swap_slot_free_notify(sis->bdev,
-				offset);
-	}
-}
-
 static void end_swap_bio_read(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
@@ -114,7 +66,6 @@ static void end_swap_bio_read(struct bio
 	}
 
 	SetPageUptodate(page);
-	swap_slot_free_notify(page);
 out:
 	unlock_page(page);
 	WRITE_ONCE(bio->bi_private, NULL);
@@ -394,11 +345,6 @@ int swap_readpage(struct page *page, boo
 	if (sis->flags & SWP_SYNCHRONOUS_IO) {
 		ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
 		if (!ret) {
-			if (trylock_page(page)) {
-				swap_slot_free_notify(page);
-				unlock_page(page);
-			}
-
 			count_vm_event(PSWPIN);
 			goto out;
 		}
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 08/14] mm: fix unexpected zeroed page mapping with zram swap
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: stable, senozhatsky, ngupta, ivan, david, axboe, minchan, akpm,
	patches, linux-mm, mm-commits, torvalds, akpm

From: Minchan Kim <minchan@kernel.org>
Subject: mm: fix unexpected zeroed page mapping with zram swap

Two processes under CLONE_VM cloning, user process can be corrupted by
seeing zeroed page unexpectedly.

    CPU A                        CPU B

do_swap_page                do_swap_page
SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
swap_readpage valid data
  swap_slot_free_notify
    delete zram entry
                            swap_readpage zeroed(invalid) data
                            pte_lock
                            map the *zero data* to userspace
                            pte_unlock
pte_lock
if (!pte_same)
  goto out_nomap;
pte_unlock
return and next refault will
read zeroed data

The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
increase the refcount of swap slot at copy_mm so it couldn't catch up
whether it's safe or not to discard data from backing device.  In the
case, only the lock it could rely on to synchronize swap slot freeing is
page table lock.  Thus, this patch gets rid of the swap_slot_free_notify
function.  With this patch, CPU A will see correct data.

    CPU A                        CPU B

do_swap_page                do_swap_page
SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
                            swap_readpage original data
                            pte_lock
                            map the original data
                            swap_free
                              swap_range_free
                                bd_disk->fops->swap_slot_free_notify
swap_readpage read zeroed data
                            pte_unlock
pte_lock
if (!pte_same)
  goto out_nomap;
pte_unlock
return
on next refault will see mapped data by CPU B

The concern of the patch would increase memory consumption since it could
keep wasted memory with compressed form in zram as well as uncompressed
form in address space.  However, most of cases of zram uses no readahead
and do_swap_page is followed by swap_free so it will free the compressed
form from in zram quickly.

Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com
Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: Ivan Babrou <ivan@cloudflare.com>
Tested-by: Ivan Babrou <ivan@cloudflare.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_io.c |   54 -------------------------------------------------
 1 file changed, 54 deletions(-)

--- a/mm/page_io.c~mm-fix-unexpected-zeroed-page-mapping-with-zram-swap
+++ a/mm/page_io.c
@@ -51,54 +51,6 @@ void end_swap_bio_write(struct bio *bio)
 	bio_put(bio);
 }
 
-static void swap_slot_free_notify(struct page *page)
-{
-	struct swap_info_struct *sis;
-	struct gendisk *disk;
-	swp_entry_t entry;
-
-	/*
-	 * There is no guarantee that the page is in swap cache - the software
-	 * suspend code (at least) uses end_swap_bio_read() against a non-
-	 * swapcache page.  So we must check PG_swapcache before proceeding with
-	 * this optimization.
-	 */
-	if (unlikely(!PageSwapCache(page)))
-		return;
-
-	sis = page_swap_info(page);
-	if (data_race(!(sis->flags & SWP_BLKDEV)))
-		return;
-
-	/*
-	 * The swap subsystem performs lazy swap slot freeing,
-	 * expecting that the page will be swapped out again.
-	 * So we can avoid an unnecessary write if the page
-	 * isn't redirtied.
-	 * This is good for real swap storage because we can
-	 * reduce unnecessary I/O and enhance wear-leveling
-	 * if an SSD is used as the as swap device.
-	 * But if in-memory swap device (eg zram) is used,
-	 * this causes a duplicated copy between uncompressed
-	 * data in VM-owned memory and compressed data in
-	 * zram-owned memory.  So let's free zram-owned memory
-	 * and make the VM-owned decompressed page *dirty*,
-	 * so the page should be swapped out somewhere again if
-	 * we again wish to reclaim it.
-	 */
-	disk = sis->bdev->bd_disk;
-	entry.val = page_private(page);
-	if (disk->fops->swap_slot_free_notify && __swap_count(entry) == 1) {
-		unsigned long offset;
-
-		offset = swp_offset(entry);
-
-		SetPageDirty(page);
-		disk->fops->swap_slot_free_notify(sis->bdev,
-				offset);
-	}
-}
-
 static void end_swap_bio_read(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
@@ -114,7 +66,6 @@ static void end_swap_bio_read(struct bio
 	}
 
 	SetPageUptodate(page);
-	swap_slot_free_notify(page);
 out:
 	unlock_page(page);
 	WRITE_ONCE(bio->bi_private, NULL);
@@ -394,11 +345,6 @@ int swap_readpage(struct page *page, boo
 	if (sis->flags & SWP_SYNCHRONOUS_IO) {
 		ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
 		if (!ret) {
-			if (trylock_page(page)) {
-				swap_slot_free_notify(page);
-				unlock_page(page);
-			}
-
 			count_vm_event(PSWPIN);
 			goto out;
 		}
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 09/14] mm: compaction: fix compiler warning when CONFIG_COMPACTION=n
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: vbabka, nigupta, lkp, quic_charante, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Charan Teja Kalla <quic_charante@quicinc.com>
Subject: mm: compaction: fix compiler warning when CONFIG_COMPACTION=n

The below warning is reported when CONFIG_COMPACTION=n:

   mm/compaction.c:56:27: warning: 'HPAGE_FRAG_CHECK_INTERVAL_MSEC'
defined but not used [-Wunused-const-variable=]
      56 | static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC =
500;
         |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix it by moving 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' under
CONFIG_COMPACTION defconfig. Also since this is just a 'static const
int' type, use #define for it.

Link: https://lkml.kernel.org/r/1647608518-20924-1-git-send-email-quic_charante@quicinc.com
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/mm/compaction.c~mm-compaction-fix-compiler-warning-when-config_compaction=n
+++ a/mm/compaction.c
@@ -26,6 +26,11 @@
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
+/*
+ * Fragmentation score check interval for proactive compaction purposes.
+ */
+#define HPAGE_FRAG_CHECK_INTERVAL_MSEC	(500)
+
 static inline void count_compact_event(enum vm_event_item item)
 {
 	count_vm_event(item);
@@ -51,11 +56,6 @@ static inline void count_compact_events(
 #define pageblock_end_pfn(pfn)		block_end_pfn(pfn, pageblock_order)
 
 /*
- * Fragmentation score check interval for proactive compaction purposes.
- */
-static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
-
-/*
  * Page order with-respect-to which proactive compaction
  * calculates external fragmentation, which is used as
  * the "fragmentation score" of a node/zone.
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 09/14] mm: compaction: fix compiler warning when CONFIG_COMPACTION=n
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: vbabka, nigupta, lkp, quic_charante, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Charan Teja Kalla <quic_charante@quicinc.com>
Subject: mm: compaction: fix compiler warning when CONFIG_COMPACTION=n

The below warning is reported when CONFIG_COMPACTION=n:

   mm/compaction.c:56:27: warning: 'HPAGE_FRAG_CHECK_INTERVAL_MSEC'
defined but not used [-Wunused-const-variable=]
      56 | static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC =
500;
         |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix it by moving 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' under
CONFIG_COMPACTION defconfig. Also since this is just a 'static const
int' type, use #define for it.

Link: https://lkml.kernel.org/r/1647608518-20924-1-git-send-email-quic_charante@quicinc.com
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/mm/compaction.c~mm-compaction-fix-compiler-warning-when-config_compaction=n
+++ a/mm/compaction.c
@@ -26,6 +26,11 @@
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
+/*
+ * Fragmentation score check interval for proactive compaction purposes.
+ */
+#define HPAGE_FRAG_CHECK_INTERVAL_MSEC	(500)
+
 static inline void count_compact_event(enum vm_event_item item)
 {
 	count_vm_event(item);
@@ -51,11 +56,6 @@ static inline void count_compact_events(
 #define pageblock_end_pfn(pfn)		block_end_pfn(pfn, pageblock_order)
 
 /*
- * Fragmentation score check interval for proactive compaction purposes.
- */
-static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
-
-/*
  * Page order with-respect-to which proactive compaction
  * calculates external fragmentation, which is used as
  * the "fragmentation score" of a node/zone.
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 10/14] hugetlb: do not demote poisoned hugetlb pages
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: stable, naoya.horiguchi, mike.kravetz, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: do not demote poisoned hugetlb pages

It is possible for poisoned hugetlb pages to reside on the free lists. 
The huge page allocation routines which dequeue entries from the free
lists make a point of avoiding poisoned pages.  There is no such check and
avoidance in the demote code path.

If a hugetlb page on the is on a free list, poison will only be set in the
head page rather then the page with the actual error.  If such a page is
demoted, then the poison flag may follow the wrong page.  A page without
error could have poison set, and a page with poison could not have the
flag set.

Check for poison before attempting to demote a hugetlb page.  Also, return
-EBUSY to the caller if only poisoned pages are on the free list.

Link: https://lkml.kernel.org/r/20220307215707.50916-1-mike.kravetz@oracle.com
Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

--- a/mm/hugetlb.c~hugetlb-do-not-demote-poisoned-hugetlb-pages
+++ a/mm/hugetlb.c
@@ -3475,7 +3475,6 @@ static int demote_pool_huge_page(struct
 {
 	int nr_nodes, node;
 	struct page *page;
-	int rc = 0;
 
 	lockdep_assert_held(&hugetlb_lock);
 
@@ -3486,15 +3485,19 @@ static int demote_pool_huge_page(struct
 	}
 
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
-		if (!list_empty(&h->hugepage_freelists[node])) {
-			page = list_entry(h->hugepage_freelists[node].next,
-					struct page, lru);
-			rc = demote_free_huge_page(h, page);
-			break;
+		list_for_each_entry(page, &h->hugepage_freelists[node], lru) {
+			if (PageHWPoison(page))
+				continue;
+
+			return demote_free_huge_page(h, page);
 		}
 	}
 
-	return rc;
+	/*
+	 * Only way to get here is if all pages on free lists are poisoned.
+	 * Return -EBUSY so that caller will not retry.
+	 */
+	return -EBUSY;
 }
 
 #define HSTATE_ATTR_RO(_name) \
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 10/14] hugetlb: do not demote poisoned hugetlb pages
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: stable, naoya.horiguchi, mike.kravetz, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: do not demote poisoned hugetlb pages

It is possible for poisoned hugetlb pages to reside on the free lists. 
The huge page allocation routines which dequeue entries from the free
lists make a point of avoiding poisoned pages.  There is no such check and
avoidance in the demote code path.

If a hugetlb page on the is on a free list, poison will only be set in the
head page rather then the page with the actual error.  If such a page is
demoted, then the poison flag may follow the wrong page.  A page without
error could have poison set, and a page with poison could not have the
flag set.

Check for poison before attempting to demote a hugetlb page.  Also, return
-EBUSY to the caller if only poisoned pages are on the free list.

Link: https://lkml.kernel.org/r/20220307215707.50916-1-mike.kravetz@oracle.com
Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

--- a/mm/hugetlb.c~hugetlb-do-not-demote-poisoned-hugetlb-pages
+++ a/mm/hugetlb.c
@@ -3475,7 +3475,6 @@ static int demote_pool_huge_page(struct
 {
 	int nr_nodes, node;
 	struct page *page;
-	int rc = 0;
 
 	lockdep_assert_held(&hugetlb_lock);
 
@@ -3486,15 +3485,19 @@ static int demote_pool_huge_page(struct
 	}
 
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
-		if (!list_empty(&h->hugepage_freelists[node])) {
-			page = list_entry(h->hugepage_freelists[node].next,
-					struct page, lru);
-			rc = demote_free_huge_page(h, page);
-			break;
+		list_for_each_entry(page, &h->hugepage_freelists[node], lru) {
+			if (PageHWPoison(page))
+				continue;
+
+			return demote_free_huge_page(h, page);
 		}
 	}
 
-	return rc;
+	/*
+	 * Only way to get here is if all pages on free lists are poisoned.
+	 * Return -EBUSY so that caller will not retry.
+	 */
+	return -EBUSY;
 }
 
 #define HSTATE_ATTR_RO(_name) \
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 11/14] revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: viro, surenb, stable, sspatil, songliubraving, shuah, rppt,
	rientjes, regressions, ndesaulniers, mike.kravetz, maskray,
	kirill.shutemov, irogers, hughd, hjl.tools, ckennelly, adobriyan,
	akpm, patches, linux-mm, mm-commits, torvalds, akpm

From: Andrew Morton <akpm@linux-foundation.org>
Subject: revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"

925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
is an attempt to fix regressions due to 9630f0d60fec5f ("fs/binfmt_elf:
use PT_LOAD p_align values for static PIE").

But regressionss continue to be reported:

https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
https://bugzilla.kernel.org/show_bug.cgi?id=215720
https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

This patch reverts the fix, so the original can also be reverted.

Fixes: 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/binfmt_elf.c~revert-fs-binfmt_elf-fix-pt_load-p_align-values-for-loaders
+++ a/fs/binfmt_elf.c
@@ -1118,7 +1118,7 @@ out_free_interp:
 			 * without MAP_FIXED nor MAP_FIXED_NOREPLACE).
 			 */
 			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
-			if (interpreter || alignment > ELF_MIN_ALIGN) {
+			if (alignment > ELF_MIN_ALIGN) {
 				load_bias = ELF_ET_DYN_BASE;
 				if (current->flags & PF_RANDOMIZE)
 					load_bias += arch_mmap_rnd();
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 11/14] revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: viro, surenb, stable, sspatil, songliubraving, shuah, rppt,
	rientjes, regressions, ndesaulniers, mike.kravetz, maskray,
	kirill.shutemov, irogers, hughd, hjl.tools, ckennelly, adobriyan,
	akpm, patches, linux-mm, mm-commits, torvalds, akpm

From: Andrew Morton <akpm@linux-foundation.org>
Subject: revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"

925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
is an attempt to fix regressions due to 9630f0d60fec5f ("fs/binfmt_elf:
use PT_LOAD p_align values for static PIE").

But regressionss continue to be reported:

https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
https://bugzilla.kernel.org/show_bug.cgi?id=215720
https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

This patch reverts the fix, so the original can also be reverted.

Fixes: 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/binfmt_elf.c~revert-fs-binfmt_elf-fix-pt_load-p_align-values-for-loaders
+++ a/fs/binfmt_elf.c
@@ -1118,7 +1118,7 @@ out_free_interp:
 			 * without MAP_FIXED nor MAP_FIXED_NOREPLACE).
 			 */
 			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
-			if (interpreter || alignment > ELF_MIN_ALIGN) {
+			if (alignment > ELF_MIN_ALIGN) {
 				load_bias = ELF_ET_DYN_BASE;
 				if (current->flags & PF_RANDOMIZE)
 					load_bias += arch_mmap_rnd();
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 12/14] revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: viro, surenb, stable, sspatil, songliubraving, shuah, rppt,
	rientjes, regressions, ndesaulniers, mike.kravetz, maskray,
	kirill.shutemov, irogers, hughd, hjl.tools, ckennelly, adobriyan,
	akpm, patches, linux-mm, mm-commits, torvalds, akpm

From: Andrew Morton <akpm@linux-foundation.org>
Subject: revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"

Despite Mike's attempted fix (925346c129da117122), regressions reports
continue:

https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
https://bugzilla.kernel.org/show_bug.cgi?id=215720
https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

So revert this patch.

Fixes: 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/binfmt_elf.c~revert-fs-binfmt_elf-use-pt_load-p_align-values-for-static-pie
+++ a/fs/binfmt_elf.c
@@ -1117,11 +1117,11 @@ out_free_interp:
 			 * independently randomized mmap region (0 load_bias
 			 * without MAP_FIXED nor MAP_FIXED_NOREPLACE).
 			 */
-			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
-			if (alignment > ELF_MIN_ALIGN) {
+			if (interpreter) {
 				load_bias = ELF_ET_DYN_BASE;
 				if (current->flags & PF_RANDOMIZE)
 					load_bias += arch_mmap_rnd();
+				alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
 				if (alignment)
 					load_bias &= ~(alignment - 1);
 				elf_flags |= MAP_FIXED_NOREPLACE;
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 12/14] revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"
@ 2022-04-15  2:13   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: viro, surenb, stable, sspatil, songliubraving, shuah, rppt,
	rientjes, regressions, ndesaulniers, mike.kravetz, maskray,
	kirill.shutemov, irogers, hughd, hjl.tools, ckennelly, adobriyan,
	akpm, patches, linux-mm, mm-commits, torvalds, akpm

From: Andrew Morton <akpm@linux-foundation.org>
Subject: revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"

Despite Mike's attempted fix (925346c129da117122), regressions reports
continue:

https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
https://bugzilla.kernel.org/show_bug.cgi?id=215720
https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

So revert this patch.

Fixes: 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/binfmt_elf.c~revert-fs-binfmt_elf-use-pt_load-p_align-values-for-static-pie
+++ a/fs/binfmt_elf.c
@@ -1117,11 +1117,11 @@ out_free_interp:
 			 * independently randomized mmap region (0 load_bias
 			 * without MAP_FIXED nor MAP_FIXED_NOREPLACE).
 			 */
-			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
-			if (alignment > ELF_MIN_ALIGN) {
+			if (interpreter) {
 				load_bias = ELF_ET_DYN_BASE;
 				if (current->flags & PF_RANDOMIZE)
 					load_bias += arch_mmap_rnd();
+				alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
 				if (alignment)
 					load_bias &= ~(alignment - 1);
 				elf_flags |= MAP_FIXED_NOREPLACE;
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 13/14] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:14   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:14 UTC (permalink / raw)
  To: urezki, hch, chris, bhe, osandov, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Omar Sandoval <osandov@fb.com>
Subject: mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore

Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
lazy_max_pages() + 1, ensuring that any future vunmaps() immediately purge
the vmap areas instead of doing it lazily.

Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
context") moved the purging from the vunmap() caller to a worker thread. 
Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
(possibly forever).  For example, consider the following scenario:

1. Thread reads from /proc/vmcore. This eventually calls
   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
   vmap_lazy_nr to lazy_max_pages() + 1.
2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
   pages (one page plus the guard page) to the purge list and
   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
   drain_vmap_work is scheduled.
3. Thread returns from the kernel and is scheduled out.
4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
   frees the 2 pages on the purge list. vmap_lazy_nr is now
   lazy_max_pages() + 1.
5. This is still over the threshold, so it tries to purge areas again,
   but doesn't find anything.
6. Repeat 5.

If the system is running with only one CPU (which is typicial for kdump)
and preemption is disabled, then this will never make forward progress:
there aren't any more pages to purge, so it hangs.  If there is more than
one CPU or preemption is enabled, then the worker thread will spin forever
in the background.  (Note that if there were already pages to be purged at
the time that set_iounmap_nonlazy() was called, this bug is avoided.)

This can be reproduced with anything that reads from /proc/vmcore multiple
times.  E.g., vmcore-dmesg /proc/vmcore.

It turns out that improvements to vmap() over the years have obsoleted the
need for this "optimization".  I benchmarked `dd if=/proc/vmcore
of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore. 
The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
5.18-rc1 with set_iounmap_nonlazy() removed entirely:

  |5.17  |5.18+fix|5.18+removal
4k|40.86s|  40.09s|      26.73s
1M|24.47s|  23.98s|      21.84s

The removal was the fastest (by a wide margin with 4k reads). This patch
removes set_iounmap_nonlazy().

Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
Fixes: 690467c81b1a  ("mm/vmalloc: Move draining areas out of caller context")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/include/asm/io.h       |    2 --
 arch/x86/kernel/crash_dump_64.c |    1 -
 mm/vmalloc.c                    |   11 -----------
 3 files changed, 14 deletions(-)

--- a/arch/x86/include/asm/io.h~mm-vmalloc-fix-spinning-drain_vmap_work-after-reading-from-proc-vmcore
+++ a/arch/x86/include/asm/io.h
@@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t of
 extern void iounmap(volatile void __iomem *addr);
 #define iounmap iounmap
 
-extern void set_iounmap_nonlazy(void);
-
 #ifdef __KERNEL__
 
 void memcpy_fromio(void *, const volatile void __iomem *, size_t);
--- a/arch/x86/kernel/crash_dump_64.c~mm-vmalloc-fix-spinning-drain_vmap_work-after-reading-from-proc-vmcore
+++ a/arch/x86/kernel/crash_dump_64.c
@@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsign
 	} else
 		memcpy(buf, vaddr + offset, csize);
 
-	set_iounmap_nonlazy();
 	iounmap((void __iomem *)vaddr);
 	return csize;
 }
--- a/mm/vmalloc.c~mm-vmalloc-fix-spinning-drain_vmap_work-after-reading-from-proc-vmcore
+++ a/mm/vmalloc.c
@@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
 /* for per-CPU blocks */
 static void purge_fragmented_blocks_allcpus(void);
 
-#ifdef CONFIG_X86_64
-/*
- * called before a call to iounmap() if the caller wants vm_area_struct's
- * immediately freed.
- */
-void set_iounmap_nonlazy(void)
-{
-	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
-}
-#endif /* CONFIG_X86_64 */
-
 /*
  * Purges all lazily-freed vmap areas.
  */
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 13/14] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
@ 2022-04-15  2:14   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:14 UTC (permalink / raw)
  To: urezki, hch, chris, bhe, osandov, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Omar Sandoval <osandov@fb.com>
Subject: mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore

Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
lazy_max_pages() + 1, ensuring that any future vunmaps() immediately purge
the vmap areas instead of doing it lazily.

Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
context") moved the purging from the vunmap() caller to a worker thread. 
Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
(possibly forever).  For example, consider the following scenario:

1. Thread reads from /proc/vmcore. This eventually calls
   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
   vmap_lazy_nr to lazy_max_pages() + 1.
2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
   pages (one page plus the guard page) to the purge list and
   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
   drain_vmap_work is scheduled.
3. Thread returns from the kernel and is scheduled out.
4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
   frees the 2 pages on the purge list. vmap_lazy_nr is now
   lazy_max_pages() + 1.
5. This is still over the threshold, so it tries to purge areas again,
   but doesn't find anything.
6. Repeat 5.

If the system is running with only one CPU (which is typicial for kdump)
and preemption is disabled, then this will never make forward progress:
there aren't any more pages to purge, so it hangs.  If there is more than
one CPU or preemption is enabled, then the worker thread will spin forever
in the background.  (Note that if there were already pages to be purged at
the time that set_iounmap_nonlazy() was called, this bug is avoided.)

This can be reproduced with anything that reads from /proc/vmcore multiple
times.  E.g., vmcore-dmesg /proc/vmcore.

It turns out that improvements to vmap() over the years have obsoleted the
need for this "optimization".  I benchmarked `dd if=/proc/vmcore
of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore. 
The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
5.18-rc1 with set_iounmap_nonlazy() removed entirely:

  |5.17  |5.18+fix|5.18+removal
4k|40.86s|  40.09s|      26.73s
1M|24.47s|  23.98s|      21.84s

The removal was the fastest (by a wide margin with 4k reads). This patch
removes set_iounmap_nonlazy().

Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
Fixes: 690467c81b1a  ("mm/vmalloc: Move draining areas out of caller context")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/include/asm/io.h       |    2 --
 arch/x86/kernel/crash_dump_64.c |    1 -
 mm/vmalloc.c                    |   11 -----------
 3 files changed, 14 deletions(-)

--- a/arch/x86/include/asm/io.h~mm-vmalloc-fix-spinning-drain_vmap_work-after-reading-from-proc-vmcore
+++ a/arch/x86/include/asm/io.h
@@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t of
 extern void iounmap(volatile void __iomem *addr);
 #define iounmap iounmap
 
-extern void set_iounmap_nonlazy(void);
-
 #ifdef __KERNEL__
 
 void memcpy_fromio(void *, const volatile void __iomem *, size_t);
--- a/arch/x86/kernel/crash_dump_64.c~mm-vmalloc-fix-spinning-drain_vmap_work-after-reading-from-proc-vmcore
+++ a/arch/x86/kernel/crash_dump_64.c
@@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsign
 	} else
 		memcpy(buf, vaddr + offset, csize);
 
-	set_iounmap_nonlazy();
 	iounmap((void __iomem *)vaddr);
 	return csize;
 }
--- a/mm/vmalloc.c~mm-vmalloc-fix-spinning-drain_vmap_work-after-reading-from-proc-vmcore
+++ a/mm/vmalloc.c
@@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
 /* for per-CPU blocks */
 static void purge_fragmented_blocks_allcpus(void);
 
-#ifdef CONFIG_X86_64
-/*
- * called before a call to iounmap() if the caller wants vm_area_struct's
- * immediately freed.
- */
-void set_iounmap_nonlazy(void)
-{
-	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
-}
-#endif /* CONFIG_X86_64 */
-
 /*
  * Purges all lazily-freed vmap areas.
  */
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 14/14] mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:14   ` Andrew Morton
  2022-04-15  2:13   ` Andrew Morton
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:14 UTC (permalink / raw)
  To: stable, catalin.marinas, patrick.wang.shcn, akpm, patches,
	linux-mm, mm-commits, torvalds, akpm

From: Patrick Wang <patrick.wang.shcn@gmail.com>
Subject: mm: kmemleak: take a full lowmem check in kmemleak_*_phys()

The kmemleak_*_phys() apis do not check the address for lowmem's min
boundary, while the caller may pass an address below lowmem, which will
trigger an oops:

# echo scan > /sys/kernel/debug/kmemleak
[   54.888353] Unable to handle kernel paging request at virtual address ff5fffffffe00000
[   54.888932] Oops [#1]
[   54.889102] Modules linked in:
[   54.889326] CPU: 2 PID: 134 Comm: bash Not tainted 5.18.0-rc1-next-20220407 #33
[   54.889620] Hardware name: riscv-virtio,qemu (DT)
[   54.889901] epc : scan_block+0x74/0x15c
[   54.890215]  ra : scan_block+0x72/0x15c
[   54.890390] epc : ffffffff801e5806 ra : ffffffff801e5804 sp : ff200000104abc30
[   54.890607]  gp : ffffffff815cd4e8 tp : ff60000004cfa340 t0 : 0000000000000200
[   54.890835]  t1 : 00aaaaaac23954cc t2 : 00000000000003ff s0 : ff200000104abc90
[   54.891024]  s1 : ffffffff81b0ff28 a0 : 0000000000000000 a1 : ff5fffffffe01000
[   54.891201]  a2 : ffffffff81b0ff28 a3 : 0000000000000002 a4 : 0000000000000001
[   54.891377]  a5 : 0000000000000000 a6 : ff200000104abd7c a7 : 0000000000000005
[   54.891552]  s2 : ff5fffffffe00ff9 s3 : ffffffff815cd998 s4 : ffffffff815d0e90
[   54.891727]  s5 : ffffffff81b0ff28 s6 : 0000000000000020 s7 : ffffffff815d0eb0
[   54.891903]  s8 : ffffffffffffffff s9 : ff5fffffffe00000 s10: ff5fffffffe01000
[   54.892078]  s11: 0000000000000022 t3 : 00ffffffaa17db4c t4 : 000000000000000f
[   54.892271]  t5 : 0000000000000001 t6 : 0000000000000000
[   54.892408] status: 0000000000000100 badaddr: ff5fffffffe00000 cause: 000000000000000d
[   54.892643] [<ffffffff801e5a1c>] scan_gray_list+0x12e/0x1a6
[   54.892824] [<ffffffff801e5d3e>] kmemleak_scan+0x2aa/0x57e
[   54.892961] [<ffffffff801e633c>] kmemleak_write+0x32a/0x40c
[   54.893096] [<ffffffff803915ac>] full_proxy_write+0x56/0x82
[   54.893235] [<ffffffff801ef456>] vfs_write+0xa6/0x2a6
[   54.893362] [<ffffffff801ef880>] ksys_write+0x6c/0xe2
[   54.893487] [<ffffffff801ef918>] sys_write+0x22/0x2a
[   54.893609] [<ffffffff8000397c>] ret_from_syscall+0x0/0x2
[   54.894183] ---[ end trace 0000000000000000 ]---

The callers may not quite know the actual address they pass(e.g.  from
devicetree).  So the kmemleak_*_phys() apis should guarantee the
address they finally use is in lowmem range, so check the address for
lowmem's min boundary.

Link: https://lkml.kernel.org/r/20220413122925.33856-1-patrick.wang.shcn@gmail.com
Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kmemleak.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/kmemleak.c~mm-kmemleak-take-a-full-lowmem-check-in-kmemleak__phys
+++ a/mm/kmemleak.c
@@ -1132,7 +1132,7 @@ EXPORT_SYMBOL(kmemleak_no_scan);
 void __ref kmemleak_alloc_phys(phys_addr_t phys, size_t size, int min_count,
 			       gfp_t gfp)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_alloc(__va(phys), size, min_count, gfp);
 }
 EXPORT_SYMBOL(kmemleak_alloc_phys);
@@ -1146,7 +1146,7 @@ EXPORT_SYMBOL(kmemleak_alloc_phys);
  */
 void __ref kmemleak_free_part_phys(phys_addr_t phys, size_t size)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_free_part(__va(phys), size);
 }
 EXPORT_SYMBOL(kmemleak_free_part_phys);
@@ -1158,7 +1158,7 @@ EXPORT_SYMBOL(kmemleak_free_part_phys);
  */
 void __ref kmemleak_not_leak_phys(phys_addr_t phys)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_not_leak(__va(phys));
 }
 EXPORT_SYMBOL(kmemleak_not_leak_phys);
@@ -1170,7 +1170,7 @@ EXPORT_SYMBOL(kmemleak_not_leak_phys);
  */
 void __ref kmemleak_ignore_phys(phys_addr_t phys)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_ignore(__va(phys));
 }
 EXPORT_SYMBOL(kmemleak_ignore_phys);
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 14/14] mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
@ 2022-04-15  2:14   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:14 UTC (permalink / raw)
  To: stable, catalin.marinas, patrick.wang.shcn, akpm, patches,
	linux-mm, mm-commits, torvalds, akpm

From: Patrick Wang <patrick.wang.shcn@gmail.com>
Subject: mm: kmemleak: take a full lowmem check in kmemleak_*_phys()

The kmemleak_*_phys() apis do not check the address for lowmem's min
boundary, while the caller may pass an address below lowmem, which will
trigger an oops:

# echo scan > /sys/kernel/debug/kmemleak
[   54.888353] Unable to handle kernel paging request at virtual address ff5fffffffe00000
[   54.888932] Oops [#1]
[   54.889102] Modules linked in:
[   54.889326] CPU: 2 PID: 134 Comm: bash Not tainted 5.18.0-rc1-next-20220407 #33
[   54.889620] Hardware name: riscv-virtio,qemu (DT)
[   54.889901] epc : scan_block+0x74/0x15c
[   54.890215]  ra : scan_block+0x72/0x15c
[   54.890390] epc : ffffffff801e5806 ra : ffffffff801e5804 sp : ff200000104abc30
[   54.890607]  gp : ffffffff815cd4e8 tp : ff60000004cfa340 t0 : 0000000000000200
[   54.890835]  t1 : 00aaaaaac23954cc t2 : 00000000000003ff s0 : ff200000104abc90
[   54.891024]  s1 : ffffffff81b0ff28 a0 : 0000000000000000 a1 : ff5fffffffe01000
[   54.891201]  a2 : ffffffff81b0ff28 a3 : 0000000000000002 a4 : 0000000000000001
[   54.891377]  a5 : 0000000000000000 a6 : ff200000104abd7c a7 : 0000000000000005
[   54.891552]  s2 : ff5fffffffe00ff9 s3 : ffffffff815cd998 s4 : ffffffff815d0e90
[   54.891727]  s5 : ffffffff81b0ff28 s6 : 0000000000000020 s7 : ffffffff815d0eb0
[   54.891903]  s8 : ffffffffffffffff s9 : ff5fffffffe00000 s10: ff5fffffffe01000
[   54.892078]  s11: 0000000000000022 t3 : 00ffffffaa17db4c t4 : 000000000000000f
[   54.892271]  t5 : 0000000000000001 t6 : 0000000000000000
[   54.892408] status: 0000000000000100 badaddr: ff5fffffffe00000 cause: 000000000000000d
[   54.892643] [<ffffffff801e5a1c>] scan_gray_list+0x12e/0x1a6
[   54.892824] [<ffffffff801e5d3e>] kmemleak_scan+0x2aa/0x57e
[   54.892961] [<ffffffff801e633c>] kmemleak_write+0x32a/0x40c
[   54.893096] [<ffffffff803915ac>] full_proxy_write+0x56/0x82
[   54.893235] [<ffffffff801ef456>] vfs_write+0xa6/0x2a6
[   54.893362] [<ffffffff801ef880>] ksys_write+0x6c/0xe2
[   54.893487] [<ffffffff801ef918>] sys_write+0x22/0x2a
[   54.893609] [<ffffffff8000397c>] ret_from_syscall+0x0/0x2
[   54.894183] ---[ end trace 0000000000000000 ]---

The callers may not quite know the actual address they pass(e.g.  from
devicetree).  So the kmemleak_*_phys() apis should guarantee the
address they finally use is in lowmem range, so check the address for
lowmem's min boundary.

Link: https://lkml.kernel.org/r/20220413122925.33856-1-patrick.wang.shcn@gmail.com
Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kmemleak.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/kmemleak.c~mm-kmemleak-take-a-full-lowmem-check-in-kmemleak__phys
+++ a/mm/kmemleak.c
@@ -1132,7 +1132,7 @@ EXPORT_SYMBOL(kmemleak_no_scan);
 void __ref kmemleak_alloc_phys(phys_addr_t phys, size_t size, int min_count,
 			       gfp_t gfp)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_alloc(__va(phys), size, min_count, gfp);
 }
 EXPORT_SYMBOL(kmemleak_alloc_phys);
@@ -1146,7 +1146,7 @@ EXPORT_SYMBOL(kmemleak_alloc_phys);
  */
 void __ref kmemleak_free_part_phys(phys_addr_t phys, size_t size)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_free_part(__va(phys), size);
 }
 EXPORT_SYMBOL(kmemleak_free_part_phys);
@@ -1158,7 +1158,7 @@ EXPORT_SYMBOL(kmemleak_free_part_phys);
  */
 void __ref kmemleak_not_leak_phys(phys_addr_t phys)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_not_leak(__va(phys));
 }
 EXPORT_SYMBOL(kmemleak_not_leak_phys);
@@ -1170,7 +1170,7 @@ EXPORT_SYMBOL(kmemleak_not_leak_phys);
  */
 void __ref kmemleak_ignore_phys(phys_addr_t phys)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_ignore(__va(phys));
 }
 EXPORT_SYMBOL(kmemleak_ignore_phys);
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-15  2:13   ` Andrew Morton
  (?)
@ 2022-04-15 22:10   ` Linus Torvalds
  2022-04-15 22:21     ` Matthew Wilcox
                       ` (3 more replies)
  -1 siblings, 4 replies; 82+ messages in thread
From: Linus Torvalds @ 2022-04-15 22:10 UTC (permalink / raw)
  To: Andrew Morton, the arch/x86 maintainers, Peter Zijlstra, Borislav Petkov
  Cc: patrice.chotard, Mikulas Patocka, markhemm, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
> iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
> instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
> but the x86 clear_user() is not nearly so well optimized as copy to user
> (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).

Ugh.

I've applied this patch, but honestly, the proper course of action
should just be to improve on clear_user().

If it really is important enough that we should care about that
performance, then we just should fix clear_user().

It's a very odd special thing right now (at least on x86-64) using
some strange handcrafted inline asm code.

I assume that 'rep stosb' is the fastest way to clear things on modern
CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS ->
"rep stos" except for small areas, and probably that "store zeros by
hand" for older CPUs).

Adding PeterZ and Borislav (who seem to be the last ones to have
worked on the copy and clear_page stuff respectively) and the x86
maintainers in case somebody gets the urge to just fix this.

Because memory clearing should be faster than copying, and the thing
that makes copying fast is that FSRM and ERMS logic (the whole
"manually unrolled copy" is hopefully mostly a thing of the past and
we can consider it legacy)

             Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-15 22:10   ` Linus Torvalds
@ 2022-04-15 22:21     ` Matthew Wilcox
  2022-04-15 22:41     ` Hugh Dickins
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2022-04-15 22:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, the arch/x86 maintainers, Peter Zijlstra,
	Borislav Petkov, patrice.chotard, Mikulas Patocka, markhemm,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Chuck Lever,
	Hugh Dickins, patches, Linux-MM, mm-commits

On Fri, Apr 15, 2022 at 03:10:51PM -0700, Linus Torvalds wrote:
> On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
> > iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
> > instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
> > but the x86 clear_user() is not nearly so well optimized as copy to user
> > (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).
> 
> Ugh.
> 
> I've applied this patch, but honestly, the proper course of action
> should just be to improve on clear_user().
> 
> If it really is important enough that we should care about that
> performance, then we just should fix clear_user().
> 
> It's a very odd special thing right now (at least on x86-64) using
> some strange handcrafted inline asm code.
> 
> I assume that 'rep stosb' is the fastest way to clear things on modern
> CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS ->
> "rep stos" except for small areas, and probably that "store zeros by
> hand" for older CPUs).
> 
> Adding PeterZ and Borislav (who seem to be the last ones to have
> worked on the copy and clear_page stuff respectively) and the x86
> maintainers in case somebody gets the urge to just fix this.

Perhaps the x86 maintainers would like to start from
https://lore.kernel.org/lkml/20210523180423.108087-1-sneves@dei.uc.pt/
instead of pushing that work off on the submitter.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-15 22:10   ` Linus Torvalds
  2022-04-15 22:21     ` Matthew Wilcox
@ 2022-04-15 22:41     ` Hugh Dickins
  2022-04-16  6:36     ` Borislav Petkov
  2022-08-18 10:44     ` [tip: x86/cpu] " tip-bot2 for Borislav Petkov
  3 siblings, 0 replies; 82+ messages in thread
From: Hugh Dickins @ 2022-04-15 22:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, the arch/x86 maintainers, Peter Zijlstra,
	Borislav Petkov, patrice.chotard, Mikulas Patocka, markhemm,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Chuck Lever,
	Hugh Dickins, patches, Linux-MM, mm-commits

On Fri, 15 Apr 2022, Linus Torvalds wrote:
> On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
> > iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
> > instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
> > but the x86 clear_user() is not nearly so well optimized as copy to user
> > (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).
> 
> Ugh.
> 
> I've applied this patch,

Phew, thanks.

> but honestly, the proper course of action
> should just be to improve on clear_user().

You'll find no disagreement here: we've all been saying the same.
It's just that that work is yet to be done (or yet to be accepted).

> 
> If it really is important enough that we should care about that
> performance, then we just should fix clear_user().
> 
> It's a very odd special thing right now (at least on x86-64) using
> some strange handcrafted inline asm code.
> 
> I assume that 'rep stosb' is the fastest way to clear things on modern
> CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS ->
> "rep stos" except for small areas, and probably that "store zeros by
> hand" for older CPUs).
> 
> Adding PeterZ and Borislav (who seem to be the last ones to have
> worked on the copy and clear_page stuff respectively) and the x86
> maintainers in case somebody gets the urge to just fix this.

Yes, it was exactly Borislav and PeterZ whom I first approached too,
link 3 in the commit message of the patch that this one is fixing,
https://lore.kernel.org/lkml/2f5ca5e4-e250-a41c-11fb-a7f4ebc7e1c9@google.com/

Borislav wants a thorough good patch, and I don't blame him for that!

Hugh

> 
> Because memory clearing should be faster than copying, and the thing
> that makes copying fast is that FSRM and ERMS logic (the whole
> "manually unrolled copy" is hopefully mostly a thing of the past and
> we can consider it legacy)
> 
>              Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-15 22:10   ` Linus Torvalds
  2022-04-15 22:21     ` Matthew Wilcox
  2022-04-15 22:41     ` Hugh Dickins
@ 2022-04-16  6:36     ` Borislav Petkov
  2022-04-16 14:07       ` Mark Hemment
  2022-08-18 10:44     ` [tip: x86/cpu] " tip-bot2 for Borislav Petkov
  3 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-16  6:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, the arch/x86 maintainers, Peter Zijlstra,
	patrice.chotard, Mikulas Patocka, markhemm, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Fri, Apr 15, 2022 at 03:10:51PM -0700, Linus Torvalds wrote:
> Adding PeterZ and Borislav (who seem to be the last ones to have
> worked on the copy and clear_page stuff respectively) and the x86
> maintainers in case somebody gets the urge to just fix this.

I guess if enough people ask and keep asking, some people at least try
to move...

> Because memory clearing should be faster than copying, and the thing
> that makes copying fast is that FSRM and ERMS logic (the whole
> "manually unrolled copy" is hopefully mostly a thing of the past and
> we can consider it legacy)

So I did give it a look and it seems to me, if we want to do the
alternatives thing here, it will have to look something like
arch/x86/lib/copy_user_64.S.

I.e., the current __clear_user() will have to become the "handle_tail"
thing there which deals with uncopied rest-bytes at the end and the new
fsrm/erms/rep_good variants will then be alternative_call_2 or _3.

The fsrm thing will have only the handle_tail thing at the end when size
!= 0.

The others - erms and rep_good - will have to check for sizes smaller
than, say a cacheline, and for those call the handle_tail thing directly
instead of going into a REP loop. The current __clear_user() is still a
lot better than that copy_user_generic_unrolled() abomination. And it's
not like old CPUs would get any perf penalty - they'll simply use the
same code.

And then you need the labels for _ASM_EXTABLE_UA() exception handling.

Anyway, something along those lines.

And then we'll need to benchmark this on a bunch of current machines to
make sure there's no funny surprises, perf-wise.

I can get cracking on this but I would advise people not to hold their
breaths. :)

Unless someone has a better idea or is itching to get hands dirty
her-/himself.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-16  6:36     ` Borislav Petkov
@ 2022-04-16 14:07       ` Mark Hemment
  2022-04-16 17:28         ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Mark Hemment @ 2022-04-16 14:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, markhemm,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Chuck Lever,
	Hugh Dickins, patches, Linux-MM, mm-commits


On Sat, 16 Apr 2022, Borislav Petkov wrote:

> On Fri, Apr 15, 2022 at 03:10:51PM -0700, Linus Torvalds wrote:
> > Adding PeterZ and Borislav (who seem to be the last ones to have
> > worked on the copy and clear_page stuff respectively) and the x86
> > maintainers in case somebody gets the urge to just fix this.
> 
> I guess if enough people ask and keep asking, some people at least try
> to move...
> 
> > Because memory clearing should be faster than copying, and the thing
> > that makes copying fast is that FSRM and ERMS logic (the whole
> > "manually unrolled copy" is hopefully mostly a thing of the past and
> > we can consider it legacy)
> 
> So I did give it a look and it seems to me, if we want to do the
> alternatives thing here, it will have to look something like
> arch/x86/lib/copy_user_64.S.
> 
> I.e., the current __clear_user() will have to become the "handle_tail"
> thing there which deals with uncopied rest-bytes at the end and the new
> fsrm/erms/rep_good variants will then be alternative_call_2 or _3.
> 
> The fsrm thing will have only the handle_tail thing at the end when size
> != 0.
> 
> The others - erms and rep_good - will have to check for sizes smaller
> than, say a cacheline, and for those call the handle_tail thing directly
> instead of going into a REP loop. The current __clear_user() is still a
> lot better than that copy_user_generic_unrolled() abomination. And it's
> not like old CPUs would get any perf penalty - they'll simply use the
> same code.
> 
> And then you need the labels for _ASM_EXTABLE_UA() exception handling.
> 
> Anyway, something along those lines.
> 
> And then we'll need to benchmark this on a bunch of current machines to
> make sure there's no funny surprises, perf-wise.
> 
> I can get cracking on this but I would advise people not to hold their
> breaths. :)
> 
> Unless someone has a better idea or is itching to get hands dirty
> her-/himself.

I've done a skeleton implementation of alternative __clear_user() based on 
CPU features.
It has three versions of __clear_user();
o __clear_user_original() - similar to the 'standard' __clear_user()
o __clear_user_rep_good() - using resp stos{qb} when CPU has 'rep_good'
o __clear_user_erms() - using 'resp stosb' when CPU has 'erms'

Not claiming the implementation is ideal, but might be a useful starting 
point for someone.
Patch is against 5.18.0-rc2.
Only basic sanity testing done.

Simple performance testing done for large sizes, on a system (Intel E8400) 
which has rep_good but not erms;
# dd if=/dev/zero of=/dev/null bs=16384 count=10000
o *_original() - ~14.2GB/s.  Same as the 'standard' __clear_user().
o *_rep_good() - same throughput as *_original().
o *_erms()     - ~12.2GB/s (expected on a system without erms).

No performance testing done for zeroing small sizes.

Cheers,
Mark

Signed-off-by: Mark Hemment <markhemm@googlemail.com>
---
 arch/x86/include/asm/asm.h        |  39 +++++++++++++++
 arch/x86/include/asm/uaccess_64.h |  36 ++++++++++++++
 arch/x86/lib/clear_page_64.S      | 100 ++++++++++++++++++++++++++++++++++++++
 arch/x86/lib/usercopy_64.c        |  32 ------------
 4 files changed, 175 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index fbcfec4dc4cc..373ed6be7a8d 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -132,6 +132,35 @@
 /* Exception table entry */
 #ifdef __ASSEMBLY__
 
+# define UNDEFINE_EXTABLE_TYPE_REG \
+	.purgem extable_type_reg ;
+
+# define DEFINE_EXTABLE_TYPE_REG \
+	.macro extable_type_reg type:req reg:req ;			\
+	.set .Lfound, 0	;						\
+	.set .Lregnr, 0 ;						\
+	.irp rs,rax,rcx,rdx,rbx,rsp,rbp,rsi,rdi,r8,r9,r10,r11,r12,r13,	\
+	     r14,r15 ;							\
+	.ifc \reg, %\rs ;						\
+	.set .Lfound, .Lfound+1 ;					\
+	.long \type + (.Lregnr << 8) ;					\
+	.endif ;							\
+	.set .Lregnr, .Lregnr+1 ;					\
+	.endr ;								\
+	.set .Lregnr, 0 ;						\
+	.irp rs,eax,ecx,edx,ebx,esp,ebp,esi,edi,r8d,r9d,r10d,r11d,r12d, \
+	     r13d,r14d,r15d ;						\
+	.ifc \reg, %\rs ;						\
+	.set .Lfound, .Lfound+1 ;					\
+	.long \type + (.Lregnr << 8) ;					\
+	.endif ;							\
+	.set .Lregnr, .Lregnr+1 ;					\
+	.endr ;								\
+	.if (.Lfound != 1) ;						\
+	.error "extable_type_reg: bad register argument" ;		\
+	.endif ;							\
+	.endm ;
+
 # define _ASM_EXTABLE_TYPE(from, to, type)			\
 	.pushsection "__ex_table","a" ;				\
 	.balign 4 ;						\
@@ -140,6 +169,16 @@
 	.long type ;						\
 	.popsection
 
+# define _ASM_EXTABLE_TYPE_REG(from, to, type1, reg1)		\
+	.pushsection "__ex_table","a" ;				\
+	.balign 4 ;						\
+	.long (from) - . ;					\
+	.long (to) - . ;					\
+	DEFINE_EXTABLE_TYPE_REG					\
+	extable_type_reg reg=reg1, type=type1 ;			\
+	UNDEFINE_EXTABLE_TYPE_REG				\
+	.popsection
+
 # ifdef CONFIG_KPROBES
 #  define _ASM_NOKPROBE(entry)					\
 	.pushsection "_kprobe_blacklist","aw" ;			\
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e04d771..6a4995e4cfae 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,40 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+/*
+ * Zero Userspace.
+ */
+
+__must_check unsigned long
+clear_user_original(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_rep_good(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_erms(void __user *addr, unsigned long len);
+
+static __always_inline __must_check unsigned long
+___clear_user(void __user *addr, unsigned long len)
+{
+	unsigned long	ret;
+
+	/*
+	 * No memory constraint because it doesn't change any memory gcc
+	 * knows about.
+	 */
+
+	might_fault();
+	alternative_call_2(
+		clear_user_original,
+		clear_user_rep_good,
+		X86_FEATURE_REP_GOOD,
+		clear_user_erms,
+		X86_FEATURE_ERMS,
+		ASM_OUTPUT2("=a" (ret), "=D" (addr), "=c" (len)),
+		"1" (addr), "2" (len)
+		: "%rdx", "cc");
+	return ret;
+}
+
+#define __clear_user(d, n)	___clear_user(d, n)
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..abe1f44ea422 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,5 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
+#include <asm/asm.h>
+#include <asm/smap.h>
 #include <asm/export.h>
 
 /*
@@ -50,3 +52,101 @@ SYM_FUNC_START(clear_page_erms)
 	RET
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Default clear user-space.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+
+SYM_FUNC_START(clear_user_original)
+	ASM_STAC
+	movq %rcx,%rax
+	shrq $3,%rcx
+	andq $7,%rax
+	testq %rcx,%rcx
+	jz 1f
+
+	.p2align 4
+0:	movq $0,(%rdi)
+	leaq 8(%rdi),%rdi
+	decq %rcx
+	jnz   0b
+
+1:	movq %rax,%rcx
+	testq %rcx,%rcx
+	jz 3f
+
+2:	movb $0,(%rdi)
+	incq %rdi
+	decl %ecx
+	jnz  2b
+
+3:	ASM_CLAC
+	movq %rcx,%rax
+	RET
+
+	_ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rax)
+	_ASM_EXTABLE_UA(2b, 3b)
+SYM_FUNC_END(clear_user_original)
+EXPORT_SYMBOL(clear_user_original)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
+ * present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+
+SYM_FUNC_START(clear_user_rep_good)
+	ASM_STAC
+	movq %rcx,%rdx
+	xorq %rax,%rax
+	shrq $3,%rcx
+	andq $7,%rdx
+
+0:	rep stosq
+	movq %rdx,%rcx
+
+1:	rep stosb
+
+3:	ASM_CLAC
+	movq %rcx,%rax
+	RET
+
+	_ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rdx)
+	_ASM_EXTABLE_UA(1b, 3b)
+SYM_FUNC_END(clear_user_rep_good)
+EXPORT_SYMBOL(clear_user_rep_good)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+
+SYM_FUNC_START(clear_user_erms)
+	xorq %rax,%rax
+	ASM_STAC
+
+0:	rep stosb
+
+3:	ASM_CLAC
+	movq %rcx,%rax
+	RET
+
+	_ASM_EXTABLE_UA(0b, 3b)
+SYM_FUNC_END(clear_user_erms)
+EXPORT_SYMBOL(clear_user_erms)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0402a749f3a0..3a2872c9c4a9 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,38 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-
-		_ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
-		_ASM_EXTABLE_UA(1b, 2b)
-
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
 unsigned long clear_user(void __user *to, unsigned long n)
 {
 	if (access_ok(to, n))

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-16 14:07       ` Mark Hemment
@ 2022-04-16 17:28         ` Borislav Petkov
  2022-04-16 17:42           ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-16 17:28 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Linus Torvalds, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sat, Apr 16, 2022 at 03:07:47PM +0100, Mark Hemment wrote:
> I've done a skeleton implementation of alternative __clear_user() based on 
> CPU features.

Cool!

Just a couple of quick notes - more indepth look next week.

> It has three versions of __clear_user();
> o __clear_user_original() - similar to the 'standard' __clear_user()
> o __clear_user_rep_good() - using resp stos{qb} when CPU has 'rep_good'
> o __clear_user_erms() - using 'resp stosb' when CPU has 'erms'

you also need a _fsrm() one which checks X86_FEATURE_FSRM. That one
should simply do rep; stosb regardless of the size. For that you can
define an alternative_call_3 similar to how the _2 variant is defined.

> Not claiming the implementation is ideal, but might be a useful starting
> point for someone.
> Patch is against 5.18.0-rc2.
> Only basic sanity testing done.
> 
> Simple performance testing done for large sizes, on a system (Intel E8400) 
> which has rep_good but not erms;
> # dd if=/dev/zero of=/dev/null bs=16384 count=10000
> o *_original() - ~14.2GB/s.  Same as the 'standard' __clear_user().
> o *_rep_good() - same throughput as *_original().
> o *_erms()     - ~12.2GB/s (expected on a system without erms).

Right.

I have a couple of boxes too - I can run the benchmarks on them too so
we should have enough perf data points eventually.

> diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
> index fbcfec4dc4cc..373ed6be7a8d 100644
> --- a/arch/x86/include/asm/asm.h
> +++ b/arch/x86/include/asm/asm.h
> @@ -132,6 +132,35 @@
>  /* Exception table entry */
>  #ifdef __ASSEMBLY__
>  
> +# define UNDEFINE_EXTABLE_TYPE_REG \
> +	.purgem extable_type_reg ;
> +
> +# define DEFINE_EXTABLE_TYPE_REG \
> +	.macro extable_type_reg type:req reg:req ;			\

I love that macro PeterZ!</sarcasm>

> +	.set .Lfound, 0	;						\
> +	.set .Lregnr, 0 ;						\
> +	.irp rs,rax,rcx,rdx,rbx,rsp,rbp,rsi,rdi,r8,r9,r10,r11,r12,r13,	\
> +	     r14,r15 ;							\
> +	.ifc \reg, %\rs ;						\
> +	.set .Lfound, .Lfound+1 ;					\
> +	.long \type + (.Lregnr << 8) ;					\
> +	.endif ;							\
> +	.set .Lregnr, .Lregnr+1 ;					\
> +	.endr ;								\
> +	.set .Lregnr, 0 ;						\
> +	.irp rs,eax,ecx,edx,ebx,esp,ebp,esi,edi,r8d,r9d,r10d,r11d,r12d, \
> +	     r13d,r14d,r15d ;						\
> +	.ifc \reg, %\rs ;						\
> +	.set .Lfound, .Lfound+1 ;					\
> +	.long \type + (.Lregnr << 8) ;					\
> +	.endif ;							\
> +	.set .Lregnr, .Lregnr+1 ;					\
> +	.endr ;								\
> +	.if (.Lfound != 1) ;						\
> +	.error "extable_type_reg: bad register argument" ;		\
> +	.endif ;							\
> +	.endm ;
> +
>  # define _ASM_EXTABLE_TYPE(from, to, type)			\
>  	.pushsection "__ex_table","a" ;				\
>  	.balign 4 ;						\
> @@ -140,6 +169,16 @@
>  	.long type ;						\
>  	.popsection
>  
> +# define _ASM_EXTABLE_TYPE_REG(from, to, type1, reg1)		\
> +	.pushsection "__ex_table","a" ;				\
> +	.balign 4 ;						\
> +	.long (from) - . ;					\
> +	.long (to) - . ;					\
> +	DEFINE_EXTABLE_TYPE_REG					\
> +	extable_type_reg reg=reg1, type=type1 ;			\
> +	UNDEFINE_EXTABLE_TYPE_REG				\
> +	.popsection
> +
>  # ifdef CONFIG_KPROBES
>  #  define _ASM_NOKPROBE(entry)					\
>  	.pushsection "_kprobe_blacklist","aw" ;			\
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 45697e04d771..6a4995e4cfae 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -79,4 +79,40 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
>  	kasan_check_write(dst, size);
>  	return __copy_user_flushcache(dst, src, size);
>  }
> +
> +/*
> + * Zero Userspace.
> + */
> +
> +__must_check unsigned long
> +clear_user_original(void __user *addr, unsigned long len);
> +__must_check unsigned long
> +clear_user_rep_good(void __user *addr, unsigned long len);
> +__must_check unsigned long
> +clear_user_erms(void __user *addr, unsigned long len);
> +
> +static __always_inline __must_check unsigned long
> +___clear_user(void __user *addr, unsigned long len)
> +{
> +	unsigned long	ret;
> +
> +	/*
> +	 * No memory constraint because it doesn't change any memory gcc
> +	 * knows about.
> +	 */
> +
> +	might_fault();

I think you could do the stac(); clac() sandwich around the
alternative_call_2 here and remove the respective calls from the asm.

> +	alternative_call_2(
> +		clear_user_original,
> +		clear_user_rep_good,
> +		X86_FEATURE_REP_GOOD,
> +		clear_user_erms,
> +		X86_FEATURE_ERMS,
> +		ASM_OUTPUT2("=a" (ret), "=D" (addr), "=c" (len)),

You can do here:

                           : "+&c" (size), "+&D" (addr)
                           :: "eax");

because size and addr need to be earlyclobbers:

  e0a96129db57 ("x86: use early clobbers in usercopy*.c")

and then simply return size;

The asm modifies it anyway - no need for ret.

In any case, thanks for doing that!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-16 17:28         ` Borislav Petkov
@ 2022-04-16 17:42           ` Linus Torvalds
  2022-04-16 21:15             ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-16 17:42 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sat, Apr 16, 2022 at 10:28 AM Borislav Petkov <bp@alien8.de> wrote:
>
> you also need a _fsrm() one which checks X86_FEATURE_FSRM. That one
> should simply do rep; stosb regardless of the size. For that you can
> define an alternative_call_3 similar to how the _2 variant is defined.

Honestly, my personal preference would be that with FSRM, we'd have an
alternative that looks something like

    asm volatile(
        "1:"
        ALTERNATIVE("call __stosb_user", "rep movsb", X86_FEATURE_FSRM)
        "2:"
       _ASM_EXTABLE_UA(1b, 2b)
        :"=c" (count), "=D" (dest),ASM_CALL_CONSTRAINT
        :"0" (count), "1" (dest), "a" (0)
        :"memory");

iow, the 'rep stosb' case would be inline.

Note that the above would have a few things to look out for:

 - special 'stosb' calling convention:

     %rax/%rcx/%rdx as inputs
     %rcx as "bytes not copied" return value
     %rdi can be clobbered

   so the actual functions would look a bit odd and would need to
save/restore some registers, but they'd basically just emulate "rep
stosb".

 - since the whole point is that the "rep movsb" is inlined, it also
means that the "call __stosb_user" is done within the STAC/CLAC
region, so objdump would have to be taught that's ok

but wouldn't it be lovely if we could start moving towards a model
where we can just inline 'memset' and 'memcpy' like this?

NOTE! The above asm has not been tested. I wrote it in this email. I'm
sure I messed something up.

               Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-16 17:42           ` Linus Torvalds
@ 2022-04-16 21:15             ` Borislav Petkov
  2022-04-17 19:41               ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-16 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sat, Apr 16, 2022 at 10:42:22AM -0700, Linus Torvalds wrote:
> On Sat, Apr 16, 2022 at 10:28 AM Borislav Petkov <bp@alien8.de> wrote:
> >
> > you also need a _fsrm() one which checks X86_FEATURE_FSRM. That one
> > should simply do rep; stosb regardless of the size. For that you can
> > define an alternative_call_3 similar to how the _2 variant is defined.
> 
> Honestly, my personal preference would be that with FSRM, we'd have an
> alternative that looks something like
> 
>     asm volatile(
>         "1:"
>         ALTERNATIVE("call __stosb_user", "rep movsb", X86_FEATURE_FSRM)
>         "2:"
>        _ASM_EXTABLE_UA(1b, 2b)
>         :"=c" (count), "=D" (dest),ASM_CALL_CONSTRAINT
>         :"0" (count), "1" (dest), "a" (0)
>         :"memory");
> 
> iow, the 'rep stosb' case would be inline.

I knew you were gonna say that - we have talked about this in the past.
And I'll do you one better -- we have the patch-if-bit-not-set thing now
too, so I think it should work if we did:


       alternative_call_3(__clear_user_fsrm,
                          __clear_user_erms,   ALT_NOT(X86_FEATURE_FSRM),
                          __clear_user_string, ALT_NOT(X86_FEATURE_ERMS),
			  __clear_user_orig,   ALT_NOT(X86_FEATURE_REP_GOOD),
                          : "+&c" (size), "+&D" (addr)
                          :: "eax");

and yeah, you wanna get rid of the CALL even and I guess that could
be made to work - I just need to play with it a bit to hammer out the
details.

I.e., it would be most optimal if it ended up being

	ALTERNATIVE_3("rep stosb",
		      "call ... ", ALT_NOT(X86_FEATURE_FSRM),
		      ...


> Note that the above would have a few things to look out for:
> 
>  - special 'stosb' calling convention:
> 
>      %rax/%rcx/%rdx as inputs
>      %rcx as "bytes not copied" return value
>      %rdi can be clobbered
> 
>    so the actual functions would look a bit odd and would need to
> save/restore some registers, but they'd basically just emulate "rep
> stosb".

Right.

>  - since the whole point is that the "rep movsb" is inlined, it also
> means that the "call __stosb_user" is done within the STAC/CLAC
> region, so objdump would have to be taught that's ok
> 
> but wouldn't it be lovely if we could start moving towards a model
> where we can just inline 'memset' and 'memcpy' like this?

Yeah, inlined insns without even a CALL insn would be the most optimal
thing to do.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-16 21:15             ` Borislav Petkov
@ 2022-04-17 19:41               ` Borislav Petkov
  2022-04-17 20:56                 ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-17 19:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sat, Apr 16, 2022 at 11:15:16PM +0200, Borislav Petkov wrote:
> > but wouldn't it be lovely if we could start moving towards a model
> > where we can just inline 'memset' and 'memcpy' like this?
> 
> Yeah, inlined insns without even a CALL insn would be the most optimal
> thing to do.

Here's a diff ontop of Mark's patch, it inlines clear_user() everywhere
and it seems to patch properly

vmlinux has:

ffffffff812ba53e:       0f 87 f8 fd ff ff       ja     ffffffff812ba33c <load_elf_binary+0x60c>
ffffffff812ba544:       90                      nop
ffffffff812ba545:       90                      nop
ffffffff812ba546:       90                      nop
ffffffff812ba547:       4c 89 e7                mov    %r12,%rdi
ffffffff812ba54a:       f3 aa                   rep stos %al,%es:(%rdi)
ffffffff812ba54c:       90                      nop
ffffffff812ba54d:       90                      nop
ffffffff812ba54e:       90                      nop
ffffffff812ba54f:       90                      nop
ffffffff812ba550:       90                      nop
ffffffff812ba551:       90                      nop

which is the rep; stosb and when I boot my guest, it has been patched to:

0xffffffff812ba544 <+2068>:  0f 01 cb        stac   
0xffffffff812ba547 <+2071>:  4c 89 e7        mov    %r12,%rdi
0xffffffff812ba54a <+2074>:  e8 71 9a 15 00  call   0xffffffff81413fc0 <clear_user_rep_good>
0xffffffff812ba54f <+2079>:  0f 01 ca        clac

(stac and clac are also alternative-patched). And there's the call to
rep_good because the guest doesn't advertize FRSM nor ERMS.

Anyway, more playing with this later to make sure it really does what it
should.

---

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index f78e2b3501a1..335d571e8a79 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -405,9 +405,6 @@ strncpy_from_user(char *dst, const char __user *src, long count);
 
 extern __must_check long strnlen_user(const char __user *str, long n);
 
-unsigned long __must_check clear_user(void __user *mem, unsigned long len);
-unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
-
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 unsigned long __must_check
 copy_mc_to_kernel(void *to, const void *from, unsigned len);
@@ -429,6 +426,8 @@ extern struct movsl_mask {
 #define ARCH_HAS_NOCACHE_UACCESS 1
 
 #ifdef CONFIG_X86_32
+unsigned long __must_check clear_user(void __user *mem, unsigned long len);
+unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
 # include <asm/uaccess_32.h>
 #else
 # include <asm/uaccess_64.h>
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 6a4995e4cfae..dc0f2bfc80a4 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -91,28 +91,34 @@ clear_user_rep_good(void __user *addr, unsigned long len);
 __must_check unsigned long
 clear_user_erms(void __user *addr, unsigned long len);
 
-static __always_inline __must_check unsigned long
-___clear_user(void __user *addr, unsigned long len)
+static __always_inline __must_check unsigned long __clear_user(void __user *addr, unsigned long size)
 {
-	unsigned long	ret;
-
 	/*
 	 * No memory constraint because it doesn't change any memory gcc
 	 * knows about.
 	 */
 
 	might_fault();
-	alternative_call_2(
-		clear_user_original,
-		clear_user_rep_good,
-		X86_FEATURE_REP_GOOD,
-		clear_user_erms,
-		X86_FEATURE_ERMS,
-		ASM_OUTPUT2("=a" (ret), "=D" (addr), "=c" (len)),
-		"1" (addr), "2" (len)
-		: "%rdx", "cc");
-	return ret;
+	stac();
+	asm_inline volatile(
+		"1:"
+		ALTERNATIVE_3("rep stosb",
+			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
+			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
+			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
+		"2:"
+	       _ASM_EXTABLE_UA(1b, 2b)
+	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
+	       : "a" (0));
+
+	clac();
+	return size;
 }
 
-#define __clear_user(d, n)	___clear_user(d, n)
+static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
+{
+	if (access_ok(to, n))
+		return __clear_user(to, n);
+	return n;
+}
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index abe1f44ea422..b80f710a7f30 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -62,9 +62,7 @@ EXPORT_SYMBOL_GPL(clear_page_erms)
  * Output:
  * rax uncopied bytes or 0 if successful.
  */
-
 SYM_FUNC_START(clear_user_original)
-	ASM_STAC
 	movq %rcx,%rax
 	shrq $3,%rcx
 	andq $7,%rax
@@ -86,7 +84,7 @@ SYM_FUNC_START(clear_user_original)
 	decl %ecx
 	jnz  2b
 
-3:	ASM_CLAC
+3:
 	movq %rcx,%rax
 	RET
 
@@ -105,11 +103,8 @@ EXPORT_SYMBOL(clear_user_original)
  * Output:
  * rax uncopied bytes or 0 if successful.
  */
-
 SYM_FUNC_START(clear_user_rep_good)
-	ASM_STAC
 	movq %rcx,%rdx
-	xorq %rax,%rax
 	shrq $3,%rcx
 	andq $7,%rdx
 
@@ -118,7 +113,7 @@ SYM_FUNC_START(clear_user_rep_good)
 
 1:	rep stosb
 
-3:	ASM_CLAC
+3:
 	movq %rcx,%rax
 	RET
 
@@ -135,15 +130,13 @@ EXPORT_SYMBOL(clear_user_rep_good)
  *
  * Output:
  * rax uncopied bytes or 0 if successful.
+ *
+ * XXX: check for small sizes and call the original version.
+ * Benchmark it first though.
  */
-
 SYM_FUNC_START(clear_user_erms)
-	xorq %rax,%rax
-	ASM_STAC
-
 0:	rep stosb
-
-3:	ASM_CLAC
+3:
 	movq %rcx,%rax
 	RET
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3a2872c9c4a9..8bf04d95dc04 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,14 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long clear_user(void __user *to, unsigned long n)
-{
-	if (access_ok(to, n))
-		return __clear_user(to, n);
-	return n;
-}
-EXPORT_SYMBOL(clear_user);
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 /**
  * clean_cache_range - write back a cache range with CLWB
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index bd0c2c828940..ea48ee11521f 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1019,6 +1019,9 @@ static const char *uaccess_safe_builtin[] = {
 	"copy_mc_fragile_handle_tail",
 	"copy_mc_enhanced_fast_string",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
+	"clear_user_erms",
+	"clear_user_rep_good",
+	"clear_user_original",
 	NULL
 };
 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-17 19:41               ` Borislav Petkov
@ 2022-04-17 20:56                 ` Linus Torvalds
  2022-04-18 10:15                   ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-17 20:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sun, Apr 17, 2022 at 12:41 PM Borislav Petkov <bp@alien8.de> wrote:
>
> Anyway, more playing with this later to make sure it really does what it
> should.

I think the special calling conventions have tripped you up:

>  SYM_FUNC_START(clear_user_original)
> -       ASM_STAC
>         movq %rcx,%rax
>         shrq $3,%rcx
>         andq $7,%rax
> @@ -86,7 +84,7 @@ SYM_FUNC_START(clear_user_original)
>         decl %ecx
>         jnz  2b
>
> -3:     ASM_CLAC
> +3:
>         movq %rcx,%rax
>         RET

That 'movq %rcx,%rax' can't be right. The caller expects it to be zero
on input and stay zero on output.

But I think "xorl %eax,%eax" is good, since %eax was used as a
temporary in that function.

And the comment above the function should be fixed too.

>  SYM_FUNC_START(clear_user_rep_good)
> -       ASM_STAC
>         movq %rcx,%rdx
> -       xorq %rax,%rax
>         shrq $3,%rcx
>         andq $7,%rdx
>
> @@ -118,7 +113,7 @@ SYM_FUNC_START(clear_user_rep_good)
>
>  1:     rep stosb
>
> -3:     ASM_CLAC
> +3:
>         movq %rcx,%rax
>         RET

Same issue here.

Probably nothing notices, since %rcx *does* end up containing the
right value, and it's _likely_ that the compiler doesn't actually end
up re-using the zero value in %rax after the inline asm (that this bug
has corrupted), but if the compiler ever goes "Oh, I put zero in %rax,
so I'll just use that afterwards", this is going to blow up very
spectacularly and be very hard to debug.

> @@ -135,15 +130,13 @@ EXPORT_SYMBOL(clear_user_rep_good)
>   *
>   * Output:
>   * rax uncopied bytes or 0 if successful.
> + *
> + * XXX: check for small sizes and call the original version.
> + * Benchmark it first though.
>   */
> -
>  SYM_FUNC_START(clear_user_erms)
> -       xorq %rax,%rax
> -       ASM_STAC
> -
>  0:     rep stosb
> -
> -3:     ASM_CLAC
> +3:
>         movq %rcx,%rax
>         RET

.. and one more time.

Also, I do think that the rep_good and erms cases should probably
check for small copes, and use the clear_user_original thing for %rcx
< 64 or something like that.

It's what we do on the memcpy side - and more importantly, it's the
only difference between "erms" and FSRM. If the ERMS code doesn't do
anything different for small copies, why have it at all?

But other than these small issues, it looks good to me.

                  Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-17 20:56                 ` Linus Torvalds
@ 2022-04-18 10:15                   ` Borislav Petkov
  2022-04-18 17:10                     ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-18 10:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sun, Apr 17, 2022 at 01:56:25PM -0700, Linus Torvalds wrote:
> That 'movq %rcx,%rax' can't be right. The caller expects it to be zero
> on input and stay zero on output.
> 
> But I think "xorl %eax,%eax" is good, since %eax was used as a
> temporary in that function.

Yah, wanted to singlestep that whole asm anyway to make sure it is good.
And just started going through it - I think it can be even optimized a
bit to use %rax for the rest bytes and decrement it into 0 eventually.

The "xorl %eax,%eax" is still there, though, in case we fault on the
user access and so that we can clear it to the compiler's expectation.

I've added comments too so that it is clear what happens at a quick glance.

SYM_FUNC_START(clear_user_original)
        mov %rcx,%rax
        shr $3,%rcx             # qwords
        and $7,%rax             # rest bytes
        test %rcx,%rcx
        jz 1f

        # do the qwords first
        .p2align 4
0:      movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz   0b

1:      test %rax,%rax
        jz 3f

        # now do the rest bytes
2:      movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz  2b

3:
        xorl %eax,%eax
        RET

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-18 10:15                   ` Borislav Petkov
@ 2022-04-18 17:10                     ` Linus Torvalds
  2022-04-19  9:17                       ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-18 17:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Mon, Apr 18, 2022 at 3:15 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Yah, wanted to singlestep that whole asm anyway to make sure it is good.
> And just started going through it - I think it can be even optimized a
> bit to use %rax for the rest bytes and decrement it into 0 eventually.

Ugh. If you do this, you need to have a big comment about how that
%rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
up doing "%rcx = %rcx*8+%rax" in ex_handler_ucopy_len() for the
exception case).

Plus you need to explain the xorl here:

> 3:
>         xorl %eax,%eax
>         RET

because with your re-written function it *looks* like %eax is already
zero, so - once again - this is about how the exception cases work and
get here magically.

So that needs some big comment about "if an exception happens, we jump
to label '3', and the exception handler will fix up %rcx, but we'll
need to clear %rax".

Or something like that.

But yes, that does look like it will work, it's just really subtle how
%rcx is zero for the 'rest bytes', and %rax is the byte count
remaining in addition.

                  Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-18 17:10                     ` Linus Torvalds
@ 2022-04-19  9:17                       ` Borislav Petkov
  2022-04-19 16:41                         ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-19  9:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Mon, Apr 18, 2022 at 10:10:42AM -0700, Linus Torvalds wrote:
> Ugh. If you do this, you need to have a big comment about how that
> %rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
> up doing "%rcx = %rcx*8+%rax" in ex_handler_ucopy_len() for the
> exception case).

Yap, and I reused your text and expanded it. You made me look at that
crazy DEFINE_EXTABLE_TYPE_REG macro finally so that I know what it does
in detail.

So I have the below now, it boots in the guest so it must be perfect.

---

/*
 * Default clear user-space.
 * Input:
 * rdi destination
 * rcx count
 *
 * Output:
 * rcx uncopied bytes or 0 if successful.
 */
SYM_FUNC_START(clear_user_original)
        mov %rcx,%rax
        shr $3,%rcx             # qwords
        and $7,%rax             # rest bytes
        test %rcx,%rcx
        jz 1f

        # do the qwords first
        .p2align 4
0:      movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz   0b

1:      test %rax,%rax
        jz 3f

        # now do the rest bytes
2:      movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz  2b

3:
        xorl %eax,%eax
        RET

        _ASM_EXTABLE_UA(0b, 3b)

        /*
         * The %rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
         * up doing "%rcx = %rcx*8 + %rax" in ex_handler_ucopy_len() for the exception
         * case). That is, we use %rax above at label 2: for simpler asm but the number
         * of uncleared bytes will land in %rcx, as expected by the caller.
         *
         * %rax at label 3: still needs to be cleared in the exception case because this
         * is called from inline asm and the compiler expects %rax to be zero when exiting
         * the inline asm, in case it might reuse it somewhere.
         */
        _ASM_EXTABLE_TYPE_REG(2b, 3b, EX_TYPE_UCOPY_LEN8, %rax)
SYM_FUNC_END(clear_user_original)
EXPORT_SYMBOL(clear_user_original)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-19  9:17                       ` Borislav Petkov
@ 2022-04-19 16:41                         ` Linus Torvalds
  2022-04-19 17:48                           ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-19 16:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Tue, Apr 19, 2022 at 2:17 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Yap, and I reused your text and expanded it. You made me look at that
> crazy DEFINE_EXTABLE_TYPE_REG macro finally so that I know what it does
> in detail.
>
> So I have the below now, it boots in the guest so it must be perfect.

This looks fine to me.

Although honestly, I'd be even happier without those fancy exception
table tricks. I actually think things would be more legible if we had
explicit error return points that did the

err8:
        shrq $3,%rcx
        addq %rax,%rcx
err1:
        xorl %eax,%eax
        RET

things explicitly.

That's perhaps especially true since this whole thing now added a new
- and even more complex - error case with that _ASM_EXTABLE_TYPE_REG.

But I'm ok with the complex version too, I guess.

                   Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-19 16:41                         ` Linus Torvalds
@ 2022-04-19 17:48                           ` Borislav Petkov
  2022-04-21 15:06                             ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-19 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Tue, Apr 19, 2022 at 09:41:42AM -0700, Linus Torvalds wrote:
> But I'm ok with the complex version too, I guess.

I feel ya. I myself am, more often than not, averse to such complex
things if it can be solved in a simpler way but we have those fancy
exception handlers tricks everywhere and if we did it here differently,
it would completely not fit the mental pattern and we would wonder why
is this special...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-19 17:48                           ` Borislav Petkov
@ 2022-04-21 15:06                             ` Borislav Petkov
  2022-04-21 16:50                               ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-21 15:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

Some numbers with the microbenchmark:

$ for i in $(seq 1 100); do dd if=/dev/zero of=/dev/null bs=16K count=10000 2>&1 | grep GB; done > /tmp/w.log
$ awk 'BEGIN { sum = 0 } { sum +=$10 } END { print sum/100 }' /tmp/w.log

on AMD zen3

original:	20.11 Gb/s
rep_good:	34.662 Gb/s
erms:		36.378 Gb/s
fsrm:		36.398 Gb/s

I'll run some real benchmarks later but these numbers kinda speak for
themselves so it would be unlikely that it would show different perf
with a real, process-intensive benchmark...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-21 15:06                             ` Borislav Petkov
@ 2022-04-21 16:50                               ` Linus Torvalds
  2022-04-21 17:22                                 ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-21 16:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

[-- Attachment #1: Type: text/plain, Size: 1159 bytes --]

On Thu, Apr 21, 2022 at 8:06 AM Borislav Petkov <bp@alien8.de> wrote:
>
> on AMD zen3
>
> original:       20.11 Gb/s
> rep_good:       34.662 Gb/s
> erms:           36.378 Gb/s
> fsrm:           36.398 Gb/s

Looks good.

Of course, the interesting cases are the "took a page fault in the middle" ones.

A very simple basic test is something like the attached.

It does no error checking or anything else, but doing a 'strace
./a.out' should give you something like

  ...
  openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
  mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x7f10ddfd0000
  munmap(0x7f10ddfe0000, 65536)           = 0
  read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 16
  exit_group(16)                          = ?

where that "read(..) = 16" is the important part. It correctly figured
out that it can only do 16 bytes (ok, 17, but we've always allowed the
user accessor functions to block).

With erms/fsrm, presumably you get that optimal "read(..) = 17".

I'm sure we have a test-case for this somewhere, but it was easier for
me to write a few lines of (bad) code than try to find it.

               Linus

[-- Attachment #2: t.c --]
[-- Type: text/x-csrc, Size: 397 bytes --]

#include <stddef.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>

// Whatever
#define PAGE_SIZE 65536

int main(int argc, char **argv)
{
	int fd;
	void *map;

	fd = open("/dev/zero", O_RDONLY); 

	map = mmap(NULL, 3*PAGE_SIZE, PROT_READ | PROT_WRITE,
		MAP_PRIVATE |MAP_ANONYMOUS , -1, 0);
	munmap(map + PAGE_SIZE, PAGE_SIZE);

	return read(fd, map + PAGE_SIZE - 17, PAGE_SIZE);
}

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-21 16:50                               ` Linus Torvalds
@ 2022-04-21 17:22                                 ` Linus Torvalds
  2022-04-24 19:37                                   ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-21 17:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Thu, Apr 21, 2022 at 9:50 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> where that "read(..) = 16" is the important part. It correctly figured
> out that it can only do 16 bytes (ok, 17, but we've always allowed the
> user accessor functions to block).

Bad choice of words - by "block" I meant "doing the accesses in
blocks", in this case 64-bit words.

Obviously the user accessors _also_ "block" in the sense of having to
wait for page faults and IO.

I think 'copy_{to,from}_user()' actually does go to the effort to try
to do byte-exact results, though.

In particular, see copy_user_handle_tail in arch/x86/lib/copy_user_64.S.

But I think that we long ago ended up deciding it really wasn't worth
doing it, and x86 ends up just going to unnecessary lengths for this
case.

               Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-21 17:22                                 ` Linus Torvalds
@ 2022-04-24 19:37                                   ` Borislav Petkov
  2022-04-24 19:54                                     ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-24 19:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Thu, Apr 21, 2022 at 10:22:32AM -0700, Linus Torvalds wrote:
> I think 'copy_{to,from}_user()' actually does go to the effort to try
> to do byte-exact results, though.

Yeah, we have had headaches with this byte-exact copying, wrt MCEs.

> In particular, see copy_user_handle_tail in arch/x86/lib/copy_user_64.S.
> 
> But I think that we long ago ended up deciding it really wasn't worth
> doing it, and x86 ends up just going to unnecessary lengths for this
> case.

You could give me some more details but AFAIU, you mean, that
fallback to byte-sized reads is unnecessary and I can get rid of
copy_user_handle_tail? Because that would be a nice cleanup.

Anyway, I ran your short prog and it all looks like you predicted it:

fsrm:
----
openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fafd78fe000
munmap(0x7fafd790e000, 65536)           = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17
exit_group(17)                          = ?
+++ exited with 17 +++

erms:
-----
openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe0b5321000
munmap(0x7fe0b5331000, 65536)           = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17
exit_group(17)                          = ?
+++ exited with 17 +++

rep_good:
---------
openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0b5f0c7000
munmap(0x7f0b5f0d7000, 65536)           = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 16
exit_group(16)                          = ?
+++ exited with 16 +++

original:
---------
openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3ff61c6000
munmap(0x7f3ff61d6000, 65536)           = 0
read(3, strace: umoven: short read (17 < 33) @0x7f3ff61d5fef
0x7f3ff61d5fef, 65536)          = 3586
exit_group(3586)                        = ?
+++ exited with 2 +++

that "umoven: short read" is strace spitting out something about the
address space of the tracee being unaccessible. But the 17 bytes short
read is still there.

From strace sources:

/* legacy method of copying from tracee */
static int
umoven_peekdata(const int pid, kernel_ulong_t addr, unsigned int len,
		void *laddr)
{

	...

	switch (errno) {
			case EFAULT: case EIO: case EPERM:
				/* address space is inaccessible */
				if (nread) {
					perror_msg("umoven: short read (%u < %u) @0x%" PRI_klx,
						   nread, nread + len, addr - nread);
				}
				return -1;

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-24 19:37                                   ` Borislav Petkov
@ 2022-04-24 19:54                                     ` Linus Torvalds
  2022-04-24 20:24                                       ` Linus Torvalds
  2022-04-27  0:14                                       ` Borislav Petkov
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2022-04-24 19:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sun, Apr 24, 2022 at 12:37 PM Borislav Petkov <bp@alien8.de> wrote:
>
> You could give me some more details but AFAIU, you mean, that
> fallback to byte-sized reads is unnecessary and I can get rid of
> copy_user_handle_tail? Because that would be a nice cleanup.

Yeah, we already don't have that fallback in many other places.

For example: traditionally we implemented fixed-size small
copy_to/from_user() with get/put_user().

We don't do that any more, but see commit 4b842e4e25b1 ("x86: get rid
of small constant size cases in raw_copy_{to,from}_user()") and
realize how it historically never did the byte-loop fallback.

The "clear_user()" case is another example of something that was never
byte-exact.

And last time we discussed this, Al was looking at making it
byte-exact, and I'm pretty sure he noted that other architectures
already didn't do always do it.

Let me go try to find it.

> Anyway, I ran your short prog and it all looks like you predicted it:

Well, this actually shows a bug.

> fsrm:
> ----
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17

The above is good.

> erms:
> -----
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17

This is good too: erms should do small copies with the expanded case,
but sicne it *thinks* it's a large copy, it will just use "rep movsb"
and be byte-exact.

> rep_good:
> ---------
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 16

This is good: it starts off with "rep movsq", does two iterations, and
then fails on the third word, so it only succeeds for 16 bytes.

> original:
> ---------
> read(3, strace: umoven: short read (17 < 33) @0x7f3ff61d5fef
> 0x7f3ff61d5fef, 65536)          = 3586

This is broken.

And I'm *not* talking about the strace warning. The strace warning is
actually just a *result* of the kernel bug.

Look at that return value. It returns '3586'. That's just pure garbage.

So that 'original' routine simply returns the wrong value.

I suspect it's a %rax vs %rcx confusion again, but with your "patch on
top of earlier patch" I didn't go and sort it out.

                   Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-24 19:54                                     ` Linus Torvalds
@ 2022-04-24 20:24                                       ` Linus Torvalds
  2022-04-27  0:14                                       ` Borislav Petkov
  1 sibling, 0 replies; 82+ messages in thread
From: Linus Torvalds @ 2022-04-24 20:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sun, Apr 24, 2022 at 12:54 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And last time we discussed this, Al was looking at making it
> byte-exact, and I'm pretty sure he noted that other architectures
> already didn't do always do it.
>
> Let me go try to find it.

Hmnm. I may have mis-remembered the details. The thread I was thinking
of was this:

   https://lore.kernel.org/all/20200719031733.GI2786714@ZenIV.linux.org.uk/

and while Al was arguing for not enforcing the exact byte count, he
still suggested that it must make *some* progress. But note the whole

 "There are two problems with that.  First of all, the promise was
  bogus - there are architectures where it is simply not true.  E.g. ppc
  (or alpha, or sparc, or...)  can have copy_from_user() called with source
  one word prior to an unmapped page, fetch that word, fail to fetch the next
  one and bugger off without doing any stores"

ie it's simply never been the case in general, and Al mentions ppc,
alpha and sparc as examples of architectures where it has not been
true.

(arm and arm64, btw, does seem to have the "try harder" byte copy loop
at the end, like x86 does).

And that's when I argued that we should just accept that the byte
exact behavior simply has never been reality, and we shouldn't even
try to make it be reality.

NOTE! We almost certainly do want to have some limit of how much off
we can be, though. I do *not* think we can just unroll the loop a ton,
and say "hey, we're doing copies in chunks of 16 words, so now we're
off by up to 128 bytes".

I'd suggest making it clear that being "off" by a word is fine,
because that's the natural thing for any architecture that needs to do
a "load low/high word" followed by "store aligned word" due to not
handling unaligned faults well (eg the whole traditional RISC thing).

And yes, I think it's actually somewhat detrimental to our test
coverage that x86 does the byte-exact thing, because it means that
*if* we have any code that depends on it, it will just happen to work
on x86, but then fail on architectures that don't get nearly the same
test coverage.

              Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-24 19:54                                     ` Linus Torvalds
  2022-04-24 20:24                                       ` Linus Torvalds
@ 2022-04-27  0:14                                       ` Borislav Petkov
  2022-04-27  1:29                                         ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-27  0:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sun, Apr 24, 2022 at 12:54:57PM -0700, Linus Torvalds wrote:
> I suspect it's a %rax vs %rcx confusion again, but with your "patch on
> top of earlier patch" I didn't go and sort it out.

Finally had some quiet time to stare at this.

So when we enter the function, we shift %rcx to get the number of
qword-sized quantities to zero:

SYM_FUNC_START(clear_user_original)
	mov %rcx,%rax
	shr $3,%rcx		# qwords	<---

and we zero in qword quantities merrily:

        # do the qwords first
        .p2align 4
0:      movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz   0b 


but when we encounter the fault here, we return *%rcx* - not %rcx << 3
- latter being the *bytes* leftover which we *actually* need to return
when we encounter the #PF.

So, we need to shift back when we fault during the qword-sized zeroing,
i.e., full function below, see label 3 there.

With that, strace looks good too:

openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7dc5000
munmap(0x7ffff7dd5000, 65536)           = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 16
exit_group(16)                          = ?
+++ exited with 16 +++

As to the byte-exact deal, I'll put it on my TODO to play with it later
and see how much asm we can shed from this simplification so thanks for
the pointers!

/* 
 * Default clear user-space.
 * Input:
 * rdi destination
 * rcx count
 *
 * Output:
 * rcx uncleared bytes or 0 if successful.
 */
SYM_FUNC_START(clear_user_original)
        mov %rcx,%rax
        shr $3,%rcx             # qwords
        and $7,%rax             # rest bytes
        test %rcx,%rcx
        jz 1f

        # do the qwords first
        .p2align 4
0:      movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz   0b

1:      test %rax,%rax
        jz 3f

        # now do the rest bytes
2:      movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz  2b

3:
        # convert qwords back into bytes to return to caller
        shl $3, %rcx

4:
        xorl %eax,%eax
        RET

        _ASM_EXTABLE_UA(0b, 3b)

        /*
         * The %rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
         * up doing "%rcx = %rcx*8 + %rax" in ex_handler_ucopy_len() for the exception
         * case). That is, we use %rax above at label 2: for simpler asm but the number
         * of uncleared bytes will land in %rcx, as expected by the caller.
         *
         * %rax at label 3: still needs to be cleared in the exception case because this
         * is called from inline asm and the compiler expects %rax to be zero when exiting
         * the inline asm, in case it might reuse it somewhere.
         */
        _ASM_EXTABLE_TYPE_REG(2b, 4b, EX_TYPE_UCOPY_LEN8, %rax)


Btw, I'm wondering if using descriptive label names would make this function even more
understandable:

/*      
 * Default clear user-space.
 * Input:
 * rdi destination
 * rcx count
 *
 * Output:
 * rcx uncleared bytes or 0 if successful.
 */
SYM_FUNC_START(clear_user_original)
        mov %rcx,%rax
        shr $3,%rcx             # qwords
        and $7,%rax             # rest bytes
        test %rcx,%rcx
        jz .Lrest_bytes

        # do the qwords first
        .p2align 4

.Lqwords:
        movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz .Lqwords
        
.Lrest_bytes:
        test %rax,%rax
        jz .Lexit
        
        # now do the rest bytes
.Lbytes:
        movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz .Lbytes
        
.Lqwords_exit:
        # convert qwords back into bytes to return to caller
        shl $3, %rcx

.Lexit:
        xorl %eax,%eax
        RET
        
        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exit)
      
        /*
         * The %rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
         * up doing "%rcx = %rcx*8 + %rax" in ex_handler_ucopy_len() for the exception
         * case). That is, we use %rax above at label 2: for simpler asm but the number
         * of uncleared bytes will land in %rcx, as expected by the caller.
         *
         * %rax at label 3: still needs to be cleared in the exception case because this
         * is called from inline asm and the compiler expects %rax to be zero when exiting
         * the inline asm, in case it might reuse it somewhere.
         */
        _ASM_EXTABLE_TYPE_REG(.Lbytes, .Lexit, EX_TYPE_UCOPY_LEN8, %rax)
SYM_FUNC_END(clear_user_original) 
EXPORT_SYMBOL(clear_user_original)


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-27  0:14                                       ` Borislav Petkov
@ 2022-04-27  1:29                                         ` Linus Torvalds
  2022-04-27 10:41                                           ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-27  1:29 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Tue, Apr 26, 2022 at 5:14 PM Borislav Petkov <bp@alien8.de> wrote:
>
> So when we enter the function, we shift %rcx to get the number of
> qword-sized quantities to zero:
>
> SYM_FUNC_START(clear_user_original)
>         mov %rcx,%rax
>         shr $3,%rcx             # qwords        <---

Yes.

But that's what we do for "rep stosq" too, for all the same reasons.

> but when we encounter the fault here, we return *%rcx* - not %rcx << 3
> - latter being the *bytes* leftover which we *actually* need to return
> when we encounter the #PF.

Yes, but:

> So, we need to shift back when we fault during the qword-sized zeroing,
> i.e., full function below, see label 3 there.

No.

The problem is that you're using the wrong exception type.

Thanks for posting the whole thing, because that makes it much more obvious.

You have the exception table entries switched.

You should have

         _ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rax)
        _ASM_EXTABLE_UA(2b, 3b)

and not need that label '4' at all.

Note how that "_ASM_EXTABLE_TYPE_REG" thing is literally designed to do

   %rcx = 8*%rcx+%rax

in the exception handler.

Of course, to get that regular

        _ASM_EXTABLE_UA(2b, 3b)

to work, you need to have the final byte count in %rcx, not in %rax so
that means that the "now do the rest of the bytes" case should have
done something like

        movl %eax,%ecx
2:      movb $0,(%rdi)
        inc %rdi
        decl %ecx
        jnz  2b

instead.

Yeah, yeah, you could also use that _ASM_EXTABLE_TYPE_REG thing for
the second exception point, and keep %rcx as zero, and keep it in
%eax, and depend on that whole "%rcx = 8*%rcx+%rax" fixing it up, and
knowing that if an exception does *not* happen, %rcx will be zero from
the word-size loop.

But that really seems much too subtle for me - why not just keep
things in %rcx, and try to make this look as much as possible like the
"rep stosq + rep stosb" case?

And finally: I still think that those fancy exception table things are
*much* too fancy, and much too subtle, and much too complicated.

So I'd actually prefer to get rid of them entirely, and make the code
use the "no register changes" exception, and make the exception
handler do a proper site-specific fixup. At that point, you can get
rid of all the "mask bits early" logic, get rid of all the extraneous
'test' instructions, and make it all look something like below.

NOTE! I've intentionally kept the %eax thing using 32-bit instructions
- smaller encoding, and only the low three bits matter, so why
move/mask full 64 bits?

NOTE2! Entirely untested. But I tried to make the normal code do
minimal work, and then fix things up in the exception case more. So it
just keeps the original count in the 32 bits in %eax until it wants to
test it, and then uses the 'andl' to both mask and test. And the
exception case knows that, so it masks there too.

I dunno. But I really think that whole new _ASM_EXTABLE_TYPE_REG and
EX_TYPE_UCOPY_LEN8 was unnecessary.

               Linus

SYM_FUNC_START(clear_user_original)
        movl %ecx,%eax
        shrq $3,%rcx             # qwords
        jz .Lrest_bytes

        # do the qwords first
        .p2align 4
.Lqwords:
        movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz .Lqwords

.Lrest_bytes:
        andl $7,%eax             # rest bytes
        jz .Lexit

        # now do the rest bytes
.Lbytes:
        movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz .Lbytes

.Lexit:
        xorl %eax,%eax
        RET

.Lqwords_exception:
        # convert qwords back into bytes to return to caller
        shlq $3, %rcx
        andl $7, %eax
        addq %rax,%rcx
        jmp .Lexit

.Lbytes_exception:
        movl %eax,%ecx
        jmp .Lexit

        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
        _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-27  1:29                                         ` Linus Torvalds
@ 2022-04-27 10:41                                           ` Borislav Petkov
  2022-04-27 16:00                                             ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-27 10:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Tue, Apr 26, 2022 at 06:29:16PM -0700, Linus Torvalds wrote:
> Yeah, yeah, you could also use that _ASM_EXTABLE_TYPE_REG thing for
> the second exception point, and keep %rcx as zero, and keep it in
> %eax, and depend on that whole "%rcx = 8*%rcx+%rax" fixing it up, and
> knowing that if an exception does *not* happen, %rcx will be zero from
> the word-size loop.

Exactly!

> But that really seems much too subtle for me - why not just keep
> things in %rcx, and try to make this look as much as possible like the
> "rep stosq + rep stosb" case?
> 
> And finally: I still think that those fancy exception table things are
> *much* too fancy, and much too subtle, and much too complicated.

This whole confusion goes to show that this really is too subtle. ;-\

I mean, it is probably fine for some simple asm where you want to have
the register fixup out of the way. But for something like that where
I wanted the asm to be optimal but not tricky at the same time, that
tricky "hidden fixup" might turn out to be not such a good idea.

And sure, yeah, we can make this work this way now and all but after
time passes, swapping all that tricky stuff back in is going to be a
pain. That's why I'm leaving breadcrumbs everywhere I can.

> So I'd actually prefer to get rid of them entirely, and make the code
> use the "no register changes" exception, and make the exception
> handler do a proper site-specific fixup. At that point, you can get
> rid of all the "mask bits early" logic, get rid of all the extraneous
> 'test' instructions, and make it all look something like below.

Yap, and the named labels make it even clearer.

> NOTE! I've intentionally kept the %eax thing using 32-bit instructions
> - smaller encoding, and only the low three bits matter, so why
> move/mask full 64 bits?

Some notes below.

> NOTE2! Entirely untested.

I'll poke at it.

> But I tried to make the normal code do minimal work, and then fix
> things up in the exception case more.

Yap.

> So it just keeps the original count in the 32 bits in %eax until it
> wants to test it, and then uses the 'andl' to both mask and test. And
> the exception case knows that, so it masks there too.

Sure.

> I dunno. But I really think that whole new _ASM_EXTABLE_TYPE_REG and
> EX_TYPE_UCOPY_LEN8 was unnecessary.

I've pointed this out to Mr. Z. He likes them as complex as possible.
:-P

> SYM_FUNC_START(clear_user_original)
>         movl %ecx,%eax

I'll add a comment here along the lines of us only needing the 32-bit
quantity of the size as we're dealing with the rest bytes and so we save
a REX byte in the opcode.

>         shrq $3,%rcx             # qwords

Any particular reason for the explicit "q" suffix? The register already
denotes the opsize and the generated opcode is the same.

>         jz .Lrest_bytes
> 
>         # do the qwords first
>         .p2align 4
> .Lqwords:
>         movq $0,(%rdi)
>         lea 8(%rdi),%rdi
>         dec %rcx

Like here, for example.

I guess you wanna be explicit as to the operation width...

>         jnz .Lqwords
> 
> .Lrest_bytes:
>         andl $7,%eax             # rest bytes
>         jz .Lexit
> 
>         # now do the rest bytes
> .Lbytes:
>         movb $0,(%rdi)
>         inc %rdi
>         decl %eax
>         jnz .Lbytes
> 
> .Lexit:
>         xorl %eax,%eax
>         RET
> 
> .Lqwords_exception:
>         # convert qwords back into bytes to return to caller
>         shlq $3, %rcx
>         andl $7, %eax
>         addq %rax,%rcx
>         jmp .Lexit
> 
> .Lbytes_exception:
>         movl %eax,%ecx
>         jmp .Lexit
> 
>         _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
>         _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)

Yap, I definitely like this one more. Straightforward.

Thx!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-27 10:41                                           ` Borislav Petkov
@ 2022-04-27 16:00                                             ` Linus Torvalds
  2022-05-04 18:56                                               ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-27 16:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Wed, Apr 27, 2022 at 3:41 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Any particular reason for the explicit "q" suffix? The register already
> denotes the opsize and the generated opcode is the same.

No, I just added the 'q' prefix to distinguish it from the 'l'
instruction working on the same register one line earlier.

And then I wasn't consistent throughout, because it was really just
about me thinking about that "here we can use 32-bit ops, and here we
are working with the full possible 64-bit range".

So take that code as what it is - an untested and slightly edited
version of the code you sent me, meant for "how about like this"
purposes rather than anything else.

                   Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-27 16:00                                             ` Linus Torvalds
@ 2022-05-04 18:56                                               ` Borislav Petkov
  2022-05-04 19:22                                                 ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-05-04 18:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

Just to update folks here: I haven't forgotten about this - Mel and I
are running some benchmarks first and staring at results to see whether
all the hoopla is even worth it.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-05-04 18:56                                               ` Borislav Petkov
@ 2022-05-04 19:22                                                 ` Linus Torvalds
  2022-05-04 20:18                                                   ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-05-04 19:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Wed, May 4, 2022 at 11:56 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Just to update folks here: I haven't forgotten about this - Mel and I
> are running some benchmarks first and staring at results to see whether
> all the hoopla is even worth it.

Side note: the "do FSRM inline" would likely be a really good thing
for "copy_to_user()", more so than the silly "clear_user()" that we
realistically do almost nowhere.

I doubt you can find "clear_user()" outside of benchmarks (but hey,
people do odd things).

But "copy_to_user()" is everywhere, and the I$ advantage of inlining
it might be noticeable on some real loads.

I remember some git profiles having copy_to_user very high due to
fstat(), for example - cp_new_stat64 and friends.

Of course, I haven't profiled git in ages, but I doubt that has
changed. Many of those kinds of loads are all about name lookup and
stat (basic things like "make" would be that too, if it weren't for
the fact that it spends a _lot_ of its time in user space string
handling).

The inlining advantage would obviously only show up on CPUs that
actually do FSRM. Which I think is currently only Ice Lake. I don't
have access to one.

                Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-05-04 19:22                                                 ` Linus Torvalds
@ 2022-05-04 20:18                                                   ` Borislav Petkov
  2022-05-04 20:40                                                     ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-05-04 20:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, May 04, 2022 at 12:22:34PM -0700, Linus Torvalds wrote:
> Side note: the "do FSRM inline" would likely be a really good thing
> for "copy_to_user()", more so than the silly "clear_user()" that we
> realistically do almost nowhere.

Right, that would be my next project.
> 
> I doubt you can find "clear_user()" outside of benchmarks (but hey,
> people do odd things).

Well, see preview below.

> But "copy_to_user()" is everywhere, and the I$ advantage of inlining
> it might be noticeable on some real loads.
> 
> I remember some git profiles having copy_to_user very high due to
> fstat(), for example - cp_new_stat64 and friends.
> 
> Of course, I haven't profiled git in ages, but I doubt that has

Yeah, see below.

> changed. Many of those kinds of loads are all about name lookup and
> stat (basic things like "make" would be that too, if it weren't for
> the fact that it spends a _lot_ of its time in user space string
> handling).
> 
> The inlining advantage would obviously only show up on CPUs that
> actually do FSRM. Which I think is currently only Ice Lake. I don't
> have access to one.

Zen3 has FSRM.

So below's the git test suite with clear_user on Zen3. It creates a lot
of processes so we get to clear_user a bunch and that's the inlined rep
movsb.

You can see some small but noticeable improvement:

gitsource
                                      rc              clear_use
                                     rc5             clear_user
Min       User         196.65 (   0.00%)      193.16 (   1.77%)
Min       System        57.20 (   0.00%)       55.89 (   2.29%)
Min       Elapsed      270.27 (   0.00%)      266.09 (   1.55%)
Min       CPU           93.00 (   0.00%)       93.00 (   0.00%)
Amean     User         197.05 (   0.00%)      194.14 *   1.48%*
Amean     System        57.41 (   0.00%)       56.35 *   1.83%*
Amean     Elapsed      270.97 (   0.00%)      266.90 *   1.50%*
Amean     CPU           93.00 (   0.00%)       93.00 (   0.00%)
Stddev    User           0.25 (   0.00%)        0.64 (-151.28%)
Stddev    System         0.24 (   0.00%)        0.31 ( -28.73%)
Stddev    Elapsed        0.56 (   0.00%)        0.62 ( -10.17%)
Stddev    CPU            0.00 (   0.00%)        0.00 (   0.00%)
CoeffVar  User           0.13 (   0.00%)        0.33 (-155.05%)
CoeffVar  System         0.41 (   0.00%)        0.54 ( -31.13%)
CoeffVar  Elapsed        0.21 (   0.00%)        0.23 ( -11.85%)
CoeffVar  CPU            0.00 (   0.00%)        0.00 (   0.00%)
Max       User         197.35 (   0.00%)      194.92 (   1.23%)
Max       System        57.75 (   0.00%)       56.64 (   1.92%)
Max       Elapsed      271.66 (   0.00%)      267.60 (   1.49%)
Max       CPU           93.00 (   0.00%)       93.00 (   0.00%)
BAmean-50 User         196.85 (   0.00%)      193.60 (   1.65%)
BAmean-50 System        57.20 (   0.00%)       56.05 (   2.01%)
BAmean-50 Elapsed      270.40 (   0.00%)      266.29 (   1.52%)
BAmean-50 CPU           93.00 (   0.00%)       93.00 (   0.00%)
BAmean-95 User         196.98 (   0.00%)      193.94 (   1.54%)
BAmean-95 System        57.32 (   0.00%)       56.28 (   1.81%)
BAmean-95 Elapsed      270.79 (   0.00%)      266.72 (   1.50%)
BAmean-95 CPU           93.00 (   0.00%)       93.00 (   0.00%)
BAmean-99 User         196.98 (   0.00%)      193.94 (   1.54%)
BAmean-99 System        57.32 (   0.00%)       56.28 (   1.81%)
BAmean-99 Elapsed      270.79 (   0.00%)      266.72 (   1.50%)
BAmean-99 CPU           93.00 (   0.00%)       93.00 (   0.00%)

                          rc   clear_use
                         rc5  clear_user
Duration User        1182.22     1165.67
Duration System       345.58      338.46
Duration Elapsed     1626.80     1602.99

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-05-04 20:18                                                   ` Borislav Petkov
@ 2022-05-04 20:40                                                     ` Linus Torvalds
  2022-05-04 21:01                                                       ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-05-04 20:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, May 4, 2022 at 1:18 PM Borislav Petkov <bp@alien8.de> wrote:
>
> Zen3 has FSRM.

Sadly, I'm on Zen2 with my 3970X, and the Zen 3 threadrippers seem to
be basically impossible to get.

> So below's the git test suite with clear_user on Zen3. It creates a lot
> of processes so we get to clear_user a bunch and that's the inlined rep
> movsb.

Oh, the clear_user() in the ELF loader? I wouldn't have expected that
to be noticeable.

Now, clear_page() is *very* noticeable, but that has its own special
magic and doesn't use clear_user().

Maybe there is some other clear_user() case I never thought of. My dim
memories of profiles definitely had copy_to_user, clear_page and
copy_page being some of the top copy loops.

                   Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-05-04 20:40                                                     ` Linus Torvalds
@ 2022-05-04 21:01                                                       ` Borislav Petkov
  2022-05-04 21:09                                                         ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-05-04 21:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, May 04, 2022 at 01:40:07PM -0700, Linus Torvalds wrote:
> Sadly, I'm on Zen2 with my 3970X, and the Zen 3 threadrippers seem to
> be basically impossible to get.

Yeah, what we did is get a gigabyte board:

[    0.000000] DMI: GIGABYTE MZ32-AR0-00/MZ32-AR0-00, BIOS M06 07/10/2021

and stick a server CPU in it:

[    2.352371] smpboot: CPU0: AMD EPYC 7313 16-Core Processor (family: 0x19, model: 0x1, stepping: 0x1)

so that we can have the memory encryption stuff. 32 threads is fairly
decent and kernel builds are fast enough, at least for me. If you need
to do a lot of allmodconfigs, you probably need something bigger though.

> Oh, the clear_user() in the ELF loader? I wouldn't have expected that
> to be noticeable.

Yah, that guy in load_elf_binary(). At least it did hit my breakpoint
fairly often so I thought, what is a benchmark that does create a lot of
processes...

> Now, clear_page() is *very* noticeable, but that has its own special
> magic and doesn't use clear_user().

Yeah, that's with the alternative CALL thing. Could be useful to try to
see how much the inlined rep; movsb would bring...

> Maybe there is some other clear_user() case I never thought of. My
> dim memories of profiles definitely had copy_to_user, clear_page and
> copy_page being some of the top copy loops.

I could try to do a perf probe or whatever fancy new thing we do now on
clear_user to get some numbers of how many times it gets called during
the benchmark run. Or do you wanna know the callers too?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-05-04 21:01                                                       ` Borislav Petkov
@ 2022-05-04 21:09                                                         ` Linus Torvalds
  2022-05-10  9:31                                                           ` clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE) Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-05-04 21:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, May 4, 2022 at 2:01 PM Borislav Petkov <bp@alien8.de> wrote:
>
> I could try to do a perf probe or whatever fancy new thing we do now on
> clear_user to get some numbers of how many times it gets called during
> the benchmark run. Or do you wanna know the callers too?

One of the non-performance reasons I like inlined memcpy is actually
that when you do a regular 'perf record' run, the cost of the memcpy
gets associated with the call-site.

Which is universally what I want for those things. I used to love our
inlined spinlocks for the same reason back when we did them.

Yeah, yeah, you can do it with callchain magic, but then you get it
all - and I really consider memcpy/memset to be a special case.
Normally I want the "oh, that leaf function is expensive", but not for
memcpy and memset (and not for spinlocks, but we'll never go back to
the old trivial spinlocks)

I don't tend to particularly care about "how many times has this been
called" kind of trace profiles. It's the actual expense in CPU cycles
I tend to care about.

That said, I cared deeply about those kinds of CPU profiles when I was
working with Al on the RCU path lookup code and looking for where the
problem spots were.

That was years ago.

I haven't really done serious profiling work for a while (which is
just as well, because it's one of the things that went backwards when
I switch to the Zen 2 threadripper for my main machine)

                  Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE)
  2022-05-04 21:09                                                         ` Linus Torvalds
@ 2022-05-10  9:31                                                           ` Borislav Petkov
  2022-05-10 17:17                                                             ` Linus Torvalds
  2022-05-10 17:28                                                             ` Linus Torvalds
  0 siblings, 2 replies; 82+ messages in thread
From: Borislav Petkov @ 2022-05-10  9:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

Lemme fix that subject so that I can find it easier in my avalanche mbox...

On Wed, May 04, 2022 at 02:09:52PM -0700, Linus Torvalds wrote:
> I don't tend to particularly care about "how many times has this been
> called" kind of trace profiles. It's the actual expense in CPU cycles
> I tend to care about.

Yeah, but, I wanted to measure how much perf improvement that would
bring with the git test suite and wanted to know how often clear_user()
is called in conjunction with it. Because the benchmarks I ran would
show very small improvements and a PF benchmark would even show weird
things like slowdowns with higher core counts.

So for a ~6m running test suite, the function gets called under 700K
times, all from padzero:

           <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
           <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
           <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
           <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
	   ...

which is around 1%-ish of the total time and which is consistent with
the benchmark numbers.

So Mel gave me the idea to simply measure how fast the function becomes. I.e.:

                start = rdtsc_ordered();
                ret = __clear_user(to, n);
                end = rdtsc_ordered();

Computing the mean average of all the samples collected during the test
suite run then shows some improvement:

clear_user_original:
Amean: 9219.71 (Sum: 6340154910, samples: 687674)

fsrm:
Amean: 8030.63 (Sum: 5522277720, samples: 687652)

That's on Zen3.

I'll run this on Icelake now too.

> I haven't really done serious profiling work for a while (which is
> just as well, because it's one of the things that went backwards when
> I switch to the Zen 2 threadripper for my main machine)

Because of the not as advanced perf support there? Any pain points I can
forward?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE)
  2022-05-10  9:31                                                           ` clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE) Borislav Petkov
@ 2022-05-10 17:17                                                             ` Linus Torvalds
  2022-05-10 17:28                                                             ` Linus Torvalds
  1 sibling, 0 replies; 82+ messages in thread
From: Linus Torvalds @ 2022-05-10 17:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 10, 2022 at 2:31 AM Borislav Petkov <bp@alien8.de> wrote:
>
> > I haven't really done serious profiling work for a while (which is
> > just as well, because it's one of the things that went backwards when
> > I switch to the Zen 2 threadripper for my main machine)
>
> Because of the not as advanced perf support there? Any pain points I can
> forward?

It's not anything fancy, and it's not anything new - you've been cc'd
on me talking about it before.

As mentioned, I don't actually do anything fancy with profiling - I
basically almost always just want to do a simple

    perf record -e cycles:pp

so that I get reasonable instruction attribution for what the
expensive part actually is (where "actually is" is obviously always
just an approximation, I'm not claiming anything else - but I just
don't want to have to try to figure out some huge instruction skid
issue). And then (because I only tend to care about the kernel, and
don't care about _who_ is doing things), I just do

    perf report --sort=dso,symbol

and start looking at the kernel side of things. I then occasionally
enable -g, but I hate doing it, and it' susually because I see "oh
damn, some spinlock slowpath, let's see what the callers are" just to
figure out which spinlock it ls.

VERY rudimentary, in other words. It's the "I don't know where the
time is going, so let's find out".

And that simple thing doesn't work, because Zen 2 doesn't like the
per-thread profiling. So I can be root and use '-a', and it works
fine, except I don't want to do things as root just for profiling.
Plus I don't actually want to see the ACPI "CPU idle" things.

I'm told the issue is that Zen 2 is not stable with IBS (aka "PEBS for
AMD"), which is all kinds of sad, but there it is.

As also mentioned, it's not actually a huge deal for me, because all I
do is read email and do "git pull". And the times when I used
profiling to find things git could need improvement on (usually
pathname lookup for "git diff") are long long gone.

               Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE)
  2022-05-10  9:31                                                           ` clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE) Borislav Petkov
  2022-05-10 17:17                                                             ` Linus Torvalds
@ 2022-05-10 17:28                                                             ` Linus Torvalds
  2022-05-10 18:10                                                               ` Borislav Petkov
  1 sibling, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-05-10 17:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 10, 2022 at 2:31 AM Borislav Petkov <bp@alien8.de> wrote:
>
> clear_user_original:
> Amean: 9219.71 (Sum: 6340154910, samples: 687674)
>
> fsrm:
> Amean: 8030.63 (Sum: 5522277720, samples: 687652)

Well, that's pretty conclusive.

I'm obviously very happy with fsrm. I've been pushing for that thing
for probably over two decades by now, because I absolutely detest
uarch optimizations for memset/memcpy that can never be done well in
software anyway (because it depends not just on cache organization,
but on cache sizes and dynamic cache hit/miss behavior of the load).

And one of the things I always wanted to do was to just have
memcpy/memset entirely inlined.

In fact, if you go back to the 0.01 linux kernel sources, you'll see
that they only compile with my bastardized version of gcc-1.40,
because I made the compiler inline those things with 'rep movs/stos',
and there was no other implementation of memcpy/memset at all.

That was a bit optimistic at the time, but here we are, 30+ years
later and it is finally looking possible, at least on some uarchs.

               Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE)
  2022-05-10 17:28                                                             ` Linus Torvalds
@ 2022-05-10 18:10                                                               ` Borislav Petkov
  2022-05-10 18:57                                                                 ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-05-10 18:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 10, 2022 at 10:28:28AM -0700, Linus Torvalds wrote:
> Well, that's pretty conclusive.

Yap. It appears I don't have a production-type Icelake so I probably
can't show the numbers there but at least I can check whether there's an
improvement too.

> I'm obviously very happy with fsrm. I've been pushing for that thing
> for probably over two decades by now,

The time sounds about right - I'm closing in on two decades poking at
the kernel myself and I've yet to see a more complex feature I've been
advocating for, materialize.

> because I absolutely detest uarch optimizations for memset/memcpy that
> can never be done well in software anyway (because it depends not just
> on cache organization, but on cache sizes and dynamic cache hit/miss
> behavior of the load).

Yeah, you want all that cacheline aggregation to happen underneath where
it can do all the checks etc.

> And one of the things I always wanted to do was to just have
> memcpy/memset entirely inlined.
> 
> In fact, if you go back to the 0.01 linux kernel sources, you'll see

LOL, I think I've seen those sources printed out on a wall somewhere.
:-)

> that they only compile with my bastardized version of gcc-1.40,
> because I made the compiler inline those things with 'rep movs/stos',
> and there was no other implementation of memcpy/memset at all.

Yeah, I have it on my todo to look at inlining the other primitives
too, and see whether that brings any improvements. Now our patching
infrastructure is nicely mature too so that we can be very creative
there.

> That was a bit optimistic at the time, but here we are, 30+ years
> later and it is finally looking possible, at least on some uarchs.

Yap, it takes "only" 30+ years. :-\

And when you think of all the crap stuff that got added in silicon
and *removed* *again* in the meantime... but I'm optimistic now that
Murphy's Law is not going to hold true anymore, we will finally start
optimizing hardware *and* software.

:-)))

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE)
  2022-05-10 18:10                                                               ` Borislav Petkov
@ 2022-05-10 18:57                                                                 ` Borislav Petkov
  2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-05-10 18:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 10, 2022 at 08:10:14PM +0200, Borislav Petkov wrote:
> And when you think of all the crap stuff that got added in silicon
> and *removed* *again* in the meantime... but I'm optimistic now that
> Murphy's Law is not going to hold true anymore, we will finally start
  ^^^^^^

Dunno if this was a Freudian slip... :-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] x86/clear_user: Make it faster
  2022-05-10 18:57                                                                 ` Borislav Petkov
@ 2022-05-24 12:32                                                                   ` Borislav Petkov
  2022-05-24 16:51                                                                     ` Linus Torvalds
                                                                                       ` (3 more replies)
  0 siblings, 4 replies; 82+ messages in thread
From: Borislav Petkov @ 2022-05-24 12:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

Ok,

finally a somewhat final version, lightly tested.

I still need to run it on production Icelake and that is kinda being
delayed due to server room cooling issues (don't ask ;-\).

---
From: Borislav Petkov <bp@suse.de>

Based on a patch by Mark Hemment <markhemm@googlemail.com> and
incorporating very sane suggestions from Linus.

The point here is to have the default case with FSRM - which is supposed
to be the majority of x86 hw out there - if not now then soon - be
directly inlined into the instruction stream so that no function call
overhead is taking place.

The benchmarks I ran would show very small improvements and a PF
benchmark would even show weird things like slowdowns with higher core
counts.

So for a ~6m running the git test suite, the function gets called under
700K times, all from padzero():

  <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
  <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
  <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
  <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
   ...

which is around 1%-ish of the total time and which is consistent with
the benchmark numbers.

So Mel gave me the idea to simply measure how fast the function becomes.
I.e.:

  start = rdtsc_ordered();
  ret = __clear_user(to, n);
  end = rdtsc_ordered();

Computing the mean average of all the samples collected during the test
suite run then shows some improvement:

  clear_user_original:
  Amean: 9219.71 (Sum: 6340154910, samples: 687674)

  fsrm:
  Amean: 8030.63 (Sum: 5522277720, samples: 687652)

That's on Zen3.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
---
 arch/x86/include/asm/uaccess.h    |   5 +-
 arch/x86/include/asm/uaccess_64.h |  45 ++++++++++
 arch/x86/lib/clear_page_64.S      | 142 ++++++++++++++++++++++++++++++
 arch/x86/lib/usercopy_64.c        |  40 ---------
 tools/objtool/check.c             |   3 +
 5 files changed, 192 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index f78e2b3501a1..335d571e8a79 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -405,9 +405,6 @@ strncpy_from_user(char *dst, const char __user *src, long count);
 
 extern __must_check long strnlen_user(const char __user *str, long n);
 
-unsigned long __must_check clear_user(void __user *mem, unsigned long len);
-unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
-
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 unsigned long __must_check
 copy_mc_to_kernel(void *to, const void *from, unsigned len);
@@ -429,6 +426,8 @@ extern struct movsl_mask {
 #define ARCH_HAS_NOCACHE_UACCESS 1
 
 #ifdef CONFIG_X86_32
+unsigned long __must_check clear_user(void __user *mem, unsigned long len);
+unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
 # include <asm/uaccess_32.h>
 #else
 # include <asm/uaccess_64.h>
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e04d771..4ffefb4bb844 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,49 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+/*
+ * Zero Userspace.
+ */
+
+__must_check unsigned long
+clear_user_original(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_rep_good(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_erms(void __user *addr, unsigned long len);
+
+static __always_inline __must_check unsigned long __clear_user(void __user *addr, unsigned long size)
+{
+	might_fault();
+	stac();
+
+	/*
+	 * No memory constraint because it doesn't change any memory gcc
+	 * knows about.
+	 */
+	asm volatile(
+		"1:\n\t"
+		ALTERNATIVE_3("rep stosb",
+			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
+			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
+			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
+		"2:\n"
+	       _ASM_EXTABLE_UA(1b, 2b)
+	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
+	       : "a" (0)
+		/* rep_good clobbers %rdx */
+	       : "rdx");
+
+	clac();
+
+	return size;
+}
+
+static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
+{
+	if (access_ok(to, n))
+		return __clear_user(to, n);
+	return n;
+}
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..6dd6c7fd1eb8 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
+#include <asm/asm.h>
 #include <asm/export.h>
 
 /*
@@ -50,3 +51,144 @@ SYM_FUNC_START(clear_page_erms)
 	RET
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Default clear user-space.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_original)
+	/*
+	 * Copy only the lower 32 bits of size as that is enough to handle the rest bytes,
+	 * i.e., no need for a 'q' suffix and thus a REX prefix.
+	 */
+	mov %ecx,%eax
+	shr $3,%rcx
+	jz .Lrest_bytes
+
+	# do the qwords first
+	.p2align 4
+.Lqwords:
+	movq $0,(%rdi)
+	lea 8(%rdi),%rdi
+	dec %rcx
+	jnz .Lqwords
+
+.Lrest_bytes:
+	and $7,  %eax
+	jz .Lexit
+
+	# now do the rest bytes
+.Lbytes:
+	movb $0,(%rdi)
+	inc %rdi
+	dec %eax
+	jnz .Lbytes
+
+.Lexit:
+	/*
+	 * %rax still needs to be cleared in the exception case because this function is called
+	 * from inline asm and the compiler expects %rax to be zero when exiting the inline asm,
+	 * in case it might reuse it somewhere.
+	 */
+        xor %eax,%eax
+        RET
+
+.Lqwords_exception:
+        # convert remaining qwords back into bytes to return to caller
+        shl $3, %rcx
+        and $7, %eax
+        add %rax,%rcx
+        jmp .Lexit
+
+.Lbytes_exception:
+        mov %eax,%ecx
+        jmp .Lexit
+
+        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
+        _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)
+SYM_FUNC_END(clear_user_original)
+EXPORT_SYMBOL(clear_user_original)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
+ * present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_rep_good)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	ja  .Lprep
+	call clear_user_original
+	RET
+
+.Lprep:
+	# copy lower 32-bits for rest bytes
+	mov %ecx, %edx
+	shr $3, %rcx
+	jz .Lrep_good_rest_bytes
+
+.Lrep_good_qwords:
+	rep stosq
+
+.Lrep_good_rest_bytes:
+	and $7, %edx
+	jz .Lrep_good_exit
+
+.Lrep_good_bytes:
+	mov %edx, %ecx
+	rep stosb
+
+.Lrep_good_exit:
+	# see .Lexit comment above
+	xor %eax, %eax
+	RET
+
+.Lrep_good_qwords_exception:
+	# convert remaining qwords back into bytes to return to caller
+	shl $3, %rcx
+	and $7, %edx
+	add %rdx, %rcx
+	jmp .Lrep_good_exit
+
+	_ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception)
+	_ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit)
+SYM_FUNC_END(clear_user_rep_good)
+EXPORT_SYMBOL(clear_user_rep_good)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ *
+ */
+SYM_FUNC_START(clear_user_erms)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	ja  .Lerms_bytes
+	call clear_user_original
+	RET
+
+.Lerms_bytes:
+	rep stosb
+
+.Lerms_exit:
+	xorl %eax,%eax
+	RET
+
+	_ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit)
+SYM_FUNC_END(clear_user_erms)
+EXPORT_SYMBOL(clear_user_erms)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0ae6cf804197..6c1f8ac5e721 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,46 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-
-		_ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
-		_ASM_EXTABLE_UA(1b, 2b)
-
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
-unsigned long clear_user(void __user *to, unsigned long n)
-{
-	if (access_ok(to, n))
-		return __clear_user(to, n);
-	return n;
-}
-EXPORT_SYMBOL(clear_user);
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 /**
  * clean_cache_range - write back a cache range with CLWB
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index ca5b74603008..e460aa004b88 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1020,6 +1020,9 @@ static const char *uaccess_safe_builtin[] = {
 	"copy_mc_fragile_handle_tail",
 	"copy_mc_enhanced_fast_string",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
+	"clear_user_erms",
+	"clear_user_rep_good",
+	"clear_user_original",
 	NULL
 };
 
-- 
2.35.1


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
@ 2022-05-24 16:51                                                                     ` Linus Torvalds
  2022-05-24 17:30                                                                       ` Borislav Petkov
  2022-05-25 12:11                                                                     ` Mark Hemment
                                                                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-05-24 16:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 24, 2022 at 5:32 AM Borislav Petkov <bp@alien8.de> wrote:
>
> finally a somewhat final version, lightly tested.

I can't find anything wrong with this, but who knows what
patch-blindness I have from looking at a few different versions of it.
Maybe my eyes just skim over it now.

I do note that the clearing of %rax here:

> +.Lerms_exit:
> +       xorl %eax,%eax
> +       RET

seems to be unnecessary, since %rax is never modified in the path
leading to this. But maybe just as well just for consistency with the
cases where it *is* used as a temporary.

And I still suspect that "copy_to_user()" is *much* more interesting
than "clear_user()", but I guess we can't inline it anyway due to all
the other overhead (ie access_ok() and stac/clac).

And for a plain "call memcpy/memset", we'd need compiler help to do
this (at a minimum, we'd have to have the compiler use the 'rep
movs/stos' register logic, and then we could patch things in place
afterwards, with objtool creating the alternatives section or
something).

                    Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-24 16:51                                                                     ` Linus Torvalds
@ 2022-05-24 17:30                                                                       ` Borislav Petkov
  0 siblings, 0 replies; 82+ messages in thread
From: Borislav Petkov @ 2022-05-24 17:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 24, 2022 at 09:51:56AM -0700, Linus Torvalds wrote:
> I can't find anything wrong with this, but who knows what
> patch-blindness I have from looking at a few different versions of it.
> Maybe my eyes just skim over it now.

Same here - I can't look at that code anymore. I'll try to gain some
distance and look at it again later, and do some more extensive testing
too.

> I do note that the clearing of %rax here:
> 
> > +.Lerms_exit:
> > +       xorl %eax,%eax
> > +       RET
> 
> seems to be unnecessary, since %rax is never modified in the path
> leading to this. But maybe just as well just for consistency with the
> cases where it *is* used as a temporary.

Yeah.

> And I still suspect that "copy_to_user()" is *much* more interesting
> than "clear_user()", but I guess we can't inline it anyway due to all
> the other overhead (ie access_ok() and stac/clac).
> 
> And for a plain "call memcpy/memset", we'd need compiler help to do
> this (at a minimum, we'd have to have the compiler use the 'rep
> movs/stos' register logic, and then we could patch things in place
> afterwards, with objtool creating the alternatives section or
> something).

Yeah, I have this on my todo to research them properly. Will report when
I have something.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
  2022-05-24 16:51                                                                     ` Linus Torvalds
@ 2022-05-25 12:11                                                                     ` Mark Hemment
  2022-05-27 11:28                                                                       ` Borislav Petkov
  2022-05-27 11:10                                                                     ` Ingo Molnar
  2022-06-22 14:21                                                                     ` Borislav Petkov
  3 siblings, 1 reply; 82+ messages in thread
From: Mark Hemment @ 2022-05-25 12:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, 24 May 2022 at 13:32, Borislav Petkov <bp@alien8.de> wrote:
>
> Ok,
>
> finally a somewhat final version, lightly tested.
>
> I still need to run it on production Icelake and that is kinda being
> delayed due to server room cooling issues (don't ask ;-\).
>
> ---
> From: Borislav Petkov <bp@suse.de>
>
> Based on a patch by Mark Hemment <markhemm@googlemail.com> and
> incorporating very sane suggestions from Linus.
>
> The point here is to have the default case with FSRM - which is supposed
> to be the majority of x86 hw out there - if not now then soon - be
> directly inlined into the instruction stream so that no function call
> overhead is taking place.
>
> The benchmarks I ran would show very small improvements and a PF
> benchmark would even show weird things like slowdowns with higher core
> counts.
>
> So for a ~6m running the git test suite, the function gets called under
> 700K times, all from padzero():
>
>   <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
>   <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
>   <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
>   <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
>    ...
>
> which is around 1%-ish of the total time and which is consistent with
> the benchmark numbers.
>
> So Mel gave me the idea to simply measure how fast the function becomes.
> I.e.:
>
>   start = rdtsc_ordered();
>   ret = __clear_user(to, n);
>   end = rdtsc_ordered();
>
> Computing the mean average of all the samples collected during the test
> suite run then shows some improvement:
>
>   clear_user_original:
>   Amean: 9219.71 (Sum: 6340154910, samples: 687674)
>
>   fsrm:
>   Amean: 8030.63 (Sum: 5522277720, samples: 687652)
>
> That's on Zen3.
>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
> ---
>  arch/x86/include/asm/uaccess.h    |   5 +-
>  arch/x86/include/asm/uaccess_64.h |  45 ++++++++++
>  arch/x86/lib/clear_page_64.S      | 142 ++++++++++++++++++++++++++++++
>  arch/x86/lib/usercopy_64.c        |  40 ---------
>  tools/objtool/check.c             |   3 +
>  5 files changed, 192 insertions(+), 43 deletions(-)

[...snip...]
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index fe59b8ac4fcc..6dd6c7fd1eb8 100644
...
> +/*
> + * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
> + * present.
> + * Input:
> + * rdi destination
> + * rcx count
> + *
> + * Output:
> + * rcx: uncleared bytes or 0 if successful.
> + */
> +SYM_FUNC_START(clear_user_rep_good)
> +       # call the original thing for less than a cacheline
> +       cmp $64, %rcx
> +       ja  .Lprep
> +       call clear_user_original
> +       RET
> +
> +.Lprep:
> +       # copy lower 32-bits for rest bytes
> +       mov %ecx, %edx
> +       shr $3, %rcx
> +       jz .Lrep_good_rest_bytes

A slight doubt here; comment says "less than a cachline", but the code
is using 'ja' (jump if above) - so calls 'clear_user_original' for a
'len' less than or equal to 64.
Not sure of the intended behaviour for 64 bytes here, but
'copy_user_enhanced_fast_string' uses the slow-method for lengths less
than 64.  So, should this be coded as;
    cmp $64,%rcx
    jb clear_user_original
?

'clear_user_erms' has similar logic which might also need reworking.

> +
> +.Lrep_good_qwords:
> +       rep stosq
> +
> +.Lrep_good_rest_bytes:
> +       and $7, %edx
> +       jz .Lrep_good_exit
> +
> +.Lrep_good_bytes:
> +       mov %edx, %ecx
> +       rep stosb
> +
> +.Lrep_good_exit:
> +       # see .Lexit comment above
> +       xor %eax, %eax
> +       RET
> +
> +.Lrep_good_qwords_exception:
> +       # convert remaining qwords back into bytes to return to caller
> +       shl $3, %rcx
> +       and $7, %edx
> +       add %rdx, %rcx
> +       jmp .Lrep_good_exit
> +
> +       _ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception)
> +       _ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit)
> +SYM_FUNC_END(clear_user_rep_good)
> +EXPORT_SYMBOL(clear_user_rep_good)
> +
> +/*
> + * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
> + * Input:
> + * rdi destination
> + * rcx count
> + *
> + * Output:
> + * rcx: uncleared bytes or 0 if successful.
> + *
> + */
> +SYM_FUNC_START(clear_user_erms)
> +       # call the original thing for less than a cacheline
> +       cmp $64, %rcx
> +       ja  .Lerms_bytes
> +       call clear_user_original
> +       RET
> +
> +.Lerms_bytes:
> +       rep stosb
> +
> +.Lerms_exit:
> +       xorl %eax,%eax
> +       RET
> +
> +       _ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit)
> +SYM_FUNC_END(clear_user_erms)
> +EXPORT_SYMBOL(clear_user_erms)
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 0ae6cf804197..6c1f8ac5e721 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -14,46 +14,6 @@
>   * Zero Userspace
>   */
>
> -unsigned long __clear_user(void __user *addr, unsigned long size)
> -{
> -       long __d0;
> -       might_fault();
> -       /* no memory constraint because it doesn't change any memory gcc knows
> -          about */
> -       stac();
> -       asm volatile(
> -               "       testq  %[size8],%[size8]\n"
> -               "       jz     4f\n"
> -               "       .align 16\n"
> -               "0:     movq $0,(%[dst])\n"
> -               "       addq   $8,%[dst]\n"
> -               "       decl %%ecx ; jnz   0b\n"
> -               "4:     movq  %[size1],%%rcx\n"
> -               "       testl %%ecx,%%ecx\n"
> -               "       jz     2f\n"
> -               "1:     movb   $0,(%[dst])\n"
> -               "       incq   %[dst]\n"
> -               "       decl %%ecx ; jnz  1b\n"
> -               "2:\n"
> -
> -               _ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
> -               _ASM_EXTABLE_UA(1b, 2b)
> -
> -               : [size8] "=&c"(size), [dst] "=&D" (__d0)
> -               : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
> -       clac();
> -       return size;
> -}
> -EXPORT_SYMBOL(__clear_user);
> -
> -unsigned long clear_user(void __user *to, unsigned long n)
> -{
> -       if (access_ok(to, n))
> -               return __clear_user(to, n);
> -       return n;
> -}
> -EXPORT_SYMBOL(clear_user);
> -
>  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
>  /**
>   * clean_cache_range - write back a cache range with CLWB
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index ca5b74603008..e460aa004b88 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -1020,6 +1020,9 @@ static const char *uaccess_safe_builtin[] = {
>         "copy_mc_fragile_handle_tail",
>         "copy_mc_enhanced_fast_string",
>         "ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
> +       "clear_user_erms",
> +       "clear_user_rep_good",
> +       "clear_user_original",
>         NULL
>  };
>
> --
> 2.35.1
>
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

Cheers,
Mark

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
  2022-05-24 16:51                                                                     ` Linus Torvalds
  2022-05-25 12:11                                                                     ` Mark Hemment
@ 2022-05-27 11:10                                                                     ` Ingo Molnar
  2022-06-22 14:21                                                                     ` Borislav Petkov
  3 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2022-05-27 11:10 UTC (permalink / raw)
  To: Borislav Petkov, Arnaldo Carvalho de Melo
  Cc: Linus Torvalds, Mark Hemment, Andrew Morton,
	the arch/x86 maintainers, Peter Zijlstra, patrice.chotard,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Chuck Lever, Hugh Dickins, patches, Linux-MM,
	mm-commits, Mel Gorman


* Borislav Petkov <bp@alien8.de> wrote:

> Ok,
> 
> finally a somewhat final version, lightly tested.
> 
> I still need to run it on production Icelake and that is kinda being
> delayed due to server room cooling issues (don't ask ;-\).

> So Mel gave me the idea to simply measure how fast the function becomes.
> I.e.:
> 
>   start = rdtsc_ordered();
>   ret = __clear_user(to, n);
>   end = rdtsc_ordered();
> 
> Computing the mean average of all the samples collected during the test
> suite run then shows some improvement:
> 
>   clear_user_original:
>   Amean: 9219.71 (Sum: 6340154910, samples: 687674)
> 
>   fsrm:
>   Amean: 8030.63 (Sum: 5522277720, samples: 687652)
> 
> That's on Zen3.

As a side note, there's some rudimentary perf tooling that allows the 
user-space testing of kernel-space x86 memcpy and memset implementations:

 $ perf bench mem memcpy
 # Running 'mem/memcpy' benchmark:
 # function 'default' (Default memcpy() provided by glibc)
 # Copying 1MB bytes ...

       42.459239 GB/sec
 # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
 # Copying 1MB bytes ...

       23.818598 GB/sec
 # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
 # Copying 1MB bytes ...

       10.172526 GB/sec
 # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
 # Copying 1MB bytes ...

       10.614810 GB/sec

Note how the actual implementation in arch/x86/lib/memcpy_64.S was used to 
build a user-space test into 'perf bench'.

For copy_user() & clear_user() some additional wrappery would be needed I 
guess, to wrap away stac()/clac()/might_sleep(), etc. ...

[ Plus it could all be improved to measure cache hot & cache cold 
  performance, to use different sizes, etc. ]

Even with the limitation that it's not 100% equivalent to the kernel-space 
thing, especially for very short buffers, having the whole perf side 
benchmarking, profiling & statistics machinery available is a plus I think.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-25 12:11                                                                     ` Mark Hemment
@ 2022-05-27 11:28                                                                       ` Borislav Petkov
  0 siblings, 0 replies; 82+ messages in thread
From: Borislav Petkov @ 2022-05-27 11:28 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Linus Torvalds, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, May 25, 2022 at 01:11:17PM +0100, Mark Hemment wrote:
> A slight doubt here; comment says "less than a cachline", but the code
> is using 'ja' (jump if above) - so calls 'clear_user_original' for a
> 'len' less than or equal to 64.
> Not sure of the intended behaviour for 64 bytes here, but
> 'copy_user_enhanced_fast_string' uses the slow-method for lengths less
> than 64.  So, should this be coded as;
>     cmp $64,%rcx
>     jb clear_user_original
> ?

Yeah, it probably doesn't matter whether you clear a cacheline the "old"
way or with some of the new ones. clear_user() performance matters only
in microbenchmarks, as I've come to realize.

But your suggestion simplifies the code so lemme do that.

Thx!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
                                                                                       ` (2 preceding siblings ...)
  2022-05-27 11:10                                                                     ` Ingo Molnar
@ 2022-06-22 14:21                                                                     ` Borislav Petkov
  2022-06-22 15:06                                                                       ` Linus Torvalds
  3 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-06-22 14:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 24, 2022 at 02:32:36PM +0200, Borislav Petkov wrote:
> I still need to run it on production Icelake and that is kinda being
> delayed due to server room cooling issues (don't ask ;-\).

So I finally got a production level ICL-X:

[    0.220822] smpboot: CPU0: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz (family: 0x6, model: 0x6a, stepping: 0x6)

and frankly, this looks really weird:

clear_user_original:
Amean: 19679.4 (Sum: 13652560764, samples: 693750)
Amean: 19743.7 (Sum: 13693470604, samples: 693562)

(I ran it twice just to be sure.)

ERMS:
Amean: 20374.3 (Sum: 13910601024, samples: 682752)
Amean: 20453.7 (Sum: 14186223606, samples: 693576)

FSRM:
Amean: 20458.2 (Sum: 13918381386, sample s: 680331)

so either that particular box is weird or Icelake really is slower wrt
FSRM or I've fat-fingered it somewhere.

But my measuring is as trivial as:

static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
{
	if (access_ok(to, n)) {
		unsigned long start, end, ret;

		start = rdtsc_ordered();
		ret = __clear_user(to, n);
		end = rdtsc_ordered();

		trace_printk("to: 0x%lx, size: %ld, cycles: %lu\n", (unsigned long)to, n, end - start);

		return ret;
	}

	return n;
}

so if anything I don't see what the problem could be...

Hmm.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-06-22 14:21                                                                     ` Borislav Petkov
@ 2022-06-22 15:06                                                                       ` Linus Torvalds
  2022-06-22 20:14                                                                         ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-06-22 15:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, Jun 22, 2022 at 9:21 AM Borislav Petkov <bp@alien8.de> wrote:
>
> and frankly, this looks really weird:

I'm not sure how valid the TSC thing is, with the extra
synchronization maybe interacting with the whole microcode engine
startup/stop thing.

I'm also not sure the rdtsc is doing the same thing on your AMD tests
vs your Intel tests - I suspect you end up both using 'rdtscp' (as
opposed to the 'lsync' variant we also have), but I don't think the
ordering really is all that well defined architecturally, so AMD may
have very different serialization rules than Intel does.

.. and that serialization may well be different wrt normal load/stores
and microcode.

So those numbers look like they have a 3% difference, but I'm not 100%
convinced it might not be due to measuring artifacts. The fact that it
worked well for you on your AMD platform doesn't necessarily mean that
it has to work on icelake-x.

But it could equally easily be that "rep stosb" really just isn't any
better on that platform, and the numbers are just giving the plain
reality.

Or it could mean that it makes some cache access decision ("this is
big enough that let's not pollute L1 caches, do stores directly to
L2") that might be better for actual performance afterwards, but that
makes that clearing itself that bit slower.

IOW, I do think that microbenchmarks are kind of suspect to begin
with, and the rdtsc thing in particular may work better on some
microarchitectures than it does others.

Very hard to make a judgment call - I think the only thing that really
ends up mattering is the macro-benchmarks, but I think when you tried
that it was way too noisy to actually show any real signal.

That is, of course, a problem with memcpy and memset in general. It's
easy to do microbenchmarks for them, it's not just clear whether said
microbenchmarks give numbers that are actually meaningful, exactly
because of things like cache replacement policy etc.

And finally, I will repeat that this particular code probably just
isn't that important. The memory clearing for page allocation and
regular memcpy is where most of the real time is spent, so I don't
think that you should necessarily worry too much about this special
case.

                 Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-06-22 15:06                                                                       ` Linus Torvalds
@ 2022-06-22 20:14                                                                         ` Borislav Petkov
  2022-06-22 21:07                                                                           ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-06-22 20:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, Jun 22, 2022 at 10:06:42AM -0500, Linus Torvalds wrote:
> I'm not sure how valid the TSC thing is, with the extra
> synchronization maybe interacting with the whole microcode engine
> startup/stop thing.

Very possible.

So I went and did the original microbenchmark which started people
looking into this in the first place and with it, it looks very
good:

before:

$ dd if=/dev/zero of=/dev/null bs=1024k status=progress
400823418880 bytes (401 GB, 373 GiB) copied, 17 s, 23.6 GB/s

after:

$ dd if=/dev/zero of=/dev/null bs=1024k status=progress
2696274771968 bytes (2.7 TB, 2.5 TiB) copied, 50 s, 53.9 GB/s

So that's very persuasive in my book.

> I'm also not sure the rdtsc is doing the same thing on your AMD tests
> vs your Intel tests - I suspect you end up both using 'rdtscp' (as

That is correct.

> opposed to the 'lsync' variant we also have), but I don't think the
> ordering really is all that well defined architecturally, so AMD may
> have very different serialization rules than Intel does.
>
> .. and that serialization may well be different wrt normal load/stores
> and microcode.

Well, if it is that, hw people have always been telling me to use RDTSC
to measure stuff but I will object next time.

> So those numbers look like they have a 3% difference, but I'm not 100%
> convinced it might not be due to measuring artifacts. The fact that it
> worked well for you on your AMD platform doesn't necessarily mean that
> it has to work on icelake-x.

Well, it certainly is something uarch-specific because that machine
had an eval Icelake sample before and it would show the same minute
slowdown too. I attributed it to the CPU being an eval sample but I
guess uarch-wise it didn't matter.

> But it could equally easily be that "rep stosb" really just isn't any
> better on that platform, and the numbers are just giving the plain
> reality.

Right.

> Or it could mean that it makes some cache access decision ("this is
> big enough that let's not pollute L1 caches, do stores directly to
> L2") that might be better for actual performance afterwards, but that
> makes that clearing itself that bit slower.

There's that too.

> IOW, I do think that microbenchmarks are kind of suspect to begin
> with, and the rdtsc thing in particular may work better on some
> microarchitectures than it does others.
> 
> Very hard to make a judgment call - I think the only thing that really
> ends up mattering is the macro-benchmarks, but I think when you tried
> that it was way too noisy to actually show any real signal.

Yap, that was the reason why I went down to the microbenchmarks. But
even with the real benchmark, it would show a slightly bad numbers on
ICL which got me scratching my head as to why that is...

> That is, of course, a problem with memcpy and memset in general. It's
> easy to do microbenchmarks for them, it's not just clear whether said
> microbenchmarks give numbers that are actually meaningful, exactly
> because of things like cache replacement policy etc.
> 
> And finally, I will repeat that this particular code probably just
> isn't that important. The memory clearing for page allocation and
> regular memcpy is where most of the real time is spent, so I don't
> think that you should necessarily worry too much about this special
> case.

Yeah, I started poking at this because people would come with patches or
complain about stuff being slow but then no one would actually sit down
and do the measurements...

Oh well, anyway, I still think we should take that because that dd thing
above is pretty good-lookin' even on ICL. And we now have a good example
of how all this patching thing should work - have the insns patched
in and only replace with calls to other variants on the minority of
machines.

And the ICL slowdown is small enough and kinda hard to measure...

Thoughts?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-06-22 20:14                                                                         ` Borislav Petkov
@ 2022-06-22 21:07                                                                           ` Linus Torvalds
  2022-06-23  9:41                                                                             ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-06-22 21:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, Jun 22, 2022 at 3:14 PM Borislav Petkov <bp@alien8.de> wrote:
>
> before:
>
> $ dd if=/dev/zero of=/dev/null bs=1024k status=progress
> 400823418880 bytes (401 GB, 373 GiB) copied, 17 s, 23.6 GB/s
>
> after:
>
> $ dd if=/dev/zero of=/dev/null bs=1024k status=progress
> 2696274771968 bytes (2.7 TB, 2.5 TiB) copied, 50 s, 53.9 GB/s
>
> So that's very persuasive in my book.

Heh. Your numbers are very confusing, because apparently you just ^C'd
the thing randomly and they do different sizes (and the GB/s number is
what matters).

Might I suggest just using "count=XYZ" to make the sizes the same and
the numbers a but more comparable? Because when I first looked at the
numbers I was like "oh, the first one finished in 17s, the second one
was three times slower!

But yes, apparently that "rep stos" is *much* better with that /dev/zero test.

That does imply that what it does is to avoid polluting some cache
hierarchy, since your 'dd' test case doesn't actually ever *use* the
end result of the zeroing.

So yeah, memset and memcpy are just fundamentally hard to benchmark,
because what matters more than the cost of the op itself is often how
the end result interacts with the code around it.

For example, one of the things that I hope FSRM really does well is
when small copies (or memsets) are then used immediately afterwards -
does the just stored data by the microcode get nicely forwarded from
the store buffers (like it would if it was a loop of stores) or does
it mean that the store buffer is bypassed and subsequent loads will
then hit the L1 cache?

That is *not* an issue in this situation, since any clear_user() won't
be immediately loaded just a few instructions later, but it's
traditionally an issue for the "small memset/memcpy" case, where the
memset/memcpy destination is possibly accessed immediately afterwards
(either to make further modifications, or to just be read).

In a perfect world, you get all the memory forwarding logic kicking
in, which can really shortcircuit things on an OoO core and take the
memory pipeline out of the critical path, which then helps IPC.

And that's an area that legacy microcoded 'rep stosb' has not been
good at. Whether FSRM is quite there yet, I don't know.

(Somebody could test: do a 'store register to memory', then to a
'memcpy()' of that memory to another memory area, and then do a
register load from that new area - at least in _theory_ a very
aggressive microarchitecture could actually do that whole forwarding,
and make the latency from the original memory store to the final
memory load be zero cycles. I know AMD was supposedly doing that for
some of the simpler cases, and it *does* actually matter for real
world loads, because that memory indirection is often due to passing
data in structures as function arguments. So it sounds stupid to store
to memory and then immediately load it again, but it actually happens
_all_the_time_ even for smart software).

            Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-06-22 21:07                                                                           ` Linus Torvalds
@ 2022-06-23  9:41                                                                             ` Borislav Petkov
  2022-07-05 17:01                                                                               ` [PATCH -final] " Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-06-23  9:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, Jun 22, 2022 at 04:07:19PM -0500, Linus Torvalds wrote:
> Might I suggest just using "count=XYZ" to make the sizes the same and
> the numbers a but more comparable? Because when I first looked at the
> numbers I was like "oh, the first one finished in 17s, the second one
> was three times slower!

Yah, I got confused too but then I looked at the rate...

But it looks like even this microbenchmark is hm, well, showing that
there's more than meets the eye. Look at at rates:

for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 | grep copied
32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s
37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s
62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s
60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s
53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s
55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s
55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s
54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s
50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s
58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s

Now the same thing with smaller buffers:

for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 | grep copied 
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s

So it is all magically alignment, microcode "activity" and other
planets-alignment things of the uarch.

It is not getting even close to 50GB/s with larger block sizes - 4M in
this case:

dd if=/dev/zero of=/dev/null bs=4M status=progress count=65536
249334595584 bytes (249 GB, 232 GiB) copied, 10 s, 24.9 GB/s
65536+0 records in
65536+0 records out
274877906944 bytes (275 GB, 256 GiB) copied, 10.9976 s, 25.0 GB/s

so it is all a bit: "yes, we can go faster, but <do all those
requirements first>" :-)

> But yes, apparently that "rep stos" is *much* better with that /dev/zero test.
> 
> That does imply that what it does is to avoid polluting some cache
> hierarchy, since your 'dd' test case doesn't actually ever *use* the
> end result of the zeroing.
> 
> So yeah, memset and memcpy are just fundamentally hard to benchmark,
> because what matters more than the cost of the op itself is often how
> the end result interacts with the code around it.

Yap, and this discarding of the end result is silly but well...

> For example, one of the things that I hope FSRM really does well is
> when small copies (or memsets) are then used immediately afterwards -
> does the just stored data by the microcode get nicely forwarded from
> the store buffers (like it would if it was a loop of stores) or does
> it mean that the store buffer is bypassed and subsequent loads will
> then hit the L1 cache?
> 
> That is *not* an issue in this situation, since any clear_user() won't
> be immediately loaded just a few instructions later, but it's
> traditionally an issue for the "small memset/memcpy" case, where the
> memset/memcpy destination is possibly accessed immediately afterwards
> (either to make further modifications, or to just be read).

Right.

> In a perfect world, you get all the memory forwarding logic kicking
> in, which can really shortcircuit things on an OoO core and take the
> memory pipeline out of the critical path, which then helps IPC.
> 
> And that's an area that legacy microcoded 'rep stosb' has not been
> good at. Whether FSRM is quite there yet, I don't know.

Right.

> (Somebody could test: do a 'store register to memory', then to a
> 'memcpy()' of that memory to another memory area, and then do a
> register load from that new area - at least in _theory_ a very
> aggressive microarchitecture could actually do that whole forwarding,
> and make the latency from the original memory store to the final
> memory load be zero cycles.

Ha, that would be an interesting exercise.

Hmm, but but, how would the hardware recognize it is the same data it
has in the cache at that new virtual address?

I presume it needs some smart tracking of cachelines. But smart tracking
costs so it needs to be something that happens a lot in all the insn
traces hw guys look at when thinking of new "shortcuts" to raise IPC. :)

> I know AMD was supposedly doing that for some of the simpler cases,

Yap, the simpler cases are probably easy to track and I guess that's
what the hw does properly and does the forwarding there while for the
more complex ones, it simply does the whole run-around at least to a
lower-level cache if not to mem.

> and it *does* actually matter for real world loads, because that
> memory indirection is often due to passing data in structures as
> function arguments. So it sounds stupid to store to memory and then
> immediately load it again, but it actually happens _all_the_time_ even
> for smart software).

Right.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH -final] x86/clear_user: Make it faster
  2022-06-23  9:41                                                                             ` Borislav Petkov
@ 2022-07-05 17:01                                                                               ` Borislav Petkov
  2022-07-06  9:24                                                                                 ` Alexey Dobriyan
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-07-05 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

Ok,

here's the final version with the Intel findings added.

I think I'm going to queue it - unless someone handwaves really loudly
- because it is net code simplification, shows a good example how
to inline insns instead of function calls - something which can be
used for converting other primitives later - and the so called perf
impact on Intel is only very minute because you can only see it in
a microbenchmark. Oh, and it'll make CPU guys speed up their FSRM
microcode. :-P

Thx to all for the very helpful hints and ideas!

---
From: Borislav Petkov <bp@suse.de>
Date: Tue, 24 May 2022 11:01:18 +0200
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Based on a patch by Mark Hemment <markhemm@googlemail.com> and
incorporating very sane suggestions from Linus.

The point here is to have the default case with FSRM - which is supposed
to be the majority of x86 hw out there - if not now then soon - be
directly inlined into the instruction stream so that no function call
overhead is taking place.

The benchmarks I ran would show very small improvements and a PF
benchmark would even show weird things like slowdowns with higher core
counts.

So for a ~6m running the git test suite, the function gets called under
700K times, all from padzero():

  <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
  <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
  <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
  <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
   ...

which is around 1%-ish of the total time and which is consistent with
the benchmark numbers.

So Mel gave me the idea to simply measure how fast the function becomes.
I.e.:

  start = rdtsc_ordered();
  ret = __clear_user(to, n);
  end = rdtsc_ordered();

Computing the mean average of all the samples collected during the test
suite run then shows some improvement:

  clear_user_original:
  Amean: 9219.71 (Sum: 6340154910, samples: 687674)

  fsrm:
  Amean: 8030.63 (Sum: 5522277720, samples: 687652)

That's on Zen3.

The situation looks a lot more confusing on Intel:

Icelake:

  clear_user_original:
  Amean: 19679.4 (Sum: 13652560764, samples: 693750)
  Amean: 19743.7 (Sum: 13693470604, samples: 693562)

(I ran it twice just to be sure.)

  ERMS:
  Amean: 20374.3 (Sum: 13910601024, samples: 682752)
  Amean: 20453.7 (Sum: 14186223606, samples: 693576)

  FSRM:
  Amean: 20458.2 (Sum: 13918381386, sample s: 680331)

The original microbenchmark which people were complaining about:

  for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 | grep copied
  32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s
  37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s
  62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s
  60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s
  53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s
  55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s
  55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s
  54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s
  50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s
  58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s

Now the same thing with smaller buffers:

  for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 | grep copied
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s

is also not conclusive because it all depends on the buffer sizes,
their alignments and when the microcode detects that cachelines can be
aggregated properly and copied in bigger sizes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
---
 arch/x86/include/asm/uaccess.h    |   5 +-
 arch/x86/include/asm/uaccess_64.h |  45 ++++++++++
 arch/x86/lib/clear_page_64.S      | 138 ++++++++++++++++++++++++++++++
 arch/x86/lib/usercopy_64.c        |  40 ---------
 tools/objtool/check.c             |   3 +
 5 files changed, 188 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 913e593a3b45..c46207946e05 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -502,9 +502,6 @@ strncpy_from_user(char *dst, const char __user *src, long count);
 
 extern __must_check long strnlen_user(const char __user *str, long n);
 
-unsigned long __must_check clear_user(void __user *mem, unsigned long len);
-unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
-
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 unsigned long __must_check
 copy_mc_to_kernel(void *to, const void *from, unsigned len);
@@ -526,6 +523,8 @@ extern struct movsl_mask {
 #define ARCH_HAS_NOCACHE_UACCESS 1
 
 #ifdef CONFIG_X86_32
+unsigned long __must_check clear_user(void __user *mem, unsigned long len);
+unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
 # include <asm/uaccess_32.h>
 #else
 # include <asm/uaccess_64.h>
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e04d771..4ffefb4bb844 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,49 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+/*
+ * Zero Userspace.
+ */
+
+__must_check unsigned long
+clear_user_original(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_rep_good(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_erms(void __user *addr, unsigned long len);
+
+static __always_inline __must_check unsigned long __clear_user(void __user *addr, unsigned long size)
+{
+	might_fault();
+	stac();
+
+	/*
+	 * No memory constraint because it doesn't change any memory gcc
+	 * knows about.
+	 */
+	asm volatile(
+		"1:\n\t"
+		ALTERNATIVE_3("rep stosb",
+			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
+			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
+			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
+		"2:\n"
+	       _ASM_EXTABLE_UA(1b, 2b)
+	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
+	       : "a" (0)
+		/* rep_good clobbers %rdx */
+	       : "rdx");
+
+	clac();
+
+	return size;
+}
+
+static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
+{
+	if (access_ok(to, n))
+		return __clear_user(to, n);
+	return n;
+}
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..ecbfb4dd3b01 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
+#include <asm/asm.h>
 #include <asm/export.h>
 
 /*
@@ -50,3 +51,140 @@ SYM_FUNC_START(clear_page_erms)
 	RET
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Default clear user-space.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_original)
+	/*
+	 * Copy only the lower 32 bits of size as that is enough to handle the rest bytes,
+	 * i.e., no need for a 'q' suffix and thus a REX prefix.
+	 */
+	mov %ecx,%eax
+	shr $3,%rcx
+	jz .Lrest_bytes
+
+	# do the qwords first
+	.p2align 4
+.Lqwords:
+	movq $0,(%rdi)
+	lea 8(%rdi),%rdi
+	dec %rcx
+	jnz .Lqwords
+
+.Lrest_bytes:
+	and $7,  %eax
+	jz .Lexit
+
+	# now do the rest bytes
+.Lbytes:
+	movb $0,(%rdi)
+	inc %rdi
+	dec %eax
+	jnz .Lbytes
+
+.Lexit:
+	/*
+	 * %rax still needs to be cleared in the exception case because this function is called
+	 * from inline asm and the compiler expects %rax to be zero when exiting the inline asm,
+	 * in case it might reuse it somewhere.
+	 */
+        xor %eax,%eax
+        RET
+
+.Lqwords_exception:
+        # convert remaining qwords back into bytes to return to caller
+        shl $3, %rcx
+        and $7, %eax
+        add %rax,%rcx
+        jmp .Lexit
+
+.Lbytes_exception:
+        mov %eax,%ecx
+        jmp .Lexit
+
+        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
+        _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)
+SYM_FUNC_END(clear_user_original)
+EXPORT_SYMBOL(clear_user_original)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
+ * present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_rep_good)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	jb clear_user_original
+
+.Lprep:
+	# copy lower 32-bits for rest bytes
+	mov %ecx, %edx
+	shr $3, %rcx
+	jz .Lrep_good_rest_bytes
+
+.Lrep_good_qwords:
+	rep stosq
+
+.Lrep_good_rest_bytes:
+	and $7, %edx
+	jz .Lrep_good_exit
+
+.Lrep_good_bytes:
+	mov %edx, %ecx
+	rep stosb
+
+.Lrep_good_exit:
+	# see .Lexit comment above
+	xor %eax, %eax
+	RET
+
+.Lrep_good_qwords_exception:
+	# convert remaining qwords back into bytes to return to caller
+	shl $3, %rcx
+	and $7, %edx
+	add %rdx, %rcx
+	jmp .Lrep_good_exit
+
+	_ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception)
+	_ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit)
+SYM_FUNC_END(clear_user_rep_good)
+EXPORT_SYMBOL(clear_user_rep_good)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ *
+ */
+SYM_FUNC_START(clear_user_erms)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	jb clear_user_original
+
+.Lerms_bytes:
+	rep stosb
+
+.Lerms_exit:
+	xorl %eax,%eax
+	RET
+
+	_ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit)
+SYM_FUNC_END(clear_user_erms)
+EXPORT_SYMBOL(clear_user_erms)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0ae6cf804197..6c1f8ac5e721 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,46 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-
-		_ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
-		_ASM_EXTABLE_UA(1b, 2b)
-
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
-unsigned long clear_user(void __user *to, unsigned long n)
-{
-	if (access_ok(to, n))
-		return __clear_user(to, n);
-	return n;
-}
-EXPORT_SYMBOL(clear_user);
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 /**
  * clean_cache_range - write back a cache range with CLWB
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 864bb9dd3584..98598f3c74c3 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1024,6 +1024,9 @@ static const char *uaccess_safe_builtin[] = {
 	"copy_mc_fragile_handle_tail",
 	"copy_mc_enhanced_fast_string",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
+	"clear_user_erms",
+	"clear_user_rep_good",
+	"clear_user_original",
 	NULL
 };
 
-- 
2.35.1



-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH -final] x86/clear_user: Make it faster
  2022-07-05 17:01                                                                               ` [PATCH -final] " Borislav Petkov
@ 2022-07-06  9:24                                                                                 ` Alexey Dobriyan
  2022-07-11 10:33                                                                                   ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Alexey Dobriyan @ 2022-07-06  9:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Mark Hemment, Andrew Morton,
	the arch/x86 maintainers, Peter Zijlstra, patrice.chotard,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Chuck Lever, Hugh Dickins, patches, Linux-MM,
	mm-commits, Mel Gorman

On Tue, Jul 05, 2022 at 07:01:06PM +0200, Borislav Petkov wrote:

> +	asm volatile(
> +		"1:\n\t"
> +		ALTERNATIVE_3("rep stosb",
> +			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
> +			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
> +			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
> +		"2:\n"
> +	       _ASM_EXTABLE_UA(1b, 2b)
> +	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
> +	       : "a" (0)
> +		/* rep_good clobbers %rdx */
> +	       : "rdx");

"+c" and "+D" should be enough for 1 instruction assembly?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH -final] x86/clear_user: Make it faster
  2022-07-06  9:24                                                                                 ` Alexey Dobriyan
@ 2022-07-11 10:33                                                                                   ` Borislav Petkov
  2022-07-12 12:32                                                                                     ` Alexey Dobriyan
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-07-11 10:33 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-kernel, Linus Torvalds, Mark Hemment, Andrew Morton,
	the arch/x86 maintainers, Peter Zijlstra, patrice.chotard,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Chuck Lever, Hugh Dickins, patches, Linux-MM,
	mm-commits, Mel Gorman

On Wed, Jul 06, 2022 at 12:24:12PM +0300, Alexey Dobriyan wrote:
> On Tue, Jul 05, 2022 at 07:01:06PM +0200, Borislav Petkov wrote:
> 
> > +	asm volatile(
> > +		"1:\n\t"
> > +		ALTERNATIVE_3("rep stosb",
> > +			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
> > +			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
> > +			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
> > +		"2:\n"
> > +	       _ASM_EXTABLE_UA(1b, 2b)
> > +	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
> > +	       : "a" (0)
> > +		/* rep_good clobbers %rdx */
> > +	       : "rdx");
> 
> "+c" and "+D" should be enough for 1 instruction assembly?

I'm looking at

  e0a96129db57 ("x86: use early clobbers in usercopy*.c")

which introduced the early clobbers and I'm thinking we want them
because "this operand is an earlyclobber operand, which is written
before the instruction is finished using the input operands" and we have
exception handling.

But maybe you need to be more verbose as to what you mean exactly...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH -final] x86/clear_user: Make it faster
  2022-07-11 10:33                                                                                   ` Borislav Petkov
@ 2022-07-12 12:32                                                                                     ` Alexey Dobriyan
  2022-08-06 12:49                                                                                       ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Alexey Dobriyan @ 2022-07-12 12:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Linus Torvalds, Mark Hemment, Andrew Morton,
	the arch/x86 maintainers, Peter Zijlstra, patrice.chotard,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Chuck Lever, Hugh Dickins, patches, Linux-MM,
	mm-commits, Mel Gorman

On Mon, Jul 11, 2022 at 12:33:20PM +0200, Borislav Petkov wrote:
> On Wed, Jul 06, 2022 at 12:24:12PM +0300, Alexey Dobriyan wrote:
> > On Tue, Jul 05, 2022 at 07:01:06PM +0200, Borislav Petkov wrote:
> > 
> > > +	asm volatile(
> > > +		"1:\n\t"
> > > +		ALTERNATIVE_3("rep stosb",
> > > +			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
> > > +			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
> > > +			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
> > > +		"2:\n"
> > > +	       _ASM_EXTABLE_UA(1b, 2b)
> > > +	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
> > > +	       : "a" (0)
> > > +		/* rep_good clobbers %rdx */
> > > +	       : "rdx");
> > 
> > "+c" and "+D" should be enough for 1 instruction assembly?
> 
> I'm looking at
> 
>   e0a96129db57 ("x86: use early clobbers in usercopy*.c")
> 
> which introduced the early clobbers and I'm thinking we want them
> because "this operand is an earlyclobber operand, which is written
> before the instruction is finished using the input operands" and we have
> exception handling.
> 
> But maybe you need to be more verbose as to what you mean exactly...

This is the original code:

-#define __do_strncpy_from_user(dst,src,count,res)                         \
-do {                                                                      \
-       long __d0, __d1, __d2;                                             \
-       might_fault();                                                     \
-       __asm__ __volatile__(                                              \
-               "       testq %1,%1\n"                                     \
-               "       jz 2f\n"                                           \
-               "0:     lodsb\n"                                           \
-               "       stosb\n"                                           \
-               "       testb %%al,%%al\n"                                 \
-               "       jz 1f\n"                                           \
-               "       decq %1\n"                                         \
-               "       jnz 0b\n"                                          \
-               "1:     subq %1,%0\n"                                      \
-               "2:\n"                                                     \
-               ".section .fixup,\"ax\"\n"                                 \
-               "3:     movq %5,%0\n"                                      \
-               "       jmp 2b\n"                                          \
-               ".previous\n"                                              \
-               _ASM_EXTABLE(0b,3b)                                        \
-               : "=&r"(res), "=&c"(count), "=&a" (__d0), "=&S" (__d1),    \
-                 "=&D" (__d2)                                             \
-               : "i"(-EFAULT), "0"(count), "1"(count), "3"(src), "4"(dst) \
-               : "memory");                                               \
-} while (0)

I meant to say that earlyclobber is necessary only because the asm body
is more than 1 instruction so there is possibility of writing to some
outputs before all inputs are consumed.

If asm body is 1 insn there is no such possibility at all.
Now "rep stosb" is 1 instruction and two alterantive functions masquarade
as single instruction.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH -final] x86/clear_user: Make it faster
  2022-07-12 12:32                                                                                     ` Alexey Dobriyan
@ 2022-08-06 12:49                                                                                       ` Borislav Petkov
  0 siblings, 0 replies; 82+ messages in thread
From: Borislav Petkov @ 2022-08-06 12:49 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-kernel, Linus Torvalds, Mark Hemment, Andrew Morton,
	the arch/x86 maintainers, Peter Zijlstra, patrice.chotard,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Chuck Lever, Hugh Dickins, patches, Linux-MM,
	mm-commits, Mel Gorman

On Tue, Jul 12, 2022 at 03:32:27PM +0300, Alexey Dobriyan wrote:
> I meant to say that earlyclobber is necessary only because the asm body
> is more than 1 instruction so there is possibility of writing to some
> outputs before all inputs are consumed.

Ok, thanks for clarifying. Below is a lightly tested version with the
earlyclobbers removed.

---
From 1587f56520fb9faa55ebb0cf02d69bdea0e40170 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp@suse.de>
Date: Tue, 24 May 2022 11:01:18 +0200
Subject: [PATCH] x86/clear_user: Make it faster
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Based on a patch by Mark Hemment <markhemm@googlemail.com> and
incorporating very sane suggestions from Linus.

The point here is to have the default case with FSRM - which is supposed
to be the majority of x86 hw out there - if not now then soon - be
directly inlined into the instruction stream so that no function call
overhead is taking place.

Drop the early clobbers from the @size and @addr operands as those are
not needed anymore since we have single instruction alternatives.

The benchmarks I ran would show very small improvements and a PF
benchmark would even show weird things like slowdowns with higher core
counts.

So for a ~6m running the git test suite, the function gets called under
700K times, all from padzero():

  <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
  <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
  <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
  <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
   ...

which is around 1%-ish of the total time and which is consistent with
the benchmark numbers.

So Mel gave me the idea to simply measure how fast the function becomes.
I.e.:

  start = rdtsc_ordered();
  ret = __clear_user(to, n);
  end = rdtsc_ordered();

Computing the mean average of all the samples collected during the test
suite run then shows some improvement:

  clear_user_original:
  Amean: 9219.71 (Sum: 6340154910, samples: 687674)

  fsrm:
  Amean: 8030.63 (Sum: 5522277720, samples: 687652)

That's on Zen3.

The situation looks a lot more confusing on Intel:

Icelake:

  clear_user_original:
  Amean: 19679.4 (Sum: 13652560764, samples: 693750)
  Amean: 19743.7 (Sum: 13693470604, samples: 693562)

(I ran it twice just to be sure.)

  ERMS:
  Amean: 20374.3 (Sum: 13910601024, samples: 682752)
  Amean: 20453.7 (Sum: 14186223606, samples: 693576)

  FSRM:
  Amean: 20458.2 (Sum: 13918381386, sample s: 680331)

The original microbenchmark which people were complaining about:

  for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 | grep copied
  32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s
  37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s
  62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s
  60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s
  53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s
  55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s
  55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s
  54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s
  50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s
  58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s

Now the same thing with smaller buffers:

  for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 | grep copied
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s

is also not conclusive because it all depends on the buffer sizes,
their alignments and when the microcode detects that cachelines can be
aggregated properly and copied in bigger sizes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
---
 arch/x86/include/asm/uaccess.h    |   5 +-
 arch/x86/include/asm/uaccess_64.h |  45 ++++++++++
 arch/x86/lib/clear_page_64.S      | 138 ++++++++++++++++++++++++++++++
 arch/x86/lib/usercopy_64.c        |  40 ---------
 tools/objtool/check.c             |   3 +
 5 files changed, 188 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 913e593a3b45..c46207946e05 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -502,9 +502,6 @@ strncpy_from_user(char *dst, const char __user *src, long count);
 
 extern __must_check long strnlen_user(const char __user *str, long n);
 
-unsigned long __must_check clear_user(void __user *mem, unsigned long len);
-unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
-
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 unsigned long __must_check
 copy_mc_to_kernel(void *to, const void *from, unsigned len);
@@ -526,6 +523,8 @@ extern struct movsl_mask {
 #define ARCH_HAS_NOCACHE_UACCESS 1
 
 #ifdef CONFIG_X86_32
+unsigned long __must_check clear_user(void __user *mem, unsigned long len);
+unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
 # include <asm/uaccess_32.h>
 #else
 # include <asm/uaccess_64.h>
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e04d771..d13d71af5cf6 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,49 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+/*
+ * Zero Userspace.
+ */
+
+__must_check unsigned long
+clear_user_original(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_rep_good(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_erms(void __user *addr, unsigned long len);
+
+static __always_inline __must_check unsigned long __clear_user(void __user *addr, unsigned long size)
+{
+	might_fault();
+	stac();
+
+	/*
+	 * No memory constraint because it doesn't change any memory gcc
+	 * knows about.
+	 */
+	asm volatile(
+		"1:\n\t"
+		ALTERNATIVE_3("rep stosb",
+			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
+			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
+			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
+		"2:\n"
+	       _ASM_EXTABLE_UA(1b, 2b)
+	       : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
+	       : "a" (0)
+		/* rep_good clobbers %rdx */
+	       : "rdx");
+
+	clac();
+
+	return size;
+}
+
+static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
+{
+	if (access_ok(to, n))
+		return __clear_user(to, n);
+	return n;
+}
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..ecbfb4dd3b01 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
+#include <asm/asm.h>
 #include <asm/export.h>
 
 /*
@@ -50,3 +51,140 @@ SYM_FUNC_START(clear_page_erms)
 	RET
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Default clear user-space.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_original)
+	/*
+	 * Copy only the lower 32 bits of size as that is enough to handle the rest bytes,
+	 * i.e., no need for a 'q' suffix and thus a REX prefix.
+	 */
+	mov %ecx,%eax
+	shr $3,%rcx
+	jz .Lrest_bytes
+
+	# do the qwords first
+	.p2align 4
+.Lqwords:
+	movq $0,(%rdi)
+	lea 8(%rdi),%rdi
+	dec %rcx
+	jnz .Lqwords
+
+.Lrest_bytes:
+	and $7,  %eax
+	jz .Lexit
+
+	# now do the rest bytes
+.Lbytes:
+	movb $0,(%rdi)
+	inc %rdi
+	dec %eax
+	jnz .Lbytes
+
+.Lexit:
+	/*
+	 * %rax still needs to be cleared in the exception case because this function is called
+	 * from inline asm and the compiler expects %rax to be zero when exiting the inline asm,
+	 * in case it might reuse it somewhere.
+	 */
+        xor %eax,%eax
+        RET
+
+.Lqwords_exception:
+        # convert remaining qwords back into bytes to return to caller
+        shl $3, %rcx
+        and $7, %eax
+        add %rax,%rcx
+        jmp .Lexit
+
+.Lbytes_exception:
+        mov %eax,%ecx
+        jmp .Lexit
+
+        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
+        _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)
+SYM_FUNC_END(clear_user_original)
+EXPORT_SYMBOL(clear_user_original)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
+ * present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_rep_good)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	jb clear_user_original
+
+.Lprep:
+	# copy lower 32-bits for rest bytes
+	mov %ecx, %edx
+	shr $3, %rcx
+	jz .Lrep_good_rest_bytes
+
+.Lrep_good_qwords:
+	rep stosq
+
+.Lrep_good_rest_bytes:
+	and $7, %edx
+	jz .Lrep_good_exit
+
+.Lrep_good_bytes:
+	mov %edx, %ecx
+	rep stosb
+
+.Lrep_good_exit:
+	# see .Lexit comment above
+	xor %eax, %eax
+	RET
+
+.Lrep_good_qwords_exception:
+	# convert remaining qwords back into bytes to return to caller
+	shl $3, %rcx
+	and $7, %edx
+	add %rdx, %rcx
+	jmp .Lrep_good_exit
+
+	_ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception)
+	_ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit)
+SYM_FUNC_END(clear_user_rep_good)
+EXPORT_SYMBOL(clear_user_rep_good)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ *
+ */
+SYM_FUNC_START(clear_user_erms)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	jb clear_user_original
+
+.Lerms_bytes:
+	rep stosb
+
+.Lerms_exit:
+	xorl %eax,%eax
+	RET
+
+	_ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit)
+SYM_FUNC_END(clear_user_erms)
+EXPORT_SYMBOL(clear_user_erms)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0ae6cf804197..6c1f8ac5e721 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,46 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-
-		_ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
-		_ASM_EXTABLE_UA(1b, 2b)
-
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
-unsigned long clear_user(void __user *to, unsigned long n)
-{
-	if (access_ok(to, n))
-		return __clear_user(to, n);
-	return n;
-}
-EXPORT_SYMBOL(clear_user);
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 /**
  * clean_cache_range - write back a cache range with CLWB
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 0cec74da7ffe..4b2e11726f4e 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1071,6 +1071,9 @@ static const char *uaccess_safe_builtin[] = {
 	"copy_mc_fragile_handle_tail",
 	"copy_mc_enhanced_fast_string",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
+	"clear_user_erms",
+	"clear_user_rep_good",
+	"clear_user_original",
 	NULL
 };
 
-- 
2.35.1


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [tip: x86/cpu] x86/clear_user: Make it faster
  2022-04-15 22:10   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2022-04-16  6:36     ` Borislav Petkov
@ 2022-08-18 10:44     ` tip-bot2 for Borislav Petkov
  3 siblings, 0 replies; 82+ messages in thread
From: tip-bot2 for Borislav Petkov @ 2022-08-18 10:44 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/cpu branch of tip:

Commit-ID:     0db7058e8e23e6bbab1b4747ecabd1784c34f50b
Gitweb:        https://git.kernel.org/tip/0db7058e8e23e6bbab1b4747ecabd1784c34f50b
Author:        Borislav Petkov <bp@suse.de>
AuthorDate:    Tue, 24 May 2022 11:01:18 +02:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Thu, 18 Aug 2022 12:36:42 +02:00

x86/clear_user: Make it faster

Based on a patch by Mark Hemment <markhemm@googlemail.com> and
incorporating very sane suggestions from Linus.

The point here is to have the default case with FSRM - which is supposed
to be the majority of x86 hw out there - if not now then soon - be
directly inlined into the instruction stream so that no function call
overhead is taking place.

Drop the early clobbers from the @size and @addr operands as those are
not needed anymore since we have single instruction alternatives.

The benchmarks I ran would show very small improvements and a PF
benchmark would even show weird things like slowdowns with higher core
counts.

So for a ~6m running the git test suite, the function gets called under
700K times, all from padzero():

  <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
  <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
  <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
  <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
   ...

which is around 1%-ish of the total time and which is consistent with
the benchmark numbers.

So Mel gave me the idea to simply measure how fast the function becomes.
I.e.:

  start = rdtsc_ordered();
  ret = __clear_user(to, n);
  end = rdtsc_ordered();

Computing the mean average of all the samples collected during the test
suite run then shows some improvement:

  clear_user_original:
  Amean: 9219.71 (Sum: 6340154910, samples: 687674)

  fsrm:
  Amean: 8030.63 (Sum: 5522277720, samples: 687652)

That's on Zen3.

The situation looks a lot more confusing on Intel:

Icelake:

  clear_user_original:
  Amean: 19679.4 (Sum: 13652560764, samples: 693750)
  Amean: 19743.7 (Sum: 13693470604, samples: 693562)

(I ran it twice just to be sure.)

  ERMS:
  Amean: 20374.3 (Sum: 13910601024, samples: 682752)
  Amean: 20453.7 (Sum: 14186223606, samples: 693576)

  FSRM:
  Amean: 20458.2 (Sum: 13918381386, sample s: 680331)

The original microbenchmark which people were complaining about:

  for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 | grep copied
  32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s
  37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s
  62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s
  60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s
  53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s
  55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s
  55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s
  54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s
  50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s
  58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s

Now the same thing with smaller buffers:

  for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 | grep copied
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s

is also not conclusive because it all depends on the buffer sizes,
their alignments and when the microcode detects that cachelines can be
aggregated properly and copied in bigger sizes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
---
 arch/x86/include/asm/uaccess.h    |   5 +-
 arch/x86/include/asm/uaccess_64.h |  45 +++++++++-
 arch/x86/lib/clear_page_64.S      | 138 +++++++++++++++++++++++++++++-
 arch/x86/lib/usercopy_64.c        |  40 +--------
 tools/objtool/check.c             |   3 +-
 5 files changed, 188 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 913e593..c462079 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -502,9 +502,6 @@ strncpy_from_user(char *dst, const char __user *src, long count);
 
 extern __must_check long strnlen_user(const char __user *str, long n);
 
-unsigned long __must_check clear_user(void __user *mem, unsigned long len);
-unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
-
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 unsigned long __must_check
 copy_mc_to_kernel(void *to, const void *from, unsigned len);
@@ -526,6 +523,8 @@ extern struct movsl_mask {
 #define ARCH_HAS_NOCACHE_UACCESS 1
 
 #ifdef CONFIG_X86_32
+unsigned long __must_check clear_user(void __user *mem, unsigned long len);
+unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
 # include <asm/uaccess_32.h>
 #else
 # include <asm/uaccess_64.h>
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e0..d13d71a 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,49 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+/*
+ * Zero Userspace.
+ */
+
+__must_check unsigned long
+clear_user_original(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_rep_good(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_erms(void __user *addr, unsigned long len);
+
+static __always_inline __must_check unsigned long __clear_user(void __user *addr, unsigned long size)
+{
+	might_fault();
+	stac();
+
+	/*
+	 * No memory constraint because it doesn't change any memory gcc
+	 * knows about.
+	 */
+	asm volatile(
+		"1:\n\t"
+		ALTERNATIVE_3("rep stosb",
+			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
+			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
+			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
+		"2:\n"
+	       _ASM_EXTABLE_UA(1b, 2b)
+	       : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
+	       : "a" (0)
+		/* rep_good clobbers %rdx */
+	       : "rdx");
+
+	clac();
+
+	return size;
+}
+
+static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
+{
+	if (access_ok(to, n))
+		return __clear_user(to, n);
+	return n;
+}
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8a..ecbfb4d 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
+#include <asm/asm.h>
 #include <asm/export.h>
 
 /*
@@ -50,3 +51,140 @@ SYM_FUNC_START(clear_page_erms)
 	RET
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Default clear user-space.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_original)
+	/*
+	 * Copy only the lower 32 bits of size as that is enough to handle the rest bytes,
+	 * i.e., no need for a 'q' suffix and thus a REX prefix.
+	 */
+	mov %ecx,%eax
+	shr $3,%rcx
+	jz .Lrest_bytes
+
+	# do the qwords first
+	.p2align 4
+.Lqwords:
+	movq $0,(%rdi)
+	lea 8(%rdi),%rdi
+	dec %rcx
+	jnz .Lqwords
+
+.Lrest_bytes:
+	and $7,  %eax
+	jz .Lexit
+
+	# now do the rest bytes
+.Lbytes:
+	movb $0,(%rdi)
+	inc %rdi
+	dec %eax
+	jnz .Lbytes
+
+.Lexit:
+	/*
+	 * %rax still needs to be cleared in the exception case because this function is called
+	 * from inline asm and the compiler expects %rax to be zero when exiting the inline asm,
+	 * in case it might reuse it somewhere.
+	 */
+        xor %eax,%eax
+        RET
+
+.Lqwords_exception:
+        # convert remaining qwords back into bytes to return to caller
+        shl $3, %rcx
+        and $7, %eax
+        add %rax,%rcx
+        jmp .Lexit
+
+.Lbytes_exception:
+        mov %eax,%ecx
+        jmp .Lexit
+
+        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
+        _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)
+SYM_FUNC_END(clear_user_original)
+EXPORT_SYMBOL(clear_user_original)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
+ * present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_rep_good)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	jb clear_user_original
+
+.Lprep:
+	# copy lower 32-bits for rest bytes
+	mov %ecx, %edx
+	shr $3, %rcx
+	jz .Lrep_good_rest_bytes
+
+.Lrep_good_qwords:
+	rep stosq
+
+.Lrep_good_rest_bytes:
+	and $7, %edx
+	jz .Lrep_good_exit
+
+.Lrep_good_bytes:
+	mov %edx, %ecx
+	rep stosb
+
+.Lrep_good_exit:
+	# see .Lexit comment above
+	xor %eax, %eax
+	RET
+
+.Lrep_good_qwords_exception:
+	# convert remaining qwords back into bytes to return to caller
+	shl $3, %rcx
+	and $7, %edx
+	add %rdx, %rcx
+	jmp .Lrep_good_exit
+
+	_ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception)
+	_ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit)
+SYM_FUNC_END(clear_user_rep_good)
+EXPORT_SYMBOL(clear_user_rep_good)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ *
+ */
+SYM_FUNC_START(clear_user_erms)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	jb clear_user_original
+
+.Lerms_bytes:
+	rep stosb
+
+.Lerms_exit:
+	xorl %eax,%eax
+	RET
+
+	_ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit)
+SYM_FUNC_END(clear_user_erms)
+EXPORT_SYMBOL(clear_user_erms)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0ae6cf8..6c1f8ac 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,46 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-
-		_ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
-		_ASM_EXTABLE_UA(1b, 2b)
-
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
-unsigned long clear_user(void __user *to, unsigned long n)
-{
-	if (access_ok(to, n))
-		return __clear_user(to, n);
-	return n;
-}
-EXPORT_SYMBOL(clear_user);
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 /**
  * clean_cache_range - write back a cache range with CLWB
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 0cec74d..4b2e117 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1071,6 +1071,9 @@ static const char *uaccess_safe_builtin[] = {
 	"copy_mc_fragile_handle_tail",
 	"copy_mc_enhanced_fast_string",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
+	"clear_user_erms",
+	"clear_user_rep_good",
+	"clear_user_original",
 	NULL
 };
 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2022-08-18 10:44 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-15  2:12 incoming Andrew Morton
2022-04-15  2:13 ` [patch 01/14] MAINTAINERS: Broadcom internal lists aren't maintainers Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:13 ` [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15 22:10   ` Linus Torvalds
2022-04-15 22:21     ` Matthew Wilcox
2022-04-15 22:41     ` Hugh Dickins
2022-04-16  6:36     ` Borislav Petkov
2022-04-16 14:07       ` Mark Hemment
2022-04-16 17:28         ` Borislav Petkov
2022-04-16 17:42           ` Linus Torvalds
2022-04-16 21:15             ` Borislav Petkov
2022-04-17 19:41               ` Borislav Petkov
2022-04-17 20:56                 ` Linus Torvalds
2022-04-18 10:15                   ` Borislav Petkov
2022-04-18 17:10                     ` Linus Torvalds
2022-04-19  9:17                       ` Borislav Petkov
2022-04-19 16:41                         ` Linus Torvalds
2022-04-19 17:48                           ` Borislav Petkov
2022-04-21 15:06                             ` Borislav Petkov
2022-04-21 16:50                               ` Linus Torvalds
2022-04-21 17:22                                 ` Linus Torvalds
2022-04-24 19:37                                   ` Borislav Petkov
2022-04-24 19:54                                     ` Linus Torvalds
2022-04-24 20:24                                       ` Linus Torvalds
2022-04-27  0:14                                       ` Borislav Petkov
2022-04-27  1:29                                         ` Linus Torvalds
2022-04-27 10:41                                           ` Borislav Petkov
2022-04-27 16:00                                             ` Linus Torvalds
2022-05-04 18:56                                               ` Borislav Petkov
2022-05-04 19:22                                                 ` Linus Torvalds
2022-05-04 20:18                                                   ` Borislav Petkov
2022-05-04 20:40                                                     ` Linus Torvalds
2022-05-04 21:01                                                       ` Borislav Petkov
2022-05-04 21:09                                                         ` Linus Torvalds
2022-05-10  9:31                                                           ` clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE) Borislav Petkov
2022-05-10 17:17                                                             ` Linus Torvalds
2022-05-10 17:28                                                             ` Linus Torvalds
2022-05-10 18:10                                                               ` Borislav Petkov
2022-05-10 18:57                                                                 ` Borislav Petkov
2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
2022-05-24 16:51                                                                     ` Linus Torvalds
2022-05-24 17:30                                                                       ` Borislav Petkov
2022-05-25 12:11                                                                     ` Mark Hemment
2022-05-27 11:28                                                                       ` Borislav Petkov
2022-05-27 11:10                                                                     ` Ingo Molnar
2022-06-22 14:21                                                                     ` Borislav Petkov
2022-06-22 15:06                                                                       ` Linus Torvalds
2022-06-22 20:14                                                                         ` Borislav Petkov
2022-06-22 21:07                                                                           ` Linus Torvalds
2022-06-23  9:41                                                                             ` Borislav Petkov
2022-07-05 17:01                                                                               ` [PATCH -final] " Borislav Petkov
2022-07-06  9:24                                                                                 ` Alexey Dobriyan
2022-07-11 10:33                                                                                   ` Borislav Petkov
2022-07-12 12:32                                                                                     ` Alexey Dobriyan
2022-08-06 12:49                                                                                       ` Borislav Petkov
2022-08-18 10:44     ` [tip: x86/cpu] " tip-bot2 for Borislav Petkov
2022-04-15  2:13 ` [patch 03/14] mm/secretmem: fix panic when growing a memfd_secret Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:13 ` [patch 04/14] irq_work: use kasan_record_aux_stack_noalloc() record callstack Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:13 ` [patch 05/14] kasan: fix hw tags enablement when KUNIT tests are disabled Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:13 ` [patch 06/14] mm, kfence: support kmem_dump_obj() for KFENCE objects Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:13 ` [patch 07/14] mm, page_alloc: fix build_zonerefs_node() Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:13 ` [patch 08/14] mm: fix unexpected zeroed page mapping with zram swap Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:13 ` [patch 09/14] mm: compaction: fix compiler warning when CONFIG_COMPACTION=n Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:13 ` [patch 10/14] hugetlb: do not demote poisoned hugetlb pages Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:13 ` [patch 11/14] revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders" Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:13 ` [patch 12/14] revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE" Andrew Morton
2022-04-15  2:13   ` Andrew Morton
2022-04-15  2:14 ` [patch 13/14] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore Andrew Morton
2022-04-15  2:14   ` Andrew Morton
2022-04-15  2:14 ` [patch 14/14] mm: kmemleak: take a full lowmem check in kmemleak_*_phys() Andrew Morton
2022-04-15  2:14   ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.