stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash
@ 2022-03-01 20:13 Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 02/28] swiotlb: fix info leak with DMA_FROM_DEVICE Sasha Levin
                   ` (26 more replies)
  0 siblings, 27 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Sasha Levin, shuah,
	daniel, andrii, linux-kselftest, netdev, bpf

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

[ Upstream commit a7e75016a0753c24d6c995bc02501ae35368e333 ]

Add a test that validates that timer value is not overwritten when doing
a copy_map_value call in the kernel. Without the prior fix, this test
triggers a crash.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220209070324.1093182-3-memxor@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 .../selftests/bpf/prog_tests/timer_crash.c    | 32 +++++++++++
 .../testing/selftests/bpf/progs/timer_crash.c | 54 +++++++++++++++++++
 2 files changed, 86 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_crash.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer_crash.c

diff --git a/tools/testing/selftests/bpf/prog_tests/timer_crash.c b/tools/testing/selftests/bpf/prog_tests/timer_crash.c
new file mode 100644
index 0000000000000..f74b82305da8c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/timer_crash.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include "timer_crash.skel.h"
+
+enum {
+	MODE_ARRAY,
+	MODE_HASH,
+};
+
+static void test_timer_crash_mode(int mode)
+{
+	struct timer_crash *skel;
+
+	skel = timer_crash__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "timer_crash__open_and_load"))
+		return;
+	skel->bss->pid = getpid();
+	skel->bss->crash_map = mode;
+	if (!ASSERT_OK(timer_crash__attach(skel), "timer_crash__attach"))
+		goto end;
+	usleep(1);
+end:
+	timer_crash__destroy(skel);
+}
+
+void test_timer_crash(void)
+{
+	if (test__start_subtest("array"))
+		test_timer_crash_mode(MODE_ARRAY);
+	if (test__start_subtest("hash"))
+		test_timer_crash_mode(MODE_HASH);
+}
diff --git a/tools/testing/selftests/bpf/progs/timer_crash.c b/tools/testing/selftests/bpf/progs/timer_crash.c
new file mode 100644
index 0000000000000..f8f7944e70dae
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/timer_crash.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+struct map_elem {
+	struct bpf_timer timer;
+	struct bpf_spin_lock lock;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct map_elem);
+} amap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct map_elem);
+} hmap SEC(".maps");
+
+int pid = 0;
+int crash_map = 0; /* 0 for amap, 1 for hmap */
+
+SEC("fentry/do_nanosleep")
+int sys_enter(void *ctx)
+{
+	struct map_elem *e, value = {};
+	void *map = crash_map ? (void *)&hmap : (void *)&amap;
+
+	if (bpf_get_current_task_btf()->tgid != pid)
+		return 0;
+
+	*(void **)&value = (void *)0xdeadcaf3;
+
+	bpf_map_update_elem(map, &(int){0}, &value, 0);
+	/* For array map, doing bpf_map_update_elem will do a
+	 * check_and_free_timer_in_array, which will trigger the crash if timer
+	 * pointer was overwritten, for hmap we need to use bpf_timer_cancel.
+	 */
+	if (crash_map == 1) {
+		e = bpf_map_lookup_elem(map, &(int){0});
+		if (!e)
+			return 0;
+		bpf_timer_cancel(&e->timer);
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 02/28] swiotlb: fix info leak with DMA_FROM_DEVICE
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 03/28] usb: dwc3: pci: add support for the Intel Raptor Lake-S Sasha Levin
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Halil Pasic, Christoph Hellwig, Sasha Levin, corbet,
	m.szyprowski, linux-doc, iommu

From: Halil Pasic <pasic@linux.ibm.com>

[ Upstream commit ddbd89deb7d32b1fbb879f48d68fda1a8ac58e8e ]

The problem I'm addressing was discovered by the LTP test covering
cve-2018-1000204.

A short description of what happens follows:
1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO
   interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV
   and a corresponding dxferp. The peculiar thing about this is that TUR
   is not reading from the device.
2) In sg_start_req() the invocation of blk_rq_map_user() effectively
   bounces the user-space buffer. As if the device was to transfer into
   it. Since commit a45b599ad808 ("scsi: sg: allocate with __GFP_ZERO in
   sg_build_indirect()") we make sure this first bounce buffer is
   allocated with GFP_ZERO.
3) For the rest of the story we keep ignoring that we have a TUR, so the
   device won't touch the buffer we prepare as if the we had a
   DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device
   and the  buffer allocated by SG is mapped by the function
   virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here
   scatter-gather and not scsi generics). This mapping involves bouncing
   via the swiotlb (we need swiotlb to do virtio in protected guest like
   s390 Secure Execution, or AMD SEV).
4) When the SCSI TUR is done, we first copy back the content of the second
   (that is swiotlb) bounce buffer (which most likely contains some
   previous IO data), to the first bounce buffer, which contains all
   zeros.  Then we copy back the content of the first bounce buffer to
   the user-space buffer.
5) The test case detects that the buffer, which it zero-initialized,
  ain't all zeros and fails.

One can argue that this is an swiotlb problem, because without swiotlb
we leak all zeros, and the swiotlb should be transparent in a sense that
it does not affect the outcome (if all other participants are well
behaved).

Copying the content of the original buffer into the swiotlb buffer is
the only way I can think of to make swiotlb transparent in such
scenarios. So let's do just that if in doubt, but allow the driver
to tell us that the whole mapped buffer is going to be overwritten,
in which case we can preserve the old behavior and avoid the performance
impact of the extra bounce.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 Documentation/core-api/dma-attributes.rst | 8 ++++++++
 include/linux/dma-mapping.h               | 8 ++++++++
 kernel/dma/swiotlb.c                      | 3 ++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
index 1887d92e8e926..17706dc91ec9f 100644
--- a/Documentation/core-api/dma-attributes.rst
+++ b/Documentation/core-api/dma-attributes.rst
@@ -130,3 +130,11 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
 subsystem that the buffer is fully accessible at the elevated privilege
 level (and ideally inaccessible or at least read-only at the
 lesser-privileged levels).
+
+DMA_ATTR_OVERWRITE
+------------------
+
+This is a hint to the DMA-mapping subsystem that the device is expected to
+overwrite the entire mapped size, thus the caller does not require any of the
+previous buffer contents to be preserved. This allows bounce-buffering
+implementations to optimise DMA_FROM_DEVICE transfers.
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb13..6150d11a607e1 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -61,6 +61,14 @@
  */
 #define DMA_ATTR_PRIVILEGED		(1UL << 9)
 
+/*
+ * This is a hint to the DMA-mapping subsystem that the device is expected
+ * to overwrite the entire mapped size, thus the caller does not require any
+ * of the previous buffer contents to be preserved. This allows
+ * bounce-buffering implementations to optimise DMA_FROM_DEVICE transfers.
+ */
+#define DMA_ATTR_OVERWRITE		(1UL << 10)
+
 /*
  * A dma_addr_t can hold any valid DMA or bus address for the platform.  It can
  * be given to a device to use as a DMA source or target.  It is specific to a
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8e840fbbed7c7..d958b1201092c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -582,7 +582,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
 	tlb_addr = slot_addr(mem->start, index) + offset;
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
-	    (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
+	    (!(attrs & DMA_ATTR_OVERWRITE) || dir == DMA_TO_DEVICE ||
+	    dir == DMA_BIDIRECTIONAL))
 		swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
 	return tlb_addr;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 03/28] usb: dwc3: pci: add support for the Intel Raptor Lake-S
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 02/28] swiotlb: fix info leak with DMA_FROM_DEVICE Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 04/28] pinctrl: tigerlake: Revert "Add Alder Lake-M ACPI ID" Sasha Levin
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Heikki Krogerus, Greg Kroah-Hartman, Sasha Levin, balbi, linux-usb

From: Heikki Krogerus <heikki.krogerus@linux.intel.com>

[ Upstream commit 038438a25c45d5ac996e95a22fa9e76ff3d1f8c7 ]

This patch adds the necessary PCI ID for Intel Raptor Lake-S
devices.

Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://lore.kernel.org/r/20220214141948.18637-1-heikki.krogerus@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/usb/dwc3/dwc3-pci.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/usb/dwc3/dwc3-pci.c b/drivers/usb/dwc3/dwc3-pci.c
index 7ff8fc8f79a9b..4e69a9d829f23 100644
--- a/drivers/usb/dwc3/dwc3-pci.c
+++ b/drivers/usb/dwc3/dwc3-pci.c
@@ -43,6 +43,7 @@
 #define PCI_DEVICE_ID_INTEL_ADLP		0x51ee
 #define PCI_DEVICE_ID_INTEL_ADLM		0x54ee
 #define PCI_DEVICE_ID_INTEL_ADLS		0x7ae1
+#define PCI_DEVICE_ID_INTEL_RPLS		0x7a61
 #define PCI_DEVICE_ID_INTEL_TGL			0x9a15
 #define PCI_DEVICE_ID_AMD_MR			0x163a
 
@@ -409,6 +410,9 @@ static const struct pci_device_id dwc3_pci_id_table[] = {
 	{ PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_ADLS),
 	  (kernel_ulong_t) &dwc3_pci_intel_swnode, },
 
+	{ PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_RPLS),
+	  (kernel_ulong_t) &dwc3_pci_intel_swnode, },
+
 	{ PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_TGL),
 	  (kernel_ulong_t) &dwc3_pci_intel_swnode, },
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 04/28] pinctrl: tigerlake: Revert "Add Alder Lake-M ACPI ID"
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 02/28] swiotlb: fix info leak with DMA_FROM_DEVICE Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 03/28] usb: dwc3: pci: add support for the Intel Raptor Lake-S Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 05/28] KVM: Fix lockdep false negative during host resume Sasha Levin
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Andy Shevchenko, Sasha Levin, mika.westerberg, andy,
	linus.walleij, linux-gpio

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

[ Upstream commit 6f66db29e2415cbe8759c48584f9cae19b3c2651 ]

It appears that last minute change moved ACPI ID of Alder Lake-M
to the INTC1055, which is already in the driver.

This ID on the other hand will be used elsewhere.

This reverts commit 258435a1c8187f559549e515d2f77fa0b57bcd27.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/pinctrl/intel/pinctrl-tigerlake.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/pinctrl/intel/pinctrl-tigerlake.c b/drivers/pinctrl/intel/pinctrl-tigerlake.c
index 0bcd19597e4ad..3ddaeffc04150 100644
--- a/drivers/pinctrl/intel/pinctrl-tigerlake.c
+++ b/drivers/pinctrl/intel/pinctrl-tigerlake.c
@@ -749,7 +749,6 @@ static const struct acpi_device_id tgl_pinctrl_acpi_match[] = {
 	{ "INT34C5", (kernel_ulong_t)&tgllp_soc_data },
 	{ "INT34C6", (kernel_ulong_t)&tglh_soc_data },
 	{ "INTC1055", (kernel_ulong_t)&tgllp_soc_data },
-	{ "INTC1057", (kernel_ulong_t)&tgllp_soc_data },
 	{ }
 };
 MODULE_DEVICE_TABLE(acpi, tgl_pinctrl_acpi_match);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 05/28] KVM: Fix lockdep false negative during host resume
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (2 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 04/28] pinctrl: tigerlake: Revert "Add Alder Lake-M ACPI ID" Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:19   ` Paolo Bonzini
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 06/28] kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode Sasha Levin
                   ` (22 subsequent siblings)
  26 siblings, 1 reply; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Wanpeng Li, Paolo Bonzini, Sasha Levin, kvm

From: Wanpeng Li <wanpengli@tencent.com>

[ Upstream commit 4cb9a998b1ce25fad74a82f5a5c45a4ef40de337 ]

I saw the below splatting after the host suspended and resumed.

   WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm]
   CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G        W IOE     5.17.0-rc3+ #4
   RIP: 0010:kvm_resume+0x2c/0x30 [kvm]
   Call Trace:
    <TASK>
    syscore_resume+0x90/0x340
    suspend_devices_and_enter+0xaee/0xe90
    pm_suspend.cold+0x36b/0x3c2
    state_store+0x82/0xf0
    kernfs_fop_write_iter+0x1b6/0x260
    new_sync_write+0x258/0x370
    vfs_write+0x33f/0x510
    ksys_write+0xc9/0x160
    do_syscall_64+0x3b/0xc0
    entry_SYSCALL_64_after_hwframe+0x44/0xae

lockdep_is_held() can return -1 when lockdep is disabled which triggers
this warning. Let's use lockdep_assert_not_held() which can detect
incorrect calls while holding a lock and it also avoids false negatives
when lockdep is disabled.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1644920142-81249-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 virt/kvm/kvm_main.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 71ddc7a8bc302..6ae9e04d0585e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5347,9 +5347,7 @@ static int kvm_suspend(void)
 static void kvm_resume(void)
 {
 	if (kvm_usage_count) {
-#ifdef CONFIG_LOCKDEP
-		WARN_ON(lockdep_is_held(&kvm_count_lock));
-#endif
+		lockdep_assert_not_held(&kvm_count_lock);
 		hardware_enable_nolock(NULL);
 	}
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 06/28] kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (3 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 05/28] KVM: Fix lockdep false negative during host resume Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:19   ` Paolo Bonzini
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0 Sasha Levin
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Anton Romanov, Paolo Bonzini, Sasha Levin, tglx, mingo, bp,
	dave.hansen, x86, kvm

From: Anton Romanov <romanton@google.com>

[ Upstream commit 3a55f729240a686aa8af00af436306c0cd532522 ]

If vcpu has tsc_always_catchup set each request updates pvclock data.
KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on
host's side and do hypercall inside pvclock_read_retry loop leading to
infinite loop in such situation.

v3:
    Removed warn
    Changed return code to KVM_EFAULT
v2:
    Added warn

Signed-off-by: Anton Romanov <romanton@google.com>
Message-Id: <20220216182653.506850-1-romanton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/kvm/x86.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0714fa0e7ede0..18fc0367ef21a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8769,6 +8769,13 @@ static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu, gpa_t paddr,
 	if (clock_type != KVM_CLOCK_PAIRING_WALLCLOCK)
 		return -KVM_EOPNOTSUPP;
 
+	/*
+	 * When tsc is in permanent catchup mode guests won't be able to use
+	 * pvclock_read_retry loop to get consistent view of pvclock
+	 */
+	if (vcpu->arch.tsc_always_catchup)
+		return -KVM_EOPNOTSUPP;
+
 	if (!kvm_get_walltime_and_clockread(&ts, &cycle))
 		return -KVM_EOPNOTSUPP;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (4 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 06/28] kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:22   ` Paolo Bonzini
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 08/28] spi: rockchip: Fix error in getting num-cs property Sasha Levin
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Leonardo Bras, Paolo Bonzini, Sasha Levin, tglx, mingo, bp,
	dave.hansen, x86, chang.seok.bae, luto, kvm

From: Leonardo Bras <leobras@redhat.com>

[ Upstream commit ad856280ddea3401e1f5060ef20e6de9f6122c76 ]

During host/guest switch (like in kvm_arch_vcpu_ioctl_run()), the kernel
swaps the fpu between host/guest contexts, by using fpu_swap_kvm_fpstate().

When xsave feature is available, the fpu swap is done by:
- xsave(s) instruction, with guest's fpstate->xfeatures as mask, is used
  to store the current state of the fpu registers to a buffer.
- xrstor(s) instruction, with (fpu_kernel_cfg.max_features &
  XFEATURE_MASK_FPSTATE) as mask, is used to put the buffer into fpu regs.

For xsave(s) the mask is used to limit what parts of the fpu regs will
be copied to the buffer. Likewise on xrstor(s), the mask is used to
limit what parts of the fpu regs will be changed.

The mask for xsave(s), the guest's fpstate->xfeatures, is defined on
kvm_arch_vcpu_create(), which (in summary) sets it to all features
supported by the cpu which are enabled on kernel config.

This means that xsave(s) will save to guest buffer all the fpu regs
contents the cpu has enabled when the guest is paused, even if they
are not used.

This would not be an issue, if xrstor(s) would also do that.

xrstor(s)'s mask for host/guest swap is basically every valid feature
contained in kernel config, except XFEATURE_MASK_PKRU.
Accordingto kernel src, it is instead switched in switch_to() and
flush_thread().

Then, the following happens with a host supporting PKRU starts a
guest that does not support it:
1 - Host has XFEATURE_MASK_PKRU set. 1st switch to guest,
2 - xsave(s) fpu regs to host fpustate (buffer has XFEATURE_MASK_PKRU)
3 - xrstor(s) guest fpustate to fpu regs (fpu regs have XFEATURE_MASK_PKRU)
4 - guest runs, then switch back to host,
5 - xsave(s) fpu regs to guest fpstate (buffer now have XFEATURE_MASK_PKRU)
6 - xrstor(s) host fpstate to fpu regs.
7 - kvm_vcpu_ioctl_x86_get_xsave() copy guest fpstate to userspace (with
    XFEATURE_MASK_PKRU, which should not be supported by guest vcpu)

On 5, even though the guest does not support PKRU, it does have the flag
set on guest fpstate, which is transferred to userspace via vcpu ioctl
KVM_GET_XSAVE.

This becomes a problem when the user decides on migrating the above guest
to another machine that does not support PKRU: the new host restores
guest's fpu regs to as they were before (xrstor(s)), but since the new
host don't support PKRU, a general-protection exception ocurs in xrstor(s)
and that crashes the guest.

This can be solved by making the guest's fpstate->user_xfeatures hold
a copy of guest_supported_xcr0. This way, on 7 the only flags copied to
userspace will be the ones compatible to guest requirements, and thus
there will be no issue during migration.

As a bonus, it will also fail if userspace tries to set fpu features
(with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
configuration.  Such features will never be returned by KVM_GET_XSAVE
or KVM_GET_XSAVE2.

Also, since kvm_vcpu_after_set_cpuid() now sets fpstate->user_xfeatures,
there is not need to set it in kvm_check_cpuid(). So, change
fpstate_realloc() so it does not touch fpstate->user_xfeatures if a
non-NULL guest_fpu is passed, which is the case when kvm_check_cpuid()
calls it.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Message-Id: <20220217053028.96432-2-leobras@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/kernel/fpu/xstate.c | 5 ++++-
 arch/x86/kvm/cpuid.c         | 2 ++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d28829403ed08..6ac01f9828530 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1563,7 +1563,10 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
 		fpregs_restore_userregs();
 
 	newfps->xfeatures = curfps->xfeatures | xfeatures;
-	newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
+
+	if (!guest_fpu)
+		newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
+
 	newfps->xfd = curfps->xfd & ~xfeatures;
 
 	curfps = fpu_install_fpstate(fpu, newfps);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index bf18679757c70..875dce4aa2d28 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -276,6 +276,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.guest_supported_xcr0 =
 		cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
 
+	vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0;
+
 	kvm_update_pv_runtime(vcpu);
 
 	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 08/28] spi: rockchip: Fix error in getting num-cs property
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (5 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0 Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 09/28] spi: rockchip: terminate dma transmission when slave abort Sasha Levin
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Jon Lin, Mark Brown, Sasha Levin, heiko, linux-spi,
	linux-arm-kernel, linux-rockchip

From: Jon Lin <jon.lin@rock-chips.com>

[ Upstream commit 9382df0a98aad5bbcd4d634790305a1d786ad224 ]

Get num-cs u32 from dts of_node property rather than u16.

Signed-off-by: Jon Lin <jon.lin@rock-chips.com>
Link: https://lore.kernel.org/r/20220216014028.8123-2-jon.lin@rock-chips.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/spi/spi-rockchip.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/spi/spi-rockchip.c b/drivers/spi/spi-rockchip.c
index 553b6b9d02222..4f65ba3dd19c2 100644
--- a/drivers/spi/spi-rockchip.c
+++ b/drivers/spi/spi-rockchip.c
@@ -654,7 +654,7 @@ static int rockchip_spi_probe(struct platform_device *pdev)
 	struct spi_controller *ctlr;
 	struct resource *mem;
 	struct device_node *np = pdev->dev.of_node;
-	u32 rsd_nsecs;
+	u32 rsd_nsecs, num_cs;
 	bool slave_mode;
 
 	slave_mode = of_property_read_bool(np, "spi-slave");
@@ -764,8 +764,9 @@ static int rockchip_spi_probe(struct platform_device *pdev)
 		 * rk spi0 has two native cs, spi1..5 one cs only
 		 * if num-cs is missing in the dts, default to 1
 		 */
-		if (of_property_read_u16(np, "num-cs", &ctlr->num_chipselect))
-			ctlr->num_chipselect = 1;
+		if (of_property_read_u32(np, "num-cs", &num_cs))
+			num_cs = 1;
+		ctlr->num_chipselect = num_cs;
 		ctlr->use_gpio_descriptors = true;
 	}
 	ctlr->dev.of_node = pdev->dev.of_node;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 09/28] spi: rockchip: terminate dma transmission when slave abort
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (6 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 08/28] spi: rockchip: Fix error in getting num-cs property Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 10/28] drm/vc4: hdmi: Unregister codec device on unbind Sasha Levin
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Jon Lin, Mark Brown, Sasha Levin, heiko, linux-spi,
	linux-arm-kernel, linux-rockchip

From: Jon Lin <jon.lin@rock-chips.com>

[ Upstream commit 80808768e41324d2e23de89972b5406c1020e6e4 ]

After slave abort, all DMA should be stopped, or it will affect the
next transmission and maybe abort again.

Signed-off-by: Jon Lin <jon.lin@rock-chips.com>
Link: https://lore.kernel.org/r/20220216014028.8123-3-jon.lin@rock-chips.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/spi/spi-rockchip.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/spi/spi-rockchip.c b/drivers/spi/spi-rockchip.c
index 4f65ba3dd19c2..c6a1bb09be056 100644
--- a/drivers/spi/spi-rockchip.c
+++ b/drivers/spi/spi-rockchip.c
@@ -585,6 +585,12 @@ static int rockchip_spi_slave_abort(struct spi_controller *ctlr)
 {
 	struct rockchip_spi *rs = spi_controller_get_devdata(ctlr);
 
+	if (atomic_read(&rs->state) & RXDMA)
+		dmaengine_terminate_sync(ctlr->dma_rx);
+	if (atomic_read(&rs->state) & TXDMA)
+		dmaengine_terminate_sync(ctlr->dma_tx);
+	atomic_set(&rs->state, 0);
+	spi_enable_chip(rs, false);
 	rs->slave_abort = true;
 	spi_finalize_current_transfer(ctlr);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 10/28] drm/vc4: hdmi: Unregister codec device on unbind
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (7 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 09/28] spi: rockchip: terminate dma transmission when slave abort Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 11/28] of/fdt: move elfcorehdr reservation early for crash dump kernel Sasha Levin
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Maxime Ripard, Javier Martinez Canillas, Sasha Levin, emma,
	mripard, airlied, daniel, dri-devel

From: Maxime Ripard <maxime@cerno.tech>

[ Upstream commit e40945ab7c7f966d0c37b7bd7b0596497dfe228d ]

On bind we will register the HDMI codec device but we don't unregister
it on unbind, leading to a device leakage. Unregister our device at
unbind.

Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220127111452.222002-1-maxime@cerno.tech
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpu/drm/vc4/vc4_hdmi.c | 8 ++++++++
 drivers/gpu/drm/vc4/vc4_hdmi.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/vc4/vc4_hdmi.c b/drivers/gpu/drm/vc4/vc4_hdmi.c
index 24f11c07bc3c7..2f53ba54b81ac 100644
--- a/drivers/gpu/drm/vc4/vc4_hdmi.c
+++ b/drivers/gpu/drm/vc4/vc4_hdmi.c
@@ -1522,6 +1522,7 @@ static int vc4_hdmi_audio_init(struct vc4_hdmi *vc4_hdmi)
 		dev_err(dev, "Couldn't register the HDMI codec: %ld\n", PTR_ERR(codec_pdev));
 		return PTR_ERR(codec_pdev);
 	}
+	vc4_hdmi->audio.codec_pdev = codec_pdev;
 
 	dai_link->cpus		= &vc4_hdmi->audio.cpu;
 	dai_link->codecs	= &vc4_hdmi->audio.codec;
@@ -1561,6 +1562,12 @@ static int vc4_hdmi_audio_init(struct vc4_hdmi *vc4_hdmi)
 
 }
 
+static void vc4_hdmi_audio_exit(struct vc4_hdmi *vc4_hdmi)
+{
+	platform_device_unregister(vc4_hdmi->audio.codec_pdev);
+	vc4_hdmi->audio.codec_pdev = NULL;
+}
+
 static irqreturn_t vc4_hdmi_hpd_irq_thread(int irq, void *priv)
 {
 	struct vc4_hdmi *vc4_hdmi = priv;
@@ -2299,6 +2306,7 @@ static void vc4_hdmi_unbind(struct device *dev, struct device *master,
 	kfree(vc4_hdmi->hdmi_regset.regs);
 	kfree(vc4_hdmi->hd_regset.regs);
 
+	vc4_hdmi_audio_exit(vc4_hdmi);
 	vc4_hdmi_cec_exit(vc4_hdmi);
 	vc4_hdmi_hotplug_exit(vc4_hdmi);
 	vc4_hdmi_connector_destroy(&vc4_hdmi->connector);
diff --git a/drivers/gpu/drm/vc4/vc4_hdmi.h b/drivers/gpu/drm/vc4/vc4_hdmi.h
index 33e9f665ab8e4..c0492da736833 100644
--- a/drivers/gpu/drm/vc4/vc4_hdmi.h
+++ b/drivers/gpu/drm/vc4/vc4_hdmi.h
@@ -113,6 +113,7 @@ struct vc4_hdmi_audio {
 	struct snd_soc_dai_link_component platform;
 	struct snd_dmaengine_dai_dma_data dma_data;
 	struct hdmi_audio_infoframe infoframe;
+	struct platform_device *codec_pdev;
 	bool streaming;
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 11/28] of/fdt: move elfcorehdr reservation early for crash dump kernel
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (8 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 10/28] drm/vc4: hdmi: Unregister codec device on unbind Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 12/28] x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU Sasha Levin
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Nikhil Gupta, Rob Herring, Sasha Levin, robh+dt, frowand.list,
	devicetree

From: Nikhil Gupta <nikhil.gupta@nxp.com>

[ Upstream commit 132507ed04ce0c5559be04dd378fec4f3bbc00e8 ]

elfcorehdr_addr is fixed address passed to Second kernel which may be conflicted
with potential reserved memory in Second kernel,so fdt_reserve_elfcorehdr() ahead
of fdt_init_reserved_mem() can relieve this situation.

Signed-off-by: Nikhil Gupta <nikhil.gupta@nxp.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220128042321.15228-1-nikhil.gupta@nxp.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/of/fdt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index 7e868e5995b7e..f66abb496ed16 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -644,8 +644,8 @@ void __init early_init_fdt_scan_reserved_mem(void)
 	}
 
 	fdt_scan_reserved_mem();
-	fdt_init_reserved_mem();
 	fdt_reserve_elfcorehdr();
+	fdt_init_reserved_mem();
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 12/28] x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (9 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 11/28] of/fdt: move elfcorehdr reservation early for crash dump kernel Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:17   ` Paolo Bonzini
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 13/28] drivers: hamradio: 6pack: fix UAF bug caused by mod_timer() Sasha Levin
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Wanpeng Li, Paolo Bonzini, Sasha Levin, tglx, mingo, bp,
	dave.hansen, x86, kvm

From: Wanpeng Li <wanpengli@tencent.com>

[ Upstream commit ec756e40e271866f951d77c5e923d8deb6002b15 ]

Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if
only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable
pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1645171838-2855-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/kernel/kvm.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 59abbdad7729c..ff3db164e52cb 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -462,19 +462,22 @@ static bool pv_tlb_flush_supported(void)
 {
 	return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
-		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
+		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&
+		(num_possible_cpus() != 1));
 }
 
 static bool pv_ipi_supported(void)
 {
-	return kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI);
+	return (kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI) &&
+	       (num_possible_cpus() != 1));
 }
 
 static bool pv_sched_yield_supported(void)
 {
 	return (kvm_para_has_feature(KVM_FEATURE_PV_SCHED_YIELD) &&
 		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
-	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
+	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&
+	    (num_possible_cpus() != 1));
 }
 
 #define KVM_IPI_CLUSTER_SIZE	(2 * BITS_PER_LONG)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 13/28] drivers: hamradio: 6pack: fix UAF bug caused by mod_timer()
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (10 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 12/28] x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 14/28] net-sysfs: add check for netdevice being present to speed_show Sasha Levin
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Duoming Zhou, Lin Ma, David S . Miller, Sasha Levin, ajk, kuba,
	linux-hams, netdev

From: Duoming Zhou <duoming@zju.edu.cn>

[ Upstream commit efe4186e6a1b54bf38b9e05450d43b0da1fd7739 ]

When a 6pack device is detaching, the sixpack_close() will act to cleanup
necessary resources. Although del_timer_sync() in sixpack_close()
won't return if there is an active timer, one could use mod_timer() in
sp_xmit_on_air() to wake up timer again by calling userspace syscall such
as ax25_sendmsg(), ax25_connect() and ax25_ioctl().

This unexpected waked handler, sp_xmit_on_air(), realizes nothing about
the undergoing cleanup and may still call pty_write() to use driver layer
resources that have already been released.

One of the possible race conditions is shown below:

      (USE)                      |      (FREE)
ax25_sendmsg()                   |
 ax25_queue_xmit()               |
  ...                            |
  sp_xmit()                      |
   sp_encaps()                   | sixpack_close()
    sp_xmit_on_air()             |  del_timer_sync(&sp->tx_t)
     mod_timer(&sp->tx_t,...)    |  ...
                                 |  unregister_netdev()
                                 |  ...
     (wait a while)              | tty_release()
                                 |  tty_release_struct()
                                 |   release_tty()
    sp_xmit_on_air()             |    tty_kref_put(tty_struct) //FREE
     pty_write(tty_struct) //USE |    ...

The corresponding fail log is shown below:
===============================================================
BUG: KASAN: use-after-free in __run_timers.part.0+0x170/0x470
Write of size 8 at addr ffff88800a652ab8 by task swapper/2/0
...
Call Trace:
  ...
  queue_work_on+0x3f/0x50
  pty_write+0xcd/0xe0pty_write+0xcd/0xe0
  sp_xmit_on_air+0xb2/0x1f0
  call_timer_fn+0x28/0x150
  __run_timers.part.0+0x3c2/0x470
  run_timer_softirq+0x3b/0x80
  __do_softirq+0xf1/0x380
  ...

This patch reorders the del_timer_sync() after the unregister_netdev()
to avoid UAF bugs. Because the unregister_netdev() is well synchronized,
it flushs out any pending queues, waits the refcount of net_device
decreases to zero and removes net_device from kernel. There is not any
running routines after executing unregister_netdev(). Therefore, we could
not arouse timer from userspace again.

Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Lin Ma <linma@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/hamradio/6pack.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c
index 8a19a06b505d1..ff2bb3d80fac8 100644
--- a/drivers/net/hamradio/6pack.c
+++ b/drivers/net/hamradio/6pack.c
@@ -668,11 +668,11 @@ static void sixpack_close(struct tty_struct *tty)
 	 */
 	netif_stop_queue(sp->dev);
 
+	unregister_netdev(sp->dev);
+
 	del_timer_sync(&sp->tx_t);
 	del_timer_sync(&sp->resync_t);
 
-	unregister_netdev(sp->dev);
-
 	/* Free all 6pack frame buffers after unreg. */
 	kfree(sp->rbuff);
 	kfree(sp->xbuff);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 14/28] net-sysfs: add check for netdevice being present to speed_show
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (11 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 13/28] drivers: hamradio: 6pack: fix UAF bug caused by mod_timer() Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 15/28] sr9700: sanity check for packet length Sasha Levin
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: suresh kumar, David S . Miller, Sasha Levin, kuba, atenart,
	edumazet, netdev

From: suresh kumar <suresh2514@gmail.com>

[ Upstream commit 4224cfd7fb6523f7a9d1c8bb91bb5df1e38eb624 ]

When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.

    [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
    [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
    ...
    [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

    crash> bt
    ...
    PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
    ...
     #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
        [exception RIP: dma_pool_alloc+0x1ab]
        RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
        RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
        RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
        RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
        R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
        R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
    #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
    #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
    #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
    #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
    #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
    #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
    #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
    #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
    #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
    #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
    #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
    #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
    #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
    #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
    #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
    #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

    crash> net_device.state ffff89443b0c0000
      state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar <suresh2514@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/core/net-sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index d7f9ee830d34c..9e5657f632453 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -213,7 +213,7 @@ static ssize_t speed_show(struct device *dev,
 	if (!rtnl_trylock())
 		return restart_syscall();
 
-	if (netif_running(netdev)) {
+	if (netif_running(netdev) && netif_device_present(netdev)) {
 		struct ethtool_link_ksettings cmd;
 
 		if (!__ethtool_get_link_ksettings(netdev, &cmd))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 15/28] sr9700: sanity check for packet length
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (12 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 14/28] net-sysfs: add check for netdevice being present to speed_show Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 16/28] hwmon: (pmbus) Clear pmbus fault/warning bits after read Sasha Levin
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Oliver Neukum, Grant Grundler, David S . Miller, Sasha Levin,
	kuba, jgg, arnd, linux-usb, netdev

From: Oliver Neukum <oneukum@suse.com>

[ Upstream commit e9da0b56fe27206b49f39805f7dcda8a89379062 ]

A malicious device can leak heap data to user space
providing bogus frame lengths. Introduce a sanity check.

Signed-off-by: Oliver Neukum <oneukum@suse.com>
Reviewed-by: Grant Grundler <grundler@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/usb/sr9700.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/sr9700.c b/drivers/net/usb/sr9700.c
index b658510cc9a42..5a53e63d33a60 100644
--- a/drivers/net/usb/sr9700.c
+++ b/drivers/net/usb/sr9700.c
@@ -413,7 +413,7 @@ static int sr9700_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 		/* ignore the CRC length */
 		len = (skb->data[1] | (skb->data[2] << 8)) - 4;
 
-		if (len > ETH_FRAME_LEN)
+		if (len > ETH_FRAME_LEN || len > skb->len)
 			return 0;
 
 		/* the last packet of current skb */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 16/28] hwmon: (pmbus) Clear pmbus fault/warning bits after read
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (13 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 15/28] sr9700: sanity check for packet length Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 17/28] nvme-tcp: send H2CData PDUs based on MAXH2CDATA Sasha Levin
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Vikash Chandola, Guenter Roeck, Sasha Levin, jdelvare, linux-hwmon

From: Vikash Chandola <vikash.chandola@linux.intel.com>

[ Upstream commit 35f165f08950a876f1b95a61d79c93678fba2fd6 ]

Almost all fault/warning bits in pmbus status registers remain set even
after fault/warning condition are removed. As per pmbus specification
these faults must be cleared by user.
Modify hwmon behavior to clear fault/warning bit after fetching data if
fault/warning bit was set. This allows to get fresh data in next read.

Signed-off-by: Vikash Chandola <vikash.chandola@linux.intel.com>
Link: https://lore.kernel.org/r/20220222131253.2426834-1-vikash.chandola@linux.intel.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/hwmon/pmbus/pmbus_core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/hwmon/pmbus/pmbus_core.c b/drivers/hwmon/pmbus/pmbus_core.c
index 776ee2237be20..ac2fbee1ba9c0 100644
--- a/drivers/hwmon/pmbus/pmbus_core.c
+++ b/drivers/hwmon/pmbus/pmbus_core.c
@@ -911,6 +911,11 @@ static int pmbus_get_boolean(struct i2c_client *client, struct pmbus_boolean *b,
 		pmbus_update_sensor_data(client, s2);
 
 	regval = status & mask;
+	if (regval) {
+		ret = pmbus_write_byte_data(client, page, reg, regval);
+		if (ret)
+			goto unlock;
+	}
 	if (s1 && s2) {
 		s64 v1, v2;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 17/28] nvme-tcp: send H2CData PDUs based on MAXH2CDATA
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (14 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 16/28] hwmon: (pmbus) Clear pmbus fault/warning bits after read Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 18/28] PCI: Mark all AMD Navi10 and Navi14 GPU ATS as broken Sasha Levin
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Varun Prakash, Sagi Grimberg, Christoph Hellwig, Sasha Levin,
	kbusch, axboe, linux-nvme

From: Varun Prakash <varun@chelsio.com>

[ Upstream commit c2700d2886a87f83f31e0a301de1d2350b52c79b ]

As per NVMe/TCP specification (revision 1.0a, section 3.6.2.3)
Maximum Host to Controller Data length (MAXH2CDATA): Specifies the
maximum number of PDU-Data bytes per H2CData PDU in bytes. This value
is a multiple of dwords and should be no less than 4,096.

Current code sets H2CData PDU data_length to r2t_length,
it does not check MAXH2CDATA value. Fix this by setting H2CData PDU
data_length to min(req->h2cdata_left, queue->maxh2cdata).

Also validate MAXH2CDATA value returned by target in ICResp PDU,
if it is not a multiple of dword or if it is less than 4096 return
-EINVAL from nvme_tcp_init_connection().

Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/nvme/host/tcp.c  | 63 +++++++++++++++++++++++++++++++---------
 include/linux/nvme-tcp.h |  1 +
 2 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 891a36d02e7c7..65e00c64a588b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -44,6 +44,8 @@ struct nvme_tcp_request {
 	u32			data_len;
 	u32			pdu_len;
 	u32			pdu_sent;
+	u32			h2cdata_left;
+	u32			h2cdata_offset;
 	u16			ttag;
 	__le16			status;
 	struct list_head	entry;
@@ -95,6 +97,7 @@ struct nvme_tcp_queue {
 	struct nvme_tcp_request *request;
 
 	int			queue_size;
+	u32			maxh2cdata;
 	size_t			cmnd_capsule_len;
 	struct nvme_tcp_ctrl	*ctrl;
 	unsigned long		flags;
@@ -572,23 +575,26 @@ static int nvme_tcp_handle_comp(struct nvme_tcp_queue *queue,
 	return ret;
 }
 
-static void nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
-		struct nvme_tcp_r2t_pdu *pdu)
+static void nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req)
 {
 	struct nvme_tcp_data_pdu *data = req->pdu;
 	struct nvme_tcp_queue *queue = req->queue;
 	struct request *rq = blk_mq_rq_from_pdu(req);
+	u32 h2cdata_sent = req->pdu_len;
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 	u8 ddgst = nvme_tcp_ddgst_len(queue);
 
 	req->state = NVME_TCP_SEND_H2C_PDU;
 	req->offset = 0;
-	req->pdu_len = le32_to_cpu(pdu->r2t_length);
+	req->pdu_len = min(req->h2cdata_left, queue->maxh2cdata);
 	req->pdu_sent = 0;
+	req->h2cdata_left -= req->pdu_len;
+	req->h2cdata_offset += h2cdata_sent;
 
 	memset(data, 0, sizeof(*data));
 	data->hdr.type = nvme_tcp_h2c_data;
-	data->hdr.flags = NVME_TCP_F_DATA_LAST;
+	if (!req->h2cdata_left)
+		data->hdr.flags = NVME_TCP_F_DATA_LAST;
 	if (queue->hdr_digest)
 		data->hdr.flags |= NVME_TCP_F_HDGST;
 	if (queue->data_digest)
@@ -597,9 +603,9 @@ static void nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
 	data->hdr.pdo = data->hdr.hlen + hdgst;
 	data->hdr.plen =
 		cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
-	data->ttag = pdu->ttag;
+	data->ttag = req->ttag;
 	data->command_id = nvme_cid(rq);
-	data->data_offset = pdu->r2t_offset;
+	data->data_offset = cpu_to_le32(req->h2cdata_offset);
 	data->data_length = cpu_to_le32(req->pdu_len);
 }
 
@@ -609,6 +615,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 	struct nvme_tcp_request *req;
 	struct request *rq;
 	u32 r2t_length = le32_to_cpu(pdu->r2t_length);
+	u32 r2t_offset = le32_to_cpu(pdu->r2t_offset);
 
 	rq = nvme_find_rq(nvme_tcp_tagset(queue), pdu->command_id);
 	if (!rq) {
@@ -633,14 +640,19 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 		return -EPROTO;
 	}
 
-	if (unlikely(le32_to_cpu(pdu->r2t_offset) < req->data_sent)) {
+	if (unlikely(r2t_offset < req->data_sent)) {
 		dev_err(queue->ctrl->ctrl.device,
 			"req %d unexpected r2t offset %u (expected %zu)\n",
-			rq->tag, le32_to_cpu(pdu->r2t_offset), req->data_sent);
+			rq->tag, r2t_offset, req->data_sent);
 		return -EPROTO;
 	}
 
-	nvme_tcp_setup_h2c_data_pdu(req, pdu);
+	req->pdu_len = 0;
+	req->h2cdata_left = r2t_length;
+	req->h2cdata_offset = r2t_offset;
+	req->ttag = pdu->ttag;
+
+	nvme_tcp_setup_h2c_data_pdu(req);
 	nvme_tcp_queue_request(req, false, true);
 
 	return 0;
@@ -928,6 +940,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
 {
 	struct nvme_tcp_queue *queue = req->queue;
 	int req_data_len = req->data_len;
+	u32 h2cdata_left = req->h2cdata_left;
 
 	while (true) {
 		struct page *page = nvme_tcp_req_cur_page(req);
@@ -972,7 +985,10 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
 				req->state = NVME_TCP_SEND_DDGST;
 				req->offset = 0;
 			} else {
-				nvme_tcp_done_send_req(queue);
+				if (h2cdata_left)
+					nvme_tcp_setup_h2c_data_pdu(req);
+				else
+					nvme_tcp_done_send_req(queue);
 			}
 			return 1;
 		}
@@ -1030,9 +1046,14 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
 	if (queue->hdr_digest && !req->offset)
 		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
 
-	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
-			offset_in_page(pdu) + req->offset, len,
-			MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
+	if (!req->h2cdata_left)
+		ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+				offset_in_page(pdu) + req->offset, len,
+				MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
+	else
+		ret = sock_no_sendpage(queue->sock, virt_to_page(pdu),
+				offset_in_page(pdu) + req->offset, len,
+				MSG_DONTWAIT | MSG_MORE);
 	if (unlikely(ret <= 0))
 		return ret;
 
@@ -1052,6 +1073,7 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
 {
 	struct nvme_tcp_queue *queue = req->queue;
 	size_t offset = req->offset;
+	u32 h2cdata_left = req->h2cdata_left;
 	int ret;
 	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
 	struct kvec iov = {
@@ -1069,7 +1091,10 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
 		return ret;
 
 	if (offset + ret == NVME_TCP_DIGEST_LENGTH) {
-		nvme_tcp_done_send_req(queue);
+		if (h2cdata_left)
+			nvme_tcp_setup_h2c_data_pdu(req);
+		else
+			nvme_tcp_done_send_req(queue);
 		return 1;
 	}
 
@@ -1261,6 +1286,7 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
 	struct msghdr msg = {};
 	struct kvec iov;
 	bool ctrl_hdgst, ctrl_ddgst;
+	u32 maxh2cdata;
 	int ret;
 
 	icreq = kzalloc(sizeof(*icreq), GFP_KERNEL);
@@ -1344,6 +1370,14 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
 		goto free_icresp;
 	}
 
+	maxh2cdata = le32_to_cpu(icresp->maxdata);
+	if ((maxh2cdata % 4) || (maxh2cdata < NVME_TCP_MIN_MAXH2CDATA)) {
+		pr_err("queue %d: invalid maxh2cdata returned %u\n",
+		       nvme_tcp_queue_id(queue), maxh2cdata);
+		goto free_icresp;
+	}
+	queue->maxh2cdata = maxh2cdata;
+
 	ret = 0;
 free_icresp:
 	kfree(icresp);
@@ -2329,6 +2363,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
 	req->data_sent = 0;
 	req->pdu_len = 0;
 	req->pdu_sent = 0;
+	req->h2cdata_left = 0;
 	req->data_len = blk_rq_nr_phys_segments(rq) ?
 				blk_rq_payload_bytes(rq) : 0;
 	req->curr_bio = rq->bio;
diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
index 959e0bd9a913e..75470159a194d 100644
--- a/include/linux/nvme-tcp.h
+++ b/include/linux/nvme-tcp.h
@@ -12,6 +12,7 @@
 #define NVME_TCP_DISC_PORT	8009
 #define NVME_TCP_ADMIN_CCSZ	SZ_8K
 #define NVME_TCP_DIGEST_LENGTH	4
+#define NVME_TCP_MIN_MAXH2CDATA 4096
 
 enum nvme_tcp_pfv {
 	NVME_TCP_PFV_1_0 = 0x0,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 18/28] PCI: Mark all AMD Navi10 and Navi14 GPU ATS as broken
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (15 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 17/28] nvme-tcp: send H2CData PDUs based on MAXH2CDATA Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 19/28] gpio: Return EPROBE_DEFER if gc->to_irq is NULL Sasha Levin
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Alex Deucher, Bjorn Helgaas, Christian König, Guchun Chen,
	Sasha Levin, linux-pci

From: Alex Deucher <alexander.deucher@amd.com>

[ Upstream commit 3f1271b54edcc692da5a3663f2aa2a64781f9bc3 ]

There are enough VBIOS escapes without the proper workaround that some
users still hit this.  Microsoft never productized ATS on Windows so OEM
platforms that were Windows-only didn't always validate ATS.

The advantages of ATS are not worth it compared to the potential
instabilities on harvested boards.  Disable ATS on all Navi10 and Navi14
boards.

Symptoms include:

  amdgpu 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0007 address=0xffffc02000 flags=0x0000]
  AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x0007 address=0xffffc02000 flags=0x0000]
  [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=6047, emitted seq=6049
  amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
  amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
  amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110)
  [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110
  amdgpu 0000:07:00.0: amdgpu: GPU reset(1) failed

Related commits:

  e8946a53e2a6 ("PCI: Mark AMD Navi14 GPU ATS as broken")
  a2da5d8cc0b0 ("PCI: Mark AMD Raven iGPU ATS as broken in some platforms")
  45beb31d3afb ("PCI: Mark AMD Navi10 GPU rev 0x00 ATS as broken")
  5e89cd303e3a ("PCI: Mark AMD Navi14 GPU rev 0xc5 ATS as broken")
  d28ca864c493 ("PCI: Mark AMD Stoney Radeon R7 GPU ATS as broken")
  9b44b0b09dec ("PCI: Mark AMD Stoney GPU ATS as broken")

[bhelgaas: add symptoms and related commits]
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1760
Link: https://lore.kernel.org/r/20220222160801.841643-1-alexander.deucher@amd.com
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/pci/quirks.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 20a9326907384..db864bf634a3e 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5344,11 +5344,6 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SERVERWORKS, 0x0422, quirk_no_ext_tags);
  */
 static void quirk_amd_harvest_no_ats(struct pci_dev *pdev)
 {
-	if ((pdev->device == 0x7312 && pdev->revision != 0x00) ||
-	    (pdev->device == 0x7340 && pdev->revision != 0xc5) ||
-	    (pdev->device == 0x7341 && pdev->revision != 0x00))
-		return;
-
 	if (pdev->device == 0x15d8) {
 		if (pdev->revision == 0xcf &&
 		    pdev->subsystem_vendor == 0xea50 &&
@@ -5370,10 +5365,19 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_amd_harvest_no_ats);
 /* AMD Iceland dGPU */
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_amd_harvest_no_ats);
 /* AMD Navi10 dGPU */
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7310, quirk_amd_harvest_no_ats);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7312, quirk_amd_harvest_no_ats);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7318, quirk_amd_harvest_no_ats);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7319, quirk_amd_harvest_no_ats);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x731a, quirk_amd_harvest_no_ats);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x731b, quirk_amd_harvest_no_ats);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x731e, quirk_amd_harvest_no_ats);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x731f, quirk_amd_harvest_no_ats);
 /* AMD Navi14 dGPU */
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7340, quirk_amd_harvest_no_ats);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7341, quirk_amd_harvest_no_ats);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7347, quirk_amd_harvest_no_ats);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x734f, quirk_amd_harvest_no_ats);
 /* AMD Raven platform iGPU */
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x15d8, quirk_amd_harvest_no_ats);
 #endif /* CONFIG_PCI_ATS */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 19/28] gpio: Return EPROBE_DEFER if gc->to_irq is NULL
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (16 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 18/28] PCI: Mark all AMD Navi10 and Navi14 GPU ATS as broken Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 20/28] drm/amdgpu: bypass tiling flag check in virtual display case (v2) Sasha Levin
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Shreeya Patel, Linus Walleij, Andy Shevchenko, kernel test robot,
	Bartosz Golaszewski, Sasha Levin, linux-gpio

From: Shreeya Patel <shreeya.patel@collabora.com>

[ Upstream commit ae42f9288846353982e2eab181fb41e7fd8bf60f ]

We are racing the registering of .to_irq when probing the
i2c driver. This results in random failure of touchscreen
devices.

Following explains the race condition better.

[gpio driver] gpio driver registers gpio chip
[gpio consumer] gpio is acquired
[gpio consumer] gpiod_to_irq() fails with -ENXIO
[gpio driver] gpio driver registers irqchip
gpiod_to_irq works at this point, but -ENXIO is fatal

We could see the following errors in dmesg logs when gc->to_irq is NULL

[2.101857] i2c_hid i2c-FTS3528:00: HID over i2c has not been provided an Int IRQ
[2.101953] i2c_hid: probe of i2c-FTS3528:00 failed with error -22

To avoid this situation, defer probing until to_irq is registered.
Returning -EPROBE_DEFER would be the first step towards avoiding
the failure of devices due to the race in registration of .to_irq.
Final solution to this issue would be to avoid using gc irq members
until they are fully initialized.

This issue has been reported many times in past and people have been
using workarounds like changing the pinctrl_amd to built-in instead
of loading it as a module or by adding a softdep for pinctrl_amd into
the config file.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209413
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Shreeya Patel <shreeya.patel@collabora.com>
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpio/gpiolib.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index abfbf546d1599..7b3f7f4d1d063 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -3111,6 +3111,16 @@ int gpiod_to_irq(const struct gpio_desc *desc)
 
 		return retirq;
 	}
+#ifdef CONFIG_GPIOLIB_IRQCHIP
+	if (gc->irq.chip) {
+		/*
+		 * Avoid race condition with other code, which tries to lookup
+		 * an IRQ before the irqchip has been properly registered,
+		 * i.e. while gpiochip is still being brought up.
+		 */
+		return -EPROBE_DEFER;
+	}
+#endif
 	return -ENXIO;
 }
 EXPORT_SYMBOL_GPL(gpiod_to_irq);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 20/28] drm/amdgpu: bypass tiling flag check in virtual display case (v2)
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (17 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 19/28] gpio: Return EPROBE_DEFER if gc->to_irq is NULL Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 21/28] Revert "xen-netback: remove 'hotplug-status' once it has served its purpose" Sasha Levin
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Guchun Chen, Leslie Shi, Alex Deucher, Sasha Levin,
	christian.koenig, Xinhui.Pan, airlied, daniel, mdaenzer, contact,
	evan.quan, qingqing.zhuo, bas, markyacoub, seanpaul, amd-gfx,
	dri-devel

From: Guchun Chen <guchun.chen@amd.com>

[ Upstream commit e2b993302f40c4eb714ecf896dd9e1c5be7d4cd7 ]

vkms leverages common amdgpu framebuffer creation, and
also as it does not support FB modifier, there is no need
to check tiling flags when initing framebuffer when virtual
display is enabled.

This can fix below calltrace:

amdgpu 0000:00:08.0: GFX9+ requires FB check based on format modifier
WARNING: CPU: 0 PID: 1023 at drivers/gpu/drm/amd/amdgpu/amdgpu_display.c:1150 amdgpu_display_framebuffer_init+0x8e7/0xb40 [amdgpu]

v2: check adev->enable_virtual_display instead as vkms can be
	enabled in bare metal as well.

Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
index dc50c05f23fc2..5c08047adb594 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
@@ -1145,7 +1145,7 @@ int amdgpu_display_framebuffer_init(struct drm_device *dev,
 	if (ret)
 		return ret;
 
-	if (!dev->mode_config.allow_fb_modifiers) {
+	if (!dev->mode_config.allow_fb_modifiers && !adev->enable_virtual_display) {
 		drm_WARN_ONCE(dev, adev->family >= AMDGPU_FAMILY_AI,
 			      "GFX9+ requires FB check based on format modifier\n");
 		ret = check_tiling_flags_gfx6(rfb);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 21/28] Revert "xen-netback: remove 'hotplug-status' once it has served its purpose"
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (18 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 20/28] drm/amdgpu: bypass tiling flag check in virtual display case (v2) Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 22/28] Revert "xen-netback: Check for hotplug-status existence before watching" Sasha Levin
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Marek Marczykowski-Górecki, Paul Durrant, Jakub Kicinski,
	Sasha Levin, wei.liu, davem, xen-devel, netdev

From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

[ Upstream commit 0f4558ae91870692ce7f509c31c9d6ee721d8cdc ]

This reverts commit 1f2565780e9b7218cf92c7630130e82dcc0fe9c2.

The 'hotplug-status' node should not be removed as long as the vif
device remains configured. Otherwise the xen-netback would wait for
re-running the network script even if it was already called (in case of
the frontent re-connecting). But also, it _should_ be removed when the
vif device is destroyed (for example when unbinding the driver) -
otherwise hotplug script would not configure the device whenever it
re-appear.

Moving removal of the 'hotplug-status' node was a workaround for nothing
calling network script after xen-netback module is reloaded. But when
vif interface is re-created (on xen-netback unbind/bind for example),
the script should be called, regardless of who does that - currently
this case is not handled by the toolstack, and requires manual
script call. Keeping hotplug-status=connected to skip the call is wrong
and leads to not configured interface.

More discussion at
https://lore.kernel.org/xen-devel/afedd7cb-a291-e773-8b0d-4db9b291fa98@ipxe.org/T/#u

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20220222001817.2264967-1-marmarek@invisiblethingslab.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/xen-netback/xenbus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index d24b7a7993aa0..3fad58d22155b 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -256,6 +256,7 @@ static void backend_disconnect(struct backend_info *be)
 		unsigned int queue_index;
 
 		xen_unregister_watchers(vif);
+		xenbus_rm(XBT_NIL, be->dev->nodename, "hotplug-status");
 #ifdef CONFIG_DEBUG_FS
 		xenvif_debugfs_delif(vif);
 #endif /* CONFIG_DEBUG_FS */
@@ -675,7 +676,6 @@ static void hotplug_status_changed(struct xenbus_watch *watch,
 
 		/* Not interested in this watch anymore. */
 		unregister_hotplug_status_watch(be);
-		xenbus_rm(XBT_NIL, be->dev->nodename, "hotplug-status");
 	}
 	kfree(str);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 22/28] Revert "xen-netback: Check for hotplug-status existence before watching"
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (19 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 21/28] Revert "xen-netback: remove 'hotplug-status' once it has served its purpose" Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 23/28] ipv6: prevent a possible race condition with lifetimes Sasha Levin
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Marek Marczykowski-Górecki, Paul Durrant, Michael Brown,
	Jakub Kicinski, Sasha Levin, wei.liu, davem, xen-devel, netdev

From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

[ Upstream commit e8240addd0a3919e0fd7436416afe9aa6429c484 ]

This reverts commit 2afeec08ab5c86ae21952151f726bfe184f6b23d.

The reasoning in the commit was wrong - the code expected to setup the
watch even if 'hotplug-status' didn't exist. In fact, it relied on the
watch being fired the first time - to check if maybe 'hotplug-status' is
already set to 'connected'. Not registering a watch for non-existing
path (which is the case if hotplug script hasn't been executed yet),
made the backend not waiting for the hotplug script to execute. This in
turns, made the netfront think the interface is fully operational, while
in fact it was not (the vif interface on xen-netback side might not be
configured yet).

This was a workaround for 'hotplug-status' erroneously being removed.
But since that is reverted now, the workaround is not necessary either.

More discussion at
https://lore.kernel.org/xen-devel/afedd7cb-a291-e773-8b0d-4db9b291fa98@ipxe.org/T/#u

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Michael Brown <mbrown@fensystems.co.uk>
Link: https://lore.kernel.org/r/20220222001817.2264967-2-marmarek@invisiblethingslab.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/xen-netback/xenbus.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index 3fad58d22155b..990360d75cb64 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -824,15 +824,11 @@ static void connect(struct backend_info *be)
 	xenvif_carrier_on(be->vif);
 
 	unregister_hotplug_status_watch(be);
-	if (xenbus_exists(XBT_NIL, dev->nodename, "hotplug-status")) {
-		err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch,
-					   NULL, hotplug_status_changed,
-					   "%s/%s", dev->nodename,
-					   "hotplug-status");
-		if (err)
-			goto err;
+	err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch, NULL,
+				   hotplug_status_changed,
+				   "%s/%s", dev->nodename, "hotplug-status");
+	if (!err)
 		be->have_hotplug_status_watch = 1;
-	}
 
 	netif_tx_wake_all_queues(be->vif->dev);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 23/28] ipv6: prevent a possible race condition with lifetimes
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (20 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 22/28] Revert "xen-netback: Check for hotplug-status existence before watching" Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 24/28] tracing: Ensure trace buffer is at least 4096 bytes large Sasha Levin
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Niels Dossche, David Ahern, Niels Dossche, Jakub Kicinski,
	Sasha Levin, davem, yoshfuji, netdev

From: Niels Dossche <dossche.niels@gmail.com>

[ Upstream commit 6c0d8833a605e195ae219b5042577ce52bf71fff ]

valid_lft, prefered_lft and tstamp are always accessed under the lock
"lock" in other places. Reading these without taking the lock may result
in inconsistencies regarding the calculation of the valid and preferred
variables since decisions are taken on these fields for those variables.

Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Niels Dossche <niels.dossche@ugent.be>
Link: https://lore.kernel.org/r/20220223131954.6570-1-niels.dossche@ugent.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 6652d96329a0c..ec9b8de5dd88a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4998,6 +4998,7 @@ static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa,
 	    nla_put_s32(skb, IFA_TARGET_NETNSID, args->netnsid))
 		goto error;
 
+	spin_lock_bh(&ifa->lock);
 	if (!((ifa->flags&IFA_F_PERMANENT) &&
 	      (ifa->prefered_lft == INFINITY_LIFE_TIME))) {
 		preferred = ifa->prefered_lft;
@@ -5019,6 +5020,7 @@ static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa,
 		preferred = INFINITY_LIFE_TIME;
 		valid = INFINITY_LIFE_TIME;
 	}
+	spin_unlock_bh(&ifa->lock);
 
 	if (!ipv6_addr_any(&ifa->peer_addr)) {
 		if (nla_put_in6_addr(skb, IFA_LOCAL, &ifa->addr) < 0 ||
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 24/28] tracing: Ensure trace buffer is at least 4096 bytes large
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (21 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 23/28] ipv6: prevent a possible race condition with lifetimes Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 25/28] tracing/osnoise: Make osnoise_main to sleep for microseconds Sasha Levin
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Sven Schnelle, Steven Rostedt, Sasha Levin, mingo

From: Sven Schnelle <svens@linux.ibm.com>

[ Upstream commit 7acf3a127bb7c65ff39099afd78960e77b2ca5de ]

Booting the kernel with 'trace_buf_size=1' give a warning at
boot during the ftrace selftests:

[    0.892809] Running postponed tracer tests:
[    0.892893] Testing tracer function:
[    0.901899] Callback from call_rcu_tasks_trace() invoked.
[    0.983829] Callback from call_rcu_tasks_rude() invoked.
[    1.072003] .. bad ring buffer .. corrupted trace buffer ..
[    1.091944] Callback from call_rcu_tasks() invoked.
[    1.097695] PASSED
[    1.097701] Testing dynamic ftrace: .. filter failed count=0 ..FAILED!
[    1.353474] ------------[ cut here ]------------
[    1.353478] WARNING: CPU: 0 PID: 1 at kernel/trace/trace.c:1951 run_tracer_selftest+0x13c/0x1b0

Therefore enforce a minimum of 4096 bytes to make the selftest pass.

Link: https://lkml.kernel.org/r/20220214134456.1751749-1-svens@linux.ibm.com

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/trace/trace.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bb15059020445..9b5abe11be5c0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1472,10 +1472,12 @@ static int __init set_buf_size(char *str)
 	if (!str)
 		return 0;
 	buf_size = memparse(str, &str);
-	/* nr_entries can not be zero */
-	if (buf_size == 0)
-		return 0;
-	trace_buf_size = buf_size;
+	/*
+	 * nr_entries can not be zero and the startup
+	 * tests require some buffer space. Therefore
+	 * ensure we have at least 4096 bytes of buffer.
+	 */
+	trace_buf_size = max(4096UL, buf_size);
 	return 1;
 }
 __setup("trace_buf_size=", set_buf_size);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 25/28] tracing/osnoise: Make osnoise_main to sleep for microseconds
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (22 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 24/28] tracing: Ensure trace buffer is at least 4096 bytes large Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 26/28] tracing: Fix selftest config check for function graph start up test Sasha Levin
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Steven Rostedt, Sasha Levin

From: Daniel Bristot de Oliveira <bristot@kernel.org>

[ Upstream commit dd990352f01ee9a6c6eee152e5d11c021caccfe4 ]

osnoise's runtime and period are in the microseconds scale, but it is
currently sleeping in the millisecond's scale. This behavior roots in the
usage of hwlat as the skeleton for osnoise.

Make osnoise to sleep in the microseconds scale. Also, move the sleep to
a specialized function.

Link: https://lkml.kernel.org/r/302aa6c7bdf2d131719b22901905e9da122a11b2.1645197336.git.bristot@kernel.org

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/trace/trace_osnoise.c | 53 ++++++++++++++++++++++--------------
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index b58674e8644a6..58c788b0ca27c 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1437,6 +1437,37 @@ static int run_osnoise(void)
 static struct cpumask osnoise_cpumask;
 static struct cpumask save_cpumask;
 
+/*
+ * osnoise_sleep - sleep until the next period
+ */
+static void osnoise_sleep(void)
+{
+	u64 interval;
+	ktime_t wake_time;
+
+	mutex_lock(&interface_lock);
+	interval = osnoise_data.sample_period - osnoise_data.sample_runtime;
+	mutex_unlock(&interface_lock);
+
+	/*
+	 * differently from hwlat_detector, the osnoise tracer can run
+	 * without a pause because preemption is on.
+	 */
+	if (!interval) {
+		/* Let synchronize_rcu_tasks() make progress */
+		cond_resched_tasks_rcu_qs();
+		return;
+	}
+
+	wake_time = ktime_add_us(ktime_get(), interval);
+	__set_current_state(TASK_INTERRUPTIBLE);
+
+	while (schedule_hrtimeout_range(&wake_time, 0, HRTIMER_MODE_ABS)) {
+		if (kthread_should_stop())
+			break;
+	}
+}
+
 /*
  * osnoise_main - The osnoise detection kernel thread
  *
@@ -1445,30 +1476,10 @@ static struct cpumask save_cpumask;
  */
 static int osnoise_main(void *data)
 {
-	u64 interval;
 
 	while (!kthread_should_stop()) {
-
 		run_osnoise();
-
-		mutex_lock(&interface_lock);
-		interval = osnoise_data.sample_period - osnoise_data.sample_runtime;
-		mutex_unlock(&interface_lock);
-
-		do_div(interval, USEC_PER_MSEC);
-
-		/*
-		 * differently from hwlat_detector, the osnoise tracer can run
-		 * without a pause because preemption is on.
-		 */
-		if (interval < 1) {
-			/* Let synchronize_rcu_tasks() make progress */
-			cond_resched_tasks_rcu_qs();
-			continue;
-		}
-
-		if (msleep_interruptible(interval))
-			break;
+		osnoise_sleep();
 	}
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 26/28] tracing: Fix selftest config check for function graph start up test
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (23 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 25/28] tracing/osnoise: Make osnoise_main to sleep for microseconds Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 27/28] selftest/vm: fix map_fixed_noreplace test failure Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 28/28] selftests/memfd: clean up mapping in mfd_fail_write Sasha Levin
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Christophe Leroy, Steven Rostedt, Sasha Levin, mingo

From: Christophe Leroy <christophe.leroy@csgroup.eu>

[ Upstream commit c5229a0bd47814770c895e94fbc97ad21819abfe ]

CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is required to test
direct tramp.

Link: https://lkml.kernel.org/r/bdc7e594e13b0891c1d61bc8d56c94b1890eaed7.1640017960.git.christophe.leroy@csgroup.eu

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/trace/trace_selftest.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index afd937a46496e..abcadbe933bb7 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -784,9 +784,7 @@ static struct fgraph_ops fgraph_ops __initdata  = {
 	.retfunc		= &trace_graph_return,
 };
 
-#if defined(CONFIG_DYNAMIC_FTRACE) && \
-    defined(CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS)
-#define TEST_DIRECT_TRAMP
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 noinline __noclone static void trace_direct_tramp(void) { }
 #endif
 
@@ -849,7 +847,7 @@ trace_selftest_startup_function_graph(struct tracer *trace,
 		goto out;
 	}
 
-#ifdef TEST_DIRECT_TRAMP
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 	tracing_reset_online_cpus(&tr->array_buffer);
 	set_graph_array(tr);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 27/28] selftest/vm: fix map_fixed_noreplace test failure
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (24 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 26/28] tracing: Fix selftest config check for function graph start up test Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 28/28] selftests/memfd: clean up mapping in mfd_fail_write Sasha Levin
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Aneesh Kumar K.V, Michael Ellerman, Jann Horn, Shuah Khan,
	Andrew Morton, Linus Torvalds, Sasha Levin, linux-mm,
	linux-kselftest

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>

[ Upstream commit f39c58008dee7ab5fc94c3f1995a21e886801df0 ]

On the latest RHEL the test fails due to executable mapped at 256MB
address

     # ./map_fixed_noreplace
    mmap() @ 0x10000000-0x10050000 p=0xffffffffffffffff result=File exists
    10000000-10010000 r-xp 00000000 fd:04 34905657                           /root/rpmbuild/BUILD/kernel-5.14.0-56.el9/linux-5.14.0-56.el9.ppc64le/tools/testing/selftests/vm/map_fixed_noreplace
    10010000-10020000 r--p 00000000 fd:04 34905657                           /root/rpmbuild/BUILD/kernel-5.14.0-56.el9/linux-5.14.0-56.el9.ppc64le/tools/testing/selftests/vm/map_fixed_noreplace
    10020000-10030000 rw-p 00010000 fd:04 34905657                           /root/rpmbuild/BUILD/kernel-5.14.0-56.el9/linux-5.14.0-56.el9.ppc64le/tools/testing/selftests/vm/map_fixed_noreplace
    10029b90000-10029bc0000 rw-p 00000000 00:00 0                            [heap]
    7fffbb510000-7fffbb750000 r-xp 00000000 fd:04 24534                      /usr/lib64/libc.so.6
    7fffbb750000-7fffbb760000 r--p 00230000 fd:04 24534                      /usr/lib64/libc.so.6
    7fffbb760000-7fffbb770000 rw-p 00240000 fd:04 24534                      /usr/lib64/libc.so.6
    7fffbb780000-7fffbb7a0000 r--p 00000000 00:00 0                          [vvar]
    7fffbb7a0000-7fffbb7b0000 r-xp 00000000 00:00 0                          [vdso]
    7fffbb7b0000-7fffbb800000 r-xp 00000000 fd:04 24514                      /usr/lib64/ld64.so.2
    7fffbb800000-7fffbb810000 r--p 00040000 fd:04 24514                      /usr/lib64/ld64.so.2
    7fffbb810000-7fffbb820000 rw-p 00050000 fd:04 24514                      /usr/lib64/ld64.so.2
    7fffd93f0000-7fffd9420000 rw-p 00000000 00:00 0                          [stack]
    Error: couldn't map the space we need for the test

Fix this by finding a free address using mmap instead of hardcoding
BASE_ADDRESS.

Link: https://lkml.kernel.org/r/20220217083417.373823-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Jann Horn <jannh@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 .../selftests/vm/map_fixed_noreplace.c        | 49 ++++++++++++++-----
 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/vm/map_fixed_noreplace.c b/tools/testing/selftests/vm/map_fixed_noreplace.c
index d91bde5112686..eed44322d1a63 100644
--- a/tools/testing/selftests/vm/map_fixed_noreplace.c
+++ b/tools/testing/selftests/vm/map_fixed_noreplace.c
@@ -17,9 +17,6 @@
 #define MAP_FIXED_NOREPLACE 0x100000
 #endif
 
-#define BASE_ADDRESS	(256ul * 1024 * 1024)
-
-
 static void dump_maps(void)
 {
 	char cmd[32];
@@ -28,18 +25,46 @@ static void dump_maps(void)
 	system(cmd);
 }
 
+static unsigned long find_base_addr(unsigned long size)
+{
+	void *addr;
+	unsigned long flags;
+
+	flags = MAP_PRIVATE | MAP_ANONYMOUS;
+	addr = mmap(NULL, size, PROT_NONE, flags, -1, 0);
+	if (addr == MAP_FAILED) {
+		printf("Error: couldn't map the space we need for the test\n");
+		return 0;
+	}
+
+	if (munmap(addr, size) != 0) {
+		printf("Error: couldn't map the space we need for the test\n");
+		return 0;
+	}
+	return (unsigned long)addr;
+}
+
 int main(void)
 {
+	unsigned long base_addr;
 	unsigned long flags, addr, size, page_size;
 	char *p;
 
 	page_size = sysconf(_SC_PAGE_SIZE);
 
+	//let's find a base addr that is free before we start the tests
+	size = 5 * page_size;
+	base_addr = find_base_addr(size);
+	if (!base_addr) {
+		printf("Error: couldn't map the space we need for the test\n");
+		return 1;
+	}
+
 	flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED_NOREPLACE;
 
 	// Check we can map all the areas we need below
 	errno = 0;
-	addr = BASE_ADDRESS;
+	addr = base_addr;
 	size = 5 * page_size;
 	p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0);
 
@@ -60,7 +85,7 @@ int main(void)
 	printf("unmap() successful\n");
 
 	errno = 0;
-	addr = BASE_ADDRESS + page_size;
+	addr = base_addr + page_size;
 	size = 3 * page_size;
 	p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0);
 	printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p);
@@ -80,7 +105,7 @@ int main(void)
 	 *     +4 |  free  | new
 	 */
 	errno = 0;
-	addr = BASE_ADDRESS;
+	addr = base_addr;
 	size = 5 * page_size;
 	p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0);
 	printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p);
@@ -101,7 +126,7 @@ int main(void)
 	 *     +4 |  free  |
 	 */
 	errno = 0;
-	addr = BASE_ADDRESS + (2 * page_size);
+	addr = base_addr + (2 * page_size);
 	size = page_size;
 	p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0);
 	printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p);
@@ -121,7 +146,7 @@ int main(void)
 	 *     +4 |  free  | new
 	 */
 	errno = 0;
-	addr = BASE_ADDRESS + (3 * page_size);
+	addr = base_addr + (3 * page_size);
 	size = 2 * page_size;
 	p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0);
 	printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p);
@@ -141,7 +166,7 @@ int main(void)
 	 *     +4 |  free  |
 	 */
 	errno = 0;
-	addr = BASE_ADDRESS;
+	addr = base_addr;
 	size = 2 * page_size;
 	p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0);
 	printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p);
@@ -161,7 +186,7 @@ int main(void)
 	 *     +4 |  free  |
 	 */
 	errno = 0;
-	addr = BASE_ADDRESS;
+	addr = base_addr;
 	size = page_size;
 	p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0);
 	printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p);
@@ -181,7 +206,7 @@ int main(void)
 	 *     +4 |  free  |  new
 	 */
 	errno = 0;
-	addr = BASE_ADDRESS + (4 * page_size);
+	addr = base_addr + (4 * page_size);
 	size = page_size;
 	p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0);
 	printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p);
@@ -192,7 +217,7 @@ int main(void)
 		return 1;
 	}
 
-	addr = BASE_ADDRESS;
+	addr = base_addr;
 	size = 5 * page_size;
 	if (munmap((void *)addr, size) != 0) {
 		dump_maps();
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH AUTOSEL 5.16 28/28] selftests/memfd: clean up mapping in mfd_fail_write
  2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
                   ` (25 preceding siblings ...)
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 27/28] selftest/vm: fix map_fixed_noreplace test failure Sasha Levin
@ 2022-03-01 20:13 ` Sasha Levin
  26 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2022-03-01 20:13 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Mike Kravetz, Joel Fernandes, Shuah Khan, Andrew Morton,
	Linus Torvalds, Sasha Levin, gthelen, linux-kselftest

From: Mike Kravetz <mike.kravetz@oracle.com>

[ Upstream commit fda153c89af344d21df281009a9d046cf587ea0f ]

Running the memfd script ./run_hugetlbfs_test.sh will often end in error
as follows:

    memfd-hugetlb: CREATE
    memfd-hugetlb: BASIC
    memfd-hugetlb: SEAL-WRITE
    memfd-hugetlb: SEAL-FUTURE-WRITE
    memfd-hugetlb: SEAL-SHRINK
    fallocate(ALLOC) failed: No space left on device
    ./run_hugetlbfs_test.sh: line 60: 166855 Aborted                 (core dumped) ./memfd_test hugetlbfs
    opening: ./mnt/memfd
    fuse: DONE

If no hugetlb pages have been preallocated, run_hugetlbfs_test.sh will
allocate 'just enough' pages to run the test.  In the SEAL-FUTURE-WRITE
test the mfd_fail_write routine maps the file, but does not unmap.  As a
result, two hugetlb pages remain reserved for the mapping.  When the
fallocate call in the SEAL-SHRINK test attempts allocate all hugetlb
pages, it is short by the two reserved pages.

Fix by making sure to unmap in mfd_fail_write.

Link: https://lkml.kernel.org/r/20220219004340.56478-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 tools/testing/selftests/memfd/memfd_test.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 192a2899bae8f..94df2692e6e4a 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -455,6 +455,7 @@ static void mfd_fail_write(int fd)
 			printf("mmap()+mprotect() didn't fail as expected\n");
 			abort();
 		}
+		munmap(p, mfd_def_size);
 	}
 
 	/* verify PUNCH_HOLE fails */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 12/28] x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 12/28] x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU Sasha Levin
@ 2022-03-01 20:17   ` Paolo Bonzini
  0 siblings, 0 replies; 43+ messages in thread
From: Paolo Bonzini @ 2022-03-01 20:17 UTC (permalink / raw)
  To: Sasha Levin, linux-kernel, stable
  Cc: Wanpeng Li, tglx, mingo, bp, dave.hansen, x86, kvm

On 3/1/22 21:13, Sasha Levin wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> [ Upstream commit ec756e40e271866f951d77c5e923d8deb6002b15 ]
> 
> Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if
> only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable
> pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask.
> 
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> Message-Id: <1645171838-2855-1-git-send-email-wanpengli@tencent.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>   arch/x86/kernel/kvm.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 59abbdad7729c..ff3db164e52cb 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -462,19 +462,22 @@ static bool pv_tlb_flush_supported(void)
>   {
>   	return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
>   		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
> -		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
> +		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&
> +		(num_possible_cpus() != 1));
>   }
>   
>   static bool pv_ipi_supported(void)
>   {
> -	return kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI);
> +	return (kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI) &&
> +	       (num_possible_cpus() != 1));
>   }
>   
>   static bool pv_sched_yield_supported(void)
>   {
>   	return (kvm_para_has_feature(KVM_FEATURE_PV_SCHED_YIELD) &&
>   		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
> -	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
> +	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&
> +	    (num_possible_cpus() != 1));
>   }
>   
>   #define KVM_IPI_CLUSTER_SIZE	(2 * BITS_PER_LONG)

NACK

Not really necessary.

Paolo


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 05/28] KVM: Fix lockdep false negative during host resume
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 05/28] KVM: Fix lockdep false negative during host resume Sasha Levin
@ 2022-03-01 20:19   ` Paolo Bonzini
  0 siblings, 0 replies; 43+ messages in thread
From: Paolo Bonzini @ 2022-03-01 20:19 UTC (permalink / raw)
  To: Sasha Levin, linux-kernel, stable; +Cc: Wanpeng Li, kvm

On 3/1/22 21:13, Sasha Levin wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> [ Upstream commit 4cb9a998b1ce25fad74a82f5a5c45a4ef40de337 ]
> 
> I saw the below splatting after the host suspended and resumed.
> 
>     WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm]
>     CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G        W IOE     5.17.0-rc3+ #4
>     RIP: 0010:kvm_resume+0x2c/0x30 [kvm]
>     Call Trace:
>      <TASK>
>      syscore_resume+0x90/0x340
>      suspend_devices_and_enter+0xaee/0xe90
>      pm_suspend.cold+0x36b/0x3c2
>      state_store+0x82/0xf0
>      kernfs_fop_write_iter+0x1b6/0x260
>      new_sync_write+0x258/0x370
>      vfs_write+0x33f/0x510
>      ksys_write+0xc9/0x160
>      do_syscall_64+0x3b/0xc0
>      entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> lockdep_is_held() can return -1 when lockdep is disabled which triggers
> this warning. Let's use lockdep_assert_not_held() which can detect
> incorrect calls while holding a lock and it also avoids false negatives
> when lockdep is disabled.
> 
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> Message-Id: <1644920142-81249-1-git-send-email-wanpengli@tencent.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>   virt/kvm/kvm_main.c | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 71ddc7a8bc302..6ae9e04d0585e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -5347,9 +5347,7 @@ static int kvm_suspend(void)
>   static void kvm_resume(void)
>   {
>   	if (kvm_usage_count) {
> -#ifdef CONFIG_LOCKDEP
> -		WARN_ON(lockdep_is_held(&kvm_count_lock));
> -#endif
> +		lockdep_assert_not_held(&kvm_count_lock);
>   		hardware_enable_nolock(NULL);
>   	}
>   }

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 06/28] kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 06/28] kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode Sasha Levin
@ 2022-03-01 20:19   ` Paolo Bonzini
  0 siblings, 0 replies; 43+ messages in thread
From: Paolo Bonzini @ 2022-03-01 20:19 UTC (permalink / raw)
  To: Sasha Levin, linux-kernel, stable
  Cc: Anton Romanov, tglx, mingo, bp, dave.hansen, x86, kvm

On 3/1/22 21:13, Sasha Levin wrote:
> From: Anton Romanov <romanton@google.com>
> 
> [ Upstream commit 3a55f729240a686aa8af00af436306c0cd532522 ]
> 
> If vcpu has tsc_always_catchup set each request updates pvclock data.
> KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on
> host's side and do hypercall inside pvclock_read_retry loop leading to
> infinite loop in such situation.
> 
> v3:
>      Removed warn
>      Changed return code to KVM_EFAULT
> v2:
>      Added warn
> 
> Signed-off-by: Anton Romanov <romanton@google.com>
> Message-Id: <20220216182653.506850-1-romanton@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>   arch/x86/kvm/x86.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0714fa0e7ede0..18fc0367ef21a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8769,6 +8769,13 @@ static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu, gpa_t paddr,
>   	if (clock_type != KVM_CLOCK_PAIRING_WALLCLOCK)
>   		return -KVM_EOPNOTSUPP;
>   
> +	/*
> +	 * When tsc is in permanent catchup mode guests won't be able to use
> +	 * pvclock_read_retry loop to get consistent view of pvclock
> +	 */
> +	if (vcpu->arch.tsc_always_catchup)
> +		return -KVM_EOPNOTSUPP;
> +
>   	if (!kvm_get_walltime_and_clockread(&ts, &cycle))
>   		return -KVM_EOPNOTSUPP;
>   

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0 Sasha Levin
@ 2022-03-01 20:22   ` Paolo Bonzini
  2022-06-03 18:40     ` Peter Xu
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2022-03-01 20:22 UTC (permalink / raw)
  To: Sasha Levin, linux-kernel, stable
  Cc: Leonardo Bras, tglx, mingo, bp, dave.hansen, x86, chang.seok.bae,
	luto, kvm

On 3/1/22 21:13, Sasha Levin wrote:
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index d28829403ed08..6ac01f9828530 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1563,7 +1563,10 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
>   		fpregs_restore_userregs();
>   
>   	newfps->xfeatures = curfps->xfeatures | xfeatures;
> -	newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
> +
> +	if (!guest_fpu)
> +		newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
> +
>   	newfps->xfd = curfps->xfd & ~xfeatures;
>   
>   	curfps = fpu_install_fpstate(fpu, newfps);
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index bf18679757c70..875dce4aa2d28 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -276,6 +276,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>   	vcpu->arch.guest_supported_xcr0 =
>   		cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
>   
> +	vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0;
> +
>   	kvm_update_pv_runtime(vcpu);
>   
>   	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);

Leonardo, was this also buggy in 5.16?  (I should have asked for a Fixes 
tag...).

Paolo


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-03-01 20:22   ` Paolo Bonzini
@ 2022-06-03 18:40     ` Peter Xu
  2022-06-06 16:18       ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Xu @ 2022-06-03 18:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sasha Levin, linux-kernel, stable, Leonardo Bras, tglx, mingo,
	bp, dave.hansen, x86, chang.seok.bae, luto, kvm,
	Sean Christopherson

On Tue, Mar 01, 2022 at 09:22:10PM +0100, Paolo Bonzini wrote:
> On 3/1/22 21:13, Sasha Levin wrote:
> > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > index d28829403ed08..6ac01f9828530 100644
> > --- a/arch/x86/kernel/fpu/xstate.c
> > +++ b/arch/x86/kernel/fpu/xstate.c
> > @@ -1563,7 +1563,10 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
> >   		fpregs_restore_userregs();
> >   	newfps->xfeatures = curfps->xfeatures | xfeatures;
> > -	newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
> > +
> > +	if (!guest_fpu)
> > +		newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
> > +
> >   	newfps->xfd = curfps->xfd & ~xfeatures;
> >   	curfps = fpu_install_fpstate(fpu, newfps);
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index bf18679757c70..875dce4aa2d28 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -276,6 +276,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >   	vcpu->arch.guest_supported_xcr0 =
> >   		cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
> > +	vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0;
> > +
> >   	kvm_update_pv_runtime(vcpu);
> >   	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
> 
> Leonardo, was this also buggy in 5.16?  (I should have asked for a Fixes
> tag...).

I just stumbled over this patch on some migration tests in the past few
days..

In short, I was migrating a VM from 5.15 host to 5.18 host and the guest
trigger double fault immediately after the switch-over (I think that's when
it's trying to do vmenter, a VECTOR_DF was injected), with either precopy
or postcopy.

After I upgrade 5.15 src host to 5.18 host, problem goes away. I did a
bisect on dest and surprisingly it points to this commit.

Side note: I'm using two hosts that have the same processor model, so no
case of missing features on either side - they just match.

I'm not really sure whether this is a bug or by design - do we require this
patch to be applied to all stable branches to make the guest not crash
after migration, or it is unexpected?

FWICT, this patch modifies user_xfeatures while we don't do that trick
before. It sounds reasonable to me from the 1st glance, say if the guest
didn't enable some of the fpu features so we don't need to migrate those
fpu state chunks as we're migrating things based on user_xfeatures, and it
sounds good to solve the migration issue on "has-pksu" host to "no-pksu"
host as described in the patch commit message.

However there seems to be something missing at least to me, on why it'll
fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
patch, but 0x0 if with it.

I think what it should be happening is user_xfeatures will be set on src
with 0x7 (old kernel), so we should have migrated some more chunks to dest,
but I just don't quickly understand why that's a problem there because
fundamentally when we restore the fpu status (fpu_swap_kvm_fpstate) we use
the max feature bitmask anyway, and the dest hardware should support all of
them.  I don't quickly see how that could trigger a double fault, though.

I'll continue the dig probably next week, before that, any thoughts?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-06-03 18:40     ` Peter Xu
@ 2022-06-06 16:18       ` Paolo Bonzini
  2022-06-06 21:27         ` Peter Xu
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2022-06-06 16:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sasha Levin, linux-kernel, stable, Leonardo Bras, tglx, mingo,
	bp, dave.hansen, x86, chang.seok.bae, luto, kvm,
	Sean Christopherson

On 6/3/22 20:40, Peter Xu wrote:
> I'm not really sure whether this is a bug or by design - do we require this
> patch to be applied to all stable branches to make the guest not crash
> after migration, or it is unexpected?

Yes, we do, though the only reported bug was for PKRU.

> However there seems to be something missing at least to me, on why it'll
> fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
> In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
> patch, but 0x0 if with it.

What CPU model are you using for the VM?  For example, if the source 
lacks this patch but the destination has it, the source will transmit 
YMM registers, but the destination will fail to set them if they are not 
available for the selected CPU model.

See the commit message: "As a bonus, it will also fail if userspace 
tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not 
compatible to the guest configuration.  Such features will never be 
returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."

Paolo


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-06-06 16:18       ` Paolo Bonzini
@ 2022-06-06 21:27         ` Peter Xu
  2022-06-07 12:54           ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Xu @ 2022-06-06 21:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sasha Levin, linux-kernel, stable, Leonardo Bras, tglx, mingo,
	bp, dave.hansen, x86, chang.seok.bae, luto, kvm,
	Sean Christopherson

On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
> > However there seems to be something missing at least to me, on why it'll
> > fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
> > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
> > patch, but 0x0 if with it.
> 
> What CPU model are you using for the VM?

I didn't specify it, assuming it's qemu64 with no extra parameters.

I just tried two other options with: (1) -cpu host, and (2) -cpu Haswell
(the choice of Haswell was really random..), with the same 5.15->5.18
migration scenario, both of them will not trigger the same guest kernel
crash.  Only qemu64 will.

Both hosts have Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

> For example, if the source lacks this patch but the destination has it,
> the source will transmit YMM registers, but the destination will fail to
> set them if they are not available for the selected CPU model.
> 
> See the commit message: "As a bonus, it will also fail if userspace tries to
> set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to
> the guest configuration.  Such features will never be returned by
> KVM_GET_XSAVE or KVM_GET_XSAVE2."

IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned
(probably by failing validate_user_xstate_header when checking against the
user_xfeatures on dest host). But that's probably not my case, because here
KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after
the precopy migration completes (or for postcopy when the switchover is
done).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-06-06 21:27         ` Peter Xu
@ 2022-06-07 12:54           ` Paolo Bonzini
  2022-06-07 15:04             ` Sean Christopherson
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2022-06-07 12:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sasha Levin, linux-kernel, stable, Leonardo Bras, tglx, mingo,
	bp, dave.hansen, x86, chang.seok.bae, luto, kvm,
	Sean Christopherson

On 6/6/22 23:27, Peter Xu wrote:
> On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
>>> However there seems to be something missing at least to me, on why it'll
>>> fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
>>> In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
>>> patch, but 0x0 if with it.
>>
>> What CPU model are you using for the VM?
> 
> I didn't specify it, assuming it's qemu64 with no extra parameters.

Ok, so indeed it lacks AVX and this patch can have an effect.

>> For example, if the source lacks this patch but the destination has it,
>> the source will transmit YMM registers, but the destination will fail to
>> set them if they are not available for the selected CPU model.
>>
>> See the commit message: "As a bonus, it will also fail if userspace tries to
>> set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to
>> the guest configuration.  Such features will never be returned by
>> KVM_GET_XSAVE or KVM_GET_XSAVE2."
> 
> IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned
> (probably by failing validate_user_xstate_header when checking against the
> user_xfeatures on dest host). But that's probably not my case, because here
> KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after
> the precopy migration completes (or for postcopy when the switchover is
> done).

Difficult to say what's happening without seeing at least the guest code 
around the double fault (above you said "fail a migration" and I thought 
that was a different scenario than the double fault), and possibly which 
was the first exception that contributed to the double fault.

Paolo


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-06-07 12:54           ` Paolo Bonzini
@ 2022-06-07 15:04             ` Sean Christopherson
  2022-06-07 18:17               ` Peter Xu
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2022-06-07 15:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, Sasha Levin, linux-kernel, stable, Leonardo Bras, tglx,
	mingo, bp, dave.hansen, x86, chang.seok.bae, luto, kvm

On Tue, Jun 07, 2022, Paolo Bonzini wrote:
> On 6/6/22 23:27, Peter Xu wrote:
> > On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
> > > > However there seems to be something missing at least to me, on why it'll
> > > > fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
> > > > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
> > > > patch, but 0x0 if with it.
> > > 
> > > What CPU model are you using for the VM?
> > 
> > I didn't specify it, assuming it's qemu64 with no extra parameters.
> 
> Ok, so indeed it lacks AVX and this patch can have an effect.
> 
> > > For example, if the source lacks this patch but the destination has it,
> > > the source will transmit YMM registers, but the destination will fail to
> > > set them if they are not available for the selected CPU model.
> > > 
> > > See the commit message: "As a bonus, it will also fail if userspace tries to
> > > set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to
> > > the guest configuration.  Such features will never be returned by
> > > KVM_GET_XSAVE or KVM_GET_XSAVE2."
> > 
> > IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned
> > (probably by failing validate_user_xstate_header when checking against the
> > user_xfeatures on dest host). But that's probably not my case, because here
> > KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after
> > the precopy migration completes (or for postcopy when the switchover is
> > done).
> 
> Difficult to say what's happening without seeing at least the guest code
> around the double fault (above you said "fail a migration" and I thought
> that was a different scenario than the double fault), and possibly which was
> the first exception that contributed to the double fault.

Regardless of why the guest explodes in the way it does, is someone planning on
bisecting this (if necessary?) and sending a backport to v5.15?  There's another
bug report that is more than likely hitting the same bug.

https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-06-07 15:04             ` Sean Christopherson
@ 2022-06-07 18:17               ` Peter Xu
  2022-06-07 18:47                 ` Sean Christopherson
  2022-06-07 18:55                 ` Peter Xu
  0 siblings, 2 replies; 43+ messages in thread
From: Peter Xu @ 2022-06-07 18:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Sasha Levin, linux-kernel, stable, Leonardo Bras,
	tglx, mingo, bp, dave.hansen, x86, chang.seok.bae, luto, kvm

On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:
> On Tue, Jun 07, 2022, Paolo Bonzini wrote:
> > On 6/6/22 23:27, Peter Xu wrote:
> > > On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
> > > > > However there seems to be something missing at least to me, on why it'll
> > > > > fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
> > > > > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
> > > > > patch, but 0x0 if with it.
> > > > 
> > > > What CPU model are you using for the VM?
> > > 
> > > I didn't specify it, assuming it's qemu64 with no extra parameters.
> > 
> > Ok, so indeed it lacks AVX and this patch can have an effect.
> > 
> > > > For example, if the source lacks this patch but the destination has it,
> > > > the source will transmit YMM registers, but the destination will fail to
> > > > set them if they are not available for the selected CPU model.
> > > > 
> > > > See the commit message: "As a bonus, it will also fail if userspace tries to
> > > > set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to
> > > > the guest configuration.  Such features will never be returned by
> > > > KVM_GET_XSAVE or KVM_GET_XSAVE2."
> > > 
> > > IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned
> > > (probably by failing validate_user_xstate_header when checking against the
> > > user_xfeatures on dest host). But that's probably not my case, because here
> > > KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after
> > > the precopy migration completes (or for postcopy when the switchover is
> > > done).
> > 
> > Difficult to say what's happening without seeing at least the guest code
> > around the double fault (above you said "fail a migration" and I thought
> > that was a different scenario than the double fault), and possibly which was
> > the first exception that contributed to the double fault.
> 
> Regardless of why the guest explodes in the way it does, is someone planning on
> bisecting this (if necessary?) and sending a backport to v5.15?  There's another
> bug report that is more than likely hitting the same bug.

What's the bisection you mentioned?  I actually did a bisection and I also
checked reverting Leo's change can also fix this issue.  Or do you mean
something else?

> 
> https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net

That is kvm64, and I agree it could be the same problem since both qemu64
and kvm64 models do not have any xsave feature bit declared in cpuid 0xd,
so potentially we could be migrating some fpu states to it even with
user_xfeatures==0 on dest host.

So today I continued the investigation, and I think what's really missing
is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and
continues even if it returns -EINVAL.  IOW, I'm wondering whether we should
fail properly and start to check kvm_arch_put_registers() retcode.  But
that'll be a QEMU fix, and it'll at least not causing random faults
(e.g. double faults) in guest but we should fail the migration gracefully.

Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in
your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():

	WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));

It'll be great if you'd like to check that up.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-06-07 18:17               ` Peter Xu
@ 2022-06-07 18:47                 ` Sean Christopherson
  2022-06-07 21:01                   ` Peter Xu
  2022-06-07 18:55                 ` Peter Xu
  1 sibling, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2022-06-07 18:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Sasha Levin, linux-kernel, stable, Leonardo Bras,
	tglx, mingo, bp, dave.hansen, x86, chang.seok.bae, luto, kvm

On Tue, Jun 07, 2022, Peter Xu wrote:
> On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:
> > On Tue, Jun 07, 2022, Paolo Bonzini wrote:
> > > On 6/6/22 23:27, Peter Xu wrote:
> > > > On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
> > > > > > However there seems to be something missing at least to me, on why it'll
> > > > > > fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
> > > > > > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
> > > > > > patch, but 0x0 if with it.
> > > > > 
> > > > > What CPU model are you using for the VM?
> > > > 
> > > > I didn't specify it, assuming it's qemu64 with no extra parameters.
> > > 
> > > Ok, so indeed it lacks AVX and this patch can have an effect.
> > > 
> > > > > For example, if the source lacks this patch but the destination has it,
> > > > > the source will transmit YMM registers, but the destination will fail to
> > > > > set them if they are not available for the selected CPU model.
> > > > > 
> > > > > See the commit message: "As a bonus, it will also fail if userspace tries to
> > > > > set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to
> > > > > the guest configuration.  Such features will never be returned by
> > > > > KVM_GET_XSAVE or KVM_GET_XSAVE2."
> > > > 
> > > > IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned
> > > > (probably by failing validate_user_xstate_header when checking against the
> > > > user_xfeatures on dest host). But that's probably not my case, because here
> > > > KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after
> > > > the precopy migration completes (or for postcopy when the switchover is
> > > > done).
> > > 
> > > Difficult to say what's happening without seeing at least the guest code
> > > around the double fault (above you said "fail a migration" and I thought
> > > that was a different scenario than the double fault), and possibly which was
> > > the first exception that contributed to the double fault.
> > 
> > Regardless of why the guest explodes in the way it does, is someone planning on
> > bisecting this (if necessary?) and sending a backport to v5.15?  There's another
> > bug report that is more than likely hitting the same bug.
> 
> What's the bisection you mentioned?  I actually did a bisection and I also
> checked reverting Leo's change can also fix this issue.  Or do you mean
> something else?

Oooooh, sorry!  I got completely turned around.  You ran into a bug with the
fix.  I thought that you were hitting the same issues as Mike where migrating
between hosts with different capabilities is broken in v5.15, but works in v5.18.

> > https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net
> 
> That is kvm64, and I agree it could be the same problem since both qemu64
> and kvm64 models do not have any xsave feature bit declared in cpuid 0xd,
> so potentially we could be migrating some fpu states to it even with
> user_xfeatures==0 on dest host.
> 
> So today I continued the investigation, and I think what's really missing
> is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and
> continues even if it returns -EINVAL.  IOW, I'm wondering whether we should
> fail properly and start to check kvm_arch_put_registers() retcode.  But
> that'll be a QEMU fix, and it'll at least not causing random faults
> (e.g. double faults) in guest but we should fail the migration gracefully.
> 
> Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in
> your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():
> 
> 	WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
> 
> It'll be great if you'd like to check that up.

Ugh, userspace can force KVM_MP_STATE_UNINITIALIZED via KVM_SET_MP_STATE.  Looks
like QEMU does that when emulating RESET.

Logically, a full RESET of the xAPIC seems like the right thing to do.  I think
we can get away with that without breaking ABI?  And kvm_lapic_reset() has a
related bug where it stops the HR timer but not doesn't handle the HV timer :-/

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e69b83708f05..948aba894245 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2395,7 +2395,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
                return;

        /* Stop the timer in case it's a reset to an active apic */
-       hrtimer_cancel(&apic->lapic_timer.timer);
+       cancel_apic_timer(&apic->lapic_timer.timer);

        /* The xAPIC ID is set at RESET even if the APIC was already enabled. */
        if (!init_event)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 540651cd28d7..ed2c7cb1642d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10912,6 +10912,9 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
             mp_state->mp_state == KVM_MP_STATE_INIT_RECEIVED))
                goto out;

+       if (mp_state->mp_state == KVM_MP_STATE_UNINITIALIZED)
+               kvm_lapic_reset(vcpu, false);
+
        if (mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED) {
                vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
                set_bit(KVM_APIC_SIPI, &vcpu->arch.apic->pending_events);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-06-07 18:17               ` Peter Xu
  2022-06-07 18:47                 ` Sean Christopherson
@ 2022-06-07 18:55                 ` Peter Xu
  2022-06-08 20:34                   ` Leonardo Bras Soares Passos
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Xu @ 2022-06-07 18:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Sasha Levin, linux-kernel, stable, Leonardo Bras,
	tglx, mingo, bp, dave.hansen, x86, chang.seok.bae, luto, kvm

On Tue, Jun 07, 2022 at 02:17:54PM -0400, Peter Xu wrote:
> On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:
> > On Tue, Jun 07, 2022, Paolo Bonzini wrote:
> > > On 6/6/22 23:27, Peter Xu wrote:
> > > > On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
> > > > > > However there seems to be something missing at least to me, on why it'll
> > > > > > fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
> > > > > > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
> > > > > > patch, but 0x0 if with it.
> > > > > 
> > > > > What CPU model are you using for the VM?
> > > > 
> > > > I didn't specify it, assuming it's qemu64 with no extra parameters.
> > > 
> > > Ok, so indeed it lacks AVX and this patch can have an effect.
> > > 
> > > > > For example, if the source lacks this patch but the destination has it,
> > > > > the source will transmit YMM registers, but the destination will fail to
> > > > > set them if they are not available for the selected CPU model.
> > > > > 
> > > > > See the commit message: "As a bonus, it will also fail if userspace tries to
> > > > > set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to
> > > > > the guest configuration.  Such features will never be returned by
> > > > > KVM_GET_XSAVE or KVM_GET_XSAVE2."
> > > > 
> > > > IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned
> > > > (probably by failing validate_user_xstate_header when checking against the
> > > > user_xfeatures on dest host). But that's probably not my case, because here
> > > > KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after
> > > > the precopy migration completes (or for postcopy when the switchover is
> > > > done).
> > > 
> > > Difficult to say what's happening without seeing at least the guest code
> > > around the double fault (above you said "fail a migration" and I thought
> > > that was a different scenario than the double fault), and possibly which was
> > > the first exception that contributed to the double fault.
> > 
> > Regardless of why the guest explodes in the way it does, is someone planning on
> > bisecting this (if necessary?) and sending a backport to v5.15?  There's another
> > bug report that is more than likely hitting the same bug.
> 
> What's the bisection you mentioned?  I actually did a bisection and I also
> checked reverting Leo's change can also fix this issue.  Or do you mean
> something else?

Ah, I forgot to mention on the "stable tree decisions": IIUC it also means
we should apply Leo's patch to all the stable trees if possible, then
migrations between them won't trigger the misterous faults anymore,
including when migrating to the latest Linux versions.

However there's the delimma that other kernels (any kernel that does not
have Leo's patch) will start to fail migrations to the stable branches that
apply Leo's patch too..  So that's kind of a slight pity.  It's just IIUC
the stable trees are more important, because it should have a broader
audience (most Linux distros)?

> 
> > 
> > https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net
> 
> That is kvm64, and I agree it could be the same problem since both qemu64
> and kvm64 models do not have any xsave feature bit declared in cpuid 0xd,
> so potentially we could be migrating some fpu states to it even with
> user_xfeatures==0 on dest host.
> 
> So today I continued the investigation, and I think what's really missing
> is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and
> continues even if it returns -EINVAL.  IOW, I'm wondering whether we should
> fail properly and start to check kvm_arch_put_registers() retcode.  But
> that'll be a QEMU fix, and it'll at least not causing random faults
> (e.g. double faults) in guest but we should fail the migration gracefully.
> 
> Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in
> your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():
> 
> 	WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
> 
> It'll be great if you'd like to check that up.
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-06-07 18:47                 ` Sean Christopherson
@ 2022-06-07 21:01                   ` Peter Xu
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Xu @ 2022-06-07 21:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Sasha Levin, linux-kernel, stable, Leonardo Bras,
	tglx, mingo, bp, dave.hansen, x86, chang.seok.bae, luto, kvm

On Tue, Jun 07, 2022 at 06:47:41PM +0000, Sean Christopherson wrote:
> On Tue, Jun 07, 2022, Peter Xu wrote:
> > On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:
> > > On Tue, Jun 07, 2022, Paolo Bonzini wrote:
> > > > On 6/6/22 23:27, Peter Xu wrote:
> > > > > On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
> > > > > > > However there seems to be something missing at least to me, on why it'll
> > > > > > > fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
> > > > > > > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
> > > > > > > patch, but 0x0 if with it.
> > > > > > 
> > > > > > What CPU model are you using for the VM?
> > > > > 
> > > > > I didn't specify it, assuming it's qemu64 with no extra parameters.
> > > > 
> > > > Ok, so indeed it lacks AVX and this patch can have an effect.
> > > > 
> > > > > > For example, if the source lacks this patch but the destination has it,
> > > > > > the source will transmit YMM registers, but the destination will fail to
> > > > > > set them if they are not available for the selected CPU model.
> > > > > > 
> > > > > > See the commit message: "As a bonus, it will also fail if userspace tries to
> > > > > > set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to
> > > > > > the guest configuration.  Such features will never be returned by
> > > > > > KVM_GET_XSAVE or KVM_GET_XSAVE2."
> > > > > 
> > > > > IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned
> > > > > (probably by failing validate_user_xstate_header when checking against the
> > > > > user_xfeatures on dest host). But that's probably not my case, because here
> > > > > KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after
> > > > > the precopy migration completes (or for postcopy when the switchover is
> > > > > done).
> > > > 
> > > > Difficult to say what's happening without seeing at least the guest code
> > > > around the double fault (above you said "fail a migration" and I thought
> > > > that was a different scenario than the double fault), and possibly which was
> > > > the first exception that contributed to the double fault.
> > > 
> > > Regardless of why the guest explodes in the way it does, is someone planning on
> > > bisecting this (if necessary?) and sending a backport to v5.15?  There's another
> > > bug report that is more than likely hitting the same bug.
> > 
> > What's the bisection you mentioned?  I actually did a bisection and I also
> > checked reverting Leo's change can also fix this issue.  Or do you mean
> > something else?
> 
> Oooooh, sorry!  I got completely turned around.  You ran into a bug with the
> fix.  I thought that you were hitting the same issues as Mike where migrating
> between hosts with different capabilities is broken in v5.15, but works in v5.18.

Aha, no worry.

> 
> > > https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net
> > 
> > That is kvm64, and I agree it could be the same problem since both qemu64
> > and kvm64 models do not have any xsave feature bit declared in cpuid 0xd,
> > so potentially we could be migrating some fpu states to it even with
> > user_xfeatures==0 on dest host.
> > 
> > So today I continued the investigation, and I think what's really missing
> > is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and
> > continues even if it returns -EINVAL.  IOW, I'm wondering whether we should
> > fail properly and start to check kvm_arch_put_registers() retcode.  But
> > that'll be a QEMU fix, and it'll at least not causing random faults
> > (e.g. double faults) in guest but we should fail the migration gracefully.
> > 
> > Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in
> > your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():
> > 
> > 	WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
> > 
> > It'll be great if you'd like to check that up.
> 
> Ugh, userspace can force KVM_MP_STATE_UNINITIALIZED via KVM_SET_MP_STATE.  Looks
> like QEMU does that when emulating RESET.
> 
> Logically, a full RESET of the xAPIC seems like the right thing to do.  I think
> we can get away with that without breaking ABI?  And kvm_lapic_reset() has a
> related bug where it stops the HR timer but not doesn't handle the HV timer :-/
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index e69b83708f05..948aba894245 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2395,7 +2395,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
>                 return;
> 
>         /* Stop the timer in case it's a reset to an active apic */
> -       hrtimer_cancel(&apic->lapic_timer.timer);
> +       cancel_apic_timer(&apic->lapic_timer.timer);

Needs to be:

  +       cancel_apic_timer(apic);

> 
>         /* The xAPIC ID is set at RESET even if the APIC was already enabled. */
>         if (!init_event)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 540651cd28d7..ed2c7cb1642d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10912,6 +10912,9 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>              mp_state->mp_state == KVM_MP_STATE_INIT_RECEIVED))
>                 goto out;
> 
> +       if (mp_state->mp_state == KVM_MP_STATE_UNINITIALIZED)
> +               kvm_lapic_reset(vcpu, false);
> +
>         if (mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED) {
>                 vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>                 set_bit(KVM_APIC_SIPI, &vcpu->arch.apic->pending_events);
> 

The change looks reasonable, but sadly I did a quick run and it still
triggers.. :-/ So there seems to be something else missing.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-06-07 18:55                 ` Peter Xu
@ 2022-06-08 20:34                   ` Leonardo Bras Soares Passos
  2022-06-08 20:53                     ` Peter Xu
  0 siblings, 1 reply; 43+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-06-08 20:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, Paolo Bonzini, Sasha Levin, linux-kernel,
	stable, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Chang S. Bae, Andy Lutomirski, kvm

Hello Peter,

On Tue, Jun 7, 2022 at 5:07 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Jun 07, 2022 at 02:17:54PM -0400, Peter Xu wrote:
> > On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:
> > > On Tue, Jun 07, 2022, Paolo Bonzini wrote:
> > > > On 6/6/22 23:27, Peter Xu wrote:
> > > > > On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
> > > > > > > However there seems to be something missing at least to me, on why it'll
> > > > > > > fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
> > > > > > > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
> > > > > > > patch, but 0x0 if with it.
> > > > > >
> > > > > > What CPU model are you using for the VM?
> > > > >
> > > > > I didn't specify it, assuming it's qemu64 with no extra parameters.
> > > >
> > > > Ok, so indeed it lacks AVX and this patch can have an effect.
> > > >
> > > > > > For example, if the source lacks this patch but the destination has it,
> > > > > > the source will transmit YMM registers, but the destination will fail to
> > > > > > set them if they are not available for the selected CPU model.
> > > > > >
> > > > > > See the commit message: "As a bonus, it will also fail if userspace tries to
> > > > > > set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to
> > > > > > the guest configuration.  Such features will never be returned by
> > > > > > KVM_GET_XSAVE or KVM_GET_XSAVE2."
> > > > >
> > > > > IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned
> > > > > (probably by failing validate_user_xstate_header when checking against the
> > > > > user_xfeatures on dest host). But that's probably not my case, because here
> > > > > KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after
> > > > > the precopy migration completes (or for postcopy when the switchover is
> > > > > done).
> > > >
> > > > Difficult to say what's happening without seeing at least the guest code
> > > > around the double fault (above you said "fail a migration" and I thought
> > > > that was a different scenario than the double fault), and possibly which was
> > > > the first exception that contributed to the double fault.
> > >
> > > Regardless of why the guest explodes in the way it does, is someone planning on
> > > bisecting this (if necessary?) and sending a backport to v5.15?  There's another
> > > bug report that is more than likely hitting the same bug.
> >
> > What's the bisection you mentioned?  I actually did a bisection and I also
> > checked reverting Leo's change can also fix this issue.  Or do you mean
> > something else?
>
> Ah, I forgot to mention on the "stable tree decisions": IIUC it also means
> we should apply Leo's patch to all the stable trees if possible, then
> migrations between them won't trigger the misterous faults anymore,
> including when migrating to the latest Linux versions.
>
> However there's the delimma that other kernels (any kernel that does not
> have Leo's patch) will start to fail migrations to the stable branches that
> apply Leo's patch too..

IIUC, you commented before that the migration issue should be solved with a
QEMU fix, is that correct? That would mean something like 'QEMU is relying on a
kernel bug to work', and should be no blocker for fixing the kernel.

If that's the case, I think we should apply the fix to every supported
stable branch that
have the fpku issue, and in parallel come with a qemu fix for that.

What do you think about it?

Best regards,
Leo

> So that's kind of a slight pity.  It's just IIUC
> the stable trees are more important, because it should have a broader
> audience (most Linux distros)?
>
> >
> > >
> > > https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net
> >
> > That is kvm64, and I agree it could be the same problem since both qemu64
> > and kvm64 models do not have any xsave feature bit declared in cpuid 0xd,
> > so potentially we could be migrating some fpu states to it even with
> > user_xfeatures==0 on dest host.
> >
> > So today I continued the investigation, and I think what's really missing
> > is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and
> > continues even if it returns -EINVAL.  IOW, I'm wondering whether we should
> > fail properly and start to check kvm_arch_put_registers() retcode.  But
> > that'll be a QEMU fix, and it'll at least not causing random faults
> > (e.g. double faults) in guest but we should fail the migration gracefully.
> >
> > Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in
> > your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():
> >
> >       WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
> >
> > It'll be great if you'd like to check that up.
> >
> > Thanks,
> >
> > --
> > Peter Xu
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  2022-06-08 20:34                   ` Leonardo Bras Soares Passos
@ 2022-06-08 20:53                     ` Peter Xu
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Xu @ 2022-06-08 20:53 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Sean Christopherson, Paolo Bonzini, Sasha Levin, linux-kernel,
	stable, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Chang S. Bae, Andy Lutomirski, kvm

On Wed, Jun 08, 2022 at 05:34:18PM -0300, Leonardo Bras Soares Passos wrote:
> Hello Peter,
> 
> On Tue, Jun 7, 2022 at 5:07 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Jun 07, 2022 at 02:17:54PM -0400, Peter Xu wrote:
> > > On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:
> > > > On Tue, Jun 07, 2022, Paolo Bonzini wrote:
> > > > > On 6/6/22 23:27, Peter Xu wrote:
> > > > > > On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
> > > > > > > > However there seems to be something missing at least to me, on why it'll
> > > > > > > > fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
> > > > > > > > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
> > > > > > > > patch, but 0x0 if with it.
> > > > > > >
> > > > > > > What CPU model are you using for the VM?
> > > > > >
> > > > > > I didn't specify it, assuming it's qemu64 with no extra parameters.
> > > > >
> > > > > Ok, so indeed it lacks AVX and this patch can have an effect.
> > > > >
> > > > > > > For example, if the source lacks this patch but the destination has it,
> > > > > > > the source will transmit YMM registers, but the destination will fail to
> > > > > > > set them if they are not available for the selected CPU model.
> > > > > > >
> > > > > > > See the commit message: "As a bonus, it will also fail if userspace tries to
> > > > > > > set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to
> > > > > > > the guest configuration.  Such features will never be returned by
> > > > > > > KVM_GET_XSAVE or KVM_GET_XSAVE2."
> > > > > >
> > > > > > IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned
> > > > > > (probably by failing validate_user_xstate_header when checking against the
> > > > > > user_xfeatures on dest host). But that's probably not my case, because here
> > > > > > KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after
> > > > > > the precopy migration completes (or for postcopy when the switchover is
> > > > > > done).
> > > > >
> > > > > Difficult to say what's happening without seeing at least the guest code
> > > > > around the double fault (above you said "fail a migration" and I thought
> > > > > that was a different scenario than the double fault), and possibly which was
> > > > > the first exception that contributed to the double fault.
> > > >
> > > > Regardless of why the guest explodes in the way it does, is someone planning on
> > > > bisecting this (if necessary?) and sending a backport to v5.15?  There's another
> > > > bug report that is more than likely hitting the same bug.
> > >
> > > What's the bisection you mentioned?  I actually did a bisection and I also
> > > checked reverting Leo's change can also fix this issue.  Or do you mean
> > > something else?
> >
> > Ah, I forgot to mention on the "stable tree decisions": IIUC it also means
> > we should apply Leo's patch to all the stable trees if possible, then
> > migrations between them won't trigger the misterous faults anymore,
> > including when migrating to the latest Linux versions.
> >
> > However there's the delimma that other kernels (any kernel that does not
> > have Leo's patch) will start to fail migrations to the stable branches that
> > apply Leo's patch too..
> 
> IIUC, you commented before that the migration issue should be solved with a
> QEMU fix, is that correct? That would mean something like 'QEMU is relying on a
> kernel bug to work', and should be no blocker for fixing the kernel.

The QEMU fix (that I posted [1]) is not a real fix, only the kernel fix is.

The QEMU patchset only allows the migration to fail early, the kernel patch
allows the migration to go through with no problem as long as both sides
are applied with the fix (or both are not..).  So there're two issues we're
tackling with and IMHO we should fix both.

[1] https://lore.kernel.org/qemu-devel/20220607230645.53950-1-peterx@redhat.com/

> 
> If that's the case, I think we should apply the fix to every supported
> stable branch that
> have the fpku issue, and in parallel come with a qemu fix for that.
> 
> What do you think about it?

Yes I mostly agree with you. I think your patch still does the right thing
by not migrating anything the guest doesn't even support, and that seems to
be the only way to fix the pksu-like issue on migrations between hosts with
different processor configurations.  But it'll also bring other unwanted
side effects, that's why IMHO we need some careful thoughts and I hope I
didn't miss anything important.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2022-06-08 20:53 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-01 20:13 [PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 02/28] swiotlb: fix info leak with DMA_FROM_DEVICE Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 03/28] usb: dwc3: pci: add support for the Intel Raptor Lake-S Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 04/28] pinctrl: tigerlake: Revert "Add Alder Lake-M ACPI ID" Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 05/28] KVM: Fix lockdep false negative during host resume Sasha Levin
2022-03-01 20:19   ` Paolo Bonzini
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 06/28] kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode Sasha Levin
2022-03-01 20:19   ` Paolo Bonzini
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0 Sasha Levin
2022-03-01 20:22   ` Paolo Bonzini
2022-06-03 18:40     ` Peter Xu
2022-06-06 16:18       ` Paolo Bonzini
2022-06-06 21:27         ` Peter Xu
2022-06-07 12:54           ` Paolo Bonzini
2022-06-07 15:04             ` Sean Christopherson
2022-06-07 18:17               ` Peter Xu
2022-06-07 18:47                 ` Sean Christopherson
2022-06-07 21:01                   ` Peter Xu
2022-06-07 18:55                 ` Peter Xu
2022-06-08 20:34                   ` Leonardo Bras Soares Passos
2022-06-08 20:53                     ` Peter Xu
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 08/28] spi: rockchip: Fix error in getting num-cs property Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 09/28] spi: rockchip: terminate dma transmission when slave abort Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 10/28] drm/vc4: hdmi: Unregister codec device on unbind Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 11/28] of/fdt: move elfcorehdr reservation early for crash dump kernel Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 12/28] x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU Sasha Levin
2022-03-01 20:17   ` Paolo Bonzini
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 13/28] drivers: hamradio: 6pack: fix UAF bug caused by mod_timer() Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 14/28] net-sysfs: add check for netdevice being present to speed_show Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 15/28] sr9700: sanity check for packet length Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 16/28] hwmon: (pmbus) Clear pmbus fault/warning bits after read Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 17/28] nvme-tcp: send H2CData PDUs based on MAXH2CDATA Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 18/28] PCI: Mark all AMD Navi10 and Navi14 GPU ATS as broken Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 19/28] gpio: Return EPROBE_DEFER if gc->to_irq is NULL Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 20/28] drm/amdgpu: bypass tiling flag check in virtual display case (v2) Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 21/28] Revert "xen-netback: remove 'hotplug-status' once it has served its purpose" Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 22/28] Revert "xen-netback: Check for hotplug-status existence before watching" Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 23/28] ipv6: prevent a possible race condition with lifetimes Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 24/28] tracing: Ensure trace buffer is at least 4096 bytes large Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 25/28] tracing/osnoise: Make osnoise_main to sleep for microseconds Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 26/28] tracing: Fix selftest config check for function graph start up test Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 27/28] selftest/vm: fix map_fixed_noreplace test failure Sasha Levin
2022-03-01 20:13 ` [PATCH AUTOSEL 5.16 28/28] selftests/memfd: clean up mapping in mfd_fail_write Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).