[MODERATED] [PATCH v6 0/5] NX 0

All of lore.kernel.org
 help / color / mirror / Atom feed

* [MODERATED] [PATCH v6 0/5] NX 0
@ 2019-10-11 10:40 Paolo Bonzini
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 1/5] NX 1 Paolo Bonzini
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Paolo Bonzini @ 2019-10-11 10:40 UTC (permalink / raw)
  To: speck

- new recovery heuristics (suggested by Andrea Arcangeli)
- removed prerequisite patches that are already upstream
- debugfs entry now read-only
- use fast shadow page invalidation when toggling nx_huge_pages

Junaid Shahid (2):
  kvm: Add helper function for creating VM worker threads
  kvm: x86: mmu: Recovery of shattered NX large pages

Paolo Bonzini (1):
  kvm: mmu: ITLB_MULTIHIT mitigation

Pawan Gupta (2):
  x86: Add ITLB_MULTIHIT bug infrastructure
  x86/cpu: Add Tremont to the cpu vulnerability whitelist

 Documentation/ABI/testing/sysfs-devices-system-cpu |   1 +
 Documentation/admin-guide/kernel-parameters.txt    |  17 ++
 arch/x86/include/asm/cpufeatures.h                 |   1 +
 arch/x86/include/asm/kvm_host.h                    |   6 +
 arch/x86/include/asm/msr-index.h                   |   7 +
 arch/x86/kernel/cpu/bugs.c                         |  24 ++
 arch/x86/kernel/cpu/common.c                       |  73 +++---
 arch/x86/kvm/mmu.c                                 | 264 ++++++++++++++++++++-
 arch/x86/kvm/mmu.h                                 |   4 +
 arch/x86/kvm/paging_tmpl.h                         |  29 ++-
 arch/x86/kvm/x86.c                                 |  12 +
 drivers/base/cpu.c                                 |   8 +
 include/linux/cpu.h                                |   2 +
 include/linux/kvm_host.h                           |   6 +
 virt/kvm/kvm_main.c                                | 114 ++++++++-
 15 files changed, 522 insertions(+), 46 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MODERATED] [PATCH v6 1/5] NX 1
  2019-10-11 10:40 [MODERATED] [PATCH v6 0/5] NX 0 Paolo Bonzini
@ 2019-10-11 10:40 ` Paolo Bonzini
  2019-10-11 13:18   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-10-15 11:24   ` Thomas Gleixner
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 2/5] NX 2 Paolo Bonzini
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 12+ messages in thread
From: Paolo Bonzini @ 2019-10-11 10:40 UTC (permalink / raw)
  To: speck


Some processors may incur a machine check error possibly
resulting in an unrecoverable cpu hang when an instruction fetch
encounters a TLB multi-hit in the instruction TLB. This can occur
when the page size is changed along with either the physical
address or cache type [1].

This issue affects both bare-metal x86 page tables and EPT.

This can be mitigated by either eliminating the use of large
pages or by using careful TLB invalidations when changing the
page size in the page tables.

Just like Spectre, Meltdown, L1TF and MDS, a new bit has been
allocated in MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will
be set on CPUs which are mitigated against this issue.

[1] For example please refer to erratum SKL002 in "6th Generation
Intel Processor Family Specification Update"
https://www.intel.com/content/www/us/en/products/docs/processors/core/desktop-6th-gen-core-family-spec-update.html
https://www.google.com/search?q=site:intel.com+SKL002

There are a lot of other affected processors outside of Skylake and
that the erratum(referred above) does not fully disclose the issue
and the impact, both on Skylake and across all the affected CPUs.

Signed-off-by: Vineela Tummalapalli <vineela.tummalapalli@intel.com>
Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |  1 +
 arch/x86/include/asm/cpufeatures.h                 |  1 +
 arch/x86/include/asm/msr-index.h                   |  7 +++
 arch/x86/kernel/cpu/bugs.c                         | 13 ++++
 arch/x86/kernel/cpu/common.c                       | 71 ++++++++++++----------
 drivers/base/cpu.c                                 |  8 +++
 include/linux/cpu.h                                |  2 +
 7 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 06d0931119cc..55bf5e1538ad 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -486,6 +486,7 @@ What:		/sys/devices/system/cpu/vulnerabilities
 		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
 		/sys/devices/system/cpu/vulnerabilities/l1tf
 		/sys/devices/system/cpu/vulnerabilities/mds
+		/sys/devices/system/cpu/vulnerabilities/itlb_multihit
 Date:		January 2018
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:	Information about CPU vulnerabilities
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 0652d3eed9bd..66aaad7611b2 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -399,5 +399,6 @@
 #define X86_BUG_MDS			X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
 #define X86_BUG_MSBDS_ONLY		X86_BUG(20) /* CPU is only affected by the  MSDBS variant of BUG_MDS */
 #define X86_BUG_SWAPGS			X86_BUG(21) /* CPU is affected by speculation through SWAPGS */
+#define X86_BUG_ITLB_MULTIHIT		X86_BUG(22) /* CPU may incur MCE during certain page attribute changes */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 20ce682a2540..c678899b21db 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -93,6 +93,13 @@
 						  * Microarchitectural Data
 						  * Sampling (MDS) vulnerabilities.
 						  */
+#define ARCH_CAP_PSCHANGE_MC_NO		BIT(6)	 /*
+						  * The processor is not susceptible to a
+						  * machine check error due to modifying the
+						  * code page size along with either the
+						  * physical address or cache type
+						  * without TLB invalidation.
+						  */
 
 #define MSR_IA32_FLUSH_CMD		0x0000010b
 #define L1D_FLUSH			BIT(0)	/*
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 91c2561b905f..ecd0126648ea 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1311,6 +1311,11 @@ static ssize_t l1tf_show_state(char *buf)
 }
 #endif
 
+static ssize_t itlb_multihit_show_state(char *buf)
+{
+	return sprintf(buf, "Processor vulnerable\n");
+}
+
 static ssize_t mds_show_state(char *buf)
 {
 	if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
@@ -1398,6 +1403,9 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 	case X86_BUG_MDS:
 		return mds_show_state(buf);
 
+	case X86_BUG_ITLB_MULTIHIT:
+		return itlb_multihit_show_state(buf);
+
 	default:
 		break;
 	}
@@ -1434,4 +1442,9 @@ ssize_t cpu_show_mds(struct device *dev, struct device_attribute *attr, char *bu
 {
 	return cpu_show_common(dev, attr, buf, X86_BUG_MDS);
 }
+
+ssize_t cpu_show_itlb_multihit(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return cpu_show_common(dev, attr, buf, X86_BUG_ITLB_MULTIHIT);
+}
 #endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9ae7d1bcd4f4..fc00b2349a9f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1016,13 +1016,14 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
 #endif
 }
 
-#define NO_SPECULATION	BIT(0)
-#define NO_MELTDOWN	BIT(1)
-#define NO_SSB		BIT(2)
-#define NO_L1TF		BIT(3)
-#define NO_MDS		BIT(4)
-#define MSBDS_ONLY	BIT(5)
-#define NO_SWAPGS	BIT(6)
+#define NO_SPECULATION		BIT(0)
+#define NO_MELTDOWN		BIT(1)
+#define NO_SSB			BIT(2)
+#define NO_L1TF			BIT(3)
+#define NO_MDS			BIT(4)
+#define MSBDS_ONLY		BIT(5)
+#define NO_SWAPGS		BIT(6)
+#define NO_ITLB_MULTIHIT	BIT(7)
 
 #define VULNWL(_vendor, _family, _model, _whitelist)	\
 	{ X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist }
@@ -1043,27 +1044,27 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
 	VULNWL(NSC,	5, X86_MODEL_ANY,	NO_SPECULATION),
 
 	/* Intel Family 6 */
-	VULNWL_INTEL(ATOM_SALTWELL,		NO_SPECULATION),
-	VULNWL_INTEL(ATOM_SALTWELL_TABLET,	NO_SPECULATION),
-	VULNWL_INTEL(ATOM_SALTWELL_MID,		NO_SPECULATION),
-	VULNWL_INTEL(ATOM_BONNELL,		NO_SPECULATION),
-	VULNWL_INTEL(ATOM_BONNELL_MID,		NO_SPECULATION),
-
-	VULNWL_INTEL(ATOM_SILVERMONT,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
-	VULNWL_INTEL(ATOM_SILVERMONT_D,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
-	VULNWL_INTEL(ATOM_SILVERMONT_MID,	NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
-	VULNWL_INTEL(ATOM_AIRMONT,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
-	VULNWL_INTEL(XEON_PHI_KNL,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
-	VULNWL_INTEL(XEON_PHI_KNM,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
+	VULNWL_INTEL(ATOM_SALTWELL,		NO_SPECULATION | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(ATOM_SALTWELL_TABLET,	NO_SPECULATION | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(ATOM_SALTWELL_MID,		NO_SPECULATION | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(ATOM_BONNELL,		NO_SPECULATION | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(ATOM_BONNELL_MID,		NO_SPECULATION | NO_ITLB_MULTIHIT),
+
+	VULNWL_INTEL(ATOM_SILVERMONT,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(ATOM_SILVERMONT_D,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(ATOM_SILVERMONT_MID,	NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(ATOM_AIRMONT,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(XEON_PHI_KNL,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(XEON_PHI_KNM,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
 
 	VULNWL_INTEL(CORE_YONAH,		NO_SSB),
 
-	VULNWL_INTEL(ATOM_AIRMONT_MID,		NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
-	VULNWL_INTEL(ATOM_AIRMONT_NP,		NO_L1TF | NO_SWAPGS),
+	VULNWL_INTEL(ATOM_AIRMONT_MID,		NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(ATOM_AIRMONT_NP,		NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
 
-	VULNWL_INTEL(ATOM_GOLDMONT,		NO_MDS | NO_L1TF | NO_SWAPGS),
-	VULNWL_INTEL(ATOM_GOLDMONT_D,		NO_MDS | NO_L1TF | NO_SWAPGS),
-	VULNWL_INTEL(ATOM_GOLDMONT_PLUS,	NO_MDS | NO_L1TF | NO_SWAPGS),
+	VULNWL_INTEL(ATOM_GOLDMONT,		NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(ATOM_GOLDMONT_D,		NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_INTEL(ATOM_GOLDMONT_PLUS,	NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
 
 	/*
 	 * Technically, swapgs isn't serializing on AMD (despite it previously
@@ -1074,14 +1075,14 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
 	 */
 
 	/* AMD Family 0xf - 0x12 */
-	VULNWL_AMD(0x0f,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
-	VULNWL_AMD(0x10,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
-	VULNWL_AMD(0x11,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
-	VULNWL_AMD(0x12,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
+	VULNWL_AMD(0x0f,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_AMD(0x10,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_AMD(0x11,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_AMD(0x12,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
 
 	/* FAMILY_ANY must be last, otherwise 0x0f - 0x12 matches won't work */
-	VULNWL_AMD(X86_FAMILY_ANY,	NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS),
-	VULNWL_HYGON(X86_FAMILY_ANY,	NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS),
+	VULNWL_AMD(X86_FAMILY_ANY,	NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
+	VULNWL_HYGON(X86_FAMILY_ANY,	NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
 	{}
 };
 
@@ -1096,15 +1097,19 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 {
 	u64 ia32_cap = 0;
 
+	if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES))
+		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap);
+
+	/* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
+	if (!cpu_matches(NO_ITLB_MULTIHIT) && !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
+		setup_force_cpu_bug(X86_BUG_ITLB_MULTIHIT);
+
 	if (cpu_matches(NO_SPECULATION))
 		return;
 
 	setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
 	setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
 
-	if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES))
-		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap);
-
 	if (!cpu_matches(NO_SSB) && !(ia32_cap & ARCH_CAP_SSB_NO) &&
 	   !cpu_has(c, X86_FEATURE_AMD_SSB_NO))
 		setup_force_cpu_bug(X86_BUG_SPEC_STORE_BYPASS);
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index cc37511de866..f1a6e020ed8d 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -554,12 +554,19 @@ ssize_t __weak cpu_show_mds(struct device *dev,
 	return sprintf(buf, "Not affected\n");
 }
 
+ssize_t __weak cpu_show_itlb_multihit(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "Not affected\n");
+}
+
 static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
 static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
 static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
 static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
 static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
 static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
+static DEVICE_ATTR(itlb_multihit, 0444, cpu_show_itlb_multihit, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_meltdown.attr,
@@ -568,6 +575,7 @@ ssize_t __weak cpu_show_mds(struct device *dev,
 	&dev_attr_spec_store_bypass.attr,
 	&dev_attr_l1tf.attr,
 	&dev_attr_mds.attr,
+	&dev_attr_itlb_multihit.attr,
 	NULL
 };
 
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index d0633ebdaa9c..038866a28f2c 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -59,6 +59,8 @@ extern ssize_t cpu_show_l1tf(struct device *dev,
 			     struct device_attribute *attr, char *buf);
 extern ssize_t cpu_show_mds(struct device *dev,
 			    struct device_attribute *attr, char *buf);
+extern ssize_t cpu_show_itlb_multihit(struct device *dev,
+				      struct device_attribute *attr, char *buf);
 
 extern __printf(4, 5)
 struct device *cpu_device_create(struct device *parent, void *drvdata,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [MODERATED] [PATCH v6 2/5] NX 2
  2019-10-11 10:40 [MODERATED] [PATCH v6 0/5] NX 0 Paolo Bonzini
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 1/5] NX 1 Paolo Bonzini
@ 2019-10-11 10:40 ` Paolo Bonzini
  2019-10-11 13:18   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-10-15 19:37   ` Tyler Hicks
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 3/5] NX 3 Paolo Bonzini
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 12+ messages in thread
From: Paolo Bonzini @ 2019-10-11 10:40 UTC (permalink / raw)
  To: speck


This patch adds new cpu family ATOM_TREMONT_X to the cpu vunerability
whitelist. ATOM_TREMONT_X is not affected by X86_BUG_ITLB_MULTIHIT. There
may be more bugs not affecting ATOM_TREMONT_X which are not known at
this point and could be added later.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kernel/cpu/common.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fc00b2349a9f..c652ca9dc046 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1074,6 +1074,8 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
 	 * good enough for our purposes.
 	 */
 
+	VULNWL_INTEL(ATOM_TREMONT_D,		NO_ITLB_MULTIHIT),
+
 	/* AMD Family 0xf - 0x12 */
 	VULNWL_AMD(0x0f,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
 	VULNWL_AMD(0x10,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [MODERATED] [PATCH v6 3/5] NX 3
  2019-10-11 10:40 [MODERATED] [PATCH v6 0/5] NX 0 Paolo Bonzini
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 1/5] NX 1 Paolo Bonzini
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 2/5] NX 2 Paolo Bonzini
@ 2019-10-11 10:40 ` Paolo Bonzini
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 4/5] NX 4 Paolo Bonzini
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 5/5] NX 5 Paolo Bonzini
  4 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2019-10-11 10:40 UTC (permalink / raw)
  To: speck


With some Intel processors, putting the same virtual address in the TLB
as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
and cause the processor to issue a machine check.  Unfortunately if EPT
page tables use huge pages, it possible for a malicious guest to cause
this situation.

This patch adds a knob to mark huge pages as non-executable. When the
nx_huge_pages parameter is enabled (and we are using EPT), all huge pages
are marked as NX. If the guest attempts to execute in one of those pages,
the page is broken down into 4K pages, which are then marked executable.

This is not an issue for shadow paging (except nested EPT), because then
the host is in control of TLB flushes and the problematic situation cannot
happen.  With nested EPT, again the nested guest can cause problems so we
treat shadow and direct EPT the same.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  11 ++
 arch/x86/include/asm/kvm_host.h                 |   2 +
 arch/x86/kernel/cpu/bugs.c                      |  13 ++-
 arch/x86/kvm/mmu.c                              | 135 ++++++++++++++++++++++--
 arch/x86/kvm/paging_tmpl.h                      |  29 +++--
 arch/x86/kvm/x86.c                              |   1 +
 6 files changed, 178 insertions(+), 13 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c7ac2f3ac99f..2ec88c723740 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2055,6 +2055,17 @@
 			KVM MMU at runtime.
 			Default is 0 (off)
 
+	kvm.nx_huge_pages=
+			[KVM] Controls the sw workaround for bug
+			X86_BUG_ITLB_MULTIHIT.
+			force	: Always deploy workaround.
+			off	: Default. Never deploy workaround.
+			auto	: Deploy workaround based on presence of
+				  X86_BUG_ITLB_MULTIHIT.
+
+			If the sw workaround is enabled for the host, guests
+			need not enable it for nested guests.
+
 	kvm-amd.nested=	[KVM,AMD] Allow nested virtualization in KVM/SVM.
 			Default is 1 (enabled)
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5d8056ff7390..55b59141663b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -313,6 +313,7 @@ struct kvm_mmu_page {
 	bool unsync;
 	u8 mmu_valid_gen;
 	bool mmio_cached;
+	bool lpage_disallowed; /* Can't be replaced by an equiv large page */
 
 	/*
 	 * The following two entries are used to key the shadow page in the
@@ -944,6 +945,7 @@ struct kvm_vm_stat {
 	ulong mmu_unsync;
 	ulong remote_tlb_flush;
 	ulong lpages;
+	ulong nx_lpage_splits;
 	ulong max_mmu_page_hash_collisions;
 };
 
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index ecd0126648ea..fe0378e03dbb 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1149,6 +1149,9 @@ void x86_spec_ctrl_setup_ap(void)
 		x86_amd_ssb_disable();
 }
 
+bool itlb_multihit_kvm_mitigation;
+EXPORT_SYMBOL_GPL(itlb_multihit_kvm_mitigation);
+
 #undef pr_fmt
 #define pr_fmt(fmt)	"L1TF: " fmt
 
@@ -1304,17 +1307,25 @@ static ssize_t l1tf_show_state(char *buf)
 		       l1tf_vmx_states[l1tf_vmx_mitigation],
 		       sched_smt_active() ? "vulnerable" : "disabled");
 }
+
+static ssize_t itlb_multihit_show_state(char *buf)
+{
+	if (itlb_multihit_kvm_mitigation)
+		return sprintf(buf, "KVM: Mitigation: Split huge pages\n");
+	else
+		return sprintf(buf, "KVM: Vulnerable\n");
+}
 #else
 static ssize_t l1tf_show_state(char *buf)
 {
 	return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);
 }
-#endif
 
 static ssize_t itlb_multihit_show_state(char *buf)
 {
 	return sprintf(buf, "Processor vulnerable\n");
 }
+#endif
 
 static ssize_t mds_show_state(char *buf)
 {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..837beefdf0a5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -47,6 +47,20 @@
 #include <asm/kvm_page_track.h>
 #include "trace.h"
 
+extern bool itlb_multihit_kvm_mitigation;
+
+static int __read_mostly nx_huge_pages = -1;
+
+static int set_nx_huge_pages(const char *val, const struct kernel_param *kp);
+
+static struct kernel_param_ops nx_huge_pages_ops = {
+	.set = set_nx_huge_pages,
+	.get = param_get_bool,
+};
+
+module_param_cb(nx_huge_pages, &nx_huge_pages_ops, &nx_huge_pages, 0644);
+__MODULE_PARM_TYPE(nx_huge_pages, "bool");
+
 /*
  * When setting this variable to true it enables Two-Dimensional-Paging
  * where the hardware walks 2 page tables:
@@ -352,6 +366,11 @@ static inline bool spte_ad_need_write_protect(u64 spte)
 	return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_ENABLED_MASK;
 }
 
+static bool is_nx_huge_page_enabled(void)
+{
+	return READ_ONCE(nx_huge_pages);
+}
+
 static inline u64 spte_shadow_accessed_mask(u64 spte)
 {
 	MMU_WARN_ON(is_mmio_spte(spte));
@@ -1190,6 +1209,15 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 }
 
+static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	if (sp->lpage_disallowed)
+		return;
+
+	++kvm->stat.nx_lpage_splits;
+	sp->lpage_disallowed = true;
+}
+
 static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	struct kvm_memslots *slots;
@@ -1207,6 +1235,12 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	kvm_mmu_gfn_allow_lpage(slot, gfn);
 }
 
+static void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	--kvm->stat.nx_lpage_splits;
+	sp->lpage_disallowed = false;
+}
+
 static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
 					  struct kvm_memory_slot *slot)
 {
@@ -2792,6 +2826,9 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
 			kvm_reload_remote_mmus(kvm);
 	}
 
+	if (sp->lpage_disallowed)
+		unaccount_huge_nx_page(kvm, sp);
+
 	sp->role.invalid = 1;
 	return list_unstable;
 }
@@ -3013,6 +3050,11 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	if (!speculative)
 		spte |= spte_shadow_accessed_mask(spte);
 
+	if (level > PT_PAGE_TABLE_LEVEL && (pte_access & ACC_EXEC_MASK) &&
+	    is_nx_huge_page_enabled()) {
+		pte_access &= ~ACC_EXEC_MASK;
+	}
+
 	if (pte_access & ACC_EXEC_MASK)
 		spte |= shadow_x_mask;
 	else
@@ -3233,9 +3275,32 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
+static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
+				       gfn_t gfn, kvm_pfn_t *pfnp, int *levelp)
+{
+	int level = *levelp;
+	u64 spte = *it.sptep;
+
+	if (it.level == level && level > PT_PAGE_TABLE_LEVEL &&
+	    is_nx_huge_page_enabled() &&
+	    is_shadow_present_pte(spte) &&
+	    !is_large_pte(spte)) {
+		/*
+		 * A small SPTE exists for this pfn, but FNAME(fetch)
+		 * and __direct_map would like to create a large PTE
+		 * instead: just force them to go down another level,
+		 * patching back for them into pfn the next 9 bits of
+		 * the address.
+		 */
+		u64 page_mask = KVM_PAGES_PER_HPAGE(level) - KVM_PAGES_PER_HPAGE(level - 1);
+		*pfnp |= gfn & page_mask;
+		(*levelp)--;
+	}
+}
+
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
 			int map_writable, int level, kvm_pfn_t pfn,
-			bool prefault)
+			bool prefault, bool lpage_disallowed)
 {
 	struct kvm_shadow_walk_iterator it;
 	struct kvm_mmu_page *sp;
@@ -3248,6 +3313,12 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
 
 	trace_kvm_mmu_spte_requested(gpa, level, pfn);
 	for_each_shadow_entry(vcpu, gpa, it) {
+		/*
+		 * We cannot overwrite existing page tables with an NX
+		 * large page, as the leaf could be executable.
+		 */
+		disallowed_hugepage_adjust(it, gfn, &pfn, &level);
+
 		base_gfn = gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
 		if (it.level == level)
 			break;
@@ -3258,6 +3329,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
 					      it.level - 1, true, ACC_ALL);
 
 			link_shadow_page(vcpu, it.sptep, sp);
+			if (lpage_disallowed)
+				account_huge_nx_page(vcpu->kvm, sp);
 		}
 	}
 
@@ -3550,11 +3623,14 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 {
 	int r;
 	int level;
-	bool force_pt_level = false;
+	bool force_pt_level;
 	kvm_pfn_t pfn;
 	unsigned long mmu_seq;
 	bool map_writable, write = error_code & PFERR_WRITE_MASK;
+	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
+				is_nx_huge_page_enabled();
 
+	force_pt_level = lpage_disallowed;
 	level = mapping_level(vcpu, gfn, &force_pt_level);
 	if (likely(!force_pt_level)) {
 		/*
@@ -3588,7 +3664,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 		goto out_unlock;
 	if (likely(!force_pt_level))
 		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
-	r = __direct_map(vcpu, v, write, map_writable, level, pfn, prefault);
+	r = __direct_map(vcpu, v, write, map_writable, level, pfn,
+			 prefault, false);
 out_unlock:
 	spin_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
@@ -4174,6 +4251,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	unsigned long mmu_seq;
 	int write = error_code & PFERR_WRITE_MASK;
 	bool map_writable;
+	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
+				is_nx_huge_page_enabled();
 
 	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
 
@@ -4184,8 +4263,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	if (r)
 		return r;
 
-	force_pt_level = !check_hugepage_cache_consistency(vcpu, gfn,
-							   PT_DIRECTORY_LEVEL);
+	force_pt_level =
+		lpage_disallowed ||
+		!check_hugepage_cache_consistency(vcpu, gfn, PT_DIRECTORY_LEVEL);
 	level = mapping_level(vcpu, gfn, &force_pt_level);
 	if (likely(!force_pt_level)) {
 		if (level > PT_DIRECTORY_LEVEL &&
@@ -4214,7 +4294,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 		goto out_unlock;
 	if (likely(!force_pt_level))
 		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
-	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn, prefault);
+	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
+			 prefault, lpage_disallowed);
 out_unlock:
 	spin_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
@@ -6155,10 +6236,52 @@ static void kvm_set_mmio_spte_mask(void)
 	kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
 }
 
+static void __set_nx_huge_pages(bool val)
+{
+	nx_huge_pages = itlb_multihit_kvm_mitigation = val;
+}
+
+static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
+{
+	bool old_val = nx_huge_pages;
+	bool new_val;
+
+	/* In "auto" mode deploy workaround only if CPU has the bug. */
+	if (sysfs_streq(val, "off"))
+		new_val = 0;
+	else if (sysfs_streq(val, "force"))
+		new_val = 1;
+	else if (sysfs_streq(val, "auto"))
+		new_val = boot_cpu_has_bug(X86_BUG_ITLB_MULTIHIT);
+	else if (strtobool(val, &new_val) < 0)
+		return -EINVAL;
+
+	__set_nx_huge_pages(new_val);
+
+	if (new_val != old_val) {
+		struct kvm *kvm;
+		int idx;
+
+		mutex_lock(&kvm_lock);
+
+		list_for_each_entry(kvm, &vm_list, vm_list) {
+			idx = srcu_read_lock(&kvm->srcu);
+			kvm_mmu_zap_all_fast(kvm);
+			srcu_read_unlock(&kvm->srcu, idx);
+		}
+		mutex_unlock(&kvm_lock);
+	}
+
+	return 0;
+}
+
 int kvm_mmu_module_init(void)
 {
 	int ret = -ENOMEM;
 
+	if (nx_huge_pages == -1)
+		__set_nx_huge_pages(boot_cpu_has_bug(X86_BUG_ITLB_MULTIHIT));
+
 	/*
 	 * MMU roles use union aliasing which is, generally speaking, an
 	 * undefined behavior. However, we supposedly know how compilers behave
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7d5cdb3af594..97b21e7fd013 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -614,13 +614,14 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			 struct guest_walker *gw,
 			 int write_fault, int hlevel,
-			 kvm_pfn_t pfn, bool map_writable, bool prefault)
+			 kvm_pfn_t pfn, bool map_writable, bool prefault,
+			 bool lpage_disallowed)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct kvm_shadow_walk_iterator it;
 	unsigned direct_access, access = gw->pt_access;
 	int top_level, ret;
-	gfn_t base_gfn;
+	gfn_t gfn, base_gfn;
 
 	direct_access = gw->pte_access;
 
@@ -665,13 +666,25 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			link_shadow_page(vcpu, it.sptep, sp);
 	}
 
-	base_gfn = gw->gfn;
+	/*
+	 * FNAME(page_fault) might have clobbered the bottom bits of
+	 * gw->gfn, restore them from the virtual address.
+	 */
+	gfn = gw->gfn | ((addr & PT_LVL_OFFSET_MASK(gw->level)) >> PAGE_SHIFT);
+	base_gfn = gfn;
 
 	trace_kvm_mmu_spte_requested(addr, gw->level, pfn);
 
 	for (; shadow_walk_okay(&it); shadow_walk_next(&it)) {
 		clear_sp_write_flooding_count(it.sptep);
-		base_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
+
+		/*
+		 * We cannot overwrite existing page tables with an NX
+		 * large page, as the leaf could be executable.
+		 */
+		disallowed_hugepage_adjust(it, gfn, &pfn, &hlevel);
+
+		base_gfn = gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
 		if (it.level == hlevel)
 			break;
 
@@ -683,6 +696,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			sp = kvm_mmu_get_page(vcpu, base_gfn, addr,
 					      it.level - 1, true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
+			if (lpage_disallowed)
+				account_huge_nx_page(vcpu->kvm, sp);
 		}
 	}
 
@@ -759,9 +774,11 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	int r;
 	kvm_pfn_t pfn;
 	int level = PT_PAGE_TABLE_LEVEL;
-	bool force_pt_level = false;
 	unsigned long mmu_seq;
 	bool map_writable, is_self_change_mapping;
+	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
+				is_nx_huge_page_enabled();
+	bool force_pt_level = lpage_disallowed;
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
 
@@ -851,7 +868,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	if (!force_pt_level)
 		transparent_hugepage_adjust(vcpu, walker.gfn, &pfn, &level);
 	r = FNAME(fetch)(vcpu, addr, &walker, write_fault,
-			 level, pfn, map_writable, prefault);
+			 level, pfn, map_writable, prefault, lpage_disallowed);
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 
 out_unlock:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a55252c69118..3d7fad762007 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -213,6 +213,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ "mmu_unsync", VM_STAT(mmu_unsync) },
 	{ "remote_tlb_flush", VM_STAT(remote_tlb_flush) },
 	{ "largepages", VM_STAT(lpages, .mode = 0444) },
+	{ "nx_largepages_splitted", VM_STAT(nx_lpage_splits, .mode = 0444) },
 	{ "max_mmu_page_hash_collisions",
 		VM_STAT(max_mmu_page_hash_collisions) },
 	{ NULL }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [MODERATED] [PATCH v6 4/5] NX 4
  2019-10-11 10:40 [MODERATED] [PATCH v6 0/5] NX 0 Paolo Bonzini
                   ` (2 preceding siblings ...)
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 3/5] NX 3 Paolo Bonzini
@ 2019-10-11 10:40 ` Paolo Bonzini
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 5/5] NX 5 Paolo Bonzini
  4 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2019-10-11 10:40 UTC (permalink / raw)
  To: speck


This adds a function to create a kernel thread associated with a given
VM. In particular, it ensures that the worker thread inherits the
priority and cgroups of the calling thread.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/linux/kvm_host.h |  6 ++++
 virt/kvm/kvm_main.c      | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 719fc3e15ea4..52ed5f66e8f9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1382,4 +1382,10 @@ static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE */
 
+typedef int (*kvm_vm_thread_fn_t)(struct kvm *kvm, uintptr_t data);
+
+int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
+				uintptr_t data, const char *name,
+				struct task_struct **thread_ptr);
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 67ef3f2e19e8..513b49be83e0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -50,6 +50,7 @@
 #include <linux/bsearch.h>
 #include <linux/io.h>
 #include <linux/lockdep.h>
+#include <linux/kthread.h>
 
 #include <asm/processor.h>
 #include <asm/ioctl.h>
@@ -4367,3 +4368,86 @@ void kvm_exit(void)
 	kvm_vfio_ops_exit();
 }
 EXPORT_SYMBOL_GPL(kvm_exit);
+
+struct kvm_vm_worker_thread_context {
+	struct kvm *kvm;
+	struct task_struct *parent;
+	struct completion init_done;
+	kvm_vm_thread_fn_t thread_fn;
+	uintptr_t data;
+	int err;
+};
+
+static int kvm_vm_worker_thread(void *context)
+{
+	/*
+	 * The init_context is allocated on the stack of the parent thread, so
+	 * we have to locally copy anything that is needed beyond initialization
+	 */
+	struct kvm_vm_worker_thread_context *init_context = context;
+	struct kvm *kvm = init_context->kvm;
+	kvm_vm_thread_fn_t thread_fn = init_context->thread_fn;
+	uintptr_t data = init_context->data;
+	int err;
+
+	err = kthread_park(current);
+	/* kthread_park(current) is never supposed to return an error */
+	WARN_ON(err != 0);
+	if (err)
+		goto init_complete;
+
+	err = cgroup_attach_task_all(init_context->parent, current);
+	if (err) {
+		kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
+			__func__, err);
+		goto init_complete;
+	}
+
+	set_user_nice(current, task_nice(init_context->parent));
+
+init_complete:
+	init_context->err = err;
+	complete(&init_context->init_done);
+	init_context = NULL;
+
+	if (err)
+		return err;
+
+	/* Wait to be woken up by the spawner before proceeding. */
+	kthread_parkme();
+
+	if (!kthread_should_stop())
+		err = thread_fn(kvm, data);
+
+	return err;
+}
+
+int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
+				uintptr_t data, const char *name,
+				struct task_struct **thread_ptr)
+{
+	struct kvm_vm_worker_thread_context init_context = {};
+	struct task_struct *thread;
+
+	*thread_ptr = NULL;
+	init_context.kvm = kvm;
+	init_context.parent = current;
+	init_context.thread_fn = thread_fn;
+	init_context.data = data;
+	init_completion(&init_context.init_done);
+
+	thread = kthread_run(kvm_vm_worker_thread, &init_context,
+			     "%s-%d", name, task_pid_nr(current));
+	if (IS_ERR(thread))
+		return PTR_ERR(thread);
+
+	/* kthread_run is never supposed to return NULL */
+	WARN_ON(thread == NULL);
+
+	wait_for_completion(&init_context.init_done);
+
+	if (!init_context.err)
+		*thread_ptr = thread;
+
+	return init_context.err;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [MODERATED] [PATCH v6 5/5] NX 5
  2019-10-11 10:40 [MODERATED] [PATCH v6 0/5] NX 0 Paolo Bonzini
                   ` (3 preceding siblings ...)
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 4/5] NX 4 Paolo Bonzini
@ 2019-10-11 10:40 ` Paolo Bonzini
  4 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2019-10-11 10:40 UTC (permalink / raw)
  To: speck


The page table pages corresponding to broken down large pages are
zapped in FIFO order, so that the large page can potentially
be recovered, if it is no longer being used for execution.  This removes
the performance penalty for walking deeper EPT page tables.

By default, one large page will last about one hour once the guest
reaches a steady state.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   6 ++
 arch/x86/include/asm/kvm_host.h                 |   4 +
 arch/x86/kvm/mmu.c                              | 129 ++++++++++++++++++++++++
 arch/x86/kvm/mmu.h                              |   4 +
 arch/x86/kvm/x86.c                              |  11 ++
 virt/kvm/kvm_main.c                             |  30 +++++-
 6 files changed, 183 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 2ec88c723740..d5fdf80b3fac 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2066,6 +2066,12 @@
 			If the sw workaround is enabled for the host, guests
 			need not enable it for nested guests.
 
+	kvm.nx_huge_pages_recovery_ratio=
+			[KVM] Controls how many 4KiB pages are periodically zapped
+			back to huge pages.  0 disables the recovery, otherwise if
+			the value is N KVM will zap 1/Nth of the 4KiB pages every
+			minute.  The default is 60.
+
 	kvm-amd.nested=	[KVM,AMD] Allow nested virtualization in KVM/SVM.
 			Default is 1 (enabled)
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55b59141663b..a21339a26436 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -310,6 +310,8 @@ struct kvm_rmap_head {
 struct kvm_mmu_page {
 	struct list_head link;
 	struct hlist_node hash_link;
+	struct list_head lpage_disallowed_link;
+
 	bool unsync;
 	u8 mmu_valid_gen;
 	bool mmio_cached;
@@ -858,6 +860,7 @@ struct kvm_arch {
 	 */
 	struct list_head active_mmu_pages;
 	struct list_head zapped_obsolete_pages;
+	struct list_head lpage_disallowed_mmu_pages;
 	struct kvm_page_track_notifier_node mmu_sp_tracker;
 	struct kvm_page_track_notifier_head track_notifier_head;
 
@@ -932,6 +935,7 @@ struct kvm_arch {
 	bool exception_payload_enabled;
 
 	struct kvm_pmu_event_filter *pmu_event_filter;
+	struct task_struct *nx_lpage_recovery_thread;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 837beefdf0a5..e6a5748a12d5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -37,6 +37,7 @@
 #include <linux/uaccess.h>
 #include <linux/hash.h>
 #include <linux/kern_levels.h>
+#include <linux/kthread.h>
 
 #include <asm/page.h>
 #include <asm/pat.h>
@@ -50,16 +51,26 @@
 extern bool itlb_multihit_kvm_mitigation;
 
 static int __read_mostly nx_huge_pages = -1;
+static uint __read_mostly nx_huge_pages_recovery_ratio = 60;
 
 static int set_nx_huge_pages(const char *val, const struct kernel_param *kp);
+static int set_nx_huge_pages_recovery_ratio(const char *val, const struct kernel_param *kp);
 
 static struct kernel_param_ops nx_huge_pages_ops = {
 	.set = set_nx_huge_pages,
 	.get = param_get_bool,
 };
 
+static struct kernel_param_ops nx_huge_pages_recovery_ratio_ops = {
+	.set = set_nx_huge_pages_recovery_ratio,
+	.get = param_get_uint,
+};
+
 module_param_cb(nx_huge_pages, &nx_huge_pages_ops, &nx_huge_pages, 0644);
 __MODULE_PARM_TYPE(nx_huge_pages, "bool");
+module_param_cb(nx_huge_pages_recovery_ratio, &nx_huge_pages_recovery_ratio_ops,
+		&nx_huge_pages_recovery_ratio, 0644);
+__MODULE_PARM_TYPE(nx_huge_pages_recovery_ratio, "uint");
 
 /*
  * When setting this variable to true it enables Two-Dimensional-Paging
@@ -1215,6 +1226,8 @@ static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 		return;
 
 	++kvm->stat.nx_lpage_splits;
+	list_add_tail(&sp->lpage_disallowed_link,
+		      &kvm->arch.lpage_disallowed_mmu_pages);
 	sp->lpage_disallowed = true;
 }
 
@@ -1239,6 +1252,7 @@ static void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	--kvm->stat.nx_lpage_splits;
 	sp->lpage_disallowed = false;
+	list_del(&sp->lpage_disallowed_link);
 }
 
 static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
@@ -6268,6 +6282,8 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 			idx = srcu_read_lock(&kvm->srcu);
 			kvm_mmu_zap_all_fast(kvm);
 			srcu_read_unlock(&kvm->srcu, idx);
+
+			wake_up_process(kvm->arch.nx_lpage_recovery_thread);
 		}
 		mutex_unlock(&kvm_lock);
 	}
@@ -6361,3 +6377,116 @@ void kvm_mmu_module_exit(void)
 	unregister_shrinker(&mmu_shrinker);
 	mmu_audit_disable();
 }
+
+static int set_nx_huge_pages_recovery_ratio(const char *val, const struct kernel_param *kp)
+{
+	unsigned int old_val;
+	int err;
+
+	old_val = nx_huge_pages_recovery_ratio;
+	err = param_set_uint(val, kp);
+	if (err)
+		return err;
+
+	if (READ_ONCE(nx_huge_pages) &&
+	    !old_val && nx_huge_pages_recovery_ratio) {
+		struct kvm *kvm;
+
+		mutex_lock(&kvm_lock);
+
+		list_for_each_entry(kvm, &vm_list, vm_list)
+			wake_up_process(kvm->arch.nx_lpage_recovery_thread);
+
+		mutex_unlock(&kvm_lock);
+	}
+
+	return err;
+}
+
+static void kvm_recover_nx_lpages(struct kvm *kvm)
+{
+	int rcu_idx;
+	struct kvm_mmu_page *sp;
+	unsigned int ratio;
+	LIST_HEAD(invalid_list);
+	ulong to_zap;
+
+	rcu_idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+
+	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
+	to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
+	while (to_zap && !list_empty(&kvm->arch.lpage_disallowed_mmu_pages)) {
+		/*
+		 * We use a separate list instead of just using active_mmu_pages
+		 * because the number of lpage_disallowed pages is expected to
+		 * be relatively small compared to the total.
+		 */
+		sp = list_first_entry(&kvm->arch.lpage_disallowed_mmu_pages,
+				      struct kvm_mmu_page,
+				      lpage_disallowed_link);
+		WARN_ON_ONCE(!sp->lpage_disallowed);
+		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+		WARN_ON_ONCE(sp->lpage_disallowed);
+
+		if (!--to_zap || need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+			kvm_mmu_commit_zap_page(kvm, &invalid_list);
+			if (to_zap)
+				cond_resched_lock(&kvm->mmu_lock);
+		}
+	}
+
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, rcu_idx);
+}
+
+static long get_nx_lpage_recovery_timeout(u64 start_time)
+{
+	return READ_ONCE(nx_huge_pages) && READ_ONCE(nx_huge_pages_recovery_ratio)
+		? start_time + 60 * HZ - get_jiffies_64()
+		: MAX_SCHEDULE_TIMEOUT;
+}
+
+static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data)
+{
+	u64 start_time;
+	long remaining_time;
+
+	while (true) {
+		start_time = get_jiffies_64();
+		remaining_time = get_nx_lpage_recovery_timeout(start_time);
+
+		set_current_state(TASK_INTERRUPTIBLE);
+		while (!kthread_should_stop() && remaining_time > 0) {
+			schedule_timeout(remaining_time);
+			remaining_time = get_nx_lpage_recovery_timeout(start_time);
+			set_current_state(TASK_INTERRUPTIBLE);
+		}
+
+		set_current_state(TASK_RUNNING);
+
+		if (kthread_should_stop())
+			return 0;
+
+		kvm_recover_nx_lpages(kvm);
+	}
+}
+
+int kvm_mmu_post_init_vm(struct kvm *kvm)
+{
+	int err;
+
+	err = kvm_vm_create_worker_thread(kvm, kvm_nx_lpage_recovery_worker, 0,
+					  "kvm-nx-lpage-recovery",
+					  &kvm->arch.nx_lpage_recovery_thread);
+	if (!err)
+		kthread_unpark(kvm->arch.nx_lpage_recovery_thread);
+
+	return err;
+}
+
+void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
+{
+	if (kvm->arch.nx_lpage_recovery_thread)
+		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
+}
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 11f8ec89433b..d55674f44a18 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -210,4 +210,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
+
+int kvm_mmu_post_init_vm(struct kvm *kvm);
+void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3d7fad762007..655e8a932ab9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9422,6 +9422,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
+	INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
 	atomic_set(&kvm->arch.noncoherent_dma_count, 0);
 
@@ -9450,6 +9451,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	return kvm_x86_ops->vm_init(kvm);
 }
 
+int kvm_arch_post_init_vm(struct kvm *kvm)
+{
+	return kvm_mmu_post_init_vm(kvm);
+}
+
 static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 {
 	vcpu_load(vcpu);
@@ -9551,6 +9557,11 @@ int x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
 }
 EXPORT_SYMBOL_GPL(x86_set_memory_region);
 
+void kvm_arch_pre_destroy_vm(struct kvm *kvm)
+{
+	kvm_mmu_pre_destroy_vm(kvm);
+}
+
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
 	if (current->mm == kvm->mm) {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 513b49be83e0..f6d4385aad65 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -626,6 +626,23 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
 	return 0;
 }
 
+/*
+ * Called after the VM is otherwise initialized, but just before adding it to
+ * the vm_list.
+ */
+int __weak kvm_arch_post_init_vm(struct kvm *kvm)
+{
+	return 0;
+}
+
+/*
+ * Called just after removing the VM from the vm_list, but before doing any
+ * other destruction.
+ */
+void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm)
+{
+}
+
 static struct kvm *kvm_create_vm(unsigned long type)
 {
 	int r, i;
@@ -676,11 +693,15 @@ static struct kvm *kvm_create_vm(unsigned long type)
 		rcu_assign_pointer(kvm->buses[i],
 			kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL_ACCOUNT));
 		if (!kvm->buses[i])
-			goto out_err;
+			goto out_err_no_mmu_notifier;
 	}
 
 	r = kvm_init_mmu_notifier(kvm);
 	if (r)
+		goto out_err_no_mmu_notifier;
+
+	r = kvm_arch_post_init_vm(kvm);
+	if (r)
 		goto out_err;
 
 	mutex_lock(&kvm_lock);
@@ -692,6 +713,11 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	return kvm;
 
 out_err:
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+	if (kvm->mmu_notifier.ops)
+		mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
+#endif
+out_err_no_mmu_notifier:
 	cleanup_srcu_struct(&kvm->irq_srcu);
 out_err_no_irq_srcu:
 	cleanup_srcu_struct(&kvm->srcu);
@@ -734,6 +760,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	mutex_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	mutex_unlock(&kvm_lock);
+	kvm_arch_pre_destroy_vm(kvm);
+
 	kvm_free_irq_routing(kvm);
 	for (i = 0; i < KVM_NR_BUSES; i++) {
 		struct kvm_io_bus *bus = kvm_get_bus(kvm, i);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [MODERATED] Re: [PATCH v6 1/5] NX 1
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 1/5] NX 1 Paolo Bonzini
@ 2019-10-11 13:18   ` Konrad Rzeszutek Wilk
  2019-10-15 11:24   ` Thomas Gleixner
  1 sibling, 0 replies; 12+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-10-11 13:18 UTC (permalink / raw)
  To: speck

On Fri, Oct 11, 2019 at 12:40:12PM +0200, speck for Paolo Bonzini wrote:
> 
> Some processors may incur a machine check error possibly
> resulting in an unrecoverable cpu hang when an instruction fetch
> encounters a TLB multi-hit in the instruction TLB. This can occur
> when the page size is changed along with either the physical
> address or cache type [1].
> 
> This issue affects both bare-metal x86 page tables and EPT.
> 
> This can be mitigated by either eliminating the use of large
> pages or by using careful TLB invalidations when changing the
> page size in the page tables.
> 
> Just like Spectre, Meltdown, L1TF and MDS, a new bit has been
> allocated in MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will
> be set on CPUs which are mitigated against this issue.
> 
> [1] For example please refer to erratum SKL002 in "6th Generation
> Intel Processor Family Specification Update"
> https://www.intel.com/content/www/us/en/products/docs/processors/core/desktop-6th-gen-core-family-spec-update.html
> https://www.google.com/search?q=site:intel.com+SKL002
> 
> There are a lot of other affected processors outside of Skylake and
> that the erratum(referred above) does not fully disclose the issue
> and the impact, both on Skylake and across all the affected CPUs.
> 
> Signed-off-by: Vineela Tummalapalli <vineela.tummalapalli@intel.com>
> Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!
> ---
>  Documentation/ABI/testing/sysfs-devices-system-cpu |  1 +
>  arch/x86/include/asm/cpufeatures.h                 |  1 +
>  arch/x86/include/asm/msr-index.h                   |  7 +++
>  arch/x86/kernel/cpu/bugs.c                         | 13 ++++
>  arch/x86/kernel/cpu/common.c                       | 71 ++++++++++++----------
>  drivers/base/cpu.c                                 |  8 +++
>  include/linux/cpu.h                                |  2 +
>  7 files changed, 70 insertions(+), 33 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> index 06d0931119cc..55bf5e1538ad 100644
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -486,6 +486,7 @@ What:		/sys/devices/system/cpu/vulnerabilities
>  		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
>  		/sys/devices/system/cpu/vulnerabilities/l1tf
>  		/sys/devices/system/cpu/vulnerabilities/mds
> +		/sys/devices/system/cpu/vulnerabilities/itlb_multihit
>  Date:		January 2018
>  Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
>  Description:	Information about CPU vulnerabilities
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 0652d3eed9bd..66aaad7611b2 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -399,5 +399,6 @@
>  #define X86_BUG_MDS			X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
>  #define X86_BUG_MSBDS_ONLY		X86_BUG(20) /* CPU is only affected by the  MSDBS variant of BUG_MDS */
>  #define X86_BUG_SWAPGS			X86_BUG(21) /* CPU is affected by speculation through SWAPGS */
> +#define X86_BUG_ITLB_MULTIHIT		X86_BUG(22) /* CPU may incur MCE during certain page attribute changes */
>  
>  #endif /* _ASM_X86_CPUFEATURES_H */
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 20ce682a2540..c678899b21db 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -93,6 +93,13 @@
>  						  * Microarchitectural Data
>  						  * Sampling (MDS) vulnerabilities.
>  						  */
> +#define ARCH_CAP_PSCHANGE_MC_NO		BIT(6)	 /*
> +						  * The processor is not susceptible to a
> +						  * machine check error due to modifying the
> +						  * code page size along with either the
> +						  * physical address or cache type
> +						  * without TLB invalidation.
> +						  */
>  
>  #define MSR_IA32_FLUSH_CMD		0x0000010b
>  #define L1D_FLUSH			BIT(0)	/*
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 91c2561b905f..ecd0126648ea 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -1311,6 +1311,11 @@ static ssize_t l1tf_show_state(char *buf)
>  }
>  #endif
>  
> +static ssize_t itlb_multihit_show_state(char *buf)
> +{
> +	return sprintf(buf, "Processor vulnerable\n");
> +}
> +
>  static ssize_t mds_show_state(char *buf)
>  {
>  	if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
> @@ -1398,6 +1403,9 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
>  	case X86_BUG_MDS:
>  		return mds_show_state(buf);
>  
> +	case X86_BUG_ITLB_MULTIHIT:
> +		return itlb_multihit_show_state(buf);
> +
>  	default:
>  		break;
>  	}
> @@ -1434,4 +1442,9 @@ ssize_t cpu_show_mds(struct device *dev, struct device_attribute *attr, char *bu
>  {
>  	return cpu_show_common(dev, attr, buf, X86_BUG_MDS);
>  }
> +
> +ssize_t cpu_show_itlb_multihit(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	return cpu_show_common(dev, attr, buf, X86_BUG_ITLB_MULTIHIT);
> +}
>  #endif
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 9ae7d1bcd4f4..fc00b2349a9f 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1016,13 +1016,14 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
>  #endif
>  }
>  
> -#define NO_SPECULATION	BIT(0)
> -#define NO_MELTDOWN	BIT(1)
> -#define NO_SSB		BIT(2)
> -#define NO_L1TF		BIT(3)
> -#define NO_MDS		BIT(4)
> -#define MSBDS_ONLY	BIT(5)
> -#define NO_SWAPGS	BIT(6)
> +#define NO_SPECULATION		BIT(0)
> +#define NO_MELTDOWN		BIT(1)
> +#define NO_SSB			BIT(2)
> +#define NO_L1TF			BIT(3)
> +#define NO_MDS			BIT(4)
> +#define MSBDS_ONLY		BIT(5)
> +#define NO_SWAPGS		BIT(6)
> +#define NO_ITLB_MULTIHIT	BIT(7)
>  
>  #define VULNWL(_vendor, _family, _model, _whitelist)	\
>  	{ X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist }
> @@ -1043,27 +1044,27 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
>  	VULNWL(NSC,	5, X86_MODEL_ANY,	NO_SPECULATION),
>  
>  	/* Intel Family 6 */
> -	VULNWL_INTEL(ATOM_SALTWELL,		NO_SPECULATION),
> -	VULNWL_INTEL(ATOM_SALTWELL_TABLET,	NO_SPECULATION),
> -	VULNWL_INTEL(ATOM_SALTWELL_MID,		NO_SPECULATION),
> -	VULNWL_INTEL(ATOM_BONNELL,		NO_SPECULATION),
> -	VULNWL_INTEL(ATOM_BONNELL_MID,		NO_SPECULATION),
> -
> -	VULNWL_INTEL(ATOM_SILVERMONT,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
> -	VULNWL_INTEL(ATOM_SILVERMONT_D,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
> -	VULNWL_INTEL(ATOM_SILVERMONT_MID,	NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
> -	VULNWL_INTEL(ATOM_AIRMONT,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
> -	VULNWL_INTEL(XEON_PHI_KNL,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
> -	VULNWL_INTEL(XEON_PHI_KNM,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
> +	VULNWL_INTEL(ATOM_SALTWELL,		NO_SPECULATION | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(ATOM_SALTWELL_TABLET,	NO_SPECULATION | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(ATOM_SALTWELL_MID,		NO_SPECULATION | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(ATOM_BONNELL,		NO_SPECULATION | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(ATOM_BONNELL_MID,		NO_SPECULATION | NO_ITLB_MULTIHIT),
> +
> +	VULNWL_INTEL(ATOM_SILVERMONT,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(ATOM_SILVERMONT_D,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(ATOM_SILVERMONT_MID,	NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(ATOM_AIRMONT,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(XEON_PHI_KNL,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(XEON_PHI_KNM,		NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
>  
>  	VULNWL_INTEL(CORE_YONAH,		NO_SSB),
>  
> -	VULNWL_INTEL(ATOM_AIRMONT_MID,		NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
> -	VULNWL_INTEL(ATOM_AIRMONT_NP,		NO_L1TF | NO_SWAPGS),
> +	VULNWL_INTEL(ATOM_AIRMONT_MID,		NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(ATOM_AIRMONT_NP,		NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
>  
> -	VULNWL_INTEL(ATOM_GOLDMONT,		NO_MDS | NO_L1TF | NO_SWAPGS),
> -	VULNWL_INTEL(ATOM_GOLDMONT_D,		NO_MDS | NO_L1TF | NO_SWAPGS),
> -	VULNWL_INTEL(ATOM_GOLDMONT_PLUS,	NO_MDS | NO_L1TF | NO_SWAPGS),
> +	VULNWL_INTEL(ATOM_GOLDMONT,		NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(ATOM_GOLDMONT_D,		NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_INTEL(ATOM_GOLDMONT_PLUS,	NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
>  
>  	/*
>  	 * Technically, swapgs isn't serializing on AMD (despite it previously
> @@ -1074,14 +1075,14 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
>  	 */
>  
>  	/* AMD Family 0xf - 0x12 */
> -	VULNWL_AMD(0x0f,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
> -	VULNWL_AMD(0x10,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
> -	VULNWL_AMD(0x11,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
> -	VULNWL_AMD(0x12,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
> +	VULNWL_AMD(0x0f,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_AMD(0x10,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_AMD(0x11,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_AMD(0x12,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
>  
>  	/* FAMILY_ANY must be last, otherwise 0x0f - 0x12 matches won't work */
> -	VULNWL_AMD(X86_FAMILY_ANY,	NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS),
> -	VULNWL_HYGON(X86_FAMILY_ANY,	NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS),
> +	VULNWL_AMD(X86_FAMILY_ANY,	NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
> +	VULNWL_HYGON(X86_FAMILY_ANY,	NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
>  	{}
>  };
>  
> @@ -1096,15 +1097,19 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
>  {
>  	u64 ia32_cap = 0;
>  
> +	if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES))
> +		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap);
> +
> +	/* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
> +	if (!cpu_matches(NO_ITLB_MULTIHIT) && !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
> +		setup_force_cpu_bug(X86_BUG_ITLB_MULTIHIT);
> +
>  	if (cpu_matches(NO_SPECULATION))
>  		return;
>  
>  	setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
>  	setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
>  
> -	if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES))
> -		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap);
> -
>  	if (!cpu_matches(NO_SSB) && !(ia32_cap & ARCH_CAP_SSB_NO) &&
>  	   !cpu_has(c, X86_FEATURE_AMD_SSB_NO))
>  		setup_force_cpu_bug(X86_BUG_SPEC_STORE_BYPASS);
> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> index cc37511de866..f1a6e020ed8d 100644
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -554,12 +554,19 @@ ssize_t __weak cpu_show_mds(struct device *dev,
>  	return sprintf(buf, "Not affected\n");
>  }
>  
> +ssize_t __weak cpu_show_itlb_multihit(struct device *dev,
> +			    struct device_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "Not affected\n");
> +}
> +
>  static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
>  static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
>  static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
>  static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
>  static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
>  static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
> +static DEVICE_ATTR(itlb_multihit, 0444, cpu_show_itlb_multihit, NULL);
>  
>  static struct attribute *cpu_root_vulnerabilities_attrs[] = {
>  	&dev_attr_meltdown.attr,
> @@ -568,6 +575,7 @@ ssize_t __weak cpu_show_mds(struct device *dev,
>  	&dev_attr_spec_store_bypass.attr,
>  	&dev_attr_l1tf.attr,
>  	&dev_attr_mds.attr,
> +	&dev_attr_itlb_multihit.attr,
>  	NULL
>  };
>  
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index d0633ebdaa9c..038866a28f2c 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -59,6 +59,8 @@ extern ssize_t cpu_show_l1tf(struct device *dev,
>  			     struct device_attribute *attr, char *buf);
>  extern ssize_t cpu_show_mds(struct device *dev,
>  			    struct device_attribute *attr, char *buf);
> +extern ssize_t cpu_show_itlb_multihit(struct device *dev,
> +				      struct device_attribute *attr, char *buf);
>  
>  extern __printf(4, 5)
>  struct device *cpu_device_create(struct device *parent, void *drvdata,
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MODERATED] Re: [PATCH v6 2/5] NX 2
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 2/5] NX 2 Paolo Bonzini
@ 2019-10-11 13:18   ` Konrad Rzeszutek Wilk
  2019-10-15 19:37   ` Tyler Hicks
  1 sibling, 0 replies; 12+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-10-11 13:18 UTC (permalink / raw)
  To: speck

On Fri, Oct 11, 2019 at 03:40:13AM -0700, speck for Paolo Bonzini wrote:
> 
> This patch adds new cpu family ATOM_TREMONT_X to the cpu vunerability
> whitelist. ATOM_TREMONT_X is not affected by X86_BUG_ITLB_MULTIHIT. There
> may be more bugs not affecting ATOM_TREMONT_X which are not known at
> this point and could be added later.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!
> ---
>  arch/x86/kernel/cpu/common.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index fc00b2349a9f..c652ca9dc046 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1074,6 +1074,8 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
>  	 * good enough for our purposes.
>  	 */
>  
> +	VULNWL_INTEL(ATOM_TREMONT_D,		NO_ITLB_MULTIHIT),
> +
>  	/* AMD Family 0xf - 0x12 */
>  	VULNWL_AMD(0x0f,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
>  	VULNWL_AMD(0x10,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/5] NX 1
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 1/5] NX 1 Paolo Bonzini
  2019-10-11 13:18   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-10-15 11:24   ` Thomas Gleixner
  2019-10-15 19:48     ` [MODERATED] " Paolo Bonzini
  1 sibling, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2019-10-15 11:24 UTC (permalink / raw)
  To: speck

On Fri, 11 Oct 2019, speck for Paolo Bonzini wrote:

This one and all other patches lack a

  From: ...
  Subject: ...

in the mail body, which requires manual fiddling with those patches when
applying them.

> Some processors may incur a machine check error possibly
> resulting in an unrecoverable cpu hang when an instruction fetch
> encounters a TLB multi-hit in the instruction TLB. This can occur
> when the page size is changed along with either the physical
> address or cache type [1].
> 
> This issue affects both bare-metal x86 page tables and EPT.
> 
> This can be mitigated by either eliminating the use of large
> pages or by using careful TLB invalidations when changing the
> page size in the page tables.
> 
> Just like Spectre, Meltdown, L1TF and MDS, a new bit has been
> allocated in MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will
> be set on CPUs which are mitigated against this issue.
> 
> [1] For example please refer to erratum SKL002 in "6th Generation
> Intel Processor Family Specification Update"
> https://www.intel.com/content/www/us/en/products/docs/processors/core/desktop-6th-gen-core-family-spec-update.html
> https://www.google.com/search?q=site:intel.com+SKL002

Please don't provide links like this. They first one will be outdated in
the worst case before this goes public. The second one is giving the right
answer today, but does it so in 5 years from now?

I asked a gazillion times in the past to stick a copy of the errata PDF
into bugzilla.kernel.org and provide a link to the BZ entry. That has a
decent chance to be a permalink.

Sure, we don't want a public bugzilla entry for this right now, but we can
create a dummy entry and update the subject plus upload the file when the
stuff goes public.

      https://bugzilla.kernel.org/show_bug.cgi?id=205195

is innocent enough for this.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MODERATED] Re: [PATCH v6 2/5] NX 2
  2019-10-11 10:40 ` [MODERATED] [PATCH v6 2/5] NX 2 Paolo Bonzini
  2019-10-11 13:18   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-10-15 19:37   ` Tyler Hicks
  1 sibling, 0 replies; 12+ messages in thread
From: Tyler Hicks @ 2019-10-15 19:37 UTC (permalink / raw)
  To: speck

On 2019-10-11 12:40:13, speck for Paolo Bonzini wrote:
> 
> This patch adds new cpu family ATOM_TREMONT_X to the cpu vunerability
> whitelist. ATOM_TREMONT_X is not affected by X86_BUG_ITLB_MULTIHIT. There
> may be more bugs not affecting ATOM_TREMONT_X which are not known at
> this point and could be added later.

Please run s/TREMONT_X/TREMONT_D/ across the patch description to match
the code after commit 5ebb34edbefa ("x86/intel: Aggregate microserver
naming").

Tyler

> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kernel/cpu/common.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index fc00b2349a9f..c652ca9dc046 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1074,6 +1074,8 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
>  	 * good enough for our purposes.
>  	 */
>  
> +	VULNWL_INTEL(ATOM_TREMONT_D,		NO_ITLB_MULTIHIT),
> +
>  	/* AMD Family 0xf - 0x12 */
>  	VULNWL_AMD(0x0f,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
>  	VULNWL_AMD(0x10,	NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MODERATED] Re: [PATCH v6 1/5] NX 1
  2019-10-15 11:24   ` Thomas Gleixner
@ 2019-10-15 19:48     ` Paolo Bonzini
  2019-10-23 16:18       ` Tyler Hicks
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2019-10-15 19:48 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 425 bytes --]

On 15/10/19 13:24, speck for Thomas Gleixner wrote:
> On Fri, 11 Oct 2019, speck for Paolo Bonzini wrote:
> 
> This one and all other patches lack a
> 
>   From: ...
>   Subject: ...
>   
> in the mail body, which requires manual fiddling with those patches when
> applying them.
> 
>       https://bugzilla.kernel.org/show_bug.cgi?id=205195
> 
> is innocent enough for this.

Will fix and resend.

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MODERATED] Re: [PATCH v6 1/5] NX 1
  2019-10-15 19:48     ` [MODERATED] " Paolo Bonzini
@ 2019-10-23 16:18       ` Tyler Hicks
  0 siblings, 0 replies; 12+ messages in thread
From: Tyler Hicks @ 2019-10-23 16:18 UTC (permalink / raw)
  To: speck

On 2019-10-15 21:48:13, speck for Paolo Bonzini wrote:
> On 15/10/19 13:24, speck for Thomas Gleixner wrote:
> > On Fri, 11 Oct 2019, speck for Paolo Bonzini wrote:
> > 
> > This one and all other patches lack a
> > 
> >   From: ...
> >   Subject: ...
> >   
> > in the mail body, which requires manual fiddling with those patches when
> > applying them.
> > 
> >       https://bugzilla.kernel.org/show_bug.cgi?id=205195
> > 
> > is innocent enough for this.
> 
> Will fix and resend.

Hi Paolo - I wanted to check in to see if a v7 could be sent out soon so
that final reviews can happen and backports can be started. Thanks!

Tyler

> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-10-23 16:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-11 10:40 [MODERATED] [PATCH v6 0/5] NX 0 Paolo Bonzini
2019-10-11 10:40 ` [MODERATED] [PATCH v6 1/5] NX 1 Paolo Bonzini
2019-10-11 13:18   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-10-15 11:24   ` Thomas Gleixner
2019-10-15 19:48     ` [MODERATED] " Paolo Bonzini
2019-10-23 16:18       ` Tyler Hicks
2019-10-11 10:40 ` [MODERATED] [PATCH v6 2/5] NX 2 Paolo Bonzini
2019-10-11 13:18   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-10-15 19:37   ` Tyler Hicks
2019-10-11 10:40 ` [MODERATED] [PATCH v6 3/5] NX 3 Paolo Bonzini
2019-10-11 10:40 ` [MODERATED] [PATCH v6 4/5] NX 4 Paolo Bonzini
2019-10-11 10:40 ` [MODERATED] [PATCH v6 5/5] NX 5 Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.