linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
@ 2014-07-04  1:23 Stewart Smith
  2014-07-08  5:06 ` [PATCH v2] " Stewart Smith
  0 siblings, 1 reply; 21+ messages in thread
From: Stewart Smith @ 2014-07-04  1:23 UTC (permalink / raw)
  To: linuxppc-dev, paulus; +Cc: Stewart Smith

The POWER8 processor has a Micro Partition Prefetch Engine, which is
a fancy way of saying "has way to store and load contents of L2 or
L2+MRU way of L3 cache". We initiate the storing of the log (list of
addresses) using the logmpp instruction and start restore by writing
to a SPR.

The logmpp instruction takes parameters in a single 64bit register:
- starting address of the table to store log of L2/L2+L3 cache contents
  - 32kb for L2
  - 128kb for L2+L3
  - Aligned relative to maximum size of the table (32kb or 128kb)
- Log control (no-op, L2 only, L2 and L3, abort logout)

We should abort any ongoing logging before initiating one.

To initiate restore, we write to the MPPR SPR. The format of what to write
to the SPR is similar to the logmpp instruction parameter:
- starting address of the table to read from (same alignment requirements)
- table size (no data, until end of table)
- prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
  32 cycles)

The idea behind loading and storing the contents of L2/L3 cache is to
reduce memory latency in a system that is frequently swapping vcores on
a physical CPU.

The best case scenario for doing this is when some vcores are doing very
cache heavy workloads. The worst case is when they have about 0 cache hits,
so we just generate needless memory operations.

This implementation just does L2 store/load. In my benchmarks this proves
to be useful.

Benchmark 1:
 - 16 core POWER8
 - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
 - No split core/SMT
 - two guests running sysbench memory test.
   sysbench --test=memory --num-threads=8 run
 - one guest running apache bench (of default HTML page)
   ab -n 490000 -c 400 http://localhost/

This benchmark aims to measure performance of real world application (apache)
where other guests are cache hot with their own workloads. The sysbench memory
benchmark does pointer sized writes to a (small) memory buffer in a loop.

In this benchmark with this patch I can see an improvement both in requests
per second (~5%) and in mean and median response times (again, about 5%).
The spread of minimum and maximum response times were largely unchanged.

benchmark 2:
 - Same VM config as benchmark 1
 - all three guests running sysbench memory benchmark

This benchmark aims to see if there is a positive or negative affect to this
cache heavy benchmark. Although due to the nature of the benchmark (stores) we
may not see a difference in performance, but rather hopefully an improvement
in consistency of performance (when vcore switched in, don't have to wait
many times for cachelines to be pulled in)

The results of this benchmark are improvements in consistency of performance
rather than performance itself. With this patch, the few outliers in duration
go away and we get more consistent performance in each guest.

benchmark 3:
 - same 3 guests and CPU configuration as benchmark 1 and 2.
 - two idle guests
 - 1 guest running STREAM benchmark

This scenario also saw performance improvement with this patch. On Copy and
Scale workloads from STREAM, I got 5-6% improvement with this patch. For
Add and triad, it was around 10% (or more).

benchmark 4:
 - same 3 guests as previous benchmarks
 - two guests running sysbench --memory, distinctly different cache heavy
   workload
 - one guest running STREAM benchmark.

Similar improvements to benchmark 3.

benchmark 5:
 - 1 guest, 8 VCPUs, Ubuntu 14.04
 - Host configured with split core (SMT8, subcores-per-core=4)
 - STREAM benchmark

In this benchmark, we see a 10-20% performance improvement across the board
of STREAM benchmark results with this patch.

Based on preliminary investigation and microbenchmarks
by Prerna Saxena <prerna@linux.vnet.ibm.com>

Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/kvm_host.h   |    1 +
 arch/powerpc/include/asm/ppc-opcode.h |   10 +++++++
 arch/powerpc/include/asm/reg.h        |    1 +
 arch/powerpc/kvm/book3s_hv.c          |   53 ++++++++++++++++++++++++++++++++-
 4 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 1eaea2d..5c0e9fc 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -305,6 +305,7 @@ struct kvmppc_vcore {
 	u32 arch_compat;
 	ulong pcr;
 	ulong dpdes;		/* doorbell state (POWER8) */
+	unsigned long mppe; /* Micro Partition Prefetch buffer */
 };
 
 #define VCORE_ENTRY_COUNT(vc)	((vc)->entry_exit_count & 0xff)
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 3132bb9..6201440 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -139,6 +139,7 @@
 #define PPC_INST_ISEL			0x7c00001e
 #define PPC_INST_ISEL_MASK		0xfc00003e
 #define PPC_INST_LDARX			0x7c0000a8
+#define PPC_INST_LOGMPP			0x7c0007e4
 #define PPC_INST_LSWI			0x7c0004aa
 #define PPC_INST_LSWX			0x7c00042a
 #define PPC_INST_LWARX			0x7c000028
@@ -275,6 +276,13 @@
 #define __PPC_EH(eh)	0
 #endif
 
+/* POWER8 Micro Partition Prefetch parameters */
+#define PPC_MPPE_ADDRESS_MASK 0xffffffffc000
+#define PPC_MPPE_WHOLE_TABLE (0x2ULL << 60)
+#define PPC_MPPE_LOG_L2 (0x02ULL << 54)
+#define PPC_MPPE_LOG_L2L3 (0x01ULL << 54)
+#define PPC_MPPE_LOG_ABORT (0x03ULL << 54)
+
 /* Deal with instructions that older assemblers aren't aware of */
 #define	PPC_DCBAL(a, b)		stringify_in_c(.long PPC_INST_DCBAL | \
 					__PPC_RA(a) | __PPC_RB(b))
@@ -283,6 +291,8 @@
 #define PPC_LDARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LDARX | \
 					___PPC_RT(t) | ___PPC_RA(a) | \
 					___PPC_RB(b) | __PPC_EH(eh))
+#define PPC_LOGMPP(b)		stringify_in_c(.long PPC_INST_LOGMPP | \
+					__PPC_RB(b))
 #define PPC_LWARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LWARX | \
 					___PPC_RT(t) | ___PPC_RA(a) | \
 					___PPC_RB(b) | __PPC_EH(eh))
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index e5d2e0b..5164beb 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -224,6 +224,7 @@
 #define   CTRL_TE	0x00c00000	/* thread enable */
 #define   CTRL_RUNLATCH	0x1
 #define SPRN_DAWR	0xB4
+#define SPRN_MPPR	0xB8	/* Micro Partition Prefetch Register */
 #define SPRN_CIABR	0xBB
 #define   CIABR_PRIV		0x3
 #define   CIABR_PRIV_USER	1
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8227dba..d19906e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1528,6 +1528,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
 	int i, need_vpa_update;
 	int srcu_idx;
 	struct kvm_vcpu *vcpus_to_update[threads_per_core];
+	phys_addr_t phy_addr, tmp;
 
 	/* don't start if any threads have a signal pending */
 	need_vpa_update = 0;
@@ -1590,9 +1591,51 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
 
 	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
 
+	/* If we have a saved list of L2/L3, restore it */
+	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mppe) {
+		phy_addr = virt_to_phys((void *)vc->mppe);
+#if defined(CONFIG_PPC_4K_PAGES)
+		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
+#endif
+		tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;
+		tmp = tmp | PPC_MPPE_WHOLE_TABLE;
+
+		/* For sanity, abort any 'save' requests in progress */
+		asm volatile(PPC_LOGMPP(R1) : : "r" (tmp));
+
+		/* Inititate a cache-load request */
+		mtspr(SPRN_MPPR, tmp);
+	}
+
+	/* Allocate memory before switching out of guest so we don't
+	   trash L2/L3 with memory allocation stuff */
+	if (cpu_has_feature(CPU_FTR_ARCH_207S) && !vc->mppe) {
+#if defined(CONFIG_PPC_64K_PAGES)
+		vc->mppe = __get_free_pages(GFP_KERNEL|__GFP_ZERO, 0);
+#elif defined(CONFIG_PPC_4K_PAGES)
+		vc->mppe = __get_free_pages(GFP_KERNEL|__GFP_ZERO, 4);
+#endif
+	}
+
 	__kvmppc_vcore_entry();
 
 	spin_lock(&vc->lock);
+
+	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mppe) {
+		phy_addr = (phys_addr_t)virt_to_phys((void *)vc->mppe);
+#if defined(CONFIG_PPC_4K_PAGES)
+		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
+#endif
+		tmp = PPC_MPPE_ADDRESS_MASK & phy_addr;
+		tmp = tmp | PPC_MPPE_LOG_L2;
+
+		/* Abort any existing 'fetch' operations for this core */
+		mtspr(SPRN_MPPR, tmp&0x0fffffffffffffff);
+
+		/* Finally, issue logmpp to save cache contents for L2 */
+		asm volatile(PPC_LOGMPP(R1) : : "r" (tmp));
+	}
+
 	/* disable sending of IPIs on virtual external irqs */
 	list_for_each_entry(vcpu, &vc->runnable_threads, arch.run_list)
 		vcpu->cpu = -1;
@@ -2329,8 +2372,16 @@ static void kvmppc_free_vcores(struct kvm *kvm)
 {
 	long int i;
 
-	for (i = 0; i < KVM_MAX_VCORES; ++i)
+	for (i = 0; i < KVM_MAX_VCORES; ++i) {
+		if (kvm->arch.vcores[i] && kvm->arch.vcores[i]->mppe) {
+#if defined(CONFIG_PPC_64K_PAGES)
+			free_pages(kvm->arch.vcores[i]->mppe, 0);
+#elif defined(CONFIG_PPC_4K_PAGES)
+			free_pages(kvm->arch.vcores[i]->mppe, 4);
+#endif
+		}
 		kfree(kvm->arch.vcores[i]);
+	}
 	kvm->arch.online_vcores = 0;
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-04  1:23 [PATCH] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 Stewart Smith
@ 2014-07-08  5:06 ` Stewart Smith
  2014-07-08 10:41   ` Alexander Graf
  2014-07-17  3:19   ` [PATCH v3] " Stewart Smith
  0 siblings, 2 replies; 21+ messages in thread
From: Stewart Smith @ 2014-07-08  5:06 UTC (permalink / raw)
  To: linuxppc-dev, paulus, kvm-ppc; +Cc: Stewart Smith

The POWER8 processor has a Micro Partition Prefetch Engine, which is
a fancy way of saying "has way to store and load contents of L2 or
L2+MRU way of L3 cache". We initiate the storing of the log (list of
addresses) using the logmpp instruction and start restore by writing
to a SPR.

The logmpp instruction takes parameters in a single 64bit register:
- starting address of the table to store log of L2/L2+L3 cache contents
  - 32kb for L2
  - 128kb for L2+L3
  - Aligned relative to maximum size of the table (32kb or 128kb)
- Log control (no-op, L2 only, L2 and L3, abort logout)

We should abort any ongoing logging before initiating one.

To initiate restore, we write to the MPPR SPR. The format of what to write
to the SPR is similar to the logmpp instruction parameter:
- starting address of the table to read from (same alignment requirements)
- table size (no data, until end of table)
- prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
  32 cycles)

The idea behind loading and storing the contents of L2/L3 cache is to
reduce memory latency in a system that is frequently swapping vcores on
a physical CPU.

The best case scenario for doing this is when some vcores are doing very
cache heavy workloads. The worst case is when they have about 0 cache hits,
so we just generate needless memory operations.

This implementation just does L2 store/load. In my benchmarks this proves
to be useful.

Benchmark 1:
 - 16 core POWER8
 - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
 - No split core/SMT
 - two guests running sysbench memory test.
   sysbench --test=memory --num-threads=8 run
 - one guest running apache bench (of default HTML page)
   ab -n 490000 -c 400 http://localhost/

This benchmark aims to measure performance of real world application (apache)
where other guests are cache hot with their own workloads. The sysbench memory
benchmark does pointer sized writes to a (small) memory buffer in a loop.

In this benchmark with this patch I can see an improvement both in requests
per second (~5%) and in mean and median response times (again, about 5%).
The spread of minimum and maximum response times were largely unchanged.

benchmark 2:
 - Same VM config as benchmark 1
 - all three guests running sysbench memory benchmark

This benchmark aims to see if there is a positive or negative affect to this
cache heavy benchmark. Although due to the nature of the benchmark (stores) we
may not see a difference in performance, but rather hopefully an improvement
in consistency of performance (when vcore switched in, don't have to wait
many times for cachelines to be pulled in)

The results of this benchmark are improvements in consistency of performance
rather than performance itself. With this patch, the few outliers in duration
go away and we get more consistent performance in each guest.

benchmark 3:
 - same 3 guests and CPU configuration as benchmark 1 and 2.
 - two idle guests
 - 1 guest running STREAM benchmark

This scenario also saw performance improvement with this patch. On Copy and
Scale workloads from STREAM, I got 5-6% improvement with this patch. For
Add and triad, it was around 10% (or more).

benchmark 4:
 - same 3 guests as previous benchmarks
 - two guests running sysbench --memory, distinctly different cache heavy
   workload
 - one guest running STREAM benchmark.

Similar improvements to benchmark 3.

benchmark 5:
 - 1 guest, 8 VCPUs, Ubuntu 14.04
 - Host configured with split core (SMT8, subcores-per-core=4)
 - STREAM benchmark

In this benchmark, we see a 10-20% performance improvement across the board
of STREAM benchmark results with this patch.

Based on preliminary investigation and microbenchmarks
by Prerna Saxena <prerna@linux.vnet.ibm.com>

Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>

--
changes since v1:
- s/mppe/mpp_buffer/
- add MPP_BUFFER_ORDER define.
---
 arch/powerpc/include/asm/kvm_host.h   |    1 +
 arch/powerpc/include/asm/ppc-opcode.h |   10 ++++++
 arch/powerpc/include/asm/reg.h        |    1 +
 arch/powerpc/kvm/book3s_hv.c          |   54 ++++++++++++++++++++++++++++++++-
 4 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 1eaea2d..83ed249 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -305,6 +305,7 @@ struct kvmppc_vcore {
 	u32 arch_compat;
 	ulong pcr;
 	ulong dpdes;		/* doorbell state (POWER8) */
+	unsigned long mpp_buffer; /* Micro Partition Prefetch buffer */
 };
 
 #define VCORE_ENTRY_COUNT(vc)	((vc)->entry_exit_count & 0xff)
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 3132bb9..6201440 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -139,6 +139,7 @@
 #define PPC_INST_ISEL			0x7c00001e
 #define PPC_INST_ISEL_MASK		0xfc00003e
 #define PPC_INST_LDARX			0x7c0000a8
+#define PPC_INST_LOGMPP			0x7c0007e4
 #define PPC_INST_LSWI			0x7c0004aa
 #define PPC_INST_LSWX			0x7c00042a
 #define PPC_INST_LWARX			0x7c000028
@@ -275,6 +276,13 @@
 #define __PPC_EH(eh)	0
 #endif
 
+/* POWER8 Micro Partition Prefetch parameters */
+#define PPC_MPPE_ADDRESS_MASK 0xffffffffc000
+#define PPC_MPPE_WHOLE_TABLE (0x2ULL << 60)
+#define PPC_MPPE_LOG_L2 (0x02ULL << 54)
+#define PPC_MPPE_LOG_L2L3 (0x01ULL << 54)
+#define PPC_MPPE_LOG_ABORT (0x03ULL << 54)
+
 /* Deal with instructions that older assemblers aren't aware of */
 #define	PPC_DCBAL(a, b)		stringify_in_c(.long PPC_INST_DCBAL | \
 					__PPC_RA(a) | __PPC_RB(b))
@@ -283,6 +291,8 @@
 #define PPC_LDARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LDARX | \
 					___PPC_RT(t) | ___PPC_RA(a) | \
 					___PPC_RB(b) | __PPC_EH(eh))
+#define PPC_LOGMPP(b)		stringify_in_c(.long PPC_INST_LOGMPP | \
+					__PPC_RB(b))
 #define PPC_LWARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LWARX | \
 					___PPC_RT(t) | ___PPC_RA(a) | \
 					___PPC_RB(b) | __PPC_EH(eh))
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index e5d2e0b..5164beb 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -224,6 +224,7 @@
 #define   CTRL_TE	0x00c00000	/* thread enable */
 #define   CTRL_RUNLATCH	0x1
 #define SPRN_DAWR	0xB4
+#define SPRN_MPPR	0xB8	/* Micro Partition Prefetch Register */
 #define SPRN_CIABR	0xBB
 #define   CIABR_PRIV		0x3
 #define   CIABR_PRIV_USER	1
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8227dba..41dab67 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -67,6 +67,13 @@
 /* Used as a "null" value for timebase values */
 #define TB_NIL	(~(u64)0)
 
+#if defined(CONFIG_PPC_64K_PAGES)
+#define MPP_BUFFER_ORDER	0
+#elif defined(CONFIG_PPC_4K_PAGES)
+#define MPP_BUFFER_ORDER	4
+#endif
+
+
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
 	int i, need_vpa_update;
 	int srcu_idx;
 	struct kvm_vcpu *vcpus_to_update[threads_per_core];
+	phys_addr_t phy_addr, tmp;
 
 	/* don't start if any threads have a signal pending */
 	need_vpa_update = 0;
@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
 
 	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
 
+	/* If we have a saved list of L2/L3, restore it */
+	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
+		phy_addr = virt_to_phys((void *)vc->mpp_buffer);
+#if defined(CONFIG_PPC_4K_PAGES)
+		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
+#endif
+		tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;
+		tmp = tmp | PPC_MPPE_WHOLE_TABLE;
+
+		/* For sanity, abort any 'save' requests in progress */
+		asm volatile(PPC_LOGMPP(R1) : : "r" (tmp));
+
+		/* Inititate a cache-load request */
+		mtspr(SPRN_MPPR, tmp);
+	}
+
+	/* Allocate memory before switching out of guest so we don't
+	   trash L2/L3 with memory allocation stuff */
+	if (cpu_has_feature(CPU_FTR_ARCH_207S) && !vc->mpp_buffer) {
+		vc->mpp_buffer = __get_free_pages(GFP_KERNEL|__GFP_ZERO,
+						  MPP_BUFFER_ORDER);
+	}
+
 	__kvmppc_vcore_entry();
 
 	spin_lock(&vc->lock);
+
+	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
+		phy_addr = (phys_addr_t)virt_to_phys((void *)vc->mpp_buffer);
+#if defined(CONFIG_PPC_4K_PAGES)
+		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
+#endif
+		tmp = PPC_MPPE_ADDRESS_MASK & phy_addr;
+		tmp = tmp | PPC_MPPE_LOG_L2;
+
+		/* Abort any existing 'fetch' operations for this core */
+		mtspr(SPRN_MPPR, tmp&0x0fffffffffffffff);
+
+		/* Finally, issue logmpp to save cache contents for L2 */
+		asm volatile(PPC_LOGMPP(R1) : : "r" (tmp));
+	}
+
 	/* disable sending of IPIs on virtual external irqs */
 	list_for_each_entry(vcpu, &vc->runnable_threads, arch.run_list)
 		vcpu->cpu = -1;
@@ -2329,8 +2376,13 @@ static void kvmppc_free_vcores(struct kvm *kvm)
 {
 	long int i;
 
-	for (i = 0; i < KVM_MAX_VCORES; ++i)
+	for (i = 0; i < KVM_MAX_VCORES; ++i) {
+		if (kvm->arch.vcores[i] && kvm->arch.vcores[i]->mpp_buffer) {
+			free_pages(kvm->arch.vcores[i]->mpp_buffer,
+				   MPP_BUFFER_ORDER);
+		}
 		kfree(kvm->arch.vcores[i]);
+	}
 	kvm->arch.online_vcores = 0;
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-08  5:06 ` [PATCH v2] " Stewart Smith
@ 2014-07-08 10:41   ` Alexander Graf
  2014-07-08 22:59     ` Stewart Smith
  2014-07-17  3:19   ` [PATCH v3] " Stewart Smith
  1 sibling, 1 reply; 21+ messages in thread
From: Alexander Graf @ 2014-07-08 10:41 UTC (permalink / raw)
  To: Stewart Smith, linuxppc-dev, paulus, kvm-ppc


On 08.07.14 07:06, Stewart Smith wrote:
> The POWER8 processor has a Micro Partition Prefetch Engine, which is
> a fancy way of saying "has way to store and load contents of L2 or
> L2+MRU way of L3 cache". We initiate the storing of the log (list of
> addresses) using the logmpp instruction and start restore by writing
> to a SPR.
>
> The logmpp instruction takes parameters in a single 64bit register:
> - starting address of the table to store log of L2/L2+L3 cache contents
>    - 32kb for L2
>    - 128kb for L2+L3
>    - Aligned relative to maximum size of the table (32kb or 128kb)
> - Log control (no-op, L2 only, L2 and L3, abort logout)
>
> We should abort any ongoing logging before initiating one.
>
> To initiate restore, we write to the MPPR SPR. The format of what to write
> to the SPR is similar to the logmpp instruction parameter:
> - starting address of the table to read from (same alignment requirements)
> - table size (no data, until end of table)
> - prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
>    32 cycles)
>
> The idea behind loading and storing the contents of L2/L3 cache is to
> reduce memory latency in a system that is frequently swapping vcores on
> a physical CPU.
>
> The best case scenario for doing this is when some vcores are doing very
> cache heavy workloads. The worst case is when they have about 0 cache hits,
> so we just generate needless memory operations.
>
> This implementation just does L2 store/load. In my benchmarks this proves
> to be useful.
>
> Benchmark 1:
>   - 16 core POWER8
>   - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
>   - No split core/SMT
>   - two guests running sysbench memory test.
>     sysbench --test=memory --num-threads=8 run
>   - one guest running apache bench (of default HTML page)
>     ab -n 490000 -c 400 http://localhost/
>
> This benchmark aims to measure performance of real world application (apache)
> where other guests are cache hot with their own workloads. The sysbench memory
> benchmark does pointer sized writes to a (small) memory buffer in a loop.
>
> In this benchmark with this patch I can see an improvement both in requests
> per second (~5%) and in mean and median response times (again, about 5%).
> The spread of minimum and maximum response times were largely unchanged.
>
> benchmark 2:
>   - Same VM config as benchmark 1
>   - all three guests running sysbench memory benchmark
>
> This benchmark aims to see if there is a positive or negative affect to this
> cache heavy benchmark. Although due to the nature of the benchmark (stores) we
> may not see a difference in performance, but rather hopefully an improvement
> in consistency of performance (when vcore switched in, don't have to wait
> many times for cachelines to be pulled in)
>
> The results of this benchmark are improvements in consistency of performance
> rather than performance itself. With this patch, the few outliers in duration
> go away and we get more consistent performance in each guest.
>
> benchmark 3:
>   - same 3 guests and CPU configuration as benchmark 1 and 2.
>   - two idle guests
>   - 1 guest running STREAM benchmark
>
> This scenario also saw performance improvement with this patch. On Copy and
> Scale workloads from STREAM, I got 5-6% improvement with this patch. For
> Add and triad, it was around 10% (or more).
>
> benchmark 4:
>   - same 3 guests as previous benchmarks
>   - two guests running sysbench --memory, distinctly different cache heavy
>     workload
>   - one guest running STREAM benchmark.
>
> Similar improvements to benchmark 3.
>
> benchmark 5:
>   - 1 guest, 8 VCPUs, Ubuntu 14.04
>   - Host configured with split core (SMT8, subcores-per-core=4)
>   - STREAM benchmark
>
> In this benchmark, we see a 10-20% performance improvement across the board
> of STREAM benchmark results with this patch.
>
> Based on preliminary investigation and microbenchmarks
> by Prerna Saxena <prerna@linux.vnet.ibm.com>
>
> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
>
> --
> changes since v1:
> - s/mppe/mpp_buffer/
> - add MPP_BUFFER_ORDER define.
> ---
>   arch/powerpc/include/asm/kvm_host.h   |    1 +
>   arch/powerpc/include/asm/ppc-opcode.h |   10 ++++++
>   arch/powerpc/include/asm/reg.h        |    1 +
>   arch/powerpc/kvm/book3s_hv.c          |   54 ++++++++++++++++++++++++++++++++-
>   4 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 1eaea2d..83ed249 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -305,6 +305,7 @@ struct kvmppc_vcore {
>   	u32 arch_compat;
>   	ulong pcr;
>   	ulong dpdes;		/* doorbell state (POWER8) */
> +	unsigned long mpp_buffer; /* Micro Partition Prefetch buffer */
>   };
>   
>   #define VCORE_ENTRY_COUNT(vc)	((vc)->entry_exit_count & 0xff)
> diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
> index 3132bb9..6201440 100644
> --- a/arch/powerpc/include/asm/ppc-opcode.h
> +++ b/arch/powerpc/include/asm/ppc-opcode.h
> @@ -139,6 +139,7 @@
>   #define PPC_INST_ISEL			0x7c00001e
>   #define PPC_INST_ISEL_MASK		0xfc00003e
>   #define PPC_INST_LDARX			0x7c0000a8
> +#define PPC_INST_LOGMPP			0x7c0007e4
>   #define PPC_INST_LSWI			0x7c0004aa
>   #define PPC_INST_LSWX			0x7c00042a
>   #define PPC_INST_LWARX			0x7c000028
> @@ -275,6 +276,13 @@
>   #define __PPC_EH(eh)	0
>   #endif
>   
> +/* POWER8 Micro Partition Prefetch parameters */
> +#define PPC_MPPE_ADDRESS_MASK 0xffffffffc000
> +#define PPC_MPPE_WHOLE_TABLE (0x2ULL << 60)
> +#define PPC_MPPE_LOG_L2 (0x02ULL << 54)
> +#define PPC_MPPE_LOG_L2L3 (0x01ULL << 54)
> +#define PPC_MPPE_LOG_ABORT (0x03ULL << 54)
> +
>   /* Deal with instructions that older assemblers aren't aware of */
>   #define	PPC_DCBAL(a, b)		stringify_in_c(.long PPC_INST_DCBAL | \
>   					__PPC_RA(a) | __PPC_RB(b))
> @@ -283,6 +291,8 @@
>   #define PPC_LDARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LDARX | \
>   					___PPC_RT(t) | ___PPC_RA(a) | \
>   					___PPC_RB(b) | __PPC_EH(eh))
> +#define PPC_LOGMPP(b)		stringify_in_c(.long PPC_INST_LOGMPP | \
> +					__PPC_RB(b))
>   #define PPC_LWARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LWARX | \
>   					___PPC_RT(t) | ___PPC_RA(a) | \
>   					___PPC_RB(b) | __PPC_EH(eh))
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index e5d2e0b..5164beb 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -224,6 +224,7 @@
>   #define   CTRL_TE	0x00c00000	/* thread enable */
>   #define   CTRL_RUNLATCH	0x1
>   #define SPRN_DAWR	0xB4
> +#define SPRN_MPPR	0xB8	/* Micro Partition Prefetch Register */
>   #define SPRN_CIABR	0xBB
>   #define   CIABR_PRIV		0x3
>   #define   CIABR_PRIV_USER	1
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 8227dba..41dab67 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -67,6 +67,13 @@
>   /* Used as a "null" value for timebase values */
>   #define TB_NIL	(~(u64)0)
>   
> +#if defined(CONFIG_PPC_64K_PAGES)
> +#define MPP_BUFFER_ORDER	0
> +#elif defined(CONFIG_PPC_4K_PAGES)
> +#define MPP_BUFFER_ORDER	4
> +#endif
> +
> +
>   static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
>   static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
>   
> @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>   	int i, need_vpa_update;
>   	int srcu_idx;
>   	struct kvm_vcpu *vcpus_to_update[threads_per_core];
> +	phys_addr_t phy_addr, tmp;

Please put the variable declarations into the if () branch so that the 
compiler can catch potential leaks :)

>   
>   	/* don't start if any threads have a signal pending */
>   	need_vpa_update = 0;
> @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>   
>   	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
>   
> +	/* If we have a saved list of L2/L3, restore it */
> +	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
> +		phy_addr = virt_to_phys((void *)vc->mpp_buffer);
> +#if defined(CONFIG_PPC_4K_PAGES)
> +		phy_addr = (phy_addr + 8*4096) & ~(8*4096);

get_free_pages() is automatically aligned to the order, no?

> +#endif
> +		tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;
> +		tmp = tmp | PPC_MPPE_WHOLE_TABLE;
> +
> +		/* For sanity, abort any 'save' requests in progress */
> +		asm volatile(PPC_LOGMPP(R1) : : "r" (tmp));
> +
> +		/* Inititate a cache-load request */
> +		mtspr(SPRN_MPPR, tmp);
> +	}

In fact, this whole block up here could be a function, no?

> +
> +	/* Allocate memory before switching out of guest so we don't
> +	   trash L2/L3 with memory allocation stuff */
> +	if (cpu_has_feature(CPU_FTR_ARCH_207S) && !vc->mpp_buffer) {
> +		vc->mpp_buffer = __get_free_pages(GFP_KERNEL|__GFP_ZERO,
> +						  MPP_BUFFER_ORDER);

get_order(64 * 1024)?

Also, why allocate it here and not on vcore creation?

> +	}
> +
>   	__kvmppc_vcore_entry();
>   
>   	spin_lock(&vc->lock);
> +
> +	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
> +		phy_addr = (phys_addr_t)virt_to_phys((void *)vc->mpp_buffer);
> +#if defined(CONFIG_PPC_4K_PAGES)
> +		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
> +#endif
> +		tmp = PPC_MPPE_ADDRESS_MASK & phy_addr;
> +		tmp = tmp | PPC_MPPE_LOG_L2;
> +
> +		/* Abort any existing 'fetch' operations for this core */
> +		mtspr(SPRN_MPPR, tmp&0x0fffffffffffffff);

pretty magical, no?

> +
> +		/* Finally, issue logmpp to save cache contents for L2 */
> +		asm volatile(PPC_LOGMPP(R1) : : "r" (tmp));
> +	}

This too should be a separate function.


Alex

> +
>   	/* disable sending of IPIs on virtual external irqs */
>   	list_for_each_entry(vcpu, &vc->runnable_threads, arch.run_list)
>   		vcpu->cpu = -1;
> @@ -2329,8 +2376,13 @@ static void kvmppc_free_vcores(struct kvm *kvm)
>   {
>   	long int i;
>   
> -	for (i = 0; i < KVM_MAX_VCORES; ++i)
> +	for (i = 0; i < KVM_MAX_VCORES; ++i) {
> +		if (kvm->arch.vcores[i] && kvm->arch.vcores[i]->mpp_buffer) {
> +			free_pages(kvm->arch.vcores[i]->mpp_buffer,
> +				   MPP_BUFFER_ORDER);
> +		}
>   		kfree(kvm->arch.vcores[i]);
> +	}
>   	kvm->arch.online_vcores = 0;
>   }
>   

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-08 10:41   ` Alexander Graf
@ 2014-07-08 22:59     ` Stewart Smith
  2014-07-10 11:05       ` Alexander Graf
  0 siblings, 1 reply; 21+ messages in thread
From: Stewart Smith @ 2014-07-08 22:59 UTC (permalink / raw)
  To: Alexander Graf, linuxppc-dev, paulus, kvm-ppc

Hi!

Thanks for review, much appreciated!

Alexander Graf <agraf@suse.de> writes:
> On 08.07.14 07:06, Stewart Smith wrote:
>> @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>>   	int i, need_vpa_update;
>>   	int srcu_idx;
>>   	struct kvm_vcpu *vcpus_to_update[threads_per_core];
>> +	phys_addr_t phy_addr, tmp;
>
> Please put the variable declarations into the if () branch so that the 
> compiler can catch potential leaks :)

ack. will fix.

>> @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>>   
>>   	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
>>   
>> +	/* If we have a saved list of L2/L3, restore it */
>> +	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
>> +		phy_addr = virt_to_phys((void *)vc->mpp_buffer);
>> +#if defined(CONFIG_PPC_4K_PAGES)
>> +		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
>
> get_free_pages() is automatically aligned to the order, no?

That's what Paul reckoned too, and then we've attempted to find anywhere
that documents that behaviour. Happen to be able to point to docs/source
that say this is part of API?

>> +#endif
>> +		tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;
>> +		tmp = tmp | PPC_MPPE_WHOLE_TABLE;
>> +
>> +		/* For sanity, abort any 'save' requests in progress */
>> +		asm volatile(PPC_LOGMPP(R1) : : "r" (tmp));
>> +
>> +		/* Inititate a cache-load request */
>> +		mtspr(SPRN_MPPR, tmp);
>> +	}
>
> In fact, this whole block up here could be a function, no?

It could, perfectly happy for it to be one. Will fix.

>> +
>> +	/* Allocate memory before switching out of guest so we don't
>> +	   trash L2/L3 with memory allocation stuff */
>> +	if (cpu_has_feature(CPU_FTR_ARCH_207S) && !vc->mpp_buffer) {
>> +		vc->mpp_buffer = __get_free_pages(GFP_KERNEL|__GFP_ZERO,
>> +						  MPP_BUFFER_ORDER);
>
> get_order(64 * 1024)?
>
> Also, why allocate it here and not on vcore creation?

There's also the possibility of saving/restorting part of the L3 cache
as well, and I was envisioning a future patch to this which checks a
flag in vcore (maybe exposed via sysfs or whatever mechanism is
applicable) if it should save/restore L2 or L2/L3, so thus it makes a
bit more sense allocating it there rather than elsewhere.

There's also no real reason to fail to create a vcore if we can't
allocate a buffer for L2/L3 cache contents - retrying later is perfectly
harmless.

>> +
>> +	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
>> +		phy_addr = (phys_addr_t)virt_to_phys((void *)vc->mpp_buffer);
>> +#if defined(CONFIG_PPC_4K_PAGES)
>> +		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
>> +#endif
>> +		tmp = PPC_MPPE_ADDRESS_MASK & phy_addr;
>> +		tmp = tmp | PPC_MPPE_LOG_L2;
>> +
>> +		/* Abort any existing 'fetch' operations for this core */
>> +		mtspr(SPRN_MPPR, tmp&0x0fffffffffffffff);
>
> pretty magical, no?

Yeah, sorry about that.... I'll hide it in a function at least, and do
something a bit better there... IIRC this magic came from the initial
work that proved the hardware functionality, so I'll go back and re-read
some hardware documentation to see where exactly this came from...

>> +
>> +		/* Finally, issue logmpp to save cache contents for L2 */
>> +		asm volatile(PPC_LOGMPP(R1) : : "r" (tmp));
>> +	}
>
> This too should be a separate function.

ack.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-08 22:59     ` Stewart Smith
@ 2014-07-10 11:05       ` Alexander Graf
  2014-07-10 13:07         ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Graf @ 2014-07-10 11:05 UTC (permalink / raw)
  To: Stewart Smith, linuxppc-dev, paulus, kvm-ppc, Mel Gorman


On 09.07.14 00:59, Stewart Smith wrote:
> Hi!
>
> Thanks for review, much appreciated!
>
> Alexander Graf <agraf@suse.de> writes:
>> On 08.07.14 07:06, Stewart Smith wrote:
>>> @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>>>    	int i, need_vpa_update;
>>>    	int srcu_idx;
>>>    	struct kvm_vcpu *vcpus_to_update[threads_per_core];
>>> +	phys_addr_t phy_addr, tmp;
>> Please put the variable declarations into the if () branch so that the
>> compiler can catch potential leaks :)
> ack. will fix.
>
>>> @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>>>    
>>>    	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
>>>    
>>> +	/* If we have a saved list of L2/L3, restore it */
>>> +	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
>>> +		phy_addr = virt_to_phys((void *)vc->mpp_buffer);
>>> +#if defined(CONFIG_PPC_4K_PAGES)
>>> +		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
>> get_free_pages() is automatically aligned to the order, no?
> That's what Paul reckoned too, and then we've attempted to find anywhere
> that documents that behaviour. Happen to be able to point to docs/source
> that say this is part of API?

Phew - it's probably buried somewhere. I could only find this document 
saying that we always get order-aligned allocations:

http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html

Mel, do you happen to have any pointer to something that explicitly (or 
even properly implicitly) says that get_free_pages() returns 
order-aligned memory?

>
>>> +#endif
>>> +		tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;
>>> +		tmp = tmp | PPC_MPPE_WHOLE_TABLE;
>>> +
>>> +		/* For sanity, abort any 'save' requests in progress */
>>> +		asm volatile(PPC_LOGMPP(R1) : : "r" (tmp));
>>> +
>>> +		/* Inititate a cache-load request */
>>> +		mtspr(SPRN_MPPR, tmp);
>>> +	}
>> In fact, this whole block up here could be a function, no?
> It could, perfectly happy for it to be one. Will fix.
>
>>> +
>>> +	/* Allocate memory before switching out of guest so we don't
>>> +	   trash L2/L3 with memory allocation stuff */
>>> +	if (cpu_has_feature(CPU_FTR_ARCH_207S) && !vc->mpp_buffer) {
>>> +		vc->mpp_buffer = __get_free_pages(GFP_KERNEL|__GFP_ZERO,
>>> +						  MPP_BUFFER_ORDER);
>> get_order(64 * 1024)?
>>
>> Also, why allocate it here and not on vcore creation?
> There's also the possibility of saving/restorting part of the L3 cache
> as well, and I was envisioning a future patch to this which checks a
> flag in vcore (maybe exposed via sysfs or whatever mechanism is
> applicable) if it should save/restore L2 or L2/L3, so thus it makes a
> bit more sense allocating it there rather than elsewhere.
>
> There's also no real reason to fail to create a vcore if we can't
> allocate a buffer for L2/L3 cache contents - retrying later is perfectly
> harmless.

If we failed during core creation just don't save/restore L2 cache 
contents at all. I really prefer to have allocation and dealloction all 
at init time - and such low order allocations will most likely succeed.

Let's leave the L3 cache bits for later when we know whether it actually 
has an impact. I personally doubt it :).


Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-10 11:05       ` Alexander Graf
@ 2014-07-10 13:07         ` Mel Gorman
  2014-07-10 13:17           ` Alexander Graf
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2014-07-10 13:07 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Stewart Smith, paulus, linuxppc-dev, kvm-ppc

On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:
> 
> On 09.07.14 00:59, Stewart Smith wrote:
> >Hi!
> >
> >Thanks for review, much appreciated!
> >
> >Alexander Graf <agraf@suse.de> writes:
> >>On 08.07.14 07:06, Stewart Smith wrote:
> >>>@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
> >>>   	int i, need_vpa_update;
> >>>   	int srcu_idx;
> >>>   	struct kvm_vcpu *vcpus_to_update[threads_per_core];
> >>>+	phys_addr_t phy_addr, tmp;
> >>Please put the variable declarations into the if () branch so that the
> >>compiler can catch potential leaks :)
> >ack. will fix.
> >
> >>>@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
> >>>   	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
> >>>+	/* If we have a saved list of L2/L3, restore it */
> >>>+	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
> >>>+		phy_addr = virt_to_phys((void *)vc->mpp_buffer);
> >>>+#if defined(CONFIG_PPC_4K_PAGES)
> >>>+		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
> >>get_free_pages() is automatically aligned to the order, no?
> >That's what Paul reckoned too, and then we've attempted to find anywhere
> >that documents that behaviour. Happen to be able to point to docs/source
> >that say this is part of API?
> 
> Phew - it's probably buried somewhere. I could only find this
> document saying that we always get order-aligned allocations:
> 
> http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html
> 
> Mel, do you happen to have any pointer to something that explicitly
> (or even properly implicitly) says that get_free_pages() returns
> order-aligned memory?
> 

I did not read the whole thread so I lack context and will just answer
this part.

There is no guarantee that pages are returned in PFN order for multiple
requests to the page allocator. This is the relevant comment in
rmqueue_bulk

                /*
                 * Split buddy pages returned by expand() are received here
                 * in physical page order. The page is added to the callers and
                 * list and the list head then moves forward. From the callers
                 * perspective, the linked list is ordered by page number in
                 * some conditions. This is useful for IO devices that can
                 * merge IO requests if the physical pages are ordered
                 * properly.
                 */

It will probably be true early in the lifetime of the system but the milage
will vary on systems with a lot of uptime. If you depend on this behaviour
for correctness then you will have a bad day.

High-order page requests to the page allocator are guaranteed to be in physical
order. However, this does not apply to vmalloc() where allocations are
only guaranteed to be virtually contiguous.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-10 13:07         ` Mel Gorman
@ 2014-07-10 13:17           ` Alexander Graf
  2014-07-10 13:30             ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Graf @ 2014-07-10 13:17 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stewart Smith, paulus, linuxppc-dev, kvm-ppc


On 10.07.14 15:07, Mel Gorman wrote:
> On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:
>> On 09.07.14 00:59, Stewart Smith wrote:
>>> Hi!
>>>
>>> Thanks for review, much appreciated!
>>>
>>> Alexander Graf <agraf@suse.de> writes:
>>>> On 08.07.14 07:06, Stewart Smith wrote:
>>>>> @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>>>>>    	int i, need_vpa_update;
>>>>>    	int srcu_idx;
>>>>>    	struct kvm_vcpu *vcpus_to_update[threads_per_core];
>>>>> +	phys_addr_t phy_addr, tmp;
>>>> Please put the variable declarations into the if () branch so that the
>>>> compiler can catch potential leaks :)
>>> ack. will fix.
>>>
>>>>> @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>>>>>    	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
>>>>> +	/* If we have a saved list of L2/L3, restore it */
>>>>> +	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
>>>>> +		phy_addr = virt_to_phys((void *)vc->mpp_buffer);
>>>>> +#if defined(CONFIG_PPC_4K_PAGES)
>>>>> +		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
>>>> get_free_pages() is automatically aligned to the order, no?
>>> That's what Paul reckoned too, and then we've attempted to find anywhere
>>> that documents that behaviour. Happen to be able to point to docs/source
>>> that say this is part of API?
>> Phew - it's probably buried somewhere. I could only find this
>> document saying that we always get order-aligned allocations:
>>
>> http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html
>>
>> Mel, do you happen to have any pointer to something that explicitly
>> (or even properly implicitly) says that get_free_pages() returns
>> order-aligned memory?
>>
> I did not read the whole thread so I lack context and will just answer
> this part.
>
> There is no guarantee that pages are returned in PFN order for multiple
> requests to the page allocator. This is the relevant comment in
> rmqueue_bulk
>
>                  /*
>                   * Split buddy pages returned by expand() are received here
>                   * in physical page order. The page is added to the callers and
>                   * list and the list head then moves forward. From the callers
>                   * perspective, the linked list is ordered by page number in
>                   * some conditions. This is useful for IO devices that can
>                   * merge IO requests if the physical pages are ordered
>                   * properly.
>                   */
>
> It will probably be true early in the lifetime of the system but the milage
> will vary on systems with a lot of uptime. If you depend on this behaviour
> for correctness then you will have a bad day.
>
> High-order page requests to the page allocator are guaranteed to be in physical
> order. However, this does not apply to vmalloc() where allocations are
> only guaranteed to be virtually contiguous.

Hrm, ok to be very concrete:

   Does __get_free_pages(..., 4); on a 4k page size system give me a 64k 
aligned pointer? :)


Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-10 13:17           ` Alexander Graf
@ 2014-07-10 13:30             ` Mel Gorman
  2014-07-10 13:30               ` Alexander Graf
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2014-07-10 13:30 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Stewart Smith, paulus, linuxppc-dev, kvm-ppc

On Thu, Jul 10, 2014 at 03:17:16PM +0200, Alexander Graf wrote:
> 
> On 10.07.14 15:07, Mel Gorman wrote:
> >On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:
> >>On 09.07.14 00:59, Stewart Smith wrote:
> >>>Hi!
> >>>
> >>>Thanks for review, much appreciated!
> >>>
> >>>Alexander Graf <agraf@suse.de> writes:
> >>>>On 08.07.14 07:06, Stewart Smith wrote:
> >>>>>@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
> >>>>>   	int i, need_vpa_update;
> >>>>>   	int srcu_idx;
> >>>>>   	struct kvm_vcpu *vcpus_to_update[threads_per_core];
> >>>>>+	phys_addr_t phy_addr, tmp;
> >>>>Please put the variable declarations into the if () branch so that the
> >>>>compiler can catch potential leaks :)
> >>>ack. will fix.
> >>>
> >>>>>@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
> >>>>>   	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
> >>>>>+	/* If we have a saved list of L2/L3, restore it */
> >>>>>+	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
> >>>>>+		phy_addr = virt_to_phys((void *)vc->mpp_buffer);
> >>>>>+#if defined(CONFIG_PPC_4K_PAGES)
> >>>>>+		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
> >>>>get_free_pages() is automatically aligned to the order, no?
> >>>That's what Paul reckoned too, and then we've attempted to find anywhere
> >>>that documents that behaviour. Happen to be able to point to docs/source
> >>>that say this is part of API?
> >>Phew - it's probably buried somewhere. I could only find this
> >>document saying that we always get order-aligned allocations:
> >>
> >>http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html
> >>
> >>Mel, do you happen to have any pointer to something that explicitly
> >>(or even properly implicitly) says that get_free_pages() returns
> >>order-aligned memory?
> >>
> >I did not read the whole thread so I lack context and will just answer
> >this part.
> >
> >There is no guarantee that pages are returned in PFN order for multiple
> >requests to the page allocator. This is the relevant comment in
> >rmqueue_bulk
> >
> >                 /*
> >                  * Split buddy pages returned by expand() are received here
> >                  * in physical page order. The page is added to the callers and
> >                  * list and the list head then moves forward. From the callers
> >                  * perspective, the linked list is ordered by page number in
> >                  * some conditions. This is useful for IO devices that can
> >                  * merge IO requests if the physical pages are ordered
> >                  * properly.
> >                  */
> >
> >It will probably be true early in the lifetime of the system but the milage
> >will vary on systems with a lot of uptime. If you depend on this behaviour
> >for correctness then you will have a bad day.
> >
> >High-order page requests to the page allocator are guaranteed to be in physical
> >order. However, this does not apply to vmalloc() where allocations are
> >only guaranteed to be virtually contiguous.
> 
> Hrm, ok to be very concrete:
> 
>   Does __get_free_pages(..., 4); on a 4k page size system give me a
> 64k aligned pointer? :)
> 

Yes.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-10 13:30             ` Mel Gorman
@ 2014-07-10 13:30               ` Alexander Graf
  0 siblings, 0 replies; 21+ messages in thread
From: Alexander Graf @ 2014-07-10 13:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Stewart Smith, paulus, linuxppc-dev, kvm-ppc


On 10.07.14 15:30, Mel Gorman wrote:
> On Thu, Jul 10, 2014 at 03:17:16PM +0200, Alexander Graf wrote:
>> On 10.07.14 15:07, Mel Gorman wrote:
>>> On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:
>>>> On 09.07.14 00:59, Stewart Smith wrote:
>>>>> Hi!
>>>>>
>>>>> Thanks for review, much appreciated!
>>>>>
>>>>> Alexander Graf <agraf@suse.de> writes:
>>>>>> On 08.07.14 07:06, Stewart Smith wrote:
>>>>>>> @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>>>>>>>    	int i, need_vpa_update;
>>>>>>>    	int srcu_idx;
>>>>>>>    	struct kvm_vcpu *vcpus_to_update[threads_per_core];
>>>>>>> +	phys_addr_t phy_addr, tmp;
>>>>>> Please put the variable declarations into the if () branch so that the
>>>>>> compiler can catch potential leaks :)
>>>>> ack. will fix.
>>>>>
>>>>>>> @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>>>>>>>    	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
>>>>>>> +	/* If we have a saved list of L2/L3, restore it */
>>>>>>> +	if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
>>>>>>> +		phy_addr = virt_to_phys((void *)vc->mpp_buffer);
>>>>>>> +#if defined(CONFIG_PPC_4K_PAGES)
>>>>>>> +		phy_addr = (phy_addr + 8*4096) & ~(8*4096);
>>>>>> get_free_pages() is automatically aligned to the order, no?
>>>>> That's what Paul reckoned too, and then we've attempted to find anywhere
>>>>> that documents that behaviour. Happen to be able to point to docs/source
>>>>> that say this is part of API?
>>>> Phew - it's probably buried somewhere. I could only find this
>>>> document saying that we always get order-aligned allocations:
>>>>
>>>> http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html
>>>>
>>>> Mel, do you happen to have any pointer to something that explicitly
>>>> (or even properly implicitly) says that get_free_pages() returns
>>>> order-aligned memory?
>>>>
>>> I did not read the whole thread so I lack context and will just answer
>>> this part.
>>>
>>> There is no guarantee that pages are returned in PFN order for multiple
>>> requests to the page allocator. This is the relevant comment in
>>> rmqueue_bulk
>>>
>>>                  /*
>>>                   * Split buddy pages returned by expand() are received here
>>>                   * in physical page order. The page is added to the callers and
>>>                   * list and the list head then moves forward. From the callers
>>>                   * perspective, the linked list is ordered by page number in
>>>                   * some conditions. This is useful for IO devices that can
>>>                   * merge IO requests if the physical pages are ordered
>>>                   * properly.
>>>                   */
>>>
>>> It will probably be true early in the lifetime of the system but the milage
>>> will vary on systems with a lot of uptime. If you depend on this behaviour
>>> for correctness then you will have a bad day.
>>>
>>> High-order page requests to the page allocator are guaranteed to be in physical
>>> order. However, this does not apply to vmalloc() where allocations are
>>> only guaranteed to be virtually contiguous.
>> Hrm, ok to be very concrete:
>>
>>    Does __get_free_pages(..., 4); on a 4k page size system give me a
>> 64k aligned pointer? :)
>>
> Yes.

Awesome - thanks a lot! :)


Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-08  5:06 ` [PATCH v2] " Stewart Smith
  2014-07-08 10:41   ` Alexander Graf
@ 2014-07-17  3:19   ` Stewart Smith
  2014-07-17  7:55     ` Alexander Graf
                       ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Stewart Smith @ 2014-07-17  3:19 UTC (permalink / raw)
  To: linuxppc-dev, paulus, kvm-ppc, Alexander Graf; +Cc: Stewart Smith

The POWER8 processor has a Micro Partition Prefetch Engine, which is
a fancy way of saying "has way to store and load contents of L2 or
L2+MRU way of L3 cache". We initiate the storing of the log (list of
addresses) using the logmpp instruction and start restore by writing
to a SPR.

The logmpp instruction takes parameters in a single 64bit register:
- starting address of the table to store log of L2/L2+L3 cache contents
  - 32kb for L2
  - 128kb for L2+L3
  - Aligned relative to maximum size of the table (32kb or 128kb)
- Log control (no-op, L2 only, L2 and L3, abort logout)

We should abort any ongoing logging before initiating one.

To initiate restore, we write to the MPPR SPR. The format of what to write
to the SPR is similar to the logmpp instruction parameter:
- starting address of the table to read from (same alignment requirements)
- table size (no data, until end of table)
- prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
  32 cycles)

The idea behind loading and storing the contents of L2/L3 cache is to
reduce memory latency in a system that is frequently swapping vcores on
a physical CPU.

The best case scenario for doing this is when some vcores are doing very
cache heavy workloads. The worst case is when they have about 0 cache hits,
so we just generate needless memory operations.

This implementation just does L2 store/load. In my benchmarks this proves
to be useful.

Benchmark 1:
 - 16 core POWER8
 - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
 - No split core/SMT
 - two guests running sysbench memory test.
   sysbench --test=memory --num-threads=8 run
 - one guest running apache bench (of default HTML page)
   ab -n 490000 -c 400 http://localhost/

This benchmark aims to measure performance of real world application (apache)
where other guests are cache hot with their own workloads. The sysbench memory
benchmark does pointer sized writes to a (small) memory buffer in a loop.

In this benchmark with this patch I can see an improvement both in requests
per second (~5%) and in mean and median response times (again, about 5%).
The spread of minimum and maximum response times were largely unchanged.

benchmark 2:
 - Same VM config as benchmark 1
 - all three guests running sysbench memory benchmark

This benchmark aims to see if there is a positive or negative affect to this
cache heavy benchmark. Although due to the nature of the benchmark (stores) we
may not see a difference in performance, but rather hopefully an improvement
in consistency of performance (when vcore switched in, don't have to wait
many times for cachelines to be pulled in)

The results of this benchmark are improvements in consistency of performance
rather than performance itself. With this patch, the few outliers in duration
go away and we get more consistent performance in each guest.

benchmark 3:
 - same 3 guests and CPU configuration as benchmark 1 and 2.
 - two idle guests
 - 1 guest running STREAM benchmark

This scenario also saw performance improvement with this patch. On Copy and
Scale workloads from STREAM, I got 5-6% improvement with this patch. For
Add and triad, it was around 10% (or more).

benchmark 4:
 - same 3 guests as previous benchmarks
 - two guests running sysbench --memory, distinctly different cache heavy
   workload
 - one guest running STREAM benchmark.

Similar improvements to benchmark 3.

benchmark 5:
 - 1 guest, 8 VCPUs, Ubuntu 14.04
 - Host configured with split core (SMT8, subcores-per-core=4)
 - STREAM benchmark

In this benchmark, we see a 10-20% performance improvement across the board
of STREAM benchmark results with this patch.

Based on preliminary investigation and microbenchmarks
by Prerna Saxena <prerna@linux.vnet.ibm.com>

Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>

--
changes since v2:
- based on feedback from Alexander Graf:
  - move save and restore of cache to separate functions
  - move allocation of mpp_buffer to vcore creation
  - get_free_pages() does actually allocate pages aligned to order
    (Mel Gorman confirms)
  - make SPR and logmpp parameters a bit less magic, especially around abort

changes since v1:
- s/mppe/mpp_buffer/
- add MPP_BUFFER_ORDER define.
---
 arch/powerpc/include/asm/kvm_host.h   |    2 +
 arch/powerpc/include/asm/ppc-opcode.h |   17 +++++++
 arch/powerpc/include/asm/reg.h        |    1 +
 arch/powerpc/kvm/book3s_hv.c          |   89 +++++++++++++++++++++++++++++----
 4 files changed, 98 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 1eaea2d..5769497 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -305,6 +305,8 @@ struct kvmppc_vcore {
 	u32 arch_compat;
 	ulong pcr;
 	ulong dpdes;		/* doorbell state (POWER8) */
+	unsigned long mpp_buffer; /* Micro Partition Prefetch buffer */
+	bool mpp_buffer_is_valid;
 };
 
 #define VCORE_ENTRY_COUNT(vc)	((vc)->entry_exit_count & 0xff)
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 3132bb9..c636841 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -139,6 +139,7 @@
 #define PPC_INST_ISEL			0x7c00001e
 #define PPC_INST_ISEL_MASK		0xfc00003e
 #define PPC_INST_LDARX			0x7c0000a8
+#define PPC_INST_LOGMPP			0x7c0007e4
 #define PPC_INST_LSWI			0x7c0004aa
 #define PPC_INST_LSWX			0x7c00042a
 #define PPC_INST_LWARX			0x7c000028
@@ -275,6 +276,20 @@
 #define __PPC_EH(eh)	0
 #endif
 
+/* POWER8 Micro Partition Prefetch (MPP) parameters */
+/* Address mask is common for LOGMPP instruction and MPPR SPR */
+#define PPC_MPPE_ADDRESS_MASK 0xffffffffc000
+
+/* Bits 60 and 61 of MPP SPR should be set to one of the following */
+/* Aborting the fetch is indeed setting 00 in the table size bits */
+#define PPC_MPPR_FETCH_ABORT (0x0ULL << 60)
+#define PPC_MPPR_FETCH_WHOLE_TABLE (0x2ULL << 60)
+
+/* Bits 54 and 55 of register for LOGMPP instruction should be set to: */
+#define PPC_LOGMPP_LOG_L2 (0x02ULL << 54)
+#define PPC_LOGMPP_LOG_L2L3 (0x01ULL << 54)
+#define PPC_LOGMPP_LOG_ABORT (0x03ULL << 54)
+
 /* Deal with instructions that older assemblers aren't aware of */
 #define	PPC_DCBAL(a, b)		stringify_in_c(.long PPC_INST_DCBAL | \
 					__PPC_RA(a) | __PPC_RB(b))
@@ -283,6 +298,8 @@
 #define PPC_LDARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LDARX | \
 					___PPC_RT(t) | ___PPC_RA(a) | \
 					___PPC_RB(b) | __PPC_EH(eh))
+#define PPC_LOGMPP(b)		stringify_in_c(.long PPC_INST_LOGMPP | \
+					__PPC_RB(b))
 #define PPC_LWARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LWARX | \
 					___PPC_RT(t) | ___PPC_RA(a) | \
 					___PPC_RB(b) | __PPC_EH(eh))
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index e5d2e0b..5164beb 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -224,6 +224,7 @@
 #define   CTRL_TE	0x00c00000	/* thread enable */
 #define   CTRL_RUNLATCH	0x1
 #define SPRN_DAWR	0xB4
+#define SPRN_MPPR	0xB8	/* Micro Partition Prefetch Register */
 #define SPRN_CIABR	0xBB
 #define   CIABR_PRIV		0x3
 #define   CIABR_PRIV_USER	1
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8227dba..b5a8b11 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -67,6 +67,13 @@
 /* Used as a "null" value for timebase values */
 #define TB_NIL	(~(u64)0)
 
+#if defined(CONFIG_PPC_64K_PAGES)
+#define MPP_BUFFER_ORDER	0
+#elif defined(CONFIG_PPC_4K_PAGES)
+#define MPP_BUFFER_ORDER	4
+#endif
+
+
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
@@ -1258,6 +1265,32 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, u64 id,
 	return r;
 }
 
+static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
+{
+	struct kvmppc_vcore *vcore;
+
+	vcore = kzalloc(sizeof(struct kvmppc_vcore), GFP_KERNEL);
+
+	if (vcore == NULL)
+		return NULL;
+
+	INIT_LIST_HEAD(&vcore->runnable_threads);
+	spin_lock_init(&vcore->lock);
+	init_waitqueue_head(&vcore->wq);
+	vcore->preempt_tb = TB_NIL;
+	vcore->lpcr = kvm->arch.lpcr;
+	vcore->first_vcpuid = core * threads_per_core;
+	vcore->kvm = kvm;
+
+	vcore->mpp_buffer_is_valid = false;
+
+	if (cpu_has_feature(CPU_FTR_ARCH_207S))
+		vcore->mpp_buffer = __get_free_pages(GFP_KERNEL|__GFP_ZERO,
+						     MPP_BUFFER_ORDER);
+
+	return vcore;
+}
+
 static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
 						   unsigned int id)
 {
@@ -1298,16 +1331,7 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
 	mutex_lock(&kvm->lock);
 	vcore = kvm->arch.vcores[core];
 	if (!vcore) {
-		vcore = kzalloc(sizeof(struct kvmppc_vcore), GFP_KERNEL);
-		if (vcore) {
-			INIT_LIST_HEAD(&vcore->runnable_threads);
-			spin_lock_init(&vcore->lock);
-			init_waitqueue_head(&vcore->wq);
-			vcore->preempt_tb = TB_NIL;
-			vcore->lpcr = kvm->arch.lpcr;
-			vcore->first_vcpuid = core * threads_per_core;
-			vcore->kvm = kvm;
-		}
+		vcore = kvmppc_vcore_create(kvm, core);
 		kvm->arch.vcores[core] = vcore;
 		kvm->arch.online_vcores++;
 	}
@@ -1516,6 +1540,37 @@ static int on_primary_thread(void)
 	return 1;
 }
 
+static void ppc_start_saving_l2_cache(struct kvmppc_vcore *vc)
+{
+	phys_addr_t phy_addr, tmp;
+
+	phy_addr = (phys_addr_t)virt_to_phys((void *)vc->mpp_buffer);
+
+	tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;
+
+	mtspr(SPRN_MPPR, tmp | PPC_MPPR_FETCH_ABORT);
+
+	asm volatile(PPC_LOGMPP(R1) : : "r" (tmp | PPC_LOGMPP_LOG_L2));
+
+	vc->mpp_buffer_is_valid = true;
+}
+
+static void ppc_start_restoring_l2_cache(const struct kvmppc_vcore *vc)
+{
+	phys_addr_t phy_addr, tmp;
+
+	phy_addr = virt_to_phys((void *)vc->mpp_buffer);
+
+	tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;
+
+	/* We must abort any in-progress save operations to ensure
+	 * the table is valid so that prefetch engine knows when to
+	 * stop prefetching. */
+	asm volatile(PPC_LOGMPP(R1) : : "r" (tmp | PPC_LOGMPP_LOG_ABORT));
+
+	mtspr(SPRN_MPPR, tmp | PPC_MPPR_FETCH_WHOLE_TABLE);
+}
+
 /*
  * Run a set of guest threads on a physical core.
  * Called with vc->lock held.
@@ -1590,9 +1645,16 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
 
 	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
 
+	if (vc->mpp_buffer_is_valid)
+		ppc_start_restoring_l2_cache(vc);
+
 	__kvmppc_vcore_entry();
 
 	spin_lock(&vc->lock);
+
+	if (vc->mpp_buffer)
+		ppc_start_saving_l2_cache(vc);
+
 	/* disable sending of IPIs on virtual external irqs */
 	list_for_each_entry(vcpu, &vc->runnable_threads, arch.run_list)
 		vcpu->cpu = -1;
@@ -2329,8 +2391,13 @@ static void kvmppc_free_vcores(struct kvm *kvm)
 {
 	long int i;
 
-	for (i = 0; i < KVM_MAX_VCORES; ++i)
+	for (i = 0; i < KVM_MAX_VCORES; ++i) {
+		if (kvm->arch.vcores[i] && kvm->arch.vcores[i]->mpp_buffer) {
+			free_pages(kvm->arch.vcores[i]->mpp_buffer,
+				   MPP_BUFFER_ORDER);
+		}
 		kfree(kvm->arch.vcores[i]);
+	}
 	kvm->arch.online_vcores = 0;
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-17  3:19   ` [PATCH v3] " Stewart Smith
@ 2014-07-17  7:55     ` Alexander Graf
  2014-07-18  4:10       ` Stewart Smith
  2014-07-17 23:52     ` Paul Mackerras
  2014-07-18  4:18     ` [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV Stewart Smith
  2 siblings, 1 reply; 21+ messages in thread
From: Alexander Graf @ 2014-07-17  7:55 UTC (permalink / raw)
  To: Stewart Smith, linuxppc-dev, paulus, kvm-ppc


On 17.07.14 05:19, Stewart Smith wrote:
> The POWER8 processor has a Micro Partition Prefetch Engine, which is
> a fancy way of saying "has way to store and load contents of L2 or
> L2+MRU way of L3 cache". We initiate the storing of the log (list of
> addresses) using the logmpp instruction and start restore by writing
> to a SPR.
>
> The logmpp instruction takes parameters in a single 64bit register:
> - starting address of the table to store log of L2/L2+L3 cache contents
>    - 32kb for L2
>    - 128kb for L2+L3
>    - Aligned relative to maximum size of the table (32kb or 128kb)
> - Log control (no-op, L2 only, L2 and L3, abort logout)
>
> We should abort any ongoing logging before initiating one.
>
> To initiate restore, we write to the MPPR SPR. The format of what to write
> to the SPR is similar to the logmpp instruction parameter:
> - starting address of the table to read from (same alignment requirements)
> - table size (no data, until end of table)
> - prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
>    32 cycles)
>
> The idea behind loading and storing the contents of L2/L3 cache is to
> reduce memory latency in a system that is frequently swapping vcores on
> a physical CPU.
>
> The best case scenario for doing this is when some vcores are doing very
> cache heavy workloads. The worst case is when they have about 0 cache hits,
> so we just generate needless memory operations.
>
> This implementation just does L2 store/load. In my benchmarks this proves
> to be useful.
>
> Benchmark 1:
>   - 16 core POWER8
>   - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
>   - No split core/SMT
>   - two guests running sysbench memory test.
>     sysbench --test=memory --num-threads=8 run
>   - one guest running apache bench (of default HTML page)
>     ab -n 490000 -c 400 http://localhost/
>
> This benchmark aims to measure performance of real world application (apache)
> where other guests are cache hot with their own workloads. The sysbench memory
> benchmark does pointer sized writes to a (small) memory buffer in a loop.
>
> In this benchmark with this patch I can see an improvement both in requests
> per second (~5%) and in mean and median response times (again, about 5%).
> The spread of minimum and maximum response times were largely unchanged.
>
> benchmark 2:
>   - Same VM config as benchmark 1
>   - all three guests running sysbench memory benchmark
>
> This benchmark aims to see if there is a positive or negative affect to this
> cache heavy benchmark. Although due to the nature of the benchmark (stores) we
> may not see a difference in performance, but rather hopefully an improvement
> in consistency of performance (when vcore switched in, don't have to wait
> many times for cachelines to be pulled in)
>
> The results of this benchmark are improvements in consistency of performance
> rather than performance itself. With this patch, the few outliers in duration
> go away and we get more consistent performance in each guest.
>
> benchmark 3:
>   - same 3 guests and CPU configuration as benchmark 1 and 2.
>   - two idle guests
>   - 1 guest running STREAM benchmark
>
> This scenario also saw performance improvement with this patch. On Copy and
> Scale workloads from STREAM, I got 5-6% improvement with this patch. For
> Add and triad, it was around 10% (or more).
>
> benchmark 4:
>   - same 3 guests as previous benchmarks
>   - two guests running sysbench --memory, distinctly different cache heavy
>     workload
>   - one guest running STREAM benchmark.
>
> Similar improvements to benchmark 3.
>
> benchmark 5:
>   - 1 guest, 8 VCPUs, Ubuntu 14.04
>   - Host configured with split core (SMT8, subcores-per-core=4)
>   - STREAM benchmark
>
> In this benchmark, we see a 10-20% performance improvement across the board
> of STREAM benchmark results with this patch.
>
> Based on preliminary investigation and microbenchmarks
> by Prerna Saxena <prerna@linux.vnet.ibm.com>
>
> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>

A lot nicer :)

>
> --
> changes since v2:
> - based on feedback from Alexander Graf:
>    - move save and restore of cache to separate functions
>    - move allocation of mpp_buffer to vcore creation
>    - get_free_pages() does actually allocate pages aligned to order
>      (Mel Gorman confirms)
>    - make SPR and logmpp parameters a bit less magic, especially around abort
>
> changes since v1:
> - s/mppe/mpp_buffer/
> - add MPP_BUFFER_ORDER define.
> ---
>   arch/powerpc/include/asm/kvm_host.h   |    2 +
>   arch/powerpc/include/asm/ppc-opcode.h |   17 +++++++
>   arch/powerpc/include/asm/reg.h        |    1 +
>   arch/powerpc/kvm/book3s_hv.c          |   89 +++++++++++++++++++++++++++++----
>   4 files changed, 98 insertions(+), 11 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 1eaea2d..5769497 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -305,6 +305,8 @@ struct kvmppc_vcore {
>   	u32 arch_compat;
>   	ulong pcr;
>   	ulong dpdes;		/* doorbell state (POWER8) */
> +	unsigned long mpp_buffer; /* Micro Partition Prefetch buffer */

Just make this a void*?

> +	bool mpp_buffer_is_valid;
>   };
>   
>   #define VCORE_ENTRY_COUNT(vc)	((vc)->entry_exit_count & 0xff)
> diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
> index 3132bb9..c636841 100644
> --- a/arch/powerpc/include/asm/ppc-opcode.h
> +++ b/arch/powerpc/include/asm/ppc-opcode.h
> @@ -139,6 +139,7 @@
>   #define PPC_INST_ISEL			0x7c00001e
>   #define PPC_INST_ISEL_MASK		0xfc00003e
>   #define PPC_INST_LDARX			0x7c0000a8
> +#define PPC_INST_LOGMPP			0x7c0007e4
>   #define PPC_INST_LSWI			0x7c0004aa
>   #define PPC_INST_LSWX			0x7c00042a
>   #define PPC_INST_LWARX			0x7c000028
> @@ -275,6 +276,20 @@
>   #define __PPC_EH(eh)	0
>   #endif
>   
> +/* POWER8 Micro Partition Prefetch (MPP) parameters */
> +/* Address mask is common for LOGMPP instruction and MPPR SPR */
> +#define PPC_MPPE_ADDRESS_MASK 0xffffffffc000
> +
> +/* Bits 60 and 61 of MPP SPR should be set to one of the following */
> +/* Aborting the fetch is indeed setting 00 in the table size bits */
> +#define PPC_MPPR_FETCH_ABORT (0x0ULL << 60)
> +#define PPC_MPPR_FETCH_WHOLE_TABLE (0x2ULL << 60)
> +
> +/* Bits 54 and 55 of register for LOGMPP instruction should be set to: */
> +#define PPC_LOGMPP_LOG_L2 (0x02ULL << 54)
> +#define PPC_LOGMPP_LOG_L2L3 (0x01ULL << 54)
> +#define PPC_LOGMPP_LOG_ABORT (0x03ULL << 54)
> +
>   /* Deal with instructions that older assemblers aren't aware of */
>   #define	PPC_DCBAL(a, b)		stringify_in_c(.long PPC_INST_DCBAL | \
>   					__PPC_RA(a) | __PPC_RB(b))
> @@ -283,6 +298,8 @@
>   #define PPC_LDARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LDARX | \
>   					___PPC_RT(t) | ___PPC_RA(a) | \
>   					___PPC_RB(b) | __PPC_EH(eh))
> +#define PPC_LOGMPP(b)		stringify_in_c(.long PPC_INST_LOGMPP | \
> +					__PPC_RB(b))
>   #define PPC_LWARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LWARX | \
>   					___PPC_RT(t) | ___PPC_RA(a) | \
>   					___PPC_RB(b) | __PPC_EH(eh))
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index e5d2e0b..5164beb 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -224,6 +224,7 @@
>   #define   CTRL_TE	0x00c00000	/* thread enable */
>   #define   CTRL_RUNLATCH	0x1
>   #define SPRN_DAWR	0xB4
> +#define SPRN_MPPR	0xB8	/* Micro Partition Prefetch Register */
>   #define SPRN_CIABR	0xBB
>   #define   CIABR_PRIV		0x3
>   #define   CIABR_PRIV_USER	1
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 8227dba..b5a8b11 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -67,6 +67,13 @@
>   /* Used as a "null" value for timebase values */
>   #define TB_NIL	(~(u64)0)
>   
> +#if defined(CONFIG_PPC_64K_PAGES)
> +#define MPP_BUFFER_ORDER	0
> +#elif defined(CONFIG_PPC_4K_PAGES)
> +#define MPP_BUFFER_ORDER	4
> +#endif
> +
> +
>   static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
>   static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
>   
> @@ -1258,6 +1265,32 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, u64 id,
>   	return r;
>   }
>   
> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> +{
> +	struct kvmppc_vcore *vcore;
> +
> +	vcore = kzalloc(sizeof(struct kvmppc_vcore), GFP_KERNEL);
> +
> +	if (vcore == NULL)
> +		return NULL;
> +
> +	INIT_LIST_HEAD(&vcore->runnable_threads);
> +	spin_lock_init(&vcore->lock);
> +	init_waitqueue_head(&vcore->wq);
> +	vcore->preempt_tb = TB_NIL;
> +	vcore->lpcr = kvm->arch.lpcr;
> +	vcore->first_vcpuid = core * threads_per_core;
> +	vcore->kvm = kvm;
> +
> +	vcore->mpp_buffer_is_valid = false;
> +
> +	if (cpu_has_feature(CPU_FTR_ARCH_207S))
> +		vcore->mpp_buffer = __get_free_pages(GFP_KERNEL|__GFP_ZERO,
> +						     MPP_BUFFER_ORDER);
> +
> +	return vcore;
> +}
> +
>   static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
>   						   unsigned int id)
>   {
> @@ -1298,16 +1331,7 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
>   	mutex_lock(&kvm->lock);
>   	vcore = kvm->arch.vcores[core];
>   	if (!vcore) {
> -		vcore = kzalloc(sizeof(struct kvmppc_vcore), GFP_KERNEL);
> -		if (vcore) {
> -			INIT_LIST_HEAD(&vcore->runnable_threads);
> -			spin_lock_init(&vcore->lock);
> -			init_waitqueue_head(&vcore->wq);
> -			vcore->preempt_tb = TB_NIL;
> -			vcore->lpcr = kvm->arch.lpcr;
> -			vcore->first_vcpuid = core * threads_per_core;
> -			vcore->kvm = kvm;
> -		}
> +		vcore = kvmppc_vcore_create(kvm, core);
>   		kvm->arch.vcores[core] = vcore;
>   		kvm->arch.online_vcores++;
>   	}
> @@ -1516,6 +1540,37 @@ static int on_primary_thread(void)
>   	return 1;
>   }
>   
> +static void ppc_start_saving_l2_cache(struct kvmppc_vcore *vc)

Please use the "kvmppc" name space.

> +{
> +	phys_addr_t phy_addr, tmp;
> +
> +	phy_addr = (phys_addr_t)virt_to_phys((void *)vc->mpp_buffer);
> +
> +	tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;

I would prefer if you give the variable a more telling name.

> +
> +	mtspr(SPRN_MPPR, tmp | PPC_MPPR_FETCH_ABORT);
> +
> +	asm volatile(PPC_LOGMPP(R1) : : "r" (tmp | PPC_LOGMPP_LOG_L2));

Can you move this asm() into a static inline function in generic code 
somewhere?

> +
> +	vc->mpp_buffer_is_valid = true;

Where does this ever get unset? And what point does this variable make? 
Can't you just check on if (vc->mpp_buffer)?

Also, a single whitespace line between every instruction you do looks 
weird ;). When you have the feeling that the code flow is weird enough 
that you need empty lines between every real line, there's probably 
something wrong in the code flow :).


Alex

> +}
> +
> +static void ppc_start_restoring_l2_cache(const struct kvmppc_vcore *vc)
> +{
> +	phys_addr_t phy_addr, tmp;
> +
> +	phy_addr = virt_to_phys((void *)vc->mpp_buffer);
> +
> +	tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;
> +
> +	/* We must abort any in-progress save operations to ensure
> +	 * the table is valid so that prefetch engine knows when to
> +	 * stop prefetching. */
> +	asm volatile(PPC_LOGMPP(R1) : : "r" (tmp | PPC_LOGMPP_LOG_ABORT));
> +
> +	mtspr(SPRN_MPPR, tmp | PPC_MPPR_FETCH_WHOLE_TABLE);
> +}
> +
>   /*
>    * Run a set of guest threads on a physical core.
>    * Called with vc->lock held.
> @@ -1590,9 +1645,16 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>   
>   	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
>   
> +	if (vc->mpp_buffer_is_valid)
> +		ppc_start_restoring_l2_cache(vc);
> +
>   	__kvmppc_vcore_entry();
>   
>   	spin_lock(&vc->lock);
> +
> +	if (vc->mpp_buffer)
> +		ppc_start_saving_l2_cache(vc);
> +
>   	/* disable sending of IPIs on virtual external irqs */
>   	list_for_each_entry(vcpu, &vc->runnable_threads, arch.run_list)
>   		vcpu->cpu = -1;
> @@ -2329,8 +2391,13 @@ static void kvmppc_free_vcores(struct kvm *kvm)
>   {
>   	long int i;
>   
> -	for (i = 0; i < KVM_MAX_VCORES; ++i)
> +	for (i = 0; i < KVM_MAX_VCORES; ++i) {
> +		if (kvm->arch.vcores[i] && kvm->arch.vcores[i]->mpp_buffer) {
> +			free_pages(kvm->arch.vcores[i]->mpp_buffer,
> +				   MPP_BUFFER_ORDER);
> +		}
>   		kfree(kvm->arch.vcores[i]);
> +	}
>   	kvm->arch.online_vcores = 0;
>   }
>   

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-17  3:19   ` [PATCH v3] " Stewart Smith
  2014-07-17  7:55     ` Alexander Graf
@ 2014-07-17 23:52     ` Paul Mackerras
  2014-07-18  4:10       ` Stewart Smith
  2014-07-18  4:18     ` [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV Stewart Smith
  2 siblings, 1 reply; 21+ messages in thread
From: Paul Mackerras @ 2014-07-17 23:52 UTC (permalink / raw)
  To: Stewart Smith; +Cc: linuxppc-dev, Alexander Graf, kvm-ppc

On Thu, Jul 17, 2014 at 01:19:57PM +1000, Stewart Smith wrote:

> The POWER8 processor has a Micro Partition Prefetch Engine, which is
> a fancy way of saying "has way to store and load contents of L2 or
> L2+MRU way of L3 cache". We initiate the storing of the log (list of
> addresses) using the logmpp instruction and start restore by writing
> to a SPR.
> 
> The logmpp instruction takes parameters in a single 64bit register:
> - starting address of the table to store log of L2/L2+L3 cache contents
>   - 32kb for L2
>   - 128kb for L2+L3
>   - Aligned relative to maximum size of the table (32kb or 128kb)
> - Log control (no-op, L2 only, L2 and L3, abort logout)
> 
> We should abort any ongoing logging before initiating one.

Do we ever want to wait for ongoing logging to finish?

[snip]

> +#if defined(CONFIG_PPC_64K_PAGES)
> +#define MPP_BUFFER_ORDER	0
> +#elif defined(CONFIG_PPC_4K_PAGES)
> +#define MPP_BUFFER_ORDER	4

Why 4 not 3?  You only need 32kB, don't you?

> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> +{
> +	struct kvmppc_vcore *vcore;
> +
> +	vcore = kzalloc(sizeof(struct kvmppc_vcore), GFP_KERNEL);
> +
> +	if (vcore == NULL)
> +		return NULL;
> +
> +	INIT_LIST_HEAD(&vcore->runnable_threads);
> +	spin_lock_init(&vcore->lock);
> +	init_waitqueue_head(&vcore->wq);
> +	vcore->preempt_tb = TB_NIL;
> +	vcore->lpcr = kvm->arch.lpcr;
> +	vcore->first_vcpuid = core * threads_per_core;
> +	vcore->kvm = kvm;

Is there a particular reason why you need to pull this code out into a
separate function?  If so, it would be a little nicer if you did that
in a separate patch, to make it easier to see that the code motion
changes nothing.

> @@ -1590,9 +1645,16 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>  
>  	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
>  
> +	if (vc->mpp_buffer_is_valid)
> +		ppc_start_restoring_l2_cache(vc);
> +
>  	__kvmppc_vcore_entry();
>  
>  	spin_lock(&vc->lock);
> +
> +	if (vc->mpp_buffer)
> +		ppc_start_saving_l2_cache(vc);

I wonder if we would get better performance improvements if we kicked
this off earlier, for instance before we save all the FP/VSX state and
switch the MMU?  I guess that could be a subsequent patch.

Paul.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-17 23:52     ` Paul Mackerras
@ 2014-07-18  4:10       ` Stewart Smith
  0 siblings, 0 replies; 21+ messages in thread
From: Stewart Smith @ 2014-07-18  4:10 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, Alexander Graf, kvm-ppc

Paul Mackerras <paulus@samba.org> writes:
> On Thu, Jul 17, 2014 at 01:19:57PM +1000, Stewart Smith wrote:
>
>> The POWER8 processor has a Micro Partition Prefetch Engine, which is
>> a fancy way of saying "has way to store and load contents of L2 or
>> L2+MRU way of L3 cache". We initiate the storing of the log (list of
>> addresses) using the logmpp instruction and start restore by writing
>> to a SPR.
>> 
>> The logmpp instruction takes parameters in a single 64bit register:
>> - starting address of the table to store log of L2/L2+L3 cache contents
>>   - 32kb for L2
>>   - 128kb for L2+L3
>>   - Aligned relative to maximum size of the table (32kb or 128kb)
>> - Log control (no-op, L2 only, L2 and L3, abort logout)
>> 
>> We should abort any ongoing logging before initiating one.
>
> Do we ever want to wait for ongoing logging to finish?

*Probably* not... but as far as I can see the hardware doesn't expose a
way to find out if there is any ongoing logging or a way to be notified
when it's done.

>> +#if defined(CONFIG_PPC_64K_PAGES)
>> +#define MPP_BUFFER_ORDER	0
>> +#elif defined(CONFIG_PPC_4K_PAGES)
>> +#define MPP_BUFFER_ORDER	4
>
> Why 4 not 3?  You only need 32kB, don't you?

Correct. whoops.

>> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>> +{
>> +	struct kvmppc_vcore *vcore;
>> +
>> +	vcore = kzalloc(sizeof(struct kvmppc_vcore), GFP_KERNEL);
>> +
>> +	if (vcore == NULL)
>> +		return NULL;
>> +
>> +	INIT_LIST_HEAD(&vcore->runnable_threads);
>> +	spin_lock_init(&vcore->lock);
>> +	init_waitqueue_head(&vcore->wq);
>> +	vcore->preempt_tb = TB_NIL;
>> +	vcore->lpcr = kvm->arch.lpcr;
>> +	vcore->first_vcpuid = core * threads_per_core;
>> +	vcore->kvm = kvm;
>
> Is there a particular reason why you need to pull this code out into a
> separate function?  If so, it would be a little nicer if you did that
> in a separate patch, to make it easier to see that the code motion
> changes nothing.

ack, done.

>> @@ -1590,9 +1645,16 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
>>  
>>  	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
>>  
>> +	if (vc->mpp_buffer_is_valid)
>> +		ppc_start_restoring_l2_cache(vc);
>> +
>>  	__kvmppc_vcore_entry();
>>  
>>  	spin_lock(&vc->lock);
>> +
>> +	if (vc->mpp_buffer)
>> +		ppc_start_saving_l2_cache(vc);
>
> I wonder if we would get better performance improvements if we kicked
> this off earlier, for instance before we save all the FP/VSX state and
> switch the MMU?  I guess that could be a subsequent patch.

Possibly, yes. Maybe something to look at in future patch, along with if
also doing the L3 save/restore is a benefit.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-17  7:55     ` Alexander Graf
@ 2014-07-18  4:10       ` Stewart Smith
  2014-07-28 12:30         ` Alexander Graf
  0 siblings, 1 reply; 21+ messages in thread
From: Stewart Smith @ 2014-07-18  4:10 UTC (permalink / raw)
  To: Alexander Graf, linuxppc-dev, paulus, kvm-ppc

Alexander Graf <agraf@suse.de> writes:
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 1eaea2d..5769497 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -305,6 +305,8 @@ struct kvmppc_vcore {
>>   	u32 arch_compat;
>>   	ulong pcr;
>>   	ulong dpdes;		/* doorbell state (POWER8) */
>> +	unsigned long mpp_buffer; /* Micro Partition Prefetch buffer */
>
> Just make this a void*?

get_free_pages returns an unsigned long and free_pages accepts an
unsigned long, so I was just avoiding the cast. Is the style in this
case to do void* rather than unsigned long and cast it everywhere?

In v4 of patch I've gone to void* anyway.

>> @@ -1516,6 +1540,37 @@ static int on_primary_thread(void)
>>   	return 1;
>>   }
>>   
>> +static void ppc_start_saving_l2_cache(struct kvmppc_vcore *vc)
>
> Please use the "kvmppc" name space.

ack, done.

>> +	phys_addr_t phy_addr, tmp;
>> +
>> +	phy_addr = (phys_addr_t)virt_to_phys((void *)vc->mpp_buffer);
>> +
>> +	tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;
>
> I would prefer if you give the variable a more telling name.

ack.

>> +
>> +	mtspr(SPRN_MPPR, tmp | PPC_MPPR_FETCH_ABORT);
>> +
>> +	asm volatile(PPC_LOGMPP(R1) : : "r" (tmp | PPC_LOGMPP_LOG_L2));
>
> Can you move this asm() into a static inline function in generic code 
> somewhere?

okay. It seems the best place may be powerpc/include/asm/cache.h -
simply because it deals with cache things. I'm open to better
suggestions :)

>
>> +
>> +	vc->mpp_buffer_is_valid = true;
>
> Where does this ever get unset? And what point does this variable make? 
> Can't you just check on if (vc->mpp_buffer)?

The problem with having moved the memory allocation for mpp_buffer to
vcore setup is that we'll have vc->mpp_buffer != NULL but have some
random contents in it, so when we first start executing the vcore, we
shouldn't initiate prefetching (hence mpp_buffer_is_valid).

If we point the prefetch engine to random memory contents, we get the
most amazing array of incomprehensible illegal accesses :)

The aborting of saving the L2 contents before starting the reading back
in makes the hardware ensure the content of that buffer is finished
correctly.

The hardware docs don't describe the exact format of what it puts in the
buffer, just that there's an 'end of table' bit set in the last entry.

> Also, a single whitespace line between every instruction you do looks 
> weird ;). When you have the feeling that the code flow is weird enough 
> that you need empty lines between every real line, there's probably 
> something wrong in the code flow :).

ok, looks a bit better with logmpp as func rather than asm block.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV
  2014-07-17  3:19   ` [PATCH v3] " Stewart Smith
  2014-07-17  7:55     ` Alexander Graf
  2014-07-17 23:52     ` Paul Mackerras
@ 2014-07-18  4:18     ` Stewart Smith
  2014-07-18  4:18       ` [PATCH v4 1/2] Split out struct kvmppc_vcore creation to separate function Stewart Smith
                         ` (2 more replies)
  2 siblings, 3 replies; 21+ messages in thread
From: Stewart Smith @ 2014-07-18  4:18 UTC (permalink / raw)
  To: linuxppc-dev, paulus, kvm-ppc, Alexander Graf; +Cc: Stewart Smith

changes since v3:
- use kvmppc namespace
- MPP_BUFFER_ORDER of 3 not 4, as we only need 32k and it's already 32k aligned
- split out kvmppc_vcore_create in separate patch
- give a variable a better name: s/tmp/mpp_addr/
- logmpp becomes static inline function

Stewart Smith (2):
  Split out struct kvmppc_vcore creation to separate function
  Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8

 arch/powerpc/include/asm/cache.h      |    7 +++
 arch/powerpc/include/asm/kvm_host.h   |    2 +
 arch/powerpc/include/asm/ppc-opcode.h |   17 +++++++
 arch/powerpc/include/asm/reg.h        |    1 +
 arch/powerpc/kvm/book3s_hv.c          |   88 ++++++++++++++++++++++++++++-----
 5 files changed, 104 insertions(+), 11 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 1/2] Split out struct kvmppc_vcore creation to separate function
  2014-07-18  4:18     ` [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV Stewart Smith
@ 2014-07-18  4:18       ` Stewart Smith
  2014-07-18  7:47         ` Paul Mackerras
  2014-07-18  4:18       ` [PATCH v4 2/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 Stewart Smith
  2014-07-28 12:34       ` [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV Alexander Graf
  2 siblings, 1 reply; 21+ messages in thread
From: Stewart Smith @ 2014-07-18  4:18 UTC (permalink / raw)
  To: linuxppc-dev, paulus, kvm-ppc, Alexander Graf; +Cc: Stewart Smith

No code changes, just split it out to a function so that with the addition
of micro partition prefetch buffer allocation (in subsequent patch) looks
neater and doesn't require excessive indentation.

Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
---
 arch/powerpc/kvm/book3s_hv.c |   31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7a12edb..cfe14f8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1228,6 +1228,26 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, u64 id,
 	return r;
 }
 
+static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
+{
+	struct kvmppc_vcore *vcore;
+
+	vcore = kzalloc(sizeof(struct kvmppc_vcore), GFP_KERNEL);
+
+	if (vcore == NULL)
+		return NULL;
+
+	INIT_LIST_HEAD(&vcore->runnable_threads);
+	spin_lock_init(&vcore->lock);
+	init_waitqueue_head(&vcore->wq);
+	vcore->preempt_tb = TB_NIL;
+	vcore->lpcr = kvm->arch.lpcr;
+	vcore->first_vcpuid = core * threads_per_subcore;
+	vcore->kvm = kvm;
+
+	return vcore;
+}
+
 static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
 						   unsigned int id)
 {
@@ -1279,16 +1299,7 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
 	mutex_lock(&kvm->lock);
 	vcore = kvm->arch.vcores[core];
 	if (!vcore) {
-		vcore = kzalloc(sizeof(struct kvmppc_vcore), GFP_KERNEL);
-		if (vcore) {
-			INIT_LIST_HEAD(&vcore->runnable_threads);
-			spin_lock_init(&vcore->lock);
-			init_waitqueue_head(&vcore->wq);
-			vcore->preempt_tb = TB_NIL;
-			vcore->lpcr = kvm->arch.lpcr;
-			vcore->first_vcpuid = core * threads_per_subcore;
-			vcore->kvm = kvm;
-		}
+		vcore = kvmppc_vcore_create(kvm, core);
 		kvm->arch.vcores[core] = vcore;
 		kvm->arch.online_vcores++;
 	}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 2/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-18  4:18     ` [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV Stewart Smith
  2014-07-18  4:18       ` [PATCH v4 1/2] Split out struct kvmppc_vcore creation to separate function Stewart Smith
@ 2014-07-18  4:18       ` Stewart Smith
  2014-07-18  7:48         ` Paul Mackerras
  2014-07-28 12:34       ` [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV Alexander Graf
  2 siblings, 1 reply; 21+ messages in thread
From: Stewart Smith @ 2014-07-18  4:18 UTC (permalink / raw)
  To: linuxppc-dev, paulus, kvm-ppc, Alexander Graf; +Cc: Stewart Smith

The POWER8 processor has a Micro Partition Prefetch Engine, which is
a fancy way of saying "has way to store and load contents of L2 or
L2+MRU way of L3 cache". We initiate the storing of the log (list of
addresses) using the logmpp instruction and start restore by writing
to a SPR.

The logmpp instruction takes parameters in a single 64bit register:
- starting address of the table to store log of L2/L2+L3 cache contents
  - 32kb for L2
  - 128kb for L2+L3
  - Aligned relative to maximum size of the table (32kb or 128kb)
- Log control (no-op, L2 only, L2 and L3, abort logout)

We should abort any ongoing logging before initiating one.

To initiate restore, we write to the MPPR SPR. The format of what to write
to the SPR is similar to the logmpp instruction parameter:
- starting address of the table to read from (same alignment requirements)
- table size (no data, until end of table)
- prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
  32 cycles)

The idea behind loading and storing the contents of L2/L3 cache is to
reduce memory latency in a system that is frequently swapping vcores on
a physical CPU.

The best case scenario for doing this is when some vcores are doing very
cache heavy workloads. The worst case is when they have about 0 cache hits,
so we just generate needless memory operations.

This implementation just does L2 store/load. In my benchmarks this proves
to be useful.

Benchmark 1:
 - 16 core POWER8
 - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
 - No split core/SMT
 - two guests running sysbench memory test.
   sysbench --test=memory --num-threads=8 run
 - one guest running apache bench (of default HTML page)
   ab -n 490000 -c 400 http://localhost/

This benchmark aims to measure performance of real world application (apache)
where other guests are cache hot with their own workloads. The sysbench memory
benchmark does pointer sized writes to a (small) memory buffer in a loop.

In this benchmark with this patch I can see an improvement both in requests
per second (~5%) and in mean and median response times (again, about 5%).
The spread of minimum and maximum response times were largely unchanged.

benchmark 2:
 - Same VM config as benchmark 1
 - all three guests running sysbench memory benchmark

This benchmark aims to see if there is a positive or negative affect to this
cache heavy benchmark. Although due to the nature of the benchmark (stores) we
may not see a difference in performance, but rather hopefully an improvement
in consistency of performance (when vcore switched in, don't have to wait
many times for cachelines to be pulled in)

The results of this benchmark are improvements in consistency of performance
rather than performance itself. With this patch, the few outliers in duration
go away and we get more consistent performance in each guest.

benchmark 3:
 - same 3 guests and CPU configuration as benchmark 1 and 2.
 - two idle guests
 - 1 guest running STREAM benchmark

This scenario also saw performance improvement with this patch. On Copy and
Scale workloads from STREAM, I got 5-6% improvement with this patch. For
Add and triad, it was around 10% (or more).

benchmark 4:
 - same 3 guests as previous benchmarks
 - two guests running sysbench --memory, distinctly different cache heavy
   workload
 - one guest running STREAM benchmark.

Similar improvements to benchmark 3.

benchmark 5:
 - 1 guest, 8 VCPUs, Ubuntu 14.04
 - Host configured with split core (SMT8, subcores-per-core=4)
 - STREAM benchmark

In this benchmark, we see a 10-20% performance improvement across the board
of STREAM benchmark results with this patch.

Based on preliminary investigation and microbenchmarks
by Prerna Saxena <prerna@linux.vnet.ibm.com>

Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>

--
changes since v3:
- use kvmppc namespace
- MPP_BUFFER_ORDER of 3 not 4, as we only need 32k and it's already 32k aligned
- split out kvmppc_vcore_create in separate patch
- give a variable a better name: s/tmp/mpp_addr/
- logmpp becomes static inline function

changes since v2:
- based on feedback from Alexander Graf:
  - move save and restore of cache to separate functions
  - move allocation of mpp_buffer to vcore creation
  - get_free_pages() does actually allocate pages aligned to order
    (Mel Gorman confirms)
  - make SPR and logmpp parameters a bit less magic, especially around abort

changes since v1:
- s/mppe/mpp_buffer/
- add MPP_BUFFER_ORDER define.
---
 arch/powerpc/include/asm/cache.h      |    7 ++++
 arch/powerpc/include/asm/kvm_host.h   |    2 ++
 arch/powerpc/include/asm/ppc-opcode.h |   17 ++++++++++
 arch/powerpc/include/asm/reg.h        |    1 +
 arch/powerpc/kvm/book3s_hv.c          |   57 ++++++++++++++++++++++++++++++++-
 5 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index ed0afc1..34a05a1 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -3,6 +3,7 @@
 
 #ifdef __KERNEL__
 
+#include <asm/reg.h>
 
 /* bytes per L1 cache line */
 #if defined(CONFIG_8xx) || defined(CONFIG_403GCX)
@@ -39,6 +40,12 @@ struct ppc64_caches {
 };
 
 extern struct ppc64_caches ppc64_caches;
+
+static inline void logmpp(u64 x)
+{
+	asm volatile(PPC_LOGMPP(R1) : : "r" (x));
+}
+
 #endif /* __powerpc64__ && ! __ASSEMBLY__ */
 
 #if defined(__ASSEMBLY__)
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index bb66d8b..5add9ac 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -305,6 +305,8 @@ struct kvmppc_vcore {
 	u32 arch_compat;
 	ulong pcr;
 	ulong dpdes;		/* doorbell state (POWER8) */
+	void *mpp_buffer; /* Micro Partition Prefetch buffer */
+	bool mpp_buffer_is_valid;
 };
 
 #define VCORE_ENTRY_COUNT(vc)	((vc)->entry_exit_count & 0xff)
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 3132bb9..c636841 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -139,6 +139,7 @@
 #define PPC_INST_ISEL			0x7c00001e
 #define PPC_INST_ISEL_MASK		0xfc00003e
 #define PPC_INST_LDARX			0x7c0000a8
+#define PPC_INST_LOGMPP			0x7c0007e4
 #define PPC_INST_LSWI			0x7c0004aa
 #define PPC_INST_LSWX			0x7c00042a
 #define PPC_INST_LWARX			0x7c000028
@@ -275,6 +276,20 @@
 #define __PPC_EH(eh)	0
 #endif
 
+/* POWER8 Micro Partition Prefetch (MPP) parameters */
+/* Address mask is common for LOGMPP instruction and MPPR SPR */
+#define PPC_MPPE_ADDRESS_MASK 0xffffffffc000
+
+/* Bits 60 and 61 of MPP SPR should be set to one of the following */
+/* Aborting the fetch is indeed setting 00 in the table size bits */
+#define PPC_MPPR_FETCH_ABORT (0x0ULL << 60)
+#define PPC_MPPR_FETCH_WHOLE_TABLE (0x2ULL << 60)
+
+/* Bits 54 and 55 of register for LOGMPP instruction should be set to: */
+#define PPC_LOGMPP_LOG_L2 (0x02ULL << 54)
+#define PPC_LOGMPP_LOG_L2L3 (0x01ULL << 54)
+#define PPC_LOGMPP_LOG_ABORT (0x03ULL << 54)
+
 /* Deal with instructions that older assemblers aren't aware of */
 #define	PPC_DCBAL(a, b)		stringify_in_c(.long PPC_INST_DCBAL | \
 					__PPC_RA(a) | __PPC_RB(b))
@@ -283,6 +298,8 @@
 #define PPC_LDARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LDARX | \
 					___PPC_RT(t) | ___PPC_RA(a) | \
 					___PPC_RB(b) | __PPC_EH(eh))
+#define PPC_LOGMPP(b)		stringify_in_c(.long PPC_INST_LOGMPP | \
+					__PPC_RB(b))
 #define PPC_LWARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LWARX | \
 					___PPC_RT(t) | ___PPC_RA(a) | \
 					___PPC_RB(b) | __PPC_EH(eh))
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index bffd89d..68cc9f9 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -225,6 +225,7 @@
 #define   CTRL_TE	0x00c00000	/* thread enable */
 #define   CTRL_RUNLATCH	0x1
 #define SPRN_DAWR	0xB4
+#define SPRN_MPPR	0xB8	/* Micro Partition Prefetch Register */
 #define SPRN_RPR	0xBA	/* Relative Priority Register */
 #define SPRN_CIABR	0xBB
 #define   CIABR_PRIV		0x3
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cfe14f8..03b758c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -35,6 +35,7 @@
 
 #include <asm/reg.h>
 #include <asm/cputable.h>
+#include <asm/cache.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
@@ -67,6 +68,13 @@
 /* Used as a "null" value for timebase values */
 #define TB_NIL	(~(u64)0)
 
+#if defined(CONFIG_PPC_64K_PAGES)
+#define MPP_BUFFER_ORDER	0
+#elif defined(CONFIG_PPC_4K_PAGES)
+#define MPP_BUFFER_ORDER	3
+#endif
+
+
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
@@ -1245,6 +1253,13 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
 	vcore->first_vcpuid = core * threads_per_subcore;
 	vcore->kvm = kvm;
 
+	vcore->mpp_buffer_is_valid = false;
+
+	if (cpu_has_feature(CPU_FTR_ARCH_207S))
+		vcore->mpp_buffer = (void *)__get_free_pages(
+			GFP_KERNEL|__GFP_ZERO,
+			MPP_BUFFER_ORDER);
+
 	return vcore;
 }
 
@@ -1511,6 +1526,33 @@ static int on_primary_thread(void)
 	return 1;
 }
 
+static void kvmppc_start_saving_l2_cache(struct kvmppc_vcore *vc)
+{
+	phys_addr_t phy_addr, mpp_addr;
+
+	phy_addr = (phys_addr_t)virt_to_phys(vc->mpp_buffer);
+	mpp_addr = phy_addr & PPC_MPPE_ADDRESS_MASK;
+
+	mtspr(SPRN_MPPR, mpp_addr | PPC_MPPR_FETCH_ABORT);
+	logmpp(mpp_addr | PPC_LOGMPP_LOG_L2);
+
+	vc->mpp_buffer_is_valid = true;
+}
+
+static void kvmppc_start_restoring_l2_cache(const struct kvmppc_vcore *vc)
+{
+	phys_addr_t phy_addr, mpp_addr;
+
+	phy_addr = virt_to_phys(vc->mpp_buffer);
+	mpp_addr = phy_addr & PPC_MPPE_ADDRESS_MASK;
+
+	/* We must abort any in-progress save operations to ensure
+	 * the table is valid so that prefetch engine knows when to
+	 * stop prefetching. */
+	logmpp(mpp_addr | PPC_LOGMPP_LOG_ABORT);
+	mtspr(SPRN_MPPR, mpp_addr | PPC_MPPR_FETCH_WHOLE_TABLE);
+}
+
 /*
  * Run a set of guest threads on a physical core.
  * Called with vc->lock held.
@@ -1588,9 +1630,16 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
 
 	srcu_idx = srcu_read_lock(&vc->kvm->srcu);
 
+	if (vc->mpp_buffer_is_valid)
+		kvmppc_start_restoring_l2_cache(vc);
+
 	__kvmppc_vcore_entry();
 
 	spin_lock(&vc->lock);
+
+	if (vc->mpp_buffer)
+		kvmppc_start_saving_l2_cache(vc);
+
 	/* disable sending of IPIs on virtual external irqs */
 	list_for_each_entry(vcpu, &vc->runnable_threads, arch.run_list)
 		vcpu->cpu = -1;
@@ -2334,8 +2383,14 @@ static void kvmppc_free_vcores(struct kvm *kvm)
 {
 	long int i;
 
-	for (i = 0; i < KVM_MAX_VCORES; ++i)
+	for (i = 0; i < KVM_MAX_VCORES; ++i) {
+		if (kvm->arch.vcores[i] && kvm->arch.vcores[i]->mpp_buffer) {
+			struct kvmppc_vcore *vc = kvm->arch.vcores[i];
+			free_pages((unsigned long)vc->mpp_buffer,
+				   MPP_BUFFER_ORDER);
+		}
 		kfree(kvm->arch.vcores[i]);
+	}
 	kvm->arch.online_vcores = 0;
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] Split out struct kvmppc_vcore creation to separate function
  2014-07-18  4:18       ` [PATCH v4 1/2] Split out struct kvmppc_vcore creation to separate function Stewart Smith
@ 2014-07-18  7:47         ` Paul Mackerras
  0 siblings, 0 replies; 21+ messages in thread
From: Paul Mackerras @ 2014-07-18  7:47 UTC (permalink / raw)
  To: Stewart Smith; +Cc: linuxppc-dev, Alexander Graf, kvm-ppc

On Fri, Jul 18, 2014 at 02:18:42PM +1000, Stewart Smith wrote:
> No code changes, just split it out to a function so that with the addition
> of micro partition prefetch buffer allocation (in subsequent patch) looks
> neater and doesn't require excessive indentation.
> 
> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>

Acked-by: Paul Mackerras <paulus@samba.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-18  4:18       ` [PATCH v4 2/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 Stewart Smith
@ 2014-07-18  7:48         ` Paul Mackerras
  0 siblings, 0 replies; 21+ messages in thread
From: Paul Mackerras @ 2014-07-18  7:48 UTC (permalink / raw)
  To: Stewart Smith; +Cc: linuxppc-dev, Alexander Graf, kvm-ppc

On Fri, Jul 18, 2014 at 02:18:43PM +1000, Stewart Smith wrote:
> The POWER8 processor has a Micro Partition Prefetch Engine, which is
> a fancy way of saying "has way to store and load contents of L2 or
> L2+MRU way of L3 cache". We initiate the storing of the log (list of
> addresses) using the logmpp instruction and start restore by writing
> to a SPR.
> 
> The logmpp instruction takes parameters in a single 64bit register:
> - starting address of the table to store log of L2/L2+L3 cache contents
>   - 32kb for L2
>   - 128kb for L2+L3
>   - Aligned relative to maximum size of the table (32kb or 128kb)
> - Log control (no-op, L2 only, L2 and L3, abort logout)
> 
> We should abort any ongoing logging before initiating one.
> 
> To initiate restore, we write to the MPPR SPR. The format of what to write
> to the SPR is similar to the logmpp instruction parameter:
> - starting address of the table to read from (same alignment requirements)
> - table size (no data, until end of table)
> - prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
>   32 cycles)
> 
> The idea behind loading and storing the contents of L2/L3 cache is to
> reduce memory latency in a system that is frequently swapping vcores on
> a physical CPU.
> 
> The best case scenario for doing this is when some vcores are doing very
> cache heavy workloads. The worst case is when they have about 0 cache hits,
> so we just generate needless memory operations.
> 
> This implementation just does L2 store/load. In my benchmarks this proves
> to be useful.
> 
> Benchmark 1:
>  - 16 core POWER8
>  - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
>  - No split core/SMT
>  - two guests running sysbench memory test.
>    sysbench --test=memory --num-threads=8 run
>  - one guest running apache bench (of default HTML page)
>    ab -n 490000 -c 400 http://localhost/
> 
> This benchmark aims to measure performance of real world application (apache)
> where other guests are cache hot with their own workloads. The sysbench memory
> benchmark does pointer sized writes to a (small) memory buffer in a loop.
> 
> In this benchmark with this patch I can see an improvement both in requests
> per second (~5%) and in mean and median response times (again, about 5%).
> The spread of minimum and maximum response times were largely unchanged.
> 
> benchmark 2:
>  - Same VM config as benchmark 1
>  - all three guests running sysbench memory benchmark
> 
> This benchmark aims to see if there is a positive or negative affect to this
> cache heavy benchmark. Although due to the nature of the benchmark (stores) we
> may not see a difference in performance, but rather hopefully an improvement
> in consistency of performance (when vcore switched in, don't have to wait
> many times for cachelines to be pulled in)
> 
> The results of this benchmark are improvements in consistency of performance
> rather than performance itself. With this patch, the few outliers in duration
> go away and we get more consistent performance in each guest.
> 
> benchmark 3:
>  - same 3 guests and CPU configuration as benchmark 1 and 2.
>  - two idle guests
>  - 1 guest running STREAM benchmark
> 
> This scenario also saw performance improvement with this patch. On Copy and
> Scale workloads from STREAM, I got 5-6% improvement with this patch. For
> Add and triad, it was around 10% (or more).
> 
> benchmark 4:
>  - same 3 guests as previous benchmarks
>  - two guests running sysbench --memory, distinctly different cache heavy
>    workload
>  - one guest running STREAM benchmark.
> 
> Similar improvements to benchmark 3.
> 
> benchmark 5:
>  - 1 guest, 8 VCPUs, Ubuntu 14.04
>  - Host configured with split core (SMT8, subcores-per-core=4)
>  - STREAM benchmark
> 
> In this benchmark, we see a 10-20% performance improvement across the board
> of STREAM benchmark results with this patch.
> 
> Based on preliminary investigation and microbenchmarks
> by Prerna Saxena <prerna@linux.vnet.ibm.com>
> 
> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>

Acked-by: Paul Mackerras <paulus@samba.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
  2014-07-18  4:10       ` Stewart Smith
@ 2014-07-28 12:30         ` Alexander Graf
  0 siblings, 0 replies; 21+ messages in thread
From: Alexander Graf @ 2014-07-28 12:30 UTC (permalink / raw)
  To: Stewart Smith, linuxppc-dev, paulus, kvm-ppc


On 18.07.14 06:10, Stewart Smith wrote:
> Alexander Graf <agraf@suse.de> writes:
>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>> index 1eaea2d..5769497 100644
>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>> @@ -305,6 +305,8 @@ struct kvmppc_vcore {
>>>    	u32 arch_compat;
>>>    	ulong pcr;
>>>    	ulong dpdes;		/* doorbell state (POWER8) */
>>> +	unsigned long mpp_buffer; /* Micro Partition Prefetch buffer */
>> Just make this a void*?
> get_free_pages returns an unsigned long and free_pages accepts an
> unsigned long, so I was just avoiding the cast. Is the style in this
> case to do void* rather than unsigned long and cast it everywhere?
>
> In v4 of patch I've gone to void* anyway.

It's probably just a matter of personal taste, but I personally prefer 
to keep pointers to memory locations in pointers.

>
>>> @@ -1516,6 +1540,37 @@ static int on_primary_thread(void)
>>>    	return 1;
>>>    }
>>>    
>>> +static void ppc_start_saving_l2_cache(struct kvmppc_vcore *vc)
>> Please use the "kvmppc" name space.
> ack, done.
>
>>> +	phys_addr_t phy_addr, tmp;
>>> +
>>> +	phy_addr = (phys_addr_t)virt_to_phys((void *)vc->mpp_buffer);
>>> +
>>> +	tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;
>> I would prefer if you give the variable a more telling name.
> ack.
>
>>> +
>>> +	mtspr(SPRN_MPPR, tmp | PPC_MPPR_FETCH_ABORT);
>>> +
>>> +	asm volatile(PPC_LOGMPP(R1) : : "r" (tmp | PPC_LOGMPP_LOG_L2));
>> Can you move this asm() into a static inline function in generic code
>> somewhere?
> okay. It seems the best place may be powerpc/include/asm/cache.h -
> simply because it deals with cache things. I'm open to better
> suggestions :)
>
>>> +
>>> +	vc->mpp_buffer_is_valid = true;
>> Where does this ever get unset? And what point does this variable make?
>> Can't you just check on if (vc->mpp_buffer)?
> The problem with having moved the memory allocation for mpp_buffer to
> vcore setup is that we'll have vc->mpp_buffer != NULL but have some
> random contents in it, so when we first start executing the vcore, we
> shouldn't initiate prefetching (hence mpp_buffer_is_valid).
>
> If we point the prefetch engine to random memory contents, we get the
> most amazing array of incomprehensible illegal accesses :)

I see :). That makes a lot of sense indeed. Maybe rename the variable to 
mpp_content_is_valid to indicate that we are not looking at a valid 
buffer, but valid content?


Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV
  2014-07-18  4:18     ` [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV Stewart Smith
  2014-07-18  4:18       ` [PATCH v4 1/2] Split out struct kvmppc_vcore creation to separate function Stewart Smith
  2014-07-18  4:18       ` [PATCH v4 2/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 Stewart Smith
@ 2014-07-28 12:34       ` Alexander Graf
  2 siblings, 0 replies; 21+ messages in thread
From: Alexander Graf @ 2014-07-28 12:34 UTC (permalink / raw)
  To: Stewart Smith, linuxppc-dev, paulus, kvm-ppc


On 18.07.14 06:18, Stewart Smith wrote:
> changes since v3:
> - use kvmppc namespace
> - MPP_BUFFER_ORDER of 3 not 4, as we only need 32k and it's already 32k aligned
> - split out kvmppc_vcore_create in separate patch
> - give a variable a better name: s/tmp/mpp_addr/
> - logmpp becomes static inline function
>
> Stewart Smith (2):
>    Split out struct kvmppc_vcore creation to separate function
>    Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8

I only realized just now that I've commented on v3 :). Sorry. I guess 
the name isn't really important enough for a respin. If you come up with 
a really nice one, I'd be happy to apply another patch that renames it.

Thanks, applied all to kvm-ppc-queue.


Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-07-28 12:34 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-04  1:23 [PATCH] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 Stewart Smith
2014-07-08  5:06 ` [PATCH v2] " Stewart Smith
2014-07-08 10:41   ` Alexander Graf
2014-07-08 22:59     ` Stewart Smith
2014-07-10 11:05       ` Alexander Graf
2014-07-10 13:07         ` Mel Gorman
2014-07-10 13:17           ` Alexander Graf
2014-07-10 13:30             ` Mel Gorman
2014-07-10 13:30               ` Alexander Graf
2014-07-17  3:19   ` [PATCH v3] " Stewart Smith
2014-07-17  7:55     ` Alexander Graf
2014-07-18  4:10       ` Stewart Smith
2014-07-28 12:30         ` Alexander Graf
2014-07-17 23:52     ` Paul Mackerras
2014-07-18  4:10       ` Stewart Smith
2014-07-18  4:18     ` [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV Stewart Smith
2014-07-18  4:18       ` [PATCH v4 1/2] Split out struct kvmppc_vcore creation to separate function Stewart Smith
2014-07-18  7:47         ` Paul Mackerras
2014-07-18  4:18       ` [PATCH v4 2/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 Stewart Smith
2014-07-18  7:48         ` Paul Mackerras
2014-07-28 12:34       ` [PATCH v4 0/2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV Alexander Graf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).