All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 00/10] KVM: Add TMEM host/guest support
@ 2012-06-06 13:07 Sasha Levin
  2012-06-06 13:24 ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2012-06-06 13:07 UTC (permalink / raw)
  To: avi, mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk
  Cc: kvm, Sasha Levin

This patch series adds support for passing TMEM commands between KVM guests
and the host. This opens the possibility to use TMEM cross-guests and
posibly across hosts with RAMster.

Since frontswap was merged in the 3.4 cycle, the kernel now has all facilities
required to work with TMEM. There is no longer a dependency on out of tree
code.

We can split this patch series into two:

 - The guest side, which is basically two shims that proxy mm/cleancache.c
 and mm/frontswap.c requests from the guest back to the host. This is done
 using a new KVM_HC_TMEM hypercall.

 - The host side, which is a rather small shim which connects KVM to zcache.


It's worth noting that this patch series don't have any significant logic in
it, and is mostly a collection of shims to pass TMEM commands across hypercalls.

I ran benchmarks using both the "streaming test" proposed by Avi, and some
general fio tests. Since the fio tests showed similar results to the
streaming test, and no anomalies, here is the summary of the streaming tests:

First, trying to stream a 26GB random file without KVM TMEM:
real    7m36.046s
user    0m17.113s
sys     5m23.809s

And with KVM TMEM:
real    7m36.018s
user    0m17.124s
sys     5m28.391s

 - No significant difference.

Now, trying to stream a 16gb file that compresses nicely, first without KVM TMEM:
real    5m10.299s
user    0m11.311s
sys     3m40.139s

And a second run without dropping cache:
real    4m33.951s
user    0m10.869s
sys     3m13.789s

Now, with KVM TMEM:
real    4m55.528s
user    0m11.119s
sys     3m33.243s

And a second run:
real    2m53.713s
user    0m7.971s
sys     2m29.807s

So KVM TMEM shows a nice performance increase once it can store pages on the host.

Sasha Levin (10):
  KVM: reintroduce hc_gpa
  KVM: wire up the TMEM HC
  zcache: export zcache interface
  KVM: add KVM TMEM entries in the appropriate config menu entry
  KVM: bring in general tmem definitions
  zcache: move out client declaration and add a KVM client
  KVM: add KVM TMEM host side interface
  KVM: add KVM TMEM guest support
  KVM: support guest side cleancache
  KVM: support guest side frontswap

 arch/x86/kvm/Kconfig                 |    1 +
 arch/x86/kvm/Makefile                |    2 +
 arch/x86/kvm/tmem/Kconfig            |   43 +++++++++++
 arch/x86/kvm/tmem/Makefile           |    6 ++
 arch/x86/kvm/tmem/cleancache.c       |  120 +++++++++++++++++++++++++++++
 arch/x86/kvm/tmem/frontswap.c        |  139 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/tmem/guest.c            |   95 +++++++++++++++++++++++
 arch/x86/kvm/tmem/guest.h            |   11 +++
 arch/x86/kvm/tmem/host.c             |   78 +++++++++++++++++++
 arch/x86/kvm/tmem/host.h             |   20 +++++
 arch/x86/kvm/tmem/tmem.h             |   62 +++++++++++++++
 arch/x86/kvm/x86.c                   |   13 +++
 drivers/staging/zcache/zcache-main.c |   48 ++++++++++--
 drivers/staging/zcache/zcache.h      |   20 +++++
 include/linux/kvm_para.h             |    1 +
 15 files changed, 652 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/kvm/tmem/Kconfig
 create mode 100644 arch/x86/kvm/tmem/Makefile
 create mode 100644 arch/x86/kvm/tmem/cleancache.c
 create mode 100644 arch/x86/kvm/tmem/frontswap.c
 create mode 100644 arch/x86/kvm/tmem/guest.c
 create mode 100644 arch/x86/kvm/tmem/guest.h
 create mode 100644 arch/x86/kvm/tmem/host.c
 create mode 100644 arch/x86/kvm/tmem/host.h
 create mode 100644 arch/x86/kvm/tmem/tmem.h
 create mode 100644 drivers/staging/zcache/zcache.h

-- 
1.7.8.6


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-06 13:07 [RFC 00/10] KVM: Add TMEM host/guest support Sasha Levin
@ 2012-06-06 13:24 ` Avi Kivity
  2012-06-08 13:20   ` Sasha Levin
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2012-06-06 13:24 UTC (permalink / raw)
  To: Sasha Levin; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm

On 06/06/2012 04:07 PM, Sasha Levin wrote:
> This patch series adds support for passing TMEM commands between KVM guests
> and the host. This opens the possibility to use TMEM cross-guests and
> posibly across hosts with RAMster.
> 
> Since frontswap was merged in the 3.4 cycle, the kernel now has all facilities
> required to work with TMEM. There is no longer a dependency on out of tree
> code.
> 
> We can split this patch series into two:
> 
>  - The guest side, which is basically two shims that proxy mm/cleancache.c
>  and mm/frontswap.c requests from the guest back to the host. This is done
>  using a new KVM_HC_TMEM hypercall.
> 
>  - The host side, which is a rather small shim which connects KVM to zcache.
> 
> 
> It's worth noting that this patch series don't have any significant logic in
> it, and is mostly a collection of shims to pass TMEM commands across hypercalls.
> 
> I ran benchmarks using both the "streaming test" proposed by Avi, and some
> general fio tests. Since the fio tests showed similar results to the
> streaming test, and no anomalies, here is the summary of the streaming tests:
> 
> First, trying to stream a 26GB random file without KVM TMEM:
> real    7m36.046s
> user    0m17.113s
> sys     5m23.809s
> 
> And with KVM TMEM:
> real    7m36.018s
> user    0m17.124s
> sys     5m28.391s

These results give about 47 usec per page system time (seems quite
high), whereas the difference is 0.7 user per page (seems quite low, for
1 or 2 syscalls per page).  Can you post a snapshot of kvm_stat while
this is running?


> 
>  - No significant difference.
> 
> Now, trying to stream a 16gb file that compresses nicely, first without KVM TMEM:
> real    5m10.299s
> user    0m11.311s
> sys     3m40.139s
> 
> And a second run without dropping cache:
> real    4m33.951s
> user    0m10.869s
> sys     3m13.789s
> 
> Now, with KVM TMEM:
> real    4m55.528s
> user    0m11.119s
> sys     3m33.243s

How is the first run faster?  Is it not doing extra work, pushing pages
to the host?

> 
> And a second run:
> real    2m53.713s
> user    0m7.971s
> sys     2m29.807s

A nice result, yes.

> 
> So KVM TMEM shows a nice performance increase once it can store pages on the host.

How was caching set up?  cache=none (in qemu terms) is most
representative, but cache=writeback also allows the host to cache guest
pages, while cache=writeback with cleancache enabled in the host should
give the same effect, but with the extra hypercalls, but with an extra
copy to manage the host pagecache.  It would be good to see results for
all three settings.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-06 13:24 ` Avi Kivity
@ 2012-06-08 13:20   ` Sasha Levin
  2012-06-08 16:06     ` Dan Magenheimer
  2012-06-11  8:09     ` Avi Kivity
  0 siblings, 2 replies; 20+ messages in thread
From: Sasha Levin @ 2012-06-08 13:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm

I re-ran benchmarks in a single user environment to get more stable results, increasing the test files to 50gb each.

First, a test of the good case scenario for KVM TMEM - we'll try streaming a file which compresses well but is bigger than the host RAM size:

First, no KVM TMEM, caching=none:

	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
	2048+0 records in
	2048+0 records out
	8589934592 bytes (8.6 GB) copied, 116.309 s, 73.9 MB/s

	real    1m56.349s
	user    0m0.015s
	sys     0m15.671s
	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
	2048+0 records in
	2048+0 records out
	8589934592 bytes (8.6 GB) copied, 116.191 s, 73.9 MB/s

	real    1m56.255s
	user    0m0.018s
	sys     0m15.504s

Now, no KVM TMEM, caching=writeback:

	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
	2048+0 records in
	2048+0 records out
	8589934592 bytes (8.6 GB) copied, 122.894 s, 69.9 MB/s

	real    2m2.965s
	user    0m0.015s
	sys     0m11.025s
	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
	2048+0 records in
	2048+0 records out
	8589934592 bytes (8.6 GB) copied, 110.915 s, 77.4 MB/s

	real    1m50.968s
	user    0m0.011s
	sys     0m10.108s

And finally, KVM TMEM on, caching=none:

	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
	2048+0 records in
	2048+0 records out
	8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s

	real    1m59.123s
	user    0m0.020s
	sys     0m29.336s

	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
	2048+0 records in
	2048+0 records out
	8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s

	real    0m36.950s
	user    0m0.005s
	sys     0m35.308s

This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going:

	kvm statistics

	 kvm_exit                                   1952342   36037
	 kvm_entry                                  1952334   36034
	 kvm_hypercall                              1710568   33948
	 kvm_apic                                    109027    1319
	 kvm_emulate_insn                             63745     673
	 kvm_mmio                                     63483     669
	 kvm_inj_virq                                 45899     654
	 kvm_apic_accept_irq                          45809     654
	 kvm_pio                                      18445      52
	 kvm_set_irq                                  19102      50
	 kvm_msi_set_irq                              17809      47
	 kvm_fpu                                        244      18
	 kvm_apic_ipi                                   368       8
	 kvm_cr                                          70       6
	 kvm_userspace_exit                             897       5
	 kvm_cpuid                                       48       5
	 vcpu_match_mmio                                257       3
	 kvm_pic_set_irq                               1293       3
	 kvm_ioapic_set_irq                            1293       3
	 kvm_ack_irq                                     84       1
	 kvm_page_fault                               60538       0


Now, for the worst case "streaming test". I've tried streaming two files, one which has good compression (zeros), and one full with random bits. Doing two runs for each.

First, the baseline - no KVM TMEM, caching=none:

Zero file:
	12800+0 records in
	12800+0 records out
	53687091200 bytes (54 GB) copied, 703.502 s, 76.3 MB/s

	real    11m43.583s
	user    0m0.106s
	sys     1m42.075s
	12800+0 records in
	12800+0 records out
	53687091200 bytes (54 GB) copied, 691.208 s, 77.7 MB/s

	real    11m31.284s
	user    0m0.100s
	sys     1m41.235s

Random file:
	12594+1 records in
	12594+1 records out
	52824875008 bytes (53 GB) copied, 655.778 s, 80.6 MB/s

	real    10m55.847s
	user    0m0.107s
	sys     1m39.852s
	12594+1 records in
	12594+1 records out
	52824875008 bytes (53 GB) copied, 652.668 s, 80.9 MB/s

	real    10m52.739s
	user    0m0.120s
	sys     1m39.712s

Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none:

Zeros:
	12800+0 records in
	12800+0 records out
	53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s

	real    11m44.536s
	user    0m0.088s
	sys     2m0.639s
	12800+0 records in
	12800+0 records out
	53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s

	real    11m30.561s
	user    0m0.088s
	sys     1m57.637s

Random:
	12594+1 records in
	12594+1 records out
	52824875008 bytes (53 GB) copied, 656.436 s, 80.5 MB/s

	real    10m56.480s
	user    0m0.034s
	sys     3m18.750s
	12594+1 records in
	12594+1 records out
	52824875008 bytes (53 GB) copied, 658.446 s, 80.2 MB/s

	real    10m58.499s
	user    0m0.046s
	sys     3m23.678s

Next, with KVM TMEM enabled, caching=none:

Zeros:
	12800+0 records in
	12800+0 records out
	53687091200 bytes (54 GB) copied, 711.754 s, 75.4 MB/s

	real    11m51.916s
	user    0m0.081s
	sys     2m59.952s
	12800+0 records in
	12800+0 records out
	53687091200 bytes (54 GB) copied, 690.958 s, 77.7 MB/s

	real    11m31.102s
	user    0m0.082s
	sys     3m6.500s

Random:
	12594+1 records in
	12594+1 records out
	52824875008 bytes (53 GB) copied, 656.378 s, 80.5 MB/s

	real    10m56.445s
	user    0m0.062s
	sys     5m53.236s
	12594+1 records in
	12594+1 records out
	52824875008 bytes (53 GB) copied, 653.353 s, 80.9 MB/s

	real    10m53.404s
	user    0m0.066s
	sys     5m57.087s


This is a snapshot of kvm_stats while this test was running:

	kvm statistics

	 kvm_entry                                   168179   20729
	 kvm_exit                                    168179   20728
	 kvm_hypercall                               131808   16409
	 kvm_apic                                     17305    2006
	 kvm_mmio                                     10877    1259
	 kvm_emulate_insn                             10974    1258
	 kvm_page_fault                                6270     866
	 kvm_inj_virq                                  6532     751
	 kvm_apic_accept_irq                           6516     751
	 kvm_set_irq                                   4888     536
	 kvm_msi_set_irq                               4471     536
	 kvm_pio                                       4714     529
	 kvm_userspace_exit                             300       2
	 vcpu_match_mmio                                 83       2
	 kvm_apic_ipi                                    69       2
	 kvm_pic_set_irq                                417       0
	 kvm_ioapic_set_irq                             417       0
	 kvm_fpu                                         76       0
	 kvm_ack_irq                                     27       0
	 kvm_cr                                          24       0
	 kvm_cpuid                                       16       0

And finally, KVM TMEM enabled, with caching=writeback:

Zeros:
	12800+0 records in
	12800+0 records out
	53687091200 bytes (54 GB) copied, 710.62 s, 75.5 MB/s

	real    11m50.698s
	user    0m0.078s
	sys     3m29.920s
	12800+0 records in
	12800+0 records out
	53687091200 bytes (54 GB) copied, 686.286 s, 78.2 MB/s

	real    11m26.321s
	user    0m0.088s
	sys     3m25.931s

Random:
	12594+1 records in
	12594+1 records out
	52824875008 bytes (53 GB) copied, 673.831 s, 78.4 MB/s

	real    11m13.883s
	user    0m0.047s
	sys     4m5.569s
	12594+1 records in
	12594+1 records out
	52824875008 bytes (53 GB) copied, 673.594 s, 78.4 MB/s

	real    11m13.619s
	user    0m0.056s
	sys     4m12.134s


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-08 13:20   ` Sasha Levin
@ 2012-06-08 16:06     ` Dan Magenheimer
  2012-06-11 11:17       ` Avi Kivity
  2012-06-11  8:09     ` Avi Kivity
  1 sibling, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-06-08 16:06 UTC (permalink / raw)
  To: Sasha Levin, Avi Kivity; +Cc: mtosatti, gregkh, sjenning, Konrad Wilk, kvm

> From: Sasha Levin [mailto:levinsasha928@gmail.com]
> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
> 
> I re-ran benchmarks in a single user environment to get more stable results, increasing the test files
> to 50gb each.

Nice results Sasha!

The non-increase in real and the significant increase in sys
demonstrates that tmem should have little or no impact as
long as there are sufficient unused CPU cycles.... since
tmem is most active on I/O bound workloads when there tends
to be lots of idle cpu time, tmem is usually "free".  But
if KVM perfectly load balances across the sum of all guests
so that there is little or no cpu idle time (rare but
possible), there will be a measurable impact.

For a true worst case analysis, try running cpus=1.  (One
can argue that anyone who runs KVM on a single cpu system
deserves what they get ;-)  But, the "WasActive" patch[1]
(if adapted slightly for the KVM-TMEM patch) should eliminate
the negative impact on systime of streaming workloads even
on cpus=1.

> From: Avi Kivity [mailto:avi@redhat.com]
>  <this comment was on Sasha's first round of benchmarking>
> These results give about 47 usec per page system time (seems quite
> high), whereas the difference is 0.7 user per page (seems quite low, for
> 1 or 2 syscalls per page).  Can you post a snapshot of kvm_stat while
> this is running?

Note that the userspace difference is likely all noise.
No tmem/zcache activites should be done in userspace.  All
the activites result from either a page fault or kswapd.

Since each streamed page (assuming no WasActive patch) should
result in one hypercall and one lz01x page compression, I suspect
that 47usec is a good estimate of the sum of those on Sasha's machine.

[1] https://lkml.org/lkml/2012/1/25/300 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-08 13:20   ` Sasha Levin
  2012-06-08 16:06     ` Dan Magenheimer
@ 2012-06-11  8:09     ` Avi Kivity
  2012-06-11 10:26       ` Sasha Levin
  1 sibling, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2012-06-11  8:09 UTC (permalink / raw)
  To: Sasha Levin; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm

On 06/08/2012 04:20 PM, Sasha Levin wrote:
> I re-ran benchmarks in a single user environment to get more stable results, increasing the test files to 50gb each.
> 
> First, a test of the good case scenario for KVM TMEM - we'll try streaming a file which compresses well but is bigger than the host RAM size:
> 
> First, no KVM TMEM, caching=none:
> 
> 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 	2048+0 records in
> 	2048+0 records out
> 	8589934592 bytes (8.6 GB) copied, 116.309 s, 73.9 MB/s
> 
> 	real    1m56.349s
> 	user    0m0.015s
> 	sys     0m15.671s
> 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 	2048+0 records in
> 	2048+0 records out
> 	8589934592 bytes (8.6 GB) copied, 116.191 s, 73.9 MB/s
> 
> 	real    1m56.255s
> 	user    0m0.018s
> 	sys     0m15.504s
> 
> Now, no KVM TMEM, caching=writeback:
> 
> 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 	2048+0 records in
> 	2048+0 records out
> 	8589934592 bytes (8.6 GB) copied, 122.894 s, 69.9 MB/s
> 
> 	real    2m2.965s
> 	user    0m0.015s
> 	sys     0m11.025s
> 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 	2048+0 records in
> 	2048+0 records out
> 	8589934592 bytes (8.6 GB) copied, 110.915 s, 77.4 MB/s
> 
> 	real    1m50.968s
> 	user    0m0.011s
> 	sys     0m10.108s

Strange that system time is lower with cache=writeback.

> 
> And finally, KVM TMEM on, caching=none:
> 
> 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 	2048+0 records in
> 	2048+0 records out
> 	8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s
> 
> 	real    1m59.123s
> 	user    0m0.020s
> 	sys     0m29.336s
> 
> 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 	2048+0 records in
> 	2048+0 records out
> 	8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s
> 
> 	real    0m36.950s
> 	user    0m0.005s
> 	sys     0m35.308s

So system time more than doubled compared to non-tmem cache=none.  The
overhead per page is 17s / (8589934592/4096) = 8.1usec.  Seems quite high.


'perf top' while this is running would be interesting.

> 
> This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going:
> 
> 	kvm statistics
> 
> 	 kvm_exit                                   1952342   36037
> 	 kvm_entry                                  1952334   36034
> 	 kvm_hypercall                              1710568   33948

In that test, 56k pages/sec were transferred.  Why are we seeing only
33k hypercalls/sec?  Shouldn't we have two hypercalls/page (one when
evicting a page to make some room, one to read the new page from tmem)?

> 
> 
> Now, for the worst case "streaming test". I've tried streaming two files, one which has good compression (zeros), and one full with random bits. Doing two runs for each.
> 
> First, the baseline - no KVM TMEM, caching=none:
> 
> Zero file:
> 	12800+0 records in
> 	12800+0 records out
> 	53687091200 bytes (54 GB) copied, 703.502 s, 76.3 MB/s
> 
> 	real    11m43.583s
> 	user    0m0.106s
> 	sys     1m42.075s
> 	12800+0 records in
> 	12800+0 records out
> 	53687091200 bytes (54 GB) copied, 691.208 s, 77.7 MB/s
> 
> 	real    11m31.284s
> 	user    0m0.100s
> 	sys     1m41.235s
> 
> Random file:
> 	12594+1 records in
> 	12594+1 records out
> 	52824875008 bytes (53 GB) copied, 655.778 s, 80.6 MB/s
> 
> 	real    10m55.847s
> 	user    0m0.107s
> 	sys     1m39.852s
> 	12594+1 records in
> 	12594+1 records out
> 	52824875008 bytes (53 GB) copied, 652.668 s, 80.9 MB/s
> 
> 	real    10m52.739s
> 	user    0m0.120s
> 	sys     1m39.712s
> 
> Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none:
> 
> Zeros:
> 	12800+0 records in
> 	12800+0 records out
> 	53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s
> 
> 	real    11m44.536s
> 	user    0m0.088s
> 	sys     2m0.639s
> 	12800+0 records in
> 	12800+0 records out
> 	53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s
> 
> 	real    11m30.561s
> 	user    0m0.088s
> 	sys     1m57.637s

zcache appears not to be helping at all; it's just adding overhead.  Is
even the compressed file too large?

overhead = 1.4 usec/page.

> 
> Random:
> 	12594+1 records in
> 	12594+1 records out
> 	52824875008 bytes (53 GB) copied, 656.436 s, 80.5 MB/s
> 
> 	real    10m56.480s
> 	user    0m0.034s
> 	sys     3m18.750s
> 	12594+1 records in
> 	12594+1 records out
> 	52824875008 bytes (53 GB) copied, 658.446 s, 80.2 MB/s
> 
> 	real    10m58.499s
> 	user    0m0.046s
> 	sys     3m23.678s

Overhead grows to 7.6 usec/page.

> 
> Next, with KVM TMEM enabled, caching=none:
> 
> Zeros:
> 	12800+0 records in
> 	12800+0 records out
> 	53687091200 bytes (54 GB) copied, 711.754 s, 75.4 MB/s
> 
> 	real    11m51.916s
> 	user    0m0.081s
> 	sys     2m59.952s
> 	12800+0 records in
> 	12800+0 records out
> 	53687091200 bytes (54 GB) copied, 690.958 s, 77.7 MB/s
> 
> 	real    11m31.102s
> 	user    0m0.082s
> 	sys     3m6.500s

Overhead = 6.6 usec/page.

> 
> Random:
> 	12594+1 records in
> 	12594+1 records out
> 	52824875008 bytes (53 GB) copied, 656.378 s, 80.5 MB/s
> 
> 	real    10m56.445s
> 	user    0m0.062s
> 	sys     5m53.236s
> 	12594+1 records in
> 	12594+1 records out
> 	52824875008 bytes (53 GB) copied, 653.353 s, 80.9 MB/s
> 
> 	real    10m53.404s
> 	user    0m0.066s
> 	sys     5m57.087s


Overhead = 19 usec/page.

This is pretty steep.  We have flash storage doing a million iops/sec,
and here you add 19 microseconds to that.

> 
> 
> This is a snapshot of kvm_stats while this test was running:
> 
> 	kvm statistics
> 
> 	 kvm_entry                                   168179   20729
> 	 kvm_exit                                    168179   20728
> 	 kvm_hypercall                               131808   16409

The last test was running 19k pages/sec, doesn't quite fit with this
measurement.  Is the measurement stable or fluctuating?

> 
> And finally, KVM TMEM enabled, with caching=writeback:

I'm not sure what the point of this is?  You have two host-caching
mechanisms running in parallel, are you trying to increase overhead
while reducing effective cache size?


My conclusion is that the overhead is quite high, but please double
check my numbers, maybe I missed something obvious.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-11  8:09     ` Avi Kivity
@ 2012-06-11 10:26       ` Sasha Levin
  2012-06-11 11:45         ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2012-06-11 10:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm

On Mon, 2012-06-11 at 11:09 +0300, Avi Kivity wrote:
> On 06/08/2012 04:20 PM, Sasha Levin wrote:
> > I re-ran benchmarks in a single user environment to get more stable results, increasing the test files to 50gb each.
> > 
> > First, a test of the good case scenario for KVM TMEM - we'll try streaming a file which compresses well but is bigger than the host RAM size:
> > 
> > First, no KVM TMEM, caching=none:
> > 
> > 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 	2048+0 records in
> > 	2048+0 records out
> > 	8589934592 bytes (8.6 GB) copied, 116.309 s, 73.9 MB/s
> > 
> > 	real    1m56.349s
> > 	user    0m0.015s
> > 	sys     0m15.671s
> > 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 	2048+0 records in
> > 	2048+0 records out
> > 	8589934592 bytes (8.6 GB) copied, 116.191 s, 73.9 MB/s
> > 
> > 	real    1m56.255s
> > 	user    0m0.018s
> > 	sys     0m15.504s
> > 
> > Now, no KVM TMEM, caching=writeback:
> > 
> > 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 	2048+0 records in
> > 	2048+0 records out
> > 	8589934592 bytes (8.6 GB) copied, 122.894 s, 69.9 MB/s
> > 
> > 	real    2m2.965s
> > 	user    0m0.015s
> > 	sys     0m11.025s
> > 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 	2048+0 records in
> > 	2048+0 records out
> > 	8589934592 bytes (8.6 GB) copied, 110.915 s, 77.4 MB/s
> > 
> > 	real    1m50.968s
> > 	user    0m0.011s
> > 	sys     0m10.108s
> 
> Strange that system time is lower with cache=writeback.

Maybe because these pages don't get written out immediately? I don't
have a better guess.

> > And finally, KVM TMEM on, caching=none:
> > 
> > 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 	2048+0 records in
> > 	2048+0 records out
> > 	8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s
> > 
> > 	real    1m59.123s
> > 	user    0m0.020s
> > 	sys     0m29.336s
> > 
> > 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 	2048+0 records in
> > 	2048+0 records out
> > 	8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s
> > 
> > 	real    0m36.950s
> > 	user    0m0.005s
> > 	sys     0m35.308s
> 
> So system time more than doubled compared to non-tmem cache=none.  The
> overhead per page is 17s / (8589934592/4096) = 8.1usec.  Seems quite high.

Right, but consider it didn't increase real time at all.

> 'perf top' while this is running would be interesting.

I'll update later with this.

> > This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going:
> > 
> > 	kvm statistics
> > 
> > 	 kvm_exit                                   1952342   36037
> > 	 kvm_entry                                  1952334   36034
> > 	 kvm_hypercall                              1710568   33948
> 
> In that test, 56k pages/sec were transferred.  Why are we seeing only
> 33k hypercalls/sec?  Shouldn't we have two hypercalls/page (one when
> evicting a page to make some room, one to read the new page from tmem)?

The guest doesn't do eviction at all, in fact - it doesn't know how big
the cache is so even if it wanted to, it couldn't evict pages (the only
thing it does is invalidate pages which have changed in the guest).

This means it only takes one hypercall/page instead of two.

> > 
> > 
> > Now, for the worst case "streaming test". I've tried streaming two files, one which has good compression (zeros), and one full with random bits. Doing two runs for each.
> > 
> > First, the baseline - no KVM TMEM, caching=none:
> > 
> > Zero file:
> > 	12800+0 records in
> > 	12800+0 records out
> > 	53687091200 bytes (54 GB) copied, 703.502 s, 76.3 MB/s
> > 
> > 	real    11m43.583s
> > 	user    0m0.106s
> > 	sys     1m42.075s
> > 	12800+0 records in
> > 	12800+0 records out
> > 	53687091200 bytes (54 GB) copied, 691.208 s, 77.7 MB/s
> > 
> > 	real    11m31.284s
> > 	user    0m0.100s
> > 	sys     1m41.235s
> > 
> > Random file:
> > 	12594+1 records in
> > 	12594+1 records out
> > 	52824875008 bytes (53 GB) copied, 655.778 s, 80.6 MB/s
> > 
> > 	real    10m55.847s
> > 	user    0m0.107s
> > 	sys     1m39.852s
> > 	12594+1 records in
> > 	12594+1 records out
> > 	52824875008 bytes (53 GB) copied, 652.668 s, 80.9 MB/s
> > 
> > 	real    10m52.739s
> > 	user    0m0.120s
> > 	sys     1m39.712s
> > 
> > Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none:
> > 
> > Zeros:
> > 	12800+0 records in
> > 	12800+0 records out
> > 	53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s
> > 
> > 	real    11m44.536s
> > 	user    0m0.088s
> > 	sys     2m0.639s
> > 	12800+0 records in
> > 	12800+0 records out
> > 	53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s
> > 
> > 	real    11m30.561s
> > 	user    0m0.088s
> > 	sys     1m57.637s
> 
> zcache appears not to be helping at all; it's just adding overhead.  Is
> even the compressed file too large?
> 
> overhead = 1.4 usec/page.

Correct, I've had to further increase the size of this file so that
zcache would fail here as well. The good case was tested before, here I
wanted to see what will happen with files that wouldn't have much
benefit from both regular caching and zcache.

> > 
> > Random:
> > 	12594+1 records in
> > 	12594+1 records out
> > 	52824875008 bytes (53 GB) copied, 656.436 s, 80.5 MB/s
> > 
> > 	real    10m56.480s
> > 	user    0m0.034s
> > 	sys     3m18.750s
> > 	12594+1 records in
> > 	12594+1 records out
> > 	52824875008 bytes (53 GB) copied, 658.446 s, 80.2 MB/s
> > 
> > 	real    10m58.499s
> > 	user    0m0.046s
> > 	sys     3m23.678s
> 
> Overhead grows to 7.6 usec/page.
> 
> > 
> > Next, with KVM TMEM enabled, caching=none:
> > 
> > Zeros:
> > 	12800+0 records in
> > 	12800+0 records out
> > 	53687091200 bytes (54 GB) copied, 711.754 s, 75.4 MB/s
> > 
> > 	real    11m51.916s
> > 	user    0m0.081s
> > 	sys     2m59.952s
> > 	12800+0 records in
> > 	12800+0 records out
> > 	53687091200 bytes (54 GB) copied, 690.958 s, 77.7 MB/s
> > 
> > 	real    11m31.102s
> > 	user    0m0.082s
> > 	sys     3m6.500s
> 
> Overhead = 6.6 usec/page.
> 
> > 
> > Random:
> > 	12594+1 records in
> > 	12594+1 records out
> > 	52824875008 bytes (53 GB) copied, 656.378 s, 80.5 MB/s
> > 
> > 	real    10m56.445s
> > 	user    0m0.062s
> > 	sys     5m53.236s
> > 	12594+1 records in
> > 	12594+1 records out
> > 	52824875008 bytes (53 GB) copied, 653.353 s, 80.9 MB/s
> > 
> > 	real    10m53.404s
> > 	user    0m0.066s
> > 	sys     5m57.087s
> 
> 
> Overhead = 19 usec/page.
> 
> This is pretty steep.  We have flash storage doing a million iops/sec,
> and here you add 19 microseconds to that.

Might be interesting to test it with flash storage as well...

> > 
> > 
> > This is a snapshot of kvm_stats while this test was running:
> > 
> > 	kvm statistics
> > 
> > 	 kvm_entry                                   168179   20729
> > 	 kvm_exit                                    168179   20728
> > 	 kvm_hypercall                               131808   16409
> 
> The last test was running 19k pages/sec, doesn't quite fit with this
> measurement.  Is the measurement stable or fluctuating?

It's pretty stable when running the "zero" pages, but when switching to
random files it somewhat fluctuates.

> > 
> > And finally, KVM TMEM enabled, with caching=writeback:
> 
> I'm not sure what the point of this is?  You have two host-caching
> mechanisms running in parallel, are you trying to increase overhead
> while reducing effective cache size?

I thought that you've asked for this test:

On Wed, 2012-06-06 at 16:24 +0300, Avi Kivity wrote:
> while cache=writeback with cleancache enabled in the host should
> give the same effect, but with the extra hypercalls, but with an extra
> copy to manage the host pagecache.  It would be good to see results for all three settings.

> My conclusion is that the overhead is quite high, but please double
> check my numbers, maybe I missed something obvious.

I'm not sure what options I have to lower the overhead here, should I be
using something other than hypercalls to communicate with the host?

I know that there are several things being worked on from zcache
perspective (WasActive, batching, etc), but is there something that
could be done within the scope of kvm-tmem?

It would be interesting in seeing results for Xen/TMEM and comparing
them to these results.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-08 16:06     ` Dan Magenheimer
@ 2012-06-11 11:17       ` Avi Kivity
  0 siblings, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2012-06-11 11:17 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm

On 06/08/2012 07:06 PM, Dan Magenheimer wrote:
>> From: Avi Kivity [mailto:avi@redhat.com]
>>  <this comment was on Sasha's first round of benchmarking>
>> These results give about 47 usec per page system time (seems quite
>> high), whereas the difference is 0.7 user per page (seems quite low, for
>> 1 or 2 syscalls per page).  Can you post a snapshot of kvm_stat while
>> this is running?
> 
> Note that the userspace difference is likely all noise.
> No tmem/zcache activites should be done in userspace.  All
> the activites result from either a page fault or kswapd.

s/user/usec/...

> Since each streamed page (assuming no WasActive patch) should
> result in one hypercall and one lz01x page compression, I suspect
> that 47usec is a good estimate of the sum of those on Sasha's machine.

It's a huge number for a page.  The newer results give lower numbers,
but still quite high.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-11 10:26       ` Sasha Levin
@ 2012-06-11 11:45         ` Avi Kivity
  2012-06-11 15:44           ` Dan Magenheimer
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2012-06-11 11:45 UTC (permalink / raw)
  To: Sasha Levin; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm

On 06/11/2012 01:26 PM, Sasha Levin wrote:
>> 
>> Strange that system time is lower with cache=writeback.
> 
> Maybe because these pages don't get written out immediately? I don't
> have a better guess.

>From the guest point of view, it's the same flow.  btw, this is a read,
so the difference would be readahead, not write-behind, but the
difference in system time is still unexplained.

> 
>> > And finally, KVM TMEM on, caching=none:
>> > 
>> > 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
>> > 	2048+0 records in
>> > 	2048+0 records out
>> > 	8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s
>> > 
>> > 	real    1m59.123s
>> > 	user    0m0.020s
>> > 	sys     0m29.336s
>> > 
>> > 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
>> > 	2048+0 records in
>> > 	2048+0 records out
>> > 	8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s
>> > 
>> > 	real    0m36.950s
>> > 	user    0m0.005s
>> > 	sys     0m35.308s
>> 
>> So system time more than doubled compared to non-tmem cache=none.  The
>> overhead per page is 17s / (8589934592/4096) = 8.1usec.  Seems quite high.
> 
> Right, but consider it didn't increase real time at all.

Real time is bounded by disk bandwidth.  It's a consideration of course,
and all forms of caching increase cpu utilization for the cache miss
case, but in this case the overhead is excessive due to the lack of
batching and due to compression overhead.

> 
>> 'perf top' while this is running would be interesting.
> 
> I'll update later with this.
> 
>> > This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going:
>> > 
>> > 	kvm statistics
>> > 
>> > 	 kvm_exit                                   1952342   36037
>> > 	 kvm_entry                                  1952334   36034
>> > 	 kvm_hypercall                              1710568   33948
>> 
>> In that test, 56k pages/sec were transferred.  Why are we seeing only
>> 33k hypercalls/sec?  Shouldn't we have two hypercalls/page (one when
>> evicting a page to make some room, one to read the new page from tmem)?
> 
> The guest doesn't do eviction at all, in fact - it doesn't know how big
> the cache is so even if it wanted to, it couldn't evict pages (the only
> thing it does is invalidate pages which have changed in the guest).

IIUC, when the guest reads a page, it first has to make room in its own
pagecache; before dropping a clean page it calls cleancache to dispose
of it, which calls a hypercall which compresses and stores it on the
host.  Next a page is allocated and a cleancache hypercall is made to
see if it is in host tmem.  So two hypercalls per page, once guest
pagecache is full.

> 
> This means it only takes one hypercall/page instead of two.
>> > 
>> > Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none:
>> > 
>> > Zeros:
>> > 	12800+0 records in
>> > 	12800+0 records out
>> > 	53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s
>> > 
>> > 	real    11m44.536s
>> > 	user    0m0.088s
>> > 	sys     2m0.639s
>> > 	12800+0 records in
>> > 	12800+0 records out
>> > 	53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s
>> > 
>> > 	real    11m30.561s
>> > 	user    0m0.088s
>> > 	sys     1m57.637s
>> 
>> zcache appears not to be helping at all; it's just adding overhead.  Is
>> even the compressed file too large?
>> 
>> overhead = 1.4 usec/page.
> 
> Correct, I've had to further increase the size of this file so that
> zcache would fail here as well. The good case was tested before, here I
> wanted to see what will happen with files that wouldn't have much
> benefit from both regular caching and zcache.

Well, zeroes is not a good test for this since it minimizes zcache
allocation overhead.

>> 
>> 
>> Overhead = 19 usec/page.
>> 
>> This is pretty steep.  We have flash storage doing a million iops/sec,
>> and here you add 19 microseconds to that.
> 
> Might be interesting to test it with flash storage as well...

Try http://sg.danny.cz/sg/sdebug26.html.  You can use it to emulate a
large fast block device without needing tons of RAM (but you can still
populate it with nonzero data).

If using qemu, try ,aio=native to minimize overhead further.

> 
>> > 
>> > 
>> > This is a snapshot of kvm_stats while this test was running:
>> > 
>> > 	kvm statistics
>> > 
>> > 	 kvm_entry                                   168179   20729
>> > 	 kvm_exit                                    168179   20728
>> > 	 kvm_hypercall                               131808   16409
>> 
>> The last test was running 19k pages/sec, doesn't quite fit with this
>> measurement.  Is the measurement stable or fluctuating?
> 
> It's pretty stable when running the "zero" pages, but when switching to
> random files it somewhat fluctuates.

Well, weird.

> 
>> > 
>> > And finally, KVM TMEM enabled, with caching=writeback:
>> 
>> I'm not sure what the point of this is?  You have two host-caching
>> mechanisms running in parallel, are you trying to increase overhead
>> while reducing effective cache size?
> 
> I thought that you've asked for this test:
> 
> On Wed, 2012-06-06 at 16:24 +0300, Avi Kivity wrote:
>> while cache=writeback with cleancache enabled in the host should
>> give the same effect, but with the extra hypercalls, but with an extra
>> copy to manage the host pagecache.  It would be good to see results for all three settings.
> 

Ah, so it's a worser worst case.  But somehow it's better than cache=none?

>> My conclusion is that the overhead is quite high, but please double
>> check my numbers, maybe I missed something obvious.
> 
> I'm not sure what options I have to lower the overhead here, should I be
> using something other than hypercalls to communicate with the host?
> 
> I know that there are several things being worked on from zcache
> perspective (WasActive, batching, etc), but is there something that
> could be done within the scope of kvm-tmem?
> 
> It would be interesting in seeing results for Xen/TMEM and comparing
> them to these results.

Batching will drastically reduce the number of hypercalls.  A different
alternative is to use ballooning to feed the guest free memory so it
doesn't need to hypercall at all.  Deciding how to divide free memory
among the guests is hard (but then so is deciding how to divide tmem
memory among guests), and adding dedup on top of that is also hard (ksm?
zksm?).  IMO letting the guest have the memory and manage it on its own
will be much simpler and faster compared to the constant chatting that
has to go on if the host manages this memory.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-11 11:45         ` Avi Kivity
@ 2012-06-11 15:44           ` Dan Magenheimer
  2012-06-11 17:06             ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-06-11 15:44 UTC (permalink / raw)
  To: Avi Kivity, Sasha Levin; +Cc: mtosatti, gregkh, sjenning, Konrad Wilk, kvm

> From: Avi Kivity [mailto:avi@redhat.com]
> >
> > The guest doesn't do eviction at all, in fact - it doesn't know how big
> > the cache is so even if it wanted to, it couldn't evict pages (the only
> > thing it does is invalidate pages which have changed in the guest).
> 
> IIUC, when the guest reads a page, it first has to make room in its own
> pagecache; before dropping a clean page it calls cleancache to dispose
> of it, which calls a hypercall which compresses and stores it on the
> host.  Next a page is allocated and a cleancache hypercall is made to
> see if it is in host tmem.  So two hypercalls per page, once guest
> pagecache is full.

Yes, Avi is correct here.

> >> This is pretty steep.  We have flash storage doing a million iops/sec,
> >> and here you add 19 microseconds to that.
> >
> > Might be interesting to test it with flash storage as well...

Well, to be fair, you are comparing a device that costs many
thousands of $US to a software solution that uses idle CPU
cycles and no additional RAM.
 
> Batching will drastically reduce the number of hypercalls.

For the record, batching CAN be implemented... ramster is essentially
an implementation of batching where the local system is the "guest"
and the remote system is the "host".  But with ramster the
overhead to move the data (whether batched or not) is much MUCH
worse than a hypercall and ramster still shows performance advantage.

So, IMHO, one step at a time.  Get the foundation code in
place and tune it later if a batching implementation can
be demonstrated to improve performance sufficiently.

> A different
> alternative is to use ballooning to feed the guest free memory so it
> doesn't need to hypercall at all.  Deciding how to divide free memory
> among the guests is hard (but then so is deciding how to divide tmem
> memory among guests), and adding dedup on top of that is also hard (ksm?
> zksm?).  IMO letting the guest have the memory and manage it on its own
> will be much simpler and faster compared to the constant chatting that
> has to go on if the host manages this memory.

Here we disagree, maybe violently.  All existing solutions that
try to do manage memory across multiple tenants from an "external
memory manager policy" fail miserably.  Tmem is at least trying
something new by actively involving both the host and the guest
in the policy (guest decides which pages, host decided how many)
and without the massive changes required for something like
IBM's solution (forgot what it was called).  Yes, tmem has
overhead but since the overhead only occurs where pages
would otherwise have to be read/written from disk, the
overhead is well "hidden".

BTW, dedup in zcache is fairly easy to implement because the
pages can only be read/written as an entire page and only
through a well-defined API.  Xen does it (with optional
compression), zcache could also, but it never made much sense
for zcache when there was only one tenant.  KVM of course
benefits from KSM, but IIUC KSM only works on anonymous pages.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-11 15:44           ` Dan Magenheimer
@ 2012-06-11 17:06             ` Avi Kivity
  2012-06-11 19:25               ` Sasha Levin
  2012-06-12  1:18               ` Dan Magenheimer
  0 siblings, 2 replies; 20+ messages in thread
From: Avi Kivity @ 2012-06-11 17:06 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm

On 06/11/2012 06:44 PM, Dan Magenheimer wrote:
> > >> This is pretty steep.  We have flash storage doing a million iops/sec,
> > >> and here you add 19 microseconds to that.
> > >
> > > Might be interesting to test it with flash storage as well...
>
> Well, to be fair, you are comparing a device that costs many
> thousands of $US to a software solution that uses idle CPU
> cycles and no additional RAM.

You don't know that those cycles are idle.  And when in fact you have no
additional RAM, those cycles are wasted to no benefit.

The fact that I/O is being performed doesn't mean that we can waste
cpu.  Those cpu cycles can be utilized by other processes on the same
guest or by other guests.


>  
> > Batching will drastically reduce the number of hypercalls.
>
> For the record, batching CAN be implemented... ramster is essentially
> an implementation of batching where the local system is the "guest"
> and the remote system is the "host".  But with ramster the
> overhead to move the data (whether batched or not) is much MUCH
> worse than a hypercall and ramster still shows performance advantage.

Sure, you can buffer pages in memory but then you add yet another copy. 
I know you think copies are cheap but I disagree.


> So, IMHO, one step at a time.  Get the foundation code in
> place and tune it later if a batching implementation can
> be demonstrated to improve performance sufficiently.

Sorry, no, first demonstrate no performance regressions, then we can
talk about performance improvements.

> > A different
> > alternative is to use ballooning to feed the guest free memory so it
> > doesn't need to hypercall at all.  Deciding how to divide free memory
> > among the guests is hard (but then so is deciding how to divide tmem
> > memory among guests), and adding dedup on top of that is also hard (ksm?
> > zksm?).  IMO letting the guest have the memory and manage it on its own
> > will be much simpler and faster compared to the constant chatting that
> > has to go on if the host manages this memory.
>
> Here we disagree, maybe violently.  All existing solutions that
> try to do manage memory across multiple tenants from an "external
> memory manager policy" fail miserably.  Tmem is at least trying
> something new by actively involving both the host and the guest
> in the policy (guest decides which pages, host decided how many)
> and without the massive changes required for something like
> IBM's solution (forgot what it was called).  

cmm2

> Yes, tmem has
> overhead but since the overhead only occurs where pages
> would otherwise have to be read/written from disk, the
> overhead is well "hidden".

The overhead is NOT hidden.  We spent many efforts to tune virtio-blk to
reduce its overhead, and now you add 6-20 microseconds per page.  A
guest may easily be reading a quarter million pages per second, this
adds up very fast - at the upper end you're consuming 5 vcpus just for tmem.

Note that you don't even have to issue I/O to get a tmem hypercall
invoked.  Alllocate a ton of memory and you get cleancache calls for
each page that passes through the tail of the LRU.  Again with the upper
end, allocating a gigabyte can now take a few seconds extra.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-11 17:06             ` Avi Kivity
@ 2012-06-11 19:25               ` Sasha Levin
  2012-06-11 19:56                 ` Sasha Levin
  2012-06-12 10:12                 ` Avi Kivity
  2012-06-12  1:18               ` Dan Magenheimer
  1 sibling, 2 replies; 20+ messages in thread
From: Sasha Levin @ 2012-06-11 19:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm

On Mon, 2012-06-11 at 20:06 +0300, Avi Kivity wrote:
> Sorry, no, first demonstrate no performance regressions, then we can
> talk about performance improvements. 

No performance regressions? For caching? How would that work?

Or even if you meant just the kvm-tmem interface overhead, I don't see
how that would work.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-11 19:25               ` Sasha Levin
@ 2012-06-11 19:56                 ` Sasha Levin
  2012-06-12 11:46                   ` Avi Kivity
  2012-06-12 10:12                 ` Avi Kivity
  1 sibling, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2012-06-11 19:56 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm

On Mon, 2012-06-11 at 21:25 +0200, Sasha Levin wrote:
> On Mon, 2012-06-11 at 20:06 +0300, Avi Kivity wrote:
> > Sorry, no, first demonstrate no performance regressions, then we can
> > talk about performance improvements. 
> 
> No performance regressions? For caching? How would that work?
> 
> Or even if you meant just the kvm-tmem interface overhead, I don't see
> how that would work.

btw, so far we've been poking on half of the code here.

What about frontswap over kvm-tmem? are there any specific tests you'd
like to see there?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-11 17:06             ` Avi Kivity
  2012-06-11 19:25               ` Sasha Levin
@ 2012-06-12  1:18               ` Dan Magenheimer
  2012-06-12 10:09                 ` Avi Kivity
  1 sibling, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-06-12  1:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm

> From: Avi Kivity [mailto:avi@redhat.com]
> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
> 
> On 06/11/2012 06:44 PM, Dan Magenheimer wrote:
> > > >> This is pretty steep.  We have flash storage doing a million iops/sec,
> > > >> and here you add 19 microseconds to that.
> > > >
> > > > Might be interesting to test it with flash storage as well...
> >
> > Well, to be fair, you are comparing a device that costs many
> > thousands of $US to a software solution that uses idle CPU
> > cycles and no additional RAM.
> 
> You don't know that those cycles are idle.  And when in fact you have no
> additional RAM, those cycles are wasted to no benefit.
> 
> The fact that I/O is being performed doesn't mean that we can waste
> cpu.  Those cpu cycles can be utilized by other processes on the same
> guest or by other guests.

You're right of course, so I apologize for oversimplifying... but
so are you.  Let's take a step back:

IMHO, a huge part (majority?) of computer science these days is
trying to beat Amdahl's law.  On many machines/workloads,
especially in virtual environments, RAM is the bottleneck.
Tmem's role is, when RAM is the bottleneck, to increase RAM
effective size AND, in a multi-tenant environment, flexibility
at the cost of CPU cycles.  But tmem also is designed to be very
dynamically flexible so that it either has low CPU cost when it
not being used OR can be dynamically disabled/re-enabled with
reasonably low overhead.

Why I think you are oversimplifying:  "those cpu cycles can be
utilized by other processes on the same guest or by other
guests" pre-supposes that cpu availability is the bottleneck.
It would be interesting if it were possible to measure how
many systems (with modern processors) for which this is true.
I'm not arguing that they don't exist but I suspect they are
fairly rare these days, even for KVM systems.

> > > Batching will drastically reduce the number of hypercalls.
> >
> > For the record, batching CAN be implemented... ramster is essentially
> > an implementation of batching where the local system is the "guest"
> > and the remote system is the "host".  But with ramster the
> > overhead to move the data (whether batched or not) is much MUCH
> > worse than a hypercall and ramster still shows performance advantage.
> 
> Sure, you can buffer pages in memory but then you add yet another copy.
> I know you think copies are cheap but I disagree.

I only think copies are *relatively* cheap.  Orders of magnitude
cheaper than some alternatives.  So if it takes two page copies
or even ten to replace a disk access, yes I think copies are cheap.
(But I do understand your point.)

> > So, IMHO, one step at a time.  Get the foundation code in
> > place and tune it later if a batching implementation can
> > be demonstrated to improve performance sufficiently.
> 
> Sorry, no, first demonstrate no performance regressions, then we can
> talk about performance improvements.

Well that's an awfully hard bar to clear, even with any of the many
changes being merged every release into the core Linux mm subsystem.
Any change to memory management will have some positive impacts on some
workloads and some negative impacts on others.

> > > A different
> > > alternative is to use ballooning to feed the guest free memory so it
> > > doesn't need to hypercall at all.  Deciding how to divide free memory
> > > among the guests is hard (but then so is deciding how to divide tmem
> > > memory among guests), and adding dedup on top of that is also hard (ksm?
> > > zksm?).  IMO letting the guest have the memory and manage it on its own
> > > will be much simpler and faster compared to the constant chatting that
> > > has to go on if the host manages this memory.
> >
> > Here we disagree, maybe violently.  All existing solutions that
> > try to do manage memory across multiple tenants from an "external
> > memory manager policy" fail miserably.  Tmem is at least trying
> > something new by actively involving both the host and the guest
> > in the policy (guest decides which pages, host decided how many)
> > and without the massive changes required for something like
> > IBM's solution (forgot what it was called).
> 
> cmm2

That's the one.  Thanks for the reminder!

> > Yes, tmem has
> > overhead but since the overhead only occurs where pages
> > would otherwise have to be read/written from disk, the
> > overhead is well "hidden".
> 
> The overhead is NOT hidden.  We spent many efforts to tune virtio-blk to
> reduce its overhead, and now you add 6-20 microseconds per page.  A
> guest may easily be reading a quarter million pages per second, this
> adds up very fast - at the upper end you're consuming 5 vcpus just for tmem.
> 
> Note that you don't even have to issue I/O to get a tmem hypercall
> invoked.  Alllocate a ton of memory and you get cleancache calls for
> each page that passes through the tail of the LRU.  Again with the upper
> end, allocating a gigabyte can now take a few seconds extra.

Though not precisely so, we are arguing throughput vs latency here
and the two can't always be mixed.

And if, in allocating a GB of memory, you are tossing out useful
pagecache pages, and those pagecache pages can instead be preserved
by tmem thus saving N page faults and order(N) disk accesses,
your savings are false economy.  I think Sasha's numbers
demonstrate that nicely.

Anyway, as I've said all along, let's look at the numbers.
I've always admitted that tmem on an old uniprocessor should
be disabled.  If no performance degradation in that environment
is a requirement for KVM-tmem to be merged, that is certainly
your choice.  And if "more CPU cycles used" is a metric,
definitely, tmem is not going to pass because that's exactly
what it's doing: trading more CPU cycles for better RAM
efficiency == less disk accesses.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-12  1:18               ` Dan Magenheimer
@ 2012-06-12 10:09                 ` Avi Kivity
  2012-06-12 16:40                   ` Dan Magenheimer
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2012-06-12 10:09 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm

On 06/12/2012 04:18 AM, Dan Magenheimer wrote:
>> From: Avi Kivity [mailto:avi@redhat.com]
>> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
>> 
>> On 06/11/2012 06:44 PM, Dan Magenheimer wrote:
>> > > >> This is pretty steep.  We have flash storage doing a million iops/sec,
>> > > >> and here you add 19 microseconds to that.
>> > > >
>> > > > Might be interesting to test it with flash storage as well...
>> >
>> > Well, to be fair, you are comparing a device that costs many
>> > thousands of $US to a software solution that uses idle CPU
>> > cycles and no additional RAM.
>> 
>> You don't know that those cycles are idle.  And when in fact you have no
>> additional RAM, those cycles are wasted to no benefit.
>> 
>> The fact that I/O is being performed doesn't mean that we can waste
>> cpu.  Those cpu cycles can be utilized by other processes on the same
>> guest or by other guests.
> 
> You're right of course, so I apologize for oversimplifying... but
> so are you.  Let's take a step back:
> 
> IMHO, a huge part (majority?) of computer science these days is
> trying to beat Amdahl's law.  On many machines/workloads,
> especially in virtual environments, RAM is the bottleneck.
> Tmem's role is, when RAM is the bottleneck, to increase RAM
> effective size AND, in a multi-tenant environment, flexibility
> at the cost of CPU cycles.  But tmem also is designed to be very
> dynamically flexible so that it either has low CPU cost when it
> not being used OR can be dynamically disabled/re-enabled with
> reasonably low overhead.
> 
> Why I think you are oversimplifying:  "those cpu cycles can be
> utilized by other processes on the same guest or by other
> guests" pre-supposes that cpu availability is the bottleneck.
> It would be interesting if it were possible to measure how
> many systems (with modern processors) for which this is true.
> I'm not arguing that they don't exist but I suspect they are
> fairly rare these days, even for KVM systems.

In a given host, either cpu or memory is the bottleneck.  If you have
both free memory and free cycles, you pack more guests on that machine.
 During off-peak you may have both, but we need to see what happens
during the peak; off-peak we're doing okay.

So on such a host, during peak, either the cpu is churning away and we
can't spare those cycles for tmem, or memory is packed full of guests
and tmem won't provide much benefit (but still consume those cycles).

> 
>> > > Batching will drastically reduce the number of hypercalls.
>> >
>> > For the record, batching CAN be implemented... ramster is essentially
>> > an implementation of batching where the local system is the "guest"
>> > and the remote system is the "host".  But with ramster the
>> > overhead to move the data (whether batched or not) is much MUCH
>> > worse than a hypercall and ramster still shows performance advantage.
>> 
>> Sure, you can buffer pages in memory but then you add yet another copy.
>> I know you think copies are cheap but I disagree.
> 
> I only think copies are *relatively* cheap.  Orders of magnitude
> cheaper than some alternatives.  So if it takes two page copies
> or even ten to replace a disk access, yes I think copies are cheap.
> (But I do understand your point.)

The copies are cheaper that a disk access, yes, but you need to factor
in the probability of a disk access being saved.  cleancache already
works on the tail end of the lru, we're dumping those pages because they
have low access frequency, so the probability starts out low.  If many
guests are active (so we need the cpu resources), then they also compete
for tmem resources, and per-guest it becomes less effective as well.

> 
>> > So, IMHO, one step at a time.  Get the foundation code in
>> > place and tune it later if a batching implementation can
>> > be demonstrated to improve performance sufficiently.
>> 
>> Sorry, no, first demonstrate no performance regressions, then we can
>> talk about performance improvements.
> 
> Well that's an awfully hard bar to clear, even with any of the many
> changes being merged every release into the core Linux mm subsystem.
> Any change to memory management will have some positive impacts on some
> workloads and some negative impacts on others.

Right, that's too harsh.  But these benchmarks show a doubling (or even
more) of cpu overhead, and that is whether the cache is effective or
not.  That is simply way too much to consider.

Look at the block, vfs, and mm layers.  Huge pains have been taken to
batch everything and avoid per-page work -- 20 years of not having
enough cycles.  And here you throw all this out of the window with
per-page crossing of the guest/host boundary.

> 
>> > Yes, tmem has
>> > overhead but since the overhead only occurs where pages
>> > would otherwise have to be read/written from disk, the
>> > overhead is well "hidden".
>> 
>> The overhead is NOT hidden.  We spent many efforts to tune virtio-blk to
>> reduce its overhead, and now you add 6-20 microseconds per page.  A
>> guest may easily be reading a quarter million pages per second, this
>> adds up very fast - at the upper end you're consuming 5 vcpus just for tmem.
>> 
>> Note that you don't even have to issue I/O to get a tmem hypercall
>> invoked.  Alllocate a ton of memory and you get cleancache calls for
>> each page that passes through the tail of the LRU.  Again with the upper
>> end, allocating a gigabyte can now take a few seconds extra.
> 
> Though not precisely so, we are arguing throughput vs latency here
> and the two can't always be mixed.
> 
> And if, in allocating a GB of memory, you are tossing out useful
> pagecache pages, and those pagecache pages can instead be preserved
> by tmem thus saving N page faults and order(N) disk accesses,
> your savings are false economy.  I think Sasha's numbers
> demonstrate that nicely.


It depends.  If you have an 8GB guest, then saving the tail end of an
8GB LRU may improve your caching or it may not.  But the impact on that
allocation is certain.  You're trading off possible marginal improvement
for unconditional performance degradation.

> 
> Anyway, as I've said all along, let's look at the numbers.
> I've always admitted that tmem on an old uniprocessor should
> be disabled.  If no performance degradation in that environment
> is a requirement for KVM-tmem to be merged, that is certainly
> your choice.  And if "more CPU cycles used" is a metric,
> definitely, tmem is not going to pass because that's exactly
> what it's doing: trading more CPU cycles for better RAM
> efficiency == less disk accesses.

Again, the cpu cycles spent are certain, and double the effort needed to
get those pages in the first place.  Disk accesses saved will depend on
the workload, and on host memory availability.  Turning tmem on will
certainly generate performance regressions as well as improvements.
Maybe on Xen the tradeoff is different (hypercalls ought to be faster on
xenpv), but the numbers I saw on kvm aren't good.


-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-11 19:25               ` Sasha Levin
  2012-06-11 19:56                 ` Sasha Levin
@ 2012-06-12 10:12                 ` Avi Kivity
  1 sibling, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2012-06-12 10:12 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm

On 06/11/2012 10:25 PM, Sasha Levin wrote:
> On Mon, 2012-06-11 at 20:06 +0300, Avi Kivity wrote:
>> Sorry, no, first demonstrate no performance regressions, then we can
>> talk about performance improvements. 
> 
> No performance regressions? For caching? How would that work?

A small degradation might be acceptable.  2X cpu consumption is not.

IMO "using host memory" is the problem, because it involves copies and
hypercalls.  Try giving the memory to the guest, either through the
balloon or through a pci device that exposes memory that can be
withdrawn.  That will make everything *much* faster.

> 
> Or even if you meant just the kvm-tmem interface overhead, I don't see
> how that would work.
> 

I meant the overall overhead, as seen by users.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-11 19:56                 ` Sasha Levin
@ 2012-06-12 11:46                   ` Avi Kivity
  2012-06-12 11:58                     ` Gleb Natapov
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2012-06-12 11:46 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm

On 06/11/2012 10:56 PM, Sasha Levin wrote:
> 
> btw, so far we've been poking on half of the code here.
> 
> What about frontswap over kvm-tmem? are there any specific tests you'd
> like to see there?

hmm.  On one hand, no one swaps these days so there aren't any good
benchmarks for it.  On the other hand, with swapping, at least we're
guaranteed the page will be read in the future (unlike cache, where it's
quite possible it won't be).  I don't know.


-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-12 11:46                   ` Avi Kivity
@ 2012-06-12 11:58                     ` Gleb Natapov
  2012-06-12 12:01                       ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Gleb Natapov @ 2012-06-12 11:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Sasha Levin, Dan Magenheimer, mtosatti, gregkh, sjenning,
	Konrad Wilk, kvm

On Tue, Jun 12, 2012 at 02:46:38PM +0300, Avi Kivity wrote:
> On 06/11/2012 10:56 PM, Sasha Levin wrote:
> > 
> > btw, so far we've been poking on half of the code here.
> > 
> > What about frontswap over kvm-tmem? are there any specific tests you'd
> > like to see there?
> 
> hmm.  On one hand, no one swaps these days so there aren't any good
> benchmarks for it.  On the other hand, with swapping, at least we're
> guaranteed the page will be read in the future (unlike cache, where it's
> quite possible it won't be).  I don't know.
> 
> 
Swapped page can be discarded without reading too.

--
			Gleb.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-12 11:58                     ` Gleb Natapov
@ 2012-06-12 12:01                       ` Avi Kivity
  0 siblings, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2012-06-12 12:01 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Sasha Levin, Dan Magenheimer, mtosatti, gregkh, sjenning,
	Konrad Wilk, kvm

On 06/12/2012 02:58 PM, Gleb Natapov wrote:
> On Tue, Jun 12, 2012 at 02:46:38PM +0300, Avi Kivity wrote:
>> On 06/11/2012 10:56 PM, Sasha Levin wrote:
>> > 
>> > btw, so far we've been poking on half of the code here.
>> > 
>> > What about frontswap over kvm-tmem? are there any specific tests you'd
>> > like to see there?
>> 
>> hmm.  On one hand, no one swaps these days so there aren't any good
>> benchmarks for it.  On the other hand, with swapping, at least we're
>> guaranteed the page will be read in the future (unlike cache, where it's
>> quite possible it won't be).  I don't know.
>> 
>> 
> Swapped page can be discarded without reading too.

Right.

The effects of frontswap can be achieved by swapping to a block device
that sets cache=writeback, more or less (esp. with trim support, you can
discard pages that you won't be needing again before they hit disk).


-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-12 10:09                 ` Avi Kivity
@ 2012-06-12 16:40                   ` Dan Magenheimer
  2012-06-12 17:54                     ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-06-12 16:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm

> From: Avi Kivity [mailto:avi@redhat.com]
> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support

I started off with a point-by-point comment on most of your
responses about the tradeoffs of how tmem works, but decided
it best to simply say we disagree and kvm-tmem will need to
prove who is right.

> >> Sorry, no, first demonstrate no performance regressions, then we can
> >> talk about performance improvements.
> >
> > Well that's an awfully hard bar to clear, even with any of the many
> > changes being merged every release into the core Linux mm subsystem.
> > Any change to memory management will have some positive impacts on some
> > workloads and some negative impacts on others.
> 
> Right, that's too harsh.  But these benchmarks show a doubling (or even
> more) of cpu overhead, and that is whether the cache is effective or
> not.  That is simply way too much to consider.

One point here... remember you have contrived a worst case
scenario.  The one case Sasha provided outside of that contrived
worst case, as you commented, looks very nice.  So the costs/benefits
remain to be seen over a wider set of workloads.

Also, even that contrived case should look quite a bit better
with WasActive properly implemented.

> Look at the block, vfs, and mm layers.  Huge pains have been taken to
> batch everything and avoid per-page work -- 20 years of not having
> enough cycles.  And here you throw all this out of the window with
> per-page crossing of the guest/host boundary.

Well, to be fair, those 20 years of effort were because
(1) disk seeks are a million times slower than an in-RAM page
copy and (2) SMP systems were rare and expensive.  The
world changes...


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/10] KVM: Add TMEM host/guest support
  2012-06-12 16:40                   ` Dan Magenheimer
@ 2012-06-12 17:54                     ` Avi Kivity
  0 siblings, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2012-06-12 17:54 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm

On 06/12/2012 07:40 PM, Dan Magenheimer wrote:
> > From: Avi Kivity [mailto:avi@redhat.com]
> > Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
>
> I started off with a point-by-point comment on most of your
> responses about the tradeoffs of how tmem works, but decided
> it best to simply say we disagree and kvm-tmem will need to
> prove who is right.

That is why I am asking for benchmarks.

> > >> Sorry, no, first demonstrate no performance regressions, then we can
> > >> talk about performance improvements.
> > >
> > > Well that's an awfully hard bar to clear, even with any of the many
> > > changes being merged every release into the core Linux mm subsystem.
> > > Any change to memory management will have some positive impacts on some
> > > workloads and some negative impacts on others.
> > 
> > Right, that's too harsh.  But these benchmarks show a doubling (or even
> > more) of cpu overhead, and that is whether the cache is effective or
> > not.  That is simply way too much to consider.
>
> One point here... remember you have contrived a worst case
> scenario.  The one case Sasha provided outside of that contrived
> worst case, as you commented, looks very nice.  So the costs/benefits
> remain to be seen over a wider set of workloads.

While the workload is contrived, decreasing benefits with increasing
cache size is nothing new.  And here tmem is increasing the cost of all
caching, without guaranteeing any return.

> Also, even that contrived case should look quite a bit better
> with WasActive properly implemented.

I'll be happy to see benchmarks of improved code.

> > Look at the block, vfs, and mm layers.  Huge pains have been taken to
> > batch everything and avoid per-page work -- 20 years of not having
> > enough cycles.  And here you throw all this out of the window with
> > per-page crossing of the guest/host boundary.
>
> Well, to be fair, those 20 years of effort were because
> (1) disk seeks are a million times slower than an in-RAM page
> copy and (2) SMP systems were rare and expensive.  The
> world changes...

I don't see how smp matters here.  You have more cores, you put more
work on them, you don't expect the OS or hypervisor to consume them for
you.  In any case you're consuming this cpu on the same core as the
guest, so you're reducing throghput (if caching is ineffective).

Disks are still slow, even fast flash arrays, but tmem is not the only
solution to that problem.  You say ballooning has not proven itself in
this area but that doesn't mean it has been proven not to work; and it
doesn't suffer from the inefficiency of crossing the guest/host boundary.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-06-12 17:55 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-06 13:07 [RFC 00/10] KVM: Add TMEM host/guest support Sasha Levin
2012-06-06 13:24 ` Avi Kivity
2012-06-08 13:20   ` Sasha Levin
2012-06-08 16:06     ` Dan Magenheimer
2012-06-11 11:17       ` Avi Kivity
2012-06-11  8:09     ` Avi Kivity
2012-06-11 10:26       ` Sasha Levin
2012-06-11 11:45         ` Avi Kivity
2012-06-11 15:44           ` Dan Magenheimer
2012-06-11 17:06             ` Avi Kivity
2012-06-11 19:25               ` Sasha Levin
2012-06-11 19:56                 ` Sasha Levin
2012-06-12 11:46                   ` Avi Kivity
2012-06-12 11:58                     ` Gleb Natapov
2012-06-12 12:01                       ` Avi Kivity
2012-06-12 10:12                 ` Avi Kivity
2012-06-12  1:18               ` Dan Magenheimer
2012-06-12 10:09                 ` Avi Kivity
2012-06-12 16:40                   ` Dan Magenheimer
2012-06-12 17:54                     ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.