* [RFC 00/10] KVM: Add TMEM host/guest support
@ 2012-06-06 13:07 Sasha Levin
2012-06-06 13:24 ` Avi Kivity
0 siblings, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2012-06-06 13:07 UTC (permalink / raw)
To: avi, mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk
Cc: kvm, Sasha Levin
This patch series adds support for passing TMEM commands between KVM guests
and the host. This opens the possibility to use TMEM cross-guests and
posibly across hosts with RAMster.
Since frontswap was merged in the 3.4 cycle, the kernel now has all facilities
required to work with TMEM. There is no longer a dependency on out of tree
code.
We can split this patch series into two:
- The guest side, which is basically two shims that proxy mm/cleancache.c
and mm/frontswap.c requests from the guest back to the host. This is done
using a new KVM_HC_TMEM hypercall.
- The host side, which is a rather small shim which connects KVM to zcache.
It's worth noting that this patch series don't have any significant logic in
it, and is mostly a collection of shims to pass TMEM commands across hypercalls.
I ran benchmarks using both the "streaming test" proposed by Avi, and some
general fio tests. Since the fio tests showed similar results to the
streaming test, and no anomalies, here is the summary of the streaming tests:
First, trying to stream a 26GB random file without KVM TMEM:
real 7m36.046s
user 0m17.113s
sys 5m23.809s
And with KVM TMEM:
real 7m36.018s
user 0m17.124s
sys 5m28.391s
- No significant difference.
Now, trying to stream a 16gb file that compresses nicely, first without KVM TMEM:
real 5m10.299s
user 0m11.311s
sys 3m40.139s
And a second run without dropping cache:
real 4m33.951s
user 0m10.869s
sys 3m13.789s
Now, with KVM TMEM:
real 4m55.528s
user 0m11.119s
sys 3m33.243s
And a second run:
real 2m53.713s
user 0m7.971s
sys 2m29.807s
So KVM TMEM shows a nice performance increase once it can store pages on the host.
Sasha Levin (10):
KVM: reintroduce hc_gpa
KVM: wire up the TMEM HC
zcache: export zcache interface
KVM: add KVM TMEM entries in the appropriate config menu entry
KVM: bring in general tmem definitions
zcache: move out client declaration and add a KVM client
KVM: add KVM TMEM host side interface
KVM: add KVM TMEM guest support
KVM: support guest side cleancache
KVM: support guest side frontswap
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/Makefile | 2 +
arch/x86/kvm/tmem/Kconfig | 43 +++++++++++
arch/x86/kvm/tmem/Makefile | 6 ++
arch/x86/kvm/tmem/cleancache.c | 120 +++++++++++++++++++++++++++++
arch/x86/kvm/tmem/frontswap.c | 139 ++++++++++++++++++++++++++++++++++
arch/x86/kvm/tmem/guest.c | 95 +++++++++++++++++++++++
arch/x86/kvm/tmem/guest.h | 11 +++
arch/x86/kvm/tmem/host.c | 78 +++++++++++++++++++
arch/x86/kvm/tmem/host.h | 20 +++++
arch/x86/kvm/tmem/tmem.h | 62 +++++++++++++++
arch/x86/kvm/x86.c | 13 +++
drivers/staging/zcache/zcache-main.c | 48 ++++++++++--
drivers/staging/zcache/zcache.h | 20 +++++
include/linux/kvm_para.h | 1 +
15 files changed, 652 insertions(+), 7 deletions(-)
create mode 100644 arch/x86/kvm/tmem/Kconfig
create mode 100644 arch/x86/kvm/tmem/Makefile
create mode 100644 arch/x86/kvm/tmem/cleancache.c
create mode 100644 arch/x86/kvm/tmem/frontswap.c
create mode 100644 arch/x86/kvm/tmem/guest.c
create mode 100644 arch/x86/kvm/tmem/guest.h
create mode 100644 arch/x86/kvm/tmem/host.c
create mode 100644 arch/x86/kvm/tmem/host.h
create mode 100644 arch/x86/kvm/tmem/tmem.h
create mode 100644 drivers/staging/zcache/zcache.h
--
1.7.8.6
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-06 13:07 [RFC 00/10] KVM: Add TMEM host/guest support Sasha Levin
@ 2012-06-06 13:24 ` Avi Kivity
2012-06-08 13:20 ` Sasha Levin
0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2012-06-06 13:24 UTC (permalink / raw)
To: Sasha Levin; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm
On 06/06/2012 04:07 PM, Sasha Levin wrote:
> This patch series adds support for passing TMEM commands between KVM guests
> and the host. This opens the possibility to use TMEM cross-guests and
> posibly across hosts with RAMster.
>
> Since frontswap was merged in the 3.4 cycle, the kernel now has all facilities
> required to work with TMEM. There is no longer a dependency on out of tree
> code.
>
> We can split this patch series into two:
>
> - The guest side, which is basically two shims that proxy mm/cleancache.c
> and mm/frontswap.c requests from the guest back to the host. This is done
> using a new KVM_HC_TMEM hypercall.
>
> - The host side, which is a rather small shim which connects KVM to zcache.
>
>
> It's worth noting that this patch series don't have any significant logic in
> it, and is mostly a collection of shims to pass TMEM commands across hypercalls.
>
> I ran benchmarks using both the "streaming test" proposed by Avi, and some
> general fio tests. Since the fio tests showed similar results to the
> streaming test, and no anomalies, here is the summary of the streaming tests:
>
> First, trying to stream a 26GB random file without KVM TMEM:
> real 7m36.046s
> user 0m17.113s
> sys 5m23.809s
>
> And with KVM TMEM:
> real 7m36.018s
> user 0m17.124s
> sys 5m28.391s
These results give about 47 usec per page system time (seems quite
high), whereas the difference is 0.7 user per page (seems quite low, for
1 or 2 syscalls per page). Can you post a snapshot of kvm_stat while
this is running?
>
> - No significant difference.
>
> Now, trying to stream a 16gb file that compresses nicely, first without KVM TMEM:
> real 5m10.299s
> user 0m11.311s
> sys 3m40.139s
>
> And a second run without dropping cache:
> real 4m33.951s
> user 0m10.869s
> sys 3m13.789s
>
> Now, with KVM TMEM:
> real 4m55.528s
> user 0m11.119s
> sys 3m33.243s
How is the first run faster? Is it not doing extra work, pushing pages
to the host?
>
> And a second run:
> real 2m53.713s
> user 0m7.971s
> sys 2m29.807s
A nice result, yes.
>
> So KVM TMEM shows a nice performance increase once it can store pages on the host.
How was caching set up? cache=none (in qemu terms) is most
representative, but cache=writeback also allows the host to cache guest
pages, while cache=writeback with cleancache enabled in the host should
give the same effect, but with the extra hypercalls, but with an extra
copy to manage the host pagecache. It would be good to see results for
all three settings.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-06 13:24 ` Avi Kivity
@ 2012-06-08 13:20 ` Sasha Levin
2012-06-08 16:06 ` Dan Magenheimer
2012-06-11 8:09 ` Avi Kivity
0 siblings, 2 replies; 20+ messages in thread
From: Sasha Levin @ 2012-06-08 13:20 UTC (permalink / raw)
To: Avi Kivity; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm
I re-ran benchmarks in a single user environment to get more stable results, increasing the test files to 50gb each.
First, a test of the good case scenario for KVM TMEM - we'll try streaming a file which compresses well but is bigger than the host RAM size:
First, no KVM TMEM, caching=none:
sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
2048+0 records in
2048+0 records out
8589934592 bytes (8.6 GB) copied, 116.309 s, 73.9 MB/s
real 1m56.349s
user 0m0.015s
sys 0m15.671s
sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
2048+0 records in
2048+0 records out
8589934592 bytes (8.6 GB) copied, 116.191 s, 73.9 MB/s
real 1m56.255s
user 0m0.018s
sys 0m15.504s
Now, no KVM TMEM, caching=writeback:
sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
2048+0 records in
2048+0 records out
8589934592 bytes (8.6 GB) copied, 122.894 s, 69.9 MB/s
real 2m2.965s
user 0m0.015s
sys 0m11.025s
sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
2048+0 records in
2048+0 records out
8589934592 bytes (8.6 GB) copied, 110.915 s, 77.4 MB/s
real 1m50.968s
user 0m0.011s
sys 0m10.108s
And finally, KVM TMEM on, caching=none:
sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
2048+0 records in
2048+0 records out
8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s
real 1m59.123s
user 0m0.020s
sys 0m29.336s
sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
2048+0 records in
2048+0 records out
8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s
real 0m36.950s
user 0m0.005s
sys 0m35.308s
This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going:
kvm statistics
kvm_exit 1952342 36037
kvm_entry 1952334 36034
kvm_hypercall 1710568 33948
kvm_apic 109027 1319
kvm_emulate_insn 63745 673
kvm_mmio 63483 669
kvm_inj_virq 45899 654
kvm_apic_accept_irq 45809 654
kvm_pio 18445 52
kvm_set_irq 19102 50
kvm_msi_set_irq 17809 47
kvm_fpu 244 18
kvm_apic_ipi 368 8
kvm_cr 70 6
kvm_userspace_exit 897 5
kvm_cpuid 48 5
vcpu_match_mmio 257 3
kvm_pic_set_irq 1293 3
kvm_ioapic_set_irq 1293 3
kvm_ack_irq 84 1
kvm_page_fault 60538 0
Now, for the worst case "streaming test". I've tried streaming two files, one which has good compression (zeros), and one full with random bits. Doing two runs for each.
First, the baseline - no KVM TMEM, caching=none:
Zero file:
12800+0 records in
12800+0 records out
53687091200 bytes (54 GB) copied, 703.502 s, 76.3 MB/s
real 11m43.583s
user 0m0.106s
sys 1m42.075s
12800+0 records in
12800+0 records out
53687091200 bytes (54 GB) copied, 691.208 s, 77.7 MB/s
real 11m31.284s
user 0m0.100s
sys 1m41.235s
Random file:
12594+1 records in
12594+1 records out
52824875008 bytes (53 GB) copied, 655.778 s, 80.6 MB/s
real 10m55.847s
user 0m0.107s
sys 1m39.852s
12594+1 records in
12594+1 records out
52824875008 bytes (53 GB) copied, 652.668 s, 80.9 MB/s
real 10m52.739s
user 0m0.120s
sys 1m39.712s
Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none:
Zeros:
12800+0 records in
12800+0 records out
53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s
real 11m44.536s
user 0m0.088s
sys 2m0.639s
12800+0 records in
12800+0 records out
53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s
real 11m30.561s
user 0m0.088s
sys 1m57.637s
Random:
12594+1 records in
12594+1 records out
52824875008 bytes (53 GB) copied, 656.436 s, 80.5 MB/s
real 10m56.480s
user 0m0.034s
sys 3m18.750s
12594+1 records in
12594+1 records out
52824875008 bytes (53 GB) copied, 658.446 s, 80.2 MB/s
real 10m58.499s
user 0m0.046s
sys 3m23.678s
Next, with KVM TMEM enabled, caching=none:
Zeros:
12800+0 records in
12800+0 records out
53687091200 bytes (54 GB) copied, 711.754 s, 75.4 MB/s
real 11m51.916s
user 0m0.081s
sys 2m59.952s
12800+0 records in
12800+0 records out
53687091200 bytes (54 GB) copied, 690.958 s, 77.7 MB/s
real 11m31.102s
user 0m0.082s
sys 3m6.500s
Random:
12594+1 records in
12594+1 records out
52824875008 bytes (53 GB) copied, 656.378 s, 80.5 MB/s
real 10m56.445s
user 0m0.062s
sys 5m53.236s
12594+1 records in
12594+1 records out
52824875008 bytes (53 GB) copied, 653.353 s, 80.9 MB/s
real 10m53.404s
user 0m0.066s
sys 5m57.087s
This is a snapshot of kvm_stats while this test was running:
kvm statistics
kvm_entry 168179 20729
kvm_exit 168179 20728
kvm_hypercall 131808 16409
kvm_apic 17305 2006
kvm_mmio 10877 1259
kvm_emulate_insn 10974 1258
kvm_page_fault 6270 866
kvm_inj_virq 6532 751
kvm_apic_accept_irq 6516 751
kvm_set_irq 4888 536
kvm_msi_set_irq 4471 536
kvm_pio 4714 529
kvm_userspace_exit 300 2
vcpu_match_mmio 83 2
kvm_apic_ipi 69 2
kvm_pic_set_irq 417 0
kvm_ioapic_set_irq 417 0
kvm_fpu 76 0
kvm_ack_irq 27 0
kvm_cr 24 0
kvm_cpuid 16 0
And finally, KVM TMEM enabled, with caching=writeback:
Zeros:
12800+0 records in
12800+0 records out
53687091200 bytes (54 GB) copied, 710.62 s, 75.5 MB/s
real 11m50.698s
user 0m0.078s
sys 3m29.920s
12800+0 records in
12800+0 records out
53687091200 bytes (54 GB) copied, 686.286 s, 78.2 MB/s
real 11m26.321s
user 0m0.088s
sys 3m25.931s
Random:
12594+1 records in
12594+1 records out
52824875008 bytes (53 GB) copied, 673.831 s, 78.4 MB/s
real 11m13.883s
user 0m0.047s
sys 4m5.569s
12594+1 records in
12594+1 records out
52824875008 bytes (53 GB) copied, 673.594 s, 78.4 MB/s
real 11m13.619s
user 0m0.056s
sys 4m12.134s
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-08 13:20 ` Sasha Levin
@ 2012-06-08 16:06 ` Dan Magenheimer
2012-06-11 11:17 ` Avi Kivity
2012-06-11 8:09 ` Avi Kivity
1 sibling, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-06-08 16:06 UTC (permalink / raw)
To: Sasha Levin, Avi Kivity; +Cc: mtosatti, gregkh, sjenning, Konrad Wilk, kvm
> From: Sasha Levin [mailto:levinsasha928@gmail.com]
> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
>
> I re-ran benchmarks in a single user environment to get more stable results, increasing the test files
> to 50gb each.
Nice results Sasha!
The non-increase in real and the significant increase in sys
demonstrates that tmem should have little or no impact as
long as there are sufficient unused CPU cycles.... since
tmem is most active on I/O bound workloads when there tends
to be lots of idle cpu time, tmem is usually "free". But
if KVM perfectly load balances across the sum of all guests
so that there is little or no cpu idle time (rare but
possible), there will be a measurable impact.
For a true worst case analysis, try running cpus=1. (One
can argue that anyone who runs KVM on a single cpu system
deserves what they get ;-) But, the "WasActive" patch[1]
(if adapted slightly for the KVM-TMEM patch) should eliminate
the negative impact on systime of streaming workloads even
on cpus=1.
> From: Avi Kivity [mailto:avi@redhat.com]
> <this comment was on Sasha's first round of benchmarking>
> These results give about 47 usec per page system time (seems quite
> high), whereas the difference is 0.7 user per page (seems quite low, for
> 1 or 2 syscalls per page). Can you post a snapshot of kvm_stat while
> this is running?
Note that the userspace difference is likely all noise.
No tmem/zcache activites should be done in userspace. All
the activites result from either a page fault or kswapd.
Since each streamed page (assuming no WasActive patch) should
result in one hypercall and one lz01x page compression, I suspect
that 47usec is a good estimate of the sum of those on Sasha's machine.
[1] https://lkml.org/lkml/2012/1/25/300
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-08 13:20 ` Sasha Levin
2012-06-08 16:06 ` Dan Magenheimer
@ 2012-06-11 8:09 ` Avi Kivity
2012-06-11 10:26 ` Sasha Levin
1 sibling, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2012-06-11 8:09 UTC (permalink / raw)
To: Sasha Levin; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm
On 06/08/2012 04:20 PM, Sasha Levin wrote:
> I re-ran benchmarks in a single user environment to get more stable results, increasing the test files to 50gb each.
>
> First, a test of the good case scenario for KVM TMEM - we'll try streaming a file which compresses well but is bigger than the host RAM size:
>
> First, no KVM TMEM, caching=none:
>
> sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 2048+0 records in
> 2048+0 records out
> 8589934592 bytes (8.6 GB) copied, 116.309 s, 73.9 MB/s
>
> real 1m56.349s
> user 0m0.015s
> sys 0m15.671s
> sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 2048+0 records in
> 2048+0 records out
> 8589934592 bytes (8.6 GB) copied, 116.191 s, 73.9 MB/s
>
> real 1m56.255s
> user 0m0.018s
> sys 0m15.504s
>
> Now, no KVM TMEM, caching=writeback:
>
> sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 2048+0 records in
> 2048+0 records out
> 8589934592 bytes (8.6 GB) copied, 122.894 s, 69.9 MB/s
>
> real 2m2.965s
> user 0m0.015s
> sys 0m11.025s
> sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 2048+0 records in
> 2048+0 records out
> 8589934592 bytes (8.6 GB) copied, 110.915 s, 77.4 MB/s
>
> real 1m50.968s
> user 0m0.011s
> sys 0m10.108s
Strange that system time is lower with cache=writeback.
>
> And finally, KVM TMEM on, caching=none:
>
> sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 2048+0 records in
> 2048+0 records out
> 8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s
>
> real 1m59.123s
> user 0m0.020s
> sys 0m29.336s
>
> sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> 2048+0 records in
> 2048+0 records out
> 8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s
>
> real 0m36.950s
> user 0m0.005s
> sys 0m35.308s
So system time more than doubled compared to non-tmem cache=none. The
overhead per page is 17s / (8589934592/4096) = 8.1usec. Seems quite high.
'perf top' while this is running would be interesting.
>
> This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going:
>
> kvm statistics
>
> kvm_exit 1952342 36037
> kvm_entry 1952334 36034
> kvm_hypercall 1710568 33948
In that test, 56k pages/sec were transferred. Why are we seeing only
33k hypercalls/sec? Shouldn't we have two hypercalls/page (one when
evicting a page to make some room, one to read the new page from tmem)?
>
>
> Now, for the worst case "streaming test". I've tried streaming two files, one which has good compression (zeros), and one full with random bits. Doing two runs for each.
>
> First, the baseline - no KVM TMEM, caching=none:
>
> Zero file:
> 12800+0 records in
> 12800+0 records out
> 53687091200 bytes (54 GB) copied, 703.502 s, 76.3 MB/s
>
> real 11m43.583s
> user 0m0.106s
> sys 1m42.075s
> 12800+0 records in
> 12800+0 records out
> 53687091200 bytes (54 GB) copied, 691.208 s, 77.7 MB/s
>
> real 11m31.284s
> user 0m0.100s
> sys 1m41.235s
>
> Random file:
> 12594+1 records in
> 12594+1 records out
> 52824875008 bytes (53 GB) copied, 655.778 s, 80.6 MB/s
>
> real 10m55.847s
> user 0m0.107s
> sys 1m39.852s
> 12594+1 records in
> 12594+1 records out
> 52824875008 bytes (53 GB) copied, 652.668 s, 80.9 MB/s
>
> real 10m52.739s
> user 0m0.120s
> sys 1m39.712s
>
> Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none:
>
> Zeros:
> 12800+0 records in
> 12800+0 records out
> 53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s
>
> real 11m44.536s
> user 0m0.088s
> sys 2m0.639s
> 12800+0 records in
> 12800+0 records out
> 53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s
>
> real 11m30.561s
> user 0m0.088s
> sys 1m57.637s
zcache appears not to be helping at all; it's just adding overhead. Is
even the compressed file too large?
overhead = 1.4 usec/page.
>
> Random:
> 12594+1 records in
> 12594+1 records out
> 52824875008 bytes (53 GB) copied, 656.436 s, 80.5 MB/s
>
> real 10m56.480s
> user 0m0.034s
> sys 3m18.750s
> 12594+1 records in
> 12594+1 records out
> 52824875008 bytes (53 GB) copied, 658.446 s, 80.2 MB/s
>
> real 10m58.499s
> user 0m0.046s
> sys 3m23.678s
Overhead grows to 7.6 usec/page.
>
> Next, with KVM TMEM enabled, caching=none:
>
> Zeros:
> 12800+0 records in
> 12800+0 records out
> 53687091200 bytes (54 GB) copied, 711.754 s, 75.4 MB/s
>
> real 11m51.916s
> user 0m0.081s
> sys 2m59.952s
> 12800+0 records in
> 12800+0 records out
> 53687091200 bytes (54 GB) copied, 690.958 s, 77.7 MB/s
>
> real 11m31.102s
> user 0m0.082s
> sys 3m6.500s
Overhead = 6.6 usec/page.
>
> Random:
> 12594+1 records in
> 12594+1 records out
> 52824875008 bytes (53 GB) copied, 656.378 s, 80.5 MB/s
>
> real 10m56.445s
> user 0m0.062s
> sys 5m53.236s
> 12594+1 records in
> 12594+1 records out
> 52824875008 bytes (53 GB) copied, 653.353 s, 80.9 MB/s
>
> real 10m53.404s
> user 0m0.066s
> sys 5m57.087s
Overhead = 19 usec/page.
This is pretty steep. We have flash storage doing a million iops/sec,
and here you add 19 microseconds to that.
>
>
> This is a snapshot of kvm_stats while this test was running:
>
> kvm statistics
>
> kvm_entry 168179 20729
> kvm_exit 168179 20728
> kvm_hypercall 131808 16409
The last test was running 19k pages/sec, doesn't quite fit with this
measurement. Is the measurement stable or fluctuating?
>
> And finally, KVM TMEM enabled, with caching=writeback:
I'm not sure what the point of this is? You have two host-caching
mechanisms running in parallel, are you trying to increase overhead
while reducing effective cache size?
My conclusion is that the overhead is quite high, but please double
check my numbers, maybe I missed something obvious.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-11 8:09 ` Avi Kivity
@ 2012-06-11 10:26 ` Sasha Levin
2012-06-11 11:45 ` Avi Kivity
0 siblings, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2012-06-11 10:26 UTC (permalink / raw)
To: Avi Kivity; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm
On Mon, 2012-06-11 at 11:09 +0300, Avi Kivity wrote:
> On 06/08/2012 04:20 PM, Sasha Levin wrote:
> > I re-ran benchmarks in a single user environment to get more stable results, increasing the test files to 50gb each.
> >
> > First, a test of the good case scenario for KVM TMEM - we'll try streaming a file which compresses well but is bigger than the host RAM size:
> >
> > First, no KVM TMEM, caching=none:
> >
> > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 2048+0 records in
> > 2048+0 records out
> > 8589934592 bytes (8.6 GB) copied, 116.309 s, 73.9 MB/s
> >
> > real 1m56.349s
> > user 0m0.015s
> > sys 0m15.671s
> > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 2048+0 records in
> > 2048+0 records out
> > 8589934592 bytes (8.6 GB) copied, 116.191 s, 73.9 MB/s
> >
> > real 1m56.255s
> > user 0m0.018s
> > sys 0m15.504s
> >
> > Now, no KVM TMEM, caching=writeback:
> >
> > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 2048+0 records in
> > 2048+0 records out
> > 8589934592 bytes (8.6 GB) copied, 122.894 s, 69.9 MB/s
> >
> > real 2m2.965s
> > user 0m0.015s
> > sys 0m11.025s
> > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 2048+0 records in
> > 2048+0 records out
> > 8589934592 bytes (8.6 GB) copied, 110.915 s, 77.4 MB/s
> >
> > real 1m50.968s
> > user 0m0.011s
> > sys 0m10.108s
>
> Strange that system time is lower with cache=writeback.
Maybe because these pages don't get written out immediately? I don't
have a better guess.
> > And finally, KVM TMEM on, caching=none:
> >
> > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 2048+0 records in
> > 2048+0 records out
> > 8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s
> >
> > real 1m59.123s
> > user 0m0.020s
> > sys 0m29.336s
> >
> > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
> > 2048+0 records in
> > 2048+0 records out
> > 8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s
> >
> > real 0m36.950s
> > user 0m0.005s
> > sys 0m35.308s
>
> So system time more than doubled compared to non-tmem cache=none. The
> overhead per page is 17s / (8589934592/4096) = 8.1usec. Seems quite high.
Right, but consider it didn't increase real time at all.
> 'perf top' while this is running would be interesting.
I'll update later with this.
> > This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going:
> >
> > kvm statistics
> >
> > kvm_exit 1952342 36037
> > kvm_entry 1952334 36034
> > kvm_hypercall 1710568 33948
>
> In that test, 56k pages/sec were transferred. Why are we seeing only
> 33k hypercalls/sec? Shouldn't we have two hypercalls/page (one when
> evicting a page to make some room, one to read the new page from tmem)?
The guest doesn't do eviction at all, in fact - it doesn't know how big
the cache is so even if it wanted to, it couldn't evict pages (the only
thing it does is invalidate pages which have changed in the guest).
This means it only takes one hypercall/page instead of two.
> >
> >
> > Now, for the worst case "streaming test". I've tried streaming two files, one which has good compression (zeros), and one full with random bits. Doing two runs for each.
> >
> > First, the baseline - no KVM TMEM, caching=none:
> >
> > Zero file:
> > 12800+0 records in
> > 12800+0 records out
> > 53687091200 bytes (54 GB) copied, 703.502 s, 76.3 MB/s
> >
> > real 11m43.583s
> > user 0m0.106s
> > sys 1m42.075s
> > 12800+0 records in
> > 12800+0 records out
> > 53687091200 bytes (54 GB) copied, 691.208 s, 77.7 MB/s
> >
> > real 11m31.284s
> > user 0m0.100s
> > sys 1m41.235s
> >
> > Random file:
> > 12594+1 records in
> > 12594+1 records out
> > 52824875008 bytes (53 GB) copied, 655.778 s, 80.6 MB/s
> >
> > real 10m55.847s
> > user 0m0.107s
> > sys 1m39.852s
> > 12594+1 records in
> > 12594+1 records out
> > 52824875008 bytes (53 GB) copied, 652.668 s, 80.9 MB/s
> >
> > real 10m52.739s
> > user 0m0.120s
> > sys 1m39.712s
> >
> > Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none:
> >
> > Zeros:
> > 12800+0 records in
> > 12800+0 records out
> > 53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s
> >
> > real 11m44.536s
> > user 0m0.088s
> > sys 2m0.639s
> > 12800+0 records in
> > 12800+0 records out
> > 53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s
> >
> > real 11m30.561s
> > user 0m0.088s
> > sys 1m57.637s
>
> zcache appears not to be helping at all; it's just adding overhead. Is
> even the compressed file too large?
>
> overhead = 1.4 usec/page.
Correct, I've had to further increase the size of this file so that
zcache would fail here as well. The good case was tested before, here I
wanted to see what will happen with files that wouldn't have much
benefit from both regular caching and zcache.
> >
> > Random:
> > 12594+1 records in
> > 12594+1 records out
> > 52824875008 bytes (53 GB) copied, 656.436 s, 80.5 MB/s
> >
> > real 10m56.480s
> > user 0m0.034s
> > sys 3m18.750s
> > 12594+1 records in
> > 12594+1 records out
> > 52824875008 bytes (53 GB) copied, 658.446 s, 80.2 MB/s
> >
> > real 10m58.499s
> > user 0m0.046s
> > sys 3m23.678s
>
> Overhead grows to 7.6 usec/page.
>
> >
> > Next, with KVM TMEM enabled, caching=none:
> >
> > Zeros:
> > 12800+0 records in
> > 12800+0 records out
> > 53687091200 bytes (54 GB) copied, 711.754 s, 75.4 MB/s
> >
> > real 11m51.916s
> > user 0m0.081s
> > sys 2m59.952s
> > 12800+0 records in
> > 12800+0 records out
> > 53687091200 bytes (54 GB) copied, 690.958 s, 77.7 MB/s
> >
> > real 11m31.102s
> > user 0m0.082s
> > sys 3m6.500s
>
> Overhead = 6.6 usec/page.
>
> >
> > Random:
> > 12594+1 records in
> > 12594+1 records out
> > 52824875008 bytes (53 GB) copied, 656.378 s, 80.5 MB/s
> >
> > real 10m56.445s
> > user 0m0.062s
> > sys 5m53.236s
> > 12594+1 records in
> > 12594+1 records out
> > 52824875008 bytes (53 GB) copied, 653.353 s, 80.9 MB/s
> >
> > real 10m53.404s
> > user 0m0.066s
> > sys 5m57.087s
>
>
> Overhead = 19 usec/page.
>
> This is pretty steep. We have flash storage doing a million iops/sec,
> and here you add 19 microseconds to that.
Might be interesting to test it with flash storage as well...
> >
> >
> > This is a snapshot of kvm_stats while this test was running:
> >
> > kvm statistics
> >
> > kvm_entry 168179 20729
> > kvm_exit 168179 20728
> > kvm_hypercall 131808 16409
>
> The last test was running 19k pages/sec, doesn't quite fit with this
> measurement. Is the measurement stable or fluctuating?
It's pretty stable when running the "zero" pages, but when switching to
random files it somewhat fluctuates.
> >
> > And finally, KVM TMEM enabled, with caching=writeback:
>
> I'm not sure what the point of this is? You have two host-caching
> mechanisms running in parallel, are you trying to increase overhead
> while reducing effective cache size?
I thought that you've asked for this test:
On Wed, 2012-06-06 at 16:24 +0300, Avi Kivity wrote:
> while cache=writeback with cleancache enabled in the host should
> give the same effect, but with the extra hypercalls, but with an extra
> copy to manage the host pagecache. It would be good to see results for all three settings.
> My conclusion is that the overhead is quite high, but please double
> check my numbers, maybe I missed something obvious.
I'm not sure what options I have to lower the overhead here, should I be
using something other than hypercalls to communicate with the host?
I know that there are several things being worked on from zcache
perspective (WasActive, batching, etc), but is there something that
could be done within the scope of kvm-tmem?
It would be interesting in seeing results for Xen/TMEM and comparing
them to these results.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-08 16:06 ` Dan Magenheimer
@ 2012-06-11 11:17 ` Avi Kivity
0 siblings, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2012-06-11 11:17 UTC (permalink / raw)
To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm
On 06/08/2012 07:06 PM, Dan Magenheimer wrote:
>> From: Avi Kivity [mailto:avi@redhat.com]
>> <this comment was on Sasha's first round of benchmarking>
>> These results give about 47 usec per page system time (seems quite
>> high), whereas the difference is 0.7 user per page (seems quite low, for
>> 1 or 2 syscalls per page). Can you post a snapshot of kvm_stat while
>> this is running?
>
> Note that the userspace difference is likely all noise.
> No tmem/zcache activites should be done in userspace. All
> the activites result from either a page fault or kswapd.
s/user/usec/...
> Since each streamed page (assuming no WasActive patch) should
> result in one hypercall and one lz01x page compression, I suspect
> that 47usec is a good estimate of the sum of those on Sasha's machine.
It's a huge number for a page. The newer results give lower numbers,
but still quite high.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-11 10:26 ` Sasha Levin
@ 2012-06-11 11:45 ` Avi Kivity
2012-06-11 15:44 ` Dan Magenheimer
0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2012-06-11 11:45 UTC (permalink / raw)
To: Sasha Levin; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm
On 06/11/2012 01:26 PM, Sasha Levin wrote:
>>
>> Strange that system time is lower with cache=writeback.
>
> Maybe because these pages don't get written out immediately? I don't
> have a better guess.
>From the guest point of view, it's the same flow. btw, this is a read,
so the difference would be readahead, not write-behind, but the
difference in system time is still unexplained.
>
>> > And finally, KVM TMEM on, caching=none:
>> >
>> > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
>> > 2048+0 records in
>> > 2048+0 records out
>> > 8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s
>> >
>> > real 1m59.123s
>> > user 0m0.020s
>> > sys 0m29.336s
>> >
>> > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
>> > 2048+0 records in
>> > 2048+0 records out
>> > 8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s
>> >
>> > real 0m36.950s
>> > user 0m0.005s
>> > sys 0m35.308s
>>
>> So system time more than doubled compared to non-tmem cache=none. The
>> overhead per page is 17s / (8589934592/4096) = 8.1usec. Seems quite high.
>
> Right, but consider it didn't increase real time at all.
Real time is bounded by disk bandwidth. It's a consideration of course,
and all forms of caching increase cpu utilization for the cache miss
case, but in this case the overhead is excessive due to the lack of
batching and due to compression overhead.
>
>> 'perf top' while this is running would be interesting.
>
> I'll update later with this.
>
>> > This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going:
>> >
>> > kvm statistics
>> >
>> > kvm_exit 1952342 36037
>> > kvm_entry 1952334 36034
>> > kvm_hypercall 1710568 33948
>>
>> In that test, 56k pages/sec were transferred. Why are we seeing only
>> 33k hypercalls/sec? Shouldn't we have two hypercalls/page (one when
>> evicting a page to make some room, one to read the new page from tmem)?
>
> The guest doesn't do eviction at all, in fact - it doesn't know how big
> the cache is so even if it wanted to, it couldn't evict pages (the only
> thing it does is invalidate pages which have changed in the guest).
IIUC, when the guest reads a page, it first has to make room in its own
pagecache; before dropping a clean page it calls cleancache to dispose
of it, which calls a hypercall which compresses and stores it on the
host. Next a page is allocated and a cleancache hypercall is made to
see if it is in host tmem. So two hypercalls per page, once guest
pagecache is full.
>
> This means it only takes one hypercall/page instead of two.
>> >
>> > Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none:
>> >
>> > Zeros:
>> > 12800+0 records in
>> > 12800+0 records out
>> > 53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s
>> >
>> > real 11m44.536s
>> > user 0m0.088s
>> > sys 2m0.639s
>> > 12800+0 records in
>> > 12800+0 records out
>> > 53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s
>> >
>> > real 11m30.561s
>> > user 0m0.088s
>> > sys 1m57.637s
>>
>> zcache appears not to be helping at all; it's just adding overhead. Is
>> even the compressed file too large?
>>
>> overhead = 1.4 usec/page.
>
> Correct, I've had to further increase the size of this file so that
> zcache would fail here as well. The good case was tested before, here I
> wanted to see what will happen with files that wouldn't have much
> benefit from both regular caching and zcache.
Well, zeroes is not a good test for this since it minimizes zcache
allocation overhead.
>>
>>
>> Overhead = 19 usec/page.
>>
>> This is pretty steep. We have flash storage doing a million iops/sec,
>> and here you add 19 microseconds to that.
>
> Might be interesting to test it with flash storage as well...
Try http://sg.danny.cz/sg/sdebug26.html. You can use it to emulate a
large fast block device without needing tons of RAM (but you can still
populate it with nonzero data).
If using qemu, try ,aio=native to minimize overhead further.
>
>> >
>> >
>> > This is a snapshot of kvm_stats while this test was running:
>> >
>> > kvm statistics
>> >
>> > kvm_entry 168179 20729
>> > kvm_exit 168179 20728
>> > kvm_hypercall 131808 16409
>>
>> The last test was running 19k pages/sec, doesn't quite fit with this
>> measurement. Is the measurement stable or fluctuating?
>
> It's pretty stable when running the "zero" pages, but when switching to
> random files it somewhat fluctuates.
Well, weird.
>
>> >
>> > And finally, KVM TMEM enabled, with caching=writeback:
>>
>> I'm not sure what the point of this is? You have two host-caching
>> mechanisms running in parallel, are you trying to increase overhead
>> while reducing effective cache size?
>
> I thought that you've asked for this test:
>
> On Wed, 2012-06-06 at 16:24 +0300, Avi Kivity wrote:
>> while cache=writeback with cleancache enabled in the host should
>> give the same effect, but with the extra hypercalls, but with an extra
>> copy to manage the host pagecache. It would be good to see results for all three settings.
>
Ah, so it's a worser worst case. But somehow it's better than cache=none?
>> My conclusion is that the overhead is quite high, but please double
>> check my numbers, maybe I missed something obvious.
>
> I'm not sure what options I have to lower the overhead here, should I be
> using something other than hypercalls to communicate with the host?
>
> I know that there are several things being worked on from zcache
> perspective (WasActive, batching, etc), but is there something that
> could be done within the scope of kvm-tmem?
>
> It would be interesting in seeing results for Xen/TMEM and comparing
> them to these results.
Batching will drastically reduce the number of hypercalls. A different
alternative is to use ballooning to feed the guest free memory so it
doesn't need to hypercall at all. Deciding how to divide free memory
among the guests is hard (but then so is deciding how to divide tmem
memory among guests), and adding dedup on top of that is also hard (ksm?
zksm?). IMO letting the guest have the memory and manage it on its own
will be much simpler and faster compared to the constant chatting that
has to go on if the host manages this memory.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-11 11:45 ` Avi Kivity
@ 2012-06-11 15:44 ` Dan Magenheimer
2012-06-11 17:06 ` Avi Kivity
0 siblings, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-06-11 15:44 UTC (permalink / raw)
To: Avi Kivity, Sasha Levin; +Cc: mtosatti, gregkh, sjenning, Konrad Wilk, kvm
> From: Avi Kivity [mailto:avi@redhat.com]
> >
> > The guest doesn't do eviction at all, in fact - it doesn't know how big
> > the cache is so even if it wanted to, it couldn't evict pages (the only
> > thing it does is invalidate pages which have changed in the guest).
>
> IIUC, when the guest reads a page, it first has to make room in its own
> pagecache; before dropping a clean page it calls cleancache to dispose
> of it, which calls a hypercall which compresses and stores it on the
> host. Next a page is allocated and a cleancache hypercall is made to
> see if it is in host tmem. So two hypercalls per page, once guest
> pagecache is full.
Yes, Avi is correct here.
> >> This is pretty steep. We have flash storage doing a million iops/sec,
> >> and here you add 19 microseconds to that.
> >
> > Might be interesting to test it with flash storage as well...
Well, to be fair, you are comparing a device that costs many
thousands of $US to a software solution that uses idle CPU
cycles and no additional RAM.
> Batching will drastically reduce the number of hypercalls.
For the record, batching CAN be implemented... ramster is essentially
an implementation of batching where the local system is the "guest"
and the remote system is the "host". But with ramster the
overhead to move the data (whether batched or not) is much MUCH
worse than a hypercall and ramster still shows performance advantage.
So, IMHO, one step at a time. Get the foundation code in
place and tune it later if a batching implementation can
be demonstrated to improve performance sufficiently.
> A different
> alternative is to use ballooning to feed the guest free memory so it
> doesn't need to hypercall at all. Deciding how to divide free memory
> among the guests is hard (but then so is deciding how to divide tmem
> memory among guests), and adding dedup on top of that is also hard (ksm?
> zksm?). IMO letting the guest have the memory and manage it on its own
> will be much simpler and faster compared to the constant chatting that
> has to go on if the host manages this memory.
Here we disagree, maybe violently. All existing solutions that
try to do manage memory across multiple tenants from an "external
memory manager policy" fail miserably. Tmem is at least trying
something new by actively involving both the host and the guest
in the policy (guest decides which pages, host decided how many)
and without the massive changes required for something like
IBM's solution (forgot what it was called). Yes, tmem has
overhead but since the overhead only occurs where pages
would otherwise have to be read/written from disk, the
overhead is well "hidden".
BTW, dedup in zcache is fairly easy to implement because the
pages can only be read/written as an entire page and only
through a well-defined API. Xen does it (with optional
compression), zcache could also, but it never made much sense
for zcache when there was only one tenant. KVM of course
benefits from KSM, but IIUC KSM only works on anonymous pages.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-11 15:44 ` Dan Magenheimer
@ 2012-06-11 17:06 ` Avi Kivity
2012-06-11 19:25 ` Sasha Levin
2012-06-12 1:18 ` Dan Magenheimer
0 siblings, 2 replies; 20+ messages in thread
From: Avi Kivity @ 2012-06-11 17:06 UTC (permalink / raw)
To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm
On 06/11/2012 06:44 PM, Dan Magenheimer wrote:
> > >> This is pretty steep. We have flash storage doing a million iops/sec,
> > >> and here you add 19 microseconds to that.
> > >
> > > Might be interesting to test it with flash storage as well...
>
> Well, to be fair, you are comparing a device that costs many
> thousands of $US to a software solution that uses idle CPU
> cycles and no additional RAM.
You don't know that those cycles are idle. And when in fact you have no
additional RAM, those cycles are wasted to no benefit.
The fact that I/O is being performed doesn't mean that we can waste
cpu. Those cpu cycles can be utilized by other processes on the same
guest or by other guests.
>
> > Batching will drastically reduce the number of hypercalls.
>
> For the record, batching CAN be implemented... ramster is essentially
> an implementation of batching where the local system is the "guest"
> and the remote system is the "host". But with ramster the
> overhead to move the data (whether batched or not) is much MUCH
> worse than a hypercall and ramster still shows performance advantage.
Sure, you can buffer pages in memory but then you add yet another copy.
I know you think copies are cheap but I disagree.
> So, IMHO, one step at a time. Get the foundation code in
> place and tune it later if a batching implementation can
> be demonstrated to improve performance sufficiently.
Sorry, no, first demonstrate no performance regressions, then we can
talk about performance improvements.
> > A different
> > alternative is to use ballooning to feed the guest free memory so it
> > doesn't need to hypercall at all. Deciding how to divide free memory
> > among the guests is hard (but then so is deciding how to divide tmem
> > memory among guests), and adding dedup on top of that is also hard (ksm?
> > zksm?). IMO letting the guest have the memory and manage it on its own
> > will be much simpler and faster compared to the constant chatting that
> > has to go on if the host manages this memory.
>
> Here we disagree, maybe violently. All existing solutions that
> try to do manage memory across multiple tenants from an "external
> memory manager policy" fail miserably. Tmem is at least trying
> something new by actively involving both the host and the guest
> in the policy (guest decides which pages, host decided how many)
> and without the massive changes required for something like
> IBM's solution (forgot what it was called).
cmm2
> Yes, tmem has
> overhead but since the overhead only occurs where pages
> would otherwise have to be read/written from disk, the
> overhead is well "hidden".
The overhead is NOT hidden. We spent many efforts to tune virtio-blk to
reduce its overhead, and now you add 6-20 microseconds per page. A
guest may easily be reading a quarter million pages per second, this
adds up very fast - at the upper end you're consuming 5 vcpus just for tmem.
Note that you don't even have to issue I/O to get a tmem hypercall
invoked. Alllocate a ton of memory and you get cleancache calls for
each page that passes through the tail of the LRU. Again with the upper
end, allocating a gigabyte can now take a few seconds extra.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-11 17:06 ` Avi Kivity
@ 2012-06-11 19:25 ` Sasha Levin
2012-06-11 19:56 ` Sasha Levin
2012-06-12 10:12 ` Avi Kivity
2012-06-12 1:18 ` Dan Magenheimer
1 sibling, 2 replies; 20+ messages in thread
From: Sasha Levin @ 2012-06-11 19:25 UTC (permalink / raw)
To: Avi Kivity; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm
On Mon, 2012-06-11 at 20:06 +0300, Avi Kivity wrote:
> Sorry, no, first demonstrate no performance regressions, then we can
> talk about performance improvements.
No performance regressions? For caching? How would that work?
Or even if you meant just the kvm-tmem interface overhead, I don't see
how that would work.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-11 19:25 ` Sasha Levin
@ 2012-06-11 19:56 ` Sasha Levin
2012-06-12 11:46 ` Avi Kivity
2012-06-12 10:12 ` Avi Kivity
1 sibling, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2012-06-11 19:56 UTC (permalink / raw)
To: Avi Kivity; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm
On Mon, 2012-06-11 at 21:25 +0200, Sasha Levin wrote:
> On Mon, 2012-06-11 at 20:06 +0300, Avi Kivity wrote:
> > Sorry, no, first demonstrate no performance regressions, then we can
> > talk about performance improvements.
>
> No performance regressions? For caching? How would that work?
>
> Or even if you meant just the kvm-tmem interface overhead, I don't see
> how that would work.
btw, so far we've been poking on half of the code here.
What about frontswap over kvm-tmem? are there any specific tests you'd
like to see there?
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-11 17:06 ` Avi Kivity
2012-06-11 19:25 ` Sasha Levin
@ 2012-06-12 1:18 ` Dan Magenheimer
2012-06-12 10:09 ` Avi Kivity
1 sibling, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-06-12 1:18 UTC (permalink / raw)
To: Avi Kivity; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm
> From: Avi Kivity [mailto:avi@redhat.com]
> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
>
> On 06/11/2012 06:44 PM, Dan Magenheimer wrote:
> > > >> This is pretty steep. We have flash storage doing a million iops/sec,
> > > >> and here you add 19 microseconds to that.
> > > >
> > > > Might be interesting to test it with flash storage as well...
> >
> > Well, to be fair, you are comparing a device that costs many
> > thousands of $US to a software solution that uses idle CPU
> > cycles and no additional RAM.
>
> You don't know that those cycles are idle. And when in fact you have no
> additional RAM, those cycles are wasted to no benefit.
>
> The fact that I/O is being performed doesn't mean that we can waste
> cpu. Those cpu cycles can be utilized by other processes on the same
> guest or by other guests.
You're right of course, so I apologize for oversimplifying... but
so are you. Let's take a step back:
IMHO, a huge part (majority?) of computer science these days is
trying to beat Amdahl's law. On many machines/workloads,
especially in virtual environments, RAM is the bottleneck.
Tmem's role is, when RAM is the bottleneck, to increase RAM
effective size AND, in a multi-tenant environment, flexibility
at the cost of CPU cycles. But tmem also is designed to be very
dynamically flexible so that it either has low CPU cost when it
not being used OR can be dynamically disabled/re-enabled with
reasonably low overhead.
Why I think you are oversimplifying: "those cpu cycles can be
utilized by other processes on the same guest or by other
guests" pre-supposes that cpu availability is the bottleneck.
It would be interesting if it were possible to measure how
many systems (with modern processors) for which this is true.
I'm not arguing that they don't exist but I suspect they are
fairly rare these days, even for KVM systems.
> > > Batching will drastically reduce the number of hypercalls.
> >
> > For the record, batching CAN be implemented... ramster is essentially
> > an implementation of batching where the local system is the "guest"
> > and the remote system is the "host". But with ramster the
> > overhead to move the data (whether batched or not) is much MUCH
> > worse than a hypercall and ramster still shows performance advantage.
>
> Sure, you can buffer pages in memory but then you add yet another copy.
> I know you think copies are cheap but I disagree.
I only think copies are *relatively* cheap. Orders of magnitude
cheaper than some alternatives. So if it takes two page copies
or even ten to replace a disk access, yes I think copies are cheap.
(But I do understand your point.)
> > So, IMHO, one step at a time. Get the foundation code in
> > place and tune it later if a batching implementation can
> > be demonstrated to improve performance sufficiently.
>
> Sorry, no, first demonstrate no performance regressions, then we can
> talk about performance improvements.
Well that's an awfully hard bar to clear, even with any of the many
changes being merged every release into the core Linux mm subsystem.
Any change to memory management will have some positive impacts on some
workloads and some negative impacts on others.
> > > A different
> > > alternative is to use ballooning to feed the guest free memory so it
> > > doesn't need to hypercall at all. Deciding how to divide free memory
> > > among the guests is hard (but then so is deciding how to divide tmem
> > > memory among guests), and adding dedup on top of that is also hard (ksm?
> > > zksm?). IMO letting the guest have the memory and manage it on its own
> > > will be much simpler and faster compared to the constant chatting that
> > > has to go on if the host manages this memory.
> >
> > Here we disagree, maybe violently. All existing solutions that
> > try to do manage memory across multiple tenants from an "external
> > memory manager policy" fail miserably. Tmem is at least trying
> > something new by actively involving both the host and the guest
> > in the policy (guest decides which pages, host decided how many)
> > and without the massive changes required for something like
> > IBM's solution (forgot what it was called).
>
> cmm2
That's the one. Thanks for the reminder!
> > Yes, tmem has
> > overhead but since the overhead only occurs where pages
> > would otherwise have to be read/written from disk, the
> > overhead is well "hidden".
>
> The overhead is NOT hidden. We spent many efforts to tune virtio-blk to
> reduce its overhead, and now you add 6-20 microseconds per page. A
> guest may easily be reading a quarter million pages per second, this
> adds up very fast - at the upper end you're consuming 5 vcpus just for tmem.
>
> Note that you don't even have to issue I/O to get a tmem hypercall
> invoked. Alllocate a ton of memory and you get cleancache calls for
> each page that passes through the tail of the LRU. Again with the upper
> end, allocating a gigabyte can now take a few seconds extra.
Though not precisely so, we are arguing throughput vs latency here
and the two can't always be mixed.
And if, in allocating a GB of memory, you are tossing out useful
pagecache pages, and those pagecache pages can instead be preserved
by tmem thus saving N page faults and order(N) disk accesses,
your savings are false economy. I think Sasha's numbers
demonstrate that nicely.
Anyway, as I've said all along, let's look at the numbers.
I've always admitted that tmem on an old uniprocessor should
be disabled. If no performance degradation in that environment
is a requirement for KVM-tmem to be merged, that is certainly
your choice. And if "more CPU cycles used" is a metric,
definitely, tmem is not going to pass because that's exactly
what it's doing: trading more CPU cycles for better RAM
efficiency == less disk accesses.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-12 1:18 ` Dan Magenheimer
@ 2012-06-12 10:09 ` Avi Kivity
2012-06-12 16:40 ` Dan Magenheimer
0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2012-06-12 10:09 UTC (permalink / raw)
To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm
On 06/12/2012 04:18 AM, Dan Magenheimer wrote:
>> From: Avi Kivity [mailto:avi@redhat.com]
>> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
>>
>> On 06/11/2012 06:44 PM, Dan Magenheimer wrote:
>> > > >> This is pretty steep. We have flash storage doing a million iops/sec,
>> > > >> and here you add 19 microseconds to that.
>> > > >
>> > > > Might be interesting to test it with flash storage as well...
>> >
>> > Well, to be fair, you are comparing a device that costs many
>> > thousands of $US to a software solution that uses idle CPU
>> > cycles and no additional RAM.
>>
>> You don't know that those cycles are idle. And when in fact you have no
>> additional RAM, those cycles are wasted to no benefit.
>>
>> The fact that I/O is being performed doesn't mean that we can waste
>> cpu. Those cpu cycles can be utilized by other processes on the same
>> guest or by other guests.
>
> You're right of course, so I apologize for oversimplifying... but
> so are you. Let's take a step back:
>
> IMHO, a huge part (majority?) of computer science these days is
> trying to beat Amdahl's law. On many machines/workloads,
> especially in virtual environments, RAM is the bottleneck.
> Tmem's role is, when RAM is the bottleneck, to increase RAM
> effective size AND, in a multi-tenant environment, flexibility
> at the cost of CPU cycles. But tmem also is designed to be very
> dynamically flexible so that it either has low CPU cost when it
> not being used OR can be dynamically disabled/re-enabled with
> reasonably low overhead.
>
> Why I think you are oversimplifying: "those cpu cycles can be
> utilized by other processes on the same guest or by other
> guests" pre-supposes that cpu availability is the bottleneck.
> It would be interesting if it were possible to measure how
> many systems (with modern processors) for which this is true.
> I'm not arguing that they don't exist but I suspect they are
> fairly rare these days, even for KVM systems.
In a given host, either cpu or memory is the bottleneck. If you have
both free memory and free cycles, you pack more guests on that machine.
During off-peak you may have both, but we need to see what happens
during the peak; off-peak we're doing okay.
So on such a host, during peak, either the cpu is churning away and we
can't spare those cycles for tmem, or memory is packed full of guests
and tmem won't provide much benefit (but still consume those cycles).
>
>> > > Batching will drastically reduce the number of hypercalls.
>> >
>> > For the record, batching CAN be implemented... ramster is essentially
>> > an implementation of batching where the local system is the "guest"
>> > and the remote system is the "host". But with ramster the
>> > overhead to move the data (whether batched or not) is much MUCH
>> > worse than a hypercall and ramster still shows performance advantage.
>>
>> Sure, you can buffer pages in memory but then you add yet another copy.
>> I know you think copies are cheap but I disagree.
>
> I only think copies are *relatively* cheap. Orders of magnitude
> cheaper than some alternatives. So if it takes two page copies
> or even ten to replace a disk access, yes I think copies are cheap.
> (But I do understand your point.)
The copies are cheaper that a disk access, yes, but you need to factor
in the probability of a disk access being saved. cleancache already
works on the tail end of the lru, we're dumping those pages because they
have low access frequency, so the probability starts out low. If many
guests are active (so we need the cpu resources), then they also compete
for tmem resources, and per-guest it becomes less effective as well.
>
>> > So, IMHO, one step at a time. Get the foundation code in
>> > place and tune it later if a batching implementation can
>> > be demonstrated to improve performance sufficiently.
>>
>> Sorry, no, first demonstrate no performance regressions, then we can
>> talk about performance improvements.
>
> Well that's an awfully hard bar to clear, even with any of the many
> changes being merged every release into the core Linux mm subsystem.
> Any change to memory management will have some positive impacts on some
> workloads and some negative impacts on others.
Right, that's too harsh. But these benchmarks show a doubling (or even
more) of cpu overhead, and that is whether the cache is effective or
not. That is simply way too much to consider.
Look at the block, vfs, and mm layers. Huge pains have been taken to
batch everything and avoid per-page work -- 20 years of not having
enough cycles. And here you throw all this out of the window with
per-page crossing of the guest/host boundary.
>
>> > Yes, tmem has
>> > overhead but since the overhead only occurs where pages
>> > would otherwise have to be read/written from disk, the
>> > overhead is well "hidden".
>>
>> The overhead is NOT hidden. We spent many efforts to tune virtio-blk to
>> reduce its overhead, and now you add 6-20 microseconds per page. A
>> guest may easily be reading a quarter million pages per second, this
>> adds up very fast - at the upper end you're consuming 5 vcpus just for tmem.
>>
>> Note that you don't even have to issue I/O to get a tmem hypercall
>> invoked. Alllocate a ton of memory and you get cleancache calls for
>> each page that passes through the tail of the LRU. Again with the upper
>> end, allocating a gigabyte can now take a few seconds extra.
>
> Though not precisely so, we are arguing throughput vs latency here
> and the two can't always be mixed.
>
> And if, in allocating a GB of memory, you are tossing out useful
> pagecache pages, and those pagecache pages can instead be preserved
> by tmem thus saving N page faults and order(N) disk accesses,
> your savings are false economy. I think Sasha's numbers
> demonstrate that nicely.
It depends. If you have an 8GB guest, then saving the tail end of an
8GB LRU may improve your caching or it may not. But the impact on that
allocation is certain. You're trading off possible marginal improvement
for unconditional performance degradation.
>
> Anyway, as I've said all along, let's look at the numbers.
> I've always admitted that tmem on an old uniprocessor should
> be disabled. If no performance degradation in that environment
> is a requirement for KVM-tmem to be merged, that is certainly
> your choice. And if "more CPU cycles used" is a metric,
> definitely, tmem is not going to pass because that's exactly
> what it's doing: trading more CPU cycles for better RAM
> efficiency == less disk accesses.
Again, the cpu cycles spent are certain, and double the effort needed to
get those pages in the first place. Disk accesses saved will depend on
the workload, and on host memory availability. Turning tmem on will
certainly generate performance regressions as well as improvements.
Maybe on Xen the tradeoff is different (hypercalls ought to be faster on
xenpv), but the numbers I saw on kvm aren't good.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-11 19:25 ` Sasha Levin
2012-06-11 19:56 ` Sasha Levin
@ 2012-06-12 10:12 ` Avi Kivity
1 sibling, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2012-06-12 10:12 UTC (permalink / raw)
To: Sasha Levin; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm
On 06/11/2012 10:25 PM, Sasha Levin wrote:
> On Mon, 2012-06-11 at 20:06 +0300, Avi Kivity wrote:
>> Sorry, no, first demonstrate no performance regressions, then we can
>> talk about performance improvements.
>
> No performance regressions? For caching? How would that work?
A small degradation might be acceptable. 2X cpu consumption is not.
IMO "using host memory" is the problem, because it involves copies and
hypercalls. Try giving the memory to the guest, either through the
balloon or through a pci device that exposes memory that can be
withdrawn. That will make everything *much* faster.
>
> Or even if you meant just the kvm-tmem interface overhead, I don't see
> how that would work.
>
I meant the overall overhead, as seen by users.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-11 19:56 ` Sasha Levin
@ 2012-06-12 11:46 ` Avi Kivity
2012-06-12 11:58 ` Gleb Natapov
0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2012-06-12 11:46 UTC (permalink / raw)
To: Sasha Levin; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm
On 06/11/2012 10:56 PM, Sasha Levin wrote:
>
> btw, so far we've been poking on half of the code here.
>
> What about frontswap over kvm-tmem? are there any specific tests you'd
> like to see there?
hmm. On one hand, no one swaps these days so there aren't any good
benchmarks for it. On the other hand, with swapping, at least we're
guaranteed the page will be read in the future (unlike cache, where it's
quite possible it won't be). I don't know.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-12 11:46 ` Avi Kivity
@ 2012-06-12 11:58 ` Gleb Natapov
2012-06-12 12:01 ` Avi Kivity
0 siblings, 1 reply; 20+ messages in thread
From: Gleb Natapov @ 2012-06-12 11:58 UTC (permalink / raw)
To: Avi Kivity
Cc: Sasha Levin, Dan Magenheimer, mtosatti, gregkh, sjenning,
Konrad Wilk, kvm
On Tue, Jun 12, 2012 at 02:46:38PM +0300, Avi Kivity wrote:
> On 06/11/2012 10:56 PM, Sasha Levin wrote:
> >
> > btw, so far we've been poking on half of the code here.
> >
> > What about frontswap over kvm-tmem? are there any specific tests you'd
> > like to see there?
>
> hmm. On one hand, no one swaps these days so there aren't any good
> benchmarks for it. On the other hand, with swapping, at least we're
> guaranteed the page will be read in the future (unlike cache, where it's
> quite possible it won't be). I don't know.
>
>
Swapped page can be discarded without reading too.
--
Gleb.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-12 11:58 ` Gleb Natapov
@ 2012-06-12 12:01 ` Avi Kivity
0 siblings, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2012-06-12 12:01 UTC (permalink / raw)
To: Gleb Natapov
Cc: Sasha Levin, Dan Magenheimer, mtosatti, gregkh, sjenning,
Konrad Wilk, kvm
On 06/12/2012 02:58 PM, Gleb Natapov wrote:
> On Tue, Jun 12, 2012 at 02:46:38PM +0300, Avi Kivity wrote:
>> On 06/11/2012 10:56 PM, Sasha Levin wrote:
>> >
>> > btw, so far we've been poking on half of the code here.
>> >
>> > What about frontswap over kvm-tmem? are there any specific tests you'd
>> > like to see there?
>>
>> hmm. On one hand, no one swaps these days so there aren't any good
>> benchmarks for it. On the other hand, with swapping, at least we're
>> guaranteed the page will be read in the future (unlike cache, where it's
>> quite possible it won't be). I don't know.
>>
>>
> Swapped page can be discarded without reading too.
Right.
The effects of frontswap can be achieved by swapping to a block device
that sets cache=writeback, more or less (esp. with trim support, you can
discard pages that you won't be needing again before they hit disk).
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-12 10:09 ` Avi Kivity
@ 2012-06-12 16:40 ` Dan Magenheimer
2012-06-12 17:54 ` Avi Kivity
0 siblings, 1 reply; 20+ messages in thread
From: Dan Magenheimer @ 2012-06-12 16:40 UTC (permalink / raw)
To: Avi Kivity; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm
> From: Avi Kivity [mailto:avi@redhat.com]
> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
I started off with a point-by-point comment on most of your
responses about the tradeoffs of how tmem works, but decided
it best to simply say we disagree and kvm-tmem will need to
prove who is right.
> >> Sorry, no, first demonstrate no performance regressions, then we can
> >> talk about performance improvements.
> >
> > Well that's an awfully hard bar to clear, even with any of the many
> > changes being merged every release into the core Linux mm subsystem.
> > Any change to memory management will have some positive impacts on some
> > workloads and some negative impacts on others.
>
> Right, that's too harsh. But these benchmarks show a doubling (or even
> more) of cpu overhead, and that is whether the cache is effective or
> not. That is simply way too much to consider.
One point here... remember you have contrived a worst case
scenario. The one case Sasha provided outside of that contrived
worst case, as you commented, looks very nice. So the costs/benefits
remain to be seen over a wider set of workloads.
Also, even that contrived case should look quite a bit better
with WasActive properly implemented.
> Look at the block, vfs, and mm layers. Huge pains have been taken to
> batch everything and avoid per-page work -- 20 years of not having
> enough cycles. And here you throw all this out of the window with
> per-page crossing of the guest/host boundary.
Well, to be fair, those 20 years of effort were because
(1) disk seeks are a million times slower than an in-RAM page
copy and (2) SMP systems were rare and expensive. The
world changes...
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support
2012-06-12 16:40 ` Dan Magenheimer
@ 2012-06-12 17:54 ` Avi Kivity
0 siblings, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2012-06-12 17:54 UTC (permalink / raw)
To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm
On 06/12/2012 07:40 PM, Dan Magenheimer wrote:
> > From: Avi Kivity [mailto:avi@redhat.com]
> > Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
>
> I started off with a point-by-point comment on most of your
> responses about the tradeoffs of how tmem works, but decided
> it best to simply say we disagree and kvm-tmem will need to
> prove who is right.
That is why I am asking for benchmarks.
> > >> Sorry, no, first demonstrate no performance regressions, then we can
> > >> talk about performance improvements.
> > >
> > > Well that's an awfully hard bar to clear, even with any of the many
> > > changes being merged every release into the core Linux mm subsystem.
> > > Any change to memory management will have some positive impacts on some
> > > workloads and some negative impacts on others.
> >
> > Right, that's too harsh. But these benchmarks show a doubling (or even
> > more) of cpu overhead, and that is whether the cache is effective or
> > not. That is simply way too much to consider.
>
> One point here... remember you have contrived a worst case
> scenario. The one case Sasha provided outside of that contrived
> worst case, as you commented, looks very nice. So the costs/benefits
> remain to be seen over a wider set of workloads.
While the workload is contrived, decreasing benefits with increasing
cache size is nothing new. And here tmem is increasing the cost of all
caching, without guaranteeing any return.
> Also, even that contrived case should look quite a bit better
> with WasActive properly implemented.
I'll be happy to see benchmarks of improved code.
> > Look at the block, vfs, and mm layers. Huge pains have been taken to
> > batch everything and avoid per-page work -- 20 years of not having
> > enough cycles. And here you throw all this out of the window with
> > per-page crossing of the guest/host boundary.
>
> Well, to be fair, those 20 years of effort were because
> (1) disk seeks are a million times slower than an in-RAM page
> copy and (2) SMP systems were rare and expensive. The
> world changes...
I don't see how smp matters here. You have more cores, you put more
work on them, you don't expect the OS or hypervisor to consume them for
you. In any case you're consuming this cpu on the same core as the
guest, so you're reducing throghput (if caching is ineffective).
Disks are still slow, even fast flash arrays, but tmem is not the only
solution to that problem. You say ballooning has not proven itself in
this area but that doesn't mean it has been proven not to work; and it
doesn't suffer from the inefficiency of crossing the guest/host boundary.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2012-06-12 17:55 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-06 13:07 [RFC 00/10] KVM: Add TMEM host/guest support Sasha Levin
2012-06-06 13:24 ` Avi Kivity
2012-06-08 13:20 ` Sasha Levin
2012-06-08 16:06 ` Dan Magenheimer
2012-06-11 11:17 ` Avi Kivity
2012-06-11 8:09 ` Avi Kivity
2012-06-11 10:26 ` Sasha Levin
2012-06-11 11:45 ` Avi Kivity
2012-06-11 15:44 ` Dan Magenheimer
2012-06-11 17:06 ` Avi Kivity
2012-06-11 19:25 ` Sasha Levin
2012-06-11 19:56 ` Sasha Levin
2012-06-12 11:46 ` Avi Kivity
2012-06-12 11:58 ` Gleb Natapov
2012-06-12 12:01 ` Avi Kivity
2012-06-12 10:12 ` Avi Kivity
2012-06-12 1:18 ` Dan Magenheimer
2012-06-12 10:09 ` Avi Kivity
2012-06-12 16:40 ` Dan Magenheimer
2012-06-12 17:54 ` Avi Kivity
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.