From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932619AbXAGRrc (ORCPT ); Sun, 7 Jan 2007 12:47:32 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932620AbXAGRrc (ORCPT ); Sun, 7 Jan 2007 12:47:32 -0500 Received: from mx2.mail.elte.hu ([157.181.151.9]:57303 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932619AbXAGRrb (ORCPT ); Sun, 7 Jan 2007 12:47:31 -0500 Date: Sun, 7 Jan 2007 18:44:17 +0100 From: Ingo Molnar To: Avi Kivity Cc: kvm-devel , linux-kernel@vger.kernel.org Subject: Re: [announce] [patch] KVM paravirtualization for Linux Message-ID: <20070107174416.GA14607@elte.hu> References: <20070105215223.GA5361@elte.hu> <45A0E586.50806@qumranet.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <45A0E586.50806@qumranet.com> User-Agent: Mutt/1.4.2.2i X-ELTE-VirusStatus: clean X-ELTE-SpamScore: -5.9 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-5.9 required=5.9 tests=ALL_TRUSTED,BAYES_00 autolearn=no SpamAssassin version=3.0.3 -3.3 ALL_TRUSTED Did not pass through any untrusted hosts -2.6 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org * Avi Kivity wrote: > >2-task context-switch performance (in microseconds, lower is better): > > > > native: 1.11 > > ---------------------------------- > > Qemu: 61.18 > > KVM upstream: 53.01 > > KVM trunk: 6.36 > > KVM trunk+paravirt/cr3: 1.60 > > > >i.e. 2-task context-switch performance is faster by a factor of 4, and > >is now quite close to native speed! > > > > Very impressive! The gain probably comes not only from avoiding the > vmentry/vmexit, but also from avoiding the flushing of the global page > tlb entries. 90% of the win comes from the avoidance of the VM exit. To quantify this more precisely i have added an artificial __flush_tlb_global() call to after switch_to(), just to see how much impact an extra global flush has on the native kernel. Context-switch cost went from 1.11 usecs to 1.65 usecs. Then i added a __flush_tlb(), which made the cost go to 1.75, which means that the global flush component is at around 0.5 usecs. > >"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better): > > > > native: 0.25 > > ---------------------------------- > > Qemu: 7.8 > > KVM upstream: 2.8 > > KVM trunk: 0.55 > > KVM paravirt/cr3: 0.36 > > > >almost twice as fast. > > > >"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better): > > > > native: 0.9 > > ---------------------------------- > > Qemu: 35.2 > > KVM upstream: 9.4 > > KVM trunk: 2.8 > > KVM paravirt/cr3: 2.2 > > > >still a 30% improvement - which isnt too bad considering that 200 tasks > >are context-switching in this workload and the cr3 cache in current CPUs > >is only 4 entries. > > > > This is a little too good to be true. Were both runs with the same > KVM_NUM_MMU_PAGES? yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' run should have been labeled as: 'cr3 tree with paravirt turned off'. That's not completely 'trunk' but close to it, and all other changes (like elimination of unnecessary TLB flushes) are fairly applied to both. i also did a run with much less MMU cache pages of 256, and hackbench 1 stayed the same, while hackbench 5 numbers started fluctuating badly (i think that workload if trashing the MMU cache badly). > I'm also concerned that at this point in time the cr3 optimizations > will only show an improvement in microbenchmarks. In real life > workloads a context switch is usually preceded by an I/O, and with the > current sorry state of kvm I/O the context switch time would be > dominated by the I/O time. oh, i agreed completely - but in my opinion accelerating virtual I/O is really easy. Accelerating the context-switch path (and basic syscall overhead like KVM does) is /hard/. So i wanted to see whether KVM runs well in all the hard cases, before looking at the low hanging performance fruits in the I/O area =B-) also note that there's lots of internal reasons why application workloads can be heavily context-switching - it's not just I/O that generates them. (pipes, critical sections / futexes, etc.) So having near-native performance for context-switches is very important. > >+ if (irq & 8) { > >+ outb(cached_slave_mask, PIC_SLAVE_IMR); > >+ outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */ > >+ outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' > >to master-IRQ2 */ > >+ } else { > >+ outb(cached_master_mask, PIC_MASTER_IMR); > >+ /* 'Specific EOI' to master: */ > >+ outb(0x60+irq, PIC_MASTER_CMD); > >+ } > >+ spin_unlock_irqrestore(&i8259A_lock, flags); > >+} > > Any reason this can't be applied to mainline? There's probably no > downside to native, and it would benefit all virtualization solutions > equally. this is legacy stuff ... > >- u64 *pae_root; > >+ u64 *pae_root[KVM_CR3_CACHE_SIZE]; > > hmm. wouldn't it be simpler to have pae_root always point at the > current root? does that guarantee that it's available? I wanted to 'pin' the root itself this way, to make sure that if a guest switches to it via the cache, that it's truly available and a valid root. cr3 addresses are non-virtual so this is the only mechanism available to guarantee that the host-side memory truly contains a root pagetable. > >+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE; > >+ } > > } > > vcpu->mmu.root_hpa = INVALID_PAGE; > > } > > You keep the page directories pinned here. [...] yes. > [...] This can be a problem if a guest frees a page directory, and > then starts using it as a regular page. kvm sometimes chooses not to > emulate a write to a guest page table, but instead to zap it, which is > impossible when the page is freed. You need to either unpin the page > when that happens, or add a hypercall to let kvm know when a page > directory is freed. the cache is zapped upon pagefaults anyway, so unpinning ought to be possible. Which one would you prefer? > >- for (i = 0; i < 4; ++i) > >- vcpu->mmu.pae_root[i] = INVALID_PAGE; > >+ for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) { > >+ /* > >+ * When emulating 32-bit mode, cr3 is only 32 bits even on > >+ * x86_64. Therefore we need to allocate shadow page tables > >+ * in the first 4GB of memory, which happens to fit the DMA32 > >+ * zone: > >+ */ > >+ page = alloc_page(GFP_KERNEL | __GFP_DMA32); > >+ if (!page) > >+ goto error_1; > >+ > >+ ASSERT(!vcpu->mmu.pae_root[j]); > >+ vcpu->mmu.pae_root[j] = page_address(page); > >+ for (i = 0; i < 4; ++i) > >+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE; > >+ } > > Since a pae root uses just 32 bytes, you can store all cache entries > in a single page. Not that it matters much. yeah - i wanted to extend the current code in a safe way, before optimizing it. > >+#define KVM_API_MAGIC 0x87654321 > >+ > > is the vmm userspace interface. The guest/host > interface should probably go somewhere else. yeah. kvm_para.h? Ingo