From mboxrd@z Thu Jan 1 00:00:00 1970 From: MJ embd Subject: Re: [PATCH 27/27] KVM: PPC: Add Documentation about PV interface Date: Fri, 9 Jul 2010 14:41:01 +0530 Message-ID: References: <1277980982-12433-1-git-send-email-agraf@suse.de> <1277980982-12433-28-git-send-email-agraf@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: kvm-ppc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, KVM list , linuxppc-dev To: Alexander Graf Return-path: In-Reply-To: <1277980982-12433-28-git-send-email-agraf-l3A5Bk7waGM@public.gmane.org> Sender: kvm-ppc-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: kvm.vger.kernel.org On Thu, Jul 1, 2010 at 4:13 PM, Alexander Graf wrote: > We just introduced a new PV interface that screams for documentation.= So here > it is - a shiny new and awesome text file describing the internal wor= ks of > the PPC KVM paravirtual interface. > > Signed-off-by: Alexander Graf > > --- > > v1 -> v2: > > =A0- clarify guest implementation > =A0- clarify that privileged instructions still work > =A0- explain safe MSR bits > =A0- Fix dsisr patch description > =A0- change hypervisor calls to use new register values > --- > =A0Documentation/kvm/ppc-pv.txt | =A0185 ++++++++++++++++++++++++++++= ++++++++++++++ > =A01 files changed, 185 insertions(+), 0 deletions(-) > =A0create mode 100644 Documentation/kvm/ppc-pv.txt > > diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.= txt > new file mode 100644 > index 0000000..82de6c6 > --- /dev/null > +++ b/Documentation/kvm/ppc-pv.txt > @@ -0,0 +1,185 @@ > +The PPC KVM paravirtual interface > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +The basic execution principle by which KVM on PowerPC works is to ru= n all kernel > +space code in PR=3D1 which is user space. This way we trap all privi= leged > +instructions and can emulate them accordingly. > + > +Unfortunately that is also the downfall. There are quite some privil= eged > +instructions that needlessly return us to the hypervisor even though= they > +could be handled differently. > + > +This is what the PPC PV interface helps with. It takes privileged in= structions > +and transforms them into unprivileged ones with some help from the h= ypervisor. > +This cuts down virtualization costs by about 50% on some of my bench= marks. > + > +The code for that interface can be found in arch/powerpc/kernel/kvm* > + > +Querying for existence > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +To find out if we're running on KVM or not, we overlay the PVR regis= ter. Usually > +the PVR register contains an id that identifies your CPU type. If, h= owever, you > +pass KVM_PVR_PARA in the register that you want the PVR result in, t= he register > +still contains KVM_PVR_PARA after the mfpvr call. > + > + =A0 =A0 =A0 LOAD_REG_IMM(r5, KVM_PVR_PARA) > + =A0 =A0 =A0 mfpvr =A0 r5 > + =A0 =A0 =A0 [r5 still contains KVM_PVR_PARA] > + > +Once determined to run under a PV capable KVM, you can now use hyper= calls as > +described below. > + > +PPC hypercalls > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +The only viable ways to reliably get from guest context to host cont= ext are: > + > + =A0 =A0 =A0 1) Call an invalid instruction > + =A0 =A0 =A0 2) Call the "sc" instruction with a parameter to "sc" > + =A0 =A0 =A0 3) Call the "sc" instruction with parameters in GPRs > + > +Method 1 is always a bad idea. Invalid instructions can be replaced = later on > +by valid instructions, rendering the interface broken. > + > +Method 2 also has downfalls. If the parameter to "sc" is !=3D 0 the = spec is > +rather unclear if the sc is targeted directly for the hypervisor or = the > +supervisor. It would also require that we read the syscall issuing i= nstruction > +every time a syscall is issued, slowing down guest syscalls. > + > +Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R0 = and > +KVM_SC_MAGIC_R3) in r0 and r3 respectively. If a syscall instruction= with these > +magic values arrives from the guest's kernel mode, we take the sysca= ll as a > +hypercall. > + > +The parameters are as follows: > + > + =A0 =A0 =A0 r0 =A0 =A0 =A0 =A0 =A0 =A0 =A0KVM_SC_MAGIC_R0 > + =A0 =A0 =A0 r3 =A0 =A0 =A0 =A0 =A0 =A0 =A0KVM_SC_MAGIC_R3 =A0 =A0 =A0= =A0 Return code > + =A0 =A0 =A0 r4 =A0 =A0 =A0 =A0 =A0 =A0 =A0Hypercall number > + =A0 =A0 =A0 r5 =A0 =A0 =A0 =A0 =A0 =A0 =A0First parameter > + =A0 =A0 =A0 r6 =A0 =A0 =A0 =A0 =A0 =A0 =A0Second parameter > + =A0 =A0 =A0 r7 =A0 =A0 =A0 =A0 =A0 =A0 =A0Third parameter > + =A0 =A0 =A0 r8 =A0 =A0 =A0 =A0 =A0 =A0 =A0Fourth parameter > + > +Hypercall definitions are shared in generic code, so the same hyperc= all numbers > +apply for x86 and powerpc alike. > + > +The magic page > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +To enable communication between the hypervisor and guest there is a = new shared > +page that contains parts of supervisor visible register state. The g= uest can > +map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PA= GE. > + > +With this hypercall issued the guest always gets the magic page mapp= ed at the > +desired location in effective and physical address space. For now, w= e always > +map the page to -4096. This way we can access it using absolute load= and store > +functions. The following instruction reads the first field of the ma= gic page: > + > + =A0 =A0 =A0 ld =A0 =A0 =A0rX, -4096(0) > + > +The interface is designed to be extensible should there be need late= r to add > +additional registers to the magic page. If you add fields to the mag= ic page, > +also define a new hypercall feature to indicate that the host can gi= ve you more > +registers. Only if the host supports the additional features, make u= se of them. > + > +The magic page has the following layout as described in > +arch/powerpc/include/asm/kvm_para.h: > + > +struct kvm_vcpu_arch_shared { > + =A0 =A0 =A0 __u64 scratch1; > + =A0 =A0 =A0 __u64 scratch2; > + =A0 =A0 =A0 __u64 scratch3; > + =A0 =A0 =A0 __u64 critical; =A0 =A0 =A0 =A0 /* Guest may not get in= terrupts if =3D=3D r1 */ > + =A0 =A0 =A0 __u64 sprg0; > + =A0 =A0 =A0 __u64 sprg1; > + =A0 =A0 =A0 __u64 sprg2; > + =A0 =A0 =A0 __u64 sprg3; > + =A0 =A0 =A0 __u64 srr0; > + =A0 =A0 =A0 __u64 srr1; > + =A0 =A0 =A0 __u64 dar; > + =A0 =A0 =A0 __u64 msr; > + =A0 =A0 =A0 __u32 dsisr; > + =A0 =A0 =A0 __u32 int_pending; =A0 =A0 =A0/* Tells the guest if we = have an interrupt */ > +}; > + > +Additions to the page must only occur at the end. Struct fields are = always 32 > +bit aligned. > + > +MSR bits > +=3D=3D=3D=3D=3D=3D=3D=3D > + > +The MSR contains bits that require hypervisor intervention and bits = that do > +not require direct hypervisor intervention because they only get int= erpreted > +when entering the guest or don't have any impact on the hypervisor's= behavior. > + > +The following bits are safe to be set inside the guest: > + > + =A0MSR_EE > + =A0MSR_RI > + =A0MSR_CR > + =A0MSR_ME > + > +If any other bit changes in the MSR, please still use mtmsr(d). > + > +Patched instructions > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +The "ld" and "std" instructions are transormed to "lwz" and "stw" in= structions > +respectively on 32 bit systems with an added offset of 4 to accomoda= te for big > +endianness. > + > +The following is a list of mapping the Linux kernel performs when ru= nning as > +guest. Implementing any of those mappings is optional, as the instru= ction traps > +also act on the shared page. So calling privileged instructions stil= l works as > +before. > + > +From =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 To > +=3D=3D=3D=3D =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =3D=3D > + > +mfmsr =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page-= >msr > +mfsprg rX, 0 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg0 > +mfsprg rX, 1 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg1 > +mfsprg rX, 2 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg2 > +mfsprg rX, 3 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg3 > +mfsrr0 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page->s= rr0 > +mfsrr1 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page->s= rr1 > +mfdar =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page-= >dar > +mfdsisr =A0 =A0 =A0 =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0lwz =A0 =A0 rX,= magic_page->dsisr > + > +mtmsr =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->m= sr > +mtsprg 0, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg0 > +mtsprg 1, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg1 > +mtsprg 2, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg2 > +mtsprg 3, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg3 > +mtsrr0 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->srr= 0 > +mtsrr1 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->srr= 1 > +mtdar =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->d= ar > +mtdsisr =A0 =A0 =A0 =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0stw =A0 =A0 rX,= magic_page->dsisr > + > +tlbsync =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nop > + > +mtmsrd rX, 0 =A0 =A0 =A0 =A0 =A0 b =A0 =A0 =A0 > +mtmsr =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0b =A0 =A0 =A0 > + > +mtmsrd rX, 1 =A0 =A0 =A0 =A0 =A0 b =A0 =A0 =A0 > + > +[BookE only] > +wrteei [0|1] =A0 =A0 =A0 =A0 =A0 b =A0 =A0 =A0 > + > + > +Some instructions require more logic to determine what's going on th= an a load > +or store instruction can deliver. To enable patching of those, we ke= ep some > +RAM around where we can live translate instructions to. What happens= is the > +following: > + > + =A0 =A0 =A0 1) copy emulation code to memory > + =A0 =A0 =A0 2) patch that code to fit the emulated instruction > + =A0 =A0 =A0 3) patch that code to return to the original pc + 4 > + =A0 =A0 =A0 4) patch the original instruction to branch to the new = code > + > +That way we can inject an arbitrary amount of code as replacement fo= r a single > +instruction. This allows us to check for pending interrupts when set= ting EE=3D1 > +for example. > + Which patch does this mapping ? Can you please point to that. > -- > 1.6.0.2 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > --=20 -mj From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ww0-f45.google.com (mail-ww0-f45.google.com [74.125.82.45]) by ozlabs.org (Postfix) with ESMTP id 75F41B6EEE for ; Fri, 9 Jul 2010 19:11:05 +1000 (EST) Received: by wwb13 with SMTP id 13so284903wwb.14 for ; Fri, 09 Jul 2010 02:11:01 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1277980982-12433-28-git-send-email-agraf@suse.de> References: <1277980982-12433-1-git-send-email-agraf@suse.de> <1277980982-12433-28-git-send-email-agraf@suse.de> Date: Fri, 9 Jul 2010 14:41:01 +0530 Message-ID: Subject: Re: [PATCH 27/27] KVM: PPC: Add Documentation about PV interface From: MJ embd To: Alexander Graf Content-Type: text/plain; charset=ISO-8859-1 Cc: linuxppc-dev , KVM list , kvm-ppc@vger.kernel.org List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, Jul 1, 2010 at 4:13 PM, Alexander Graf wrote: > We just introduced a new PV interface that screams for documentation. So = here > it is - a shiny new and awesome text file describing the internal works o= f > the PPC KVM paravirtual interface. > > Signed-off-by: Alexander Graf > > --- > > v1 -> v2: > > =A0- clarify guest implementation > =A0- clarify that privileged instructions still work > =A0- explain safe MSR bits > =A0- Fix dsisr patch description > =A0- change hypervisor calls to use new register values > --- > =A0Documentation/kvm/ppc-pv.txt | =A0185 ++++++++++++++++++++++++++++++++= ++++++++++ > =A01 files changed, 185 insertions(+), 0 deletions(-) > =A0create mode 100644 Documentation/kvm/ppc-pv.txt > > diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.txt > new file mode 100644 > index 0000000..82de6c6 > --- /dev/null > +++ b/Documentation/kvm/ppc-pv.txt > @@ -0,0 +1,185 @@ > +The PPC KVM paravirtual interface > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +The basic execution principle by which KVM on PowerPC works is to run al= l kernel > +space code in PR=3D1 which is user space. This way we trap all privilege= d > +instructions and can emulate them accordingly. > + > +Unfortunately that is also the downfall. There are quite some privileged > +instructions that needlessly return us to the hypervisor even though the= y > +could be handled differently. > + > +This is what the PPC PV interface helps with. It takes privileged instru= ctions > +and transforms them into unprivileged ones with some help from the hyper= visor. > +This cuts down virtualization costs by about 50% on some of my benchmark= s. > + > +The code for that interface can be found in arch/powerpc/kernel/kvm* > + > +Querying for existence > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +To find out if we're running on KVM or not, we overlay the PVR register.= Usually > +the PVR register contains an id that identifies your CPU type. If, howev= er, you > +pass KVM_PVR_PARA in the register that you want the PVR result in, the r= egister > +still contains KVM_PVR_PARA after the mfpvr call. > + > + =A0 =A0 =A0 LOAD_REG_IMM(r5, KVM_PVR_PARA) > + =A0 =A0 =A0 mfpvr =A0 r5 > + =A0 =A0 =A0 [r5 still contains KVM_PVR_PARA] > + > +Once determined to run under a PV capable KVM, you can now use hypercall= s as > +described below. > + > +PPC hypercalls > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +The only viable ways to reliably get from guest context to host context = are: > + > + =A0 =A0 =A0 1) Call an invalid instruction > + =A0 =A0 =A0 2) Call the "sc" instruction with a parameter to "sc" > + =A0 =A0 =A0 3) Call the "sc" instruction with parameters in GPRs > + > +Method 1 is always a bad idea. Invalid instructions can be replaced late= r on > +by valid instructions, rendering the interface broken. > + > +Method 2 also has downfalls. If the parameter to "sc" is !=3D 0 the spec= is > +rather unclear if the sc is targeted directly for the hypervisor or the > +supervisor. It would also require that we read the syscall issuing instr= uction > +every time a syscall is issued, slowing down guest syscalls. > + > +Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R0 and > +KVM_SC_MAGIC_R3) in r0 and r3 respectively. If a syscall instruction wit= h these > +magic values arrives from the guest's kernel mode, we take the syscall a= s a > +hypercall. > + > +The parameters are as follows: > + > + =A0 =A0 =A0 r0 =A0 =A0 =A0 =A0 =A0 =A0 =A0KVM_SC_MAGIC_R0 > + =A0 =A0 =A0 r3 =A0 =A0 =A0 =A0 =A0 =A0 =A0KVM_SC_MAGIC_R3 =A0 =A0 =A0 = =A0 Return code > + =A0 =A0 =A0 r4 =A0 =A0 =A0 =A0 =A0 =A0 =A0Hypercall number > + =A0 =A0 =A0 r5 =A0 =A0 =A0 =A0 =A0 =A0 =A0First parameter > + =A0 =A0 =A0 r6 =A0 =A0 =A0 =A0 =A0 =A0 =A0Second parameter > + =A0 =A0 =A0 r7 =A0 =A0 =A0 =A0 =A0 =A0 =A0Third parameter > + =A0 =A0 =A0 r8 =A0 =A0 =A0 =A0 =A0 =A0 =A0Fourth parameter > + > +Hypercall definitions are shared in generic code, so the same hypercall = numbers > +apply for x86 and powerpc alike. > + > +The magic page > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +To enable communication between the hypervisor and guest there is a new = shared > +page that contains parts of supervisor visible register state. The guest= can > +map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE. > + > +With this hypercall issued the guest always gets the magic page mapped a= t the > +desired location in effective and physical address space. For now, we al= ways > +map the page to -4096. This way we can access it using absolute load and= store > +functions. The following instruction reads the first field of the magic = page: > + > + =A0 =A0 =A0 ld =A0 =A0 =A0rX, -4096(0) > + > +The interface is designed to be extensible should there be need later to= add > +additional registers to the magic page. If you add fields to the magic p= age, > +also define a new hypercall feature to indicate that the host can give y= ou more > +registers. Only if the host supports the additional features, make use o= f them. > + > +The magic page has the following layout as described in > +arch/powerpc/include/asm/kvm_para.h: > + > +struct kvm_vcpu_arch_shared { > + =A0 =A0 =A0 __u64 scratch1; > + =A0 =A0 =A0 __u64 scratch2; > + =A0 =A0 =A0 __u64 scratch3; > + =A0 =A0 =A0 __u64 critical; =A0 =A0 =A0 =A0 /* Guest may not get interr= upts if =3D=3D r1 */ > + =A0 =A0 =A0 __u64 sprg0; > + =A0 =A0 =A0 __u64 sprg1; > + =A0 =A0 =A0 __u64 sprg2; > + =A0 =A0 =A0 __u64 sprg3; > + =A0 =A0 =A0 __u64 srr0; > + =A0 =A0 =A0 __u64 srr1; > + =A0 =A0 =A0 __u64 dar; > + =A0 =A0 =A0 __u64 msr; > + =A0 =A0 =A0 __u32 dsisr; > + =A0 =A0 =A0 __u32 int_pending; =A0 =A0 =A0/* Tells the guest if we have= an interrupt */ > +}; > + > +Additions to the page must only occur at the end. Struct fields are alwa= ys 32 > +bit aligned. > + > +MSR bits > +=3D=3D=3D=3D=3D=3D=3D=3D > + > +The MSR contains bits that require hypervisor intervention and bits that= do > +not require direct hypervisor intervention because they only get interpr= eted > +when entering the guest or don't have any impact on the hypervisor's beh= avior. > + > +The following bits are safe to be set inside the guest: > + > + =A0MSR_EE > + =A0MSR_RI > + =A0MSR_CR > + =A0MSR_ME > + > +If any other bit changes in the MSR, please still use mtmsr(d). > + > +Patched instructions > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +The "ld" and "std" instructions are transormed to "lwz" and "stw" instru= ctions > +respectively on 32 bit systems with an added offset of 4 to accomodate f= or big > +endianness. > + > +The following is a list of mapping the Linux kernel performs when runnin= g as > +guest. Implementing any of those mappings is optional, as the instructio= n traps > +also act on the shared page. So calling privileged instructions still wo= rks as > +before. > + > +From =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 To > +=3D=3D=3D=3D =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =3D=3D > + > +mfmsr =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page->msr > +mfsprg rX, 0 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg0 > +mfsprg rX, 1 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg1 > +mfsprg rX, 2 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg2 > +mfsprg rX, 3 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg3 > +mfsrr0 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page->srr0 > +mfsrr1 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page->srr1 > +mfdar =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page->dar > +mfdsisr =A0 =A0 =A0 =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0lwz =A0 =A0 rX, mag= ic_page->dsisr > + > +mtmsr =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->msr > +mtsprg 0, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg0 > +mtsprg 1, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg1 > +mtsprg 2, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg2 > +mtsprg 3, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg3 > +mtsrr0 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->srr0 > +mtsrr1 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->srr1 > +mtdar =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->dar > +mtdsisr =A0 =A0 =A0 =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0stw =A0 =A0 rX, mag= ic_page->dsisr > + > +tlbsync =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nop > + > +mtmsrd rX, 0 =A0 =A0 =A0 =A0 =A0 b =A0 =A0 =A0 > +mtmsr =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0b =A0 =A0 =A0 > + > +mtmsrd rX, 1 =A0 =A0 =A0 =A0 =A0 b =A0 =A0 =A0 > + > +[BookE only] > +wrteei [0|1] =A0 =A0 =A0 =A0 =A0 b =A0 =A0 =A0 > + > + > +Some instructions require more logic to determine what's going on than a= load > +or store instruction can deliver. To enable patching of those, we keep s= ome > +RAM around where we can live translate instructions to. What happens is = the > +following: > + > + =A0 =A0 =A0 1) copy emulation code to memory > + =A0 =A0 =A0 2) patch that code to fit the emulated instruction > + =A0 =A0 =A0 3) patch that code to return to the original pc + 4 > + =A0 =A0 =A0 4) patch the original instruction to branch to the new code > + > +That way we can inject an arbitrary amount of code as replacement for a = single > +instruction. This allows us to check for pending interrupts when setting= EE=3D1 > +for example. > + Which patch does this mapping ? Can you please point to that. > -- > 1.6.0.2 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > --=20 -mj From mboxrd@z Thu Jan 1 00:00:00 1970 From: MJ embd Date: Fri, 09 Jul 2010 09:23:01 +0000 Subject: Re: [PATCH 27/27] KVM: PPC: Add Documentation about PV interface Message-Id: List-Id: References: <1277980982-12433-1-git-send-email-agraf@suse.de> <1277980982-12433-28-git-send-email-agraf@suse.de> In-Reply-To: <1277980982-12433-28-git-send-email-agraf-l3A5Bk7waGM@public.gmane.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable To: Alexander Graf Cc: kvm-ppc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, KVM list , linuxppc-dev On Thu, Jul 1, 2010 at 4:13 PM, Alexander Graf wrote: > We just introduced a new PV interface that screams for documentation. So = here > it is - a shiny new and awesome text file describing the internal works of > the PPC KVM paravirtual interface. > > Signed-off-by: Alexander Graf > > --- > > v1 -> v2: > > =A0- clarify guest implementation > =A0- clarify that privileged instructions still work > =A0- explain safe MSR bits > =A0- Fix dsisr patch description > =A0- change hypervisor calls to use new register values > --- > =A0Documentation/kvm/ppc-pv.txt | =A0185 ++++++++++++++++++++++++++++++++= ++++++++++ > =A01 files changed, 185 insertions(+), 0 deletions(-) > =A0create mode 100644 Documentation/kvm/ppc-pv.txt > > diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.txt > new file mode 100644 > index 0000000..82de6c6 > --- /dev/null > +++ b/Documentation/kvm/ppc-pv.txt > @@ -0,0 +1,185 @@ > +The PPC KVM paravirtual interface > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> + > +The basic execution principle by which KVM on PowerPC works is to run al= l kernel > +space code in PR=3D1 which is user space. This way we trap all privileged > +instructions and can emulate them accordingly. > + > +Unfortunately that is also the downfall. There are quite some privileged > +instructions that needlessly return us to the hypervisor even though they > +could be handled differently. > + > +This is what the PPC PV interface helps with. It takes privileged instru= ctions > +and transforms them into unprivileged ones with some help from the hyper= visor. > +This cuts down virtualization costs by about 50% on some of my benchmark= s. > + > +The code for that interface can be found in arch/powerpc/kernel/kvm* > + > +Querying for existence > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +To find out if we're running on KVM or not, we overlay the PVR register.= Usually > +the PVR register contains an id that identifies your CPU type. If, howev= er, you > +pass KVM_PVR_PARA in the register that you want the PVR result in, the r= egister > +still contains KVM_PVR_PARA after the mfpvr call. > + > + =A0 =A0 =A0 LOAD_REG_IMM(r5, KVM_PVR_PARA) > + =A0 =A0 =A0 mfpvr =A0 r5 > + =A0 =A0 =A0 [r5 still contains KVM_PVR_PARA] > + > +Once determined to run under a PV capable KVM, you can now use hypercall= s as > +described below. > + > +PPC hypercalls > +=3D=3D=3D=3D=3D=3D=3D > + > +The only viable ways to reliably get from guest context to host context = are: > + > + =A0 =A0 =A0 1) Call an invalid instruction > + =A0 =A0 =A0 2) Call the "sc" instruction with a parameter to "sc" > + =A0 =A0 =A0 3) Call the "sc" instruction with parameters in GPRs > + > +Method 1 is always a bad idea. Invalid instructions can be replaced late= r on > +by valid instructions, rendering the interface broken. > + > +Method 2 also has downfalls. If the parameter to "sc" is !=3D 0 the spec= is > +rather unclear if the sc is targeted directly for the hypervisor or the > +supervisor. It would also require that we read the syscall issuing instr= uction > +every time a syscall is issued, slowing down guest syscalls. > + > +Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R0 and > +KVM_SC_MAGIC_R3) in r0 and r3 respectively. If a syscall instruction wit= h these > +magic values arrives from the guest's kernel mode, we take the syscall a= s a > +hypercall. > + > +The parameters are as follows: > + > + =A0 =A0 =A0 r0 =A0 =A0 =A0 =A0 =A0 =A0 =A0KVM_SC_MAGIC_R0 > + =A0 =A0 =A0 r3 =A0 =A0 =A0 =A0 =A0 =A0 =A0KVM_SC_MAGIC_R3 =A0 =A0 =A0 = =A0 Return code > + =A0 =A0 =A0 r4 =A0 =A0 =A0 =A0 =A0 =A0 =A0Hypercall number > + =A0 =A0 =A0 r5 =A0 =A0 =A0 =A0 =A0 =A0 =A0First parameter > + =A0 =A0 =A0 r6 =A0 =A0 =A0 =A0 =A0 =A0 =A0Second parameter > + =A0 =A0 =A0 r7 =A0 =A0 =A0 =A0 =A0 =A0 =A0Third parameter > + =A0 =A0 =A0 r8 =A0 =A0 =A0 =A0 =A0 =A0 =A0Fourth parameter > + > +Hypercall definitions are shared in generic code, so the same hypercall = numbers > +apply for x86 and powerpc alike. > + > +The magic page > +=3D=3D=3D=3D=3D=3D=3D > + > +To enable communication between the hypervisor and guest there is a new = shared > +page that contains parts of supervisor visible register state. The guest= can > +map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE. > + > +With this hypercall issued the guest always gets the magic page mapped a= t the > +desired location in effective and physical address space. For now, we al= ways > +map the page to -4096. This way we can access it using absolute load and= store > +functions. The following instruction reads the first field of the magic = page: > + > + =A0 =A0 =A0 ld =A0 =A0 =A0rX, -4096(0) > + > +The interface is designed to be extensible should there be need later to= add > +additional registers to the magic page. If you add fields to the magic p= age, > +also define a new hypercall feature to indicate that the host can give y= ou more > +registers. Only if the host supports the additional features, make use o= f them. > + > +The magic page has the following layout as described in > +arch/powerpc/include/asm/kvm_para.h: > + > +struct kvm_vcpu_arch_shared { > + =A0 =A0 =A0 __u64 scratch1; > + =A0 =A0 =A0 __u64 scratch2; > + =A0 =A0 =A0 __u64 scratch3; > + =A0 =A0 =A0 __u64 critical; =A0 =A0 =A0 =A0 /* Guest may not get interr= upts if =3D r1 */ > + =A0 =A0 =A0 __u64 sprg0; > + =A0 =A0 =A0 __u64 sprg1; > + =A0 =A0 =A0 __u64 sprg2; > + =A0 =A0 =A0 __u64 sprg3; > + =A0 =A0 =A0 __u64 srr0; > + =A0 =A0 =A0 __u64 srr1; > + =A0 =A0 =A0 __u64 dar; > + =A0 =A0 =A0 __u64 msr; > + =A0 =A0 =A0 __u32 dsisr; > + =A0 =A0 =A0 __u32 int_pending; =A0 =A0 =A0/* Tells the guest if we have= an interrupt */ > +}; > + > +Additions to the page must only occur at the end. Struct fields are alwa= ys 32 > +bit aligned. > + > +MSR bits > +=3D=3D=3D=3D > + > +The MSR contains bits that require hypervisor intervention and bits that= do > +not require direct hypervisor intervention because they only get interpr= eted > +when entering the guest or don't have any impact on the hypervisor's beh= avior. > + > +The following bits are safe to be set inside the guest: > + > + =A0MSR_EE > + =A0MSR_RI > + =A0MSR_CR > + =A0MSR_ME > + > +If any other bit changes in the MSR, please still use mtmsr(d). > + > +Patched instructions > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +The "ld" and "std" instructions are transormed to "lwz" and "stw" instru= ctions > +respectively on 32 bit systems with an added offset of 4 to accomodate f= or big > +endianness. > + > +The following is a list of mapping the Linux kernel performs when runnin= g as > +guest. Implementing any of those mappings is optional, as the instructio= n traps > +also act on the shared page. So calling privileged instructions still wo= rks as > +before. > + > +From =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 To > +=3D=3D =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =3D > + > +mfmsr =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page->msr > +mfsprg rX, 0 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg0 > +mfsprg rX, 1 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg1 > +mfsprg rX, 2 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg2 > +mfsprg rX, 3 =A0 =A0 =A0 =A0 =A0 ld =A0 =A0 =A0rX, magic_page->sprg3 > +mfsrr0 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page->srr0 > +mfsrr1 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page->srr1 > +mfdar =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0ld =A0 =A0 =A0rX, magic_page->dar > +mfdsisr =A0 =A0 =A0 =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0lwz =A0 =A0 rX, mag= ic_page->dsisr > + > +mtmsr =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->msr > +mtsprg 0, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg0 > +mtsprg 1, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg1 > +mtsprg 2, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg2 > +mtsprg 3, rX =A0 =A0 =A0 =A0 =A0 std =A0 =A0 rX, magic_page->sprg3 > +mtsrr0 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->srr0 > +mtsrr1 rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->srr1 > +mtdar =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0std =A0 =A0 rX, magic_page->dar > +mtdsisr =A0 =A0 =A0 =A0rX =A0 =A0 =A0 =A0 =A0 =A0 =A0stw =A0 =A0 rX, mag= ic_page->dsisr > + > +tlbsync =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nop > + > +mtmsrd rX, 0 =A0 =A0 =A0 =A0 =A0 b =A0 =A0 =A0 > +mtmsr =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0b =A0 =A0 =A0 > + > +mtmsrd rX, 1 =A0 =A0 =A0 =A0 =A0 b =A0 =A0 =A0 > + > +[BookE only] > +wrteei [0|1] =A0 =A0 =A0 =A0 =A0 b =A0 =A0 =A0 > + > + > +Some instructions require more logic to determine what's going on than a= load > +or store instruction can deliver. To enable patching of those, we keep s= ome > +RAM around where we can live translate instructions to. What happens is = the > +following: > + > + =A0 =A0 =A0 1) copy emulation code to memory > + =A0 =A0 =A0 2) patch that code to fit the emulated instruction > + =A0 =A0 =A0 3) patch that code to return to the original pc + 4 > + =A0 =A0 =A0 4) patch the original instruction to branch to the new code > + > +That way we can inject an arbitrary amount of code as replacement for a = single > +instruction. This allows us to check for pending interrupts when setting= EE=3D1 > +for example. > + Which patch does this mapping ? Can you please point to that. > -- > 1.6.0.2 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > --=20 -mj