From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.3 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F4229C433DF for ; Fri, 31 Jul 2020 01:18:43 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id B312F208E4 for ; Fri, 31 Jul 2020 01:18:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="s3cpMjRu" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B312F208E4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1k1Jgu-0000CC-6L; Fri, 31 Jul 2020 01:18:32 +0000 Received: from all-amaz-eas1.inumbo.com ([34.197.232.57] helo=us1-amaz-eas2.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1k1Jgt-0000C6-2E for xen-devel@lists.xenproject.org; Fri, 31 Jul 2020 01:18:31 +0000 X-Inumbo-ID: be408bae-d2cb-11ea-ab62-12813bfff9fa Received: from mail.kernel.org (unknown [198.145.29.99]) by us1-amaz-eas2.inumbo.com (Halon) with ESMTPS id be408bae-d2cb-11ea-ab62-12813bfff9fa; Fri, 31 Jul 2020 01:18:30 +0000 (UTC) Received: from localhost (c-67-164-102-47.hsd1.ca.comcast.net [67.164.102.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 63562208E4; Fri, 31 Jul 2020 01:18:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1596158309; bh=NkC+gVIfv+TVu+bSb9zrp8Mjd4JMr0fGQvlKvKQx2R0=; h=Date:From:To:cc:Subject:In-Reply-To:References:From; b=s3cpMjRuDvkxuTue7PxQlzfcYpp7TsZJLRhH4xlHhU4sY4AnN41z0BYskKWlc+6qO vCLUvrjjK63nFexHO2dp57N7smEMtD1oFUa27FN2dbTE3Q6J/Ld78WwqcY+pqLxT46 BSJbuHvoA/ppA2cTV3hPS5FIkZXxPylkPltNUFvM= Date: Thu, 30 Jul 2020 18:18:28 -0700 (PDT) From: Stefano Stabellini X-X-Sender: sstabellini@sstabellini-ThinkPad-T480s To: Julien Grall Subject: Re: [PATCH v3] xen/arm: Convert runstate address during hypcall In-Reply-To: Message-ID: References: <3911d221ce9ed73611b93aa437b9ca227d6aa201.1596099067.git.bertrand.marquis@arm.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Cc: Stefano Stabellini , Wei Liu , Andrew Cooper , Ian Jackson , George Dunlap , Bertrand Marquis , Jan Beulich , xen-devel@lists.xenproject.org, nd@arm.com, Volodymyr Babchuk , =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" On Thu, 30 Jul 2020, Julien Grall wrote: > Hi Bertrand, > > To avoid extra work on your side, I would recommend to wait a bit before > sending a new version. It would be good to at least settle the conversation in > v2 regarding the approach taken. > > On 30/07/2020 11:24, Bertrand Marquis wrote: > > At the moment on Arm, a Linux guest running with KTPI enabled will > > cause the following error when a context switch happens in user mode: > > (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0 > > > > The error is caused by the virtual address for the runstate area > > registered by the guest only being accessible when the guest is running > > in kernel space when KPTI is enabled. > > > > To solve this issue, this patch is doing the translation from virtual > > address to physical address during the hypercall and mapping the > > required pages using vmap. This is removing the conversion from virtual > > to physical address during the context switch which is solving the > > problem with KPTI. > > To echo what Jan said on the previous version, this is a change in a stable > ABI and therefore may break existing guest. FAOD, I agree in principle with > the idea. However, we want to explain why breaking the ABI is the *only* > viable solution. > > From my understanding, it is not possible to fix without an ABI breakage > because the hypervisor doesn't know when the guest will switch back from > userspace to kernel space. The risk is the information provided by the > runstate wouldn't contain accurate information and could affect how the guest > handle stolen time. > > Additionally there are a few issues with the current interface: > 1) It is assuming the virtual address cannot be re-used by the userspace. > Thanksfully Linux have a split address space. But this may change with KPTI in > place. > 2) When update the page-tables, the guest has to go through an invalid > mapping. So the translation may fail at any point. > > IOW, the existing interface can lead to random memory corruption and > inacurracy of the stolen time. > > > > > This is done only on arm architecture, the behaviour on x86 is not > > modified by this patch and the address conversion is done as before > > during each context switch. > > > > This is introducing several limitations in comparison to the previous > > behaviour (on arm only): > > - if the guest is remapping the area at a different physical address Xen > > will continue to update the area at the previous physical address. As > > the area is in kernel space and usually defined as a global variable this > > is something which is believed not to happen. If this is required by a > > guest, it will have to call the hypercall with the new area (even if it > > is at the same virtual address). > > - the area needs to be mapped during the hypercall. For the same reasons > > as for the previous case, even if the area is registered for a different > > vcpu. It is believed that registering an area using a virtual address > > unmapped is not something done. > > This is not clear whether the virtual address refer to the current vCPU or the > vCPU you register the runstate for. From the past discussion, I think you > refer to the former. It would be good to clarify. > > Additionally, all the new restrictions should be documented in the public > interface. So an OS developper can find the differences between the > architectures. Just to paraphrase what Julien wrote, it would be good to improve the commit message with the points suggested and also write a note in the header file about the changes to the interface. > To answer Jan's concern, we certainly don't know all the guest OSes existing, > however we also need to balance the benefit for a large majority of the users. > > From previous discussion, the current approach was deemed to be acceptable on > Arm and, AFAICT, also x86 (see [1]). > > TBH, I would rather see the approach to be common. For that, we would an > agreement from Andrew and Jan in the approach here. Meanwhile, I think this is > the best approach to address the concern from Arm users. +1 > > inline functions in headers could not be used as the architecture > > domain.h is included before the global domain.h making it impossible > > to use the struct vcpu inside the architecture header. > > This should not have any performance impact as the hypercall is only > > called once per vcpu usually. > > > > Signed-off-by: Bertrand Marquis > > > > --- > > Changes in v2 > > - use vmap to map the pages during the hypercall. > > - reintroduce initial copy during hypercall. > > > > Changes in v3 > > - Fix Coding style > > - Fix vaddr printing on arm32 > > - use write_atomic to modify state_entry_time update bit (only > > in guest structure as the bit is not used inside Xen copy) > > > > --- > > xen/arch/arm/domain.c | 161 ++++++++++++++++++++++++++++++----- > > xen/arch/x86/domain.c | 29 ++++++- > > xen/arch/x86/x86_64/domain.c | 4 +- > > xen/common/domain.c | 19 ++--- > > xen/include/asm-arm/domain.h | 9 ++ > > xen/include/asm-x86/domain.h | 16 ++++ > > xen/include/xen/domain.h | 5 ++ > > xen/include/xen/sched.h | 16 +--- > > 8 files changed, 206 insertions(+), 53 deletions(-) > > > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c > > index 31169326b2..8b36946017 100644 > > --- a/xen/arch/arm/domain.c > > +++ b/xen/arch/arm/domain.c > > @@ -19,6 +19,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > @@ -275,36 +276,156 @@ static void ctxt_switch_to(struct vcpu *n) > > virt_timer_restore(n); > > } > > -/* Update per-VCPU guest runstate shared memory area (if registered). */ > > -static void update_runstate_area(struct vcpu *v) > > +static void cleanup_runstate_vcpu_locked(struct vcpu *v) > > { > > - void __user *guest_handle = NULL; > > + if ( v->arch.runstate_guest ) > > + { > > + vunmap((void *)((unsigned long)v->arch.runstate_guest & > > PAGE_MASK)); > > + > > + put_page(v->arch.runstate_guest_page[0]); > > + > > + if ( v->arch.runstate_guest_page[1] ) > > + put_page(v->arch.runstate_guest_page[1]); > > + > > + v->arch.runstate_guest = NULL; > > + } > > +} > > + > > +void arch_vcpu_cleanup_runstate(struct vcpu *v) > > +{ > > + spin_lock(&v->arch.runstate_guest_lock); > > + > > + cleanup_runstate_vcpu_locked(v); > > + > > + spin_unlock(&v->arch.runstate_guest_lock); > > +} > > + > > +static int setup_runstate_vcpu_locked(struct vcpu *v, vaddr_t vaddr) > > +{ > > + unsigned int offset; > > + mfn_t mfn[2]; > > + struct page_info *page; > > + unsigned int numpages; > > struct vcpu_runstate_info runstate; > > + void *p; > > - if ( guest_handle_is_null(runstate_guest(v)) ) > > - return; > > + /* user can pass a NULL address to unregister a previous area */ > > + if ( vaddr == 0 ) > > + return 0; > > + > > + offset = vaddr & ~PAGE_MASK; > > + > > + /* provided address must be aligned to a 64bit */ > > + if ( offset % alignof(struct vcpu_runstate_info) ) > > This new restriction wants to be explained in the commit message and public > header. > > > + { > > + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at > > 0x%"PRIvaddr > > + ": Invalid offset\n", vaddr); > > We usually enforce 80 character per lines except for format string. So it is > easier to grep them in the code. > > > + return -EINVAL; > > + } > > + > > + page = get_page_from_gva(v, vaddr, GV2M_WRITE); > > + if ( !page ) > > + { > > + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at > > 0x%"PRIvaddr > > + ": Page is not mapped\n", vaddr); > > + return -EINVAL; > > + } > > + > > + mfn[0] = page_to_mfn(page); > > + v->arch.runstate_guest_page[0] = page; > > + > > + if ( offset > (PAGE_SIZE - sizeof(struct vcpu_runstate_info)) ) > > + { > > + /* guest area is crossing pages */ > > + page = get_page_from_gva(v, vaddr + PAGE_SIZE, GV2M_WRITE); > > + if ( !page ) > > + { > > + put_page(v->arch.runstate_guest_page[0]); > > + gprintk(XENLOG_WARNING, > > + "Cannot map runstate pointer at 0x%"PRIvaddr > > + ": 2nd Page is not mapped\n", vaddr); > > + return -EINVAL; > > + } > > + mfn[1] = page_to_mfn(page); > > + v->arch.runstate_guest_page[1] = page; > > + numpages = 2; > > + } > > + else > > + { > > + v->arch.runstate_guest_page[1] = NULL; > > + numpages = 1; > > + } > > - memcpy(&runstate, &v->runstate, sizeof(runstate)); > > + p = vmap(mfn, numpages); > > + if ( !p ) > > + { > > + put_page(v->arch.runstate_guest_page[0]); > > + if ( numpages == 2 ) > > + put_page(v->arch.runstate_guest_page[1]); > > - if ( VM_ASSIST(v->domain, runstate_update_flag) ) > > + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at > > 0x%"PRIvaddr > > + ": vmap error\n", vaddr); > > + return -EINVAL; > > + } > > + > > + v->arch.runstate_guest = p + offset; > > + > > + if (v == current) > > + memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate)); > > + else > > { > > - guest_handle = &v->runstate_guest.p->state_entry_time + 1; > > - guest_handle--; > > - runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; > > - __raw_copy_to_guest(guest_handle, > > - (void *)(&runstate.state_entry_time + 1) - 1, > > 1); > > - smp_wmb(); > > + vcpu_runstate_get(v, &runstate); > > + memcpy(v->arch.runstate_guest, &runstate, sizeof(v->runstate)); > > } > > - __copy_to_guest(runstate_guest(v), &runstate, 1); > > + return 0; > > +} > > + > > +int arch_vcpu_setup_runstate(struct vcpu *v, > > + struct vcpu_register_runstate_memory_area > > area) > > +{ > > + int rc; > > + > > + spin_lock(&v->arch.runstate_guest_lock); > > + > > + /* cleanup if we are recalled */ > > + cleanup_runstate_vcpu_locked(v); > > + > > + rc = setup_runstate_vcpu_locked(v, (vaddr_t)area.addr.v); > > + > > + spin_unlock(&v->arch.runstate_guest_lock); > > - if ( guest_handle ) > > + return rc; > > +} > > + > > + > > +/* Update per-VCPU guest runstate shared memory area (if registered). */ > > +static void update_runstate_area(struct vcpu *v) > > +{ > > + spin_lock(&v->arch.runstate_guest_lock); > > + > > + if ( v->arch.runstate_guest ) > > { > > - runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; > > - smp_wmb(); > > - __raw_copy_to_guest(guest_handle, > > - (void *)(&runstate.state_entry_time + 1) - 1, > > 1); > > + if ( VM_ASSIST(v->domain, runstate_update_flag) ) > > + { > > + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; > > + write_atomic(&(v->arch.runstate_guest->state_entry_time), > > + v->runstate.state_entry_time); > > NIT: You want to indent v-> at the same level as the argument from the first > line. > > Also, I think you are missing a smp_wmb() here. I just wanted to add that I reviewed the patch and aside from the smp_wmb (and the couple of code style NITs), there is no other issue in the patch that I could find. No further comments from my side. > > + } > > + > > + memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate)); > > + > > + if ( VM_ASSIST(v->domain, runstate_update_flag) ) > > + { > > + /* copy must be done before switching the bit */ > > + smp_wmb(); > > + v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; > > + write_atomic(&(v->arch.runstate_guest->state_entry_time), > > + v->runstate.state_entry_time); > > Same remark for the indentation.