From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1ABD7C28CBC for ; Wed, 6 May 2020 07:00:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E83E420936 for ; Wed, 6 May 2020 07:00:24 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="b0KKpEYs" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728095AbgEFHAY (ORCPT ); Wed, 6 May 2020 03:00:24 -0400 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:33535 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727803AbgEFHAW (ORCPT ); Wed, 6 May 2020 03:00:22 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1588748420; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WFq/0yqRbKNXv/YBxVPlsC0lujmPDyxJNpEoG2AcP8w=; b=b0KKpEYsXRk2CES92M0xwFEA+Zast8+Rg8ORsc8XW6ib6UQv46Qfhj9G4x2tS/GWx1/y7w J/F7sBCjVLvheFGC1L3sNe5/r4dUgV3rwzgB+PfhSvlS2lj3h7DrzZC8tByhus7kxN14XL iFjbymsiw40EBDNJwk0Sb5U1I7Xo19A= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-318-DFjrMKbuNKGrpDQ3jBz64g-1; Wed, 06 May 2020 03:00:18 -0400 X-MC-Unique: DFjrMKbuNKGrpDQ3jBz64g-1 Received: by mail-wr1-f70.google.com with SMTP id u4so823046wrm.13 for ; Wed, 06 May 2020 00:00:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=WFq/0yqRbKNXv/YBxVPlsC0lujmPDyxJNpEoG2AcP8w=; b=L29j88nsCBZDvA+VEusg+BJVo39ZNxqZmTGyAM0UpfB/6+EaaGDAq6y/HXXbs0sI9V D8l30d0IKpftesgKRzjxzdo50lnZZgHwewf/tQiws6vRc9/Qjf8diuLBCdd52yEdaE1B 9/RVdDTnmIKNI3Q18wWNFetWn2vsLa0uQyv7mDav2fybm03Hx/DdwJGsTUzv/4X1d+1R OfaZSwnVBHHsdsYbxIpKWUEpuBn3YjNajsPZvoWr4lQHaXH2zHxz7z8c5naO4SwRbGfT CWunxHSets3bu+nIe+hiv2mBxIm4ZyQDqDFcDprcMUq/ewmceOpfHqlsdIDXZ05TqyHu P97g== X-Gm-Message-State: AGi0PuZGNWuQxueqzsAZLgIBowsWKHqtDQ6fu/36Th6UOOhPe4M0uOB8 O6LtSJwgk/URPupgFdNwDiUbnooTaBc6Rfn6LMQKoU6rcvFRkYzwkFv6qBK/+Y6DCCqsYityBFb ri4+ZWkZJ0+JysywMjJqcI1XT X-Received: by 2002:adf:dfc8:: with SMTP id q8mr7688671wrn.325.1588748417400; Wed, 06 May 2020 00:00:17 -0700 (PDT) X-Google-Smtp-Source: APiQypLvuTZ3mOtVJTz+LZ76QaJsd2IW3Bp7IkeH7XH2An2xEV/4s/UH++MAOsNHvdu1xb6w7Qbljw== X-Received: by 2002:adf:dfc8:: with SMTP id q8mr7688640wrn.325.1588748417113; Wed, 06 May 2020 00:00:17 -0700 (PDT) Received: from ?IPv6:2001:b07:6468:f312:11d7:2f21:f38b:17e? ([2001:b07:6468:f312:11d7:2f21:f38b:17e]) by smtp.gmail.com with ESMTPSA id o6sm1187423wrw.63.2020.05.06.00.00.15 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 06 May 2020 00:00:16 -0700 (PDT) Subject: Re: [patch V4 part 1 13/36] x86/kvm: Restrict ASYNC_PF to user space To: Thomas Gleixner , LKML Cc: x86@kernel.org, "Paul E. McKenney" , Andy Lutomirski , Alexandre Chartre , Frederic Weisbecker , Sean Christopherson , Masami Hiramatsu , Petr Mladek , Steven Rostedt , Joel Fernandes , Boris Ostrovsky , Juergen Gross , Brian Gerst , Mathieu Desnoyers , Josh Poimboeuf , Will Deacon References: <20200505131602.633487962@linutronix.de> <20200505134059.369802541@linutronix.de> From: Paolo Bonzini Message-ID: Date: Wed, 6 May 2020 09:00:16 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <20200505134059.369802541@linutronix.de> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/05/20 15:16, Thomas Gleixner wrote: > The async page fault injection into kernel space creates more problems than > it solves. The host has absolutely no knowledge about the state of the > guest if the fault happens in CPL0. The only restriction for the host is > interrupt disabled state. If interrupts are enabled in the guest then the > exception can hit arbitrary code. The HALT based wait in non-preemotible > code is a hacky replacement for a proper hypercall. > > For the ongoing work to restrict instrumentation and make the RCU idle > interaction well defined the required extra work for supporting async > pagefault in CPL0 is just not justified and creates complexity for a > dubious benefit. > > The CPL3 injection is well defined and does not cause any issues as it is > more or less the same as a regular page fault from CPL3. > > Suggested-by: Andy Lutomirski > Signed-off-by: Thomas Gleixner > --- > arch/x86/kernel/kvm.c | 100 +++----------------------------------------------- > 1 file changed, 7 insertions(+), 93 deletions(-) > > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -75,7 +75,6 @@ struct kvm_task_sleep_node { > struct swait_queue_head wq; > u32 token; > int cpu; > - bool use_halt; > }; > > static struct kvm_task_sleep_head { > @@ -98,8 +97,7 @@ static struct kvm_task_sleep_node *_find > return NULL; > } > > -static bool kvm_async_pf_queue_task(u32 token, bool use_halt, > - struct kvm_task_sleep_node *n) > +static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n) > { > u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS); > struct kvm_task_sleep_head *b = &async_pf_sleepers[key]; > @@ -117,7 +115,6 @@ static bool kvm_async_pf_queue_task(u32 > > n->token = token; > n->cpu = smp_processor_id(); > - n->use_halt = use_halt; > init_swait_queue_head(&n->wq); > hlist_add_head(&n->link, &b->list); > raw_spin_unlock(&b->lock); > @@ -138,7 +135,7 @@ void kvm_async_pf_task_wait_schedule(u32 > > lockdep_assert_irqs_disabled(); > > - if (!kvm_async_pf_queue_task(token, false, &n)) > + if (!kvm_async_pf_queue_task(token, &n)) > return; > > for (;;) { > @@ -154,91 +151,10 @@ void kvm_async_pf_task_wait_schedule(u32 > } > EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule); > > -/* > - * Invoked from the async page fault handler. > - */ > -static void kvm_async_pf_task_wait_halt(u32 token) > -{ > - struct kvm_task_sleep_node n; > - > - if (!kvm_async_pf_queue_task(token, true, &n)) > - return; > - > - for (;;) { > - if (hlist_unhashed(&n.link)) > - break; > - /* > - * No point in doing anything about RCU here. Any RCU read > - * side critical section or RCU watching section can be > - * interrupted by VMEXITs and the host is free to keep the > - * vCPU scheduled out as long as it sees fit. This is not > - * any different just because of the halt induced voluntary > - * VMEXIT. > - * > - * Also the async page fault could have interrupted any RCU > - * watching context, so invoking rcu_irq_exit()/enter() > - * around this is not gaining anything. > - */ > - native_safe_halt(); > - local_irq_disable(); > - } > -} > - > -/* Invoked from the async page fault handler */ > -static void kvm_async_pf_task_wait(u32 token, bool usermode) > -{ > - bool can_schedule; > - > - /* > - * No need to check whether interrupts were disabled because the > - * host will (hopefully) only inject an async page fault into > - * interrupt enabled regions. > - * > - * If CONFIG_PREEMPTION is enabled then check whether the code > - * which triggered the page fault is preemptible. This covers user > - * mode as well because preempt_count() is obviously 0 there. > - * > - * The check for rcu_preempt_depth() is also required because > - * voluntary scheduling inside a rcu read locked section is not > - * allowed. > - * > - * The idle task is already covered by this because idle always > - * has a preempt count > 0. > - * > - * If CONFIG_PREEMPTION is disabled only allow scheduling when > - * coming from user mode as there is no indication whether the > - * context which triggered the page fault could schedule or not. > - */ > - if (IS_ENABLED(CONFIG_PREEMPTION)) > - can_schedule = preempt_count() + rcu_preempt_depth() == 0; > - else > - can_schedule = usermode; > - > - /* > - * If the kernel context is allowed to schedule then RCU is > - * watching because no preemptible code in the kernel is inside RCU > - * idle state. So it can be treated like user mode. User mode is > - * safe because the #PF entry invoked enter_from_user_mode(). > - * > - * For the non schedulable case invoke rcu_irq_enter() for > - * now. This will be moved out to the pagefault entry code later > - * and only invoked when really needed. > - */ > - if (can_schedule) { > - kvm_async_pf_task_wait_schedule(token); > - } else { > - rcu_irq_enter(); > - kvm_async_pf_task_wait_halt(token); > - rcu_irq_exit(); > - } > -} > - > static void apf_task_wake_one(struct kvm_task_sleep_node *n) > { > hlist_del_init(&n->link); > - if (n->use_halt) > - smp_send_reschedule(n->cpu); > - else if (swq_has_sleeper(&n->wq)) > + if (swq_has_sleeper(&n->wq)) > swake_up_one(&n->wq); > } > > @@ -337,8 +253,10 @@ bool __kvm_handle_async_pf(struct pt_reg > panic("Host injected async #PF in interrupt disabled region\n"); > > if (reason == KVM_PV_REASON_PAGE_NOT_PRESENT) { > - /* page is swapped out by the host. */ > - kvm_async_pf_task_wait(token, user_mode(regs)); > + if (unlikely(!(user_mode(regs)))) > + panic("Host injected async #PF in kernel mode\n"); > + /* Page is swapped out by the host. */ > + kvm_async_pf_task_wait_schedule(token); > } else { > rcu_irq_enter(); > kvm_async_pf_task_wake(token); > @@ -397,10 +315,6 @@ static void kvm_guest_cpu_init(void) > WARN_ON_ONCE(!static_branch_likely(&kvm_async_pf_enabled)); > > pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason)); > - > -#ifdef CONFIG_PREEMPTION > - pa |= KVM_ASYNC_PF_SEND_ALWAYS; > -#endif > pa |= KVM_ASYNC_PF_ENABLED; > > if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_VMEXIT)) > Acked-by: Paolo Bonzini