From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0F60C433C1 for ; Mon, 22 Mar 2021 19:16:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 72C0E6199F for ; Mon, 22 Mar 2021 19:16:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231550AbhCVTQL (ORCPT ); Mon, 22 Mar 2021 15:16:11 -0400 Received: from mail.skyhub.de ([5.9.137.197]:43944 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229854AbhCVTPl (ORCPT ); Mon, 22 Mar 2021 15:15:41 -0400 Received: from zn.tnic (p200300ec2f066700d1873920611831f8.dip0.t-ipconnect.de [IPv6:2003:ec:2f06:6700:d187:3920:6118:31f8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.skyhub.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id 565F61EC030E; Mon, 22 Mar 2021 20:15:40 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alien8.de; s=dkim; t=1616440540; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=iJX3GF2CgIaOQc27uz5DRc8byMVlS9HWNPA5d8fZkIo=; b=ltFg2Gv+bV8xd96ltmseD8+Cag/9qRazqx2m9QiOIO39sZM9++/YnT+naK5uyi15xSQDSb 5EHFoFX+MyhOXKQTlBd7K7A24f3AcS1M3lMZgdSiP3tfd3uaydumPdKuYqATgwtLK+KPUd 6rZyWTa/A7M2jphlW/okYIgwfH27+g4= Date: Mon, 22 Mar 2021 20:15:40 +0100 From: Borislav Petkov To: Sean Christopherson Cc: Kai Huang , kvm@vger.kernel.org, x86@kernel.org, linux-sgx@vger.kernel.org, linux-kernel@vger.kernel.org, jarkko@kernel.org, luto@kernel.org, dave.hansen@intel.com, rick.p.edgecombe@intel.com, haitao.huang@intel.com, pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com Subject: Re: [PATCH v3 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page() Message-ID: <20210322191540.GH6481@zn.tnic> References: <062acb801926b2ade2f9fe1672afb7113453a741.1616136308.git.kai.huang@intel.com> <20210322181646.GG6481@zn.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 22, 2021 at 11:56:37AM -0700, Sean Christopherson wrote: > Not necessarily. This can only trigger in the host, and thus require a host > reboot, if the host is also running enclaves. If the CSP is not running > enclaves, or is running its enclaves in a separate VM, then this path cannot be > reached. That's what I meant. Rebooting guests is a lot easier, ofc. Or are you saying, this can trigger *only* when they're running enclaves on the *host* too? > EREMOVE can only fail if there's a kernel or hardware bug (or a VMM bug if > running as a guest). We get those on a daily basis. > IME, nearly every kernel/KVM bug that I introduced that led to EREMOVE > failure was also quite fatal to SGX, i.e. this is just the canary in > the coal mine. > > It's certainly possible to add more sophisticated error handling, e.g. through > the pages onto a list and periodically try to recover them. But, since the vast > majority of bugs that cause EREMOVE failure are fatal to SGX, implementing > sophisticated handling is quite low on the list of priorities. > > Dave wanted the "page leaked" error message so that it's abundantly clear that > the kernel is leaking pages on EREMOVE failure and that the WARN isn't "benign". So this sounds to me like this should BUG too eventually. Or is this one of those "this should never happen" things so no one should worry? Whatever it is, if an admin sees this message in dmesg and doesn't get a lengthy explanation what she/he is supposed to do, I don't think she/he will be as relaxed. Hell, people open bugs for correctable ECCs and are asking whether they need to replace their hardware. So let's play this out: put yourself in an admin's shoes and tell me how should an admin react when she/he sees that? Should the kernel probably also say: "Don't worry, you have enough memory and what's a 4K, who cares? You'll reboot eventually." Or should the kernel say "You need to reboot ASAP." And so on... So what is the scenario here and what kind of reaction is that message supposed to cause, recovery action, blabla, the whole spiel? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette