From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2389BECAAD3 for ; Thu, 1 Sep 2022 21:54:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233907AbiIAVyA (ORCPT ); Thu, 1 Sep 2022 17:54:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50512 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229892AbiIAVx7 (ORCPT ); Thu, 1 Sep 2022 17:53:59 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 421A279604; Thu, 1 Sep 2022 14:53:58 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id D30C561F92; Thu, 1 Sep 2022 21:53:57 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 90E21C433C1; Thu, 1 Sep 2022 21:53:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1662069237; bh=1CeJuFupFm54R4oRzjbPCSPjJ1OwgvU/hufLKHsYuuo=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=q5I8W5Re0j6x5QS+OM32mN5AXWg9segdPvWl2W9aG7p+XSM307RMBQoUf4PmX0hCM uhBwUKo83LdQJuuDxOFj9Y6612ZD8/kEcZYPsDlPuHccmawFyoc3FTSE1C+Ub+smJA WbRbr3OrUjrcpaWMQd86S0B5aSpN7RdeMjyuKwqWoslJp9ayAtiGkWRV1ft7VMvqfy hMTgUecLH8wpxmRcf479yrUk+D/FertFz3A20MqW2hrkmXyjq2dh1wigdKxzUVuTZx +B6w8aQ7+oZY4u+ClzQdOkHjWvWIIagrnLI4LuFPedqWDrBI92mkG0PYXe7/cCNUKN 5g8RI5RV7Qcmg== Date: Fri, 2 Sep 2022 00:53:52 +0300 From: Jarkko Sakkinen To: Reinette Chatre Cc: linux-sgx@vger.kernel.org, Haitao Huang , Vijay Dhanraj , Dave Hansen , Paul Menzel , stable@vger.kernel.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , "H. Peter Anvin" , "open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)" Subject: Re: [PATCH v2 2/6] x86/sgx: Do not consider unsanitized pages an error Message-ID: References: <20220831173829.126661-1-jarkko@kernel.org> <20220831173829.126661-3-jarkko@kernel.org> <24906e57-461f-6c94-9e78-0d8507df01bb@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <24906e57-461f-6c94-9e78-0d8507df01bb@intel.com> Precedence: bulk List-ID: X-Mailing-List: linux-sgx@vger.kernel.org On Wed, Aug 31, 2022 at 01:39:53PM -0700, Reinette Chatre wrote: > Hi Jarkko, > > On 8/31/2022 10:38 AM, Jarkko Sakkinen wrote: > > In sgx_init(), if misc_register() fails or misc_register() succeeds but > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > prematurely stopped. This may leave some unsanitized pages, which does > > not matter, because SGX will be disabled for the whole power cycle. > > > > This triggers WARN_ON() because sgx_dirty_page_list ends up being > > non-empty, and dumps the call stack: > > > > [ 0.268103] sgx: EPC section 0x40200000-0x45f7ffff > > [ 0.268591] ------------[ cut here ]------------ > > [ 0.268592] WARNING: CPU: 6 PID: 83 at > > arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 > > [ 0.268598] Modules linked in: > > [ 0.268600] CPU: 6 PID: 83 Comm: ksgxd Not tainted 6.0.0-rc2 #382 > > [ 0.268603] Hardware name: Dell Inc. XPS 13 9370/0RMYH9, BIOS 1.21.0 > > 07/06/2022 > > [ 0.268604] RIP: 0010:ksgxd+0x1b7/0x1d0 > > [ 0.268607] Code: ff e9 f2 fe ff ff 48 89 df e8 75 07 0e 00 84 c0 0f > > 84 c3 fe ff ff 31 ff e8 e6 07 0e 00 84 c0 0f 85 94 fe ff ff e9 af fe ff > > ff <0f> 0b e9 7f fe ff ff e8 dd 9c 95 00 66 66 2e 0f 1f 84 00 00 00 00 > > [ 0.268608] RSP: 0000:ffffb6c7404f3ed8 EFLAGS: 00010287 > > [ 0.268610] RAX: ffffb6c740431a10 RBX: ffff8dcd8117b400 RCX: > > 0000000000000000 > > [ 0.268612] RDX: 0000000080000000 RSI: ffffb6c7404319d0 RDI: > > 00000000ffffffff > > [ 0.268613] RBP: ffff8dcd820a4d80 R08: ffff8dcd820a4180 R09: > > ffff8dcd820a4180 > > [ 0.268614] R10: 0000000000000000 R11: 0000000000000006 R12: > > ffffb6c74006bce0 > > [ 0.268615] R13: ffff8dcd80e63880 R14: ffffffffa8a60f10 R15: > > 0000000000000000 > > [ 0.268616] FS: 0000000000000000(0000) GS:ffff8dcf25580000(0000) > > knlGS:0000000000000000 > > [ 0.268617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 0.268619] CR2: 0000000000000000 CR3: 0000000213410001 CR4: > > 00000000003706e0 > > [ 0.268620] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > 0000000000000000 > > [ 0.268621] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > 0000000000000400 > > [ 0.268622] Call Trace: > > [ 0.268624] > > [ 0.268627] ? _raw_spin_lock_irqsave+0x24/0x60 > > [ 0.268632] ? _raw_spin_unlock_irqrestore+0x23/0x40 > > [ 0.268634] ? __kthread_parkme+0x36/0x90 > > [ 0.268637] kthread+0xe5/0x110 > > [ 0.268639] ? kthread_complete_and_exit+0x20/0x20 > > [ 0.268642] ret_from_fork+0x1f/0x30 > > [ 0.268647] > > [ 0.268648] ---[ end trace 0000000000000000 ]--- > > > > Are you still planning to trim this? > > > Ultimately this can crash the kernel, if the following is set: > > > > /proc/sys/kernel/panic_on_warn > > > > In premature stop, print nothing, as the number is by practical means a > > random number. Otherwise, it is an indicator of a bug in the driver, and > > therefore print the number of unsanitized pages with pr_err(). > > I think that "print the number of unsanitized pages with pr_err()" > contradicts the patch subject of "Do not consider unsanitized pages > an error". > > ... > > > @@ -388,17 +393,40 @@ void sgx_reclaim_direct(void) > > > > static int ksgxd(void *p) > > { > > + long ret; > > + > > set_freezable(); > > > > /* > > * Sanitize pages in order to recover from kexec(). The 2nd pass is > > * required for SECS pages, whose child pages blocked EREMOVE. > > */ > > - __sgx_sanitize_pages(&sgx_dirty_page_list); > > - __sgx_sanitize_pages(&sgx_dirty_page_list); > > + ret = __sgx_sanitize_pages(&sgx_dirty_page_list); > > + if (ret == -ECANCELED) > > + /* kthread stopped */ > > + return 0; > > > > - /* sanity check: */ > > - WARN_ON(!list_empty(&sgx_dirty_page_list)); > > + ret = __sgx_sanitize_pages(&sgx_dirty_page_list); > > + switch (ret) { > > + case 0: > > + /* success, no unsanitized pages */ > > + break; > > + > > + case -ECANCELED: > > + /* kthread stopped */ > > + return 0; > > + > > + default: > > + /* > > + * Never expected to happen in a working driver. If it happens > > + * the bug is expected to be in the sanitization process, but > > + * successfully sanitized pages are still valid and driver can > > + * be used and most importantly debugged without issues. To put > > + * short, the global state of kernel is not corrupted so no > > + * reason to do any more complicated rollback. > > + */ > > + pr_err("%ld unsanitized pages\n", ret); > > + } > > > > while (!kthread_should_stop()) { > > if (try_to_freeze()) > > > I think I am missing something here. A lot of logic is added here but I > do not see why it is necessary. ksgxd() knows via kthread_should_stop() if > the reclaimer was canceled. I am thus wondering, could the above not be > simplified to something similar to V1: > > @@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) > > static int ksgxd(void *p) > { > + unsigned long left_dirty; > + > set_freezable(); > > /* > @@ -395,10 +402,10 @@ static int ksgxd(void *p) > * required for SECS pages, whose child pages blocked EREMOVE. > */ > __sgx_sanitize_pages(&sgx_dirty_page_list); IMHO, would make sense also to have here: if (!kthread_should_stop()) return 0; > - __sgx_sanitize_pages(&sgx_dirty_page_list); > > - /* sanity check: */ > - WARN_ON(!list_empty(&sgx_dirty_page_list)); > + left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); > + if (left_dirty && !kthread_should_stop()) > + pr_err("%lu unsanitized pages\n", left_dirty); That would be incorrect, if the function returned because of kthread stopped. If you do the check here you already have a window where kthread could have been stopped anyhow. So even this would be less correct: if (kthreas_should_stop()) { return 0; } else if (left_dirty) { pr_err("%lu unsanitized pages\n", left_dirty); } So in the end you end as complicated and less correct fix. This all is explained in the commit message. If you unconditionally print error, you don't have a meaning for the number of unsanitized pags. BR, Jarkko