From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3E698C61DA4 for ; Thu, 2 Feb 2023 22:56:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=Xo0FBVaLNjScZ4ESRGU0zwxj1oMMzGCwSRVYXdBFEDk=; b=lOhca1X+T+F0Wk 1uHY6IXVR6Gg6SxUyS2NGbNSZ1Z8xWLQUsN0VxWddR1Xz6s50k6q/pIymTw8kefY05bCoGyS9FQqb DCvZN0jm1oPO5OWNfSp9+2O3Ll9PyP1l4kZTV2vX+0jSRM7hQNtSjLxjMsKIaWSsFosA0JKW8HrTm Ajvs81oVSB811XhCIOd7ehJo5y73yXIeR0+/ZiyiwY8hUo36S1zTWydHTrlN583e2QZQaLQdoPN2B mRQ+3+HSyhShiGODxM6hgYiBOBquU+xmg7vCIqpojHBKu2fc/x0l2DMnI0N9cA35aXfRAp48GiSWM ztIm7jSoSjaHzAkcWhDQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1pNiVO-00HMY7-4V; Thu, 02 Feb 2023 22:56:34 +0000 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1pNiVK-00HMXY-3k for linux-riscv@lists.infradead.org; Thu, 02 Feb 2023 22:56:32 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675378587; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=XMJSsnUmUwDCBuOmcITqEg5FKDzJKTZTwWrtPHX2Mzk=; b=GmToqAq7xdqxUKlEy3kXF8evXkmP1iurk8S1T+DCQWAbIVEdQSgk57dB6nrzaQntXbisC6 DLJGEcvofaScDnMO0xlN+8IQLkwD6UL+eLp13Q3UVKnKRf8VJaw58Irne6FpLXZNSIozgc RZpuZcsRu1+28/ijnZTEh23zpW/DQbA= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-387-K26x3XdfO4WSwz0702EIZg-1; Thu, 02 Feb 2023 17:56:25 -0500 X-MC-Unique: K26x3XdfO4WSwz0702EIZg-1 Received: by mail-qt1-f200.google.com with SMTP id m7-20020ac807c7000000b003b80b35c136so1730889qth.5 for ; Thu, 02 Feb 2023 14:56:25 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=XMJSsnUmUwDCBuOmcITqEg5FKDzJKTZTwWrtPHX2Mzk=; b=jUDUZx+oHg0+bF1Y9lzHqPSJC7H/5yTYBd4npppw9Vc8Ucc2UJyqCsl/m8OFTCj7sL KkoM17rADedhvg5jfDKmcWszjFVd6WzlzvxoA35ZonikLLUrpYKI/ZcT9cxr5u4dl/GK UQV4zMe0ehw64seUC6jzwIuGkY75FD9q7pP6NCcpyPtudHbcyQe1U4GSdcDmsSy2nox5 TkTf6+5p33pLFFQcokKBxatFuCZHq5TIP6aO4UszFum5pougiEU2JrWmPXM0XpN+nbzK A+UklhqOZ6jd52OcoBwRfaeB5PwYKbAF7JwLC8qvXfLq4GbHM3ko+lgt0+SJCRaPogih wuRg== X-Gm-Message-State: AO0yUKUtkZiawV3y8dI0I4oAhtoKTS+/O02jJ3rPxZgkj51u84seI1hA CILcKTEVDbnawUpdB9q1GQTD4FEPP+79WAuKnEGCZd6LiT0J4G5KL29Gj8x5CEPywjpEzJhCPI6 FfqzLYnW/GuPQW+HY8/aDFkVUHIWf X-Received: by 2002:a05:622a:1e10:b0:3b8:68df:fc72 with SMTP id br16-20020a05622a1e1000b003b868dffc72mr12702100qtb.2.1675378585236; Thu, 02 Feb 2023 14:56:25 -0800 (PST) X-Google-Smtp-Source: AK7set98ufe9BnObmCyLO5vL17bxJ3WNbuFwrxLR6rxAVU1vknl5Zn7Dx1QXwCn46sOWBGmNV32apQ== X-Received: by 2002:a05:622a:1e10:b0:3b8:68df:fc72 with SMTP id br16-20020a05622a1e1000b003b868dffc72mr12702061qtb.2.1675378584842; Thu, 02 Feb 2023 14:56:24 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id j9-20020a05620a410900b00729b7d71ac7sm661766qko.33.2023.02.02.14.56.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Feb 2023 14:56:24 -0800 (PST) Date: Thu, 2 Feb 2023 17:56:22 -0500 From: Peter Xu To: Al Viro Cc: Linus Torvalds , linux-arch@vger.kernel.org, linux-alpha@vger.kernel.org, linux-ia64@vger.kernel.org, linux-hexagon@vger.kernel.org, linux-m68k@lists.linux-m68k.org, Michal Simek , Dinh Nguyen , openrisc@lists.librecores.org, linux-parisc@vger.kernel.org, linux-riscv@lists.infradead.org, sparclinux@vger.kernel.org Subject: Re: [RFC][PATCHSET] VM_FAULT_RETRY fixes Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Disposition: inline X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230202_145630_271663_90D0F882 X-CRM114-Status: GOOD ( 53.05 ) X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org On Wed, Feb 01, 2023 at 10:18:11PM +0000, Al Viro wrote: > On Wed, Feb 01, 2023 at 02:48:22PM -0500, Peter Xu wrote: > > > I do also see a common pattern of the possibility to have a generic fault > > handler like generic_page_fault(). > > > > It probably should start with taking the mmap_sem until providing some > > retval that is much easier to digest further by the arch-dependent code, so > > it can directly do something rather than parsing the bitmask in a > > duplicated way (hence the new retval should hopefully not a bitmask anymore > > but a "what to do"). > > > > Maybe it can be something like: > > > > /** > > * enum page_fault_retval - Higher level fault retval, generalized from > > * vm_fault_reason above that is only used by hardware page fault handlers. > > * It generalizes the bitmask-versioned retval into something that the arch > > * dependent code should react upon. > > * > > * @PF_RET_COMPLETED: The page fault is completed successfully > > * @PF_RET_BAD_AREA: The page fault address falls in a bad area > > * (e.g., vma not found, expand_stack() fails..) > > FWIW, there's a fun discrepancy - VM_FAULT_SIGSEGV may yield SEGV_MAPERR > or SEGV_ACCERR; depends upon the architecture. Not that there'd been > many places that return VM_FAULT_SIGSEGV these days... Good thing, too, > since otherwise e.g. csky would oops... > > > * @PF_RET_ACCESS_ERR: The page fault has access errors > > * (e.g., write fault on !VM_WRITE vmas) > > * @PF_RET_KERN_FIXUP: The page fault requires kernel fixups > > * (e.g., during copy_to_user() but fault failed?) > > * @PF_RET_HWPOISON: The page fault encountered poisoned pages > > * @PF_RET_SIGNAL: The page fault encountered poisoned pages > > ?? > > > * ... > > */ > > enum page_fault_retval { > > PF_RET_DONE = 0, > > PF_RET_BAD_AREA, > > PF_RET_ACCESS_ERR, > > PF_RET_KERN_FIXUP, > > PF_RET_HWPOISON, > > PF_RET_SIGNAL, > > ... > > }; > > > > As a start we may still want to return some more information (perhaps still > > the vm_fault_t alongside? Or another union that will provide different > > information based on different PF_RET_*). One major thing is I see how we > > handle VM_FAULT_HWPOISON and also the fact that we encode something more > > into the bitmask on page sizes (VM_FAULT_HINDEX_MASK). > > > > So the generic helper could, hopefully, hide the complexity of: > > > > - Taking and releasing of mmap lock > > - find_vma(), and also relevant checks on access or stack handling > > Umm... arm is a bit special here: > if (addr < FIRST_USER_ADDRESS) > return VM_FAULT_BADMAP; > with no counterparts elsewhere. For this specific case IIUC it's the same as bad_area. VM_FAULT_BADMAP is further handled later in do_page_fault() there for either arm/arm64. This reminded me this, on how arm defines the private retvals, while the generic ones grows and probably no one noticed they can collapse already.. #define VM_FAULT_BADMAP 0x010000 #define VM_FAULT_BADACCESS 0x020000 enum vm_fault_reason { ... VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000, }; VM_FAULT_HINDEX_MASK is only used by VM_FAULT_HWPOISON_LARGE, so I think arm[64] could expect some surprise when it hit hugetlb hwpoison pages... maybe I should prepare a patch for arm. > > > - handle_mm_fault() itself (of course...) > > - detect signals > > - handle page fault retries (so, in the new layer of retval there should > > have nothing telling it to retry; it should always be the ultimate result) > > agreed. > > - unlock mmap; don't leave that to caller. > > > - parse different errors into "what the arch code should do", and > > generalize the common ones, e.g. > > - OOM, do pagefault_out_of_memory() for user-mode > > - VM_FAULT_SIGSEGV, which should be able to merge into PF_RET_BAD_AREA? > > - ... > > AFAICS, all errors in kernel mode => no_context. > > > It'll simplify things if we can unify some small details like whether the > > -EFAULT above should contain a sigbus. > > > > A trivial detail I found when I was looking at this is, x86_64 passes in > > different signals to kernelmode_fixup_or_oops() - in do_user_addr_fault() > > there're three call sites and each of them pass over a differerent signal. > > IIUC that will only make a difference if there's a nested page fault during > > the vsyscall emulation (but I may be wrong too because I'm new to this > > code), and I have no idea when it'll happen and whether that needs to be > > strictly followed. > > From my (very incomplete so far) dig through that pile: > Q: do we still have the cases when handle_mm_fault() does > not return any of VM_FAULT_COMPLETED | VM_FAULT_RETRY | VM_FAULT_ERROR? > That gets treated as unlock + VM_FAULT_COMPLETED, but do we still need > that? > Q: can VM_FAULT_RETRY be mixed with anything in VM_FAULT_ERROR? > What locking, if that happens? For this one, I don't think they can be mixed. IMHO RETRY only binds with a wait, so if we didn't wait and found issue, we return ERROR; if we decided to wait, we will try nothing more besides return after wait with the RETRY. We should just never check any error at all if the wait happened. Otherwise there's a bug of potential deadlock. I'll skip some details in this email either above or below; I agree there're so many trivial details to take care of to not break a thing. IMHO it'll be merely impossible to merge things across most (if not to say, all) archs. It will need to be start from one or at least a few that still shares a major common base - I would still rely on x86 as a start - then we try to use the helper in as much archs as possible. Even on x86, I do also see challenges so I'm not sure whether a common enough routine can be abstracted indeed. But I believe there's a way to do this because obviously we still see tons of duplicated logics falling around. It may definitely need time to think out where's the best spot to start, and how to gradually move towards covering more archs starting from one. Thanks, > * details of storing the fault details (for ptrace, mostly) > vary a lot; no chance to unify, AFAICS. > * requirements for vma flags also differ; e.g. read fault on > alpha is explicitly OK with absence of VM_READ if VM_WRITE is there. > Probably should go by way of arm and pass the mask that must > have non-empty intersection with vma->vm_flags? Because *that* > is very likely to be a part of ABI - mmap(2) callers that rely > upon the flags being OK for given architecture are quite possible. > * mmap lock is also quite variable in how it's taken; > x86 and arm have fun dance with trylock/search for exception handler/etc. > Other architectures do not; OTOH, there's a prefetch stuck in itanic > variant, with comment about mmap_sem being performance-critical... > * logics for stack expansion includes this twist: > if (!(vma->vm_flags & VM_GROWSDOWN)) > goto map_err; > if (user_mode(regs)) { > /* Accessing the stack below usp is always a bug. The > "+ 256" is there due to some instructions doing > pre-decrement on the stack and that doesn't show up > until later. */ > if (address + 256 < rdusp()) > goto map_err; > } > if (expand_stack(vma, address)) > goto map_err; > That's m68k; ISTR similar considerations elsewhere, but I could be > wrong. > -- Peter Xu _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 127F2C05027 for ; Thu, 2 Feb 2023 22:57:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232051AbjBBW5Q (ORCPT ); Thu, 2 Feb 2023 17:57:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60324 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231868AbjBBW5P (ORCPT ); Thu, 2 Feb 2023 17:57:15 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2525C2E0C1 for ; Thu, 2 Feb 2023 14:56:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675378586; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=XMJSsnUmUwDCBuOmcITqEg5FKDzJKTZTwWrtPHX2Mzk=; b=Qrhu3FdSyOTGjOrIiNtSjcuk6EDxhTBrCAwAs61q5aigpYh4dY+r6TKvSHR1dXl4ksz5zg r848RHlrcszkcifWfgewZj8OxoPXW0/xg2NZy/FNmfhWqk5pc36Gn1KchnPBEnL7nmZzvC Hx1qQ3Nll8YDYp12OhgK2ooLYPP/QWQ= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-368-BTfkq5nGNwGrcdrq3-pvyw-1; Thu, 02 Feb 2023 17:56:25 -0500 X-MC-Unique: BTfkq5nGNwGrcdrq3-pvyw-1 Received: by mail-qk1-f198.google.com with SMTP id u10-20020a05620a0c4a00b00705e77d6207so2219068qki.5 for ; Thu, 02 Feb 2023 14:56:25 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=XMJSsnUmUwDCBuOmcITqEg5FKDzJKTZTwWrtPHX2Mzk=; b=kvtyjlgIQ7Po3IS8Osc+g0nR1P20xslmaFHtNOvAKt4k7Ob2xvfkDkKlNDmaH8fACX 1wJt1oOt/w/a30k99fURGh2bTH6Z/dCchtA5sFd9lMaiysmaYbCFNS8zZh2GiFbaFPkx IvzkbfjJgT4j8XTMhyGpM/gB8M9y96q7hPC+AqU1x4WgSPq+3l73xgMs3T8SxFdNiFW0 lnr/lsTo+7LZmCu0lBUuv0lUthvq37XK2ePWHDBffc9TtM5DWR6BYD+fB3uIDODXwU// T3W4Oe8q/ipyAwvIPkSLUalkiXb7MWPiok+Sxi/H19mFMPlwC7PwRZBFwV9iNCQ5v0A8 myMA== X-Gm-Message-State: AO0yUKXimzHW+XXMl67BPiaEN/y6n69GUjKqIEtw96nV5XGS6NL5O7z+ 8z2jCmnVT8iWoRI/11lE4HkzDxUohRBCEUHwyqliznlAHDNjTL1sniYkgWFzNSMJoNZH7Aa7dQV EfqwOlraKZq7hIhIQYmx9hEgi X-Received: by 2002:a05:622a:1e10:b0:3b8:68df:fc72 with SMTP id br16-20020a05622a1e1000b003b868dffc72mr12702095qtb.2.1675378585235; Thu, 02 Feb 2023 14:56:25 -0800 (PST) X-Google-Smtp-Source: AK7set98ufe9BnObmCyLO5vL17bxJ3WNbuFwrxLR6rxAVU1vknl5Zn7Dx1QXwCn46sOWBGmNV32apQ== X-Received: by 2002:a05:622a:1e10:b0:3b8:68df:fc72 with SMTP id br16-20020a05622a1e1000b003b868dffc72mr12702061qtb.2.1675378584842; Thu, 02 Feb 2023 14:56:24 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id j9-20020a05620a410900b00729b7d71ac7sm661766qko.33.2023.02.02.14.56.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Feb 2023 14:56:24 -0800 (PST) Date: Thu, 2 Feb 2023 17:56:22 -0500 From: Peter Xu To: Al Viro Cc: Linus Torvalds , linux-arch@vger.kernel.org, linux-alpha@vger.kernel.org, linux-ia64@vger.kernel.org, linux-hexagon@vger.kernel.org, linux-m68k@lists.linux-m68k.org, Michal Simek , Dinh Nguyen , openrisc@lists.librecores.org, linux-parisc@vger.kernel.org, linux-riscv@lists.infradead.org, sparclinux@vger.kernel.org Subject: Re: [RFC][PATCHSET] VM_FAULT_RETRY fixes Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-parisc@vger.kernel.org On Wed, Feb 01, 2023 at 10:18:11PM +0000, Al Viro wrote: > On Wed, Feb 01, 2023 at 02:48:22PM -0500, Peter Xu wrote: > > > I do also see a common pattern of the possibility to have a generic fault > > handler like generic_page_fault(). > > > > It probably should start with taking the mmap_sem until providing some > > retval that is much easier to digest further by the arch-dependent code, so > > it can directly do something rather than parsing the bitmask in a > > duplicated way (hence the new retval should hopefully not a bitmask anymore > > but a "what to do"). > > > > Maybe it can be something like: > > > > /** > > * enum page_fault_retval - Higher level fault retval, generalized from > > * vm_fault_reason above that is only used by hardware page fault handlers. > > * It generalizes the bitmask-versioned retval into something that the arch > > * dependent code should react upon. > > * > > * @PF_RET_COMPLETED: The page fault is completed successfully > > * @PF_RET_BAD_AREA: The page fault address falls in a bad area > > * (e.g., vma not found, expand_stack() fails..) > > FWIW, there's a fun discrepancy - VM_FAULT_SIGSEGV may yield SEGV_MAPERR > or SEGV_ACCERR; depends upon the architecture. Not that there'd been > many places that return VM_FAULT_SIGSEGV these days... Good thing, too, > since otherwise e.g. csky would oops... > > > * @PF_RET_ACCESS_ERR: The page fault has access errors > > * (e.g., write fault on !VM_WRITE vmas) > > * @PF_RET_KERN_FIXUP: The page fault requires kernel fixups > > * (e.g., during copy_to_user() but fault failed?) > > * @PF_RET_HWPOISON: The page fault encountered poisoned pages > > * @PF_RET_SIGNAL: The page fault encountered poisoned pages > > ?? > > > * ... > > */ > > enum page_fault_retval { > > PF_RET_DONE = 0, > > PF_RET_BAD_AREA, > > PF_RET_ACCESS_ERR, > > PF_RET_KERN_FIXUP, > > PF_RET_HWPOISON, > > PF_RET_SIGNAL, > > ... > > }; > > > > As a start we may still want to return some more information (perhaps still > > the vm_fault_t alongside? Or another union that will provide different > > information based on different PF_RET_*). One major thing is I see how we > > handle VM_FAULT_HWPOISON and also the fact that we encode something more > > into the bitmask on page sizes (VM_FAULT_HINDEX_MASK). > > > > So the generic helper could, hopefully, hide the complexity of: > > > > - Taking and releasing of mmap lock > > - find_vma(), and also relevant checks on access or stack handling > > Umm... arm is a bit special here: > if (addr < FIRST_USER_ADDRESS) > return VM_FAULT_BADMAP; > with no counterparts elsewhere. For this specific case IIUC it's the same as bad_area. VM_FAULT_BADMAP is further handled later in do_page_fault() there for either arm/arm64. This reminded me this, on how arm defines the private retvals, while the generic ones grows and probably no one noticed they can collapse already.. #define VM_FAULT_BADMAP 0x010000 #define VM_FAULT_BADACCESS 0x020000 enum vm_fault_reason { ... VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000, }; VM_FAULT_HINDEX_MASK is only used by VM_FAULT_HWPOISON_LARGE, so I think arm[64] could expect some surprise when it hit hugetlb hwpoison pages... maybe I should prepare a patch for arm. > > > - handle_mm_fault() itself (of course...) > > - detect signals > > - handle page fault retries (so, in the new layer of retval there should > > have nothing telling it to retry; it should always be the ultimate result) > > agreed. > > - unlock mmap; don't leave that to caller. > > > - parse different errors into "what the arch code should do", and > > generalize the common ones, e.g. > > - OOM, do pagefault_out_of_memory() for user-mode > > - VM_FAULT_SIGSEGV, which should be able to merge into PF_RET_BAD_AREA? > > - ... > > AFAICS, all errors in kernel mode => no_context. > > > It'll simplify things if we can unify some small details like whether the > > -EFAULT above should contain a sigbus. > > > > A trivial detail I found when I was looking at this is, x86_64 passes in > > different signals to kernelmode_fixup_or_oops() - in do_user_addr_fault() > > there're three call sites and each of them pass over a differerent signal. > > IIUC that will only make a difference if there's a nested page fault during > > the vsyscall emulation (but I may be wrong too because I'm new to this > > code), and I have no idea when it'll happen and whether that needs to be > > strictly followed. > > From my (very incomplete so far) dig through that pile: > Q: do we still have the cases when handle_mm_fault() does > not return any of VM_FAULT_COMPLETED | VM_FAULT_RETRY | VM_FAULT_ERROR? > That gets treated as unlock + VM_FAULT_COMPLETED, but do we still need > that? > Q: can VM_FAULT_RETRY be mixed with anything in VM_FAULT_ERROR? > What locking, if that happens? For this one, I don't think they can be mixed. IMHO RETRY only binds with a wait, so if we didn't wait and found issue, we return ERROR; if we decided to wait, we will try nothing more besides return after wait with the RETRY. We should just never check any error at all if the wait happened. Otherwise there's a bug of potential deadlock. I'll skip some details in this email either above or below; I agree there're so many trivial details to take care of to not break a thing. IMHO it'll be merely impossible to merge things across most (if not to say, all) archs. It will need to be start from one or at least a few that still shares a major common base - I would still rely on x86 as a start - then we try to use the helper in as much archs as possible. Even on x86, I do also see challenges so I'm not sure whether a common enough routine can be abstracted indeed. But I believe there's a way to do this because obviously we still see tons of duplicated logics falling around. It may definitely need time to think out where's the best spot to start, and how to gradually move towards covering more archs starting from one. Thanks, > * details of storing the fault details (for ptrace, mostly) > vary a lot; no chance to unify, AFAICS. > * requirements for vma flags also differ; e.g. read fault on > alpha is explicitly OK with absence of VM_READ if VM_WRITE is there. > Probably should go by way of arm and pass the mask that must > have non-empty intersection with vma->vm_flags? Because *that* > is very likely to be a part of ABI - mmap(2) callers that rely > upon the flags being OK for given architecture are quite possible. > * mmap lock is also quite variable in how it's taken; > x86 and arm have fun dance with trylock/search for exception handler/etc. > Other architectures do not; OTOH, there's a prefetch stuck in itanic > variant, with comment about mmap_sem being performance-critical... > * logics for stack expansion includes this twist: > if (!(vma->vm_flags & VM_GROWSDOWN)) > goto map_err; > if (user_mode(regs)) { > /* Accessing the stack below usp is always a bug. The > "+ 256" is there due to some instructions doing > pre-decrement on the stack and that doesn't show up > until later. */ > if (address + 256 < rdusp()) > goto map_err; > } > if (expand_stack(vma, address)) > goto map_err; > That's m68k; ISTR similar considerations elsewhere, but I could be > wrong. > -- Peter Xu From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Xu Date: Thu, 02 Feb 2023 22:56:22 +0000 Subject: Re: [RFC][PATCHSET] VM_FAULT_RETRY fixes Message-Id: List-Id: References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Al Viro Cc: Linus Torvalds , linux-arch@vger.kernel.org, linux-alpha@vger.kernel.org, linux-ia64@vger.kernel.org, linux-hexagon@vger.kernel.org, linux-m68k@lists.linux-m68k.org, Michal Simek , Dinh Nguyen , openrisc@lists.librecores.org, linux-parisc@vger.kernel.org, linux-riscv@lists.infradead.org, sparclinux@vger.kernel.org On Wed, Feb 01, 2023 at 10:18:11PM +0000, Al Viro wrote: > On Wed, Feb 01, 2023 at 02:48:22PM -0500, Peter Xu wrote: > > > I do also see a common pattern of the possibility to have a generic fault > > handler like generic_page_fault(). > > > > It probably should start with taking the mmap_sem until providing some > > retval that is much easier to digest further by the arch-dependent code, so > > it can directly do something rather than parsing the bitmask in a > > duplicated way (hence the new retval should hopefully not a bitmask anymore > > but a "what to do"). > > > > Maybe it can be something like: > > > > /** > > * enum page_fault_retval - Higher level fault retval, generalized from > > * vm_fault_reason above that is only used by hardware page fault handlers. > > * It generalizes the bitmask-versioned retval into something that the arch > > * dependent code should react upon. > > * > > * @PF_RET_COMPLETED: The page fault is completed successfully > > * @PF_RET_BAD_AREA: The page fault address falls in a bad area > > * (e.g., vma not found, expand_stack() fails..) > > FWIW, there's a fun discrepancy - VM_FAULT_SIGSEGV may yield SEGV_MAPERR > or SEGV_ACCERR; depends upon the architecture. Not that there'd been > many places that return VM_FAULT_SIGSEGV these days... Good thing, too, > since otherwise e.g. csky would oops... > > > * @PF_RET_ACCESS_ERR: The page fault has access errors > > * (e.g., write fault on !VM_WRITE vmas) > > * @PF_RET_KERN_FIXUP: The page fault requires kernel fixups > > * (e.g., during copy_to_user() but fault failed?) > > * @PF_RET_HWPOISON: The page fault encountered poisoned pages > > * @PF_RET_SIGNAL: The page fault encountered poisoned pages > > ?? > > > * ... > > */ > > enum page_fault_retval { > > PF_RET_DONE = 0, > > PF_RET_BAD_AREA, > > PF_RET_ACCESS_ERR, > > PF_RET_KERN_FIXUP, > > PF_RET_HWPOISON, > > PF_RET_SIGNAL, > > ... > > }; > > > > As a start we may still want to return some more information (perhaps still > > the vm_fault_t alongside? Or another union that will provide different > > information based on different PF_RET_*). One major thing is I see how we > > handle VM_FAULT_HWPOISON and also the fact that we encode something more > > into the bitmask on page sizes (VM_FAULT_HINDEX_MASK). > > > > So the generic helper could, hopefully, hide the complexity of: > > > > - Taking and releasing of mmap lock > > - find_vma(), and also relevant checks on access or stack handling > > Umm... arm is a bit special here: > if (addr < FIRST_USER_ADDRESS) > return VM_FAULT_BADMAP; > with no counterparts elsewhere. For this specific case IIUC it's the same as bad_area. VM_FAULT_BADMAP is further handled later in do_page_fault() there for either arm/arm64. This reminded me this, on how arm defines the private retvals, while the generic ones grows and probably no one noticed they can collapse already.. #define VM_FAULT_BADMAP 0x010000 #define VM_FAULT_BADACCESS 0x020000 enum vm_fault_reason { ... VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000, }; VM_FAULT_HINDEX_MASK is only used by VM_FAULT_HWPOISON_LARGE, so I think arm[64] could expect some surprise when it hit hugetlb hwpoison pages... maybe I should prepare a patch for arm. > > > - handle_mm_fault() itself (of course...) > > - detect signals > > - handle page fault retries (so, in the new layer of retval there should > > have nothing telling it to retry; it should always be the ultimate result) > > agreed. > > - unlock mmap; don't leave that to caller. > > > - parse different errors into "what the arch code should do", and > > generalize the common ones, e.g. > > - OOM, do pagefault_out_of_memory() for user-mode > > - VM_FAULT_SIGSEGV, which should be able to merge into PF_RET_BAD_AREA? > > - ... > > AFAICS, all errors in kernel mode => no_context. > > > It'll simplify things if we can unify some small details like whether the > > -EFAULT above should contain a sigbus. > > > > A trivial detail I found when I was looking at this is, x86_64 passes in > > different signals to kernelmode_fixup_or_oops() - in do_user_addr_fault() > > there're three call sites and each of them pass over a differerent signal. > > IIUC that will only make a difference if there's a nested page fault during > > the vsyscall emulation (but I may be wrong too because I'm new to this > > code), and I have no idea when it'll happen and whether that needs to be > > strictly followed. > > From my (very incomplete so far) dig through that pile: > Q: do we still have the cases when handle_mm_fault() does > not return any of VM_FAULT_COMPLETED | VM_FAULT_RETRY | VM_FAULT_ERROR? > That gets treated as unlock + VM_FAULT_COMPLETED, but do we still need > that? > Q: can VM_FAULT_RETRY be mixed with anything in VM_FAULT_ERROR? > What locking, if that happens? For this one, I don't think they can be mixed. IMHO RETRY only binds with a wait, so if we didn't wait and found issue, we return ERROR; if we decided to wait, we will try nothing more besides return after wait with the RETRY. We should just never check any error at all if the wait happened. Otherwise there's a bug of potential deadlock. I'll skip some details in this email either above or below; I agree there're so many trivial details to take care of to not break a thing. IMHO it'll be merely impossible to merge things across most (if not to say, all) archs. It will need to be start from one or at least a few that still shares a major common base - I would still rely on x86 as a start - then we try to use the helper in as much archs as possible. Even on x86, I do also see challenges so I'm not sure whether a common enough routine can be abstracted indeed. But I believe there's a way to do this because obviously we still see tons of duplicated logics falling around. It may definitely need time to think out where's the best spot to start, and how to gradually move towards covering more archs starting from one. Thanks, > * details of storing the fault details (for ptrace, mostly) > vary a lot; no chance to unify, AFAICS. > * requirements for vma flags also differ; e.g. read fault on > alpha is explicitly OK with absence of VM_READ if VM_WRITE is there. > Probably should go by way of arm and pass the mask that must > have non-empty intersection with vma->vm_flags? Because *that* > is very likely to be a part of ABI - mmap(2) callers that rely > upon the flags being OK for given architecture are quite possible. > * mmap lock is also quite variable in how it's taken; > x86 and arm have fun dance with trylock/search for exception handler/etc. > Other architectures do not; OTOH, there's a prefetch stuck in itanic > variant, with comment about mmap_sem being performance-critical... > * logics for stack expansion includes this twist: > if (!(vma->vm_flags & VM_GROWSDOWN)) > goto map_err; > if (user_mode(regs)) { > /* Accessing the stack below usp is always a bug. The > "+ 256" is there due to some instructions doing > pre-decrement on the stack and that doesn't show up > until later. */ > if (address + 256 < rdusp()) > goto map_err; > } > if (expand_stack(vma, address)) > goto map_err; > That's m68k; ISTR similar considerations elsewhere, but I could be > wrong. > -- Peter Xu