From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 779ADC433FE for ; Thu, 21 Oct 2021 14:42:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 57C46611C7 for ; Thu, 21 Oct 2021 14:42:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231530AbhJUOpH (ORCPT ); Thu, 21 Oct 2021 10:45:07 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:57413 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231425AbhJUOpF (ORCPT ); Thu, 21 Oct 2021 10:45:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634827368; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=SmEzbXtfEVt2BpM7wSrp23JreLNUFM7GXMWeh1u9ztI=; b=NFjey+Rp3UFB62KQxXMPHItRcF96+7DSEXbGiw0H+O9h+ncRJKEOXlRh+domfAFeEZ47UK Zj6cOZeKsUNYle2RH9SJwAcwxpFDe7KlG64PBi8kMgqwkHmZHo7GdJyXGCULOXcT49XWvx FsXgSAR+dq/rNd32WRKOc6za0yZR5WA= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-588-zcMdDC5OOlytQcFGfr0tZA-1; Thu, 21 Oct 2021 10:42:47 -0400 X-MC-Unique: zcMdDC5OOlytQcFGfr0tZA-1 Received: by mail-wm1-f72.google.com with SMTP id p3-20020a05600c204300b0030daa138dfeso322837wmg.9 for ; Thu, 21 Oct 2021 07:42:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=SmEzbXtfEVt2BpM7wSrp23JreLNUFM7GXMWeh1u9ztI=; b=UGhBFc1cXNqAqNV9UaEhMvHmFpN15OHKWyOpqUQQGZIXfZt48YY9P+4YaNTxtiMUPu rP50749TqcY3nRA9yAsF+FmV1v+JewVkywm109rzgkOLTNQuYZkR0dJW2K5dI9NOeLeZ RFKc90DL7+hcadk+x86f6eVIuJcctQhYbr4Sd2i0dmWboprL34QM07BQ6VRUf0YZatAH I5M+8pcfCQwYE5QHiuureJq3VyDG2W5ViSXJ4cofG5Oaadd03bwMvojdmenm4A1y0Whx gEYP1qwMIzVbvfd6xG2AC6dQkYjiOk0SFB1Cq4/ClJUQVrUdCJB/Bo/J+z07xnYa2vP9 /I8Q== X-Gm-Message-State: AOAM533MTm1e0FXzkieQ2pSaJMpxGbz9cpE8aXzWksPfzlsLjZENixfP kCJNOdh17HzwBHenl19dgrtiK18SB06hWvhFrwG3xsSp4IOgafNI6anSDoYy9yceRpSj5dfLIyb kmORXNTSkLMB9SlgK8wwJJiuh1jYXgsFihTnlR4su X-Received: by 2002:adf:c78d:: with SMTP id l13mr7544017wrg.134.1634827364942; Thu, 21 Oct 2021 07:42:44 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxe9FbtIjQkoQAdC+TPuURhpm/wOA+DZOJl2uKRFfWKCj35F5i6OVE+TEzdARceI4I8WSyqjU9LUZpNOmPkJLM= X-Received: by 2002:adf:c78d:: with SMTP id l13mr7543974wrg.134.1634827364652; Thu, 21 Oct 2021 07:42:44 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Andreas Gruenbacher Date: Thu, 21 Oct 2021 16:42:33 +0200 Message-ID: Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl() To: Catalin Marinas Cc: Linus Torvalds , Al Viro , Christoph Hellwig , "Darrick J. Wong" , Jan Kara , Matthew Wilcox , cluster-devel , linux-fsdevel , Linux Kernel Mailing List , "ocfs2-devel@oss.oracle.com" , Josef Bacik , Will Deacon Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas wrote: > On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote: > > On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds > > wrote: > > > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas wrote: > > > > > > > > +#ifdef CONFIG_ARM64_MTE > > > > +#define FAULT_GRANULE_SIZE (16) > > > > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1)) > > > > > > [...] > > > > > > > If this looks in the right direction, I'll do some proper patches > > > > tomorrow. > > > > > > Looks fine to me. It's going to be quite expensive and bad for caches, though. > > > > > > That said, fault_in_writable() is _supposed_ to all be for the slow > > > path when things go south and the normal path didn't work out, so I > > > think it's fine. > > > > Let me get back to this; I'm actually not convinced that we need to > > worry about sub-page-size fault granules in fault_in_pages_readable or > > fault_in_pages_writeable. > > > > From a filesystem point of view, we can get into trouble when a > > user-space read or write triggers a page fault while we're holding > > filesystem locks, and that page fault ends up calling back into the > > filesystem. To deal with that, we're performing those user-space > > accesses with page faults disabled. > > Yes, this makes sense. > > > When a page fault would occur, we > > get back an error instead, and then we try to fault in the offending > > pages. If a page is resident and we still get a fault trying to access > > it, trying to fault in the same page again isn't going to help and we > > have a true error. > > You can't be sure the second fault is a true error. The unlocked > fault_in_*() may race with some LRU scheme making the pte not accessible > or a write-back making it clean/read-only. copy_to_user() with > pagefault_disabled() fails again but that's a benign fault. The > filesystem should re-attempt the fault-in (gup would correct the pte), > disable page faults and copy_to_user(), potentially in an infinite loop. > If you bail out on the second/third uaccess following a fault_in_*() > call, you may get some unexpected errors (though very rare). Maybe the > filesystems avoid this problem somehow but I couldn't figure it out. Good point, we can indeed only bail out if both the user copy and the fault-in fail. But probing the entire memory range in fault domain granularity in the page fault-in functions still doesn't actually make sense. Those functions really only need to guarantee that we'll be able to make progress eventually. From that point of view, it should be enough to probe the first byte of the requested memory range, so when one of those functions reports that the next N bytes should be accessible, this really means that the first byte surely isn't permanently inaccessible and that the rest is likely accessible. Functions fault_in_readable and fault_in_writeable already work that way, so this only leaves function fault_in_safe_writeable to worry about. > > We're clearly looking at memory at a page > > granularity; faults at a sub-page level don't matter at this level of > > abstraction (but they do show similar error behavior). To avoid > > getting stuck, when it gets a short result or -EFAULT, the filesystem > > implements the following backoff strategy: first, it tries to fault in > > a number of pages. When the read or write still doesn't make progress, > > it scales back and faults in a single page. Finally, when that still > > doesn't help, it gives up. This strategy is needed for actual page > > faults, but it also handles sub-page faults appropriately as long as > > the user-space access functions give sensible results. > > As I said above, I think with this approach there's a small chance of > incorrectly reporting an error when the fault is recoverable. If you > change it to an infinite loop, you'd run into the sub-page fault > problem. Yes, I see now, thanks. > There are some places with such infinite loops: futex_wake_op(), > search_ioctl() in the btrfs code. I still have to get my head around > generic_perform_write() but I think we get away here because it faults > in the page with a get_user() rather than gup (and copy_from_user() is > guaranteed to make progress if any bytes can still be accessed). Thanks, Andreas From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C02A0C433EF for ; Thu, 21 Oct 2021 14:42:57 +0000 (UTC) Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 628C9611CB for ; Thu, 21 Oct 2021 14:42:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 628C9611CB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=oss.oracle.com Received: from pps.filterd (m0246631.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 19LEYMCG013675; Thu, 21 Oct 2021 14:42:56 GMT Received: from userp3020.oracle.com (userp3020.oracle.com [156.151.31.79]) by mx0b-00069f02.pphosted.com with ESMTP id 3btrfm59k4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 21 Oct 2021 14:42:56 +0000 Received: from pps.filterd (userp3020.oracle.com [127.0.0.1]) by userp3020.oracle.com (8.16.1.2/8.16.1.2) with SMTP id 19LEZbIr129807; Thu, 21 Oct 2021 14:42:55 GMT Received: from oss.oracle.com (oss-old-reserved.oracle.com [137.254.22.2]) by userp3020.oracle.com with ESMTP id 3br8gw4y94-1 (version=TLSv1 cipher=AES256-SHA bits=256 verify=NO); Thu, 21 Oct 2021 14:42:54 +0000 Received: from localhost ([127.0.0.1] helo=lb-oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1mdZHR-0006g6-Bd; Thu, 21 Oct 2021 07:42:53 -0700 Received: from aserp3020.oracle.com ([141.146.126.70]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1mdZHP-0006fh-6q for ocfs2-devel@oss.oracle.com; Thu, 21 Oct 2021 07:42:51 -0700 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.1.2/8.16.1.2) with SMTP id 19LEZLxJ173272 for ; Thu, 21 Oct 2021 14:42:51 GMT Received: from mx0a-00069f01.pphosted.com (mx0a-00069f01.pphosted.com [205.220.165.26]) by aserp3020.oracle.com with ESMTP id 3bqpj8wxam-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Thu, 21 Oct 2021 14:42:51 +0000 Received: from pps.filterd (m0246573.ppops.net [127.0.0.1]) by mx0b-00069f01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 19LEB6vI031102 for ; Thu, 21 Oct 2021 14:42:50 GMT Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mx0b-00069f01.pphosted.com with ESMTP id 3bu497msy3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Thu, 21 Oct 2021 14:42:49 +0000 Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-230-iRecnci8PSWrrTSLJnJMNQ-1; Thu, 21 Oct 2021 10:42:47 -0400 X-MC-Unique: iRecnci8PSWrrTSLJnJMNQ-1 Received: by mail-wr1-f70.google.com with SMTP id j12-20020adf910c000000b0015e4260febdso260322wrj.20 for ; Thu, 21 Oct 2021 07:42:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=SmEzbXtfEVt2BpM7wSrp23JreLNUFM7GXMWeh1u9ztI=; b=FE7I5QpPgQ/xuPugvBTNh0hqwi5nME4uRXswsFXqXG+v+8VYnXPp4gnTYfFrI2qjE0 3xWQHKD9OfwFJEwM+YsoDJ0d+4vmfR3rFtd2RUGfwiCChaKQI50MXi4DsUGNPZYlJlMq /gnaNoRYfnTa+ziUZBLSGi8IuSWYPlvT4c62Qg1YkZzvdKuLPdTHJ3giTqemi1Eb8aeu SQeCPTkP5S8cAhQmkh1FY0i8WQAkPvbgWopr48M8aIeXM5XNVhMmxo702tvU2GrYi+Py 10eMcqjunImA0DFzeU/5eIazNd2A5AO/Fyb1RuecFwtIBE+LkHtIdDsaYtS2+/SJ+qGp hbrQ== X-Gm-Message-State: AOAM5333OyyVy94hSpeQU3HgHT321fuwCy+FQFRoBQP+GJxAelsf/KLZ W7xLoLPgSZQXhL/rnyqBr8TcKESt+RsXBQNgy2ygP1Dv90uXGi7TWj5gqI+k4it/mrmBo1ty5GP aWqYSVrWOd4UDD3vJjVoYlb+cpVkZ0s2YumEKoA== X-Received: by 2002:adf:c78d:: with SMTP id l13mr7544015wrg.134.1634827364929; Thu, 21 Oct 2021 07:42:44 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxe9FbtIjQkoQAdC+TPuURhpm/wOA+DZOJl2uKRFfWKCj35F5i6OVE+TEzdARceI4I8WSyqjU9LUZpNOmPkJLM= X-Received: by 2002:adf:c78d:: with SMTP id l13mr7543974wrg.134.1634827364652; Thu, 21 Oct 2021 07:42:44 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Andreas Gruenbacher Date: Thu, 21 Oct 2021 16:42:33 +0200 Message-ID: To: Catalin Marinas Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=agruenba@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Proofpoint-SPF-Result: pass X-Proofpoint-SPF-Record: v=spf1 ip4:103.23.64.2 ip4:103.23.65.2 ip4:103.23.66.26 ip4:103.23.67.26 ip4:107.21.15.141 ip4:108.177.8.0/21 ip4:128.17.0.0/20 ip4:128.17.128.0/20 ip4:128.17.192.0/20 ip4:128.17.64.0/20 ip4:128.245.0.0/20 ip4:128.245.64.0/20 ip4:13.110.208.0/21 ip4:13.110.216.0/22 ip4:13.110.224.0/20 ip4:13.111.0.0/16 ip4:136.147.128.0/20 include:spf1.redhat.com -all X-Proofpoint-SPF-VenPass: Allowed X-Source-IP: 170.10.133.124 X-ServerName: us-smtp-delivery-124.mimecast.com X-Proofpoint-SPF-Result: pass X-Proofpoint-SPF-Record: v=spf1 ip4:103.23.64.2 ip4:103.23.65.2 ip4:103.23.66.26 ip4:103.23.67.26 ip4:107.21.15.141 ip4:108.177.8.0/21 ip4:128.17.0.0/20 ip4:128.17.128.0/20 ip4:128.17.192.0/20 ip4:128.17.64.0/20 ip4:128.245.0.0/20 ip4:128.245.64.0/20 ip4:13.110.208.0/21 ip4:13.110.216.0/22 ip4:13.110.224.0/20 ip4:13.111.0.0/16 ip4:136.147.128.0/20 include:spf1.redhat.com -all X-Proofpoint-Virus-Version: vendor=nai engine=6300 definitions=10144 signatures=668683 X-Proofpoint-Spam-Reason: safe X-Spam: OrgSafeList X-SpamRule: orgsafelist Cc: cluster-devel , Jan Kara , Will Deacon , Linux Kernel Mailing List , Josef Bacik , Christoph Hellwig , Al Viro , linux-fsdevel , Linus Torvalds , "ocfs2-devel@oss.oracle.com" Subject: Re: [Ocfs2-devel] [RFC][arm64] possible infinite loop in btrfs search_ioctl() X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Proofpoint-Virus-Version: vendor=nai engine=6300 definitions=10144 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxlogscore=999 mlxscore=0 adultscore=0 spamscore=0 phishscore=0 bulkscore=0 suspectscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2109230001 definitions=main-2110210078 X-Proofpoint-GUID: 7HWEZ2OxL0qhTQJgXJx836BgVzfNRhGV X-Proofpoint-ORIG-GUID: 7HWEZ2OxL0qhTQJgXJx836BgVzfNRhGV On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas wrote: > On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote: > > On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds > > wrote: > > > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas wrote: > > > > > > > > +#ifdef CONFIG_ARM64_MTE > > > > +#define FAULT_GRANULE_SIZE (16) > > > > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1)) > > > > > > [...] > > > > > > > If this looks in the right direction, I'll do some proper patches > > > > tomorrow. > > > > > > Looks fine to me. It's going to be quite expensive and bad for caches, though. > > > > > > That said, fault_in_writable() is _supposed_ to all be for the slow > > > path when things go south and the normal path didn't work out, so I > > > think it's fine. > > > > Let me get back to this; I'm actually not convinced that we need to > > worry about sub-page-size fault granules in fault_in_pages_readable or > > fault_in_pages_writeable. > > > > From a filesystem point of view, we can get into trouble when a > > user-space read or write triggers a page fault while we're holding > > filesystem locks, and that page fault ends up calling back into the > > filesystem. To deal with that, we're performing those user-space > > accesses with page faults disabled. > > Yes, this makes sense. > > > When a page fault would occur, we > > get back an error instead, and then we try to fault in the offending > > pages. If a page is resident and we still get a fault trying to access > > it, trying to fault in the same page again isn't going to help and we > > have a true error. > > You can't be sure the second fault is a true error. The unlocked > fault_in_*() may race with some LRU scheme making the pte not accessible > or a write-back making it clean/read-only. copy_to_user() with > pagefault_disabled() fails again but that's a benign fault. The > filesystem should re-attempt the fault-in (gup would correct the pte), > disable page faults and copy_to_user(), potentially in an infinite loop. > If you bail out on the second/third uaccess following a fault_in_*() > call, you may get some unexpected errors (though very rare). Maybe the > filesystems avoid this problem somehow but I couldn't figure it out. Good point, we can indeed only bail out if both the user copy and the fault-in fail. But probing the entire memory range in fault domain granularity in the page fault-in functions still doesn't actually make sense. Those functions really only need to guarantee that we'll be able to make progress eventually. From that point of view, it should be enough to probe the first byte of the requested memory range, so when one of those functions reports that the next N bytes should be accessible, this really means that the first byte surely isn't permanently inaccessible and that the rest is likely accessible. Functions fault_in_readable and fault_in_writeable already work that way, so this only leaves function fault_in_safe_writeable to worry about. > > We're clearly looking at memory at a page > > granularity; faults at a sub-page level don't matter at this level of > > abstraction (but they do show similar error behavior). To avoid > > getting stuck, when it gets a short result or -EFAULT, the filesystem > > implements the following backoff strategy: first, it tries to fault in > > a number of pages. When the read or write still doesn't make progress, > > it scales back and faults in a single page. Finally, when that still > > doesn't help, it gives up. This strategy is needed for actual page > > faults, but it also handles sub-page faults appropriately as long as > > the user-space access functions give sensible results. > > As I said above, I think with this approach there's a small chance of > incorrectly reporting an error when the fault is recoverable. If you > change it to an infinite loop, you'd run into the sub-page fault > problem. Yes, I see now, thanks. > There are some places with such infinite loops: futex_wake_op(), > search_ioctl() in the btrfs code. I still have to get my head around > generic_perform_write() but I think we get away here because it faults > in the page with a get_user() rather than gup (and copy_from_user() is > guaranteed to make progress if any bytes can still be accessed). Thanks, Andreas _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Gruenbacher Date: Thu, 21 Oct 2021 16:42:33 +0200 Subject: [Cluster-devel] [RFC][arm64] possible infinite loop in btrfs search_ioctl() In-Reply-To: References: Message-ID: List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas wrote: > On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote: > > On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds > > wrote: > > > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas wrote: > > > > > > > > +#ifdef CONFIG_ARM64_MTE > > > > +#define FAULT_GRANULE_SIZE (16) > > > > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1)) > > > > > > [...] > > > > > > > If this looks in the right direction, I'll do some proper patches > > > > tomorrow. > > > > > > Looks fine to me. It's going to be quite expensive and bad for caches, though. > > > > > > That said, fault_in_writable() is _supposed_ to all be for the slow > > > path when things go south and the normal path didn't work out, so I > > > think it's fine. > > > > Let me get back to this; I'm actually not convinced that we need to > > worry about sub-page-size fault granules in fault_in_pages_readable or > > fault_in_pages_writeable. > > > > From a filesystem point of view, we can get into trouble when a > > user-space read or write triggers a page fault while we're holding > > filesystem locks, and that page fault ends up calling back into the > > filesystem. To deal with that, we're performing those user-space > > accesses with page faults disabled. > > Yes, this makes sense. > > > When a page fault would occur, we > > get back an error instead, and then we try to fault in the offending > > pages. If a page is resident and we still get a fault trying to access > > it, trying to fault in the same page again isn't going to help and we > > have a true error. > > You can't be sure the second fault is a true error. The unlocked > fault_in_*() may race with some LRU scheme making the pte not accessible > or a write-back making it clean/read-only. copy_to_user() with > pagefault_disabled() fails again but that's a benign fault. The > filesystem should re-attempt the fault-in (gup would correct the pte), > disable page faults and copy_to_user(), potentially in an infinite loop. > If you bail out on the second/third uaccess following a fault_in_*() > call, you may get some unexpected errors (though very rare). Maybe the > filesystems avoid this problem somehow but I couldn't figure it out. Good point, we can indeed only bail out if both the user copy and the fault-in fail. But probing the entire memory range in fault domain granularity in the page fault-in functions still doesn't actually make sense. Those functions really only need to guarantee that we'll be able to make progress eventually. From that point of view, it should be enough to probe the first byte of the requested memory range, so when one of those functions reports that the next N bytes should be accessible, this really means that the first byte surely isn't permanently inaccessible and that the rest is likely accessible. Functions fault_in_readable and fault_in_writeable already work that way, so this only leaves function fault_in_safe_writeable to worry about. > > We're clearly looking at memory at a page > > granularity; faults at a sub-page level don't matter at this level of > > abstraction (but they do show similar error behavior). To avoid > > getting stuck, when it gets a short result or -EFAULT, the filesystem > > implements the following backoff strategy: first, it tries to fault in > > a number of pages. When the read or write still doesn't make progress, > > it scales back and faults in a single page. Finally, when that still > > doesn't help, it gives up. This strategy is needed for actual page > > faults, but it also handles sub-page faults appropriately as long as > > the user-space access functions give sensible results. > > As I said above, I think with this approach there's a small chance of > incorrectly reporting an error when the fault is recoverable. If you > change it to an infinite loop, you'd run into the sub-page fault > problem. Yes, I see now, thanks. > There are some places with such infinite loops: futex_wake_op(), > search_ioctl() in the btrfs code. I still have to get my head around > generic_perform_write() but I think we get away here because it faults > in the page with a get_user() rather than gup (and copy_from_user() is > guaranteed to make progress if any bytes can still be accessed). Thanks, Andreas