From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCF67C34047 for ; Wed, 19 Feb 2020 17:14:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7A03C2465D for ; Wed, 19 Feb 2020 17:14:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7A03C2465D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 233F96B0003; Wed, 19 Feb 2020 12:14:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1BC286B0006; Wed, 19 Feb 2020 12:14:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0ABCF6B0007; Wed, 19 Feb 2020 12:14:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id E47076B0003 for ; Wed, 19 Feb 2020 12:14:51 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 8A3702C8B for ; Wed, 19 Feb 2020 17:14:51 +0000 (UTC) X-FDA: 76507526382.17.fly32_476c8d516ab14 X-HE-Tag: fly32_476c8d516ab14 X-Filterd-Recvd-Size: 7448 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Wed, 19 Feb 2020 17:14:50 +0000 (UTC) Received: from pps.filterd (m0127361.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 01JH7wxQ191774 for ; Wed, 19 Feb 2020 12:14:50 -0500 Received: from e06smtp04.uk.ibm.com (e06smtp04.uk.ibm.com [195.75.94.100]) by mx0a-001b2d01.pphosted.com with ESMTP id 2y92xdcf7w-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 19 Feb 2020 12:14:49 -0500 Received: from localhost by e06smtp04.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 19 Feb 2020 17:14:46 -0000 Received: from b06cxnps4075.portsmouth.uk.ibm.com (9.149.109.197) by e06smtp04.uk.ibm.com (192.168.101.134) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Wed, 19 Feb 2020 17:14:43 -0000 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 01JHEgrB54591670 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 19 Feb 2020 17:14:42 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 72B4152059; Wed, 19 Feb 2020 17:14:42 +0000 (GMT) Received: from pomme.local (unknown [9.145.50.97]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id 132445204F; Wed, 19 Feb 2020 17:14:41 +0000 (GMT) Subject: Re: Splitting the mmap_sem To: Peter Zijlstra , Matthew Wilcox Cc: SeongJae Park , Michal Hocko , Vlastimil Babka , "Kirill A. Shutemov" , linux-mm@kvack.org References: <20200109170715.GV4951@dhcp22.suse.cz> <20200109173206.3731-1-sj38.park@gmail.com> <20200109201320.GO6788@bombadil.infradead.org> <20200206135920.GS14914@hirez.programming.kicks-ass.net> <20200206201536.GX8731@bombadil.infradead.org> <20200206205529.GZ14914@hirez.programming.kicks-ass.net> <20200206212024.GB8731@bombadil.infradead.org> <20200207085234.GB14914@hirez.programming.kicks-ass.net> From: Laurent Dufour Date: Wed, 19 Feb 2020 18:14:41 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.4.2 MIME-Version: 1.0 In-Reply-To: <20200207085234.GB14914@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: fr X-TM-AS-GCONF: 00 x-cbid: 20021917-0016-0000-0000-000002E857B1 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 20021917-0017-0000-0000-0000334B7111 Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138,18.0.572 definitions=2020-02-19_04:2020-02-19,2020-02-19 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 adultscore=0 clxscore=1011 lowpriorityscore=0 phishscore=0 bulkscore=0 spamscore=0 suspectscore=0 malwarescore=0 priorityscore=1501 impostorscore=0 mlxlogscore=884 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2002190130 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Le 07/02/2020 =C3=A0 09:52, Peter Zijlstra a =C3=A9crit=C2=A0: > On Thu, Feb 06, 2020 at 01:20:24PM -0800, Matthew Wilcox wrote: >> On Thu, Feb 06, 2020 at 09:55:29PM +0100, Peter Zijlstra wrote: >>> On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote: >>>> then, at the beginning of a page fault call srcu_read_lock(&vma_srcu= ); >>>> walk the tree as we do now, allocate memory for PTEs, sleep waiting = for >>>> pages to arrive back from disc, etc, etc, then at the end of the fau= lt, >>>> call srcu_read_unlock(&vma_srcu). >>> >>> So far so good,... >>> >>>> munmap() would consist of removing the >>>> VMA from the tree, then calling synchronize_srcu() to wait for all f= aults >>>> to finish, then putting the backing file, etc, etc and freeing the V= MA. >>> >>> call_srcu(), and the (s)rcu callback will then fput() and such things >>> more. >>> >>> synchronize_srcu() (like synchronize_rcu()) is stupid slow and would >>> make munmap()/exit()/etc.. unusable. >> >> I'll need to think about that a bit. I was convinced we needed to wai= t >> for the current pagefaults to finish before we could return from munma= p(). >> I need to convince myself that it's OK to return to userspace while th= e >> page faults for that range are still proceeding on other CPUs. >=20 > File-io might be in progress, any actual faults will result in SIGFAULT > instead of installing a PTE. >=20 > It is not fundamentally different from any threaded uaf race. >=20 >>>> This seems pretty reasonable, and investigation could actually proce= ed >>>> before the Maple tree work lands. Today, that would be: >>>> >>>> srcu_read_lock(&vmas_srcu); >>>> down_read(&mm->mmap_sem); >>>> find_vma(mm, address); >>>> up_read(&mm->mmap_sem); >>>> ... rest of fault handler path ... >>>> srcu_read_unlock(&vmas_srcu); >>>> >>>> Kind of a pain because we still call find_vma() in the per-arch page >>>> fault handler, but for prototyping, we'd only have to do one or two >>>> architectures. >>> >>> If you look at the earlier speculative page-fault patches by Laurent, >>> which were based on my still earlier patches, you'll find most of thi= s >>> there. >>> >>> The tricky bit was validating everything on the second page-table wal= k, >>> so see if nothing had fundamentally changed, specifically the VMA, >>> before installing the PTE. If you do this without mmap_sem, you need = to >>> hold ptlock to pin stuff while validating everything you did earlier. >> >> The patches Laurent posted used regular RCU and a per-VMA refcount, no= t >> SRCU. >=20 > That are his later patches, and I distinctly disagree with that > approach. >=20 > If you look at the patches here: >=20 > https://lkml.kernel.org/r/cover.1479465699.git.ldufour@linux.vnet.ib= m.com >=20 > you'll find it uses SRCU. For the record, I switched from SRCU to RCU and a ref counter because usi= ng=20 SRCU, the performances were impacted by the scheduling generated to handl= e=20 SRCU asynchronous events. I may have missed something, but using RCU and a ref counter was working=20 around this 20% overhead. >> If you use SRCU, why would you need a second page table walk? >=20 > Because SRCU only ensures the VMA object remains extant, it does not > prevent modification of it, normally that guarantee is provided by > mmap_sem, but we're not going to use that. >=20 > Instead, what we serialize on is the (split) ptlock. So we do the first > page-walk and ptlock to verify the vma-lookup, then we drop ptlock and > do the file-io, then we page-walk and take ptlock again, verify the vma > (again) and install the PTE. If anything goes wrong, we bail. >=20 > See this patch: >=20 > https://lkml.kernel.org/r/301fb863785f37c319b493bd0d43167353871804.1= 479465699.git.ldufour@linux.vnet.ibm.com >=20