From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,MSGID_FROM_MTA_HEADER,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 13D51C00A89 for ; Fri, 30 Oct 2020 23:51:29 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 591FA22228 for ; Fri, 30 Oct 2020 23:51:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="UfRT5yml" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 591FA22228 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 51C9B6B0036; Fri, 30 Oct 2020 19:51:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4CBCD6B005C; Fri, 30 Oct 2020 19:51:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 392636B005D; Fri, 30 Oct 2020 19:51:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0249.hostedemail.com [216.40.44.249]) by kanga.kvack.org (Postfix) with ESMTP id 050CE6B0036 for ; Fri, 30 Oct 2020 19:51:26 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 997F38249980 for ; Fri, 30 Oct 2020 23:51:26 +0000 (UTC) X-FDA: 77430240972.26.sign25_1c02d832729a Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id 77C311804B647 for ; Fri, 30 Oct 2020 23:51:26 +0000 (UTC) X-HE-Tag: sign25_1c02d832729a X-Filterd-Recvd-Size: 11474 Received: from hqnvemgate24.nvidia.com (hqnvemgate24.nvidia.com [216.228.121.143]) by imf12.hostedemail.com (Postfix) with ESMTP for ; Fri, 30 Oct 2020 23:51:25 +0000 (UTC) Received: from hqmail.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate24.nvidia.com (using TLS: TLSv1.2, AES256-SHA) id ; Fri, 30 Oct 2020 16:51:29 -0700 Received: from HQMAIL107.nvidia.com (172.20.187.13) by HQMAIL105.nvidia.com (172.20.187.12) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Fri, 30 Oct 2020 23:51:23 +0000 Received: from NAM10-DM6-obe.outbound.protection.outlook.com (104.47.58.101) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3 via Frontend Transport; Fri, 30 Oct 2020 23:51:23 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=jM9vUZfSqNOB2a7/000K2xAjaAwHeileE7naYEpAmx2Wl4lltGCg7AmwL1MwHPveWW5p4CFsxttKR9UvkJB7DQejd+LJRQtT7/M21L5TDy3+Sh2gun0E3a82Zug4FyTzFMgj6nIz29KUTQtM9RfYjodz0ugNvR6w9y0FH1aBWf1DvnH2tOHoXH8CaOaqiycM+y/t2+5y5YwIyaSt99c6x3sQDM8Gr8y8XNLGGgIigsu1r1giKOA8xuwLHLFBSvdWNPO7W03wgRTozCAF7vVyPLSsdlkZmWgOhZrrl75dRSrR4fADKI+TDIinOW0zYjAqYHuGXr8OO+WH7JDFlPthjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Bm8fIpzqNS32EdCF18an7Xr01KZQwUmu93fiHjrcgOc=; b=HW2aJfZrm/nMykCQ46s7CCiFzj/bVm5U25BqL3lfp4w+iFuZMJo0Lm+CFJNX3PVkxZh/yjA7Vo3dG+IvW5/bO4yJe0cMFIZ3xs7Hk+u6Ao3UT9wyvGbMAZd+GjmRL9R/9xvJEMCiG4nyaLRs4HZ1MttrKD77Bw58qXUWu0JR2v/2TykOQtV8MiUy1npi/CTRVO5c/mekZPq12bIl/ZcQfw79+f9jvAH8NH2xLJK0fRXk9oUNP4+F7hK+2aq3BIv+piAogwbt1BwipMF8ntUFGXbCUvkzAh5K6Nd6cL5t5ZYrkvN9bb4TM+7ncP70fjK38GG/I0SzaRqbt67WfAaQpA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none Received: from DM6PR12MB3834.namprd12.prod.outlook.com (2603:10b6:5:14a::12) by DM6PR12MB4388.namprd12.prod.outlook.com (2603:10b6:5:2a9::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3499.18; Fri, 30 Oct 2020 23:51:22 +0000 Received: from DM6PR12MB3834.namprd12.prod.outlook.com ([fe80::cdbe:f274:ad65:9a78]) by DM6PR12MB3834.namprd12.prod.outlook.com ([fe80::cdbe:f274:ad65:9a78%7]) with mapi id 15.20.3499.027; Fri, 30 Oct 2020 23:51:22 +0000 Date: Fri, 30 Oct 2020 20:51:21 -0300 From: Jason Gunthorpe To: Peter Xu , "Ahmed S. Darwish" CC: , Linus Torvalds , Andrea Arcangeli , Andrew Morton , Aneesh Kumar K.V , Christoph Hellwig , Hugh Dickins , Jan Kara , Jann Horn , John Hubbard , Kirill Shutemov , Kirill Tkhai , Leon Romanovsky , Linux-MM , Michal Hocko , Oleg Nesterov Subject: Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Message-ID: <20201030235121.GQ2620339@nvidia.com> References: <0-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> <2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> <20201030225250.GB6357@xz-x1> Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20201030225250.GB6357@xz-x1> X-ClientProxiedBy: BL1PR13CA0048.namprd13.prod.outlook.com (2603:10b6:208:257::23) To DM6PR12MB3834.namprd12.prod.outlook.com (2603:10b6:5:14a::12) MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from mlx.ziepe.ca (156.34.48.30) by BL1PR13CA0048.namprd13.prod.outlook.com (2603:10b6:208:257::23) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3499.13 via Frontend Transport; Fri, 30 Oct 2020 23:51:22 +0000 Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1kYeAz-00E7WM-5v; Fri, 30 Oct 2020 20:51:21 -0300 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1604101889; bh=Bm8fIpzqNS32EdCF18an7Xr01KZQwUmu93fiHjrcgOc=; h=ARC-Seal:ARC-Message-Signature:ARC-Authentication-Results:Date: From:To:CC:Subject:Message-ID:References:Content-Type: Content-Disposition:In-Reply-To:X-ClientProxiedBy:MIME-Version: X-MS-Exchange-MessageSentRepresentingType; b=UfRT5ymlW4VRfvHy1BqghtFj6cfcnK6TBqww0fh5DdJ1jIWIEO9dKGYRH0Diz8e3K n+n2qQBiGPCfbdfpj+xeXsgZIPbpPc9mL9kJHaWkP5SXF0TY9X8IwVztWApQ9Px9Gd OVdPBAPZ+9dDmSPHrZe7LSYl3P/oa56YQvyEHtxx9FIlKKrQjuiZA1rhsdc/8QCGuw SX5MnjSq4qZz0jsFP89mJShNUb27yOzWDjAkpUFdTc2UIoLvy4YlbLlEQMpqjV6WYM l86WwO3xnxLmeIlVS26ZrymS+QqTQWohisLB/h/AvfSPy320iPLf1rueNSSoEGUJaq ekw3bVKNWotDg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Oct 30, 2020 at 06:52:50PM -0400, Peter Xu wrote: > Hi, Jason, > > I think majorly the patch looks good to me, but I have a few pure questions > majorly not directly related to the patch itself, but around the contexts. > Since I _feel_ like there'll be a new version to update the comments below, > maybe I can still ask aloud... Please bare with me. :) No problem > On Fri, Oct 30, 2020 at 11:46:21AM -0300, Jason Gunthorpe wrote: > > Slow GUP is safe against this race because copy_page_range() is only > > called while holding the exclusive side of the mmap_lock on the src > > mm_struct. > > Pure question: I understand that this patch requires this, but... Could anyone > remind me why read lock of mmap_sem is not enough for fork() before this one? I do not know why fork uses the exclusive lock, however the write side of the seqcount must be exclusive or it doesn't work properly. Was happy to find we already had the locking to support this. > > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > > index 5e5480a0a32d7d..2520f6e05f4d44 100644 > > +++ b/drivers/firmware/efi/efi.c > > @@ -57,6 +57,7 @@ struct mm_struct efi_mm = { > > .mm_rb = RB_ROOT, > > .mm_users = ATOMIC_INIT(2), > > .mm_count = ATOMIC_INIT(1), > > + .write_protect_seq = SEQCNT_ZERO(efi_mm.write_protect_seq), > > MMAP_LOCK_INITIALIZER(efi_mm) > > .page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), > > .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), > > Another pure question: I'm just curious how you find all the statically > definied mm_structs, and to make sure all of them are covered (just in case > un-initialized seqcount could fail strangely). I searched for all MMAP_LOCK_INITIALIZER() places and assumed that Michel got them all when he added it :) > Actually I'm thinking whether we should have one place to keep all the init > vars for all the statically definied mm_structs, so we don't need to find them > everytime, but only change that one place. I was thinking that as well, most of the places are all the same > > diff --git a/mm/memory.c b/mm/memory.c > > index c48f8df6e50268..294c2c3c4fe00d 100644 > > +++ b/mm/memory.c > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > > 0, src_vma, src_mm, addr, end); > > mmu_notifier_invalidate_range_start(&range); > > + /* > > + * The read side doesn't spin, it goes to the mmap_lock, so the > > + * raw version is used to avoid disabling preemption here > > + */ > > + mmap_assert_write_locked(src_mm); > > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > > Would raw_write_seqcount_begin() be better here? Hum.. I felt no because it had the preempt stuff added into it, however it would work - __seqcount_lock_preemptible() == false for the seqcount_t case (see below) Looking more closely, maybe the right API to pick is write_seqcount_t_begin() and write_seqcount_t_end() ?? However, no idea what the intention of the '*_seqcount_t_*' family is - it only seems to be used to implement the seqlock.. Lets add Amhed, perhaps he can give some guidance (see next section)? > My understanding is that we used raw_write_seqcount_t_begin() because we're > with spin lock so assuming we disabled preemption already. Here we rely on the exclusive mmap_lock, not a spinlock. This ensures only one write side is running concurrently as required by seqcount. The concern about preemption disable is that it wasn't held for fork() before, and we don't need it.. I understand preemption disable regions must be short or the RT people will not be happy, holding one across all of copy_page_range() sounds bad. Ahmed explained in commit 8117ab508f the reason the seqcount_t write side has preemption disabled is because it can livelock RT kernels if the read side is spinning after preempting the write side. eg look at how __read_seqcount_begin() is implemented: while ((seq = __seqcount_sequence(s)) & 1) \ cpu_relax(); \ However, in this patch, we don't spin on the read side. If the read side collides with a writer it immediately goes to the mmap_lock, which is sleeping, and so it will sort itself out properly, even if it was preempted. > An even further pure question on __seqcount_preemptible() (feel free to ignore > this question!): I saw that __seqcount_preemptible() seems to have been > constantly defined as "return false". Not sure what happened there.. The new code has a range of seqcount_t types see Documentation/locking/seqlock.rst 'Sequence counters with associated locks' It uses _Generic to do a bit of meta-programming and creates a compile time table of lock properties: SEQCOUNT_LOCKNAME(raw_spinlock, raw_spinlock_t, false, s->lock, raw_spin, raw_spin_lock(s->lock)) SEQCOUNT_LOCKNAME(spinlock, spinlock_t, __SEQ_RT, s->lock, spin, spin_lock(s->lock)) SEQCOUNT_LOCKNAME(rwlock, rwlock_t, __SEQ_RT, s->lock, read, read_lock(s->lock)) SEQCOUNT_LOCKNAME(mutex, struct mutex, true, s->lock, mutex, mutex_lock(s->lock)) SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mutex, ww_mutex_lock(s->lock, NULL)) As well as as default set of properties for normal seqcount_t. The __seqcount_preemptible() is selected by the _Generic for seqcount_t: #define __seqprop(s, prop) _Generic(*(s), \ seqcount_t: __seqprop_##prop((void *)(s)), \ And it says preemption must be disabled before using the lock: static inline void __seqprop_assert(const seqcount_t *s) { lockdep_assert_preemption_disabled(); } And thus no need to have an automatic disable preemption: static inline bool __seqprop_preemptible(const seqcount_t *s) { return false; } Other lock subtypes are different, eg the codegen for mutex will use lockdep_assert_held(s->lock) for _assert and true for _preemptible() So if we map the 'write begin' entry points: write_seqcount_begin - Enforces preemption off raw_write_seqcount_begin - Auto disable preemption if required (false) raw_write_seqcount_t_begin - No preemption stuff write_seqcount_t_begin - No preemption stuff Jason