From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C498C433DB for ; Wed, 10 Feb 2021 16:42:56 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B5E7764E7C for ; Wed, 10 Feb 2021 16:42:55 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B5E7764E7C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ziepe.ca Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4A30F6B0073; Wed, 10 Feb 2021 11:42:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4514F6B0074; Wed, 10 Feb 2021 11:42:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 38DE06B0075; Wed, 10 Feb 2021 11:42:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0143.hostedemail.com [216.40.44.143]) by kanga.kvack.org (Postfix) with ESMTP id 234326B0073 for ; Wed, 10 Feb 2021 11:42:55 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id E0C321EFF for ; Wed, 10 Feb 2021 16:42:54 +0000 (UTC) X-FDA: 77802927468.16.crook04_1d0d7fd27611 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin16.hostedemail.com (Postfix) with ESMTP id BD576100E690B for ; Wed, 10 Feb 2021 16:42:54 +0000 (UTC) X-HE-Tag: crook04_1d0d7fd27611 X-Filterd-Recvd-Size: 7088 Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46]) by imf18.hostedemail.com (Postfix) with ESMTP for ; Wed, 10 Feb 2021 16:42:54 +0000 (UTC) Received: by mail-qv1-f46.google.com with SMTP id 2so1142912qvd.0 for ; Wed, 10 Feb 2021 08:42:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=3s0sbB266t+kB2hZu69WTJrO4xsFfz+2QRXUjEz3CWc=; b=gaEA2jWQgumT5G6eLNawjqVI6Q816oqfyITtJL3ebypOlvDmiv+aKiKGidSyZvBjjZ GCztbUetVCivCeuvv1ugwNbWuKI/eWcBmd1J2BEHKDxoUtHGnPGcReHZl2BB3H2hsUiK 1Yk0cjReRl3dQssY3251Outq3dpr9L3/M+156NpfEGR6DLi3ALfgefTdNlTU5KNZzC4m doZyUhPOQfTeowmhMjrC38PgiUf5sqa5pIYaYV+YQTciGe0PvCcu30JdrEJ8M/ctm3ni A6zv0HIMJ8h7YqO9iTnqyD53FGwjDJMvYGTkCl86Z+ZY6bISGivuPhQtmjOI2x7RWwjD O5dg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=3s0sbB266t+kB2hZu69WTJrO4xsFfz+2QRXUjEz3CWc=; b=Tfg6VOxwg+3weqmRF1TL3ztB5zRHqpowP8ytm53DcKLFupkAYQNn7adg6hec4P+Em0 Wau2XR3HEEH79Oq3YfnMZ99ggZ7YHkrlstx95ykNiAJCwQv/RJ+rRAGXrQwFBJ+Gs3Oq +/tTAe67ucM62x5K/1Y/lB11Ov8f4unuoEcA1h4MfC6oePjl0KW4C2+clADHsgKMqo2L nAI4A5EQG/v60yok4T6+Iy18THzq7rGzV7LOYBE1/28B3GeVDP9ApPKkQOXM1a2RJnqO 6iLdTZ4t6h8hCrRKimEQQc0y4yzra5eeh0jmSlVbqDGlQez6nptgaBVaHaZ15eZ/s/Jj mEkw== X-Gm-Message-State: AOAM5330vPy9WWL0tRbL7QFxeuJbhDnMQlVQvv+BpvkXyv3A8pNPioLd jCKCTKyED9c6eohc3wNGeO0Siw== X-Google-Smtp-Source: ABdhPJz1tDX+r+nFkZfX2qAKwocLwtYlVsh5vniIMEe+96qkT7O7va/4L7JiKqKl5wrq1BCOqDmboA== X-Received: by 2002:a0c:e641:: with SMTP id c1mr2975190qvn.47.1612975373514; Wed, 10 Feb 2021 08:42:53 -0800 (PST) Received: from ziepe.ca (hlfxns017vw-142-162-115-133.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.162.115.133]) by smtp.gmail.com with ESMTPSA id k129sm1855392qkf.108.2021.02.10.08.42.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 10 Feb 2021 08:42:53 -0800 (PST) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1l9sZo-0066t7-EV; Wed, 10 Feb 2021 12:42:52 -0400 Date: Wed, 10 Feb 2021 12:42:52 -0400 From: Jason Gunthorpe To: Matthew Wilcox Cc: Laurent Dufour , linux-mm@kvack.org, "Liam R. Howlett" , Paul McKenney Subject: Re: synchronize_rcu in munmap? Message-ID: <20210210164252.GM4718@ziepe.ca> References: <20210208132643.GP308988@casper.infradead.org> <20210209142941.GY308988@casper.infradead.org> <17e3b4d0-8a16-75ba-e1c7-b678e4cf2089@linux.ibm.com> <20210209173822.GH4718@ziepe.ca> <20210209195817.GZ308988@casper.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210209195817.GZ308988@casper.infradead.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 09, 2021 at 07:58:17PM +0000, Matthew Wilcox wrote: > > The pagewalk.c needs to call its ops in a sleepable context, otherwise > > it could just use the normal page table locks.. Not sure RCU could be > > fit into here? > > Depends on the caller of walk_page_*() whether the ops need to sleep > or not. We could create a non-sleeping-op version of walk_page that used the PTL locks, that would avoid the need for the mmap_sem for places that can tolerate non-sleeping ops. Page tables can't be freed while the PTL spinlocks are held. This is certainly easier to reason about than trying to understand if races from having no locks are OK or not. > The specific problem we're trying to solve here is avoiding > taking the mmap_sem in /proc/$pid/smaps. I thought you were trying to remove the mmap sem around the VMA related things? The secondary role the mmap_sem has for the page table itself is unrelated to the VMA. To be compatible with page_walk.c's no-ptl locks design anything that calls tlb_finish_mmu() with the freed_tables=1 has to also hold the write side of the mmap_sem. This ensures that the page table memory cannot be freed under the read side of the mmap_sem. If you delete the mmap_sem as a VMA lock we can still keep this idea, adding a new rwsem lock to something like the below. If someone thinks of a smarter way to serialize this later then it is at least clearly documented. diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 03c33c93a582b9..98ee4d0d9416d3 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -287,6 +287,9 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, inc_tlb_flush_pending(tlb->mm); } +#define page_table_memory_write_lock(mm) mmap_assert_write_locked(mm) +#define page_table_memory_write_unlock(mm) + /** * tlb_finish_mmu - finish an mmu_gather structure * @tlb: the mmu_gather structure to finish @@ -299,6 +302,16 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { + if (tlb->freed_tables) { + /* + * Page table levels to be freed are now removed from the page + * table itself and on the free list in the mmu_gather. Exclude + * any readers of this memory before we progress to freeing. + */ + page_table_memory_write_lock(tlb->mm); + page_table_memory_write_unlock(tlb->mm); + } + /* * If there are parallel threads are doing PTE changes on same range * under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB diff --git a/mm/pagewalk.c b/mm/pagewalk.c index e81640d9f17706..c190565ee0b404 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -311,6 +311,15 @@ static int walk_page_test(unsigned long start, unsigned long end, return 0; } +/* + * Write side of the page_table_memory is held across any place that will kfree + * a page table level, eg calls to tlb_finish_mmu() where struct mmu_gather + * freed_tables = 1. It is a sleepable alternative to the page table spin locks + * that allows semi-lockless reading. + */ +#define page_table_memory_read_lock(mm) mmap_assert_locked(mm) +#define page_table_memory_read_unlock(mm) + static int __walk_page_range(unsigned long start, unsigned long end, struct mm_walk *walk) { @@ -324,12 +333,16 @@ static int __walk_page_range(unsigned long start, unsigned long end, return err; } + page_table_memory_read_lock(walk->mm); + if (vma && is_vm_hugetlb_page(vma)) { if (ops->hugetlb_entry) err = walk_hugetlb_range(start, end, walk); } else err = walk_pgd_range(start, end, walk); + page_table_memory_read_unlock(walk->mm); + if (vma && ops->post_vma) ops->post_vma(walk);