From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2854AC433DB for ; Mon, 8 Feb 2021 13:26:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2DFAE64E26 for ; Mon, 8 Feb 2021 13:26:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2DFAE64E26 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8654D6B006C; Mon, 8 Feb 2021 08:26:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8162D6B006E; Mon, 8 Feb 2021 08:26:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 72BFE6B0070; Mon, 8 Feb 2021 08:26:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0171.hostedemail.com [216.40.44.171]) by kanga.kvack.org (Postfix) with ESMTP id 5E0D36B006C for ; Mon, 8 Feb 2021 08:26:48 -0500 (EST) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 2922A180AD81A for ; Mon, 8 Feb 2021 13:26:48 +0000 (UTC) X-FDA: 77795175696.23.cart62_0713de2275ff Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin23.hostedemail.com (Postfix) with ESMTP id 0930337604 for ; Mon, 8 Feb 2021 13:26:48 +0000 (UTC) X-HE-Tag: cart62_0713de2275ff X-Filterd-Recvd-Size: 5015 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Mon, 8 Feb 2021 13:26:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:Message-ID: Subject:Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=GM1+vEVf9Uau7A+JE4ldhXeu76MjIjs5AM/8s8CEdhU=; b=u5fQCsmPv5LWFkpPN/8W2abp5g ohp52XwZHSnILKi7rkliNtpet0HTzyFmyj7i38GTiEBNA8Lqyz0VFElsERQOGR9Pigu11hZ2tTS7M wXUAqKRyL4796zjfJ2bTn6bUD6VgV/sRgoa9teSvgvJOI7xG8x4/GIZXHE5k+CFsTg7AIblhJlpXv NsyaM1KNXdpDyYQ8dJQemFA9bEniZXcozIcH/OBYONugt3BBZGcwAcJ7AyKWud4IhPq3ZwP33XrX+ kIuVp8DWu+VBgR8hO7XfrdhiYhHJGhm7rO4YSdq7fqFLLrH9Q49KSryMwGtuRbgRXZ7BanrgFzTqo Sol1kyfw==; Received: from willy by casper.infradead.org with local (Exim 4.94 #2 (Red Hat Linux)) id 1l96Yt-0060hi-8e; Mon, 08 Feb 2021 13:26:43 +0000 Date: Mon, 8 Feb 2021 13:26:43 +0000 From: Matthew Wilcox To: linux-mm@kvack.org Cc: "Liam R. Howlett" , Laurent Dufour , Paul McKenney Subject: synchronize_rcu in munmap? Message-ID: <20210208132643.GP308988@casper.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: TLDR: I think we're going to need to call synchronize_rcu() in munmap(), and I know this can be quite expensive. Is this a reasonable price to pay for avoiding taking the mmap_sem for page faults? We're taking the first tentative steps towards moving the page fault handler from being protected by the mmap_sem to being protected by the RCU lock. The first part is replacing the rbtree with the maple tree, so a walk can be RCU safe. Liam has taken care of that part. The next step is to introduce an RCU user that's perhaps less critical than the page fault handler to see the consequences. So I chose /proc/$pid/maps. This also covers smaps and numa_maps. It's probably the simplest user, and taking the mmap_sem in this path actually affects some important workloads. Here's the tree: https://git.infradead.org/users/willy/linux-maple.git/shortlog/refs/heads/proc-vma-rcu The first of the two patches simply frees VMAs through the RCU mechanism. That means we can now walk the tree and read from the VMA, knowing the VMA will remain a VMA until we drop the RCU lock. It might be removed from the tree by a racing munmap(), but the memory will not be reallocated. The second patch converts the seq_file business of finding the VMAs from being protected by the mmap_sem to the rcu_lock. Problem: We're quite free about modifying VMAs. So vm_start and vm_end might get modified while we're in the middle of the walk. I think that's OK in some circumstances (eg a second mmap causes vm_end to be increased and we miss the increase), but others are less OK. For example mremap(MREMAP_MAYMOVE) may result in us seeing output from /proc/$pid/maps like this: 7f38db534000-7f38db81a000 r--p 00000000 fe:01 55416 /usr/lib/locale/locale-archive 7f38db81a000-7f38db83f000 r--p 00000000 fe:01 123093081 /usr/lib/x86_64-linux-gnu/libc-2.31.so 55d237566000-55d237567000 rw-p 00000000 00:00 0 7f38db98a000-7f38db9d4000 r--p 00170000 fe:01 123093081 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f38db9d4000-7f38db9d5000 ---p 001ba000 fe:01 123093081 /usr/lib/x86_64-linux-gnu/libc-2.31.so and I'm not sure how comfortable we are with userspace parsers coping with that kind of situation. We should probably allocate a new VMA and free the old one in this situation. Next problem: /proc/$pid/smaps calls walk_page_vma() which starts out by saying: mmap_assert_locked(walk.mm); which made me realise that smaps is also going to walk the page tables. So the page tables have to be pinned by the existence of the VMA. Which means the page tables must be freed by the same RCU callback that frees the VMA. But doing that means that a task which calls mmap(); munmap(); mmap(); must avoid allocating the same address for the second mmap (until the RCU grace period has elapsed), otherwise threads on other CPUs may see the stale PTEs instead of the new ones. Solution 1: Move the page table freeing into the RCU callback, call synchronize_rcu() in munmap(). Solution 2: Refcount the VMA and free the page tables on refcount dropping to zero. This doesn't actually work because the stale PTE problem still exists. Solution 3: When unmapping a VMA, instead of erasing the VMA from the maple tree, put a "dead" entry in its place. Once the RCU freeing and the TLB shootdown has happened, erase the entry and it can then be allocated. If we do that MAP_FIXED will have to synchronize_rcu() if it overlaps a dead entry. Thoughts?