From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92E20C432C0 for ; Tue, 3 Dec 2019 22:21:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4D8E220684 for ; Tue, 3 Dec 2019 22:21:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="ZkPu66HD" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4D8E220684 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DEDF26B0738; Tue, 3 Dec 2019 17:21:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D9CA06B0739; Tue, 3 Dec 2019 17:21:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C8B966B073A; Tue, 3 Dec 2019 17:21:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0126.hostedemail.com [216.40.44.126]) by kanga.kvack.org (Postfix) with ESMTP id B2C656B0738 for ; Tue, 3 Dec 2019 17:21:51 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 5C44A181AEF23 for ; Tue, 3 Dec 2019 22:21:51 +0000 (UTC) X-FDA: 76225253622.08.suit69_677722ce47e5b X-HE-Tag: suit69_677722ce47e5b X-Filterd-Recvd-Size: 4491 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf29.hostedemail.com (Postfix) with ESMTP for ; Tue, 3 Dec 2019 22:21:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=Content-Type:MIME-Version:Message-ID: Subject:To:From:Date:Sender:Reply-To:Cc:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=XaqQY4e0RMLSnSUdU7PNVbSW8chd10CWdPP05q9mJIo=; b=ZkPu66HDZHYL6XpKrGDs3rTXuB iJXgJo1gsmptkZ6eGqjff7uPreDGUz0sKXPNzBSS4drFAqTGV8hoivnB6ZeQWDcOV4rNZEgt5JnTb FUNWhZtjZw1PDZDWbA+/Wy2BMZZP5PPoqdvkAmv7WU6cy/Y12pynVrHUdyrxSkcJ2rRwhMRSJHi80 CH+gCR1IugGQwp4S8wK/LGFSbvY7YbshLxDGE8AIm9ItGBNfnfYpETQr60LqN537C8zXmXRak8NsZ P33eFPiR+MPxncAfZSx9I52nGqRLcmzUSweJbh2KbrE76uKgpgBnjXsl/amCb5NQA282ZKmTxB2wy Aqh+S2/w==; Received: from willy by bombadil.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1icGYF-0007jv-Bg for linux-mm@kvack.org; Tue, 03 Dec 2019 22:21:47 +0000 Date: Tue, 3 Dec 2019 14:21:47 -0800 From: Matthew Wilcox To: linux-mm@kvack.org Subject: Splitting the mmap_sem Message-ID: <20191203222147.GV20752@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.12.1 (2019-06-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for their feedback on an earlier version of this. I think the solution we discussed doesn't quite work, so here's one which I think does. See the last two paragraphs in particular.] My preferred solution to the mmap_sem scalability problem is to allow VMAs to be looked up under the RCU read lock then take a per-VMA lock. I've been focusing on the first half of this problem (looking up VMAs in an RCU-safe data structure) and ignoring the second half (taking a lock while holding the RCU lock). We can't take a semaphore while holding the RCU lock in case we have to sleep -- the VMA might not exist any more when we woke up. Making the per-VMA lock a spinlock would be a massive change -- fault handlers are currently called with the mmap_sem held and may sleep. So I think we need a per-VMA refcount. That lets us sleep while handling a fault. There are over 100 fault handlers in the kernel, and I don't want to change the locking in all of them. That makes modifications to the tree a little tricky. At the moment, we take the rwsem for write which waits for all readers to finish, then we modify the VMAs, then we allow readers back in. With RCU, there is no way to block readers, so different threads may (at the same time) see both an old and a new VMA for the same virtual address. So calling mmap() looks like this: allocate a new VMA update pointer(s) in maple tree sleep until old VMAs have a zero refcount synchronize_rcu() free old VMAs flush caches for affected range return to userspace While one thread is calling mmap(MAP_FIXED), two other threads which are accessing the same address may see different data from each other and have different page translations in their respective CPU caches until the thread calling mmap() returns. I believe this is OK, but would greatly appreciate hearing from people who know better. Some people are concerned that a reference count on the VMA will lead to contention moving from the mmap_sem to the refcount on a very large VMA for workloads which have one giant VMA covering the entire working set. For those workloads, I propose we use the existing ->map_pages() callback (changed to return a vm_fault_t from the current void). It will be called with the RCU lock held and no reference count on the vma. If it needs to sleep, it should bump the refcount, drop the RCU lock, prepare enough so that the next call will not need to sleep, then drop the refcount and return VM_FAULT_RETRY so the VM knows the VMA is no longer good, and it needs to walk the VMA tree from the start. We currently only have one ->map_pages() callback, and it's filemap_map_pages(). It only needs to sleep in one place -- to allocate a PTE table. I think that can be allocated ahead of time if needed.