From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934401AbdKBUIp (ORCPT ); Thu, 2 Nov 2017 16:08:45 -0400 Received: from mx1.redhat.com ([209.132.183.28]:49134 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934221AbdKBUIn (ORCPT ); Thu, 2 Nov 2017 16:08:43 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 107916A7C6 Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=aarcange@redhat.com Date: Thu, 2 Nov 2017 21:08:40 +0100 From: Andrea Arcangeli To: Laurent Dufour Cc: paulmck@linux.vnet.ibm.com, peterz@infradead.org, akpm@linux-foundation.org, kirill@shutemov.name, ak@linux.intel.com, mhocko@kernel.org, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , Alexei Starovoitov , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com, npiggin@gmail.com, bsingharora@gmail.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org Subject: Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count Message-ID: <20171102200840.GC22686@redhat.com> References: <1507729966-10660-1-git-send-email-ldufour@linux.vnet.ibm.com> <1507729966-10660-8-git-send-email-ldufour@linux.vnet.ibm.com> <20171026101833.GF563@redhat.com> <2cbea37c-c2a7-bfd4-4528-fd273b210e29@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.1 (2017-09-22) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Thu, 02 Nov 2017 20:08:43 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 02, 2017 at 06:25:11PM +0100, Laurent Dufour wrote: > I think there is some memory barrier missing when the VMA is modified so > currently the modifications done in the VMA structure may not be written > down at the time the pte is locked. So doing that change will also requires > to call smp_wmb() before locking the page tables. In the current patch this > is ensured by the call to write_seqcount_end(). > Doing so will still require to have a memory barrier when touching the VMA. > Not sure we get far better performance compared to the sequence count > change. But I'll give it a try anyway ;) Luckily smp_wmb is a noop on x86. I would suggest to ignore the above issue completely if you give it a try, and then if this performs, we can just embed a smp_wmb() before spin_lock() somewhere in pte_offset_map_lock/pte_lockptr/spin_lock_nested for those archs whose spin_lock isn't a smp_wmb() equivalent. I would focus at flushing writes before every pagetable spin_lock for non-x86 archs, rather than after all vma modifications. That should be easier to keep under control and it's going to be more efficient too as if something there are fewer spin locks than vma modifications. For non-x86 archs we may then need a smp_wmb__before_spin_lock. That looks more self contained than surrounding all vma modifications and it's a noop on x86 anyway. I thought about the contention detection logic too yesterday: to detect contention we could have a mm->mmap_sem_contention_jiffies and if down_read_trylock_exclusive() [same as down_read_if_not_hold in prev mail] fails (and it'll fail if either read or write mmap_sem is hold, so also convering mremap/mprotect etc..) we set mm->mmap_sem_contention_jiffies = jiffies and then to know if you must not touch the mmap_sem at all, you compare jiffies against mmap_sem_contention_jiffies, if it's equal we go speculative. If that's not enough we can just keep going speculative for a few more jiffies with time_before(). The srcu lock is non concerning because the inc/dec of the fast path is in per-cpu cacheline of course, no false sharing possible there or it wouldn't be any better than a normal lock. The vma revalidation is already done by khugepaged and mm/userfaultfd, both need to drop the mmap_sem and continue working on the pagetables, so we already know it's workable and not too slow. Summarizing.. by using a runtime contention triggered speculative design that goes speculative only when contention is runtime-detected using the above logic (or equivalent), and by having to revalidate the vma by hand with find_vma without knowing instantly if the vma become stale, we will run with a substantially slower speculative page fault than with your current speculative always-on design, but the slower speculative page fault runtime will still scale 100% in SMP so it should still be faster on large SMP systems. The pros is that it won't regress the mmap/brk vma modifications. The whole complexity of tracking the vma modifications should also go away and the resulting code should be more maintainable and less risky to break in subtle ways impossible to reproduce. Thanks! Andrea