From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755351AbbCSOKc (ORCPT ); Thu, 19 Mar 2015 10:10:32 -0400 Received: from cantor2.suse.de ([195.135.220.15]:41407 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751675AbbCSOK3 (ORCPT ); Thu, 19 Mar 2015 10:10:29 -0400 Date: Thu, 19 Mar 2015 14:10:22 +0000 From: Mel Gorman To: Linus Torvalds Cc: Dave Chinner , Ingo Molnar , Andrew Morton , Aneesh Kumar , Linux Kernel Mailing List , Linux-MM , xfs@oss.sgi.com, ppc-dev Subject: Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur Message-ID: <20150319141022.GD3087@suse.de> References: <20150312131045.GE3406@suse.de> <20150312184925.GH3406@suse.de> <20150317070655.GB10105@dastard> <20150317205104.GA28621@dastard> <20150317220840.GC28621@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote: > > - something completely different that I am entirely missing > > So I think there's something I'm missing. For non-shared mappings, I > still have the idea that pte_dirty should be the same as pte_write. > And yet, your testing of 3.19 shows that it's a big difference. > There's clearly something I'm completely missing. > Minimally, there is still the window where we clear the PTE to set the protections. During that window, a fault can occur. In the old code which was inherently racy and unsafe, the fault might still go ahead deferring a potential migration for a short period. In the current code, it'll stall on the lock, notice the PTE is changed and refault so the overhead is very different but functionally correct. In the old code, pte_write had complex interactions with background cleaning and sync in the case of file mappings (not applicable to Dave's case but still it's unpredictable behaviour). pte_dirty is close but there are interactions with the application as the timing of writes vs the PTE scanner matter. Even if we restored the original behaviour, it would still be very difficult to understand all the interactions between userspace and kernel. The patch below should be tested because it's clearer what the intent is. Using the VMA flags is coarse but it's not vulnerable to timing artifacts that behave differently depending on the machine. My preliminary testing shows it helps but not by much. It does not restore performance to where it was but it's easier to understand which is important if there are changes in the scheduler later. In combination, I also think that slowing PTE scanning when migration fails is the correct action even if it is unrelated to the patch Dave bisected to. It's stupid to increase scanning rates and incurs more faults when migrations are failing so I'll be testing that next. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 626e93db28ba..2f12e9fcf1a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, flags |= TNF_FAULT_LOCAL; } - /* - * Avoid grouping on DSO/COW pages in specific and RO pages - * in general, RO pages shouldn't hurt as much anyway since - * they can be in shared cache state. - * - * FIXME! This checks "pmd_dirty()" as an approximation of - * "is this a read-only page", since checking "pmd_write()" - * is even more broken. We haven't actually turned this into - * a writable page, so pmd_write() will always be false. - */ - if (!pmd_dirty(pmd)) + /* See similar comment in do_numa_page for explanation */ + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* diff --git a/mm/memory.c b/mm/memory.c index 411144f977b1..20beb6647dba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * Avoid grouping on DSO/COW pages in specific and RO pages - * in general, RO pages shouldn't hurt as much anyway since - * they can be in shared cache state. + * Avoid grouping on RO pages in general. RO pages shouldn't hurt as + * much anyway since they can be in shared cache state. This misses + * the case where a mapping is writable but the process never writes + * to it but pte_write gets cleared during protection updates and + * pte_dirty has unpredictable behaviour between PTE scan updates, + * background writeback, dirty balancing and application behaviour. * - * FIXME! This checks "pmd_dirty()" as an approximation of - * "is this a read-only page", since checking "pmd_write()" - * is even more broken. We haven't actually turned this into - * a writable page, so pmd_write() will always be false. + * TODO: Note that the ideal here would be to avoid a situation where a + * NUMA fault is taken immediately followed by a write fault in + * some cases which would have lower overhead overall but would be + * invasive as the fault paths would need to be unified. */ - if (!pte_dirty(pte)) + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 9AFD07F37 for ; Thu, 19 Mar 2015 09:10:35 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id 445D5AC001 for ; Thu, 19 Mar 2015 07:10:32 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id oGH3cx8aLKcIJaye (version=TLSv1 cipher=AES256-SHA bits=256 verify=NO) for ; Thu, 19 Mar 2015 07:10:29 -0700 (PDT) Date: Thu, 19 Mar 2015 14:10:22 +0000 From: Mel Gorman Subject: Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur Message-ID: <20150319141022.GD3087@suse.de> References: <20150312131045.GE3406@suse.de> <20150312184925.GH3406@suse.de> <20150317070655.GB10105@dastard> <20150317205104.GA28621@dastard> <20150317220840.GC28621@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Linus Torvalds Cc: Linux Kernel Mailing List , xfs@oss.sgi.com, Linux-MM , Aneesh Kumar , Andrew Morton , ppc-dev , Ingo Molnar On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote: > > - something completely different that I am entirely missing > > So I think there's something I'm missing. For non-shared mappings, I > still have the idea that pte_dirty should be the same as pte_write. > And yet, your testing of 3.19 shows that it's a big difference. > There's clearly something I'm completely missing. > Minimally, there is still the window where we clear the PTE to set the protections. During that window, a fault can occur. In the old code which was inherently racy and unsafe, the fault might still go ahead deferring a potential migration for a short period. In the current code, it'll stall on the lock, notice the PTE is changed and refault so the overhead is very different but functionally correct. In the old code, pte_write had complex interactions with background cleaning and sync in the case of file mappings (not applicable to Dave's case but still it's unpredictable behaviour). pte_dirty is close but there are interactions with the application as the timing of writes vs the PTE scanner matter. Even if we restored the original behaviour, it would still be very difficult to understand all the interactions between userspace and kernel. The patch below should be tested because it's clearer what the intent is. Using the VMA flags is coarse but it's not vulnerable to timing artifacts that behave differently depending on the machine. My preliminary testing shows it helps but not by much. It does not restore performance to where it was but it's easier to understand which is important if there are changes in the scheduler later. In combination, I also think that slowing PTE scanning when migration fails is the correct action even if it is unrelated to the patch Dave bisected to. It's stupid to increase scanning rates and incurs more faults when migrations are failing so I'll be testing that next. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 626e93db28ba..2f12e9fcf1a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, flags |= TNF_FAULT_LOCAL; } - /* - * Avoid grouping on DSO/COW pages in specific and RO pages - * in general, RO pages shouldn't hurt as much anyway since - * they can be in shared cache state. - * - * FIXME! This checks "pmd_dirty()" as an approximation of - * "is this a read-only page", since checking "pmd_write()" - * is even more broken. We haven't actually turned this into - * a writable page, so pmd_write() will always be false. - */ - if (!pmd_dirty(pmd)) + /* See similar comment in do_numa_page for explanation */ + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* diff --git a/mm/memory.c b/mm/memory.c index 411144f977b1..20beb6647dba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * Avoid grouping on DSO/COW pages in specific and RO pages - * in general, RO pages shouldn't hurt as much anyway since - * they can be in shared cache state. + * Avoid grouping on RO pages in general. RO pages shouldn't hurt as + * much anyway since they can be in shared cache state. This misses + * the case where a mapping is writable but the process never writes + * to it but pte_write gets cleared during protection updates and + * pte_dirty has unpredictable behaviour between PTE scan updates, + * background writeback, dirty balancing and application behaviour. * - * FIXME! This checks "pmd_dirty()" as an approximation of - * "is this a read-only page", since checking "pmd_write()" - * is even more broken. We haven't actually turned this into - * a writable page, so pmd_write() will always be false. + * TODO: Note that the ideal here would be to avoid a situation where a + * NUMA fault is taken immediately followed by a write fault in + * some cases which would have lower overhead overall but would be + * invasive as the fault paths would need to be unified. */ - if (!pte_dirty(pte)) + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f53.google.com (mail-wg0-f53.google.com [74.125.82.53]) by kanga.kvack.org (Postfix) with ESMTP id A5C716B0038 for ; Thu, 19 Mar 2015 10:10:30 -0400 (EDT) Received: by wggv3 with SMTP id v3so63574360wgg.1 for ; Thu, 19 Mar 2015 07:10:30 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id ht10si3192048wib.115.2015.03.19.07.10.28 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 19 Mar 2015 07:10:29 -0700 (PDT) Date: Thu, 19 Mar 2015 14:10:22 +0000 From: Mel Gorman Subject: Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur Message-ID: <20150319141022.GD3087@suse.de> References: <20150312131045.GE3406@suse.de> <20150312184925.GH3406@suse.de> <20150317070655.GB10105@dastard> <20150317205104.GA28621@dastard> <20150317220840.GC28621@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Dave Chinner , Ingo Molnar , Andrew Morton , Aneesh Kumar , Linux Kernel Mailing List , Linux-MM , xfs@oss.sgi.com, ppc-dev On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote: > > - something completely different that I am entirely missing > > So I think there's something I'm missing. For non-shared mappings, I > still have the idea that pte_dirty should be the same as pte_write. > And yet, your testing of 3.19 shows that it's a big difference. > There's clearly something I'm completely missing. > Minimally, there is still the window where we clear the PTE to set the protections. During that window, a fault can occur. In the old code which was inherently racy and unsafe, the fault might still go ahead deferring a potential migration for a short period. In the current code, it'll stall on the lock, notice the PTE is changed and refault so the overhead is very different but functionally correct. In the old code, pte_write had complex interactions with background cleaning and sync in the case of file mappings (not applicable to Dave's case but still it's unpredictable behaviour). pte_dirty is close but there are interactions with the application as the timing of writes vs the PTE scanner matter. Even if we restored the original behaviour, it would still be very difficult to understand all the interactions between userspace and kernel. The patch below should be tested because it's clearer what the intent is. Using the VMA flags is coarse but it's not vulnerable to timing artifacts that behave differently depending on the machine. My preliminary testing shows it helps but not by much. It does not restore performance to where it was but it's easier to understand which is important if there are changes in the scheduler later. In combination, I also think that slowing PTE scanning when migration fails is the correct action even if it is unrelated to the patch Dave bisected to. It's stupid to increase scanning rates and incurs more faults when migrations are failing so I'll be testing that next. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 626e93db28ba..2f12e9fcf1a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, flags |= TNF_FAULT_LOCAL; } - /* - * Avoid grouping on DSO/COW pages in specific and RO pages - * in general, RO pages shouldn't hurt as much anyway since - * they can be in shared cache state. - * - * FIXME! This checks "pmd_dirty()" as an approximation of - * "is this a read-only page", since checking "pmd_write()" - * is even more broken. We haven't actually turned this into - * a writable page, so pmd_write() will always be false. - */ - if (!pmd_dirty(pmd)) + /* See similar comment in do_numa_page for explanation */ + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* diff --git a/mm/memory.c b/mm/memory.c index 411144f977b1..20beb6647dba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * Avoid grouping on DSO/COW pages in specific and RO pages - * in general, RO pages shouldn't hurt as much anyway since - * they can be in shared cache state. + * Avoid grouping on RO pages in general. RO pages shouldn't hurt as + * much anyway since they can be in shared cache state. This misses + * the case where a mapping is writable but the process never writes + * to it but pte_write gets cleared during protection updates and + * pte_dirty has unpredictable behaviour between PTE scan updates, + * background writeback, dirty balancing and application behaviour. * - * FIXME! This checks "pmd_dirty()" as an approximation of - * "is this a read-only page", since checking "pmd_write()" - * is even more broken. We haven't actually turned this into - * a writable page, so pmd_write() will always be false. + * TODO: Note that the ideal here would be to avoid a situation where a + * NUMA fault is taken immediately followed by a write fault in + * some cases which would have lower overhead overall but would be + * invasive as the fault paths would need to be unified. */ - if (!pte_dirty(pte)) + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de (cantor2.suse.de [195.135.220.15]) (using TLSv1 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id BBAC41A02CB for ; Fri, 20 Mar 2015 01:10:31 +1100 (AEDT) Date: Thu, 19 Mar 2015 14:10:22 +0000 From: Mel Gorman To: Linus Torvalds Subject: Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur Message-ID: <20150319141022.GD3087@suse.de> References: <20150312131045.GE3406@suse.de> <20150312184925.GH3406@suse.de> <20150317070655.GB10105@dastard> <20150317205104.GA28621@dastard> <20150317220840.GC28621@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 In-Reply-To: Cc: Dave Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, Linux-MM , Aneesh Kumar , Andrew Morton , ppc-dev , Ingo Molnar List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote: > > - something completely different that I am entirely missing > > So I think there's something I'm missing. For non-shared mappings, I > still have the idea that pte_dirty should be the same as pte_write. > And yet, your testing of 3.19 shows that it's a big difference. > There's clearly something I'm completely missing. > Minimally, there is still the window where we clear the PTE to set the protections. During that window, a fault can occur. In the old code which was inherently racy and unsafe, the fault might still go ahead deferring a potential migration for a short period. In the current code, it'll stall on the lock, notice the PTE is changed and refault so the overhead is very different but functionally correct. In the old code, pte_write had complex interactions with background cleaning and sync in the case of file mappings (not applicable to Dave's case but still it's unpredictable behaviour). pte_dirty is close but there are interactions with the application as the timing of writes vs the PTE scanner matter. Even if we restored the original behaviour, it would still be very difficult to understand all the interactions between userspace and kernel. The patch below should be tested because it's clearer what the intent is. Using the VMA flags is coarse but it's not vulnerable to timing artifacts that behave differently depending on the machine. My preliminary testing shows it helps but not by much. It does not restore performance to where it was but it's easier to understand which is important if there are changes in the scheduler later. In combination, I also think that slowing PTE scanning when migration fails is the correct action even if it is unrelated to the patch Dave bisected to. It's stupid to increase scanning rates and incurs more faults when migrations are failing so I'll be testing that next. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 626e93db28ba..2f12e9fcf1a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, flags |= TNF_FAULT_LOCAL; } - /* - * Avoid grouping on DSO/COW pages in specific and RO pages - * in general, RO pages shouldn't hurt as much anyway since - * they can be in shared cache state. - * - * FIXME! This checks "pmd_dirty()" as an approximation of - * "is this a read-only page", since checking "pmd_write()" - * is even more broken. We haven't actually turned this into - * a writable page, so pmd_write() will always be false. - */ - if (!pmd_dirty(pmd)) + /* See similar comment in do_numa_page for explanation */ + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* diff --git a/mm/memory.c b/mm/memory.c index 411144f977b1..20beb6647dba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * Avoid grouping on DSO/COW pages in specific and RO pages - * in general, RO pages shouldn't hurt as much anyway since - * they can be in shared cache state. + * Avoid grouping on RO pages in general. RO pages shouldn't hurt as + * much anyway since they can be in shared cache state. This misses + * the case where a mapping is writable but the process never writes + * to it but pte_write gets cleared during protection updates and + * pte_dirty has unpredictable behaviour between PTE scan updates, + * background writeback, dirty balancing and application behaviour. * - * FIXME! This checks "pmd_dirty()" as an approximation of - * "is this a read-only page", since checking "pmd_write()" - * is even more broken. We haven't actually turned this into - * a writable page, so pmd_write() will always be false. + * TODO: Note that the ideal here would be to avoid a situation where a + * NUMA fault is taken immediately followed by a write fault in + * some cases which would have lower overhead overall but would be + * invasive as the fault paths would need to be unified. */ - if (!pte_dirty(pte)) + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /*