From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 23 Dec 2016 15:53:05 +0100 From: Michal Hocko To: Minchan Kim Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161223145305.GF23109@dhcp22.suse.cz> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> <20161223095336.GA5305@bbox> <20161223115421.GD23109@dhcp22.suse.cz> <20161223140131.GA5724@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161223140131.GA5724@bbox> Sender: owner-linux-mm@kvack.org List-ID: On Fri 23-12-16 23:01:31, Minchan Kim wrote: > On Fri, Dec 23, 2016 at 12:54:21PM +0100, Michal Hocko wrote: > > On Fri 23-12-16 18:53:36, Minchan Kim wrote: [...] > > > stucks until VM marked the pmd dirty. > > > > > > How the emulation work depends on the architecture. In case of arm64, > > > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to > > > mark the pte dirty via triggering page fault when store access happens. > > > Once the page fault occurs, VM marks the pte dirty and arch code for > > > setting pte will clear PTE_RDONLY for application to proceed. > > > > > > IOW, if VM doesn't mark the pte dirty, application hangs forever by > > > repeated fault(i.e., store op but the pte is PTE_RDONLY). > > > > > > This patch enables dirty-bit emulation for those architectures. > > > > Yes this is helpful and much more clear, thank you. One thing that is > > still not clear to me is why cannot we handle that in the arch specific > > code. I mean what is the side effect of doing pmd_mkdirty for > > architectures which do not need it? > > For architecture which supports H/W access/dirty bit, it couldn't be > reached there code path so there is no side effect, I think. ahh, I knew I was missing something. It definitely wasn't obvious to me and my x86 config it simply generates code to call huge_pmd_set_accessed. > A thing > I can think of is just increasing code size little bit. Maybe, we > could optimize away some ifdef magic but not sure worth it. it is not -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 23 Dec 2016 23:01:31 +0900 From: Minchan Kim To: Michal Hocko CC: "Kirill A. Shutemov" , Andrew Morton , , Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , , , "[4.5+]" , Andreas Schwab Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161223140131.GA5724@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> <20161223095336.GA5305@bbox> <20161223115421.GD23109@dhcp22.suse.cz> MIME-Version: 1.0 In-Reply-To: <20161223115421.GD23109@dhcp22.suse.cz> Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: On Fri, Dec 23, 2016 at 12:54:21PM +0100, Michal Hocko wrote: > On Fri 23-12-16 18:53:36, Minchan Kim wrote: > > Hi, > > > > On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > > > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > > > [...] > > > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > > > From: Minchan Kim > > > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > > > > > The problem is page fault handler supports only accessed flag emulation > > > > for THP page of SW-dirty/accessed architecture. > > > > > > > > This patch enables dirty-bit emulation for those architectures. > > > > Without it, MADV_FREE makes application hang by repeated fault forever. > > > > > > The changelog is rather terse and considering the issue is rather subtle > > > and it aims the stable tree I think it could see more information. How > > > do we end up looping in the page fault and why the dirty pmd stops it. > > > Could you update the changelog to be more verbose, please? I am still > > > digesting this patch but I believe it is correct fwiw... > > > > > > > How about this? Feel free to suggest better wording. > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > The problem is currently page fault handler doesn't supports dirty bit > > emulation of pte for non-HW dirty-bit architecture so that application > > s@pte@pmd@ ? It would be more clear. Will update with it. > > > stucks until VM marked the pmd dirty. > > > > How the emulation work depends on the architecture. In case of arm64, > > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to > > mark the pte dirty via triggering page fault when store access happens. > > Once the page fault occurs, VM marks the pte dirty and arch code for > > setting pte will clear PTE_RDONLY for application to proceed. > > > > IOW, if VM doesn't mark the pte dirty, application hangs forever by > > repeated fault(i.e., store op but the pte is PTE_RDONLY). > > > > This patch enables dirty-bit emulation for those architectures. > > Yes this is helpful and much more clear, thank you. One thing that is > still not clear to me is why cannot we handle that in the arch specific > code. I mean what is the side effect of doing pmd_mkdirty for > architectures which do not need it? For architecture which supports H/W access/dirty bit, it couldn't be reached there code path so there is no side effect, I think. A thing I can think of is just increasing code size little bit. Maybe, we could optimize away some ifdef magic but not sure worth it. We have been same way pte(not pmd) emulation handling for several decacdes. Anyway, it should be off-topic, I think. Thanks. > > -- > Michal Hocko > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 23 Dec 2016 12:54:21 +0100 From: Michal Hocko To: Minchan Kim Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161223115421.GD23109@dhcp22.suse.cz> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> <20161223095336.GA5305@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161223095336.GA5305@bbox> Sender: owner-linux-mm@kvack.org List-ID: On Fri 23-12-16 18:53:36, Minchan Kim wrote: > Hi, > > On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > > [...] > > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > > From: Minchan Kim > > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > > > The problem is page fault handler supports only accessed flag emulation > > > for THP page of SW-dirty/accessed architecture. > > > > > > This patch enables dirty-bit emulation for those architectures. > > > Without it, MADV_FREE makes application hang by repeated fault forever. > > > > The changelog is rather terse and considering the issue is rather subtle > > and it aims the stable tree I think it could see more information. How > > do we end up looping in the page fault and why the dirty pmd stops it. > > Could you update the changelog to be more verbose, please? I am still > > digesting this patch but I believe it is correct fwiw... > > > > How about this? Feel free to suggest better wording. > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is currently page fault handler doesn't supports dirty bit > emulation of pte for non-HW dirty-bit architecture so that application s@pte@pmd@ ? > stucks until VM marked the pmd dirty. > > How the emulation work depends on the architecture. In case of arm64, > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to > mark the pte dirty via triggering page fault when store access happens. > Once the page fault occurs, VM marks the pte dirty and arch code for > setting pte will clear PTE_RDONLY for application to proceed. > > IOW, if VM doesn't mark the pte dirty, application hangs forever by > repeated fault(i.e., store op but the pte is PTE_RDONLY). > > This patch enables dirty-bit emulation for those architectures. Yes this is helpful and much more clear, thank you. One thing that is still not clear to me is why cannot we handle that in the arch specific code. I mean what is the side effect of doing pmd_mkdirty for architectures which do not need it? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 23 Dec 2016 18:53:36 +0900 From: Minchan Kim To: Michal Hocko CC: "Kirill A. Shutemov" , Andrew Morton , , Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , , , "[4.5+]" , Andreas Schwab Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161223095336.GA5305@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> MIME-Version: 1.0 In-Reply-To: <20161223091725.GA23117@dhcp22.suse.cz> Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: Hi, On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > [...] > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > From: Minchan Kim > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > The problem is page fault handler supports only accessed flag emulation > > for THP page of SW-dirty/accessed architecture. > > > > This patch enables dirty-bit emulation for those architectures. > > Without it, MADV_FREE makes application hang by repeated fault forever. > > The changelog is rather terse and considering the issue is rather subtle > and it aims the stable tree I think it could see more information. How > do we end up looping in the page fault and why the dirty pmd stops it. > Could you update the changelog to be more verbose, please? I am still > digesting this patch but I believe it is correct fwiw... > How about this? Feel free to suggest better wording. Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de The problem is currently page fault handler doesn't supports dirty bit emulation of pte for non-HW dirty-bit architecture so that application stucks until VM marked the pmd dirty. How the emulation work depends on the architecture. In case of arm64, when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to mark the pte dirty via triggering page fault when store access happens. Once the page fault occurs, VM marks the pte dirty and arch code for setting pte will clear PTE_RDONLY for application to proceed. IOW, if VM doesn't mark the pte dirty, application hangs forever by repeated fault(i.e., store op but the pte is PTE_RDONLY). This patch enables dirty-bit emulation for those architectures. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 23 Dec 2016 10:17:25 +0100 From: Michal Hocko To: Minchan Kim Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161223091725.GA23117@dhcp22.suse.cz> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161222145203.GA18970@bbox> Sender: owner-linux-mm@kvack.org List-ID: On Thu 22-12-16 23:52:03, Minchan Kim wrote: [...] > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > From: Minchan Kim > Date: Thu, 22 Dec 2016 23:43:49 +0900 > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. The changelog is rather terse and considering the issue is rather subtle and it aims the stable tree I think it could see more information. How do we end up looping in the page fault and why the dirty pmd stops it. Could you update the changelog to be more verbose, please? I am still digesting this patch but I believe it is correct fwiw... Thanks! > [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called > > Cc: Jason Evans > Cc: Kirill A. Shutemov > Cc: Will Deacon > Cc: Catalin Marinas > Cc: linux-arch@vger.kernel.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: [4.5+] > Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") > Reported-by: Andreas Schwab > Signed-off-by: Minchan Kim > --- > * from v1 > * Remove __handle_mm_fault part - Kirill > > mm/huge_memory.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 10eedbf..29ec8a4 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) > { > pmd_t entry; > unsigned long haddr; > + bool write = vmf->flags & FAULT_FLAG_WRITE; > > vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); > if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) > goto unlock; > > entry = pmd_mkyoung(orig_pmd); > + if (write) > + entry = pmd_mkdirty(entry); > haddr = vmf->address & HPAGE_PMD_MASK; > - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, > - vmf->flags & FAULT_FLAG_WRITE)) > + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) > update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); > > unlock: > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Andreas Schwab To: Minchan Kim Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "\[4.5+\]" Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> Date: Thu, 22 Dec 2016 23:12:32 +0100 In-Reply-To: <20161222145203.GA18970@bbox> (Minchan Kim's message of "Thu, 22 Dec 2016 23:52:03 +0900") Message-ID: <8737hftxyn.fsf@suse.de> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: On Dez 22 2016, Minchan Kim wrote: > From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > From: Minchan Kim > Date: Thu, 22 Dec 2016 23:43:49 +0900 > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called Successfully tested a backport to 4.9. Andreas. -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 22 Dec 2016 21:35:33 +0300 From: "Kirill A. Shutemov" To: Minchan Kim Cc: Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161222183533.GA29876@node.shutemov.name> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161222145203.GA18970@bbox> Sender: owner-linux-mm@kvack.org List-ID: On Thu, Dec 22, 2016 at 11:52:03PM +0900, Minchan Kim wrote: > Hello, > > On Thu, Dec 22, 2016 at 11:17:13AM +0300, Kirill A. Shutemov wrote: > > < snip > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 36c774f..7408ddc 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > > > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > > > > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > > > - !pmd_write(orig_pmd)) { > > > - ret = wp_huge_pmd(&vmf, orig_pmd); > > > - if (!(ret & VM_FAULT_FALLBACK)) > > > + if (vmf.flags & FAULT_FLAG_WRITE) { > > > + if (!pmd_write(orig_pmd)) { > > > + ret = wp_huge_pmd(&vmf, orig_pmd); > > > + if (ret == VM_FAULT_FALLBACK) > > > > In theory, more than one flag can be set and it would lead to > > false-negative. Bit check was the right thing. > > > > And I don't understand why do you need to change code in > > __handle_mm_fault() at all. > > From what I see change to huge_pmd_set_accessed() should be enough. > > Yeb. Thanks for the review. Here v2 goes. > > From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > From: Minchan Kim > Date: Thu, 22 Dec 2016 23:43:49 +0900 > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called > > Cc: Jason Evans > Cc: Kirill A. Shutemov > Cc: Will Deacon > Cc: Catalin Marinas > Cc: linux-arch@vger.kernel.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: [4.5+] > Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") > Reported-by: Andreas Schwab > Signed-off-by: Minchan Kim Acked-by: Kirill A. Shutemov -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 22 Dec 2016 23:52:03 +0900 From: Minchan Kim To: "Kirill A. Shutemov" Cc: Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161222145203.GA18970@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161222081713.GA32480@node.shutemov.name> Sender: owner-linux-mm@kvack.org List-ID: Hello, On Thu, Dec 22, 2016 at 11:17:13AM +0300, Kirill A. Shutemov wrote: < snip > > > diff --git a/mm/memory.c b/mm/memory.c > > index 36c774f..7408ddc 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > > - !pmd_write(orig_pmd)) { > > - ret = wp_huge_pmd(&vmf, orig_pmd); > > - if (!(ret & VM_FAULT_FALLBACK)) > > + if (vmf.flags & FAULT_FLAG_WRITE) { > > + if (!pmd_write(orig_pmd)) { > > + ret = wp_huge_pmd(&vmf, orig_pmd); > > + if (ret == VM_FAULT_FALLBACK) > > In theory, more than one flag can be set and it would lead to > false-negative. Bit check was the right thing. > > And I don't understand why do you need to change code in > __handle_mm_fault() at all. > From what I see change to huge_pmd_set_accessed() should be enough. Yeb. Thanks for the review. Here v2 goes. >>From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 22 Dec 2016 23:43:49 +0900 Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de The problem is page fault handler supports only accessed flag emulation for THP page of SW-dirty/accessed architecture. This patch enables dirty-bit emulation for those architectures. Without it, MADV_FREE makes application hang by repeated fault forever. [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called Cc: Jason Evans Cc: Kirill A. Shutemov Cc: Will Deacon Cc: Catalin Marinas Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: [4.5+] Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Reported-by: Andreas Schwab Signed-off-by: Minchan Kim --- * from v1 * Remove __handle_mm_fault part - Kirill mm/huge_memory.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 10eedbf..29ec8a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) { pmd_t entry; unsigned long haddr; + bool write = vmf->flags & FAULT_FLAG_WRITE; vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) goto unlock; entry = pmd_mkyoung(orig_pmd); + if (write) + entry = pmd_mkdirty(entry); haddr = vmf->address & HPAGE_PMD_MASK; - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, - vmf->flags & FAULT_FLAG_WRITE)) + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); unlock: -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 22 Dec 2016 11:17:13 +0300 From: "Kirill A. Shutemov" To: Minchan Kim Cc: Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161222081713.GA32480@node.shutemov.name> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1482364101-16204-1-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: On Thu, Dec 22, 2016 at 08:48:21AM +0900, Minchan Kim wrote: > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. I guess you wanted put b8d3c4c3009d before [1], right? > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called > > Cc: Jason Evans > Cc: Kirill A. Shutemov > Cc: Will Deacon > Cc: Catalin Marinas > Cc: linux-arch@vger.kernel.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: [4.5+] > Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") > Reported-by: Andreas Schwab > Signed-off-by: Minchan Kim > --- > mm/huge_memory.c | 6 ++++-- > mm/memory.c | 18 ++++++++++-------- > 2 files changed, 14 insertions(+), 10 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 10eedbf..29ec8a4 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) > { > pmd_t entry; > unsigned long haddr; > + bool write = vmf->flags & FAULT_FLAG_WRITE; > > vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); > if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) > goto unlock; > > entry = pmd_mkyoung(orig_pmd); > + if (write) > + entry = pmd_mkdirty(entry); > haddr = vmf->address & HPAGE_PMD_MASK; > - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, > - vmf->flags & FAULT_FLAG_WRITE)) > + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) > update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); > > unlock: > diff --git a/mm/memory.c b/mm/memory.c > index 36c774f..7408ddc 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > - !pmd_write(orig_pmd)) { > - ret = wp_huge_pmd(&vmf, orig_pmd); > - if (!(ret & VM_FAULT_FALLBACK)) > + if (vmf.flags & FAULT_FLAG_WRITE) { > + if (!pmd_write(orig_pmd)) { > + ret = wp_huge_pmd(&vmf, orig_pmd); > + if (ret == VM_FAULT_FALLBACK) In theory, more than one flag can be set and it would lead to false-negative. Bit check was the right thing. And I don't understand why do you need to change code in __handle_mm_fault() at all. >>From what I see change to huge_pmd_set_accessed() should be enough. > + goto pte_fault; > return ret; > - } else { > - huge_pmd_set_accessed(&vmf, orig_pmd); > - return 0; > + } > } > + > + huge_pmd_set_accessed(&vmf, orig_pmd); > + return 0; > } > } > - > +pte_fault: > return handle_pte_fault(&vmf); > } > > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Minchan Kim To: Andrew Morton Cc: linux-mm@kvack.org, Minchan Kim , Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Subject: [PATCH] mm: pmd dirty emulation in page fault handler Date: Thu, 22 Dec 2016 08:48:21 +0900 Message-Id: <1482364101-16204-1-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de The problem is page fault handler supports only accessed flag emulation for THP page of SW-dirty/accessed architecture. This patch enables dirty-bit emulation for those architectures. Without it, MADV_FREE makes application hang by repeated fault forever. [1] mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called Cc: Jason Evans Cc: Kirill A. Shutemov Cc: Will Deacon Cc: Catalin Marinas Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: [4.5+] Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Reported-by: Andreas Schwab Signed-off-by: Minchan Kim --- mm/huge_memory.c | 6 ++++-- mm/memory.c | 18 ++++++++++-------- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 10eedbf..29ec8a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) { pmd_t entry; unsigned long haddr; + bool write = vmf->flags & FAULT_FLAG_WRITE; vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) goto unlock; entry = pmd_mkyoung(orig_pmd); + if (write) + entry = pmd_mkdirty(entry); haddr = vmf->address & HPAGE_PMD_MASK; - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, - vmf->flags & FAULT_FLAG_WRITE)) + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); unlock: diff --git a/mm/memory.c b/mm/memory.c index 36c774f..7408ddc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) return do_huge_pmd_numa_page(&vmf, orig_pmd); - if ((vmf.flags & FAULT_FLAG_WRITE) && - !pmd_write(orig_pmd)) { - ret = wp_huge_pmd(&vmf, orig_pmd); - if (!(ret & VM_FAULT_FALLBACK)) + if (vmf.flags & FAULT_FLAG_WRITE) { + if (!pmd_write(orig_pmd)) { + ret = wp_huge_pmd(&vmf, orig_pmd); + if (ret == VM_FAULT_FALLBACK) + goto pte_fault; return ret; - } else { - huge_pmd_set_accessed(&vmf, orig_pmd); - return 0; + } } + + huge_pmd_set_accessed(&vmf, orig_pmd); + return 0; } } - +pte_fault: return handle_pte_fault(&vmf); } -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Date: Thu, 22 Dec 2016 11:17:13 +0300 Message-ID: <20161222081713.GA32480@node.shutemov.name> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <1482364101-16204-1-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org To: Minchan Kim Cc: Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab List-Id: linux-arch.vger.kernel.org On Thu, Dec 22, 2016 at 08:48:21AM +0900, Minchan Kim wrote: > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. I guess you wanted put b8d3c4c3009d before [1], right? > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called > > Cc: Jason Evans > Cc: Kirill A. Shutemov > Cc: Will Deacon > Cc: Catalin Marinas > Cc: linux-arch@vger.kernel.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: [4.5+] > Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") > Reported-by: Andreas Schwab > Signed-off-by: Minchan Kim > --- > mm/huge_memory.c | 6 ++++-- > mm/memory.c | 18 ++++++++++-------- > 2 files changed, 14 insertions(+), 10 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 10eedbf..29ec8a4 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) > { > pmd_t entry; > unsigned long haddr; > + bool write = vmf->flags & FAULT_FLAG_WRITE; > > vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); > if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) > goto unlock; > > entry = pmd_mkyoung(orig_pmd); > + if (write) > + entry = pmd_mkdirty(entry); > haddr = vmf->address & HPAGE_PMD_MASK; > - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, > - vmf->flags & FAULT_FLAG_WRITE)) > + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) > update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); > > unlock: > diff --git a/mm/memory.c b/mm/memory.c > index 36c774f..7408ddc 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > - !pmd_write(orig_pmd)) { > - ret = wp_huge_pmd(&vmf, orig_pmd); > - if (!(ret & VM_FAULT_FALLBACK)) > + if (vmf.flags & FAULT_FLAG_WRITE) { > + if (!pmd_write(orig_pmd)) { > + ret = wp_huge_pmd(&vmf, orig_pmd); > + if (ret == VM_FAULT_FALLBACK) In theory, more than one flag can be set and it would lead to false-negative. Bit check was the right thing. And I don't understand why do you need to change code in __handle_mm_fault() at all. >From what I see change to huge_pmd_set_accessed() should be enough. > + goto pte_fault; > return ret; > - } else { > - huge_pmd_set_accessed(&vmf, orig_pmd); > - return 0; > + } > } > + > + huge_pmd_set_accessed(&vmf, orig_pmd); > + return 0; > } > } > - > +pte_fault: > return handle_pte_fault(&vmf); > } > > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Date: Thu, 22 Dec 2016 23:52:03 +0900 Message-ID: <20161222145203.GA18970@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20161222081713.GA32480@node.shutemov.name> Sender: stable-owner@vger.kernel.org To: "Kirill A. Shutemov" Cc: Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab List-Id: linux-arch.vger.kernel.org Hello, On Thu, Dec 22, 2016 at 11:17:13AM +0300, Kirill A. Shutemov wrote: < snip > > > diff --git a/mm/memory.c b/mm/memory.c > > index 36c774f..7408ddc 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > > - !pmd_write(orig_pmd)) { > > - ret = wp_huge_pmd(&vmf, orig_pmd); > > - if (!(ret & VM_FAULT_FALLBACK)) > > + if (vmf.flags & FAULT_FLAG_WRITE) { > > + if (!pmd_write(orig_pmd)) { > > + ret = wp_huge_pmd(&vmf, orig_pmd); > > + if (ret == VM_FAULT_FALLBACK) > > In theory, more than one flag can be set and it would lead to > false-negative. Bit check was the right thing. > > And I don't understand why do you need to change code in > __handle_mm_fault() at all. > From what I see change to huge_pmd_set_accessed() should be enough. Yeb. Thanks for the review. Here v2 goes. >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 22 Dec 2016 23:43:49 +0900 Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de The problem is page fault handler supports only accessed flag emulation for THP page of SW-dirty/accessed architecture. This patch enables dirty-bit emulation for those architectures. Without it, MADV_FREE makes application hang by repeated fault forever. [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called Cc: Jason Evans Cc: Kirill A. Shutemov Cc: Will Deacon Cc: Catalin Marinas Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: [4.5+] Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Reported-by: Andreas Schwab Signed-off-by: Minchan Kim --- * from v1 * Remove __handle_mm_fault part - Kirill mm/huge_memory.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 10eedbf..29ec8a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) { pmd_t entry; unsigned long haddr; + bool write = vmf->flags & FAULT_FLAG_WRITE; vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) goto unlock; entry = pmd_mkyoung(orig_pmd); + if (write) + entry = pmd_mkdirty(entry); haddr = vmf->address & HPAGE_PMD_MASK; - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, - vmf->flags & FAULT_FLAG_WRITE)) + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); unlock: -- 2.7.4 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Schwab Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Date: Thu, 22 Dec 2016 23:12:32 +0100 Message-ID: <8737hftxyn.fsf@suse.de> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> Mime-Version: 1.0 Content-Type: text/plain Return-path: In-Reply-To: <20161222145203.GA18970@bbox> (Minchan Kim's message of "Thu, 22 Dec 2016 23:52:03 +0900") Sender: owner-linux-mm@kvack.org To: Minchan Kim Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" List-Id: linux-arch.vger.kernel.org On Dez 22 2016, Minchan Kim wrote: > From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > From: Minchan Kim > Date: Thu, 22 Dec 2016 23:43:49 +0900 > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called Successfully tested a backport to 4.9. Andreas. -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Date: Fri, 23 Dec 2016 10:17:25 +0100 Message-ID: <20161223091725.GA23117@dhcp22.suse.cz> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from mx2.suse.de ([195.135.220.15]:55542 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751497AbcLWJRb (ORCPT ); Fri, 23 Dec 2016 04:17:31 -0500 Content-Disposition: inline In-Reply-To: <20161222145203.GA18970@bbox> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Minchan Kim Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab On Thu 22-12-16 23:52:03, Minchan Kim wrote: [...] > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > From: Minchan Kim > Date: Thu, 22 Dec 2016 23:43:49 +0900 > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. The changelog is rather terse and considering the issue is rather subtle and it aims the stable tree I think it could see more information. How do we end up looping in the page fault and why the dirty pmd stops it. Could you update the changelog to be more verbose, please? I am still digesting this patch but I believe it is correct fwiw... Thanks! > [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called > > Cc: Jason Evans > Cc: Kirill A. Shutemov > Cc: Will Deacon > Cc: Catalin Marinas > Cc: linux-arch@vger.kernel.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: [4.5+] > Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") > Reported-by: Andreas Schwab > Signed-off-by: Minchan Kim > --- > * from v1 > * Remove __handle_mm_fault part - Kirill > > mm/huge_memory.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 10eedbf..29ec8a4 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) > { > pmd_t entry; > unsigned long haddr; > + bool write = vmf->flags & FAULT_FLAG_WRITE; > > vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); > if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) > goto unlock; > > entry = pmd_mkyoung(orig_pmd); > + if (write) > + entry = pmd_mkdirty(entry); > haddr = vmf->address & HPAGE_PMD_MASK; > - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, > - vmf->flags & FAULT_FLAG_WRITE)) > + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) > update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); > > unlock: > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Date: Fri, 23 Dec 2016 18:53:36 +0900 Message-ID: <20161223095336.GA5305@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: In-Reply-To: <20161223091725.GA23117@dhcp22.suse.cz> Content-Disposition: inline Sender: owner-linux-mm@kvack.org To: Michal Hocko Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab List-Id: linux-arch.vger.kernel.org Hi, On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > [...] > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > From: Minchan Kim > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > The problem is page fault handler supports only accessed flag emulation > > for THP page of SW-dirty/accessed architecture. > > > > This patch enables dirty-bit emulation for those architectures. > > Without it, MADV_FREE makes application hang by repeated fault forever. > > The changelog is rather terse and considering the issue is rather subtle > and it aims the stable tree I think it could see more information. How > do we end up looping in the page fault and why the dirty pmd stops it. > Could you update the changelog to be more verbose, please? I am still > digesting this patch but I believe it is correct fwiw... > How about this? Feel free to suggest better wording. Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de The problem is currently page fault handler doesn't supports dirty bit emulation of pte for non-HW dirty-bit architecture so that application stucks until VM marked the pmd dirty. How the emulation work depends on the architecture. In case of arm64, when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to mark the pte dirty via triggering page fault when store access happens. Once the page fault occurs, VM marks the pte dirty and arch code for setting pte will clear PTE_RDONLY for application to proceed. IOW, if VM doesn't mark the pte dirty, application hangs forever by repeated fault(i.e., store op but the pte is PTE_RDONLY). This patch enables dirty-bit emulation for those architectures. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Date: Fri, 23 Dec 2016 23:01:31 +0900 Message-ID: <20161223140131.GA5724@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> <20161223095336.GA5305@bbox> <20161223115421.GD23109@dhcp22.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: In-Reply-To: <20161223115421.GD23109@dhcp22.suse.cz> Content-Disposition: inline Sender: owner-linux-mm@kvack.org To: Michal Hocko Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab List-Id: linux-arch.vger.kernel.org On Fri, Dec 23, 2016 at 12:54:21PM +0100, Michal Hocko wrote: > On Fri 23-12-16 18:53:36, Minchan Kim wrote: > > Hi, > > > > On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > > > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > > > [...] > > > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > > > From: Minchan Kim > > > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > > > > > The problem is page fault handler supports only accessed flag emulation > > > > for THP page of SW-dirty/accessed architecture. > > > > > > > > This patch enables dirty-bit emulation for those architectures. > > > > Without it, MADV_FREE makes application hang by repeated fault forever. > > > > > > The changelog is rather terse and considering the issue is rather subtle > > > and it aims the stable tree I think it could see more information. How > > > do we end up looping in the page fault and why the dirty pmd stops it. > > > Could you update the changelog to be more verbose, please? I am still > > > digesting this patch but I believe it is correct fwiw... > > > > > > > How about this? Feel free to suggest better wording. > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > The problem is currently page fault handler doesn't supports dirty bit > > emulation of pte for non-HW dirty-bit architecture so that application > > s@pte@pmd@ ? It would be more clear. Will update with it. > > > stucks until VM marked the pmd dirty. > > > > How the emulation work depends on the architecture. In case of arm64, > > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to > > mark the pte dirty via triggering page fault when store access happens. > > Once the page fault occurs, VM marks the pte dirty and arch code for > > setting pte will clear PTE_RDONLY for application to proceed. > > > > IOW, if VM doesn't mark the pte dirty, application hangs forever by > > repeated fault(i.e., store op but the pte is PTE_RDONLY). > > > > This patch enables dirty-bit emulation for those architectures. > > Yes this is helpful and much more clear, thank you. One thing that is > still not clear to me is why cannot we handle that in the arch specific > code. I mean what is the side effect of doing pmd_mkdirty for > architectures which do not need it? For architecture which supports H/W access/dirty bit, it couldn't be reached there code path so there is no side effect, I think. A thing I can think of is just increasing code size little bit. Maybe, we could optimize away some ifdef magic but not sure worth it. We have been same way pte(not pmd) emulation handling for several decacdes. Anyway, it should be off-topic, I think. Thanks. > > -- > Michal Hocko > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from LGEAMRELO12.lge.com ([156.147.23.52]:51965 "EHLO lgeamrelo12.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1764294AbcLUXsa (ORCPT ); Wed, 21 Dec 2016 18:48:30 -0500 From: Minchan Kim Subject: [PATCH] mm: pmd dirty emulation in page fault handler Date: Thu, 22 Dec 2016 08:48:21 +0900 Message-ID: <1482364101-16204-1-git-send-email-minchan@kernel.org> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, Minchan Kim , Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Message-ID: <20161221234821.0Gj-Lh1KdDJEdE8z-ucNnf5o1xqWJa_ZDVsPZa4Oqwc@z> Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de The problem is page fault handler supports only accessed flag emulation for THP page of SW-dirty/accessed architecture. This patch enables dirty-bit emulation for those architectures. Without it, MADV_FREE makes application hang by repeated fault forever. [1] mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called Cc: Jason Evans Cc: Kirill A. Shutemov Cc: Will Deacon Cc: Catalin Marinas Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: [4.5+] Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Reported-by: Andreas Schwab Signed-off-by: Minchan Kim --- mm/huge_memory.c | 6 ++++-- mm/memory.c | 18 ++++++++++-------- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 10eedbf..29ec8a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) { pmd_t entry; unsigned long haddr; + bool write = vmf->flags & FAULT_FLAG_WRITE; vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) goto unlock; entry = pmd_mkyoung(orig_pmd); + if (write) + entry = pmd_mkdirty(entry); haddr = vmf->address & HPAGE_PMD_MASK; - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, - vmf->flags & FAULT_FLAG_WRITE)) + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); unlock: diff --git a/mm/memory.c b/mm/memory.c index 36c774f..7408ddc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) return do_huge_pmd_numa_page(&vmf, orig_pmd); - if ((vmf.flags & FAULT_FLAG_WRITE) && - !pmd_write(orig_pmd)) { - ret = wp_huge_pmd(&vmf, orig_pmd); - if (!(ret & VM_FAULT_FALLBACK)) + if (vmf.flags & FAULT_FLAG_WRITE) { + if (!pmd_write(orig_pmd)) { + ret = wp_huge_pmd(&vmf, orig_pmd); + if (ret == VM_FAULT_FALLBACK) + goto pte_fault; return ret; - } else { - huge_pmd_set_accessed(&vmf, orig_pmd); - return 0; + } } + + huge_pmd_set_accessed(&vmf, orig_pmd); + return 0; } } - +pte_fault: return handle_pte_fault(&vmf); } -- 2.7.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f67.google.com ([74.125.82.67]:35642 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754479AbcLVSfo (ORCPT ); Thu, 22 Dec 2016 13:35:44 -0500 Received: by mail-wm0-f67.google.com with SMTP id l2so12124117wml.2 for ; Thu, 22 Dec 2016 10:35:37 -0800 (PST) Date: Thu, 22 Dec 2016 21:35:33 +0300 From: "Kirill A. Shutemov" Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161222183533.GA29876@node.shutemov.name> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161222145203.GA18970@bbox> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Minchan Kim Cc: Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Message-ID: <20161222183533.XgtxGwMfWb2jEt596ahxMqLMU0EJdpBYVqoM6Z1j74Y@z> On Thu, Dec 22, 2016 at 11:52:03PM +0900, Minchan Kim wrote: > Hello, > > On Thu, Dec 22, 2016 at 11:17:13AM +0300, Kirill A. Shutemov wrote: > > < snip > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 36c774f..7408ddc 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > > > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > > > > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > > > - !pmd_write(orig_pmd)) { > > > - ret = wp_huge_pmd(&vmf, orig_pmd); > > > - if (!(ret & VM_FAULT_FALLBACK)) > > > + if (vmf.flags & FAULT_FLAG_WRITE) { > > > + if (!pmd_write(orig_pmd)) { > > > + ret = wp_huge_pmd(&vmf, orig_pmd); > > > + if (ret == VM_FAULT_FALLBACK) > > > > In theory, more than one flag can be set and it would lead to > > false-negative. Bit check was the right thing. > > > > And I don't understand why do you need to change code in > > __handle_mm_fault() at all. > > From what I see change to huge_pmd_set_accessed() should be enough. > > Yeb. Thanks for the review. Here v2 goes. > > From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > From: Minchan Kim > Date: Thu, 22 Dec 2016 23:43:49 +0900 > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called > > Cc: Jason Evans > Cc: Kirill A. Shutemov > Cc: Will Deacon > Cc: Catalin Marinas > Cc: linux-arch@vger.kernel.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: [4.5+] > Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") > Reported-by: Andreas Schwab > Signed-off-by: Minchan Kim Acked-by: Kirill A. Shutemov -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de ([195.135.220.15]:38002 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758635AbcLWLy0 (ORCPT ); Fri, 23 Dec 2016 06:54:26 -0500 Date: Fri, 23 Dec 2016 12:54:21 +0100 From: Michal Hocko Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161223115421.GD23109@dhcp22.suse.cz> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> <20161223095336.GA5305@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161223095336.GA5305@bbox> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Minchan Kim Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Message-ID: <20161223115421.a55DfynvQNvZsjBzGuHGjPL_qjNxfq-gx99rQQ1pTBY@z> On Fri 23-12-16 18:53:36, Minchan Kim wrote: > Hi, > > On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > > [...] > > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > > From: Minchan Kim > > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > > > The problem is page fault handler supports only accessed flag emulation > > > for THP page of SW-dirty/accessed architecture. > > > > > > This patch enables dirty-bit emulation for those architectures. > > > Without it, MADV_FREE makes application hang by repeated fault forever. > > > > The changelog is rather terse and considering the issue is rather subtle > > and it aims the stable tree I think it could see more information. How > > do we end up looping in the page fault and why the dirty pmd stops it. > > Could you update the changelog to be more verbose, please? I am still > > digesting this patch but I believe it is correct fwiw... > > > > How about this? Feel free to suggest better wording. > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is currently page fault handler doesn't supports dirty bit > emulation of pte for non-HW dirty-bit architecture so that application s@pte@pmd@ ? > stucks until VM marked the pmd dirty. > > How the emulation work depends on the architecture. In case of arm64, > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to > mark the pte dirty via triggering page fault when store access happens. > Once the page fault occurs, VM marks the pte dirty and arch code for > setting pte will clear PTE_RDONLY for application to proceed. > > IOW, if VM doesn't mark the pte dirty, application hangs forever by > repeated fault(i.e., store op but the pte is PTE_RDONLY). > > This patch enables dirty-bit emulation for those architectures. Yes this is helpful and much more clear, thank you. One thing that is still not clear to me is why cannot we handle that in the arch specific code. I mean what is the side effect of doing pmd_mkdirty for architectures which do not need it? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de ([195.135.220.15]:45761 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758865AbcLWOxL (ORCPT ); Fri, 23 Dec 2016 09:53:11 -0500 Date: Fri, 23 Dec 2016 15:53:05 +0100 From: Michal Hocko Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161223145305.GF23109@dhcp22.suse.cz> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> <20161223095336.GA5305@bbox> <20161223115421.GD23109@dhcp22.suse.cz> <20161223140131.GA5724@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161223140131.GA5724@bbox> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Minchan Kim Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Message-ID: <20161223145305.7UHMX4R9Q1ikYXK4XMxj791fcFhW0pEAPdH6R_4jLH4@z> On Fri 23-12-16 23:01:31, Minchan Kim wrote: > On Fri, Dec 23, 2016 at 12:54:21PM +0100, Michal Hocko wrote: > > On Fri 23-12-16 18:53:36, Minchan Kim wrote: [...] > > > stucks until VM marked the pmd dirty. > > > > > > How the emulation work depends on the architecture. In case of arm64, > > > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to > > > mark the pte dirty via triggering page fault when store access happens. > > > Once the page fault occurs, VM marks the pte dirty and arch code for > > > setting pte will clear PTE_RDONLY for application to proceed. > > > > > > IOW, if VM doesn't mark the pte dirty, application hangs forever by > > > repeated fault(i.e., store op but the pte is PTE_RDONLY). > > > > > > This patch enables dirty-bit emulation for those architectures. > > > > Yes this is helpful and much more clear, thank you. One thing that is > > still not clear to me is why cannot we handle that in the arch specific > > code. I mean what is the side effect of doing pmd_mkdirty for > > architectures which do not need it? > > For architecture which supports H/W access/dirty bit, it couldn't be > reached there code path so there is no side effect, I think. ahh, I knew I was missing something. It definitely wasn't obvious to me and my x86 config it simply generates code to call huge_pmd_set_accessed. > A thing > I can think of is just increasing code size little bit. Maybe, we > could optimize away some ifdef magic but not sure worth it. it is not -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f66.google.com ([74.125.82.66]:32787 "EHLO mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750697AbcLVIYI (ORCPT ); Thu, 22 Dec 2016 03:24:08 -0500 Received: by mail-wm0-f66.google.com with SMTP id u144so34702953wmu.0 for ; Thu, 22 Dec 2016 00:24:08 -0800 (PST) Date: Thu, 22 Dec 2016 11:17:13 +0300 From: "Kirill A. Shutemov" Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161222081713.GA32480@node.shutemov.name> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1482364101-16204-1-git-send-email-minchan@kernel.org> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Minchan Kim Cc: Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Message-ID: <20161222081713.vtP612Lh7L6C94C7MsTjCU5QE1GA5BdWqHMvpHe0w6s@z> On Thu, Dec 22, 2016 at 08:48:21AM +0900, Minchan Kim wrote: > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. I guess you wanted put b8d3c4c3009d before [1], right? > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called > > Cc: Jason Evans > Cc: Kirill A. Shutemov > Cc: Will Deacon > Cc: Catalin Marinas > Cc: linux-arch@vger.kernel.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: [4.5+] > Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") > Reported-by: Andreas Schwab > Signed-off-by: Minchan Kim > --- > mm/huge_memory.c | 6 ++++-- > mm/memory.c | 18 ++++++++++-------- > 2 files changed, 14 insertions(+), 10 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 10eedbf..29ec8a4 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) > { > pmd_t entry; > unsigned long haddr; > + bool write = vmf->flags & FAULT_FLAG_WRITE; > > vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); > if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) > goto unlock; > > entry = pmd_mkyoung(orig_pmd); > + if (write) > + entry = pmd_mkdirty(entry); > haddr = vmf->address & HPAGE_PMD_MASK; > - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, > - vmf->flags & FAULT_FLAG_WRITE)) > + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) > update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); > > unlock: > diff --git a/mm/memory.c b/mm/memory.c > index 36c774f..7408ddc 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > - !pmd_write(orig_pmd)) { > - ret = wp_huge_pmd(&vmf, orig_pmd); > - if (!(ret & VM_FAULT_FALLBACK)) > + if (vmf.flags & FAULT_FLAG_WRITE) { > + if (!pmd_write(orig_pmd)) { > + ret = wp_huge_pmd(&vmf, orig_pmd); > + if (ret == VM_FAULT_FALLBACK) In theory, more than one flag can be set and it would lead to false-negative. Bit check was the right thing. And I don't understand why do you need to change code in __handle_mm_fault() at all. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from LGEAMRELO11.lge.com ([156.147.23.51]:43923 "EHLO lgeamrelo11.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752614AbcLVOwI (ORCPT ); Thu, 22 Dec 2016 09:52:08 -0500 Date: Thu, 22 Dec 2016 23:52:03 +0900 From: Minchan Kim Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161222145203.GA18970@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161222081713.GA32480@node.shutemov.name> Sender: linux-arch-owner@vger.kernel.org List-ID: To: "Kirill A. Shutemov" Cc: Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Message-ID: <20161222145203.DLWbA7lWDT7kWb5msHDSgysswHn2VClv7-2FKdUm0d8@z> Hello, On Thu, Dec 22, 2016 at 11:17:13AM +0300, Kirill A. Shutemov wrote: < snip > > > diff --git a/mm/memory.c b/mm/memory.c > > index 36c774f..7408ddc 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > > - !pmd_write(orig_pmd)) { > > - ret = wp_huge_pmd(&vmf, orig_pmd); > > - if (!(ret & VM_FAULT_FALLBACK)) > > + if (vmf.flags & FAULT_FLAG_WRITE) { > > + if (!pmd_write(orig_pmd)) { > > + ret = wp_huge_pmd(&vmf, orig_pmd); > > + if (ret == VM_FAULT_FALLBACK) > > In theory, more than one flag can be set and it would lead to > false-negative. Bit check was the right thing. > > And I don't understand why do you need to change code in > __handle_mm_fault() at all. > From what I see change to huge_pmd_set_accessed() should be enough. Yeb. Thanks for the review. Here v2 goes. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de ([195.135.220.15]:46454 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757203AbcLVWMl (ORCPT ); Thu, 22 Dec 2016 17:12:41 -0500 From: Andreas Schwab Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> Date: Thu, 22 Dec 2016 23:12:32 +0100 In-Reply-To: <20161222145203.GA18970@bbox> (Minchan Kim's message of "Thu, 22 Dec 2016 23:52:03 +0900") Message-ID: <8737hftxyn.fsf@suse.de> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-arch-owner@vger.kernel.org List-ID: To: Minchan Kim Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" Message-ID: <20161222221232.15x89yP2nGIBVPHSfjjvzgX74dvIk1aqPZQjVlejDdM@z> On Dez 22 2016, Minchan Kim wrote: > From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > From: Minchan Kim > Date: Thu, 22 Dec 2016 23:43:49 +0900 > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called Successfully tested a backport to 4.9. Andreas. -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from LGEAMRELO11.lge.com ([156.147.23.51]:45296 "EHLO lgeamrelo11.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932659AbcLWJxk (ORCPT ); Fri, 23 Dec 2016 04:53:40 -0500 Date: Fri, 23 Dec 2016 18:53:36 +0900 From: Minchan Kim Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161223095336.GA5305@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> MIME-Version: 1.0 In-Reply-To: <20161223091725.GA23117@dhcp22.suse.cz> Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline Sender: linux-arch-owner@vger.kernel.org List-ID: To: Michal Hocko Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Message-ID: <20161223095336.EWCvDryk29tppoDUyuSUCK3eUsxWb_37wuAccbtywUU@z> Hi, On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > [...] > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > From: Minchan Kim > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > The problem is page fault handler supports only accessed flag emulation > > for THP page of SW-dirty/accessed architecture. > > > > This patch enables dirty-bit emulation for those architectures. > > Without it, MADV_FREE makes application hang by repeated fault forever. > > The changelog is rather terse and considering the issue is rather subtle > and it aims the stable tree I think it could see more information. How > do we end up looping in the page fault and why the dirty pmd stops it. > Could you update the changelog to be more verbose, please? I am still > digesting this patch but I believe it is correct fwiw... > How about this? Feel free to suggest better wording. Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de The problem is currently page fault handler doesn't supports dirty bit emulation of pte for non-HW dirty-bit architecture so that application stucks until VM marked the pmd dirty. How the emulation work depends on the architecture. In case of arm64, when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to mark the pte dirty via triggering page fault when store access happens. Once the page fault occurs, VM marks the pte dirty and arch code for setting pte will clear PTE_RDONLY for application to proceed. IOW, if VM doesn't mark the pte dirty, application hangs forever by repeated fault(i.e., store op but the pte is PTE_RDONLY). This patch enables dirty-bit emulation for those architectures. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from LGEAMRELO13.lge.com ([156.147.23.53]:36670 "EHLO lgeamrelo13.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758723AbcLWOBf (ORCPT ); Fri, 23 Dec 2016 09:01:35 -0500 Date: Fri, 23 Dec 2016 23:01:31 +0900 From: Minchan Kim Subject: Re: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <20161223140131.GA5724@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> <20161223095336.GA5305@bbox> <20161223115421.GD23109@dhcp22.suse.cz> MIME-Version: 1.0 In-Reply-To: <20161223115421.GD23109@dhcp22.suse.cz> Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline Sender: linux-arch-owner@vger.kernel.org List-ID: To: Michal Hocko Cc: "Kirill A. Shutemov" , Andrew Morton , linux-mm@kvack.org, Jason Evans , "Kirill A . Shutemov" , Will Deacon , Catalin Marinas , linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, "[4.5+]" , Andreas Schwab Message-ID: <20161223140131.uCiB2D1pYJQ7Mn5aR9vKKYeVvjMpjaVdnOkgSLnWI6I@z> On Fri, Dec 23, 2016 at 12:54:21PM +0100, Michal Hocko wrote: > On Fri 23-12-16 18:53:36, Minchan Kim wrote: > > Hi, > > > > On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > > > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > > > [...] > > > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > > > From: Minchan Kim > > > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > > > > > The problem is page fault handler supports only accessed flag emulation > > > > for THP page of SW-dirty/accessed architecture. > > > > > > > > This patch enables dirty-bit emulation for those architectures. > > > > Without it, MADV_FREE makes application hang by repeated fault forever. > > > > > > The changelog is rather terse and considering the issue is rather subtle > > > and it aims the stable tree I think it could see more information. How > > > do we end up looping in the page fault and why the dirty pmd stops it. > > > Could you update the changelog to be more verbose, please? I am still > > > digesting this patch but I believe it is correct fwiw... > > > > > > > How about this? Feel free to suggest better wording. > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de > > > > The problem is currently page fault handler doesn't supports dirty bit > > emulation of pte for non-HW dirty-bit architecture so that application > > s@pte@pmd@ ? It would be more clear. Will update with it. > > > stucks until VM marked the pmd dirty. > > > > How the emulation work depends on the architecture. In case of arm64, > > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to > > mark the pte dirty via triggering page fault when store access happens. > > Once the page fault occurs, VM marks the pte dirty and arch code for > > setting pte will clear PTE_RDONLY for application to proceed. > > > > IOW, if VM doesn't mark the pte dirty, application hangs forever by > > repeated fault(i.e., store op but the pte is PTE_RDONLY). > > > > This patch enables dirty-bit emulation for those architectures. > > Yes this is helpful and much more clear, thank you. One thing that is > still not clear to me is why cannot we handle that in the arch specific > code. I mean what is the side effect of doing pmd_mkdirty for > architectures which do not need it? For architecture which supports H/W access/dirty bit, it couldn't be reached there code path so there is no side effect, I think. A thing I can think of is just increasing code size little bit. Maybe, we could optimize away some ifdef magic but not sure worth it. We have been same way pte(not pmd) emulation handling for several decacdes. Anyway, it should be off-topic, I think. Thanks. > > -- > Michal Hocko > SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: minchan@kernel.org (Minchan Kim) Date: Thu, 22 Dec 2016 08:48:21 +0900 Subject: [PATCH] mm: pmd dirty emulation in page fault handler Message-ID: <1482364101-16204-1-git-send-email-minchan@kernel.org> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de The problem is page fault handler supports only accessed flag emulation for THP page of SW-dirty/accessed architecture. This patch enables dirty-bit emulation for those architectures. Without it, MADV_FREE makes application hang by repeated fault forever. [1] mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called Cc: Jason Evans Cc: Kirill A. Shutemov Cc: Will Deacon Cc: Catalin Marinas Cc: linux-arch at vger.kernel.org Cc: linux-arm-kernel at lists.infradead.org Cc: [4.5+] Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Reported-by: Andreas Schwab Signed-off-by: Minchan Kim --- mm/huge_memory.c | 6 ++++-- mm/memory.c | 18 ++++++++++-------- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 10eedbf..29ec8a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) { pmd_t entry; unsigned long haddr; + bool write = vmf->flags & FAULT_FLAG_WRITE; vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) goto unlock; entry = pmd_mkyoung(orig_pmd); + if (write) + entry = pmd_mkdirty(entry); haddr = vmf->address & HPAGE_PMD_MASK; - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, - vmf->flags & FAULT_FLAG_WRITE)) + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); unlock: diff --git a/mm/memory.c b/mm/memory.c index 36c774f..7408ddc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) return do_huge_pmd_numa_page(&vmf, orig_pmd); - if ((vmf.flags & FAULT_FLAG_WRITE) && - !pmd_write(orig_pmd)) { - ret = wp_huge_pmd(&vmf, orig_pmd); - if (!(ret & VM_FAULT_FALLBACK)) + if (vmf.flags & FAULT_FLAG_WRITE) { + if (!pmd_write(orig_pmd)) { + ret = wp_huge_pmd(&vmf, orig_pmd); + if (ret == VM_FAULT_FALLBACK) + goto pte_fault; return ret; - } else { - huge_pmd_set_accessed(&vmf, orig_pmd); - return 0; + } } + + huge_pmd_set_accessed(&vmf, orig_pmd); + return 0; } } - +pte_fault: return handle_pte_fault(&vmf); } -- 2.7.4 From mboxrd@z Thu Jan 1 00:00:00 1970 From: kirill@shutemov.name (Kirill A. Shutemov) Date: Thu, 22 Dec 2016 11:17:13 +0300 Subject: [PATCH] mm: pmd dirty emulation in page fault handler In-Reply-To: <1482364101-16204-1-git-send-email-minchan@kernel.org> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> Message-ID: <20161222081713.GA32480@node.shutemov.name> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Thu, Dec 22, 2016 at 08:48:21AM +0900, Minchan Kim wrote: > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. I guess you wanted put b8d3c4c3009d before [1], right? > http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called > > Cc: Jason Evans > Cc: Kirill A. Shutemov > Cc: Will Deacon > Cc: Catalin Marinas > Cc: linux-arch at vger.kernel.org > Cc: linux-arm-kernel at lists.infradead.org > Cc: [4.5+] > Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") > Reported-by: Andreas Schwab > Signed-off-by: Minchan Kim > --- > mm/huge_memory.c | 6 ++++-- > mm/memory.c | 18 ++++++++++-------- > 2 files changed, 14 insertions(+), 10 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 10eedbf..29ec8a4 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) > { > pmd_t entry; > unsigned long haddr; > + bool write = vmf->flags & FAULT_FLAG_WRITE; > > vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); > if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) > goto unlock; > > entry = pmd_mkyoung(orig_pmd); > + if (write) > + entry = pmd_mkdirty(entry); > haddr = vmf->address & HPAGE_PMD_MASK; > - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, > - vmf->flags & FAULT_FLAG_WRITE)) > + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) > update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); > > unlock: > diff --git a/mm/memory.c b/mm/memory.c > index 36c774f..7408ddc 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > - !pmd_write(orig_pmd)) { > - ret = wp_huge_pmd(&vmf, orig_pmd); > - if (!(ret & VM_FAULT_FALLBACK)) > + if (vmf.flags & FAULT_FLAG_WRITE) { > + if (!pmd_write(orig_pmd)) { > + ret = wp_huge_pmd(&vmf, orig_pmd); > + if (ret == VM_FAULT_FALLBACK) In theory, more than one flag can be set and it would lead to false-negative. Bit check was the right thing. And I don't understand why do you need to change code in __handle_mm_fault() at all. >>From what I see change to huge_pmd_set_accessed() should be enough. > + goto pte_fault; > return ret; > - } else { > - huge_pmd_set_accessed(&vmf, orig_pmd); > - return 0; > + } > } > + > + huge_pmd_set_accessed(&vmf, orig_pmd); > + return 0; > } > } > - > +pte_fault: > return handle_pte_fault(&vmf); > } > > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo at kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email at kvack.org -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 From: minchan@kernel.org (Minchan Kim) Date: Thu, 22 Dec 2016 23:52:03 +0900 Subject: [PATCH] mm: pmd dirty emulation in page fault handler In-Reply-To: <20161222081713.GA32480@node.shutemov.name> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> Message-ID: <20161222145203.GA18970@bbox> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Hello, On Thu, Dec 22, 2016 at 11:17:13AM +0300, Kirill A. Shutemov wrote: < snip > > > diff --git a/mm/memory.c b/mm/memory.c > > index 36c774f..7408ddc 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > > - !pmd_write(orig_pmd)) { > > - ret = wp_huge_pmd(&vmf, orig_pmd); > > - if (!(ret & VM_FAULT_FALLBACK)) > > + if (vmf.flags & FAULT_FLAG_WRITE) { > > + if (!pmd_write(orig_pmd)) { > > + ret = wp_huge_pmd(&vmf, orig_pmd); > > + if (ret == VM_FAULT_FALLBACK) > > In theory, more than one flag can be set and it would lead to > false-negative. Bit check was the right thing. > > And I don't understand why do you need to change code in > __handle_mm_fault() at all. > From what I see change to huge_pmd_set_accessed() should be enough. Yeb. Thanks for the review. Here v2 goes. >>From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 22 Dec 2016 23:43:49 +0900 Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de The problem is page fault handler supports only accessed flag emulation for THP page of SW-dirty/accessed architecture. This patch enables dirty-bit emulation for those architectures. Without it, MADV_FREE makes application hang by repeated fault forever. [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called Cc: Jason Evans Cc: Kirill A. Shutemov Cc: Will Deacon Cc: Catalin Marinas Cc: linux-arch at vger.kernel.org Cc: linux-arm-kernel at lists.infradead.org Cc: [4.5+] Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Reported-by: Andreas Schwab Signed-off-by: Minchan Kim --- * from v1 * Remove __handle_mm_fault part - Kirill mm/huge_memory.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 10eedbf..29ec8a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) { pmd_t entry; unsigned long haddr; + bool write = vmf->flags & FAULT_FLAG_WRITE; vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) goto unlock; entry = pmd_mkyoung(orig_pmd); + if (write) + entry = pmd_mkdirty(entry); haddr = vmf->address & HPAGE_PMD_MASK; - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, - vmf->flags & FAULT_FLAG_WRITE)) + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); unlock: -- 2.7.4 From mboxrd@z Thu Jan 1 00:00:00 1970 From: kirill@shutemov.name (Kirill A. Shutemov) Date: Thu, 22 Dec 2016 21:35:33 +0300 Subject: [PATCH] mm: pmd dirty emulation in page fault handler In-Reply-To: <20161222145203.GA18970@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> Message-ID: <20161222183533.GA29876@node.shutemov.name> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Thu, Dec 22, 2016 at 11:52:03PM +0900, Minchan Kim wrote: > Hello, > > On Thu, Dec 22, 2016 at 11:17:13AM +0300, Kirill A. Shutemov wrote: > > < snip > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 36c774f..7408ddc 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) > > > return do_huge_pmd_numa_page(&vmf, orig_pmd); > > > > > > - if ((vmf.flags & FAULT_FLAG_WRITE) && > > > - !pmd_write(orig_pmd)) { > > > - ret = wp_huge_pmd(&vmf, orig_pmd); > > > - if (!(ret & VM_FAULT_FALLBACK)) > > > + if (vmf.flags & FAULT_FLAG_WRITE) { > > > + if (!pmd_write(orig_pmd)) { > > > + ret = wp_huge_pmd(&vmf, orig_pmd); > > > + if (ret == VM_FAULT_FALLBACK) > > > > In theory, more than one flag can be set and it would lead to > > false-negative. Bit check was the right thing. > > > > And I don't understand why do you need to change code in > > __handle_mm_fault() at all. > > From what I see change to huge_pmd_set_accessed() should be enough. > > Yeb. Thanks for the review. Here v2 goes. > > From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > From: Minchan Kim > Date: Thu, 22 Dec 2016 23:43:49 +0900 > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called > > Cc: Jason Evans > Cc: Kirill A. Shutemov > Cc: Will Deacon > Cc: Catalin Marinas > Cc: linux-arch at vger.kernel.org > Cc: linux-arm-kernel at lists.infradead.org > Cc: [4.5+] > Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") > Reported-by: Andreas Schwab > Signed-off-by: Minchan Kim Acked-by: Kirill A. Shutemov -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 From: mhocko@kernel.org (Michal Hocko) Date: Fri, 23 Dec 2016 10:17:25 +0100 Subject: [PATCH] mm: pmd dirty emulation in page fault handler In-Reply-To: <20161222145203.GA18970@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> Message-ID: <20161223091725.GA23117@dhcp22.suse.cz> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Thu 22-12-16 23:52:03, Minchan Kim wrote: [...] > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > From: Minchan Kim > Date: Thu, 22 Dec 2016 23:43:49 +0900 > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. The changelog is rather terse and considering the issue is rather subtle and it aims the stable tree I think it could see more information. How do we end up looping in the page fault and why the dirty pmd stops it. Could you update the changelog to be more verbose, please? I am still digesting this patch but I believe it is correct fwiw... Thanks! > [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called > > Cc: Jason Evans > Cc: Kirill A. Shutemov > Cc: Will Deacon > Cc: Catalin Marinas > Cc: linux-arch at vger.kernel.org > Cc: linux-arm-kernel at lists.infradead.org > Cc: [4.5+] > Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") > Reported-by: Andreas Schwab > Signed-off-by: Minchan Kim > --- > * from v1 > * Remove __handle_mm_fault part - Kirill > > mm/huge_memory.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 10eedbf..29ec8a4 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) > { > pmd_t entry; > unsigned long haddr; > + bool write = vmf->flags & FAULT_FLAG_WRITE; > > vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); > if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) > goto unlock; > > entry = pmd_mkyoung(orig_pmd); > + if (write) > + entry = pmd_mkdirty(entry); > haddr = vmf->address & HPAGE_PMD_MASK; > - if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, > - vmf->flags & FAULT_FLAG_WRITE)) > + if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write)) > update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd); > > unlock: > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo at kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email at kvack.org -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: schwab@suse.de (Andreas Schwab) Date: Thu, 22 Dec 2016 23:12:32 +0100 Subject: [PATCH] mm: pmd dirty emulation in page fault handler In-Reply-To: <20161222145203.GA18970@bbox> (Minchan Kim's message of "Thu, 22 Dec 2016 23:52:03 +0900") References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> Message-ID: <8737hftxyn.fsf@suse.de> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Dez 22 2016, Minchan Kim wrote: > From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > From: Minchan Kim > Date: Thu, 22 Dec 2016 23:43:49 +0900 > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de > > The problem is page fault handler supports only accessed flag emulation > for THP page of SW-dirty/accessed architecture. > > This patch enables dirty-bit emulation for those architectures. > Without it, MADV_FREE makes application hang by repeated fault forever. > > [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called Successfully tested a backport to 4.9. Andreas. -- Andreas Schwab, SUSE Labs, schwab at suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." From mboxrd@z Thu Jan 1 00:00:00 1970 From: mhocko@kernel.org (Michal Hocko) Date: Fri, 23 Dec 2016 12:54:21 +0100 Subject: [PATCH] mm: pmd dirty emulation in page fault handler In-Reply-To: <20161223095336.GA5305@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> <20161223095336.GA5305@bbox> Message-ID: <20161223115421.GD23109@dhcp22.suse.cz> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Fri 23-12-16 18:53:36, Minchan Kim wrote: > Hi, > > On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > > [...] > > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > > From: Minchan Kim > > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de > > > > > > The problem is page fault handler supports only accessed flag emulation > > > for THP page of SW-dirty/accessed architecture. > > > > > > This patch enables dirty-bit emulation for those architectures. > > > Without it, MADV_FREE makes application hang by repeated fault forever. > > > > The changelog is rather terse and considering the issue is rather subtle > > and it aims the stable tree I think it could see more information. How > > do we end up looping in the page fault and why the dirty pmd stops it. > > Could you update the changelog to be more verbose, please? I am still > > digesting this patch but I believe it is correct fwiw... > > > > How about this? Feel free to suggest better wording. > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de > > The problem is currently page fault handler doesn't supports dirty bit > emulation of pte for non-HW dirty-bit architecture so that application s at pte@pmd@ ? > stucks until VM marked the pmd dirty. > > How the emulation work depends on the architecture. In case of arm64, > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to > mark the pte dirty via triggering page fault when store access happens. > Once the page fault occurs, VM marks the pte dirty and arch code for > setting pte will clear PTE_RDONLY for application to proceed. > > IOW, if VM doesn't mark the pte dirty, application hangs forever by > repeated fault(i.e., store op but the pte is PTE_RDONLY). > > This patch enables dirty-bit emulation for those architectures. Yes this is helpful and much more clear, thank you. One thing that is still not clear to me is why cannot we handle that in the arch specific code. I mean what is the side effect of doing pmd_mkdirty for architectures which do not need it? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: minchan@kernel.org (Minchan Kim) Date: Fri, 23 Dec 2016 18:53:36 +0900 Subject: [PATCH] mm: pmd dirty emulation in page fault handler In-Reply-To: <20161223091725.GA23117@dhcp22.suse.cz> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> Message-ID: <20161223095336.GA5305@bbox> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Hi, On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > [...] > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > From: Minchan Kim > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de > > > > The problem is page fault handler supports only accessed flag emulation > > for THP page of SW-dirty/accessed architecture. > > > > This patch enables dirty-bit emulation for those architectures. > > Without it, MADV_FREE makes application hang by repeated fault forever. > > The changelog is rather terse and considering the issue is rather subtle > and it aims the stable tree I think it could see more information. How > do we end up looping in the page fault and why the dirty pmd stops it. > Could you update the changelog to be more verbose, please? I am still > digesting this patch but I believe it is correct fwiw... > How about this? Feel free to suggest better wording. Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de The problem is currently page fault handler doesn't supports dirty bit emulation of pte for non-HW dirty-bit architecture so that application stucks until VM marked the pmd dirty. How the emulation work depends on the architecture. In case of arm64, when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to mark the pte dirty via triggering page fault when store access happens. Once the page fault occurs, VM marks the pte dirty and arch code for setting pte will clear PTE_RDONLY for application to proceed. IOW, if VM doesn't mark the pte dirty, application hangs forever by repeated fault(i.e., store op but the pte is PTE_RDONLY). This patch enables dirty-bit emulation for those architectures. From mboxrd@z Thu Jan 1 00:00:00 1970 From: mhocko@kernel.org (Michal Hocko) Date: Fri, 23 Dec 2016 15:53:05 +0100 Subject: [PATCH] mm: pmd dirty emulation in page fault handler In-Reply-To: <20161223140131.GA5724@bbox> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> <20161223095336.GA5305@bbox> <20161223115421.GD23109@dhcp22.suse.cz> <20161223140131.GA5724@bbox> Message-ID: <20161223145305.GF23109@dhcp22.suse.cz> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Fri 23-12-16 23:01:31, Minchan Kim wrote: > On Fri, Dec 23, 2016 at 12:54:21PM +0100, Michal Hocko wrote: > > On Fri 23-12-16 18:53:36, Minchan Kim wrote: [...] > > > stucks until VM marked the pmd dirty. > > > > > > How the emulation work depends on the architecture. In case of arm64, > > > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to > > > mark the pte dirty via triggering page fault when store access happens. > > > Once the page fault occurs, VM marks the pte dirty and arch code for > > > setting pte will clear PTE_RDONLY for application to proceed. > > > > > > IOW, if VM doesn't mark the pte dirty, application hangs forever by > > > repeated fault(i.e., store op but the pte is PTE_RDONLY). > > > > > > This patch enables dirty-bit emulation for those architectures. > > > > Yes this is helpful and much more clear, thank you. One thing that is > > still not clear to me is why cannot we handle that in the arch specific > > code. I mean what is the side effect of doing pmd_mkdirty for > > architectures which do not need it? > > For architecture which supports H/W access/dirty bit, it couldn't be > reached there code path so there is no side effect, I think. ahh, I knew I was missing something. It definitely wasn't obvious to me and my x86 config it simply generates code to call huge_pmd_set_accessed. > A thing > I can think of is just increasing code size little bit. Maybe, we > could optimize away some ifdef magic but not sure worth it. it is not -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: minchan@kernel.org (Minchan Kim) Date: Fri, 23 Dec 2016 23:01:31 +0900 Subject: [PATCH] mm: pmd dirty emulation in page fault handler In-Reply-To: <20161223115421.GD23109@dhcp22.suse.cz> References: <1482364101-16204-1-git-send-email-minchan@kernel.org> <20161222081713.GA32480@node.shutemov.name> <20161222145203.GA18970@bbox> <20161223091725.GA23117@dhcp22.suse.cz> <20161223095336.GA5305@bbox> <20161223115421.GD23109@dhcp22.suse.cz> Message-ID: <20161223140131.GA5724@bbox> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Fri, Dec 23, 2016 at 12:54:21PM +0100, Michal Hocko wrote: > On Fri 23-12-16 18:53:36, Minchan Kim wrote: > > Hi, > > > > On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote: > > > On Thu 22-12-16 23:52:03, Minchan Kim wrote: > > > [...] > > > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001 > > > > From: Minchan Kim > > > > Date: Thu, 22 Dec 2016 23:43:49 +0900 > > > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler > > > > > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de > > > > > > > > The problem is page fault handler supports only accessed flag emulation > > > > for THP page of SW-dirty/accessed architecture. > > > > > > > > This patch enables dirty-bit emulation for those architectures. > > > > Without it, MADV_FREE makes application hang by repeated fault forever. > > > > > > The changelog is rather terse and considering the issue is rather subtle > > > and it aims the stable tree I think it could see more information. How > > > do we end up looping in the page fault and why the dirty pmd stops it. > > > Could you update the changelog to be more verbose, please? I am still > > > digesting this patch but I believe it is correct fwiw... > > > > > > > How about this? Feel free to suggest better wording. > > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64. > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf at hawking.suse.de > > > > The problem is currently page fault handler doesn't supports dirty bit > > emulation of pte for non-HW dirty-bit architecture so that application > > s at pte@pmd@ ? It would be more clear. Will update with it. > > > stucks until VM marked the pmd dirty. > > > > How the emulation work depends on the architecture. In case of arm64, > > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to > > mark the pte dirty via triggering page fault when store access happens. > > Once the page fault occurs, VM marks the pte dirty and arch code for > > setting pte will clear PTE_RDONLY for application to proceed. > > > > IOW, if VM doesn't mark the pte dirty, application hangs forever by > > repeated fault(i.e., store op but the pte is PTE_RDONLY). > > > > This patch enables dirty-bit emulation for those architectures. > > Yes this is helpful and much more clear, thank you. One thing that is > still not clear to me is why cannot we handle that in the arch specific > code. I mean what is the side effect of doing pmd_mkdirty for > architectures which do not need it? For architecture which supports H/W access/dirty bit, it couldn't be reached there code path so there is no side effect, I think. A thing I can think of is just increasing code size little bit. Maybe, we could optimize away some ifdef magic but not sure worth it. We have been same way pte(not pmd) emulation handling for several decacdes. Anyway, it should be off-topic, I think. Thanks. > > -- > Michal Hocko > SUSE Labs