From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb1-f197.google.com (mail-yb1-f197.google.com [209.85.219.197]) by kanga.kvack.org (Postfix) with ESMTP id 398A78E0001 for ; Fri, 21 Sep 2018 11:01:43 -0400 (EDT) Received: by mail-yb1-f197.google.com with SMTP id h10-v6so2216965ybm.3 for ; Fri, 21 Sep 2018 08:01:43 -0700 (PDT) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id 9-v6sor3002815yby.44.2018.09.21.08.01.41 for (Google Transport Security); Fri, 21 Sep 2018 08:01:41 -0700 (PDT) MIME-Version: 1.0 From: Chulmin Kim Date: Sat, 22 Sep 2018 00:01:30 +0900 Message-ID: Subject: Question about a pte with PTE_PROT_NONE and !PTE_VALID on !PROT_NONE vma Content-Type: multipart/alternative; boundary="000000000000be7814057662ea05" Sender: owner-linux-mm@kvack.org List-ID: To: "linux-mm@kvack.org" --000000000000be7814057662ea05 Content-Type: text/plain; charset="UTF-8" Hi all. I am developing an android smartphone. I am facing a problem that a thread is looping the page fault routine forever. (The kernel version is around v4.4 though it may differ from the mainline slightly as the problem occurs in a device being developed in my company.) The pte corresponding to the fault address is with PTE_PROT_NONE and !PTE_VALID. (by the way, the pte is mapped to anon page (ashmem)) The weird thing, in my opinion, is that the VMA of the fault address is not with PROT_NONE but with PROT_READ & PROT_WRITE. So, the page fault routine (handle_pte_fault()) returns 0 and fault loops forever. I don't think this is a normal situation. As I didn't enable NUMA, a pte with PROT_NONE and !PTE_VALID is likely set by mprotect(). 1. mprotect(PROT_NONE) -> vma split & set pte with PROT_NONE 2. mprotect(PROT_READ & WRITE) -> vma merge & revert pte I suspect that the revert pte in #2 didn't work somehow but no clue. I googled and found a similar situation ( http://linux-kernel.2935.n7.nabble.com/pipe-page-fault-oddness-td953839.html) which is relevant to NUMA and huge pagetable configs while my device is nothing to do with those configs. Am I missing any possible scenario? or is it already known BUG? It will be pleasure if you can give any idea about this problem. Thanks. Chulmin Kim --000000000000be7814057662ea05 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi all.
I am developing an android sm= artphone.

I am facing a problem that a thread is l= ooping the page fault routine forever.
(The kernel version is= around v4.4 though it may differ from the mainline slightly=C2=A0
as the problem occurs in a device being developed in my company.)

The pte corresponding to the fault address is with PTE_PR= OT_NONE and !PTE_VALID.
(by the way, the pte is mapped to anon pa= ge (ashmem))
The weird thing, in my opinion, is that
th= e VMA of the fault address is not with=C2=A0PROT_NONE=C2=A0but with PROT_RE= AD & PROT_WRITE.
So, the page fault routine (handle_pte_fault= ()) returns 0 and fault loops forever.

I don't= think this is a normal situation.

As I didn't= enable NUMA, a pte with PROT_NONE and !PTE_VALID is likely set by mprotect= ().
1. mprotect(PROT_NONE) -> vma split & set pte with PRO= T_NONE
2. mprotect(PROT_READ & WRITE) -> vma merge & r= evert pte=C2=A0
I suspect that the revert pte in #2 didn't wo= rk somehow
but no clue.

I googled and fo= und a similar situation (http://linux-kernel.2935.n7.nabble.c= om/pipe-page-fault-oddness-td953839.html) which is relevant to NUMA and= huge pagetable configs
while my device is nothing to do with tho= se configs.

Am I missing any possible scenario? or= is it already known BUG?
It will be pleasure if you can give= any idea about this problem.

Thanks.
Chulmin Kim --000000000000be7814057662ea05-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-f200.google.com (mail-pf1-f200.google.com [209.85.210.200]) by kanga.kvack.org (Postfix) with ESMTP id 7939A8E0001 for ; Sat, 22 Sep 2018 00:38:14 -0400 (EDT) Received: by mail-pf1-f200.google.com with SMTP id b69-v6so7421571pfc.20 for ; Fri, 21 Sep 2018 21:38:14 -0700 (PDT) Received: from mailout3.samsung.com (mailout3.samsung.com. [203.254.224.33]) by mx.google.com with ESMTPS id x14-v6si6068164pfi.138.2018.09.21.21.38.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 21 Sep 2018 21:38:12 -0700 (PDT) MIME-version: 1.0 Content-type: text/plain; charset="utf-8"; format="flowed" Subject: Re: Question about a pte with PTE_PROT_NONE and !PTE_VALID on !PROT_NONE vma From: Chulmin Kim Message-id: <10146a73-4788-ba89-001f-f928bbb314f5@samsung.com> Date: Sat, 22 Sep 2018 13:38:07 +0900 In-reply-to: Content-transfer-encoding: 8bit Content-language: en-US References: Sender: owner-linux-mm@kvack.org List-ID: To: Chulmin Kim , "linux-mm@kvack.org" , aarcange@redhat.com Dear Arcangeli, I think this problem is very much related with the race condition shown in the below commit. (e86f15ee64d8, mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk) I checked that the the thread and its child threads are doing mprotect(PROT_{NONE or R|W}) things repeatedly while I didn't reproduce the problem yet. Do you think this is one of the phenomenon you expected from the race condition shown in the above commit? Thanks. Chulmin Kim On 09/22/2018 12:01 AM, Chulmin Kim wrote: > Hi all. > I am developing an android smartphone. > > I am facing a problem that a thread is looping the page fault routine > forever. > (The kernel version is around v4.4 though it may differ from the > mainline slightly > as the problem occurs in a device being developed in my company.) > > The pte corresponding to the fault address is with PTE_PROT_NONE and > !PTE_VALID. > (by the way, the pte is mapped to anon page (ashmem)) > The weird thing, in my opinion, is that > the VMA of the fault address is not withA PROT_NONEA but with PROT_READ > & PROT_WRITE. > So, the page fault routine (handle_pte_fault()) returns 0 and fault > loops forever. > > I don't think this is a normal situation. > > As I didn't enable NUMA, a pte with PROT_NONE and !PTE_VALID is likely > set by mprotect(). > 1. mprotect(PROT_NONE) -> vma split & set pte with PROT_NONE > 2. mprotect(PROT_READ & WRITE) -> vma merge & revert pte > I suspect that the revert pte in #2 didn't work somehow > but no clue. > > I googled and found a similar situation > (http://linux-kernel.2935.n7.nabble.com/pipe-page-fault-oddness-td953839.html) > which is relevant to NUMA and huge pagetable configs > while my device is nothing to do with those configs. > > Am I missing any possible scenario? or is it already known BUG? > It will be pleasure if you can give any idea about this problem. > > Thanks. > Chulmin Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by kanga.kvack.org (Postfix) with ESMTP id 6537B8E0041 for ; Mon, 24 Sep 2018 17:08:53 -0400 (EDT) Received: by mail-qt1-f198.google.com with SMTP id t17-v6so6875255qtq.12 for ; Mon, 24 Sep 2018 14:08:53 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id i25-v6si277835qta.379.2018.09.24.14.08.52 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 24 Sep 2018 14:08:52 -0700 (PDT) Date: Mon, 24 Sep 2018 17:08:50 -0400 From: Andrea Arcangeli Subject: Re: Question about a pte with PTE_PROT_NONE and !PTE_VALID on !PROT_NONE vma Message-ID: <20180924210850.GV28957@redhat.com> References: <10146a73-4788-ba89-001f-f928bbb314f5@samsung.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <10146a73-4788-ba89-001f-f928bbb314f5@samsung.com> Sender: owner-linux-mm@kvack.org List-ID: To: Chulmin Kim Cc: Chulmin Kim , "linux-mm@kvack.org" Hello, On Sat, Sep 22, 2018 at 01:38:07PM +0900, Chulmin Kim wrote: > Dear Arcangeli, > > > I think this problem is very much related with > > the race condition shown in the below commit. > > (e86f15ee64d8, mm: vma_merge: fix vm_page_prot SMP race condition > against rmap_walk) > > > I checked that > > the the thread and its child threads are doing mprotect(PROT_{NONE or > R|W}) things repeatedly > > while I didn't reproduce the problem yet. > > > Do you think this is one of the phenomenon you expected > > from the race condition shown in the above commit? Yes that commit will fix your problem in a v4.4 based tree that misses that fix. You just need to cherry-pick that commit to fix the problem. Page migrate sets the pte to PROT_NONE by mistake because it runs concurrently with the mprotect that transitions an adjacent vma from PROT_NONE to PROT_READ|WRITE. vma_merge (before the fix) temporarily shown an erratic PROT_NONE vma prot for the virtual range under page migration. With NUMA disabled, it's likely compaction that triggered page migrate for you. Disabling compaction at build time would have likely hidden the problem. Compaction uses migration and you most certainly have CONFIG_COMPACTION=y (rightfully so). On a side note, I suggest to cherry pick the last upstream commit of mm/vmacache.c too. Hope this helps, Andrea From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-f198.google.com (mail-pf1-f198.google.com [209.85.210.198]) by kanga.kvack.org (Postfix) with ESMTP id 2F1898E0001 for ; Thu, 27 Sep 2018 01:10:38 -0400 (EDT) Received: by mail-pf1-f198.google.com with SMTP id s1-v6so1508321pfm.22 for ; Wed, 26 Sep 2018 22:10:38 -0700 (PDT) Received: from mailout1.samsung.com (mailout1.samsung.com. [203.254.224.24]) by mx.google.com with ESMTPS id o6-v6si1021552plk.31.2018.09.26.22.10.36 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 26 Sep 2018 22:10:36 -0700 (PDT) Subject: Re: Question about a pte with PTE_PROT_NONE and !PTE_VALID on !PROT_NONE vma From: Chulmin Kim Message-id: Date: Thu, 27 Sep 2018 14:10:46 +0900 MIME-version: 1.0 In-reply-to: <20180924210850.GV28957@redhat.com> Content-type: text/plain; charset="utf-8"; format="flowed" Content-transfer-encoding: 7bit Content-language: en-US References: <10146a73-4788-ba89-001f-f928bbb314f5@samsung.com> <20180924210850.GV28957@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli , "linux-mm@kvack.org" , Chulmin Kim Hello. Thanks for the reply. We are doing the test (a kind of aging test for 3 days) to prove this is the fix for the problem. I will let you know when the test is done. On 09/25/2018 06:08 AM, Andrea Arcangeli wrote: > Hello, > > On Sat, Sep 22, 2018 at 01:38:07PM +0900, Chulmin Kim wrote: >> Dear Arcangeli, >> >> >> I think this problem is very much related with >> >> the race condition shown in the below commit. >> >> (e86f15ee64d8, mm: vma_merge: fix vm_page_prot SMP race condition >> against rmap_walk) >> >> >> I checked that >> >> the the thread and its child threads are doing mprotect(PROT_{NONE or >> R|W}) things repeatedly >> >> while I didn't reproduce the problem yet. >> >> >> Do you think this is one of the phenomenon you expected >> >> from the race condition shown in the above commit? > Yes that commit will fix your problem in a v4.4 based tree that misses > that fix. You just need to cherry-pick that commit to fix the problem. > > Page migrate sets the pte to PROT_NONE by mistake because it runs > concurrently with the mprotect that transitions an adjacent vma from > PROT_NONE to PROT_READ|WRITE. vma_merge (before the fix) temporarily > shown an erratic PROT_NONE vma prot for the virtual range under page > migration. > > With NUMA disabled, it's likely compaction that triggered page migrate > for you. Disabling compaction at build time would have likely hidden > the problem. Compaction uses migration and you most certainly have > CONFIG_COMPACTION=y (rightfully so). > > On a side note, I suggest to cherry pick the last upstream commit of > mm/vmacache.c too. Sorry but I didn't get this line correctly. Do you meanthe commit 7a9cdebdc (mm: get rid of vmacache_flush_all() entirely)? Could you elaborate what is the point? Are you saying there is another scenario that makes the problem I am seeing? > Hope this helps, > Andrea > > > Thanks. Chulmin Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-f200.google.com (mail-pl1-f200.google.com [209.85.214.200]) by kanga.kvack.org (Postfix) with ESMTP id 6C5ED6B000D for ; Fri, 5 Oct 2018 02:26:06 -0400 (EDT) Received: by mail-pl1-f200.google.com with SMTP id v4-v6so10132020plz.21 for ; Thu, 04 Oct 2018 23:26:06 -0700 (PDT) Received: from mailout1.samsung.com (mailout1.samsung.com. [203.254.224.24]) by mx.google.com with ESMTPS id p20-v6si7224081pgm.192.2018.10.04.23.26.04 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 04 Oct 2018 23:26:05 -0700 (PDT) Subject: Re: Question about a pte with PTE_PROT_NONE and !PTE_VALID on !PROT_NONE vma From: Chulmin Kim Message-id: Date: Fri, 05 Oct 2018 15:26:05 +0900 MIME-version: 1.0 In-reply-to: <20180924210850.GV28957@redhat.com> Content-type: text/plain; charset="utf-8"; format="flowed" Content-transfer-encoding: 7bit Content-language: en-US References: <10146a73-4788-ba89-001f-f928bbb314f5@samsung.com> <20180924210850.GV28957@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli Cc: Chulmin Kim , "linux-mm@kvack.org" Dear all, We have verified using the problem scenario (repeat execution fo android apps for 2~3 days) that the problem is gone after applying the commit. - e86f15ee64d8, mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk Thanks! Chulmin Kim On 09/25/2018 06:08 AM, Andrea Arcangeli wrote: > Hello, > > On Sat, Sep 22, 2018 at 01:38:07PM +0900, Chulmin Kim wrote: >> Dear Arcangeli, >> >> >> I think this problem is very much related with >> >> the race condition shown in the below commit. >> >> (e86f15ee64d8, mm: vma_merge: fix vm_page_prot SMP race condition >> against rmap_walk) >> >> >> I checked that >> >> the the thread and its child threads are doing mprotect(PROT_{NONE or >> R|W}) things repeatedly >> >> while I didn't reproduce the problem yet. >> >> >> Do you think this is one of the phenomenon you expected >> >> from the race condition shown in the above commit? > Yes that commit will fix your problem in a v4.4 based tree that misses > that fix. You just need to cherry-pick that commit to fix the problem. > > Page migrate sets the pte to PROT_NONE by mistake because it runs > concurrently with the mprotect that transitions an adjacent vma from > PROT_NONE to PROT_READ|WRITE. vma_merge (before the fix) temporarily > shown an erratic PROT_NONE vma prot for the virtual range under page > migration. > > With NUMA disabled, it's likely compaction that triggered page migrate > for you. Disabling compaction at build time would have likely hidden > the problem. Compaction uses migration and you most certainly have > CONFIG_COMPACTION=y (rightfully so). > > On a side note, I suggest to cherry pick the last upstream commit of > mm/vmacache.c too. > > Hope this helps, > Andrea > > >