linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qian Cai <cai@lca.pw>
To: Mike Kravetz <mike.kravetz@oracle.com>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@kernel.org>, Hugh Dickins <hughd@google.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	 Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 0/2] hugetlbfs: use i_mmap_rwsem for more synchronization
Date: Thu, 12 Mar 2020 11:57:50 -0400	[thread overview]
Message-ID: <1584028670.7365.182.camel@lca.pw> (raw)
In-Reply-To: <20200305002650.160855-1-mike.kravetz@oracle.com>

On Wed, 2020-03-04 at 16:26 -0800, Mike Kravetz wrote:
> While discussing the issue with huge_pte_offset [1], I remembered that
> there were more outstanding hugetlb races.  These issues are:
> 
> 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
>    invalid via a call to huge_pmd_unshare by another thread.
> 2) hugetlbfs page faults can race with truncation causing invalid global
>    reserve counts and state.
> 
> A previous attempt was made to use i_mmap_rwsem in this manner as described
> at [2].  However, those patches were reverted starting with [3] due to
> locking issues.
> 
> To effectively use i_mmap_rwsem to address the above issues it needs to
> be held (in read mode) during page fault processing.  However, during
> fault processing we need to lock the page we will be adding.  Lock
> ordering requires we take page lock before i_mmap_rwsem.  Waiting until
> after taking the page lock is too late in the fault process for the
> synchronization we want to do.
> 
> To address this lock ordering issue, the following patches change the
> lock ordering for hugetlb pages.  This is not too invasive as hugetlbfs
> processing is done separate from core mm in many places.  However, I
> don't really like this idea.  Much ugliness is contained in the new
> routine hugetlb_page_mapping_lock_write() of patch 1.
> 
> The only other way I can think of to address these issues is by catching
> all the races.  After catching a race, cleanup, backout, retry ... etc,
> as needed.  This can get really ugly, especially for huge page reservations.
> At one time, I started writing some of the reservation backout code for
> page faults and it got so ugly and complicated I went down the path of
> adding synchronization to avoid the races.  Any other suggestions would
> be welcome.

Reverted this series on the top of today's linux-next fixed the hang with LTP
move_pages12 on both powerpc and arm64,

# /opt/ltp/testcases/bin/move_pages12
tst_test.c:1217: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:263: INFO: Free RAM 260577280 kB
move_pages12.c:281: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:291: INFO: Increasing 2048kB hugepages pool on node 8 to 4
move_pages12.c:207: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:207: INFO: Allocating and freeing 4 hugepages on node 8
<hang>

[ 3948.791155][  T688] INFO: task move_pages12:32930 can't die for more than
3072 seconds.
[ 3948.791181][  T688] move_pages12    D26224 32930      1 0x00040002
[ 3948.791199][  T688] Call Trace:
[ 3948.791210][  T688] [c000200816b4f680] [c0000000010b7a68]
cpufreq_update_util_data+0x0/0x8 (unreliable)
[ 3948.791234][  T688] [c000200816b4f860] [c00000000002615c]
__switch_to+0x38c/0x520
[ 3948.791247][  T688] [c000200816b4f8d0] [c0000000009a1c94]
__schedule+0x4b4/0xba0
[ 3948.791268][  T688] [c000200816b4f9a0] [c0000000009a2428] schedule+0xa8/0x170
[ 3948.791288][  T688] [c000200816b4f9d0] [c0000000009a2d0c]
io_schedule+0x2c/0x50
[ 3948.791300][  T688] [c000200816b4fa00] [c000000000331020]
__lock_page+0x150/0x3c0
[ 3948.791322][  T688] [c000200816b4fac0] [c000000000420164]
hugetlb_no_page+0xb04/0xd40
lock_page at include/linux/pagemap.h:480
(inlined by) hugetlb_no_page at mm/hugetlb.c:4286
[ 3948.791342][  T688] [c000200816b4fc10] [c000000000420bd8]
hugetlb_fault+0x738/0xc00
[ 3948.791363][  T688] [c000200816b4fcd0] [c0000000003b9c44]
handle_mm_fault+0x444/0x450
[ 3948.791384][  T688] [c000200816b4fd20] [c000000000070b98]
__do_page_fault+0x2b8/0xf90
[ 3948.791406][  T688] [c000200816b4fe20] [c00000000000aa88]
handle_page_fault+0x10/0x30

> 
> [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
> [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
> [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
> 
> Mike Kravetz (2):
>   hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
>   hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
> 
>  fs/hugetlbfs/inode.c    |  34 +++++---
>  include/linux/fs.h      |   5 ++
>  include/linux/hugetlb.h |   8 ++
>  mm/hugetlb.c            | 175 +++++++++++++++++++++++++++++++++++-----
>  mm/memory-failure.c     |  29 ++++++-
>  mm/migrate.c            |  24 +++++-
>  mm/rmap.c               |  17 +++-
>  mm/userfaultfd.c        |  11 ++-
>  8 files changed, 264 insertions(+), 39 deletions(-)
> 


  parent reply	other threads:[~2020-03-12 15:57 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-05  0:26 [PATCH 0/2] hugetlbfs: use i_mmap_rwsem for more synchronization Mike Kravetz
2020-03-05  0:26 ` [PATCH 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2020-03-05  0:26 ` [PATCH 2/2] hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2020-03-12 15:57 ` Qian Cai [this message]
2020-03-12 16:34   ` [PATCH 0/2] hugetlbfs: use i_mmap_rwsem for more synchronization Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1584028670.7365.182.camel@lca.pw \
    --to=cai@lca.pw \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dave@stgolabs.net \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=prakash.sangappa@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).