linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: Hugh Dickins <hughd@google.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Michal Hocko <mhocko@kernel.org>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 0/4] hugetlbfs: use hinode_rwsem for pmd sharing synchronization
Date: Thu, 5 Nov 2020 13:59:24 -0800	[thread overview]
Message-ID: <7374a8cc-ab8d-ad25-2b8b-2451ba8f7131@oracle.com> (raw)
In-Reply-To: <20201103002841.273161-1-mike.kravetz@oracle.com>

On 11/2/20 4:28 PM, Mike Kravetz wrote:
> The RFC series reverted all patches where i_mmap_rwsem was used for
> pmd sharing synchronization, and then added code to use hinode_rwsem.
> This series ends up with the same code in the end, but is structured
> as follows:
> 
> - Revert the page fault/truncation race patch which depends on i_mmap_rwsem
>   always being taken in page fault path.
> - Add hinode_rwsem to hugetlbfs specific inode and supporting routines.
> - Convert code from using i_mmap_rwsem for pmd sharing synchronization
>   to using hinode_rwsem.
> - Add code to more robustly deal with page fault/truncation races.
> 
> My hope is that this will be easier to review.

My apologies.  Please do not spend any time on this patch series and it
certainly is not something to be sent to stable.

I have created a patch to address the BUG and propose that for stable [1].

After further thought, the approach in this series also has lock ordering
issues.  Here is a simple summary of the problem.

With pmd sharing, the pte pointer returned by huge_pte_alloc is not
guaranteed to be valid.  This is because another thread could have made
a call to huge_pmd_unshare and 'unshared' the pmd page containing the
pte pointer.

The current i_mmap_rwsem locking and the inode locking proposed in this
series acquire the semaphore in read mode before calling huge_pte_alloc.
Code continues to hold the semaphore until finished with the pte pointer.
Callers of huge_pmd_unshare hold the semaphore in write mode.  Thus, the
semaphore prevents the race.

The problem with this type of approach is lock ordering.  The semaphore is
acquired in read mode during page fault processing.  The first thing this
code does is 'allocate' a pte with huge_pte_alloc.  It will then, find or
allocate a page and lock it.  Finally, it will lock the page table to update
the pte.  Two instances where we may need to take the semaphore in write
mode are page migration and memory failure.  In these cases, the first thing
we need to do is lock the page.  Only after locking the page can we locate
the semaphore which needs to be acquired in write mode.  Hence we end up with
a classic cause for ABBA deadlocks.

I'm starting to think that adding more synchronization is not the best way
to approach this issue.  Rather, we should always validate pte pointers after
acquiring the page table lock.  At the lowest level, pmd sharing is
synchronized by the page table lock.  We already do some validation of the
pte after acquiring the page table lock.  For example, checking if
huge_pte_none() is still true.  Before even checking for none, we would need
to lookup the pte again (huge_pte_offset) and compare to the pte we previously
acquired.  If they are not the same, then we would need to backout and retry.

Unless someone has another suggestion, I'll start exploring this approach.

[1] https://lore.kernel.org/linux-mm/20201105195058.78401-1-mike.kravetz@oracle.com/
-- 
Mike Kravetz


      parent reply	other threads:[~2020-11-05 21:59 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-26 23:31 [RFC PATCH v2 0/4] hugetlbfs: introduce hinode_rwsem for pmd sharing synchronization Mike Kravetz
2020-10-26 23:31 ` [RFC PATCH v2 1/4] hugetlbfs: revert use of i_mmap_rwsem for pmd sharing and more sync Mike Kravetz
2020-10-26 23:31 ` [RFC PATCH v2 2/4] hugetlbfs: add hinode_rwsem to hugetlb specific inode Mike Kravetz
2020-10-26 23:31 ` [RFC PATCH v2 3/4] hugetlbfs: use hinode_rwsem for pmd sharing synchronization Mike Kravetz
2020-10-26 23:31 ` [RFC PATCH v2 4/4] huegtlbfs: handle page fault/truncate races Mike Kravetz
2020-11-03  0:28 ` [PATCH 0/4] hugetlbfs: use hinode_rwsem for pmd sharing synchronization Mike Kravetz
2020-11-03  0:28   ` [PATCH 1/4] Revert hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2020-11-03  0:28   ` [PATCH 2/4] hugetlbfs: add hinode_rwsem to hugetlb specific inode Mike Kravetz
2020-11-03  0:28   ` [PATCH 3/4] hugetlbfs: use hinode_rwsem for pmd sharing synchronization Mike Kravetz
2020-11-03  0:28   ` [PATCH 4/4] huegtlbfs: handle page fault/truncate races Mike Kravetz
2020-11-05 21:59   ` Mike Kravetz [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7374a8cc-ab8d-ad25-2b8b-2451ba8f7131@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dave@stgolabs.net \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=prakash.sangappa@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).