linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Davidlohr Bueso <dave@stgolabs.net>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Michal Hocko <mhocko@kernel.org>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH v2 2/2] hugetlb: use same fault hash key for shared and private mappings
Date: Thu, 11 Apr 2019 11:32:52 -0700	[thread overview]
Message-ID: <0b1d1faf-ff72-a51f-b48a-175c9c5cab53@oracle.com> (raw)
In-Reply-To: <20190328234704.27083-3-mike.kravetz@oracle.com>

On 3/28/19 4:47 PM, Mike Kravetz wrote:
> hugetlb uses a fault mutex hash table to prevent page faults of the
> same pages concurrently.  The key for shared and private mappings is
> different.  Shared keys off address_space and file index.  Private
> keys off mm and virtual address.  Consider a private mappings of a
> populated hugetlbfs file.  A write fault will first map the page from
> the file and then do a COW to map a writable page.

Davidlohr suggested adding the stack trace to the commit log.  When I
originally 'discovered' this issue I was debugging something else.  The
routine remove_inode_hugepages() contains the following:

			 * ...
			 * This race can only happen in the hole punch case.
			 * Getting here in a truncate operation is a bug.
			 */
			if (unlikely(page_mapped(page))) {
				BUG_ON(truncate_op);

				i_mmap_lock_write(mapping);
				hugetlb_vmdelete_list(&mapping->i_mmap,
					index * pages_per_huge_page(h),
					(index + 1) * pages_per_huge_page(h));
				i_mmap_unlock_write(mapping);
			}

			lock_page(page);
			/*
			 * We must free the huge page and remove from page
			 * ...
			 */
			VM_BUG_ON(PagePrivate(page));
			remove_huge_page(page);
			freed++;

I observed that the page could be mapped (again) before the call to lock_page
if we raced with a private write fault.  However, for COW faults the faulting
code is holding the page lock until it unmaps the file page.  Hence, we will
not call remove_huge_page() with the page mapped.  That is good.  However, for
simple read faults the page remains mapped after releasing the page lock and
we can call remove_huge_page with a mapped page and BUG.

Sorry, the original commit message was not completely accurate in describing
the issue.  I was basing the change on behavior experienced during debug of
a another issue.  Actually, it is MUCH easier to BUG by making private read
faults race with hole punch.  As a result, I now think this should go to
stable.

Andrew, below is an updated commit message.  No changes to code.  Would you
like me to send an updated patch?  Also, need to add stable.

hugetlb uses a fault mutex hash table to prevent page faults of the
same pages concurrently.  The key for shared and private mappings is
different.  Shared keys off address_space and file index.  Private
keys off mm and virtual address.  Consider a private mappings of a
populated hugetlbfs file.  A fault will map the page from the file
and if needed do a COW to map a writable page.

Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
pages.  It uses the address_space file index key.  However, private
mappings will use a different key and could race with this code to map
the file page.  This causes problems (BUG) for the page cache remove
code as it expects the page to be unmapped.  A sample stack is:

page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
kernel BUG at mm/filemap.c:169!
...
RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
...
Call Trace:
__delete_from_page_cache+0x39/0x220
delete_from_page_cache+0x45/0x70
remove_inode_hugepages+0x13c/0x380
? __add_to_page_cache_locked+0x162/0x380
hugetlbfs_fallocate+0x403/0x540
? _cond_resched+0x15/0x30
? __inode_security_revalidate+0x5d/0x70
? selinux_file_permission+0x100/0x130
vfs_fallocate+0x13f/0x270
ksys_fallocate+0x3c/0x80
__x64_sys_fallocate+0x1a/0x20
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9

There seems to be another potential COW issue/race with this approach
of different private and shared keys as noted in commit 8382d914ebf7
("mm, hugetlb: improve page-fault scalability").

Since every hugetlb mapping (even anon and private) is actually a file
mapping, just use the address_space index key for all mappings.  This
results in potentially more hash collisions.  However, this should not
be the common case.

Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
Cc: <stable@vger.kernel.org>

-- 
Mike Kravetz

  reply	other threads:[~2019-04-11 18:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-28 23:47 [PATCH v2 0/2] A couple hugetlbfs fixes Mike Kravetz
2019-03-28 23:47 ` [PATCH v2 1/2] huegtlbfs: on restore reserve error path retain subpool reservation Mike Kravetz
2019-03-28 23:47 ` [PATCH v2 2/2] hugetlb: use same fault hash key for shared and private mappings Mike Kravetz
2019-04-11 18:32   ` Mike Kravetz [this message]
2019-04-12 16:52     ` Davidlohr Bueso
2019-04-03  5:29 ` [PATCH v2 0/2] A couple hugetlbfs fixes Naoya Horiguchi
2019-04-08 19:48 ` Davidlohr Bueso
2019-04-09  3:30   ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0b1d1faf-ff72-a51f-b48a-175c9c5cab53@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).