All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Yang Shi <yang.shi@linux.alibaba.com>
Cc: hughd@google.com, kirill.shutemov@linux.intel.com,
	aarcange@redhat.com, akpm@linux-foundation.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially
Date: Mon, 25 Nov 2019 21:33:50 +0300	[thread overview]
Message-ID: <20191125183350.5gmcln6t3ofszbsy@box> (raw)
In-Reply-To: <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com>

On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
> 
> 
> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
> > On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
> > > Currently when truncating shmem file, if the range is partial of THP
> > > (start or end is in the middle of THP), the pages actually will just get
> > > cleared rather than being freed unless the range cover the whole THP.
> > > Even though all the subpages are truncated (randomly or sequentially),
> > > the THP may still be kept in page cache.  This might be fine for some
> > > usecases which prefer preserving THP.
> > > 
> > > But, when doing balloon inflation in QEMU, QEMU actually does hole punch
> > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
> > > So, when using shmem THP as memory backend QEMU inflation actually doesn't
> > > work as expected since it doesn't free memory.  But, the inflation
> > > usecase really needs get the memory freed.  Anonymous THP will not get
> > > freed right away too but it will be freed eventually when all subpages are
> > > unmapped, but shmem THP would still stay in page cache.
> > > 
> > > To protect the usecases which may prefer preserving THP, introduce a
> > > new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is
> > > preferred behavior if truncating partial THP.  This mode just makes
> > > sense to tmpfs for the time being.
> > We need to clarify interaction with khugepaged. This implementation
> > doesn't do anything to prevent khugepaged from collapsing the range back
> > to THP just after the split.
> 
> Yes, it doesn't. Will clarify this in the commit log.

Okay, but I'm not sure that documention alone will be enough. We need
proper design.

> > > @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> > >   			}
> > >   			unlock_page(page);
> > >   		}
> > > +rescan_split:
> > >   		pagevec_remove_exceptionals(&pvec);
> > >   		pagevec_release(&pvec);
> > > +
> > > +		if (split && PageTransCompound(page)) {
> > > +			/* The THP may get freed under us */
> > > +			if (!get_page_unless_zero(compound_head(page)))
> > > +				goto rescan_out;
> > > +
> > > +			lock_page(page);
> > > +
> > > +			/*
> > > +			 * The extra pins from page cache lookup have been
> > > +			 * released by pagevec_release().
> > > +			 */
> > > +			if (!split_huge_page(page)) {
> > > +				unlock_page(page);
> > > +				put_page(page);
> > > +				/* Re-look up page cache from current index */
> > > +				goto again;
> > > +			}
> > > +			unlock_page(page);
> > > +			put_page(page);
> > > +		}
> > > +rescan_out:
> > >   		index++;
> > >   	}
> > Doing get_page_unless_zero() just after you've dropped the pin for the
> > page looks very suboptimal.
> 
> If I don't drop the pins the THP can't be split. And, there might be more
> than one pins from find_get_entries() if I read the code correctly. For
> example, truncate 8K length in the middle of THP, the THP's refcount would
> get bumpped twice since  two sub pages would be returned.

Pin the page before pagevec_release() and avoid get_page_unless_zero().

Current code is buggy. You need to check that the page is still belong to
the file after speculative lookup.

-- 
 Kirill A. Shutemov

  reply	other threads:[~2019-11-25 18:33 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-23  1:05 [RFC PATCH] mm: shmem: allow split THP when truncating THP partially Yang Shi
2019-11-25  2:28 ` kbuild test robot
2019-11-25  9:36 ` Kirill A. Shutemov
2019-11-25 18:24   ` Yang Shi
2019-11-25 18:33     ` Kirill A. Shutemov [this message]
2019-11-25 19:33       ` Yang Shi
2019-11-26 23:34         ` Yang Shi
2019-11-28  3:06           ` Hugh Dickins
2019-11-28  3:06             ` Hugh Dickins
2019-11-28 11:34             ` Kirill A. Shutemov
2019-12-02 23:14               ` Yang Shi
2019-12-02 23:07             ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191125183350.5gmcln6t3ofszbsy@box \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.