linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Hugh Dickins <hughd@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>,
	kirill.shutemov@linux.intel.com, aarcange@redhat.com,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially
Date: Thu, 28 Nov 2019 14:34:56 +0300	[thread overview]
Message-ID: <20191128113456.5phjhd3ajgky3h3i@box> (raw)
In-Reply-To: <alpine.LSU.2.11.1911271718130.652@eggly.anvils>

On Wed, Nov 27, 2019 at 07:06:01PM -0800, Hugh Dickins wrote:
> On Tue, 26 Nov 2019, Yang Shi wrote:
> > On 11/25/19 11:33 AM, Yang Shi wrote:
> > > On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
> > > > On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
> > > > > On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
> > > > > > On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
> > > > > > > Currently when truncating shmem file, if the range is partial of
> > > > > > > THP
> > > > > > > (start or end is in the middle of THP), the pages actually will
> > > > > > > just get
> > > > > > > cleared rather than being freed unless the range cover the whole
> > > > > > > THP.
> > > > > > > Even though all the subpages are truncated (randomly or
> > > > > > > sequentially),
> > > > > > > the THP may still be kept in page cache.  This might be fine for
> > > > > > > some
> > > > > > > usecases which prefer preserving THP.
> > > > > > > 
> > > > > > > But, when doing balloon inflation in QEMU, QEMU actually does hole
> > > > > > > punch
> > > > > > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not
> > > > > > > used.
> > > > > > > So, when using shmem THP as memory backend QEMU inflation actually
> > > > > > > doesn't
> > > > > > > work as expected since it doesn't free memory.  But, the inflation
> > > > > > > usecase really needs get the memory freed.  Anonymous THP will not
> > > > > > > get
> > > > > > > freed right away too but it will be freed eventually when all
> > > > > > > subpages are
> > > > > > > unmapped, but shmem THP would still stay in page cache.
> > > > > > > 
> > > > > > > To protect the usecases which may prefer preserving THP, introduce
> > > > > > > a
> > > > > > > new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP
> > > > > > > is
> > > > > > > preferred behavior if truncating partial THP.  This mode just makes
> > > > > > > sense to tmpfs for the time being.
> 
> Sorry, I haven't managed to set aside enough time for this until now.
> 
> First off, let me say that I firmly believe this punch-split behavior
> should be the standard behavior (like in my huge tmpfs implementation),
> and we should not need a special FALLOC_FL_SPLIT_HPAGE to do it.
> But I don't know if I'll be able to persuade Kirill of that.
> 
> If the caller wants to write zeroes into the file, she can do so with the
> write syscall: the caller has asked to punch a hole or truncate the file,
> and in our case, like your QEMU case, hopes that memory and memcg charge
> will be freed by doing so.  I'll be surprised if changing the behavior
> to yours and mine turns out to introduce a regression, but if it does,
> I guess we'll then have to put it behind a sysctl or whatever.
> 
> IIUC the reason that it's currently implemented by clearing the hole
> is because split_huge_page() (unlike in older refcounting days) cannot
> be guaranteed to succeed.  Which is unfortunate, and none of us is very
> keen to build a filesystem on unreliable behavior; but the failure cases
> appear in practice to be rare enough, that it's on balance better to give
> the punch-hole-truncate caller what she asked for whenever possible.

I don't have a firm position here. Maybe you are right and we should try
to split pages right away.

It might be useful to consider case wider than shmem.

On traditional filesystem with a backing storage semantics of the same
punch hole operation is somewhat different. It doesn't have explicit
implications on memory footprint. It's about managing persistent storage.
With shmem/tmpfs it is lumped together.

It might be nice to write down pages that can be discarded under memory
pressure and leave the huge page intact until then...

[ I don't see a problem with your patch as long as we agree that it's
desired semantics for the interface. ]

-- 
 Kirill A. Shutemov

  reply	other threads:[~2019-11-28 11:35 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-23  1:05 [RFC PATCH] mm: shmem: allow split THP when truncating THP partially Yang Shi
2019-11-25  9:36 ` Kirill A. Shutemov
2019-11-25 18:24   ` Yang Shi
2019-11-25 18:33     ` Kirill A. Shutemov
2019-11-25 19:33       ` Yang Shi
2019-11-26 23:34         ` Yang Shi
2019-11-28  3:06           ` Hugh Dickins
2019-11-28 11:34             ` Kirill A. Shutemov [this message]
2019-12-02 23:14               ` Yang Shi
2019-12-02 23:07             ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191128113456.5phjhd3ajgky3h3i@box \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).