All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Herrmann <dh.herrmann@gmail.com>
To: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>, Tony Battersby <tonyb@cybernetics.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	John Stultz <john.stultz@linaro.org>,
	Christoph Lameter <clameter@sgi.com>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v2 0/3] File Sealing & memfd_create()
Date: Mon, 26 May 2014 13:44:13 +0200	[thread overview]
Message-ID: <CANq1E4TDDzG+HtBz261_nid3kVRG_jwcWHizzkdZCZZE3BaLgQ@mail.gmail.com> (raw)
In-Reply-To: <alpine.LSU.2.11.1405191403280.969@eggly.anvils>

Hi

(CC migrate.c committers)

On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins <hughd@google.com> wrote:
> On Mon, 19 May 2014, Jan Kara wrote:
>> On Mon 19-05-14 13:44:25, David Herrmann wrote:
>> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
>> > > The aspect which really worries me is this: the maintenance burden.
>> > > This approach would add some peculiar new code, introducing a rare
>> > > special case: which we might get right today, but will very easily
>> > > forget tomorrow when making some other changes to mm.  If we compile
>> > > a list of danger areas in mm, this would surely belong on that list.
>> >
>> > I tried doing the page-replacement in the last 4 days, but honestly,
>> > it's far more complex than I thought. So if no-one more experienced
>
> To be honest, I'm quite glad to hear that: it is still a solution worth
> considering, but I'd rather continue the search for a better solution.

What if we set VM_IO for memory-mappings if a file supports sealing?
That might be a hack and quite restrictive, but we could also add a
VM_DONTPIN flag that just prevents any page-pinning like GUP (which is
also a side-effect of VM_IO). This is basically what we do to protect
PCI BARs from that race during hotplug (well, VM_PFNMAP ist what
protects those, but the code is the same). If we mention in the
man-page that memfd-objects don't support direct-IO, we'd be fine, I
think. Not sure if that hack is better than the page-replacement,
though. It'd be definitely much simpler.

Regarding page-replacement, I tried using migrate_page(), however,
this obviously fails in page_freeze_refs() due to the elevated
ref-count and we cannot account for those, as they might vanish
asynchronously. Now I wonder whether we could just add a new mode
MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy.
Existing refs would still operate on the old page, but any new access
gets the new page. This way, we could collect pages with elevated
ref-counts in shmem similar to do_move_page_to_node_array() and then
call migrate_pages(). Now migrate_pages() takes good care to prevent
any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs
as 'in-migration', so accesses are delayed. Page-faults wait on the
page-lock and retry due to mapping==NULL. lru is disabled beforehand.
Therefore, there cannot be any racing page-lookups as they all stall
on the migration. Moreover, page_freeze_refs() fails only if the page
is pinned by independent users (usually some form of I/O).
Question is what those additional ref-counts might be. Given that
shmem 'owns' its pages, none of these external references should pass
those refs around. All they use it for is I/O. Therefore, we shouldn't
even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we
expect those external refs to never pass page-refs around. If that's a
valid assumption (and I haven't found any offenders so far), we should
be good with migrate_pages(MIGRATE_PHASE_OUT) as I described.

Comments?

While skimming over migrate.c I noticed two odd behaviors:
1) migration_entry_wait() is used to wait on a migration to finish,
before accessing PTE entries. However, we call get_page() there, which
increases the ref-count of the old page and causes page_freeze_refs()
to fail. There's no way we can know how many tasks wait on a migration
entry when calling page_freeze_refs(). I have no idea how that's
supposed to work? Why don't we store the new page in the migration-swp
entry so any lookups stall on the new page? We don't care for
ref-counts on that page and if the migration fails, new->mapping is
set to NULL and any lookup is retried. remove_migration_pte() can
restore the old page correctly.
2) remove_migration_pte() calls get_page(new) before writing the PTE.
But who releases the ref of the old page?

Thanks
David

WARNING: multiple messages have this Message-ID (diff)
From: David Herrmann <dh.herrmann@gmail.com>
To: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>, Tony Battersby <tonyb@cybernetics.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	John Stultz <john.stultz@linaro.org>,
	Christoph Lameter <clameter@sgi.com>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v2 0/3] File Sealing & memfd_create()
Date: Mon, 26 May 2014 13:44:13 +0200	[thread overview]
Message-ID: <CANq1E4TDDzG+HtBz261_nid3kVRG_jwcWHizzkdZCZZE3BaLgQ@mail.gmail.com> (raw)
In-Reply-To: <alpine.LSU.2.11.1405191403280.969@eggly.anvils>

Hi

(CC migrate.c committers)

On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins <hughd@google.com> wrote:
> On Mon, 19 May 2014, Jan Kara wrote:
>> On Mon 19-05-14 13:44:25, David Herrmann wrote:
>> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
>> > > The aspect which really worries me is this: the maintenance burden.
>> > > This approach would add some peculiar new code, introducing a rare
>> > > special case: which we might get right today, but will very easily
>> > > forget tomorrow when making some other changes to mm.  If we compile
>> > > a list of danger areas in mm, this would surely belong on that list.
>> >
>> > I tried doing the page-replacement in the last 4 days, but honestly,
>> > it's far more complex than I thought. So if no-one more experienced
>
> To be honest, I'm quite glad to hear that: it is still a solution worth
> considering, but I'd rather continue the search for a better solution.

What if we set VM_IO for memory-mappings if a file supports sealing?
That might be a hack and quite restrictive, but we could also add a
VM_DONTPIN flag that just prevents any page-pinning like GUP (which is
also a side-effect of VM_IO). This is basically what we do to protect
PCI BARs from that race during hotplug (well, VM_PFNMAP ist what
protects those, but the code is the same). If we mention in the
man-page that memfd-objects don't support direct-IO, we'd be fine, I
think. Not sure if that hack is better than the page-replacement,
though. It'd be definitely much simpler.

Regarding page-replacement, I tried using migrate_page(), however,
this obviously fails in page_freeze_refs() due to the elevated
ref-count and we cannot account for those, as they might vanish
asynchronously. Now I wonder whether we could just add a new mode
MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy.
Existing refs would still operate on the old page, but any new access
gets the new page. This way, we could collect pages with elevated
ref-counts in shmem similar to do_move_page_to_node_array() and then
call migrate_pages(). Now migrate_pages() takes good care to prevent
any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs
as 'in-migration', so accesses are delayed. Page-faults wait on the
page-lock and retry due to mapping==NULL. lru is disabled beforehand.
Therefore, there cannot be any racing page-lookups as they all stall
on the migration. Moreover, page_freeze_refs() fails only if the page
is pinned by independent users (usually some form of I/O).
Question is what those additional ref-counts might be. Given that
shmem 'owns' its pages, none of these external references should pass
those refs around. All they use it for is I/O. Therefore, we shouldn't
even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we
expect those external refs to never pass page-refs around. If that's a
valid assumption (and I haven't found any offenders so far), we should
be good with migrate_pages(MIGRATE_PHASE_OUT) as I described.

Comments?

While skimming over migrate.c I noticed two odd behaviors:
1) migration_entry_wait() is used to wait on a migration to finish,
before accessing PTE entries. However, we call get_page() there, which
increases the ref-count of the old page and causes page_freeze_refs()
to fail. There's no way we can know how many tasks wait on a migration
entry when calling page_freeze_refs(). I have no idea how that's
supposed to work? Why don't we store the new page in the migration-swp
entry so any lookups stall on the new page? We don't care for
ref-counts on that page and if the migration fails, new->mapping is
set to NULL and any lookup is retried. remove_migration_pte() can
restore the old page correctly.
2) remove_migration_pte() calls get_page(new) before writing the PTE.
But who releases the ref of the old page?

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-05-26 11:44 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-15 18:38 [PATCH v2 0/3] File Sealing & memfd_create() David Herrmann
2014-04-15 18:38 ` David Herrmann
2014-04-15 18:38 ` [PATCH v2 1/3] shm: add sealing API David Herrmann
2014-04-15 18:38   ` David Herrmann
2014-05-20  2:16   ` Hugh Dickins
2014-05-20  2:16     ` Hugh Dickins
2014-05-23 16:37     ` David Herrmann
2014-05-23 16:37       ` David Herrmann
2014-06-02 10:30       ` Hugh Dickins
2014-06-02 10:30         ` Hugh Dickins
2014-04-15 18:38 ` [PATCH v2 2/3] shm: add memfd_create() syscall David Herrmann
2014-04-15 18:38   ` David Herrmann
2014-05-20  2:20   ` Hugh Dickins
2014-05-20  2:20     ` Hugh Dickins
2014-05-23 16:57     ` David Herrmann
2014-05-23 16:57       ` David Herrmann
2014-06-02 10:59       ` Hugh Dickins
2014-06-02 10:59         ` Hugh Dickins
2014-06-02 17:50         ` Andy Lutomirski
2014-06-02 17:50           ` Andy Lutomirski
2014-06-13 10:42         ` David Herrmann
2014-06-13 10:42           ` David Herrmann
2014-05-21 10:50   ` Konstantin Khlebnikov
2014-05-21 10:50     ` Konstantin Khlebnikov
2014-04-15 18:38 ` [PATCH v2 3/3] selftests: add memfd_create() + sealing tests David Herrmann
2014-04-15 18:38   ` David Herrmann
2014-05-20  2:22   ` Hugh Dickins
2014-05-20  2:22     ` Hugh Dickins
2014-05-23 17:06     ` David Herrmann
2014-05-23 17:06       ` David Herrmann
2014-05-14  5:09 ` [PATCH v2 0/3] File Sealing & memfd_create() Hugh Dickins
2014-05-14  5:09   ` Hugh Dickins
2014-05-14 16:15   ` Tony Battersby
2014-05-14 16:15     ` Tony Battersby
2014-05-14 22:35     ` Hugh Dickins
2014-05-14 22:35       ` Hugh Dickins
2014-05-19 11:44       ` David Herrmann
2014-05-19 11:44         ` David Herrmann
2014-05-19 16:09         ` Jan Kara
2014-05-19 16:09           ` Jan Kara
2014-05-19 22:11           ` Hugh Dickins
2014-05-19 22:11             ` Hugh Dickins
2014-05-26 11:44             ` David Herrmann [this message]
2014-05-26 11:44               ` David Herrmann
2014-05-31  4:44               ` Hugh Dickins
2014-05-31  4:44                 ` Hugh Dickins
2014-06-02  4:42         ` Minchan Kim
2014-06-02  4:42           ` Minchan Kim
2014-06-02  9:14           ` Jan Kara
2014-06-02  9:14             ` Jan Kara
2014-06-02 16:04             ` David Herrmann
2014-06-02 16:04               ` David Herrmann
2014-06-03  8:31               ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANq1E4TDDzG+HtBz261_nid3kVRG_jwcWHizzkdZCZZE3BaLgQ@mail.gmail.com \
    --to=dh.herrmann@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=john.stultz@linaro.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=tj@kernel.org \
    --cc=tonyb@cybernetics.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.