All of lore.kernel.org
 help / color / mirror / Atom feed
From: ira.weiny@intel.com
To: Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.cz>,
	Theodore Ts'o <tytso@mit.edu>, Jeff Layton <jlayton@kernel.org>,
	Dave Chinner <david@fromorbit.com>
Cc: linux-nvdimm@lists.01.org, "John Hubbard" <jhubbard@nvidia.com>,
	linux-kernel@vger.kernel.org,
	"Matthew Wilcox" <willy@infradead.org>,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	"Jérôme Glisse" <jglisse@redhat.com>,
	linux-fsdevel@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linux-ext4@vger.kernel.org
Subject: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
Date: Wed,  5 Jun 2019 18:45:33 -0700	[thread overview]
Message-ID: <20190606014544.8339-1-ira.weiny@intel.com> (raw)

From: Ira Weiny <ira.weiny@intel.com>

... V1,000,000   ;-)

Pre-requisites:
	John Hubbard's put_user_pages() patch series.[1]
	Jan Kara's ext4_break_layouts() fixes[2]

Based on the feedback from LSFmm and the LWN article which resulted.  I've
decided to take a slightly different tack on this problem.

The real issue is that there is no use case for a user to have RDMA pinn'ed
memory which is then truncated.  So really any solution we present which:

A) Prevents file system corruption or data leaks
...and...
B) Informs the user that they did something wrong

Should be an acceptable solution.

Because this is slightly new behavior.  And because this is gonig to be
specific to DAX (because of the lack of a page cache) we have made the user
"opt in" to this behavior.

The following patches implement the following solution.

1) The user has to opt in to allowing GUP pins on a file with a layout lease
   (now made visible).
2) GUP will fail (EPERM) if a layout lease is not taken
3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
4) The user has the option of holding the layout lease to receive a SIGIO for
   notification to the original thread that another thread has tried to delete
   their data.  Furthermore this indicates that if the user needs to GUP the
   file again they will need to retake the Layout lease before doing so.


NOTE: If the user releases the layout lease or if it has been broken by another
operation further GUP operations on the file will fail without re-taking the
lease.  This means that if a user would like to register pieces of a file and
continue to register other pieces later they would be advised to keep the
layout lease, get a SIGIO notification, and retake the lease.

NOTE2: Truncation of pages which are not actively pinned will succeed.  Similar
to accessing an mmap to this area GUP pins of that memory may fail.


A general overview follows for background.

It should be noted that one solution for this problem is to use RDMA's On
Demand Paging (ODP).  There are 2 big reasons this may not work.

	1) The hardware being used for RDMA may not support ODP
	2) ODP may be detrimental to the over all network (cluster or cloud)
	   performance

Therefore, in order to support RDMA to File system pages without On Demand
Paging (ODP) a number of things need to be done.

1) GUP "longterm" users need to inform the other subsystems that they have
   taken a pin on a page which may remain pinned for a very "long time".[3]

2) Any page which is "controlled" by a file system needs to have special
   handling.  The details of the handling depends on if the page is page cache
   fronted or not.

   2a) A page cache fronted page which has been pinned by GUP long term can use a
   bounce buffer to allow the file system to write back snap shots of the page.
   This is handled by the FS recognizing the GUP long term pin and making a copy
   of the page to be written back.
	NOTE: this patch set does not address this path.

   2b) A FS "controlled" page which is not page cache fronted is either easier
   to deal with or harder depending on the operation the filesystem is trying
   to do.

	2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
	FS can no longer use the pages in question until the pin has been
	removed.  This patch set presents a solution to this by introducing
	some reasonable restrictions on user space applications.

	2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
	then there is nothing which need be done.  Data is Read or Written
	directly to the page.  This is an easy case which would currently work
	if not for GUP long term pins being disabled.  Therefore this patch set
	need not change access to the file data but does allow for GUP pins
	after 2ba above is dealt with.


This patch series and presents a solution for problem 2ba)

[1] https://github.com/johnhubbard/linux/tree/gup_dma_core

[2] ext4/dev branch:

- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev

	Specific patches:

	[2a] ext4: wait for outstanding dio during truncate in nojournal mode

	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8

	[2b] ext4: do not delete unlinked inode from orphan list on failed truncate

	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14

	[2c] ext4: gracefully handle ext4_break_layouts() failure during truncate

	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c

[3] The definition of long time is debatable but it has been established
that RDMAs use of pages, minutes or hours after the pin is the extreme case
which makes this problem most severe.


Ira Weiny (10):
  fs/locks: Add trace_leases_conflict
  fs/locks: Export F_LAYOUT lease to user space
  mm/gup: Pass flags down to __gup_device_huge* calls
  mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
  fs/ext4: Teach ext4 to break layout leases
  fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
  fs/ext4: Fail truncate if pages are GUP pinned
  fs/xfs: Teach xfs to use new dax_layout_busy_page()
  fs/xfs: Fail truncate if pages are GUP pinned
  mm/gup: Remove FOLL_LONGTERM DAX exclusion

 fs/Kconfig                       |   1 +
 fs/dax.c                         |  38 ++++++---
 fs/ext4/ext4.h                   |   2 +-
 fs/ext4/extents.c                |   6 +-
 fs/ext4/inode.c                  |  26 +++++--
 fs/locks.c                       |  97 ++++++++++++++++++++---
 fs/xfs/xfs_file.c                |  24 ++++--
 fs/xfs/xfs_inode.h               |   5 +-
 fs/xfs/xfs_ioctl.c               |  15 +++-
 fs/xfs/xfs_iops.c                |  14 +++-
 fs/xfs/xfs_pnfs.c                |  14 ++--
 include/linux/dax.h              |   9 ++-
 include/linux/fs.h               |   2 +-
 include/linux/mm.h               |   2 +
 include/trace/events/filelock.h  |  35 +++++++++
 include/uapi/asm-generic/fcntl.h |   3 +
 mm/gup.c                         | 129 ++++++++++++-------------------
 mm/huge_memory.c                 |  12 +++
 18 files changed, 299 insertions(+), 135 deletions(-)

-- 
2.20.1

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: ira.weiny@intel.com
To: Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.cz>,
	"Theodore Ts'o" <tytso@mit.edu>, Jeff Layton <jlayton@kernel.org>,
	Dave Chinner <david@fromorbit.com>
Cc: "Ira Weiny" <ira.weiny@intel.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	linux-xfs@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nvdimm@lists.01.org, linux-ext4@vger.kernel.org,
	linux-mm@kvack.org
Subject: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
Date: Wed,  5 Jun 2019 18:45:33 -0700	[thread overview]
Message-ID: <20190606014544.8339-1-ira.weiny@intel.com> (raw)

From: Ira Weiny <ira.weiny@intel.com>

... V1,000,000   ;-)

Pre-requisites:
	John Hubbard's put_user_pages() patch series.[1]
	Jan Kara's ext4_break_layouts() fixes[2]

Based on the feedback from LSFmm and the LWN article which resulted.  I've
decided to take a slightly different tack on this problem.

The real issue is that there is no use case for a user to have RDMA pinn'ed
memory which is then truncated.  So really any solution we present which:

A) Prevents file system corruption or data leaks
...and...
B) Informs the user that they did something wrong

Should be an acceptable solution.

Because this is slightly new behavior.  And because this is gonig to be
specific to DAX (because of the lack of a page cache) we have made the user
"opt in" to this behavior.

The following patches implement the following solution.

1) The user has to opt in to allowing GUP pins on a file with a layout lease
   (now made visible).
2) GUP will fail (EPERM) if a layout lease is not taken
3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
4) The user has the option of holding the layout lease to receive a SIGIO for
   notification to the original thread that another thread has tried to delete
   their data.  Furthermore this indicates that if the user needs to GUP the
   file again they will need to retake the Layout lease before doing so.


NOTE: If the user releases the layout lease or if it has been broken by another
operation further GUP operations on the file will fail without re-taking the
lease.  This means that if a user would like to register pieces of a file and
continue to register other pieces later they would be advised to keep the
layout lease, get a SIGIO notification, and retake the lease.

NOTE2: Truncation of pages which are not actively pinned will succeed.  Similar
to accessing an mmap to this area GUP pins of that memory may fail.


A general overview follows for background.

It should be noted that one solution for this problem is to use RDMA's On
Demand Paging (ODP).  There are 2 big reasons this may not work.

	1) The hardware being used for RDMA may not support ODP
	2) ODP may be detrimental to the over all network (cluster or cloud)
	   performance

Therefore, in order to support RDMA to File system pages without On Demand
Paging (ODP) a number of things need to be done.

1) GUP "longterm" users need to inform the other subsystems that they have
   taken a pin on a page which may remain pinned for a very "long time".[3]

2) Any page which is "controlled" by a file system needs to have special
   handling.  The details of the handling depends on if the page is page cache
   fronted or not.

   2a) A page cache fronted page which has been pinned by GUP long term can use a
   bounce buffer to allow the file system to write back snap shots of the page.
   This is handled by the FS recognizing the GUP long term pin and making a copy
   of the page to be written back.
	NOTE: this patch set does not address this path.

   2b) A FS "controlled" page which is not page cache fronted is either easier
   to deal with or harder depending on the operation the filesystem is trying
   to do.

	2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
	FS can no longer use the pages in question until the pin has been
	removed.  This patch set presents a solution to this by introducing
	some reasonable restrictions on user space applications.

	2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
	then there is nothing which need be done.  Data is Read or Written
	directly to the page.  This is an easy case which would currently work
	if not for GUP long term pins being disabled.  Therefore this patch set
	need not change access to the file data but does allow for GUP pins
	after 2ba above is dealt with.


This patch series and presents a solution for problem 2ba)

[1] https://github.com/johnhubbard/linux/tree/gup_dma_core

[2] ext4/dev branch:

- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev

	Specific patches:

	[2a] ext4: wait for outstanding dio during truncate in nojournal mode

	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8

	[2b] ext4: do not delete unlinked inode from orphan list on failed truncate

	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14

	[2c] ext4: gracefully handle ext4_break_layouts() failure during truncate

	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c

[3] The definition of long time is debatable but it has been established
that RDMAs use of pages, minutes or hours after the pin is the extreme case
which makes this problem most severe.


Ira Weiny (10):
  fs/locks: Add trace_leases_conflict
  fs/locks: Export F_LAYOUT lease to user space
  mm/gup: Pass flags down to __gup_device_huge* calls
  mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
  fs/ext4: Teach ext4 to break layout leases
  fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
  fs/ext4: Fail truncate if pages are GUP pinned
  fs/xfs: Teach xfs to use new dax_layout_busy_page()
  fs/xfs: Fail truncate if pages are GUP pinned
  mm/gup: Remove FOLL_LONGTERM DAX exclusion

 fs/Kconfig                       |   1 +
 fs/dax.c                         |  38 ++++++---
 fs/ext4/ext4.h                   |   2 +-
 fs/ext4/extents.c                |   6 +-
 fs/ext4/inode.c                  |  26 +++++--
 fs/locks.c                       |  97 ++++++++++++++++++++---
 fs/xfs/xfs_file.c                |  24 ++++--
 fs/xfs/xfs_inode.h               |   5 +-
 fs/xfs/xfs_ioctl.c               |  15 +++-
 fs/xfs/xfs_iops.c                |  14 +++-
 fs/xfs/xfs_pnfs.c                |  14 ++--
 include/linux/dax.h              |   9 ++-
 include/linux/fs.h               |   2 +-
 include/linux/mm.h               |   2 +
 include/trace/events/filelock.h  |  35 +++++++++
 include/uapi/asm-generic/fcntl.h |   3 +
 mm/gup.c                         | 129 ++++++++++++-------------------
 mm/huge_memory.c                 |  12 +++
 18 files changed, 299 insertions(+), 135 deletions(-)

-- 
2.20.1


             reply	other threads:[~2019-06-06  1:45 UTC|newest]

Thread overview: 136+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-06  1:45 ira.weiny [this message]
2019-06-06  1:45 ` [PATCH RFC 00/10] RDMA/FS DAX truncate proposal ira.weiny
2019-06-06  1:45 ` [PATCH RFC 01/10] fs/locks: Add trace_leases_conflict ira.weiny
2019-06-09 12:52   ` Jeff Layton
2019-06-06  1:45 ` [PATCH RFC 02/10] fs/locks: Export F_LAYOUT lease to user space ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-09 13:00   ` Jeff Layton
2019-06-09 13:00     ` Jeff Layton
2019-06-11 21:38     ` Ira Weiny
2019-06-11 21:38       ` Ira Weiny
2019-06-12  9:46       ` Jan Kara
2019-06-06  1:45 ` [PATCH RFC 03/10] mm/gup: Pass flags down to __gup_device_huge* calls ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  6:18   ` Christoph Hellwig
2019-06-06 16:10     ` Ira Weiny
2019-06-06  1:45 ` [PATCH RFC 04/10] mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 05/10] fs/ext4: Teach ext4 to break layout leases ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 06/10] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 07/10] fs/ext4: Fail truncate if pages are GUP pinned ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06 10:58   ` Jan Kara
2019-06-06 10:58     ` Jan Kara
2019-06-06 16:17     ` Ira Weiny
2019-06-06  1:45 ` [PATCH RFC 08/10] fs/xfs: Teach xfs to use new dax_layout_busy_page() ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 09/10] fs/xfs: Fail truncate if pages are GUP pinned ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 10/10] mm/gup: Remove FOLL_LONGTERM DAX exclusion ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  5:52 ` [PATCH RFC 00/10] RDMA/FS DAX truncate proposal John Hubbard
2019-06-06  5:52   ` John Hubbard
2019-06-06 17:11   ` Ira Weiny
2019-06-06 17:11     ` Ira Weiny
2019-06-06 19:46     ` Jason Gunthorpe
2019-06-06 10:42 ` Jan Kara
2019-06-06 15:35   ` Dan Williams
2019-06-06 19:51   ` Jason Gunthorpe
2019-06-06 22:22     ` Ira Weiny
2019-06-07 10:36       ` Jan Kara
2019-06-07 12:17         ` Jason Gunthorpe
2019-06-07 14:52           ` Ira Weiny
2019-06-07 14:52             ` Ira Weiny
2019-06-07 15:10             ` Jason Gunthorpe
2019-06-12 10:29             ` Jan Kara
2019-06-12 10:29               ` Jan Kara
2019-06-12 11:47               ` Jason Gunthorpe
2019-06-12 12:09                 ` Jan Kara
2019-06-12 12:09                   ` Jan Kara
2019-06-12 18:41                   ` Dan Williams
2019-06-13  7:17                     ` Jan Kara
2019-06-13  7:17                       ` Jan Kara
2019-06-12 19:14                   ` Jason Gunthorpe
2019-06-12 22:13                     ` Ira Weiny
2019-06-12 22:54                       ` Dan Williams
2019-06-12 22:54                         ` Dan Williams
2019-06-12 23:33                         ` Ira Weiny
2019-06-12 23:33                           ` Ira Weiny
2019-06-13  1:14                           ` Dan Williams
2019-06-13  1:14                             ` Dan Williams
2019-06-13 15:13                             ` Jason Gunthorpe
2019-06-13 16:25                               ` Dan Williams
2019-06-13 16:25                                 ` Dan Williams
2019-06-13 17:18                                 ` Jason Gunthorpe
2019-06-13 16:53                           ` Dan Williams
2019-06-13 16:53                             ` Dan Williams
2019-06-13 15:12                         ` Jason Gunthorpe
2019-06-13  7:53                       ` Jan Kara
2019-06-13  7:53                         ` Jan Kara
2019-06-12 18:49               ` Dan Williams
2019-06-12 18:49                 ` Dan Williams
2019-06-13  7:43                 ` Jan Kara
2019-06-06 22:03   ` Ira Weiny
2019-06-06 22:03     ` Ira Weiny
2019-06-06 22:26     ` Ira Weiny
2019-06-06 22:28     ` Dave Chinner
2019-06-07 11:04     ` Jan Kara
2019-06-07 18:25       ` Ira Weiny
2019-06-07 18:25         ` Ira Weiny
2019-06-07 18:25         ` Ira Weiny
2019-06-07 18:50         ` Jason Gunthorpe
2019-06-08  0:10         ` Dave Chinner
2019-06-08  0:10           ` Dave Chinner
2019-06-09  1:29           ` Ira Weiny
2019-06-09  1:29             ` Ira Weiny
2019-06-09  1:29             ` Ira Weiny
2019-06-12 12:37           ` Matthew Wilcox
2019-06-12 12:37             ` Matthew Wilcox
2019-06-12 12:37             ` Matthew Wilcox
2019-06-12 23:30             ` Ira Weiny
2019-06-12 23:30               ` Ira Weiny
2019-06-12 23:30               ` Ira Weiny
2019-06-13  0:55               ` Dave Chinner
2019-06-13  0:55                 ` Dave Chinner
2019-06-13  0:55                 ` Dave Chinner
2019-06-13 20:34                 ` Ira Weiny
2019-06-13 20:34                   ` Ira Weiny
2019-06-13 20:34                   ` Ira Weiny
2019-06-14  3:42                   ` Dave Chinner
2019-06-13  0:25             ` Dave Chinner
2019-06-13  0:25               ` Dave Chinner
2019-06-13  3:23               ` Matthew Wilcox
2019-06-13  3:23                 ` Matthew Wilcox
2019-06-13  3:23                 ` Matthew Wilcox
2019-06-13  4:36                 ` Dave Chinner
2019-06-13  4:36                   ` Dave Chinner
2019-06-13  4:36                   ` Dave Chinner
2019-06-13 10:47                   ` Matthew Wilcox
2019-06-13 10:47                     ` Matthew Wilcox
2019-06-13 10:47                     ` Matthew Wilcox
2019-06-13 15:29                 ` Jason Gunthorpe
2019-06-13 15:27               ` Matthew Wilcox
2019-06-13 15:27                 ` Matthew Wilcox
2019-06-13 15:27                 ` Matthew Wilcox
2019-06-13 21:13                 ` Ira Weiny
2019-06-13 21:13                   ` Ira Weiny
2019-06-13 23:45                   ` Jason Gunthorpe
2019-06-14  0:00                     ` Ira Weiny
2019-06-14  0:00                       ` Ira Weiny
2019-06-14  2:09                     ` Dave Chinner
2019-06-14  2:09                       ` Dave Chinner
2019-06-14  2:09                       ` Dave Chinner
2019-06-14  2:31                       ` Matthew Wilcox
2019-06-14  2:31                         ` Matthew Wilcox
2019-06-14  3:07                         ` Dave Chinner
2019-06-14  3:07                           ` Dave Chinner
2019-06-14  3:07                           ` Dave Chinner
2019-06-20 14:52                 ` Jan Kara
2019-06-20 14:52                   ` Jan Kara
2019-06-13 20:34               ` Ira Weiny
2019-06-13 20:34                 ` Ira Weiny
2019-06-13 20:34                 ` Ira Weiny
2019-06-14  2:58                 ` Dave Chinner
2019-06-14  2:58                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190606014544.8339-1-ira.weiny@intel.com \
    --to=ira.weiny@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=jlayton@kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.