All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Linux API <linux-api@vger.kernel.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Dave Chinner <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-xfs@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andy Lutomirski <luto@kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
Date: Tue, 15 Aug 2017 10:37:32 +0200	[thread overview]
Message-ID: <20170815083732.GB27505@quack2.suse.cz> (raw)
In-Reply-To: <CAPcyv4hi_Y5Qj=h_Qf4Bcyv+EWBosa2gQT+-8ro3hPY9VMshSA@mail.gmail.com>

On Mon 14-08-17 09:14:42, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > Thay being said I think we absolutely should support RDMA memory
> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> > to make sure get_user_page works, which for now means we'll need a
> >> > struct page mapping for the region (which will be really annoying
> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> > and we need to gurantee that the extent mapping won't change while
> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> > due to side effects even with the current DAX code, but we'll need to
> >> > make it explicit.  And maybe that's where we need to converge -
> >> > "sealing" the extent map makes sense as such a temporary measure
> >> > that is not persisted on disk, which automatically gets released
> >> > when the holding process exits, because we sort of already do this
> >> > implicitly.  It might also make sense to have explicitl breakable
> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> > any userspace RDMA file server would also need those semantics.
> >>
> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >>
> >>     1/ only succeed if the fault can be satisfied without page cache
> >>
> >>     2/ only install a pte for the fault if it can do so without
> >> triggering block map updates
> >>
> >> So, I think it would still end up setting an inode flag to make
> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> active. However, it would not record that state in the on-disk
> >> metadata and it would automatically clear at munmap time. That should
> >> be enough to support the host-persistent-memory, and
> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> where we would need to software manage I/O coherence.
> >
> > Hum, this proposal (and the problems you are trying to deal with) seem very
> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> > the DAX area (and so additionally complicated by the fact that filesystems
> > now have to care). The patch set was not merged due to lack of interest I
> > think but it looked sensible and the proposed API would make sense for more
> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> "no-fault" guarantee and fixes the accounting of locked System RAM.
> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> RAM so the accounting problem is not there for DAX. mm_pin() also does
> not appear to have a relationship to a file backed memory like mmap
> allows.

So the accounting part is probably non-interesting for DAX purposes and I
agree there are other differences as well. But mm_mpin() prevented page
migrations which is parallel to your requirement of "offset->block mapping
is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
has saner interface to pin pages than get_user_pages() and you mention RDMA
and similar technologies as a usecase for your work for similar reasons.
So my thought was that possibly we should have the same API for pinning
"storage" for RDMA transfers regardless of whether the backing is page
cache or pmem and the API should be usable for in-kernel users as well?
mmap flag seems a bit clumsy in this regard so maybe a form of a separate
syscall - be it mpin(start, len) or some other name - might be more
suitable?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Dave Chinner <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-xfs@vger.kernel.org, Jeff Moyer <jmoyer@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andy Lutomirski <luto@kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Linux API <linux-api@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
Date: Tue, 15 Aug 2017 10:37:32 +0200	[thread overview]
Message-ID: <20170815083732.GB27505@quack2.suse.cz> (raw)
In-Reply-To: <CAPcyv4hi_Y5Qj=h_Qf4Bcyv+EWBosa2gQT+-8ro3hPY9VMshSA@mail.gmail.com>

On Mon 14-08-17 09:14:42, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > Thay being said I think we absolutely should support RDMA memory
> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> > to make sure get_user_page works, which for now means we'll need a
> >> > struct page mapping for the region (which will be really annoying
> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> > and we need to gurantee that the extent mapping won't change while
> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> > due to side effects even with the current DAX code, but we'll need to
> >> > make it explicit.  And maybe that's where we need to converge -
> >> > "sealing" the extent map makes sense as such a temporary measure
> >> > that is not persisted on disk, which automatically gets released
> >> > when the holding process exits, because we sort of already do this
> >> > implicitly.  It might also make sense to have explicitl breakable
> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> > any userspace RDMA file server would also need those semantics.
> >>
> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >>
> >>     1/ only succeed if the fault can be satisfied without page cache
> >>
> >>     2/ only install a pte for the fault if it can do so without
> >> triggering block map updates
> >>
> >> So, I think it would still end up setting an inode flag to make
> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> active. However, it would not record that state in the on-disk
> >> metadata and it would automatically clear at munmap time. That should
> >> be enough to support the host-persistent-memory, and
> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> where we would need to software manage I/O coherence.
> >
> > Hum, this proposal (and the problems you are trying to deal with) seem very
> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> > the DAX area (and so additionally complicated by the fact that filesystems
> > now have to care). The patch set was not merged due to lack of interest I
> > think but it looked sensible and the proposed API would make sense for more
> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> "no-fault" guarantee and fixes the accounting of locked System RAM.
> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> RAM so the accounting problem is not there for DAX. mm_pin() also does
> not appear to have a relationship to a file backed memory like mmap
> allows.

So the accounting part is probably non-interesting for DAX purposes and I
agree there are other differences as well. But mm_mpin() prevented page
migrations which is parallel to your requirement of "offset->block mapping
is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
has saner interface to pin pages than get_user_pages() and you mention RDMA
and similar technologies as a usecase for your work for similar reasons.
So my thought was that possibly we should have the same API for pinning
"storage" for RDMA transfers regardless of whether the backing is page
cache or pmem and the API should be usable for in-kernel users as well?
mmap flag seems a bit clumsy in this regard so maybe a form of a separate
syscall - be it mpin(start, len) or some other name - might be more
suitable?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
To: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
	Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>,
	"Darrick J. Wong"
	<darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	"linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org"
	<linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>,
	Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>,
	"linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Alexander Viro
	<viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
	Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	linux-fsdevel
	<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Ross Zwisler
	<ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>,
	Linux API <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Subject: Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
Date: Tue, 15 Aug 2017 10:37:32 +0200	[thread overview]
Message-ID: <20170815083732.GB27505@quack2.suse.cz> (raw)
In-Reply-To: <CAPcyv4hi_Y5Qj=h_Qf4Bcyv+EWBosa2gQT+-8ro3hPY9VMshSA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon 14-08-17 09:14:42, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> >> > Thay being said I think we absolutely should support RDMA memory
> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> > to make sure get_user_page works, which for now means we'll need a
> >> > struct page mapping for the region (which will be really annoying
> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> > and we need to gurantee that the extent mapping won't change while
> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> > due to side effects even with the current DAX code, but we'll need to
> >> > make it explicit.  And maybe that's where we need to converge -
> >> > "sealing" the extent map makes sense as such a temporary measure
> >> > that is not persisted on disk, which automatically gets released
> >> > when the holding process exits, because we sort of already do this
> >> > implicitly.  It might also make sense to have explicitl breakable
> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> > any userspace RDMA file server would also need those semantics.
> >>
> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >>
> >>     1/ only succeed if the fault can be satisfied without page cache
> >>
> >>     2/ only install a pte for the fault if it can do so without
> >> triggering block map updates
> >>
> >> So, I think it would still end up setting an inode flag to make
> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> active. However, it would not record that state in the on-disk
> >> metadata and it would automatically clear at munmap time. That should
> >> be enough to support the host-persistent-memory, and
> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> where we would need to software manage I/O coherence.
> >
> > Hum, this proposal (and the problems you are trying to deal with) seem very
> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> > the DAX area (and so additionally complicated by the fact that filesystems
> > now have to care). The patch set was not merged due to lack of interest I
> > think but it looked sensible and the proposed API would make sense for more
> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> "no-fault" guarantee and fixes the accounting of locked System RAM.
> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> RAM so the accounting problem is not there for DAX. mm_pin() also does
> not appear to have a relationship to a file backed memory like mmap
> allows.

So the accounting part is probably non-interesting for DAX purposes and I
agree there are other differences as well. But mm_mpin() prevented page
migrations which is parallel to your requirement of "offset->block mapping
is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
has saner interface to pin pages than get_user_pages() and you mention RDMA
and similar technologies as a usecase for your work for similar reasons.
So my thought was that possibly we should have the same API for pinning
"storage" for RDMA transfers regardless of whether the backing is page
cache or pmem and the API should be usable for in-kernel users as well?
mmap flag seems a bit clumsy in this regard so maybe a form of a separate
syscall - be it mpin(start, len) or some other name - might be more
suitable?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

  reply	other threads:[~2017-08-15  8:35 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-04  2:28 [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap Dan Williams
2017-08-04  2:28 ` Dan Williams
2017-08-04  2:28 ` [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04 20:00   ` Darrick J. Wong
2017-08-04 20:00     ` Darrick J. Wong
2017-08-04 20:31     ` Dan Williams
2017-08-04 20:31       ` Dan Williams
2017-08-05  9:47   ` Christoph Hellwig
2017-08-05  9:47     ` Christoph Hellwig
2017-08-07  0:25     ` Dave Chinner
2017-08-07  0:25       ` Dave Chinner
2017-08-11 10:34       ` Christoph Hellwig
2017-08-11 10:34         ` Christoph Hellwig
2017-08-04  2:28 ` [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04 19:46   ` Darrick J. Wong
2017-08-04 19:46     ` Darrick J. Wong
2017-08-04 19:52     ` Dan Williams
2017-08-04 19:52       ` Dan Williams
2017-08-04 23:31   ` Dave Chinner
2017-08-04 23:31     ` Dave Chinner
2017-08-04 23:43     ` Dan Williams
2017-08-04 23:43       ` Dan Williams
2017-08-05  0:04       ` Dave Chinner
2017-08-05  0:04         ` Dave Chinner
2017-08-04  2:28 ` [PATCH v2 3/5] fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04 20:04   ` Darrick J. Wong
2017-08-04 20:04     ` Darrick J. Wong
2017-08-04 20:36     ` Dan Williams
2017-08-04 20:36       ` Dan Williams
2017-08-04  2:28 ` [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04 20:33   ` Darrick J. Wong
2017-08-04 20:33     ` Darrick J. Wong
2017-08-04 20:45     ` Dan Williams
2017-08-04 20:45       ` Dan Williams
2017-08-04 23:46     ` Dave Chinner
2017-08-04 23:46       ` Dave Chinner
2017-08-04 23:57       ` Darrick J. Wong
2017-08-04 23:57         ` Darrick J. Wong
2017-08-04  2:28 ` [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04 20:14   ` Darrick J. Wong
2017-08-04 20:14     ` Darrick J. Wong
2017-08-04 20:47     ` Dan Williams
2017-08-04 20:47       ` Dan Williams
2017-08-04 20:53       ` Darrick J. Wong
2017-08-04 20:53         ` Darrick J. Wong
2017-08-04 20:55         ` Dan Williams
2017-08-04 20:55           ` Dan Williams
2017-08-04  2:38 ` [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap Dan Williams
2017-08-04  2:38   ` Dan Williams
2017-08-04  2:38   ` Dan Williams
2017-08-05  9:50   ` Christoph Hellwig
2017-08-05  9:50     ` Christoph Hellwig
2017-08-05  9:50     ` Christoph Hellwig
2017-08-06 18:51     ` Dan Williams
2017-08-06 18:51       ` Dan Williams
2017-08-06 18:51       ` Dan Williams
2017-08-11 10:44       ` Christoph Hellwig
2017-08-11 10:44         ` Christoph Hellwig
2017-08-11 10:44         ` Christoph Hellwig
2017-08-11 22:26         ` Dan Williams
2017-08-11 22:26           ` Dan Williams
2017-08-11 22:26           ` Dan Williams
2017-08-12  3:57           ` Andy Lutomirski
2017-08-12  3:57             ` Andy Lutomirski
2017-08-12  4:44             ` Dan Williams
2017-08-12  4:44               ` Dan Williams
2017-08-12  4:44               ` Dan Williams
2017-08-12  7:34             ` Christoph Hellwig
2017-08-12  7:34               ` Christoph Hellwig
2017-08-12  7:34               ` Christoph Hellwig
2017-08-12  7:33           ` Christoph Hellwig
2017-08-12  7:33             ` Christoph Hellwig
2017-08-12  7:33             ` Christoph Hellwig
2017-08-12 19:19             ` Dan Williams
2017-08-12 19:19               ` Dan Williams
2017-08-12 19:19               ` Dan Williams
2017-08-13  9:24               ` Christoph Hellwig
2017-08-13  9:24                 ` Christoph Hellwig
2017-08-13 20:31                 ` Dan Williams
2017-08-13 20:31                   ` Dan Williams
2017-08-13 20:31                   ` Dan Williams
2017-08-14 12:40                   ` Jan Kara
2017-08-14 12:40                     ` Jan Kara
2017-08-14 12:40                     ` Jan Kara
2017-08-14 16:14                     ` Dan Williams
2017-08-14 16:14                       ` Dan Williams
2017-08-15  8:37                       ` Jan Kara [this message]
2017-08-15  8:37                         ` Jan Kara
2017-08-15  8:37                         ` Jan Kara
2017-08-15 23:50                         ` Dan Williams
2017-08-15 23:50                           ` Dan Williams
2017-08-16 13:57                           ` Jan Kara
2017-08-16 13:57                             ` Jan Kara
2017-08-16 13:57                             ` Jan Kara
2017-08-21  9:16                     ` Peter Zijlstra
2017-08-21  9:16                       ` Peter Zijlstra
2017-08-21  9:16                       ` Peter Zijlstra
2017-08-14 21:46                   ` Darrick J. Wong
2017-08-14 21:46                     ` Darrick J. Wong
2017-08-14 21:46                     ` Darrick J. Wong
2017-08-13 23:46                 ` Dave Chinner
2017-08-13 23:46                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170815083732.GB27505@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=dan.j.williams@intel.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=peterz@infradead.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.