All of lore.kernel.org
 help / color / mirror / Atom feed
* direct_access, pinning and truncation
@ 2014-10-08 19:05 Matthew Wilcox
  2014-10-08 23:21 ` Zach Brown
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-10-08 19:05 UTC (permalink / raw)
  To: linux-fsdevel


One of the things on my todo list is making O_DIRECT work to a
memory-mapped direct_access file.  Right now, it simply doesn't work
because there's no struct page for the memory, so get_user_pages() fails.
Boaz has posted a patch to create struct pages for direct_access files,
which is certainly one way of solving the immediate problem, but it
ignores the deeper problem.

For normal files, get_user_pages() elevates the reference count on
the pages.  If those pages are subsequently truncated from the file,
the underlying file blocks are released to the filesystem's free pool.
The pages are removed from the page cache and the process's address space,
but hang around until the caller of get_user_pages() calls put_page() on
them again at which point they are released into the pool of free pages.

Once we have a struct page for (or some other way to handle pinning of)
persistent memory blocks, truncating a file that has pinned pages will
still cause the disk blocks to be released to the free pool.  But there
weren't any pages of DRAM between the filesystem and the application!
So those blocks are "freed" while still referenced.  And that reference
might well be programmed into a piece of hardware that's doing DMA;
it can't be stopped.

I see three solutions here:

1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
the caller with the struct pages of the DRAM.  Modify DAX to handle some
file pages being in the page cache, and make sure that we know whether
the PMEM or DRAM is up to date.  This has the obvious downside that
get_user_pages() becomes slow.

2. Modify filesystems that support DAX to handle pinning blocks.
Some filesystems (that support COW and snapshots) already support
reference-counting individual blocks.  We may be ale to do better by
using a tree of pinned extents or something.  This makes it much harder
to modify a filesystem to support DAX, and I don't see patches adding
this capability to ext2 being warmly welcomed.

3. Make truncate() block if it hits a pinned page.  There's really no
good reason to truncate a file that has pinned pages; it's either a bug
or you're trying to be nasty to someone.  We actually already have code
for this; inode_dio_wait() / inode_dio_done().  But page pinning isn't
just for O_DIRECT I/Os and other transient users like crypto, it's also
for long-lived things like RDMA, where we could potentially block for
an indefinite time.

Does option 3 open up a new attack surface?  I'm thinking about somebody
opening a large file that's publically readable, and pinning part of
it by handing part of it to an RDMA card.  That would prevent the owner
from truncating it.

One thing that option 3 doesn't do is affect whether a file can be
removed.  Just having the file mmaped is enough to prevent the file blocks
from being reused, even if all names for that file have been removed.

I'm open to other solutions ...


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-08 19:05 direct_access, pinning and truncation Matthew Wilcox
@ 2014-10-08 23:21 ` Zach Brown
  2014-10-09 16:44   ` Matthew Wilcox
  2014-10-09  1:10 ` Dave Chinner
  2014-10-10 13:08 ` Jan Kara
  2 siblings, 1 reply; 14+ messages in thread
From: Zach Brown @ 2014-10-08 23:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

[... figuring out how g_u_p() references can prevent freeing and
re-using the underlying mapped pmem addresses given the lack of struct
pages for the mapping]

> I see three solutions here:
> 
> 1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
> the caller with the struct pages of the DRAM.  Modify DAX to handle some
> file pages being in the page cache, and make sure that we know whether
> the PMEM or DRAM is up to date.  This has the obvious downside that
> get_user_pages() becomes slow.

And serialize transitions and fs stores to pmem regions.  And now
storing to dram-fronted pmem goes through all the dirtying and writeback
machinery.  This sounds like a nightmare to me, to be honest.

> 2. Modify filesystems that support DAX to handle pinning blocks.
> Some filesystems (that support COW and snapshots) already support
> reference-counting individual blocks.  We may be ale to do better by
> using a tree of pinned extents or something.  This makes it much harder
> to modify a filesystem to support DAX, and I don't see patches adding
> this capability to ext2 being warmly welcomed.

This seems.. doable?  Recording the referenced pmem in free lists in the
fs is fine as long as the pmem isn't modified until the references are
released, right?

Maybe in the allocator you skip otherwise free blocks if they intersect
with the run time structure (rbtree of extents, presumably) that is
taking the place of reference counts in struct page.  There aren't
*that* many allocator entry points.  I guess you'd need to avoid other
modifications of free space like trimming :/.  It still seems reasonably
doable?

And hey, lord knows we love to implement rbtrees of extents in file
systems!  (btrfs: struct extent_state, ext4: struct extent_status)

The tricky part would be maintaining that structure behind g_u_p() and
put_page() calls.  Probably a richer interface that gives callers
something more than just raw page pointers.

> 3. Make truncate() block if it hits a pinned page.  There's really no
> good reason to truncate a file that has pinned pages; it's either a bug
> or you're trying to be nasty to someone.  We actually already have code
> for this; inode_dio_wait() / inode_dio_done().  But page pinning isn't
> just for O_DIRECT I/Os and other transient users like crypto, it's also
> for long-lived things like RDMA, where we could potentially block for
> an indefinite time.

I have no concrete examples, but I agree that it sounds like the sort of
thing that would bite us in the ass if we miss some use case :/.

I guess my initial vote is for trying a less-than-perfect prototype of
#2 to see just how hairy the rough outline gets.

- z

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-08 19:05 direct_access, pinning and truncation Matthew Wilcox
  2014-10-08 23:21 ` Zach Brown
@ 2014-10-09  1:10 ` Dave Chinner
  2014-10-09 15:25   ` Matthew Wilcox
  2014-10-10 13:08 ` Jan Kara
  2 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2014-10-09  1:10 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

On Wed, Oct 08, 2014 at 03:05:23PM -0400, Matthew Wilcox wrote:
> 
> One of the things on my todo list is making O_DIRECT work to a
> memory-mapped direct_access file.

I don't understand the motivation or the use case: O_DIRECT is
purely for bypassing the page cache, and DAX already bypasses the
page cache.  What difference is there between the DAX read/write
path and a DAX-based O_DIRECT IO path, and why doesn't just ignoring
O_DIRECT for DAX enabled filesystems simply do what you need?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-09  1:10 ` Dave Chinner
@ 2014-10-09 15:25   ` Matthew Wilcox
  2014-10-13  1:19     ` Dave Chinner
  2014-10-19  9:51     ` Boaz Harrosh
  0 siblings, 2 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-10-09 15:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, linux-fsdevel

On Thu, Oct 09, 2014 at 12:10:38PM +1100, Dave Chinner wrote:
> On Wed, Oct 08, 2014 at 03:05:23PM -0400, Matthew Wilcox wrote:
> > 
> > One of the things on my todo list is making O_DIRECT work to a
> > memory-mapped direct_access file.
> 
> I don't understand the motivation or the use case: O_DIRECT is
> purely for bypassing the page cache, and DAX already bypasses the
> page cache.  What difference is there between the DAX read/write
> path and a DAX-based O_DIRECT IO path, and why doesn't just ignoring
> O_DIRECT for DAX enabled filesystems simply do what you need?

There are two filesystems involved ... if both (or neither!) are DAX,
everything's fine.  The problem comes when you do things this way around:

int cachefd = open("/dax/cache", O_RDWR);
int datafd = open("/nfs/bigdata", O_RDWR | O_DIRECT);
void *cache = mmap(NULL, 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
		MAP_SHARED, cachefd, 0);
read(datafd, cache, 1024 * 1024);

The non-DAX filesystem needs to pin pages from the DAX filesystem while
they're under I/O.


Another attempt to solve this problem might be to turn the O_DIRECT
read into a read into a page of DRAM, followed by a copy from DRAM
to PMEM.  Conversely, writes could be done as a copy to DRAM followed
by a page-based write.


You also elided the paragraphs where I point out that this is an example
of a more general problem; there really are people who want to do RDMA
to DAX memory (the HPC crowd, of course), and we need to not open up
security holes when enabling that.  Since it's a potentially long-duration
and bi-directional mapping, the copy solution isn't going to work here
(without going all the way to solution 1).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-08 23:21 ` Zach Brown
@ 2014-10-09 16:44   ` Matthew Wilcox
  2014-10-09 19:14     ` Zach Brown
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2014-10-09 16:44 UTC (permalink / raw)
  To: Zach Brown; +Cc: Matthew Wilcox, linux-fsdevel

On Wed, Oct 08, 2014 at 04:21:32PM -0700, Zach Brown wrote:
> [... figuring out how g_u_p() references can prevent freeing and
> re-using the underlying mapped pmem addresses given the lack of struct
> pages for the mapping]
> 
> > I see three solutions here:
> > 
> > 1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
> > the caller with the struct pages of the DRAM.  Modify DAX to handle some
> > file pages being in the page cache, and make sure that we know whether
> > the PMEM or DRAM is up to date.  This has the obvious downside that
> > get_user_pages() becomes slow.
> 
> And serialize transitions and fs stores to pmem regions.  And now
> storing to dram-fronted pmem goes through all the dirtying and writeback
> machinery.  This sounds like a nightmare to me, to be honest.

That's not so bad ... it's just normal page-cache stuff, really.  It'd be
per-page serialisation, just like the current gunk we go through to get
sparse loads to not allocate backing store.

> > 2. Modify filesystems that support DAX to handle pinning blocks.
> > Some filesystems (that support COW and snapshots) already support
> > reference-counting individual blocks.  We may be ale to do better by
> > using a tree of pinned extents or something.  This makes it much harder
> > to modify a filesystem to support DAX, and I don't see patches adding
> > this capability to ext2 being warmly welcomed.
> 
> This seems.. doable?  Recording the referenced pmem in free lists in the
> fs is fine as long as the pmem isn't modified until the references are
> released, right?

As long as it's not *allocated* to anything else (which seems to be what
you're actually saying in the next paragraph).

> Maybe in the allocator you skip otherwise free blocks if they intersect
> with the run time structure (rbtree of extents, presumably) that is
> taking the place of reference counts in struct page.  There aren't
> *that* many allocator entry points.  I guess you'd need to avoid other
> modifications of free space like trimming :/.  It still seems reasonably
> doable?

Ah, so on reboot, the on-disk data structures are all correct, and
the in-memory data structures went away with the runtime pinning of
the memory.  Nice.

> And hey, lord knows we love to implement rbtrees of extents in file
> systems!  (btrfs: struct extent_state, ext4: struct extent_status)
> 
> The tricky part would be maintaining that structure behind g_u_p() and
> put_page() calls.  Probably a richer interface that gives callers
> something more than just raw page pointers.
> 
> > 3. Make truncate() block if it hits a pinned page.  There's really no
> > good reason to truncate a file that has pinned pages; it's either a bug
> > or you're trying to be nasty to someone.  We actually already have code
> > for this; inode_dio_wait() / inode_dio_done().  But page pinning isn't
> > just for O_DIRECT I/Os and other transient users like crypto, it's also
> > for long-lived things like RDMA, where we could potentially block for
> > an indefinite time.
> 
> I have no concrete examples, but I agree that it sounds like the sort of
> thing that would bite us in the ass if we miss some use case :/.
> 
> I guess my initial vote is for trying a less-than-perfect prototype of
> #2 to see just how hairy the rough outline gets.

Thinking about it now, it seems less hairy than I initially thought.  I'll
give it a quick try and see how it goes.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-09 16:44   ` Matthew Wilcox
@ 2014-10-09 19:14     ` Zach Brown
  2014-10-10 10:01       ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Zach Brown @ 2014-10-09 19:14 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

> > Maybe in the allocator you skip otherwise free blocks if they intersect
> > with the run time structure (rbtree of extents, presumably) that is
> > taking the place of reference counts in struct page.  There aren't
> > *that* many allocator entry points.  I guess you'd need to avoid other
> > modifications of free space like trimming :/.  It still seems reasonably
> > doable?
> 
> Ah, so on reboot, the on-disk data structures are all correct, and
> the in-memory data structures went away with the runtime pinning of
> the memory.  Nice.

Yeah, that's what I was picturing.  The part I'm most fuzzy on is how to
get current g_u_p() callers consuming the mappings without full struct
pages.

- z

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-09 19:14     ` Zach Brown
@ 2014-10-10 10:01       ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2014-10-10 10:01 UTC (permalink / raw)
  To: Zach Brown; +Cc: Matthew Wilcox, linux-fsdevel

On Thu 09-10-14 12:14:02, Zach Brown wrote:
> > > Maybe in the allocator you skip otherwise free blocks if they intersect
> > > with the run time structure (rbtree of extents, presumably) that is
> > > taking the place of reference counts in struct page.  There aren't
> > > *that* many allocator entry points.  I guess you'd need to avoid other
> > > modifications of free space like trimming :/.  It still seems reasonably
> > > doable?
> > 
> > Ah, so on reboot, the on-disk data structures are all correct, and
> > the in-memory data structures went away with the runtime pinning of
> > the memory.  Nice.
> 
> Yeah, that's what I was picturing.  The part I'm most fuzzy on is how to
> get current g_u_p() callers consuming the mappings without full struct
> pages.
  So the direct IO layer could be relatively easily converted to use just
PFNs instead of struct pages. But you'd also have to change block layer
(bios) to work with PFNs instead of struct page and that's going to be
non-trivial IMHO.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-08 19:05 direct_access, pinning and truncation Matthew Wilcox
  2014-10-08 23:21 ` Zach Brown
  2014-10-09  1:10 ` Dave Chinner
@ 2014-10-10 13:08 ` Jan Kara
  2014-10-10 14:24   ` Matthew Wilcox
  2 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2014-10-10 13:08 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

On Wed 08-10-14 15:05:23, Matthew Wilcox wrote:
> 
> One of the things on my todo list is making O_DIRECT work to a
> memory-mapped direct_access file.  Right now, it simply doesn't work
> because there's no struct page for the memory, so get_user_pages() fails.
> Boaz has posted a patch to create struct pages for direct_access files,
> which is certainly one way of solving the immediate problem, but it
> ignores the deeper problem.
  Maybe we can set some terminology - direct IO has two 'endpoints' (I
don't want to talk about source / target because that swaps when talking
about reads / writes). One endpoint is a 'buffer' and another endpoint is a
'storage'. Now 'buffer' may be a memory mapped file on some filesystem.
In your case what isn't working is when 'buffer' is mmaped file on a DAX
filesystem.

> For normal files, get_user_pages() elevates the reference count on
> the pages.  If those pages are subsequently truncated from the file,
> the underlying file blocks are released to the filesystem's free pool.
> The pages are removed from the page cache and the process's address space,
> but hang around until the caller of get_user_pages() calls put_page() on
> them again at which point they are released into the pool of free pages.
> 
> Once we have a struct page for (or some other way to handle pinning of)
> persistent memory blocks, truncating a file that has pinned pages will
> still cause the disk blocks to be released to the free pool.  But there
> weren't any pages of DRAM between the filesystem and the application!
> So those blocks are "freed" while still referenced.  And that reference
> might well be programmed into a piece of hardware that's doing DMA;
> it can't be stopped.
> 
> I see three solutions here:
> 
> 1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
> the caller with the struct pages of the DRAM.  Modify DAX to handle some
> file pages being in the page cache, and make sure that we know whether
> the PMEM or DRAM is up to date.  This has the obvious downside that
> get_user_pages() becomes slow.
> 
> 2. Modify filesystems that support DAX to handle pinning blocks.
> Some filesystems (that support COW and snapshots) already support
> reference-counting individual blocks.  We may be ale to do better by
> using a tree of pinned extents or something.  This makes it much harder
> to modify a filesystem to support DAX, and I don't see patches adding
> this capability to ext2 being warmly welcomed.
> 
> 3. Make truncate() block if it hits a pinned page.  There's really no
> good reason to truncate a file that has pinned pages; it's either a bug
> or you're trying to be nasty to someone.  We actually already have code
> for this; inode_dio_wait() / inode_dio_done().  But page pinning isn't
> just for O_DIRECT I/Os and other transient users like crypto, it's also
> for long-lived things like RDMA, where we could potentially block for
> an indefinite time.
  What option 3 seems to implicitely assume is that there are 'struct
pages' to pin. So do you expect to add struct page to PFNs which were a
target of get_user_pages()? And then check whether PFN is pinned (has
corresponding struct page) in the truncate code?

Note that inode_dio_wait() isn't really what you look for. That waits for
DIO pending against 'storage'. Currently we don't track in any way (except
for elevated page reference counts) that 'buffer' is an endpoint of direct
IO.

Thinking about options over and over again, I think trying something like
2) might be good. I'd still attach struct page to pinned PFNs to avoid some
troubles but you could delay freeing of fs blocks if they are pinned by
get_user_pages(). You could just hook into a path where filesystem frees
blocks - e.g. ext4 already does this anyway in ext4_mb_free_metadata()
since we free blocks in in-memory bitmaps only after the current
transaction is committed (changes in in-memory bitmaps happen from
ext4_journal_commit_callback(), which calls ext4_free_data_callback()). So
ext4 already handles the situation where in-memory bitmaps are different
from on disk ones and what you need is no different.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-10 13:08 ` Jan Kara
@ 2014-10-10 14:24   ` Matthew Wilcox
  2014-10-19 11:08     ` Boaz Harrosh
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2014-10-10 14:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Wilcox, linux-fsdevel

On Fri, Oct 10, 2014 at 03:08:05PM +0200, Jan Kara wrote:
> > One of the things on my todo list is making O_DIRECT work to a
> > memory-mapped direct_access file.  Right now, it simply doesn't work
> > because there's no struct page for the memory, so get_user_pages() fails.
> > Boaz has posted a patch to create struct pages for direct_access files,
> > which is certainly one way of solving the immediate problem, but it
> > ignores the deeper problem.
>   Maybe we can set some terminology - direct IO has two 'endpoints' (I
> don't want to talk about source / target because that swaps when talking
> about reads / writes). One endpoint is a 'buffer' and another endpoint is a
> 'storage'. Now 'buffer' may be a memory mapped file on some filesystem.
> In your case what isn't working is when 'buffer' is mmaped file on a DAX
> filesystem.

Good terminology :-)

> > 2. Modify filesystems that support DAX to handle pinning blocks.
> > Some filesystems (that support COW and snapshots) already support
> > reference-counting individual blocks.  We may be ale to do better by
> > using a tree of pinned extents or something.  This makes it much harder
> > to modify a filesystem to support DAX, and I don't see patches adding
> > this capability to ext2 being warmly welcomed.
> > 
> > 3. Make truncate() block if it hits a pinned page.  There's really no
> > good reason to truncate a file that has pinned pages; it's either a bug
> > or you're trying to be nasty to someone.  We actually already have code
> > for this; inode_dio_wait() / inode_dio_done().  But page pinning isn't
> > just for O_DIRECT I/Os and other transient users like crypto, it's also
> > for long-lived things like RDMA, where we could potentially block for
> > an indefinite time.
>   What option 3 seems to implicitely assume is that there are 'struct
> pages' to pin. So do you expect to add struct page to PFNs which were a
> target of get_user_pages()? And then check whether PFN is pinned (has
> corresponding struct page) in the truncate code?

I'm assuming that we come up with *some* way to solve the missing struct
page problem.  Whether it's restructuring splice, O_DIRECT and RDMA to do
without struct pages, whether it's dynamically allocating struct pages,
whether it's statically allocating struct pages, whether it's coming up
with some other data structure that takes the place of struct page for
DAX ... doesn't matter for this part of the conversation.

> Note that inode_dio_wait() isn't really what you look for. That waits for
> DIO pending against 'storage'. Currently we don't track in any way (except
> for elevated page reference counts) that 'buffer' is an endpoint of direct
> IO.

Ah, I wasn't clear ... I was proposing incrementing i_dio_count on the
buffer's inode when get_user_pages() was called.

> Thinking about options over and over again, I think trying something like
> 2) might be good. I'd still attach struct page to pinned PFNs to avoid some
> troubles but you could delay freeing of fs blocks if they are pinned by
> get_user_pages(). You could just hook into a path where filesystem frees
> blocks - e.g. ext4 already does this anyway in ext4_mb_free_metadata()
> since we free blocks in in-memory bitmaps only after the current
> transaction is committed (changes in in-memory bitmaps happen from
> ext4_journal_commit_callback(), which calls ext4_free_data_callback()). So
> ext4 already handles the situation where in-memory bitmaps are different
> from on disk ones and what you need is no different.

If this is something that (some) filesystems already do, then I feel
much happier about this idea!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-09 15:25   ` Matthew Wilcox
@ 2014-10-13  1:19     ` Dave Chinner
  2014-10-19  9:51     ` Boaz Harrosh
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2014-10-13  1:19 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

On Thu, Oct 09, 2014 at 11:25:24AM -0400, Matthew Wilcox wrote:
> On Thu, Oct 09, 2014 at 12:10:38PM +1100, Dave Chinner wrote:
> > On Wed, Oct 08, 2014 at 03:05:23PM -0400, Matthew Wilcox wrote:
> > > 
> > > One of the things on my todo list is making O_DIRECT work to a
> > > memory-mapped direct_access file.
> > 
> > I don't understand the motivation or the use case: O_DIRECT is
> > purely for bypassing the page cache, and DAX already bypasses the
> > page cache.  What difference is there between the DAX read/write
> > path and a DAX-based O_DIRECT IO path, and why doesn't just ignoring
> > O_DIRECT for DAX enabled filesystems simply do what you need?
> 
> There are two filesystems involved ... if both (or neither!) are DAX,
> everything's fine.  The problem comes when you do things this way around:
> 
> int cachefd = open("/dax/cache", O_RDWR);
> int datafd = open("/nfs/bigdata", O_RDWR | O_DIRECT);
> void *cache = mmap(NULL, 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
> 		MAP_SHARED, cachefd, 0);
> read(datafd, cache, 1024 * 1024);
> 
> The non-DAX filesystem needs to pin pages from the DAX filesystem while
> they're under I/O.

OK, that's what I was missing - it's not direct IO into/out of the
DAX filesystem - it's when you use the mmap()d DAX pages as the
source/destination of said direct IO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-09 15:25   ` Matthew Wilcox
  2014-10-13  1:19     ` Dave Chinner
@ 2014-10-19  9:51     ` Boaz Harrosh
  1 sibling, 0 replies; 14+ messages in thread
From: Boaz Harrosh @ 2014-10-19  9:51 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Chinner; +Cc: linux-fsdevel

On 10/09/2014 06:25 PM, Matthew Wilcox wrote:
> On Thu, Oct 09, 2014 at 12:10:38PM +1100, Dave Chinner wrote:
>> On Wed, Oct 08, 2014 at 03:05:23PM -0400, Matthew Wilcox wrote:
>>>
>>> One of the things on my todo list is making O_DIRECT work to a
>>> memory-mapped direct_access file.
>>
>> I don't understand the motivation or the use case: O_DIRECT is
>> purely for bypassing the page cache, and DAX already bypasses the
>> page cache.  What difference is there between the DAX read/write
>> path and a DAX-based O_DIRECT IO path, and why doesn't just ignoring
>> O_DIRECT for DAX enabled filesystems simply do what you need?
> 
> There are two filesystems involved ... if both (or neither!) are DAX,
> everything's fine.  The problem comes when you do things this way around:
> 
> int cachefd = open("/dax/cache", O_RDWR);
> int datafd = open("/nfs/bigdata", O_RDWR | O_DIRECT);
> void *cache = mmap(NULL, 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
> 		MAP_SHARED, cachefd, 0);
> read(datafd, cache, 1024 * 1024);
> 

This BTW works today. What happens is that get_user_pages() fails, so
directIO of NFS above fails and the VFS will just revert to buffered IO which
will work just fine with a simple memcpy to/from NFS's page-cache

> The non-DAX filesystem needs to pin pages from the DAX filesystem while
> they're under I/O.
> 
> 
> Another attempt to solve this problem might be to turn the O_DIRECT
> read into a read into a page of DRAM, followed by a copy from DRAM
> to PMEM.  Conversely, writes could be done as a copy to DRAM followed
> by a page-based write.
> 

So that's kind of stupid, why not let it be a @datafd's page cache like
what actually happen today?

> 
> You also elided the paragraphs where I point out that this is an example
> of a more general problem; there really are people who want to do RDMA
> to DAX memory (the HPC crowd, of course),

I do not yet see how in your proposal you can ever do RDMA without my
page-structs-for-pmem patch? This was exactly my motivation to enable this,
and to enable direct block layer access to pmem.

And Yes once the page-struct ref is held say by RDMA, it must be left unallocateable
until its refcount drops. This is exactly what we did in our pmem+pages based
FS.

Today RDMA and/or any other subsystem access is not possible, and does not
have this problem.

> And we need to not open up
> security holes when enabling that.  Since it's a potentially long-duration
> and bi-directional mapping, the copy solution isn't going to work here

I agree we should be careful to not open any holes. If done right it should be
good. A pmem aware FS should monitor the reference count of the pmem-page-struct
and if still held must not recycle that block to free-store but keep it held
until the reference drops. It is quite simple really.

That said a sane application should not have this problem. There should not be
a possibility for the RDMA to access loosely coupled pages that belongs to nothing.
(That used to belong to an mmaped file). For example taking some kind of flock on
the file will make the truncate wait until file is closed by app. And app does not
close until RDMA mapping is closed. Otherwise what is the point of this app?

I agree that exposing pmem to external subsytems, unlike today, might pose new
challenges. But these are doable.

On top of Matthew's DAX patches, there can be a simple API established with the FS
where dax_truncate_page can communicate that a certain block must not yet be returned
to free-store after the truncate, and will be returned to free-store later on.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-10 14:24   ` Matthew Wilcox
@ 2014-10-19 11:08     ` Boaz Harrosh
  2014-10-19 23:01       ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Boaz Harrosh @ 2014-10-19 11:08 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara; +Cc: linux-fsdevel

On 10/10/2014 05:24 PM, Matthew Wilcox wrote:
<>
> 
> I'm assuming that we come up with *some* way to solve the missing struct
> page problem.  Whether it's restructuring splice, O_DIRECT and RDMA to do
> without struct pages, 

That makes no sense to me, where will it end? You are doubling the size of the
code to have two paths, and there will always be a subsystem you did not touch
and is missing support. And why? page was already invented to do exactly what you
want, track state of a PFN.

> whether it's dynamically allocating struct pages,

I have tried this. It does not work. The PFN <-> page mapping is directly calculated
from the phisical/virtual addresses. Through the use of the section object.

struct page is actually just a part of a bigger "section" object. You do
not allocate an individual page-struct. You need to allocate a memory "section"

> whether it's statically allocating struct pages, 

The best I came up with was "hotplug" allocation. It is rather static, but
hot plugable. Please inspect my patches. This came out very simple, the minimum code
possible that gives you all the above support, without need to change any of these
subsystems, and plugs just nicely into all the other subsystems you have not mentioned.

> whether it's coming up
> with some other data structure that takes the place of struct page for
> DAX ... 

Again. Why reinvent the wheel when the old one works perfectly and does
everything you want, including the most important aspect. Not adding any
new infrastructure, and/or modifying any code. So why even think about it?

> doesn't matter for this part of the conversation.
> 

I agree, this does not solve the reference problem, in this case DAX will
need an new entry into the FS to communicate delayed free-block. But as Jan
pointed out this is not against current FS structure.

I think lots of current DAX problems and performance short comings can be
solved very nicely if we assume we have struct-page for pmem. For example
the use of the page-lock instead of the i_mutex we take today.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-19 11:08     ` Boaz Harrosh
@ 2014-10-19 23:01       ` Dave Chinner
  2014-10-21  9:17         ` Boaz Harrosh
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2014-10-19 23:01 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Matthew Wilcox, Jan Kara, linux-fsdevel

On Sun, Oct 19, 2014 at 02:08:07PM +0300, Boaz Harrosh wrote:
> On 10/10/2014 05:24 PM, Matthew Wilcox wrote:
> <>
> > 
> > I'm assuming that we come up with *some* way to solve the missing struct
> > page problem.  Whether it's restructuring splice, O_DIRECT and RDMA to do
> > without struct pages, 
> 
> That makes no sense to me, where will it end? You are doubling the size of the
> code to have two paths, and there will always be a subsystem you did not touch
> and is missing support. And why? page was already invented to do exactly what you
> want, track state of a PFN.
.....
> > whether it's coming up
> > with some other data structure that takes the place of struct page for
> > DAX ... 
> 
> Again. Why reinvent the wheel when the old one works perfectly and does
> everything you want, including the most important aspect. Not adding any
> new infrastructure, and/or modifying any code. So why even think about it?
> 
> > doesn't matter for this part of the conversation.
> > 
> 
> I agree, this does not solve the reference problem, in this case DAX will
> need an new entry into the FS to communicate delayed free-block. But as Jan
> pointed out this is not against current FS structure.
> 
> I think lots of current DAX problems and performance short comings can be
> solved very nicely if we assume we have struct-page for pmem. For example
> the use of the page-lock instead of the i_mutex we take today.

Which makes me look at what DAX is intended for.

DAX is an enabler, allowing us to get direct access to PMEM with
*existing filesystem technologies*.  I don't want to have to add new
extent management functions to XFS to add temporary references to
allow DAX to hold onto extents after an inode has been freed because
some RDMA app has pinned the PMEM and forgot to let it go. That way
lies madness for existing filesystems - yes, we can add such warts
to them, but it's ugly, nasty and needed only by a very, very small
lunatic fringe of users.

IMO, this proposal is way outside the original DAX-replaces-XIP scope;
I really don't think that requiring extensive modifications to
filesystems to use DAX is a good idea. Apart from it being contrary to the
original architectural goal of DAX (which was "enable direct access
with minimal filesystem implementation impact"), we risk significant
impact on non-DAX users by requiring architectural changes to the
underlying filesystems to support DAX.

So my question is this: at what point do we say "out of scope for
DAX, make this work with a native PMEM filesystem"?  DAX as it
stands fills the "95% of what people need" goal with minimal effort;
our efforts should be focussed on merging what we have, not creeping
the scope and making it harder to implement and get merged.

If we want RDMA into PMEM devices or direct IO to/from persisten
memory, then I'd suggest that this is functionality that belongs in
native PMEM storage devices/filesystems and should be designed to be
efficient in that environment way from the ground up.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: direct_access, pinning and truncation
  2014-10-19 23:01       ` Dave Chinner
@ 2014-10-21  9:17         ` Boaz Harrosh
  0 siblings, 0 replies; 14+ messages in thread
From: Boaz Harrosh @ 2014-10-21  9:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, Jan Kara, linux-fsdevel

On 10/20/2014 02:01 AM, Dave Chinner wrote:
> On Sun, Oct 19, 2014 at 02:08:07PM +0300, Boaz Harrosh wrote:
>> On 10/10/2014 05:24 PM, Matthew Wilcox wrote:
>> <>
>>>
>>> I'm assuming that we come up with *some* way to solve the missing struct
>>> page problem.  Whether it's restructuring splice, O_DIRECT and RDMA to do
>>> without struct pages, 
>>
>> That makes no sense to me, where will it end? You are doubling the size of the
>> code to have two paths, and there will always be a subsystem you did not touch
>> and is missing support. And why? page was already invented to do exactly what you
>> want, track state of a PFN.
> .....
>>> whether it's coming up
>>> with some other data structure that takes the place of struct page for
>>> DAX ... 
>>
>> Again. Why reinvent the wheel when the old one works perfectly and does
>> everything you want, including the most important aspect. Not adding any
>> new infrastructure, and/or modifying any code. So why even think about it?
>>
>>> doesn't matter for this part of the conversation.
>>>
>>
>> I agree, this does not solve the reference problem, in this case DAX will
>> need an new entry into the FS to communicate delayed free-block. But as Jan
>> pointed out this is not against current FS structure.
>>
>> I think lots of current DAX problems and performance short comings can be
>> solved very nicely if we assume we have struct-page for pmem. For example
>> the use of the page-lock instead of the i_mutex we take today.
> 
> Which makes me look at what DAX is intended for.
> 
> DAX is an enabler, allowing us to get direct access to PMEM with
> *existing filesystem technologies*.  I don't want to have to add new
> extent management functions to XFS to add temporary references to
> allow DAX to hold onto extents after an inode has been freed because
> some RDMA app has pinned the PMEM and forgot to let it go. That way
> lies madness for existing filesystems - yes, we can add such warts
> to them, but it's ugly, nasty and needed only by a very, very small
> lunatic fringe of users.
> 

I agree

> IMO, this proposal is way outside the original DAX-replaces-XIP scope;
> I really don't think that requiring extensive modifications to
> filesystems to use DAX is a good idea. Apart from it being contrary to the
> original architectural goal of DAX (which was "enable direct access
> with minimal filesystem implementation impact"), we risk significant
> impact on non-DAX users by requiring architectural changes to the
> underlying filesystems to support DAX.
> 
> So my question is this: at what point do we say "out of scope for
> DAX, make this work with a native PMEM filesystem"?  DAX as it
> stands fills the "95% of what people need" goal with minimal effort;
> our efforts should be focussed on merging what we have, not creeping
> the scope and making it harder to implement and get merged.
> 
> If we want RDMA into PMEM devices or direct IO to/from persisten
> memory, then I'd suggest that this is functionality that belongs in
> native PMEM storage devices/filesystems and should be designed to be
> efficient in that environment way from the ground up.
> 

You convinced me. This is out of scope for DAX and is up to the user.
It actually works today, let me explain:

Today, after my patch to pmem, one can just mmap a file and the pointer
returned pass to any RDMA engine he chooses and it will just work. With
brd driver and DAX it will just work today, and even with old XIP.
The problem that remains is the truncate while RDMA mapped. What the
user will need to do is take a lock on the file to wart any truncates.
For me this is like trashing the block-dev directly while an FS is
mounted, I think, can a none root do this?
Please note that this scenario is possible today with a brd device.

> Cheers,
> Dave.
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-10-21  9:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-08 19:05 direct_access, pinning and truncation Matthew Wilcox
2014-10-08 23:21 ` Zach Brown
2014-10-09 16:44   ` Matthew Wilcox
2014-10-09 19:14     ` Zach Brown
2014-10-10 10:01       ` Jan Kara
2014-10-09  1:10 ` Dave Chinner
2014-10-09 15:25   ` Matthew Wilcox
2014-10-13  1:19     ` Dave Chinner
2014-10-19  9:51     ` Boaz Harrosh
2014-10-10 13:08 ` Jan Kara
2014-10-10 14:24   ` Matthew Wilcox
2014-10-19 11:08     ` Boaz Harrosh
2014-10-19 23:01       ` Dave Chinner
2014-10-21  9:17         ` Boaz Harrosh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.