* [PATCH 1/2] mm: introduce put_user_page(), placeholder version
2018-07-09 8:05 [PATCH 0/2] mm/fs: put_user_page() proposal john.hubbard
@ 2018-07-09 8:05 ` john.hubbard
2018-07-09 10:08 ` kbuild test robot
2018-07-09 15:53 ` Jason Gunthorpe
2018-07-09 8:05 ` [PATCH 2/2] goldfish_pipe/mm: convert to the new put_user_page() call john.hubbard
` (2 subsequent siblings)
3 siblings, 2 replies; 16+ messages in thread
From: john.hubbard @ 2018-07-09 8:05 UTC (permalink / raw)
To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro
Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard
From: John Hubbard <jhubbard@nvidia.com>
Introduces put_user_page(), which simply calls put_page().
This provides a safe way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().
Also adds release_user_pages(), a drop-in replacement for
release_pages(). This is intended to be easily grep-able,
for later performance improvements, since release_user_pages
is not batched like release_pages is, and is significantly
slower.
Subsequent patches will add functionality to put_user_page().
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
include/linux/mm.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9ffe380..db4a211aad79 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -923,6 +923,20 @@ static inline void put_page(struct page *page)
__put_page(page);
}
+/* Placeholder version, until all get_user_pages*() callers are updated. */
+static inline void put_user_page(struct page *page)
+{
+ put_page(page);
+}
+
+/* A drop-in replacement for release_pages(): */
+static inline void release_user_pages(struct page **pages,
+ unsigned long npages)
+{
+ while (npages)
+ put_user_page(pages[--npages]);
+}
+
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
#define SECTION_IN_PAGE_FLAGS
#endif
--
2.18.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] mm: introduce put_user_page(), placeholder version
2018-07-09 8:05 ` [PATCH 1/2] mm: introduce put_user_page(), placeholder version john.hubbard
@ 2018-07-09 10:08 ` kbuild test robot
2018-07-09 18:48 ` John Hubbard
2018-07-09 15:53 ` Jason Gunthorpe
1 sibling, 1 reply; 16+ messages in thread
From: kbuild test robot @ 2018-07-09 10:08 UTC (permalink / raw)
To: john.hubbard
Cc: kbuild-all, Matthew Wilcox, Michal Hocko, Christopher Lameter,
Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
linux-rdma, linux-fsdevel, John Hubbard
[-- Attachment #1: Type: text/plain, Size: 2741 bytes --]
Hi John,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc4 next-20180709]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/john-hubbard-gmail-com/mm-fs-put_user_page-proposal/20180709-173653
config: x86_64-randconfig-x015-201827 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
Note: the linux-review/john-hubbard-gmail-com/mm-fs-put_user_page-proposal/20180709-173653 HEAD 3f7da023c5e08e49489e39be9cde820b0d1ab4d6 builds fine.
It only hurts bisectibility.
All errors (new ones prefixed by >>):
>> drivers/platform//goldfish/goldfish_pipe.c:334:13: error: conflicting types for 'release_user_pages'
static void release_user_pages(struct page **pages, int pages_count,
^~~~~~~~~~~~~~~~~~
In file included from include/linux/scatterlist.h:8:0,
from include/linux/dma-mapping.h:11,
from drivers/platform//goldfish/goldfish_pipe.c:62:
include/linux/mm.h:933:20: note: previous definition of 'release_user_pages' was here
static inline void release_user_pages(struct page **pages,
^~~~~~~~~~~~~~~~~~
vim +/release_user_pages +334 drivers/platform//goldfish/goldfish_pipe.c
726ea1a8 Jin Qian 2017-04-20 333
726ea1a8 Jin Qian 2017-04-20 @334 static void release_user_pages(struct page **pages, int pages_count,
726ea1a8 Jin Qian 2017-04-20 335 int is_write, s32 consumed_size)
c89f2750 David 'Digit' Turner 2013-01-21 336 {
726ea1a8 Jin Qian 2017-04-20 337 int i;
c89f2750 David 'Digit' Turner 2013-01-21 338
726ea1a8 Jin Qian 2017-04-20 339 for (i = 0; i < pages_count; i++) {
726ea1a8 Jin Qian 2017-04-20 340 if (!is_write && consumed_size > 0)
726ea1a8 Jin Qian 2017-04-20 341 set_page_dirty(pages[i]);
726ea1a8 Jin Qian 2017-04-20 342 put_page(pages[i]);
726ea1a8 Jin Qian 2017-04-20 343 }
726ea1a8 Jin Qian 2017-04-20 344 }
726ea1a8 Jin Qian 2017-04-20 345
:::::: The code at line 334 was first introduced by commit
:::::: 726ea1a8ea96b2bba34ee2073b58f0770800701c goldfish_pipe: An implementation of more parallel pipe
:::::: TO: Jin Qian <jinqian@android.com>
:::::: CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26781 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] mm: introduce put_user_page(), placeholder version
2018-07-09 10:08 ` kbuild test robot
@ 2018-07-09 18:48 ` John Hubbard
0 siblings, 0 replies; 16+ messages in thread
From: John Hubbard @ 2018-07-09 18:48 UTC (permalink / raw)
To: kbuild test robot, john.hubbard
Cc: kbuild-all, Matthew Wilcox, Michal Hocko, Christopher Lameter,
Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
linux-rdma, linux-fsdevel
On 07/09/2018 03:08 AM, kbuild test robot wrote:
> Hi John,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on linus/master]
...
>
>>> drivers/platform//goldfish/goldfish_pipe.c:334:13: error: conflicting types for 'release_user_pages'
> static void release_user_pages(struct page **pages, int pages_count,
> ^~~~~~~~~~~~~~~~~~
Yes. Patches #1 and #2 need to be combined here. I'll do that in the next version, which will probably include several of the easier put_user_page() conversions, as well.
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] mm: introduce put_user_page(), placeholder version
2018-07-09 8:05 ` [PATCH 1/2] mm: introduce put_user_page(), placeholder version john.hubbard
2018-07-09 10:08 ` kbuild test robot
@ 2018-07-09 15:53 ` Jason Gunthorpe
2018-07-09 16:11 ` Jan Kara
1 sibling, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2018-07-09 15:53 UTC (permalink / raw)
To: john.hubbard
Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter, Dan Williams,
Jan Kara, Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel,
John Hubbard
On Mon, Jul 09, 2018 at 01:05:53AM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
>
> Introduces put_user_page(), which simply calls put_page().
> This provides a safe way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
>
> Also adds release_user_pages(), a drop-in replacement for
> release_pages(). This is intended to be easily grep-able,
> for later performance improvements, since release_user_pages
> is not batched like release_pages is, and is significantly
> slower.
>
> Subsequent patches will add functionality to put_user_page().
>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> include/linux/mm.h | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a0fbb9ffe380..db4a211aad79 100644
> +++ b/include/linux/mm.h
> @@ -923,6 +923,20 @@ static inline void put_page(struct page *page)
> __put_page(page);
> }
>
> +/* Placeholder version, until all get_user_pages*() callers are updated. */
> +static inline void put_user_page(struct page *page)
> +{
> + put_page(page);
> +}
> +
> +/* A drop-in replacement for release_pages(): */
> +static inline void release_user_pages(struct page **pages,
> + unsigned long npages)
> +{
> + while (npages)
> + put_user_page(pages[--npages]);
> +}
> +
> #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> #define SECTION_IN_PAGE_FLAGS
> #endif
Just as question: Do you think it is worthwhile to have a
release_user_page_dirtied() helper as well?
Ie to indicate that a pages that were grabbed under GUP FOLL_WRITE
were actually written too?
Keeps more of these unimportant details out of the drivers..
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] mm: introduce put_user_page(), placeholder version
2018-07-09 15:53 ` Jason Gunthorpe
@ 2018-07-09 16:11 ` Jan Kara
0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2018-07-09 16:11 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: john.hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
Dan Williams, Jan Kara, Al Viro, linux-mm, LKML, linux-rdma,
linux-fsdevel, John Hubbard
On Mon 09-07-18 09:53:57, Jason Gunthorpe wrote:
> On Mon, Jul 09, 2018 at 01:05:53AM -0700, john.hubbard@gmail.com wrote:
> > From: John Hubbard <jhubbard@nvidia.com>
> >
> > Introduces put_user_page(), which simply calls put_page().
> > This provides a safe way to update all get_user_pages*() callers,
> > so that they call put_user_page(), instead of put_page().
> >
> > Also adds release_user_pages(), a drop-in replacement for
> > release_pages(). This is intended to be easily grep-able,
> > for later performance improvements, since release_user_pages
> > is not batched like release_pages is, and is significantly
> > slower.
> >
> > Subsequent patches will add functionality to put_user_page().
> >
> > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > include/linux/mm.h | 14 ++++++++++++++
> > 1 file changed, 14 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a0fbb9ffe380..db4a211aad79 100644
> > +++ b/include/linux/mm.h
> > @@ -923,6 +923,20 @@ static inline void put_page(struct page *page)
> > __put_page(page);
> > }
> >
> > +/* Placeholder version, until all get_user_pages*() callers are updated. */
> > +static inline void put_user_page(struct page *page)
> > +{
> > + put_page(page);
> > +}
> > +
> > +/* A drop-in replacement for release_pages(): */
> > +static inline void release_user_pages(struct page **pages,
> > + unsigned long npages)
> > +{
> > + while (npages)
> > + put_user_page(pages[--npages]);
> > +}
> > +
> > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> > #define SECTION_IN_PAGE_FLAGS
> > #endif
>
> Just as question: Do you think it is worthwhile to have a
> release_user_page_dirtied() helper as well?
>
> Ie to indicate that a pages that were grabbed under GUP FOLL_WRITE
> were actually written too?
>
> Keeps more of these unimportant details out of the drivers..
Yeah, I think that would be nice as well.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/2] goldfish_pipe/mm: convert to the new put_user_page() call
2018-07-09 8:05 [PATCH 0/2] mm/fs: put_user_page() proposal john.hubbard
2018-07-09 8:05 ` [PATCH 1/2] mm: introduce put_user_page(), placeholder version john.hubbard
@ 2018-07-09 8:05 ` john.hubbard
2018-07-09 8:49 ` [PATCH 0/2] mm/fs: put_user_page() proposal Nicholas Piggin
2018-07-09 16:27 ` Jan Kara
3 siblings, 0 replies; 16+ messages in thread
From: john.hubbard @ 2018-07-09 8:05 UTC (permalink / raw)
To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro
Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard
From: John Hubbard <jhubbard@nvidia.com>
For code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(),
instead of put_page().
Also: rename release_user_pages(), to avoid a naming
conflict with the new external function of the same name.
CC: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
drivers/platform/goldfish/goldfish_pipe.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c
index 3e32a4c14d5f..3ab871c22a88 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -331,7 +331,7 @@ static int pin_user_pages(unsigned long first_page, unsigned long last_page,
}
-static void release_user_pages(struct page **pages, int pages_count,
+static void __release_user_pages(struct page **pages, int pages_count,
int is_write, s32 consumed_size)
{
int i;
@@ -339,7 +339,7 @@ static void release_user_pages(struct page **pages, int pages_count,
for (i = 0; i < pages_count; i++) {
if (!is_write && consumed_size > 0)
set_page_dirty(pages[i]);
- put_page(pages[i]);
+ put_user_page(pages[i]);
}
}
@@ -409,7 +409,7 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
*consumed_size = pipe->command_buffer->rw_params.consumed_size;
- release_user_pages(pages, pages_count, is_write, *consumed_size);
+ __release_user_pages(pages, pages_count, is_write, *consumed_size);
mutex_unlock(&pipe->lock);
--
2.18.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2] mm/fs: put_user_page() proposal
2018-07-09 8:05 [PATCH 0/2] mm/fs: put_user_page() proposal john.hubbard
2018-07-09 8:05 ` [PATCH 1/2] mm: introduce put_user_page(), placeholder version john.hubbard
2018-07-09 8:05 ` [PATCH 2/2] goldfish_pipe/mm: convert to the new put_user_page() call john.hubbard
@ 2018-07-09 8:49 ` Nicholas Piggin
2018-07-09 16:08 ` Jan Kara
2018-07-09 16:27 ` Jan Kara
3 siblings, 1 reply; 16+ messages in thread
From: Nicholas Piggin @ 2018-07-09 8:49 UTC (permalink / raw)
To: john.hubbard
Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
linux-rdma, linux-fsdevel, John Hubbard
On Mon, 9 Jul 2018 01:05:52 -0700
john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
>
> Hi,
>
> With respect to tracking get_user_pages*() pages with page->dma_pinned*
> fields [1], I spent a few days retrofitting most of the get_user_pages*()
> call sites, by adding calls to a new put_user_page() function, in place
> of put_page(), where appropriate. This will work, but it's a large effort.
>
> Design note: I didn't see anything that hinted at a way to fix this
> problem, without actually changing all of the get_user_pages*() call sites,
> so I think it's reasonable to start with that.
>
> Anyway, it's still incomplete, but because this is a large, tree-wide
> change (that will take some time and testing), I'd like to propose a plan,
> before spamming zillions of people with put_user_page() conversion patches.
> So I picked out the first two patches to show where this is going.
>
> Proposed steps:
>
> Step 1:
>
> Start with the patches here, then continue with...dozens more.
> This will eventually convert all of the call sites to use put_user_page().
> This is easy in some places, but complex in others, such as:
>
> -- drivers/gpu/drm/amd
> -- bio
> -- fuse
> -- cifs
> -- anything from:
> git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq
>
> The easy ones can be grouped into a single patchset, perhaps, and the
> complex ones probably each need a patchset, in order to get the in-depth
> review they'll need.
>
> Furthermore, some of these areas I hope to attract some help on, once
> this starts going.
>
> Step 2:
>
> In parallel, tidy up the core patchset that was discussed in [1], (version
> 2 has already been reviewed, so I know what to do), and get it perfected
> and reviewed. Don't apply it until step 1 is all done, though.
>
> Step 3:
>
> Activate refcounting of dma-pinned pages (essentially, patch #5, which is
> [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start
> mopping up any missed call sites.
>
> Step 4:
>
> After some soak time, actually connect it up (patch #6 of [1]) and start
> taking action based on the new page->dma_pinned* fields.
You can use my decade old patch!
https://lkml.org/lkml/2009/2/17/113
The problem with blocking in clear_page_dirty_for_io is that the fs is
holding the page lock (or locks) and possibly others too. If you
expect to have a bunch of long term references hanging around on the
page, then there will be hangs and deadlocks everywhere. And if you do
not have such log term references, then page lock (or some similar lock
bit) for the duration of the DMA should be about enough?
I think it has to be more fundamental to the filesystem. Filesystem
would get callbacks to register such long term dirtying on its files.
Then it can do locking, resource allocation, -ENOTSUPP, etc.
Thanks,
Nick
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2] mm/fs: put_user_page() proposal
2018-07-09 8:49 ` [PATCH 0/2] mm/fs: put_user_page() proposal Nicholas Piggin
@ 2018-07-09 16:08 ` Jan Kara
2018-07-09 17:16 ` Matthew Wilcox
0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2018-07-09 16:08 UTC (permalink / raw)
To: Nicholas Piggin
Cc: john.hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
linux-rdma, linux-fsdevel, John Hubbard
On Mon 09-07-18 18:49:37, Nicholas Piggin wrote:
> On Mon, 9 Jul 2018 01:05:52 -0700
> john.hubbard@gmail.com wrote:
>
> > From: John Hubbard <jhubbard@nvidia.com>
> >
> > Hi,
> >
> > With respect to tracking get_user_pages*() pages with page->dma_pinned*
> > fields [1], I spent a few days retrofitting most of the get_user_pages*()
> > call sites, by adding calls to a new put_user_page() function, in place
> > of put_page(), where appropriate. This will work, but it's a large effort.
> >
> > Design note: I didn't see anything that hinted at a way to fix this
> > problem, without actually changing all of the get_user_pages*() call sites,
> > so I think it's reasonable to start with that.
> >
> > Anyway, it's still incomplete, but because this is a large, tree-wide
> > change (that will take some time and testing), I'd like to propose a plan,
> > before spamming zillions of people with put_user_page() conversion patches.
> > So I picked out the first two patches to show where this is going.
> >
> > Proposed steps:
> >
> > Step 1:
> >
> > Start with the patches here, then continue with...dozens more.
> > This will eventually convert all of the call sites to use put_user_page().
> > This is easy in some places, but complex in others, such as:
> >
> > -- drivers/gpu/drm/amd
> > -- bio
> > -- fuse
> > -- cifs
> > -- anything from:
> > git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq
> >
> > The easy ones can be grouped into a single patchset, perhaps, and the
> > complex ones probably each need a patchset, in order to get the in-depth
> > review they'll need.
> >
> > Furthermore, some of these areas I hope to attract some help on, once
> > this starts going.
> >
> > Step 2:
> >
> > In parallel, tidy up the core patchset that was discussed in [1], (version
> > 2 has already been reviewed, so I know what to do), and get it perfected
> > and reviewed. Don't apply it until step 1 is all done, though.
> >
> > Step 3:
> >
> > Activate refcounting of dma-pinned pages (essentially, patch #5, which is
> > [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start
> > mopping up any missed call sites.
> >
> > Step 4:
> >
> > After some soak time, actually connect it up (patch #6 of [1]) and start
> > taking action based on the new page->dma_pinned* fields.
>
> You can use my decade old patch!
>
> https://lkml.org/lkml/2009/2/17/113
The problem has a longer history than I thought ;)
> The problem with blocking in clear_page_dirty_for_io is that the fs is
> holding the page lock (or locks) and possibly others too. If you
> expect to have a bunch of long term references hanging around on the
> page, then there will be hangs and deadlocks everywhere. And if you do
> not have such log term references, then page lock (or some similar lock
> bit) for the duration of the DMA should be about enough?
There are two separate questions:
1) How to identify pages pinned for DMA? We have no bit in struct page to
use and we cannot reuse page lock as that immediately creates lock
inversions e.g. in direct IO code (which could be fixed but then good luck
with auditing all the other GUP users). Matthew had an idea and John
implemented it based on removing page from LRU and using that space in
struct page. So we at least have a way to identify pages that are pinned
and can track their pin count.
2) What to do when some page is pinned but we need to do e.g.
clear_page_dirty_for_io(). After some more thinking I agree with you that
just blocking waiting for page to unpin will create deadlocks like:
ext4_writepages() ext4_direct_IO_write()
__blockdev_direct_IO()
iov_iter_get_pages()
- pins page
handle = ext4_journal_start_with_reserve(inode, ...)
- starts transaction
...
lock_page(page)
mpage_submit_page()
clear_page_dirty_for_io(page) -> blocks on pin
ext4_dio_get_block_unwritten_sync()
- called to allocate
blocks for DIO
ext4_journal_start()
- may block and wait
for transaction
started by
ext4_writepages() to
finish
> I think it has to be more fundamental to the filesystem. Filesystem
> would get callbacks to register such long term dirtying on its files.
> Then it can do locking, resource allocation, -ENOTSUPP, etc.
Well, direct IO would not classify as long term dirtying I guess but still
regardless of how we identify pinned pages, just waiting in
clear_page_dirty_for_io() is going to cause deadlocks. So I agree with you
that the solution (but even for short term GUP users) will need filesystem
changes. I don't see a need for fs callbacks on pin time (as I don't see
much fs-specific work to do there) but we will probably need to provide a
way to wait for outstanding pins & preventing new ones for given mapping
range while writeback / unmapping is running.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2] mm/fs: put_user_page() proposal
2018-07-09 16:08 ` Jan Kara
@ 2018-07-09 17:16 ` Matthew Wilcox
2018-07-09 19:47 ` Jan Kara
0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2018-07-09 17:16 UTC (permalink / raw)
To: Jan Kara
Cc: Nicholas Piggin, john.hubbard, Michal Hocko, Christopher Lameter,
Jason Gunthorpe, Dan Williams, Al Viro, linux-mm, LKML,
linux-rdma, linux-fsdevel, John Hubbard
On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote:
> On Mon 09-07-18 18:49:37, Nicholas Piggin wrote:
> > The problem with blocking in clear_page_dirty_for_io is that the fs is
> > holding the page lock (or locks) and possibly others too. If you
> > expect to have a bunch of long term references hanging around on the
> > page, then there will be hangs and deadlocks everywhere. And if you do
> > not have such log term references, then page lock (or some similar lock
> > bit) for the duration of the DMA should be about enough?
>
> There are two separate questions:
>
> 1) How to identify pages pinned for DMA? We have no bit in struct page to
> use and we cannot reuse page lock as that immediately creates lock
> inversions e.g. in direct IO code (which could be fixed but then good luck
> with auditing all the other GUP users). Matthew had an idea and John
> implemented it based on removing page from LRU and using that space in
> struct page. So we at least have a way to identify pages that are pinned
> and can track their pin count.
>
> 2) What to do when some page is pinned but we need to do e.g.
> clear_page_dirty_for_io(). After some more thinking I agree with you that
> just blocking waiting for page to unpin will create deadlocks like:
Why are we trying to writeback a page that is pinned? It's presumed to
be continuously redirtied by its pinner. We can't evict it.
> ext4_writepages() ext4_direct_IO_write()
> __blockdev_direct_IO()
> iov_iter_get_pages()
> - pins page
> handle = ext4_journal_start_with_reserve(inode, ...)
> - starts transaction
> ...
> lock_page(page)
> mpage_submit_page()
> clear_page_dirty_for_io(page) -> blocks on pin
I don't think it should block. It should fail.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2] mm/fs: put_user_page() proposal
2018-07-09 17:16 ` Matthew Wilcox
@ 2018-07-09 19:47 ` Jan Kara
2018-07-09 19:56 ` Jason Gunthorpe
2018-07-09 20:00 ` Matthew Wilcox
0 siblings, 2 replies; 16+ messages in thread
From: Jan Kara @ 2018-07-09 19:47 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Jan Kara, Nicholas Piggin, john.hubbard, Michal Hocko,
Christopher Lameter, Jason Gunthorpe, Dan Williams, Al Viro,
linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard
On Mon 09-07-18 10:16:51, Matthew Wilcox wrote:
> On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote:
> > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote:
> > > The problem with blocking in clear_page_dirty_for_io is that the fs is
> > > holding the page lock (or locks) and possibly others too. If you
> > > expect to have a bunch of long term references hanging around on the
> > > page, then there will be hangs and deadlocks everywhere. And if you do
> > > not have such log term references, then page lock (or some similar lock
> > > bit) for the duration of the DMA should be about enough?
> >
> > There are two separate questions:
> >
> > 1) How to identify pages pinned for DMA? We have no bit in struct page to
> > use and we cannot reuse page lock as that immediately creates lock
> > inversions e.g. in direct IO code (which could be fixed but then good luck
> > with auditing all the other GUP users). Matthew had an idea and John
> > implemented it based on removing page from LRU and using that space in
> > struct page. So we at least have a way to identify pages that are pinned
> > and can track their pin count.
> >
> > 2) What to do when some page is pinned but we need to do e.g.
> > clear_page_dirty_for_io(). After some more thinking I agree with you that
> > just blocking waiting for page to unpin will create deadlocks like:
>
> Why are we trying to writeback a page that is pinned? It's presumed to
> be continuously redirtied by its pinner. We can't evict it.
So what should be a result of fsync(file), where some 'file' pages are
pinned e.g. by running direct IO? If we just skip those pages, we'll lie to
userspace that data was committed while it was not (and it's not only about
data that has landed in those pages via DMA, you can have first 1k of a page
modified by normal IO in parallel to DMA modifying second 1k chunk). If
fsync(2) returns error, it would be really unexpected by userspace and most
apps will just not handle that correctly. So what else can you do than
block?
> > ext4_writepages() ext4_direct_IO_write()
> > __blockdev_direct_IO()
> > iov_iter_get_pages()
> > - pins page
> > handle = ext4_journal_start_with_reserve(inode, ...)
> > - starts transaction
> > ...
> > lock_page(page)
> > mpage_submit_page()
> > clear_page_dirty_for_io(page) -> blocks on pin
>
> I don't think it should block. It should fail.
See above...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2] mm/fs: put_user_page() proposal
2018-07-09 19:47 ` Jan Kara
@ 2018-07-09 19:56 ` Jason Gunthorpe
2018-07-10 7:51 ` Jan Kara
2018-07-09 20:00 ` Matthew Wilcox
1 sibling, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2018-07-09 19:56 UTC (permalink / raw)
To: Jan Kara
Cc: Matthew Wilcox, Nicholas Piggin, john.hubbard, Michal Hocko,
Christopher Lameter, Dan Williams, Al Viro, linux-mm, LKML,
linux-rdma, linux-fsdevel, John Hubbard
On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote:
> On Mon 09-07-18 10:16:51, Matthew Wilcox wrote:
> > On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote:
> > > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote:
> > > > The problem with blocking in clear_page_dirty_for_io is that the fs is
> > > > holding the page lock (or locks) and possibly others too. If you
> > > > expect to have a bunch of long term references hanging around on the
> > > > page, then there will be hangs and deadlocks everywhere. And if you do
> > > > not have such log term references, then page lock (or some similar lock
> > > > bit) for the duration of the DMA should be about enough?
> > >
> > > There are two separate questions:
> > >
> > > 1) How to identify pages pinned for DMA? We have no bit in struct page to
> > > use and we cannot reuse page lock as that immediately creates lock
> > > inversions e.g. in direct IO code (which could be fixed but then good luck
> > > with auditing all the other GUP users). Matthew had an idea and John
> > > implemented it based on removing page from LRU and using that space in
> > > struct page. So we at least have a way to identify pages that are pinned
> > > and can track their pin count.
> > >
> > > 2) What to do when some page is pinned but we need to do e.g.
> > > clear_page_dirty_for_io(). After some more thinking I agree with you that
> > > just blocking waiting for page to unpin will create deadlocks like:
> >
> > Why are we trying to writeback a page that is pinned? It's presumed to
> > be continuously redirtied by its pinner. We can't evict it.
>
> So what should be a result of fsync(file), where some 'file' pages are
> pinned e.g. by running direct IO? If we just skip those pages, we'll lie to
> userspace that data was committed while it was not (and it's not only about
> data that has landed in those pages via DMA, you can have first 1k of a page
> modified by normal IO in parallel to DMA modifying second 1k chunk). If
> fsync(2) returns error, it would be really unexpected by userspace and most
> apps will just not handle that correctly. So what else can you do than
> block?
I think as a userspace I would expect the 'current content' to be
flushed without waiting..
If you block fsync() then anyone using a RDMA MR with it will just
dead lock. What happens if two processes open the same file and
one makes a MR and the other calls fsync()? Sounds bad.
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2] mm/fs: put_user_page() proposal
2018-07-09 19:56 ` Jason Gunthorpe
@ 2018-07-10 7:51 ` Jan Kara
0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2018-07-10 7:51 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jan Kara, Matthew Wilcox, Nicholas Piggin, john.hubbard,
Michal Hocko, Christopher Lameter, Dan Williams, Al Viro,
linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard
On Mon 09-07-18 13:56:57, Jason Gunthorpe wrote:
> On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote:
> > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote:
> > > On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote:
> > > > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote:
> > > > > The problem with blocking in clear_page_dirty_for_io is that the fs is
> > > > > holding the page lock (or locks) and possibly others too. If you
> > > > > expect to have a bunch of long term references hanging around on the
> > > > > page, then there will be hangs and deadlocks everywhere. And if you do
> > > > > not have such log term references, then page lock (or some similar lock
> > > > > bit) for the duration of the DMA should be about enough?
> > > >
> > > > There are two separate questions:
> > > >
> > > > 1) How to identify pages pinned for DMA? We have no bit in struct page to
> > > > use and we cannot reuse page lock as that immediately creates lock
> > > > inversions e.g. in direct IO code (which could be fixed but then good luck
> > > > with auditing all the other GUP users). Matthew had an idea and John
> > > > implemented it based on removing page from LRU and using that space in
> > > > struct page. So we at least have a way to identify pages that are pinned
> > > > and can track their pin count.
> > > >
> > > > 2) What to do when some page is pinned but we need to do e.g.
> > > > clear_page_dirty_for_io(). After some more thinking I agree with you that
> > > > just blocking waiting for page to unpin will create deadlocks like:
> > >
> > > Why are we trying to writeback a page that is pinned? It's presumed to
> > > be continuously redirtied by its pinner. We can't evict it.
> >
> > So what should be a result of fsync(file), where some 'file' pages are
> > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to
> > userspace that data was committed while it was not (and it's not only about
> > data that has landed in those pages via DMA, you can have first 1k of a page
> > modified by normal IO in parallel to DMA modifying second 1k chunk). If
> > fsync(2) returns error, it would be really unexpected by userspace and most
> > apps will just not handle that correctly. So what else can you do than
> > block?
>
> I think as a userspace I would expect the 'current content' to be
> flushed without waiting..
Yes but the problem is we cannot generally write out a page whose contents
is possibly changing (e.g. RAID5 checksums would then be wrong). But maybe
using bounce pages (and keeping original page still dirty) in such case would
be worth it - originally I thought using bounce pages would not bring us
much but now seeing problems with blocking in more detail maybe they are
worth the trouble after all...
> If you block fsync() then anyone using a RDMA MR with it will just
> dead lock. What happens if two processes open the same file and
> one makes a MR and the other calls fsync()? Sounds bad.
Yes, that's one of the reasons why we were discussing revoke mechanisms for
long term pins. But with bounce pages we could possibly avoid that (except
for cases like DAX + truncate where it's really unavoidable but there it's
a new functionality so mandating revoke and returning error otherwise is
fine I guess).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2] mm/fs: put_user_page() proposal
2018-07-09 19:47 ` Jan Kara
2018-07-09 19:56 ` Jason Gunthorpe
@ 2018-07-09 20:00 ` Matthew Wilcox
2018-07-10 8:21 ` Jan Kara
1 sibling, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2018-07-09 20:00 UTC (permalink / raw)
To: Jan Kara
Cc: Nicholas Piggin, john.hubbard, Michal Hocko, Christopher Lameter,
Jason Gunthorpe, Dan Williams, Al Viro, linux-mm, LKML,
linux-rdma, linux-fsdevel, John Hubbard
On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote:
> On Mon 09-07-18 10:16:51, Matthew Wilcox wrote:
> > > 2) What to do when some page is pinned but we need to do e.g.
> > > clear_page_dirty_for_io(). After some more thinking I agree with you that
> > > just blocking waiting for page to unpin will create deadlocks like:
> >
> > Why are we trying to writeback a page that is pinned? It's presumed to
> > be continuously redirtied by its pinner. We can't evict it.
>
> So what should be a result of fsync(file), where some 'file' pages are
> pinned e.g. by running direct IO? If we just skip those pages, we'll lie to
> userspace that data was committed while it was not (and it's not only about
> data that has landed in those pages via DMA, you can have first 1k of a page
> modified by normal IO in parallel to DMA modifying second 1k chunk). If
> fsync(2) returns error, it would be really unexpected by userspace and most
> apps will just not handle that correctly. So what else can you do than
> block?
I was thinking about writeback, and neglected the fsync case. For fsync,
we could copy the "current" contents of the page to a freshly-allocated
page and write _that_ to disc? As long as we redirty the real page after
the pin is dropped, I think we're fine.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2] mm/fs: put_user_page() proposal
2018-07-09 20:00 ` Matthew Wilcox
@ 2018-07-10 8:21 ` Jan Kara
0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2018-07-10 8:21 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Jan Kara, Nicholas Piggin, john.hubbard, Michal Hocko,
Christopher Lameter, Jason Gunthorpe, Dan Williams, Al Viro,
linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard
On Mon 09-07-18 13:00:49, Matthew Wilcox wrote:
> On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote:
> > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote:
> > > > 2) What to do when some page is pinned but we need to do e.g.
> > > > clear_page_dirty_for_io(). After some more thinking I agree with you that
> > > > just blocking waiting for page to unpin will create deadlocks like:
> > >
> > > Why are we trying to writeback a page that is pinned? It's presumed to
> > > be continuously redirtied by its pinner. We can't evict it.
> >
> > So what should be a result of fsync(file), where some 'file' pages are
> > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to
> > userspace that data was committed while it was not (and it's not only about
> > data that has landed in those pages via DMA, you can have first 1k of a page
> > modified by normal IO in parallel to DMA modifying second 1k chunk). If
> > fsync(2) returns error, it would be really unexpected by userspace and most
> > apps will just not handle that correctly. So what else can you do than
> > block?
>
> I was thinking about writeback, and neglected the fsync case.
For memory cleaning writeback skipping is certainly the right thing to do
and that's what we plan to do.
> For fsync, we could copy the "current" contents of the page to a
> freshly-allocated page and write _that_ to disc? As long as we redirty
> the real page after the pin is dropped, I think we're fine.
So for record, this technique is called "bouncing" in block layer
terminology and we do have a support for it there (see block/bounce.c). It
would need some tweaking (e.g. a bio flag to indicate that some page in a
bio needs bouncing if underlying storage requires stable pages) but that is
easy to do - we even had support for something similar some years back as
ext3 needed it to provide guarantee metadata buffer cannot be modified
while IO is running on it.
I was actually already considering using this some time ago but then
disregarded it as it seemed it won't buy us much compared to blocking /
skipping. But now seeing the troubles with blocking, using page bouncing
for situations where we cannot just skip page writeout looks indeed
appealing. Thanks for suggesting that!
As a side note I'm not 100% decided whether it is better to keep the
original page dirty all the time while it is pinned or not. I'm more
inclined to keeping it dirty all the time as it gives mm more accurate
information about the amount of really dirty pages, prevents reclaim of
filesystem's dirtiness / allocation tracking information (buffers or
whatever it has attached to the page), and generally avoids "surprising"
set_page_dirty() once page is unpinned (one less dirtying path for
filesystems to care about). OTOH it would make flusher threads always try
to writeback these pages only to skip them, fsync(2) would always write
them, etc...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2] mm/fs: put_user_page() proposal
2018-07-09 8:05 [PATCH 0/2] mm/fs: put_user_page() proposal john.hubbard
` (2 preceding siblings ...)
2018-07-09 8:49 ` [PATCH 0/2] mm/fs: put_user_page() proposal Nicholas Piggin
@ 2018-07-09 16:27 ` Jan Kara
3 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2018-07-09 16:27 UTC (permalink / raw)
To: john.hubbard
Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
linux-rdma, linux-fsdevel, John Hubbard
Hi,
On Mon 09-07-18 01:05:52, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
>
> With respect to tracking get_user_pages*() pages with page->dma_pinned*
> fields [1], I spent a few days retrofitting most of the get_user_pages*()
> call sites, by adding calls to a new put_user_page() function, in place
> of put_page(), where appropriate. This will work, but it's a large effort.
>
> Design note: I didn't see anything that hinted at a way to fix this
> problem, without actually changing all of the get_user_pages*() call sites,
> so I think it's reasonable to start with that.
Agreed.
> Anyway, it's still incomplete, but because this is a large, tree-wide
> change (that will take some time and testing), I'd like to propose a plan,
> before spamming zillions of people with put_user_page() conversion patches.
> So I picked out the first two patches to show where this is going.
>
> Proposed steps:
>
> Step 1:
>
> Start with the patches here, then continue with...dozens more.
> This will eventually convert all of the call sites to use put_user_page().
> This is easy in some places, but complex in others, such as:
>
> -- drivers/gpu/drm/amd
> -- bio
> -- fuse
> -- cifs
> -- anything from:
> git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq
>
> The easy ones can be grouped into a single patchset, perhaps, and the
> complex ones probably each need a patchset, in order to get the in-depth
> review they'll need.
Agreed.
> Furthermore, some of these areas I hope to attract some help on, once
> this starts going.
>
> Step 2:
>
> In parallel, tidy up the core patchset that was discussed in [1], (version
> 2 has already been reviewed, so I know what to do), and get it perfected
> and reviewed. Don't apply it until step 1 is all done, though.
>
> Step 3:
>
> Activate refcounting of dma-pinned pages (essentially, patch #5, which is
> [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start
> mopping up any missed call sites.
>
> Step 4:
>
> After some soak time, actually connect it up (patch #6 of [1]) and start
> taking action based on the new page->dma_pinned* fields.
>
> [1] https://www.spinics.net/lists/linux-mm/msg156409.html
>
> or, the same thread on LKML if it's working for you:
>
> https://lkml.org/lkml/2018/7/4/368
Yeah, but as Nick pointed out we have some more work to do in step 4 to
avoid deadlocks. Still there's a lot of work to do on which the direction
to progress is clear :).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread