linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
@ 2020-06-24 19:14 Chris Wilson
  2020-06-24 19:21 ` Jason Gunthorpe
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Chris Wilson @ 2020-06-24 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, intel-gfx, Chris Wilson, Andrew Morton, Jan Kara,
	Jérôme Glisse, John Hubbard, Claudio Imbrenda,
	Kirill A . Shutemov, Jason Gunthorpe

A general rule of thumb is that shrinkers should be fast and effective.
They are called from direct reclaim at the most incovenient of times when
the caller is waiting for a page. If we attempt to reclaim a page being
pinned for active dma [pin_user_pages()], we will incur far greater
latency than a normal anonymous page mapped multiple times. Worse the
page may be in use indefinitely by the HW and unable to be reclaimed
in a timely manner.

A side effect of the LRU shrinker not being dma aware is that we will
often attempt to perform direct reclaim on the persistent group of dma
pages while continuing to use the dma HW (an issue as the HW may already
be actively waiting for the next user request), and even attempt to
reclaim a partially allocated dma object in order to satisfy pinning
the next user page for that object.

It is to be expected that such pages are made available for reclaim at
the end of the dma operation [unpin_user_pages()], and for truly
longterm pins to be proactively recovered via device specific shrinkers
[i.e. stop the HW, allow the pages to be returned to the system, and
then compete again for the memory].

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
---
This seems perhaps a little devious and overzealous. Is there a more
appropriate TTU flag? Would there be a way to limit its effect to say
FOLL_LONGTERM? Doing the migration first would seem to be sensible if
we disable opportunistic migration for the duration of the pin.
---
 mm/rmap.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index 5fe2dedce1fc..374c6e65551b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1393,6 +1393,22 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	    is_zone_device_page(page) && !is_device_private_page(page))
 		return true;
 
+	/*
+	 * Try and fail early to revoke a costly DMA pinned page.
+	 *
+	 * Reclaiming an active DMA page requires stopping the hardware
+	 * and flushing access. [Hardware that does support pagefaulting,
+	 * and so can quickly revoke DMA pages at any time, does not need
+	 * to pin the DMA page.] At worst, the page may be indefinitely in
+	 * use by the hardware. Even at best it will take far longer to
+	 * revoke the access via the mmu notifier, forcing that latency
+	 * onto our callers rather than the consumer of the HW. As we are
+	 * called during opportunistic direct reclaim, declare the
+	 * opportunity cost too high and ignore the page.
+	 */
+	if (page_maybe_dma_pinned(page))
+		return true;
+
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
 				flags & TTU_SPLIT_FREEZE, page);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-24 19:14 [PATCH] mm: Skip opportunistic reclaim for dma pinned pages Chris Wilson
@ 2020-06-24 19:21 ` Jason Gunthorpe
  2020-06-24 20:23   ` Yang Shi
  2020-06-24 20:47   ` John Hubbard
  2020-06-25  7:57 ` Michal Hocko
  2020-06-25 11:42 ` Matthew Wilcox
  2 siblings, 2 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2020-06-24 19:21 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-mm, linux-kernel, intel-gfx, Andrew Morton, Jan Kara,
	Jérôme Glisse, John Hubbard, Claudio Imbrenda,
	Kirill A . Shutemov

On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> A general rule of thumb is that shrinkers should be fast and effective.
> They are called from direct reclaim at the most incovenient of times when
> the caller is waiting for a page. If we attempt to reclaim a page being
> pinned for active dma [pin_user_pages()], we will incur far greater
> latency than a normal anonymous page mapped multiple times. Worse the
> page may be in use indefinitely by the HW and unable to be reclaimed
> in a timely manner.

A pinned page can't be migrated, discarded or swapped by definition -
it would cause data corruption.

So, how do things even get here and/or work today at all? I think the
explanation is missing something important.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-24 19:21 ` Jason Gunthorpe
@ 2020-06-24 20:23   ` Yang Shi
  2020-06-24 21:02     ` Yang Shi
  2020-06-24 20:47   ` John Hubbard
  1 sibling, 1 reply; 14+ messages in thread
From: Yang Shi @ 2020-06-24 20:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chris Wilson, Linux MM, Linux Kernel Mailing List, intel-gfx,
	Andrew Morton, Jan Kara, Jérôme Glisse, John Hubbard,
	Claudio Imbrenda, Kirill A . Shutemov

On Wed, Jun 24, 2020 at 12:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > A general rule of thumb is that shrinkers should be fast and effective.
> > They are called from direct reclaim at the most incovenient of times when
> > the caller is waiting for a page. If we attempt to reclaim a page being
> > pinned for active dma [pin_user_pages()], we will incur far greater
> > latency than a normal anonymous page mapped multiple times. Worse the
> > page may be in use indefinitely by the HW and unable to be reclaimed
> > in a timely manner.
>
> A pinned page can't be migrated, discarded or swapped by definition -
> it would cause data corruption.
>
> So, how do things even get here and/or work today at all? I think the
> explanation is missing something important.

The __remove_mapping() will try to freeze page count if the count is
expected otherwise just not discard the page. I'm not quite sure why
the check is done that late, my wild guess is to check the refcount at
the last minute so there might be a chance the pin gets released right
before it.

But I noticed a bug in __remove_ampping() for THP since THP's dma
pinned count is recorded in the tail page's hpage_pinned_refcount
instead of refcount. So, the refcount freeze might be successful for
pinned THP.  Chris's patch could solve this issue too, but I'm not
sure if it is worth backing earlier once dma pinned page is met. If it
is worth, the follow-up question is why not just skip such page in
scan phase?

>
> Jason
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-24 19:21 ` Jason Gunthorpe
  2020-06-24 20:23   ` Yang Shi
@ 2020-06-24 20:47   ` John Hubbard
  2020-06-24 23:20     ` Jason Gunthorpe
  1 sibling, 1 reply; 14+ messages in thread
From: John Hubbard @ 2020-06-24 20:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Chris Wilson
  Cc: linux-mm, linux-kernel, intel-gfx, Andrew Morton, Jan Kara,
	Jérôme Glisse, Claudio Imbrenda, Kirill A . Shutemov

On 2020-06-24 12:21, Jason Gunthorpe wrote:
> On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
>> A general rule of thumb is that shrinkers should be fast and effective.
>> They are called from direct reclaim at the most incovenient of times when
>> the caller is waiting for a page. If we attempt to reclaim a page being
>> pinned for active dma [pin_user_pages()], we will incur far greater
>> latency than a normal anonymous page mapped multiple times. Worse the
>> page may be in use indefinitely by the HW and unable to be reclaimed
>> in a timely manner.
> 
> A pinned page can't be migrated, discarded or swapped by definition -
> it would cause data corruption.
> 
> So, how do things even get here and/or work today at all? I think the
> explanation is missing something important.
> 

Well, those activities generally try to unmap the page, and
have to be prepared to deal with failure to unmap. From my reading,
it seemed very clear.

What's less clear is why the comment and the commit description
only talk about reclaim, when there are additional things that call
try_to_unmap(), including:

     migrate_vma_unmap()
     split_huge_page_to_list() --> unmap_page()

I do like this code change, though. And I *think* it's actually safe to
do this, as it stays away from writeback or other filesystem activity.
But let me double check that, in case I'm forgetting something.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-24 20:23   ` Yang Shi
@ 2020-06-24 21:02     ` Yang Shi
  0 siblings, 0 replies; 14+ messages in thread
From: Yang Shi @ 2020-06-24 21:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chris Wilson, Linux MM, Linux Kernel Mailing List, intel-gfx,
	Andrew Morton, Jan Kara, Jérôme Glisse, John Hubbard,
	Claudio Imbrenda, Kirill A . Shutemov

On Wed, Jun 24, 2020 at 1:23 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Jun 24, 2020 at 12:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > > A general rule of thumb is that shrinkers should be fast and effective.
> > > They are called from direct reclaim at the most incovenient of times when
> > > the caller is waiting for a page. If we attempt to reclaim a page being
> > > pinned for active dma [pin_user_pages()], we will incur far greater
> > > latency than a normal anonymous page mapped multiple times. Worse the
> > > page may be in use indefinitely by the HW and unable to be reclaimed
> > > in a timely manner.
> >
> > A pinned page can't be migrated, discarded or swapped by definition -
> > it would cause data corruption.
> >
> > So, how do things even get here and/or work today at all? I think the
> > explanation is missing something important.
>
> The __remove_mapping() will try to freeze page count if the count is
> expected otherwise just not discard the page. I'm not quite sure why
> the check is done that late, my wild guess is to check the refcount at
> the last minute so there might be a chance the pin gets released right
> before it.
>
> But I noticed a bug in __remove_ampping() for THP since THP's dma
> pinned count is recorded in the tail page's hpage_pinned_refcount
> instead of refcount. So, the refcount freeze might be successful for
> pinned THP.  Chris's patch could solve this issue too, but I'm not

This bug is not valid. I just realized try_grab_page() would increase
both refcount and hpage_pinned_refcount.

> sure if it is worth backing earlier once dma pinned page is met. If it
> is worth, the follow-up question is why not just skip such page in
> scan phase?
>
> >
> > Jason
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-24 20:47   ` John Hubbard
@ 2020-06-24 23:20     ` Jason Gunthorpe
  2020-06-25  0:11       ` John Hubbard
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2020-06-24 23:20 UTC (permalink / raw)
  To: John Hubbard
  Cc: Chris Wilson, linux-mm, linux-kernel, intel-gfx, Andrew Morton,
	Jan Kara, Jérôme Glisse, Claudio Imbrenda,
	Kirill A . Shutemov

On Wed, Jun 24, 2020 at 01:47:23PM -0700, John Hubbard wrote:
> On 2020-06-24 12:21, Jason Gunthorpe wrote:
> > On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > > A general rule of thumb is that shrinkers should be fast and effective.
> > > They are called from direct reclaim at the most incovenient of times when
> > > the caller is waiting for a page. If we attempt to reclaim a page being
> > > pinned for active dma [pin_user_pages()], we will incur far greater
> > > latency than a normal anonymous page mapped multiple times. Worse the
> > > page may be in use indefinitely by the HW and unable to be reclaimed
> > > in a timely manner.
> > 
> > A pinned page can't be migrated, discarded or swapped by definition -
> > it would cause data corruption.
> > 
> > So, how do things even get here and/or work today at all? I think the
> > explanation is missing something important.
> > 
> 
> Well, those activities generally try to unmap the page, and
> have to be prepared to deal with failure to unmap. From my reading,
> it seemed very clear.

I think Yang explained it - the page is removed from the mappings but
freeing it does not happen because page_ref_freeze() does not succeed
due to the pin.

Presumably the mappings can reconnect to the same physical page if
it is re-faulted to avoid any data corruption.

So, the issue here is the mappings are trashed while the page remains
- and trashing the mapping triggers a mmu notifier which upsets i915.

> What's less clear is why the comment and the commit description
> only talk about reclaim, when there are additional things that call
> try_to_unmap(), including:
> 
>     migrate_vma_unmap()
>     split_huge_page_to_list() --> unmap_page()

It looks like the same unmap first then abort if the refcount is still
elevated design as shrink_page_list() ?

> I do like this code change, though. And I *think* it's actually safe to
> do this, as it stays away from writeback or other filesystem activity.
> But let me double check that, in case I'm forgetting something.

It would be nice to have an explanation why it is OK now to change
it..

I don't know, but could it be that try_to_unmap() has to be done
before checking the refcount as each mapping is included in the
refcount? ie we couldn't know a DMA pin was active in advance?

Now that we have your pin stuff we can detect a DMA pin without doing
all the unmaps?

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-24 23:20     ` Jason Gunthorpe
@ 2020-06-25  0:11       ` John Hubbard
  2020-06-25 11:24         ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: John Hubbard @ 2020-06-25  0:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chris Wilson, linux-mm, linux-kernel, intel-gfx, Andrew Morton,
	Jan Kara, Jérôme Glisse, Claudio Imbrenda,
	Kirill A . Shutemov

On 2020-06-24 16:20, Jason Gunthorpe wrote:
...
> I think Yang explained it - the page is removed from the mappings but
> freeing it does not happen because page_ref_freeze() does not succeed
> due to the pin.
> 
> Presumably the mappings can reconnect to the same physical page if
> it is re-faulted to avoid any data corruption.
> 
> So, the issue here is the mappings are trashed while the page remains
> - and trashing the mapping triggers a mmu notifier which upsets i915.
> 
>> What's less clear is why the comment and the commit description
>> only talk about reclaim, when there are additional things that call
>> try_to_unmap(), including:
>>
>>      migrate_vma_unmap()
>>      split_huge_page_to_list() --> unmap_page()
> 
> It looks like the same unmap first then abort if the refcount is still
> elevated design as shrink_page_list() ?


Yes. I was just wondering why the documentation here seems to ignore the
other, non-reclaim cases. Anyway, though...


> 
>> I do like this code change, though. And I *think* it's actually safe to
>> do this, as it stays away from writeback or other filesystem activity.
>> But let me double check that, in case I'm forgetting something.

...OK, I've checked, and I like it a little bit less now. Mainly for
structural reasons, though. I think it would work correctly. But
here's a concern: try_to_unmap() should only fail to unmap if there is a
reason to not unmap. Having a page be pinned for dma is a reason to not
*free* a page, and it's also a reason to be careful about writeback and
page buffers for writeback and such. But I'm not sure that it's a reason
to fail to remove mappings.

True, most (all?) of the reasons that we remove mappings, generally are
for things that are not allowed while a page is dma-pinned...at least,
today. But still, there's nothing fundamental about a mapping that
should prevent it from coming or going while a page is undergoing
dma.

So, it's merely a convenient, now-misnamed location in the call stack
to fail out. That's not great. It might be better, as Jason hints at
below, to fail out a little earlier, instead. That would lead to a more
places to call page_maybe_dma_pinned(), but that's not a real problem,
because it's still a small number of places.

After writing all of that...I don't feel strongly about it, because
TTU is kind of synonymous with "I'm about to do a dma-pin-unfriendly
operation".

Maybe some of the more experienced fs or mm people have strong opinions
one way or the other?


> 
> It would be nice to have an explanation why it is OK now to change
> it..

Yes. Definitely good to explain that in the commit log. I think
it's triggered by the existence of page_maybe_dma_pinned(). Until
that was added, figuring out if dma was involved required basically
just guesswork. Now we have a way to guess much more accurately. :)

> 
> I don't know, but could it be that try_to_unmap() has to be done
> before checking the refcount as each mapping is included in the
> refcount? ie we couldn't know a DMA pin was active in advance?
> 
> Now that we have your pin stuff we can detect a DMA pin without doing
> all the unmaps?
> 

Once something calls pin_user_page*(), then the pages will be marked
as dma-pinned, yes. So no, there is no need to wait until try_to_unmap()
to find out.

A final note: depending on where page_maybe_dma_pinned() ends up
getting called, this might prevent a fair number of the problems that
Jan originally reported [1], and that I also reported separately!

Well, not all of the problems, and only after the filesystems get
converted to call pin_user_pages() (working on that next), but...I think
it would actually avoid the crash our customer reported back in early
2018. Even though we don't have the full file lease + pin_user_pages()
solution in place.

That's because reclaim is what triggers the problems that we saw. And
with this patch, we bail out of reclaim early.


[1] https://www.spinics.net/lists/linux-mm/msg142700.html


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-24 19:14 [PATCH] mm: Skip opportunistic reclaim for dma pinned pages Chris Wilson
  2020-06-24 19:21 ` Jason Gunthorpe
@ 2020-06-25  7:57 ` Michal Hocko
       [not found]   ` <159308284703.4527.16058577374955415124@build.alporthouse.com>
  2020-06-25 11:42 ` Matthew Wilcox
  2 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2020-06-25  7:57 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-mm, linux-kernel, intel-gfx, Andrew Morton, Jan Kara,
	Jérôme Glisse, John Hubbard, Claudio Imbrenda,
	Kirill A . Shutemov, Jason Gunthorpe

On Wed 24-06-20 20:14:17, Chris Wilson wrote:
> A general rule of thumb is that shrinkers should be fast and effective.
> They are called from direct reclaim at the most incovenient of times when
> the caller is waiting for a page. If we attempt to reclaim a page being
> pinned for active dma [pin_user_pages()], we will incur far greater
> latency than a normal anonymous page mapped multiple times. Worse the
> page may be in use indefinitely by the HW and unable to be reclaimed
> in a timely manner.
> 
> A side effect of the LRU shrinker not being dma aware is that we will
> often attempt to perform direct reclaim on the persistent group of dma
> pages while continuing to use the dma HW (an issue as the HW may already
> be actively waiting for the next user request), and even attempt to
> reclaim a partially allocated dma object in order to satisfy pinning
> the next user page for that object.

You are talking about direct reclaim but this path is shared with the
background reclaim. This is a bit confusing. Maybe you just want to
outline the latency in the reclaim which is more noticeable in the
direct reclaim to the userspace. This would be good to be clarified.

How much memory are we talking about here btw?

> It is to be expected that such pages are made available for reclaim at
> the end of the dma operation [unpin_user_pages()], and for truly
> longterm pins to be proactively recovered via device specific shrinkers
> [i.e. stop the HW, allow the pages to be returned to the system, and
> then compete again for the memory].

Is the later implemented?

Btw. overall intention of the patch is not really clear to me. Do I get
it right that this is going to reduce latency of the reclaim for pages
that are not reclaimable anyway because they are pinned? If yes do we
have any numbers for that.

It would be also good to explain why the bail out is implemented in
try_to_unmap rather than shrink_shrink_page_list.

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> ---
> This seems perhaps a little devious and overzealous. Is there a more
> appropriate TTU flag? Would there be a way to limit its effect to say
> FOLL_LONGTERM? Doing the migration first would seem to be sensible if
> we disable opportunistic migration for the duration of the pin.
> ---
>  mm/rmap.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5fe2dedce1fc..374c6e65551b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1393,6 +1393,22 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	    is_zone_device_page(page) && !is_device_private_page(page))
>  		return true;
>  
> +	/*
> +	 * Try and fail early to revoke a costly DMA pinned page.
> +	 *
> +	 * Reclaiming an active DMA page requires stopping the hardware
> +	 * and flushing access. [Hardware that does support pagefaulting,
> +	 * and so can quickly revoke DMA pages at any time, does not need
> +	 * to pin the DMA page.] At worst, the page may be indefinitely in
> +	 * use by the hardware. Even at best it will take far longer to
> +	 * revoke the access via the mmu notifier, forcing that latency
> +	 * onto our callers rather than the consumer of the HW. As we are
> +	 * called during opportunistic direct reclaim, declare the
> +	 * opportunity cost too high and ignore the page.
> +	 */
> +	if (page_maybe_dma_pinned(page))
> +		return true;

I do not understand why the page table walk needs to be done. The page
is going to be pinned no matter how many page tables are mapping it
right?

> +
>  	if (flags & TTU_SPLIT_HUGE_PMD) {
>  		split_huge_pmd_address(vma, address,
>  				flags & TTU_SPLIT_FREEZE, page);
> -- 
> 2.20.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-25  0:11       ` John Hubbard
@ 2020-06-25 11:24         ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2020-06-25 11:24 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, Chris Wilson, linux-mm, linux-kernel, intel-gfx,
	Andrew Morton, Jan Kara, Jérôme Glisse,
	Claudio Imbrenda, Kirill A . Shutemov

On Wed 24-06-20 17:11:30, John Hubbard wrote:
> On 2020-06-24 16:20, Jason Gunthorpe wrote:
> > > I do like this code change, though. And I *think* it's actually safe to
> > > do this, as it stays away from writeback or other filesystem activity.
> > > But let me double check that, in case I'm forgetting something.
> 
> ...OK, I've checked, and I like it a little bit less now. Mainly for
> structural reasons, though. I think it would work correctly. But
> here's a concern: try_to_unmap() should only fail to unmap if there is a
> reason to not unmap. Having a page be pinned for dma is a reason to not
> *free* a page, and it's also a reason to be careful about writeback and
> page buffers for writeback and such. But I'm not sure that it's a reason
> to fail to remove mappings.
> 
> True, most (all?) of the reasons that we remove mappings, generally are
> for things that are not allowed while a page is dma-pinned...at least,
> today. But still, there's nothing fundamental about a mapping that
> should prevent it from coming or going while a page is undergoing
> dma.
> 
> So, it's merely a convenient, now-misnamed location in the call stack
> to fail out. That's not great. It might be better, as Jason hints at
> below, to fail out a little earlier, instead. That would lead to a more
> places to call page_maybe_dma_pinned(), but that's not a real problem,
> because it's still a small number of places.

Agreed, I think that shrink_page_list() as Michal writes is a better place
for the page_may_be_dma_pinned() check. But other than that I agree it is
good to avoid all the unnecessary work of preparing a page for reclaim when
we can now check there's no hope of reclaiming it (at this time).

> > I don't know, but could it be that try_to_unmap() has to be done
> > before checking the refcount as each mapping is included in the
> > refcount? ie we couldn't know a DMA pin was active in advance?
> > 
> > Now that we have your pin stuff we can detect a DMA pin without doing
> > all the unmaps?
> > 
> 
> Once something calls pin_user_page*(), then the pages will be marked
> as dma-pinned, yes. So no, there is no need to wait until try_to_unmap()
> to find out.

Yeah, I'm pretty sure the page ref check happens so late because earlier it
was impossible to tell whether there are "external" page references reclaim
cannot get rid of - filesystem may (but need not!) hold a page reference
for its private data, page tables will hold references as well, etc. Now
with page_may_be_dma_pinned() we can detect at least one class of external
references (i.e. pins) early so it's stupid not to do that.

> A final note: depending on where page_maybe_dma_pinned() ends up
> getting called, this might prevent a fair number of the problems that
> Jan originally reported [1], and that I also reported separately!
> 
> Well, not all of the problems, and only after the filesystems get
> converted to call pin_user_pages() (working on that next), but...I think
> it would actually avoid the crash our customer reported back in early
> 2018. Even though we don't have the full file lease + pin_user_pages()
> solution in place.
> 
> That's because reclaim is what triggers the problems that we saw. And
> with this patch, we bail out of reclaim early.

I agree that with this change, some races will become much less likely for
some usecases. But as you say, it's not a full solution.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-24 19:14 [PATCH] mm: Skip opportunistic reclaim for dma pinned pages Chris Wilson
  2020-06-24 19:21 ` Jason Gunthorpe
  2020-06-25  7:57 ` Michal Hocko
@ 2020-06-25 11:42 ` Matthew Wilcox
  2020-06-25 13:40   ` Jan Kara
  2020-06-25 16:32   ` Yang Shi
  2 siblings, 2 replies; 14+ messages in thread
From: Matthew Wilcox @ 2020-06-25 11:42 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-mm, linux-kernel, intel-gfx, Andrew Morton, Jan Kara,
	Jérôme Glisse, John Hubbard, Claudio Imbrenda,
	Kirill A . Shutemov, Jason Gunthorpe

On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> A side effect of the LRU shrinker not being dma aware is that we will
> often attempt to perform direct reclaim on the persistent group of dma
> pages while continuing to use the dma HW (an issue as the HW may already
> be actively waiting for the next user request), and even attempt to
> reclaim a partially allocated dma object in order to satisfy pinning
> the next user page for that object.
> 
> It is to be expected that such pages are made available for reclaim at
> the end of the dma operation [unpin_user_pages()], and for truly
> longterm pins to be proactively recovered via device specific shrinkers
> [i.e. stop the HW, allow the pages to be returned to the system, and
> then compete again for the memory].

Why are DMA pinned pages still on the LRU list at all?  I never got an
answer to this that made sense to me.  By definition, a page which is
pinned for DMA is being accessed, and needs to at the very least change
position on the LRU list, so just take it off the list when DMA-pinned
and put it back on the list when DMA-unpinned.

This overly complex lease stuff must have some reason for existing, but
I still don't get it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-25 11:42 ` Matthew Wilcox
@ 2020-06-25 13:40   ` Jan Kara
  2020-06-25 16:05     ` Matthew Wilcox
  2020-06-25 16:32   ` Yang Shi
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Kara @ 2020-06-25 13:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris Wilson, linux-mm, linux-kernel, intel-gfx, Andrew Morton,
	Jan Kara, Jérôme Glisse, John Hubbard,
	Claudio Imbrenda, Kirill A . Shutemov, Jason Gunthorpe

On Thu 25-06-20 12:42:09, Matthew Wilcox wrote:
> On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > A side effect of the LRU shrinker not being dma aware is that we will
> > often attempt to perform direct reclaim on the persistent group of dma
> > pages while continuing to use the dma HW (an issue as the HW may already
> > be actively waiting for the next user request), and even attempt to
> > reclaim a partially allocated dma object in order to satisfy pinning
> > the next user page for that object.
> > 
> > It is to be expected that such pages are made available for reclaim at
> > the end of the dma operation [unpin_user_pages()], and for truly
> > longterm pins to be proactively recovered via device specific shrinkers
> > [i.e. stop the HW, allow the pages to be returned to the system, and
> > then compete again for the memory].
> 
> Why are DMA pinned pages still on the LRU list at all?  I never got an
> answer to this that made sense to me.  By definition, a page which is
> pinned for DMA is being accessed, and needs to at the very least change
> position on the LRU list, so just take it off the list when DMA-pinned
> and put it back on the list when DMA-unpinned.

Well, we do mark_page_accessed() when pinning in GUP. This is not perfect
but it's as good as it gets with CPU having no control when the page is
actually accessed. Also taking the page off and then back to LRU list would
increase the contention on the LRU list locks and generally cost
performance so for short term pins it is not desirable... Otherwise I agree
that conceptually it would make some sence although I'm not sure some
places wouldn't get confused by e.g. page cache pages not being on LRU
list. 

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
       [not found]   ` <159308284703.4527.16058577374955415124@build.alporthouse.com>
@ 2020-06-25 15:12     ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2020-06-25 15:12 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-mm, linux-kernel, intel-gfx, Andrew Morton, Jan Kara,
	Jérôme Glisse, John Hubbard, Claudio Imbrenda,
	Kirill A . Shutemov, Jason Gunthorpe

On Thu 25-06-20 12:00:47, Chris Wilson wrote:
> Quoting Michal Hocko (2020-06-25 08:57:25)
> > On Wed 24-06-20 20:14:17, Chris Wilson wrote:
> > > A general rule of thumb is that shrinkers should be fast and effective.
> > > They are called from direct reclaim at the most incovenient of times when
> > > the caller is waiting for a page. If we attempt to reclaim a page being
> > > pinned for active dma [pin_user_pages()], we will incur far greater
> > > latency than a normal anonymous page mapped multiple times. Worse the
> > > page may be in use indefinitely by the HW and unable to be reclaimed
> > > in a timely manner.
> > > 
> > > A side effect of the LRU shrinker not being dma aware is that we will
> > > often attempt to perform direct reclaim on the persistent group of dma
> > > pages while continuing to use the dma HW (an issue as the HW may already
> > > be actively waiting for the next user request), and even attempt to
> > > reclaim a partially allocated dma object in order to satisfy pinning
> > > the next user page for that object.
> > 
> > You are talking about direct reclaim but this path is shared with the
> > background reclaim. This is a bit confusing. Maybe you just want to
> > outline the latency in the reclaim which is more noticeable in the
> > direct reclaim to the userspace. This would be good to be clarified.
> > 
> > How much memory are we talking about here btw?
> 
> It depends. In theory, it is used sparingly. But it is under userspace
> control, exposed via Vulkan, OpenGL, OpenCL, media and even old XShm. If
> all goes to plan the application memory is only pinned for as long as the
> HW is using it, but that is an indefinite period of time and an indefinite
> amount of memory. There are provisions in place to impose upper limits
> on how long an operation can last on the HW, and the mmu-notifier is
> there to ensure we do unpin the memory on demand. However cancelling a
> HW operation (which will result in data loss and often process
> termination due to an unfortunate sequence of events when userspace
> fails to recover) for a try_to_unmap on behalf of the LRU shrinker is not
> a good choice.

OK, thanks for the clarification. What and when should MM intervene to
prevent potential OOM?
  
[...]
> > Btw. overall intention of the patch is not really clear to me. Do I get
> > it right that this is going to reduce latency of the reclaim for pages
> > that are not reclaimable anyway because they are pinned? If yes do we
> > have any numbers for that.
> 
> I can plug it into a microbenchmark ala cycletest to show the impact...
> Memory filled with 64M gup objects, random utilisation of those with
> the GPU; background process filling the pagecache with find /; reporting
> the time difference from the expected expiry of a timer with the actual:
> [On a Geminilake Atom-class processor with 8GiB, average of 5 runs, each
> measuring mean latency for 20s -- mean is probably a really bad choice
> here, we need 50/90/95/99]
> 
> direct reclaim calling mmu-notifier:
> gem_syslatency: cycles=2122, latency mean=1601.185us max=33572us
> 
> skipping try_to_unmap_one with page_maybe_dma_pinned:
> gem_syslatency: cycles=1965, latency mean=597.971us max=28462us
> 
> Baseline (background find /; application touched all memory, but no HW
> ops)
> gem_syslatency: cycles=0, latency mean=6.695us max=77us
> 
> Compare with the time to allocate a single THP against load:
> 
> Baseline:
> gem_syslatency: cycles=0, latency mean=1541.562us max=52196us
> Direct reclaim calling mmu-notifier:
> gem_syslatency: cycles=2115, latency mean=9050.930us max=396986us
> page_maybe_dma_pinned skip:
> gem_syslatency: cycles=2325, latency mean=7431.633us max=187960us
> 
> Take with a massive pinch of salt. I expect, once I find the right
> sequence, to reliably control the induced latency on the RT thread.
> 
> But first, I have to look at why there's a correlation with HW load and
> timer latency, even with steady state usage. That's quite surprising --
> ah, I had it left to PREEMPT_VOLUNTARY and this machine has to scan
> every request submitted to HW. Just great.
> 
> With PREEMPT:
> Timer:
> Base:    gem_syslatency: cycles=0, latency mean=8.823us max=83us
> Reclaim: gem_syslatency: cycles=2224, latency mean=79.308us max=4805us
> Skip:    gem_syslatency: cycles=2677, latency mean=70.306us max=4720us
> 
> THP:
> Base:    gem_syslatency: cycles=0, latency mean=1993.693us max=201958us
> Reclaim: gem_syslatency: cycles=1284, latency mean=2873.633us max=295962us
> Skip:    gem_syslatency: cycles=1809, latency mean=1991.509us max=261050us
> 
> Earlier caveats notwithstanding; confidence in results still low.
> 
> And refine the testing somewhat, if at the very least gather enough
> samples for credible statistics.

OK, so my understanding is that the overall impact is very low. So what
is the primary motivation for the patch? Prevent from a pointless work -
aka invoke the notifier?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-25 13:40   ` Jan Kara
@ 2020-06-25 16:05     ` Matthew Wilcox
  0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2020-06-25 16:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Chris Wilson, linux-mm, linux-kernel, intel-gfx, Andrew Morton,
	Jérôme Glisse, John Hubbard, Claudio Imbrenda,
	Kirill A . Shutemov, Jason Gunthorpe

On Thu, Jun 25, 2020 at 03:40:44PM +0200, Jan Kara wrote:
> On Thu 25-06-20 12:42:09, Matthew Wilcox wrote:
> > Why are DMA pinned pages still on the LRU list at all?  I never got an
> > answer to this that made sense to me.  By definition, a page which is
> > pinned for DMA is being accessed, and needs to at the very least change
> > position on the LRU list, so just take it off the list when DMA-pinned
> > and put it back on the list when DMA-unpinned.
> 
> Well, we do mark_page_accessed() when pinning in GUP. This is not perfect
> but it's as good as it gets with CPU having no control when the page is
> actually accessed. Also taking the page off and then back to LRU list would
> increase the contention on the LRU list locks and generally cost
> performance so for short term pins it is not desirable... Otherwise I agree
> that conceptually it would make some sence although I'm not sure some
> places wouldn't get confused by e.g. page cache pages not being on LRU
> list. 

We could/should do what we do for mlocked pages:
Documentation/vm/unevictable-lru.rst

I think 'case five' is wrong and needs to be removed.  Pinning is
inappropriate for "I'm going to modify the page myself".

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
  2020-06-25 11:42 ` Matthew Wilcox
  2020-06-25 13:40   ` Jan Kara
@ 2020-06-25 16:32   ` Yang Shi
  1 sibling, 0 replies; 14+ messages in thread
From: Yang Shi @ 2020-06-25 16:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris Wilson, Linux MM, Linux Kernel Mailing List, intel-gfx,
	Andrew Morton, Jan Kara, Jérôme Glisse, John Hubbard,
	Claudio Imbrenda, Kirill A . Shutemov, Jason Gunthorpe

On Thu, Jun 25, 2020 at 4:42 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Jun 24, 2020 at 08:14:17PM +0100, Chris Wilson wrote:
> > A side effect of the LRU shrinker not being dma aware is that we will
> > often attempt to perform direct reclaim on the persistent group of dma
> > pages while continuing to use the dma HW (an issue as the HW may already
> > be actively waiting for the next user request), and even attempt to
> > reclaim a partially allocated dma object in order to satisfy pinning
> > the next user page for that object.
> >
> > It is to be expected that such pages are made available for reclaim at
> > the end of the dma operation [unpin_user_pages()], and for truly
> > longterm pins to be proactively recovered via device specific shrinkers
> > [i.e. stop the HW, allow the pages to be returned to the system, and
> > then compete again for the memory].
>
> Why are DMA pinned pages still on the LRU list at all?  I never got an
> answer to this that made sense to me.  By definition, a page which is
> pinned for DMA is being accessed, and needs to at the very least change
> position on the LRU list, so just take it off the list when DMA-pinned
> and put it back on the list when DMA-unpinned.

Sounds reasonable to me. In the earlier email I suggested skip isolate
dma pinned page in scan phase, but if they are long term pinned, it
seems preferred to put them on the unevictable lru IMHO.

>
> This overly complex lease stuff must have some reason for existing, but
> I still don't get it.
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-06-25 16:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-24 19:14 [PATCH] mm: Skip opportunistic reclaim for dma pinned pages Chris Wilson
2020-06-24 19:21 ` Jason Gunthorpe
2020-06-24 20:23   ` Yang Shi
2020-06-24 21:02     ` Yang Shi
2020-06-24 20:47   ` John Hubbard
2020-06-24 23:20     ` Jason Gunthorpe
2020-06-25  0:11       ` John Hubbard
2020-06-25 11:24         ` Jan Kara
2020-06-25  7:57 ` Michal Hocko
     [not found]   ` <159308284703.4527.16058577374955415124@build.alporthouse.com>
2020-06-25 15:12     ` Michal Hocko
2020-06-25 11:42 ` Matthew Wilcox
2020-06-25 13:40   ` Jan Kara
2020-06-25 16:05     ` Matthew Wilcox
2020-06-25 16:32   ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).