linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
@ 2024-05-01  0:31 John Hubbard
  2024-05-01  5:10 ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: John Hubbard @ 2024-05-01  0:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, LKML, linux-rdma, linux-mm, John Hubbard,
	Mike Marciniszyn, Leon Romanovsky, Artemy Kovalyov,
	Michael Guralnik, Alistair Popple, Pak Markthub

pin_user_pages*() occasionally fails due to migrate_pages() failures
that, in turn, are due to temporarily elevated folio refcounts. This
happens because a few years ago, pin_user_pages*() APIs were upgraded to
automatically migrate pages away from ZONE_MOVABLE, but the callers were
not upgraded to handle any migration failures. And in fact, they can't
easily do so anyway, because the migration return code was filtered out:
-EAGAIN failures from migration are squashed, along with any other
failure, into -ENOMEM, thus hiding details from the upper layer callers.

One failure case that we ran into recently looks like this:

pin_user_pages_fast()
  internal_get_user_pages_fast()
    __gup_longterm_locked()
      check_and_migrate_movable_pages()
        migrate_longterm_unpinnable_pages()
          migrate_pages()
            migrate_pages_batch(..., NR_MAX_MIGRATE_PAGES_RETRY==10)
              migrate_folio_move()
                  move_to_new_folio()
                    migrate_folio()
                      migrate_folio_extra()
                        folio_migrate_mapping()
                          if (folio_ref_count(folio) != expected_count)
                            return -EAGAIN;
                              // ...and migrate_longterm_unpinnable_pages()
                              // translates -EAGAIN to -ENOMEM

Although so far I have not pinpointed the cause of such transient
refcount increases, these are sufficiently common (and expected by the
entire design) that I think we have enough information to proceed
directly to a fix. This patch shows my preferred solution, which does
the following:

a) Restore the -EAGAIN return code: pass it unchanged all the way back
to pin_user_pages*() callers.

b) Then, retry pin_user_pages_fast() from ib_umem_get(), because that IB
driver is displaying real failures in the field, and we've confirmed
that a retry at this location will fix those failures. Retrying at this
higher level (as compared to the pre-existing, low-level retry in
migrate_pages_batch()) allows more things to happen, and more time to
elapse, between retries.

Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Artemy Kovalyov <artemyko@nvidia.com>
Cc: Michael Guralnik <michaelgur@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Pak Markthub <pmarkthub@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c | 33 ++++++++++++++++++++++++++-------
 mm/gup.c                       | 14 +++++++++++---
 2 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b699..7c691faacc8a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -210,15 +210,34 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 
 	while (npages) {
 		cond_resched();
-		pinned = pin_user_pages_fast(cur_base,
-					  min_t(unsigned long, npages,
-						PAGE_SIZE /
-						sizeof(struct page *)),
-					  gup_flags, page_list);
-		if (pinned < 0) {
+		pinned = -ENOMEM;
+		int attempts = 0;
+		/*
+		 * pin_user_pages_fast() can return -EAGAIN, due to falling back
+		 * to gup-slow and then failing to migrate pages out of
+		 * ZONE_MOVABLE due to a transient elevated page refcount.
+		 *
+		 * One retry is enough to avoid this problem, so far, but let's
+		 * use a slightly higher retry count just in case even larger
+		 * systems have a longer-lasting transient refcount problem.
+		 *
+		 */
+		static const int MAX_ATTEMPTS = 3;
+
+		while (pinned == -EAGAIN && attempts < MAX_ATTEMPTS) {
+			pinned = pin_user_pages_fast(cur_base,
+						     min_t(unsigned long,
+							npages, PAGE_SIZE /
+							sizeof(struct page *)),
+						     gup_flags, page_list);
 			ret = pinned;
-			goto umem_release;
+			attempts++;
+
+			if (pinned == -EAGAIN)
+				continue;
 		}
+		if (pinned < 0)
+			goto umem_release;
 
 		cur_base += pinned * PAGE_SIZE;
 		npages -= pinned;
diff --git a/mm/gup.c b/mm/gup.c
index 1611e73b1121..edb069f937cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2141,15 +2141,23 @@ static int migrate_longterm_unpinnable_pages(
 	}
 
 	if (!list_empty(movable_page_list)) {
+		int rc;
 		struct migration_target_control mtc = {
 			.nid = NUMA_NO_NODE,
 			.gfp_mask = GFP_USER | __GFP_NOWARN,
 		};
 
-		if (migrate_pages(movable_page_list, alloc_migration_target,
+		rc = migrate_pages(movable_page_list, alloc_migration_target,
 				  NULL, (unsigned long)&mtc, MIGRATE_SYNC,
-				  MR_LONGTERM_PIN, NULL)) {
-			ret = -ENOMEM;
+				  MR_LONGTERM_PIN, NULL);
+		if (rc) {
+			/*
+			 * Report any failure *except* -EAGAIN as "not enough
+			 * memory". -EAGAIN is valuable because callers further
+			 * up the call stack can benefit from a retry.
+			 */
+			if (rc != -EAGAIN)
+				ret = -ENOMEM;
 			goto err;
 		}
 	}

base-commit: 18daea77cca626f590fb140fc11e3a43c5d41354
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-01  0:31 [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches John Hubbard
@ 2024-05-01  5:10 ` Christoph Hellwig
  2024-05-01 12:10   ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2024-05-01  5:10 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jason Gunthorpe, LKML, linux-rdma, linux-mm,
	Mike Marciniszyn, Leon Romanovsky, Artemy Kovalyov,
	Michael Guralnik, Alistair Popple, Pak Markthub

> +		pinned = -ENOMEM;
> +		int attempts = 0;
> +		/*
> +		 * pin_user_pages_fast() can return -EAGAIN, due to falling back
> +		 * to gup-slow and then failing to migrate pages out of
> +		 * ZONE_MOVABLE due to a transient elevated page refcount.
> +		 *
> +		 * One retry is enough to avoid this problem, so far, but let's
> +		 * use a slightly higher retry count just in case even larger
> +		 * systems have a longer-lasting transient refcount problem.
> +		 *
> +		 */
> +		static const int MAX_ATTEMPTS = 3;
> +
> +		while (pinned == -EAGAIN && attempts < MAX_ATTEMPTS) {
> +			pinned = pin_user_pages_fast(cur_base,
> +						     min_t(unsigned long,
> +							npages, PAGE_SIZE /
> +							sizeof(struct page *)),
> +						     gup_flags, page_list);
>  			ret = pinned;
> -			goto umem_release;
> +			attempts++;
> +
> +			if (pinned == -EAGAIN)
> +				continue;
>  		}
> +		if (pinned < 0)
> +			goto umem_release;

This doesn't make sense.  IFF a blind retry is all that is needed it
should be done in the core functionality.  I fear it's not that easy,
though.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-01  5:10 ` Christoph Hellwig
@ 2024-05-01 12:10   ` Jason Gunthorpe
  2024-05-01 17:32     ` John Hubbard
  2024-05-02  1:05     ` Alistair Popple
  0 siblings, 2 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2024-05-01 12:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Hubbard, Andrew Morton, LKML, linux-rdma, linux-mm,
	Mike Marciniszyn, Leon Romanovsky, Artemy Kovalyov,
	Michael Guralnik, Alistair Popple, Pak Markthub

On Tue, Apr 30, 2024 at 10:10:43PM -0700, Christoph Hellwig wrote:
> > +		pinned = -ENOMEM;
> > +		int attempts = 0;
> > +		/*
> > +		 * pin_user_pages_fast() can return -EAGAIN, due to falling back
> > +		 * to gup-slow and then failing to migrate pages out of
> > +		 * ZONE_MOVABLE due to a transient elevated page refcount.
> > +		 *
> > +		 * One retry is enough to avoid this problem, so far, but let's
> > +		 * use a slightly higher retry count just in case even larger
> > +		 * systems have a longer-lasting transient refcount problem.
> > +		 *
> > +		 */
> > +		static const int MAX_ATTEMPTS = 3;
> > +
> > +		while (pinned == -EAGAIN && attempts < MAX_ATTEMPTS) {
> > +			pinned = pin_user_pages_fast(cur_base,
> > +						     min_t(unsigned long,
> > +							npages, PAGE_SIZE /
> > +							sizeof(struct page *)),
> > +						     gup_flags, page_list);
> >  			ret = pinned;
> > -			goto umem_release;
> > +			attempts++;
> > +
> > +			if (pinned == -EAGAIN)
> > +				continue;
> >  		}
> > +		if (pinned < 0)
> > +			goto umem_release;
> 
> This doesn't make sense.  IFF a blind retry is all that is needed it
> should be done in the core functionality.  I fear it's not that easy,
> though.

+1

This migration retry weirdness is a GUP issue, it needs to be solved
in the mm not exposed to every pin_user_pages caller.

If it turns out ZONE_MOVEABLE pages can't actually be reliably moved
then it is pretty broken..

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-01 12:10   ` Jason Gunthorpe
@ 2024-05-01 17:32     ` John Hubbard
  2024-05-02  1:05     ` Alistair Popple
  1 sibling, 0 replies; 12+ messages in thread
From: John Hubbard @ 2024-05-01 17:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Andrew Morton, LKML, linux-rdma, linux-mm, Mike Marciniszyn,
	Leon Romanovsky, Artemy Kovalyov, Michael Guralnik,
	Alistair Popple, Pak Markthub

On 5/1/24 5:10 AM, Jason Gunthorpe wrote:
> On Tue, Apr 30, 2024 at 10:10:43PM -0700, Christoph Hellwig wrote:
...
>> This doesn't make sense.  IFF a blind retry is all that is needed it
>> should be done in the core functionality.  I fear it's not that easy,
>> though.

So do I. :)

> 
> +1
> 
> This migration retry weirdness is a GUP issue, it needs to be solved
> in the mm not exposed to every pin_user_pages caller.
> 
> If it turns out ZONE_MOVEABLE pages can't actually be reliably moved
> then it is pretty broken..
> 

OK, I'll work on finding out what is temporarily elevating the refcount
and preventing the migration. And see where that leads.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-01 12:10   ` Jason Gunthorpe
  2024-05-01 17:32     ` John Hubbard
@ 2024-05-02  1:05     ` Alistair Popple
  2024-05-02  6:49       ` John Hubbard
  2024-05-02  6:56       ` David Hildenbrand
  1 sibling, 2 replies; 12+ messages in thread
From: Alistair Popple @ 2024-05-02  1:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, John Hubbard, Andrew Morton, LKML, linux-rdma,
	linux-mm, Mike Marciniszyn, Leon Romanovsky, Artemy Kovalyov,
	Michael Guralnik, Pak Markthub


Jason Gunthorpe <jgg@nvidia.com> writes:

> On Tue, Apr 30, 2024 at 10:10:43PM -0700, Christoph Hellwig wrote:
>> > +		pinned = -ENOMEM;
>> > +		int attempts = 0;
>> > +		/*
>> > +		 * pin_user_pages_fast() can return -EAGAIN, due to falling back
>> > +		 * to gup-slow and then failing to migrate pages out of
>> > +		 * ZONE_MOVABLE due to a transient elevated page refcount.
>> > +		 *
>> > +		 * One retry is enough to avoid this problem, so far, but let's
>> > +		 * use a slightly higher retry count just in case even larger
>> > +		 * systems have a longer-lasting transient refcount problem.
>> > +		 *
>> > +		 */
>> > +		static const int MAX_ATTEMPTS = 3;
>> > +
>> > +		while (pinned == -EAGAIN && attempts < MAX_ATTEMPTS) {
>> > +			pinned = pin_user_pages_fast(cur_base,
>> > +						     min_t(unsigned long,
>> > +							npages, PAGE_SIZE /
>> > +							sizeof(struct page *)),
>> > +						     gup_flags, page_list);
>> >  			ret = pinned;
>> > -			goto umem_release;
>> > +			attempts++;
>> > +
>> > +			if (pinned == -EAGAIN)
>> > +				continue;
>> >  		}
>> > +		if (pinned < 0)
>> > +			goto umem_release;
>> 
>> This doesn't make sense.  IFF a blind retry is all that is needed it
>> should be done in the core functionality.  I fear it's not that easy,
>> though.
>
> +1
>
> This migration retry weirdness is a GUP issue, it needs to be solved
> in the mm not exposed to every pin_user_pages caller.
>
> If it turns out ZONE_MOVEABLE pages can't actually be reliably moved
> then it is pretty broken..

I wonder if we should remove the arbitrary retry limit in
migrate_pages() entirely for ZONE_MOVEABLE pages and just loop until
they migrate? By definition there should only be transient references on
these pages so why do we need to limit the number of retries in the
first place?

 - Alistair

> Jason


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-02  1:05     ` Alistair Popple
@ 2024-05-02  6:49       ` John Hubbard
  2024-05-02  6:56       ` David Hildenbrand
  1 sibling, 0 replies; 12+ messages in thread
From: John Hubbard @ 2024-05-02  6:49 UTC (permalink / raw)
  To: Alistair Popple, Jason Gunthorpe
  Cc: Christoph Hellwig, Andrew Morton, LKML, linux-rdma, linux-mm,
	Mike Marciniszyn, Leon Romanovsky, Artemy Kovalyov,
	Michael Guralnik, Pak Markthub

On 5/1/24 6:05 PM, Alistair Popple wrote:
> Jason Gunthorpe <jgg@nvidia.com> writes:
>> On Tue, Apr 30, 2024 at 10:10:43PM -0700, Christoph Hellwig wrote:
...
>>> This doesn't make sense.  IFF a blind retry is all that is needed it
>>> should be done in the core functionality.  I fear it's not that easy,
>>> though.
>>
>> +1
>>
>> This migration retry weirdness is a GUP issue, it needs to be solved
>> in the mm not exposed to every pin_user_pages caller.
>>
>> If it turns out ZONE_MOVEABLE pages can't actually be reliably moved
>> then it is pretty broken..
> 
> I wonder if we should remove the arbitrary retry limit in
> migrate_pages() entirely for ZONE_MOVEABLE pages and just loop until
> they migrate? By definition there should only be transient references on
> these pages so why do we need to limit the number of retries in the
> first place?
> 

Well, along those lines, I can confirm that this patch also fixes the
symptoms:

diff --git a/mm/migrate.c b/mm/migrate.c
index 73a052a382f1..faa67cc441e2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1728,7 +1728,9 @@ static int migrate_pages_batch(struct list_head *from,
  				else
  					goto move;
  			case -EAGAIN:
-				retry++;
+				/* For ZONE_MOVABLE folios, retry forever */
+				if (!folio_is_zone_movable(folio))
+					retry++;
  				thp_retry += is_thp;
  				nr_retry_pages += nr_pages;
  				break;
@@ -1786,7 +1788,9 @@ static int migrate_pages_batch(struct list_head *from,
  			 */
  			switch(rc) {
  			case -EAGAIN:
-				retry++;
+				/* For ZONE_MOVABLE folios, retry forever */
+				if (!folio_is_zone_movable(folio))
+					retry++;
  				thp_retry += is_thp;
  				nr_retry_pages += nr_pages;
  				break;

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-02  1:05     ` Alistair Popple
  2024-05-02  6:49       ` John Hubbard
@ 2024-05-02  6:56       ` David Hildenbrand
  2024-05-02 18:10         ` John Hubbard
  1 sibling, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-05-02  6:56 UTC (permalink / raw)
  To: Alistair Popple, Jason Gunthorpe
  Cc: Christoph Hellwig, John Hubbard, Andrew Morton, LKML, linux-rdma,
	linux-mm, Mike Marciniszyn, Leon Romanovsky, Artemy Kovalyov,
	Michael Guralnik, Pak Markthub

On 02.05.24 03:05, Alistair Popple wrote:
> 
> Jason Gunthorpe <jgg@nvidia.com> writes:
> 
>> On Tue, Apr 30, 2024 at 10:10:43PM -0700, Christoph Hellwig wrote:
>>>> +		pinned = -ENOMEM;
>>>> +		int attempts = 0;
>>>> +		/*
>>>> +		 * pin_user_pages_fast() can return -EAGAIN, due to falling back
>>>> +		 * to gup-slow and then failing to migrate pages out of
>>>> +		 * ZONE_MOVABLE due to a transient elevated page refcount.
>>>> +		 *
>>>> +		 * One retry is enough to avoid this problem, so far, but let's
>>>> +		 * use a slightly higher retry count just in case even larger
>>>> +		 * systems have a longer-lasting transient refcount problem.
>>>> +		 *
>>>> +		 */
>>>> +		static const int MAX_ATTEMPTS = 3;
>>>> +
>>>> +		while (pinned == -EAGAIN && attempts < MAX_ATTEMPTS) {
>>>> +			pinned = pin_user_pages_fast(cur_base,
>>>> +						     min_t(unsigned long,
>>>> +							npages, PAGE_SIZE /
>>>> +							sizeof(struct page *)),
>>>> +						     gup_flags, page_list);
>>>>   			ret = pinned;
>>>> -			goto umem_release;
>>>> +			attempts++;
>>>> +
>>>> +			if (pinned == -EAGAIN)
>>>> +				continue;
>>>>   		}
>>>> +		if (pinned < 0)
>>>> +			goto umem_release;
>>>
>>> This doesn't make sense.  IFF a blind retry is all that is needed it
>>> should be done in the core functionality.  I fear it's not that easy,
>>> though.
>>
>> +1
>>
>> This migration retry weirdness is a GUP issue, it needs to be solved
>> in the mm not exposed to every pin_user_pages caller.
>>
>> If it turns out ZONE_MOVEABLE pages can't actually be reliably moved
>> then it is pretty broken..
> 
> I wonder if we should remove the arbitrary retry limit in
> migrate_pages() entirely for ZONE_MOVEABLE pages and just loop until
> they migrate? By definition there should only be transient references on
> these pages so why do we need to limit the number of retries in the
> first place?

There are some weird things that still needs fixing: vmsplice() is the 
example that doesn't use FOLL_LONGTERM.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-02  6:56       ` David Hildenbrand
@ 2024-05-02 18:10         ` John Hubbard
  2024-05-02 18:34           ` Jason Gunthorpe
  2024-05-10  7:54           ` David Hildenbrand
  0 siblings, 2 replies; 12+ messages in thread
From: John Hubbard @ 2024-05-02 18:10 UTC (permalink / raw)
  To: David Hildenbrand, Alistair Popple, Jason Gunthorpe
  Cc: Christoph Hellwig, Andrew Morton, LKML, linux-rdma, linux-mm,
	Mike Marciniszyn, Leon Romanovsky, Artemy Kovalyov,
	Michael Guralnik, Pak Markthub

On 5/1/24 11:56 PM, David Hildenbrand wrote:
> On 02.05.24 03:05, Alistair Popple wrote:
>> Jason Gunthorpe <jgg@nvidia.com> writes:
...
>>>> This doesn't make sense.  IFF a blind retry is all that is needed it
>>>> should be done in the core functionality.  I fear it's not that easy,
>>>> though.
>>>
>>> +1
>>>
>>> This migration retry weirdness is a GUP issue, it needs to be solved
>>> in the mm not exposed to every pin_user_pages caller.
>>>
>>> If it turns out ZONE_MOVEABLE pages can't actually be reliably moved
>>> then it is pretty broken..
>>
>> I wonder if we should remove the arbitrary retry limit in
>> migrate_pages() entirely for ZONE_MOVEABLE pages and just loop until
>> they migrate? By definition there should only be transient references on
>> these pages so why do we need to limit the number of retries in the
>> first place?
> 
> There are some weird things that still needs fixing: vmsplice() is the 
> example that doesn't use FOLL_LONGTERM.
> 

Hi David!

Do you have any other call sites in mind? It sounds like one way forward
is to fix each call site...

This is an unhappy story right now: the pin_user_pages*() APIs are
significantly worse than before the "migrate pages away automatically"
upgrade, from a user point of view. Because now, the APIs fail
intermittently for callers who follow the rules--because
pin_user_pages() is not fully working yet, basically.

Other ideas, large or small, about how to approach a fix?

thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-02 18:10         ` John Hubbard
@ 2024-05-02 18:34           ` Jason Gunthorpe
  2024-05-02 18:44             ` Matthew Wilcox
  2024-05-10  7:54           ` David Hildenbrand
  1 sibling, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2024-05-02 18:34 UTC (permalink / raw)
  To: John Hubbard
  Cc: David Hildenbrand, Alistair Popple, Christoph Hellwig,
	Andrew Morton, LKML, linux-rdma, linux-mm, Mike Marciniszyn,
	Leon Romanovsky, Artemy Kovalyov, Michael Guralnik, Pak Markthub

On Thu, May 02, 2024 at 11:10:05AM -0700, John Hubbard wrote:
> On 5/1/24 11:56 PM, David Hildenbrand wrote:
> > On 02.05.24 03:05, Alistair Popple wrote:
> > > Jason Gunthorpe <jgg@nvidia.com> writes:
> ...
> > > > > This doesn't make sense.  IFF a blind retry is all that is needed it
> > > > > should be done in the core functionality.  I fear it's not that easy,
> > > > > though.
> > > > 
> > > > +1
> > > > 
> > > > This migration retry weirdness is a GUP issue, it needs to be solved
> > > > in the mm not exposed to every pin_user_pages caller.
> > > > 
> > > > If it turns out ZONE_MOVEABLE pages can't actually be reliably moved
> > > > then it is pretty broken..
> > > 
> > > I wonder if we should remove the arbitrary retry limit in
> > > migrate_pages() entirely for ZONE_MOVEABLE pages and just loop until
> > > they migrate? By definition there should only be transient references on
> > > these pages so why do we need to limit the number of retries in the
> > > first place?
> > 
> > There are some weird things that still needs fixing: vmsplice() is the
> > example that doesn't use FOLL_LONGTERM.
> > 
> 
> Hi David!
> 
> Do you have any other call sites in mind? It sounds like one way forward
> is to fix each call site...
> 
> This is an unhappy story right now: the pin_user_pages*() APIs are
> significantly worse than before the "migrate pages away automatically"
> upgrade, from a user point of view. Because now, the APIs fail
> intermittently for callers who follow the rules--because
> pin_user_pages() is not fully working yet, basically.
>
> Other ideas, large or small, about how to approach a fix?

IMHO pin_user_pages() should sleep and spin in an interruptable sleep
until we get all the migrations done. Not sure how hard it would be to
add some kind of proper waiting event sleep?

If userspace has got itself into knots then pin_user_pages() will
block in the kernel and ctrl-c will rescue it.

Even the temporary pins for something like O_DIRECT are long enough
that we wouldn't want to just spin the CPU.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-02 18:34           ` Jason Gunthorpe
@ 2024-05-02 18:44             ` Matthew Wilcox
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2024-05-02 18:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, David Hildenbrand, Alistair Popple,
	Christoph Hellwig, Andrew Morton, LKML, linux-rdma, linux-mm,
	Mike Marciniszyn, Leon Romanovsky, Artemy Kovalyov,
	Michael Guralnik, Pak Markthub

On Thu, May 02, 2024 at 03:34:08PM -0300, Jason Gunthorpe wrote:
> IMHO pin_user_pages() should sleep and spin in an interruptable sleep

killable, not interruptible.  Otherwise SIGWINCH and SIGALRM can
result an early return.

> until we get all the migrations done. Not sure how hard it would be to
> add some kind of proper waiting event sleep?

ummmmm.  We have a "has waiters" bit in the folio.  So on every call to
folio_put(), we could check that bit and wake up any waiters.  I need to
think about that; right now, we only use it for unlock and end_writeback.
Making folio_put() heavier is, well, quite a lot of call-sites.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-02 18:10         ` John Hubbard
  2024-05-02 18:34           ` Jason Gunthorpe
@ 2024-05-10  7:54           ` David Hildenbrand
  2024-05-11  0:32             ` Kasireddy, Vivek
  1 sibling, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-05-10  7:54 UTC (permalink / raw)
  To: John Hubbard, Alistair Popple, Jason Gunthorpe
  Cc: Christoph Hellwig, Andrew Morton, LKML, linux-rdma, linux-mm,
	Mike Marciniszyn, Leon Romanovsky, Artemy Kovalyov,
	Michael Guralnik, Pak Markthub, Vivek Kasireddy

On 02.05.24 20:10, John Hubbard wrote:
> On 5/1/24 11:56 PM, David Hildenbrand wrote:
>> On 02.05.24 03:05, Alistair Popple wrote:
>>> Jason Gunthorpe <jgg@nvidia.com> writes:
> ...
>>>>> This doesn't make sense.  IFF a blind retry is all that is needed it
>>>>> should be done in the core functionality.  I fear it's not that easy,
>>>>> though.
>>>>
>>>> +1
>>>>
>>>> This migration retry weirdness is a GUP issue, it needs to be solved
>>>> in the mm not exposed to every pin_user_pages caller.
>>>>
>>>> If it turns out ZONE_MOVEABLE pages can't actually be reliably moved
>>>> then it is pretty broken..
>>>
>>> I wonder if we should remove the arbitrary retry limit in
>>> migrate_pages() entirely for ZONE_MOVEABLE pages and just loop until
>>> they migrate? By definition there should only be transient references on
>>> these pages so why do we need to limit the number of retries in the
>>> first place?
>>
>> There are some weird things that still needs fixing: vmsplice() is the
>> example that doesn't use FOLL_LONGTERM.
>>
> 
> Hi David!
> 

Sorry for the late reply!

> Do you have any other call sites in mind? It sounds like one way forward
> is to fix each call site...

We also have udmabuf, that is currently getting fixed [1] similarly to 
how we handle GUP. Could you and/or Jason also have a look at the 
GUP-related bits? I acked it but the patch set does not seem to make 
progress.

https://lkml.kernel.org/r/20240411070157.3318425-1-vivek.kasireddy@intel.com

The sad story is:

(a) vmsplice() is harder to fix (identify all put_page() and replace 
them by unpin_user_page()), but will get fixed at some point.

(b) even !longterm DMA can take double-digit seconds

(c) other wrong code is likely to exist or to appear again and it's
     rather hard to identify+prevent reliably

IMHO we should expect migration to take a longer time and maybe never 
happening in some cases.

Memory offlining (e.g., echo "offline" > 
sys/devices/system/memory/memory0/state) currently tries forever to 
migrate pages and can be killed if started from user space using a fatal 
signal. If memory offlining happens from kernel context (unplugging 
DIMM, ACPI code triggers offlining), we'd much rather want to fail at 
some point instead of looping forever, but it hasn't really popped up as 
a problem so far.

virtio-mem uses alloc_contig_range() for best-effort allocation and will 
skip such temporarily unmovable ranges to try again later. Here, we 
really don't want to loop forever in migration code but rather fail 
earlier and try unplug of another memory block.

So as long as page pinning is triggered from user context where the user 
can simply abort the process (e.g., kill the process), sleep+retry on 
ZONE_MOVABLE + MIGRATE_CMA sounds reasonable.

> 
> This is an unhappy story right now: the pin_user_pages*() APIs are
> significantly worse than before the "migrate pages away automatically"
> upgrade, from a user point of view. Because now, the APIs fail
> intermittently for callers who follow the rules--because
> pin_user_pages() is not fully working yet, basically.
> 
> Other ideas, large or small, about how to approach a fix?

What Jason says makes sense to me: sleep+retry. My only concern is when 
pin_user_pages_*() is called from non-killable context where failing at 
some point might be more reasonable. But maybe that use case doesn't 
really exist?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
  2024-05-10  7:54           ` David Hildenbrand
@ 2024-05-11  0:32             ` Kasireddy, Vivek
  0 siblings, 0 replies; 12+ messages in thread
From: Kasireddy, Vivek @ 2024-05-11  0:32 UTC (permalink / raw)
  To: David Hildenbrand, John Hubbard, Alistair Popple,
	Jason Gunthorpe, Andrew Morton
  Cc: Christoph Hellwig, LKML, linux-rdma, linux-mm, Marciniszyn, Mike,
	Leon Romanovsky, Artemy Kovalyov, Michael Guralnik, Pak Markthub

Hi David,

> 
> On 02.05.24 20:10, John Hubbard wrote:
> > On 5/1/24 11:56 PM, David Hildenbrand wrote:
> >> On 02.05.24 03:05, Alistair Popple wrote:
> >>> Jason Gunthorpe <jgg@nvidia.com> writes:
> > ...
> >>>>> This doesn't make sense.  IFF a blind retry is all that is needed
> >>>>> it should be done in the core functionality.  I fear it's not that
> >>>>> easy, though.
> >>>>
> >>>> +1
> >>>>
> >>>> This migration retry weirdness is a GUP issue, it needs to be
> >>>> solved in the mm not exposed to every pin_user_pages caller.
> >>>>
> >>>> If it turns out ZONE_MOVEABLE pages can't actually be reliably
> >>>> moved then it is pretty broken..
> >>>
> >>> I wonder if we should remove the arbitrary retry limit in
> >>> migrate_pages() entirely for ZONE_MOVEABLE pages and just loop until
> >>> they migrate? By definition there should only be transient
> >>> references on these pages so why do we need to limit the number of
> >>> retries in the first place?
> >>
> >> There are some weird things that still needs fixing: vmsplice() is
> >> the example that doesn't use FOLL_LONGTERM.
> >>
> >
> > Hi David!
> >
> 
> Sorry for the late reply!
> 
> > Do you have any other call sites in mind? It sounds like one way
> > forward is to fix each call site...
> 
> We also have udmabuf, that is currently getting fixed [1] similarly to how we
> handle GUP. Could you and/or Jason also have a look at the GUP-related
> bits? I acked it but the patch set does not seem to make progress.
> 
> https://lkml.kernel.org/r/20240411070157.3318425-1-vivek.kasireddy@intel.com
I am hoping Andrew would pick this series up for the upcoming 6.10 cycle given
that most patches have Acks and I was planning to maintain the udmabuf bits.

Thanks,
Vivek

> 
> The sad story is:
> 
> (a) vmsplice() is harder to fix (identify all put_page() and replace them by
> unpin_user_page()), but will get fixed at some point.
> 
> (b) even !longterm DMA can take double-digit seconds
> 
> (c) other wrong code is likely to exist or to appear again and it's
>      rather hard to identify+prevent reliably
> 
> IMHO we should expect migration to take a longer time and maybe never
> happening in some cases.
> 
> Memory offlining (e.g., echo "offline" >
> sys/devices/system/memory/memory0/state) currently tries forever to
> migrate pages and can be killed if started from user space using a fatal signal.
> If memory offlining happens from kernel context (unplugging DIMM, ACPI
> code triggers offlining), we'd much rather want to fail at some point instead
> of looping forever, but it hasn't really popped up as a problem so far.
> 
> virtio-mem uses alloc_contig_range() for best-effort allocation and will skip
> such temporarily unmovable ranges to try again later. Here, we really don't
> want to loop forever in migration code but rather fail earlier and try unplug
> of another memory block.
> 
> So as long as page pinning is triggered from user context where the user can
> simply abort the process (e.g., kill the process), sleep+retry on
> ZONE_MOVABLE + MIGRATE_CMA sounds reasonable.
> 
> >
> > This is an unhappy story right now: the pin_user_pages*() APIs are
> > significantly worse than before the "migrate pages away automatically"
> > upgrade, from a user point of view. Because now, the APIs fail
> > intermittently for callers who follow the rules--because
> > pin_user_pages() is not fully working yet, basically.
> >
> > Other ideas, large or small, about how to approach a fix?
> 
> What Jason says makes sense to me: sleep+retry. My only concern is when
> pin_user_pages_*() is called from non-killable context where failing at some
> point might be more reasonable. But maybe that use case doesn't really
> exist?
> 
> --
> Cheers,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-05-11  0:32 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-01  0:31 [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches John Hubbard
2024-05-01  5:10 ` Christoph Hellwig
2024-05-01 12:10   ` Jason Gunthorpe
2024-05-01 17:32     ` John Hubbard
2024-05-02  1:05     ` Alistair Popple
2024-05-02  6:49       ` John Hubbard
2024-05-02  6:56       ` David Hildenbrand
2024-05-02 18:10         ` John Hubbard
2024-05-02 18:34           ` Jason Gunthorpe
2024-05-02 18:44             ` Matthew Wilcox
2024-05-10  7:54           ` David Hildenbrand
2024-05-11  0:32             ` Kasireddy, Vivek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).