All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate
@ 2022-10-20  7:49 Baolin Wang
  2022-10-20  7:49 ` [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt Baolin Wang
  2022-10-20  8:15 ` [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate Huang, Ying
  0 siblings, 2 replies; 11+ messages in thread
From: Baolin Wang @ 2022-10-20  7:49 UTC (permalink / raw)
  To: akpm
  Cc: david, ying.huang, ziy, shy828301, baolin.wang, jingshan,
	linux-mm, linux-kernel

The migrate_pages() will return the number of {normal page, THP, hugetlb}
that were not migrated, or an error code. That means it can still return
the number of failure count, though the pages have been migrated
successfully with several times re-try.

So we should not use the return value of migrate_pages() to determin
if there are pages are failed to migrate. Instead we can validate the
'movable_page_list' to see if there are pages remained in the list,
which are failed to migrate. That can mitigate the failure of longterm
pinning.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/gup.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 5182aba..bd8cfcd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1914,9 +1914,10 @@ static int migrate_longterm_unpinnable_pages(
 			.gfp_mask = GFP_USER | __GFP_NOWARN,
 		};
 
-		if (migrate_pages(movable_page_list, alloc_migration_target,
-				  NULL, (unsigned long)&mtc, MIGRATE_SYNC,
-				  MR_LONGTERM_PIN, NULL)) {
+		ret = migrate_pages(movable_page_list, alloc_migration_target,
+				    NULL, (unsigned long)&mtc, MIGRATE_SYNC,
+				    MR_LONGTERM_PIN, NULL);
+		if (ret < 0 || !list_empty(movable_page_list)) {
 			ret = -ENOMEM;
 			goto err;
 		}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt
  2022-10-20  7:49 [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate Baolin Wang
@ 2022-10-20  7:49 ` Baolin Wang
  2022-10-20  8:24   ` Huang, Ying
  2022-10-20  8:15 ` [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate Huang, Ying
  1 sibling, 1 reply; 11+ messages in thread
From: Baolin Wang @ 2022-10-20  7:49 UTC (permalink / raw)
  To: akpm
  Cc: david, ying.huang, ziy, shy828301, baolin.wang, jingshan,
	linux-mm, linux-kernel

When creating a virtual machine, we will use memfd_create() to get
a file descriptor which can be used to create share memory mappings
using the mmap function, meanwhile the mmap() will set the MAP_POPULATE
flag to allocate physical pages for the virtual machine.

When allocating physical pages for the guest, the host can fallback to
allocate some CMA pages for the guest when over half of the zone's free
memory is in the CMA area.

In guest os, when the application wants to do some data transaction with
DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and
create IOMMU mappings for the DMA pages. However, when calling
VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be
failed to longterm-pin sometimes.

After some invetigation, we found the pages used to do DMA mapping can
contain some CMA pages, and these CMA pages will cause a possible
failure of the longterm-pin, due to failed to migrate the CMA pages.
The reason of migration failure may be temporary reference count or
memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA
ioctl returns error, which makes the application failed to start.

I observed one migration failure case (which is not easy to reproduce) is
that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed'
count is also 1.

That means when migrating a THP which is in CMA area, but can not allocate
a new THP due to memory fragmentation, so it will split the THP. However
THP split is also failed, probably the reason is temporary reference count
of this THP. And the temporary reference count can be caused by dropping
page caches (I observed the drop caches operation in the system), but we
can not drop the shmem page caches due to they are already dirty at that time.

Especially for THP split failure, which is caused by temporary reference
count, we can try again to mitigate the failure of migration in this case
according to previous discussion [1].

[1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/huge_memory.c |  4 ++--
 mm/migrate.c     | 18 +++++++++++++++---
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ad17c8d..a79f03b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2666,7 +2666,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * split PMDs
 	 */
 	if (!can_split_folio(folio, &extra_pins)) {
-		ret = -EBUSY;
+		ret = -EAGAIN;
 		goto out_unlock;
 	}
 
@@ -2716,7 +2716,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			xas_unlock(&xas);
 		local_irq_enable();
 		remap_page(folio, folio_nr_pages(folio));
-		ret = -EBUSY;
+		ret = -EAGAIN;
 	}
 
 out_unlock:
diff --git a/mm/migrate.c b/mm/migrate.c
index 8e5eb6e..55c7855 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1506,9 +1506,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 				if (is_thp) {
 					nr_thp_failed++;
 					/* THP NUMA faulting doesn't split THP to retry. */
-					if (!nosplit && !try_split_thp(page, &thp_split_pages)) {
-						nr_thp_split++;
-						break;
+					if (!nosplit) {
+						rc = try_split_thp(page, &thp_split_pages);
+						if (!rc) {
+							nr_thp_split++;
+							break;
+						} else if (reason == MR_LONGTERM_PIN &&
+							   rc == -EAGAIN) {
+							/*
+							 * Try again to split THP to mitigate
+							 * the failure of longterm pinning.
+							 */
+							thp_retry++;
+							nr_retry_pages += nr_subpages;
+							break;
+						}
 					}
 				} else if (!no_subpage_counting) {
 					nr_failed++;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate
  2022-10-20  7:49 [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate Baolin Wang
  2022-10-20  7:49 ` [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt Baolin Wang
@ 2022-10-20  8:15 ` Huang, Ying
  2022-10-20  9:24   ` Baolin Wang
  1 sibling, 1 reply; 11+ messages in thread
From: Huang, Ying @ 2022-10-20  8:15 UTC (permalink / raw)
  To: Baolin Wang; +Cc: akpm, david, ziy, shy828301, jingshan, linux-mm, linux-kernel

Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> The migrate_pages() will return the number of {normal page, THP, hugetlb}
> that were not migrated, or an error code. That means it can still return
> the number of failure count, though the pages have been migrated
> successfully with several times re-try.

If my understanding were correct, if pages are migrated successfully
after several times re-tries, the return value will be 0.  There's one
possibility for migrate_pages() to return non-zero but all pages are
migrated.  That is, when THP is split and all subpages are migrated
successfully.

> So we should not use the return value of migrate_pages() to determin
> if there are pages are failed to migrate. Instead we can validate the
> 'movable_page_list' to see if there are pages remained in the list,
> which are failed to migrate. That can mitigate the failure of longterm
> pinning.

Another choice is to use a special return value for split THP + success
migration.  But I'm fine to use list_empty(return_pages).

> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/gup.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 5182aba..bd8cfcd 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1914,9 +1914,10 @@ static int migrate_longterm_unpinnable_pages(
>  			.gfp_mask = GFP_USER | __GFP_NOWARN,
>  		};
>  
> -		if (migrate_pages(movable_page_list, alloc_migration_target,
> -				  NULL, (unsigned long)&mtc, MIGRATE_SYNC,
> -				  MR_LONGTERM_PIN, NULL)) {
> +		ret = migrate_pages(movable_page_list, alloc_migration_target,
> +				    NULL, (unsigned long)&mtc, MIGRATE_SYNC,
> +				    MR_LONGTERM_PIN, NULL);
> +		if (ret < 0 || !list_empty(movable_page_list)) {

It seems that !list_empty() is sufficient here.

>  			ret = -ENOMEM;

Why change the error code?  I don't think it's a good idea to do that.

>  			goto err;
>  		}

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt
  2022-10-20  7:49 ` [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt Baolin Wang
@ 2022-10-20  8:24   ` Huang, Ying
  2022-10-20  9:33     ` Baolin Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Huang, Ying @ 2022-10-20  8:24 UTC (permalink / raw)
  To: Baolin Wang; +Cc: akpm, david, ziy, shy828301, jingshan, linux-mm, linux-kernel

Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> When creating a virtual machine, we will use memfd_create() to get
> a file descriptor which can be used to create share memory mappings
> using the mmap function, meanwhile the mmap() will set the MAP_POPULATE
> flag to allocate physical pages for the virtual machine.
>
> When allocating physical pages for the guest, the host can fallback to
> allocate some CMA pages for the guest when over half of the zone's free
> memory is in the CMA area.
>
> In guest os, when the application wants to do some data transaction with
> DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and
> create IOMMU mappings for the DMA pages. However, when calling
> VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be
> failed to longterm-pin sometimes.
>
> After some invetigation, we found the pages used to do DMA mapping can
> contain some CMA pages, and these CMA pages will cause a possible
> failure of the longterm-pin, due to failed to migrate the CMA pages.
> The reason of migration failure may be temporary reference count or
> memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA
> ioctl returns error, which makes the application failed to start.
>
> I observed one migration failure case (which is not easy to reproduce) is
> that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed'
> count is also 1.
>
> That means when migrating a THP which is in CMA area, but can not allocate
> a new THP due to memory fragmentation, so it will split the THP. However
> THP split is also failed, probably the reason is temporary reference count
> of this THP. And the temporary reference count can be caused by dropping
> page caches (I observed the drop caches operation in the system), but we
> can not drop the shmem page caches due to they are already dirty at that time.
>
> Especially for THP split failure, which is caused by temporary reference
> count, we can try again to mitigate the failure of migration in this case
> according to previous discussion [1].

Does the patch solved your problem?

> [1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/huge_memory.c |  4 ++--
>  mm/migrate.c     | 18 +++++++++++++++---
>  2 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ad17c8d..a79f03b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2666,7 +2666,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  	 * split PMDs
>  	 */
>  	if (!can_split_folio(folio, &extra_pins)) {
> -		ret = -EBUSY;
> +		ret = -EAGAIN;
>  		goto out_unlock;
>  	}
>  
> @@ -2716,7 +2716,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  			xas_unlock(&xas);
>  		local_irq_enable();
>  		remap_page(folio, folio_nr_pages(folio));
> -		ret = -EBUSY;
> +		ret = -EAGAIN;
>  	}
>  
>  out_unlock:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8e5eb6e..55c7855 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1506,9 +1506,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  				if (is_thp) {
>  					nr_thp_failed++;
>  					/* THP NUMA faulting doesn't split THP to retry. */
> -					if (!nosplit && !try_split_thp(page, &thp_split_pages)) {
> -						nr_thp_split++;
> -						break;
> +					if (!nosplit) {
> +						rc = try_split_thp(page, &thp_split_pages);
> +						if (!rc) {
> +							nr_thp_split++;
> +							break;
> +						} else if (reason == MR_LONGTERM_PIN &&
> +							   rc == -EAGAIN) {

In case reason != MR_LONGTERM_PIN, you change the return value of
migrate_pages().  So you need to use another variable for return value.

> +							/*
> +							 * Try again to split THP to mitigate
> +							 * the failure of longterm pinning.
> +							 */
> +							thp_retry++;
> +							nr_retry_pages += nr_subpages;
> +							break;
> +						}
>  					}
>  				} else if (!no_subpage_counting) {
>  					nr_failed++;

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate
  2022-10-20  8:15 ` [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate Huang, Ying
@ 2022-10-20  9:24   ` Baolin Wang
  2022-10-20 11:43     ` Alistair Popple
  0 siblings, 1 reply; 11+ messages in thread
From: Baolin Wang @ 2022-10-20  9:24 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, david, ziy, shy828301, jingshan, linux-mm, linux-kernel



On 10/20/2022 4:15 PM, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> The migrate_pages() will return the number of {normal page, THP, hugetlb}
>> that were not migrated, or an error code. That means it can still return
>> the number of failure count, though the pages have been migrated
>> successfully with several times re-try.
> 
> If my understanding were correct, if pages are migrated successfully
> after several times re-tries, the return value will be 0.  There's one
> possibility for migrate_pages() to return non-zero but all pages are
> migrated.  That is, when THP is split and all subpages are migrated
> successfully.

Yeah, that's the case I tested. Thanks for pointing out. I'll re-write 
my incorrect commit message next time.

> 
>> So we should not use the return value of migrate_pages() to determin
>> if there are pages are failed to migrate. Instead we can validate the
>> 'movable_page_list' to see if there are pages remained in the list,
>> which are failed to migrate. That can mitigate the failure of longterm
>> pinning.
> 
> Another choice is to use a special return value for split THP + success
> migration.  But I'm fine to use list_empty(return_pages).

OK. Using list_empty(return_pages) looks more simple.

> 
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/gup.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 5182aba..bd8cfcd 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1914,9 +1914,10 @@ static int migrate_longterm_unpinnable_pages(
>>   			.gfp_mask = GFP_USER | __GFP_NOWARN,
>>   		};
>>   
>> -		if (migrate_pages(movable_page_list, alloc_migration_target,
>> -				  NULL, (unsigned long)&mtc, MIGRATE_SYNC,
>> -				  MR_LONGTERM_PIN, NULL)) {
>> +		ret = migrate_pages(movable_page_list, alloc_migration_target,
>> +				    NULL, (unsigned long)&mtc, MIGRATE_SYNC,
>> +				    MR_LONGTERM_PIN, NULL);
>> +		if (ret < 0 || !list_empty(movable_page_list)) {
> 
> It seems that !list_empty() is sufficient here.

OK. Drop the 'ret < 0'

>>   			ret = -ENOMEM;
> 
> Why change the error code?  I don't think it's a good idea to do that.

The GUP need a -errno for failure or partial success when migration, and 
we can not return the number of pages failed to migrate. So returning 
-ENOMEM seems suitable for both cases?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt
  2022-10-20  8:24   ` Huang, Ying
@ 2022-10-20  9:33     ` Baolin Wang
  2022-10-20 19:21       ` Yang Shi
  0 siblings, 1 reply; 11+ messages in thread
From: Baolin Wang @ 2022-10-20  9:33 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, david, ziy, shy828301, jingshan, linux-mm, linux-kernel



On 10/20/2022 4:24 PM, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> When creating a virtual machine, we will use memfd_create() to get
>> a file descriptor which can be used to create share memory mappings
>> using the mmap function, meanwhile the mmap() will set the MAP_POPULATE
>> flag to allocate physical pages for the virtual machine.
>>
>> When allocating physical pages for the guest, the host can fallback to
>> allocate some CMA pages for the guest when over half of the zone's free
>> memory is in the CMA area.
>>
>> In guest os, when the application wants to do some data transaction with
>> DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and
>> create IOMMU mappings for the DMA pages. However, when calling
>> VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be
>> failed to longterm-pin sometimes.
>>
>> After some invetigation, we found the pages used to do DMA mapping can
>> contain some CMA pages, and these CMA pages will cause a possible
>> failure of the longterm-pin, due to failed to migrate the CMA pages.
>> The reason of migration failure may be temporary reference count or
>> memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA
>> ioctl returns error, which makes the application failed to start.
>>
>> I observed one migration failure case (which is not easy to reproduce) is
>> that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed'
>> count is also 1.
>>
>> That means when migrating a THP which is in CMA area, but can not allocate
>> a new THP due to memory fragmentation, so it will split the THP. However
>> THP split is also failed, probably the reason is temporary reference count
>> of this THP. And the temporary reference count can be caused by dropping
>> page caches (I observed the drop caches operation in the system), but we
>> can not drop the shmem page caches due to they are already dirty at that time.
>>
>> Especially for THP split failure, which is caused by temporary reference
>> count, we can try again to mitigate the failure of migration in this case
>> according to previous discussion [1].
> 
> Does the patch solved your problem?

The problem is not easy to reproduce and I will test this patch on our 
products. However I think this is a likely case to fail the migration, 
which need to be addressed to mitigate the failure.

>> [1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/huge_memory.c |  4 ++--
>>   mm/migrate.c     | 18 +++++++++++++++---
>>   2 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index ad17c8d..a79f03b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2666,7 +2666,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>   	 * split PMDs
>>   	 */
>>   	if (!can_split_folio(folio, &extra_pins)) {
>> -		ret = -EBUSY;
>> +		ret = -EAGAIN;
>>   		goto out_unlock;
>>   	}
>>   
>> @@ -2716,7 +2716,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>   			xas_unlock(&xas);
>>   		local_irq_enable();
>>   		remap_page(folio, folio_nr_pages(folio));
>> -		ret = -EBUSY;
>> +		ret = -EAGAIN;
>>   	}
>>   
>>   out_unlock:
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 8e5eb6e..55c7855 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1506,9 +1506,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>   				if (is_thp) {
>>   					nr_thp_failed++;
>>   					/* THP NUMA faulting doesn't split THP to retry. */
>> -					if (!nosplit && !try_split_thp(page, &thp_split_pages)) {
>> -						nr_thp_split++;
>> -						break;
>> +					if (!nosplit) {
>> +						rc = try_split_thp(page, &thp_split_pages);
>> +						if (!rc) {
>> +							nr_thp_split++;
>> +							break;
>> +						} else if (reason == MR_LONGTERM_PIN &&
>> +							   rc == -EAGAIN) {
> 
> In case reason != MR_LONGTERM_PIN, you change the return value of
> migrate_pages().  So you need to use another variable for return value.

Good catch, will fix in next version. Thanks for your comments.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate
  2022-10-20  9:24   ` Baolin Wang
@ 2022-10-20 11:43     ` Alistair Popple
  2022-10-21  0:28       ` Huang, Ying
  2022-10-21  2:51       ` Baolin Wang
  0 siblings, 2 replies; 11+ messages in thread
From: Alistair Popple @ 2022-10-20 11:43 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Huang, Ying, akpm, david, ziy, shy828301, jingshan, linux-mm,
	linux-kernel


Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 10/20/2022 4:15 PM, Huang, Ying wrote:
>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>>
>>> The migrate_pages() will return the number of {normal page, THP, hugetlb}
>>> that were not migrated, or an error code. That means it can still return
>>> the number of failure count, though the pages have been migrated
>>> successfully with several times re-try.
>> If my understanding were correct, if pages are migrated successfully
>> after several times re-tries, the return value will be 0.  There's one
>> possibility for migrate_pages() to return non-zero but all pages are
>> migrated.  That is, when THP is split and all subpages are migrated
>> successfully.
>
> Yeah, that's the case I tested. Thanks for pointing out. I'll re-write my
> incorrect commit message next time.

This is confusing to me. So users of move_page() will see an
unsuccessful migration even when all subpages were migrated? Seems like
we should fix the return code of migrate_pages() for this case where all
subpages were successfully migrated.

>>
>>> So we should not use the return value of migrate_pages() to determin
>>> if there are pages are failed to migrate. Instead we can validate the
>>> 'movable_page_list' to see if there are pages remained in the list,
>>> which are failed to migrate. That can mitigate the failure of longterm
>>> pinning.
>> Another choice is to use a special return value for split THP + success
>> migration.  But I'm fine to use list_empty(return_pages).
>
> OK. Using list_empty(return_pages) looks more simple.
>
>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>   mm/gup.c | 7 ++++---
>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 5182aba..bd8cfcd 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -1914,9 +1914,10 @@ static int migrate_longterm_unpinnable_pages(
>>>   			.gfp_mask = GFP_USER | __GFP_NOWARN,
>>>   		};
>>>   -		if (migrate_pages(movable_page_list, alloc_migration_target,
>>> -				  NULL, (unsigned long)&mtc, MIGRATE_SYNC,
>>> -				  MR_LONGTERM_PIN, NULL)) {
>>> +		ret = migrate_pages(movable_page_list, alloc_migration_target,
>>> +				    NULL, (unsigned long)&mtc, MIGRATE_SYNC,
>>> +				    MR_LONGTERM_PIN, NULL);
>>> +		if (ret < 0 || !list_empty(movable_page_list)) {
>> It seems that !list_empty() is sufficient here.
>
> OK. Drop the 'ret < 0'
>
>>>   			ret = -ENOMEM;
>> Why change the error code?  I don't think it's a good idea to do that.
>
> The GUP need a -errno for failure or partial success when migration, and we can
> not return the number of pages failed to migrate. So returning -ENOMEM seems
> suitable for both cases?

Seem reasonable to me. migrate_pages() might return -EAGAIN which would
cause everything to be re-pinned and tried again which is not what you
want here. See the comment at the start of
check_and_migrate_movable_pages().

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt
  2022-10-20  9:33     ` Baolin Wang
@ 2022-10-20 19:21       ` Yang Shi
  2022-10-21  6:15         ` Baolin Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Yang Shi @ 2022-10-20 19:21 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Huang, Ying, akpm, david, ziy, jingshan, linux-mm, linux-kernel

On Thu, Oct 20, 2022 at 2:33 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 10/20/2022 4:24 PM, Huang, Ying wrote:
> > Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> >
> >> When creating a virtual machine, we will use memfd_create() to get
> >> a file descriptor which can be used to create share memory mappings
> >> using the mmap function, meanwhile the mmap() will set the MAP_POPULATE
> >> flag to allocate physical pages for the virtual machine.
> >>
> >> When allocating physical pages for the guest, the host can fallback to
> >> allocate some CMA pages for the guest when over half of the zone's free
> >> memory is in the CMA area.
> >>
> >> In guest os, when the application wants to do some data transaction with
> >> DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and
> >> create IOMMU mappings for the DMA pages. However, when calling
> >> VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be
> >> failed to longterm-pin sometimes.
> >>
> >> After some invetigation, we found the pages used to do DMA mapping can
> >> contain some CMA pages, and these CMA pages will cause a possible
> >> failure of the longterm-pin, due to failed to migrate the CMA pages.
> >> The reason of migration failure may be temporary reference count or
> >> memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA
> >> ioctl returns error, which makes the application failed to start.
> >>
> >> I observed one migration failure case (which is not easy to reproduce) is
> >> that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed'
> >> count is also 1.
> >>
> >> That means when migrating a THP which is in CMA area, but can not allocate
> >> a new THP due to memory fragmentation, so it will split the THP. However
> >> THP split is also failed, probably the reason is temporary reference count
> >> of this THP. And the temporary reference count can be caused by dropping
> >> page caches (I observed the drop caches operation in the system), but we
> >> can not drop the shmem page caches due to they are already dirty at that time.
> >>
> >> Especially for THP split failure, which is caused by temporary reference
> >> count, we can try again to mitigate the failure of migration in this case
> >> according to previous discussion [1].
> >
> > Does the patch solved your problem?
>
> The problem is not easy to reproduce and I will test this patch on our
> products. However I think this is a likely case to fail the migration,
> which need to be addressed to mitigate the failure.

You may try to trace all migrations across your fleet (or just pick
some sample machines, this should make data analysis easier) and
filter the migration by reasons, for example, MR_LONGTERM_PIN, then
compare the migration success rate before and after the patch. It
should be a good justification. But it may need some work on data
aggregation, process and analysis, not sure how feasible it is.

>
> >> [1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> ---
> >>   mm/huge_memory.c |  4 ++--
> >>   mm/migrate.c     | 18 +++++++++++++++---
> >>   2 files changed, 17 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index ad17c8d..a79f03b 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -2666,7 +2666,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> >>       * split PMDs
> >>       */
> >>      if (!can_split_folio(folio, &extra_pins)) {
> >> -            ret = -EBUSY;
> >> +            ret = -EAGAIN;
> >>              goto out_unlock;
> >>      }
> >>
> >> @@ -2716,7 +2716,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> >>                      xas_unlock(&xas);
> >>              local_irq_enable();
> >>              remap_page(folio, folio_nr_pages(folio));
> >> -            ret = -EBUSY;
> >> +            ret = -EAGAIN;
> >>      }
> >>
> >>   out_unlock:
> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 8e5eb6e..55c7855 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -1506,9 +1506,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >>                              if (is_thp) {
> >>                                      nr_thp_failed++;
> >>                                      /* THP NUMA faulting doesn't split THP to retry. */
> >> -                                    if (!nosplit && !try_split_thp(page, &thp_split_pages)) {
> >> -                                            nr_thp_split++;
> >> -                                            break;
> >> +                                    if (!nosplit) {
> >> +                                            rc = try_split_thp(page, &thp_split_pages);
> >> +                                            if (!rc) {
> >> +                                                    nr_thp_split++;
> >> +                                                    break;
> >> +                                            } else if (reason == MR_LONGTERM_PIN &&
> >> +                                                       rc == -EAGAIN) {
> >
> > In case reason != MR_LONGTERM_PIN, you change the return value of
> > migrate_pages().  So you need to use another variable for return value.
>
> Good catch, will fix in next version. Thanks for your comments.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate
  2022-10-20 11:43     ` Alistair Popple
@ 2022-10-21  0:28       ` Huang, Ying
  2022-10-21  2:51       ` Baolin Wang
  1 sibling, 0 replies; 11+ messages in thread
From: Huang, Ying @ 2022-10-21  0:28 UTC (permalink / raw)
  To: Baolin Wang, Alistair Popple
  Cc: akpm, david, ziy, shy828301, jingshan, linux-mm, linux-kernel

Alistair Popple <apopple@nvidia.com> writes:

> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>
>> On 10/20/2022 4:15 PM, Huang, Ying wrote:
>>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>>>
>>>> The migrate_pages() will return the number of {normal page, THP, hugetlb}
>>>> that were not migrated, or an error code. That means it can still return
>>>> the number of failure count, though the pages have been migrated
>>>> successfully with several times re-try.
>>> If my understanding were correct, if pages are migrated successfully
>>> after several times re-tries, the return value will be 0.  There's one
>>> possibility for migrate_pages() to return non-zero but all pages are
>>> migrated.  That is, when THP is split and all subpages are migrated
>>> successfully.
>>
>> Yeah, that's the case I tested. Thanks for pointing out. I'll re-write my
>> incorrect commit message next time.
>
> This is confusing to me. So users of move_page() will see an
> unsuccessful migration even when all subpages were migrated? Seems like
> we should fix the return code of migrate_pages() for this case where all
> subpages were successfully migrated.
>
>>>
>>>> So we should not use the return value of migrate_pages() to determin
>>>> if there are pages are failed to migrate. Instead we can validate the
>>>> 'movable_page_list' to see if there are pages remained in the list,
>>>> which are failed to migrate. That can mitigate the failure of longterm
>>>> pinning.
>>> Another choice is to use a special return value for split THP + success
>>> migration.  But I'm fine to use list_empty(return_pages).
>>
>> OK. Using list_empty(return_pages) looks more simple.
>>
>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>   mm/gup.c | 7 ++++---
>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index 5182aba..bd8cfcd 100644
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -1914,9 +1914,10 @@ static int migrate_longterm_unpinnable_pages(
>>>>   			.gfp_mask = GFP_USER | __GFP_NOWARN,
>>>>   		};
>>>>   -		if (migrate_pages(movable_page_list, alloc_migration_target,
>>>> -				  NULL, (unsigned long)&mtc, MIGRATE_SYNC,
>>>> -				  MR_LONGTERM_PIN, NULL)) {
>>>> +		ret = migrate_pages(movable_page_list, alloc_migration_target,
>>>> +				    NULL, (unsigned long)&mtc, MIGRATE_SYNC,
>>>> +				    MR_LONGTERM_PIN, NULL);
>>>> +		if (ret < 0 || !list_empty(movable_page_list)) {
>>> It seems that !list_empty() is sufficient here.
>>
>> OK. Drop the 'ret < 0'
>>
>>>>   			ret = -ENOMEM;
>>> Why change the error code?  I don't think it's a good idea to do that.
>>
>> The GUP need a -errno for failure or partial success when migration, and we can
>> not return the number of pages failed to migrate. So returning -ENOMEM seems
>> suitable for both cases?
>
> Seem reasonable to me. migrate_pages() might return -EAGAIN which would
> cause everything to be re-pinned and tried again which is not what you
> want here. See the comment at the start of
> check_and_migrate_movable_pages().

Yes.  You are right.  The error code of migrate_pages() isn't good for
caller here.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate
  2022-10-20 11:43     ` Alistair Popple
  2022-10-21  0:28       ` Huang, Ying
@ 2022-10-21  2:51       ` Baolin Wang
  1 sibling, 0 replies; 11+ messages in thread
From: Baolin Wang @ 2022-10-21  2:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Huang, Ying, akpm, david, ziy, shy828301, jingshan, linux-mm,
	linux-kernel



On 10/20/2022 7:43 PM, Alistair Popple wrote:
> 
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> On 10/20/2022 4:15 PM, Huang, Ying wrote:
>>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>>>
>>>> The migrate_pages() will return the number of {normal page, THP, hugetlb}
>>>> that were not migrated, or an error code. That means it can still return
>>>> the number of failure count, though the pages have been migrated
>>>> successfully with several times re-try.
>>> If my understanding were correct, if pages are migrated successfully
>>> after several times re-tries, the return value will be 0.  There's one
>>> possibility for migrate_pages() to return non-zero but all pages are
>>> migrated.  That is, when THP is split and all subpages are migrated
>>> successfully.
>>
>> Yeah, that's the case I tested. Thanks for pointing out. I'll re-write my
>> incorrect commit message next time.
> 
> This is confusing to me. So users of move_page() will see an
> unsuccessful migration even when all subpages were migrated? Seems like

Yes.

> we should fix the return code of migrate_pages() for this case where all
> subpages were successfully migrated.

After more investigation, some other callers will also check the return 
value to see of all pages are migrated successfully. So yes, I will 
change the return value in migrate_pages() to fix this issue for all 
callers like you and Ying suggested. Thanks.

>>>> So we should not use the return value of migrate_pages() to determin
>>>> if there are pages are failed to migrate. Instead we can validate the
>>>> 'movable_page_list' to see if there are pages remained in the list,
>>>> which are failed to migrate. That can mitigate the failure of longterm
>>>> pinning.
>>> Another choice is to use a special return value for split THP + success
>>> migration.  But I'm fine to use list_empty(return_pages).
>>
>> OK. Using list_empty(return_pages) looks more simple.
>>
>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    mm/gup.c | 7 ++++---
>>>>    1 file changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index 5182aba..bd8cfcd 100644
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -1914,9 +1914,10 @@ static int migrate_longterm_unpinnable_pages(
>>>>    			.gfp_mask = GFP_USER | __GFP_NOWARN,
>>>>    		};
>>>>    -		if (migrate_pages(movable_page_list, alloc_migration_target,
>>>> -				  NULL, (unsigned long)&mtc, MIGRATE_SYNC,
>>>> -				  MR_LONGTERM_PIN, NULL)) {
>>>> +		ret = migrate_pages(movable_page_list, alloc_migration_target,
>>>> +				    NULL, (unsigned long)&mtc, MIGRATE_SYNC,
>>>> +				    MR_LONGTERM_PIN, NULL);
>>>> +		if (ret < 0 || !list_empty(movable_page_list)) {
>>> It seems that !list_empty() is sufficient here.
>>
>> OK. Drop the 'ret < 0'
>>
>>>>    			ret = -ENOMEM;
>>> Why change the error code?  I don't think it's a good idea to do that.
>>
>> The GUP need a -errno for failure or partial success when migration, and we can
>> not return the number of pages failed to migrate. So returning -ENOMEM seems
>> suitable for both cases?
> 
> Seem reasonable to me. migrate_pages() might return -EAGAIN which would
> cause everything to be re-pinned and tried again which is not what you
> want here. See the comment at the start of
> check_and_migrate_movable_pages().

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt
  2022-10-20 19:21       ` Yang Shi
@ 2022-10-21  6:15         ` Baolin Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Baolin Wang @ 2022-10-21  6:15 UTC (permalink / raw)
  To: Yang Shi; +Cc: Huang, Ying, akpm, david, ziy, jingshan, linux-mm, linux-kernel



On 10/21/2022 3:21 AM, Yang Shi wrote:
> On Thu, Oct 20, 2022 at 2:33 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 10/20/2022 4:24 PM, Huang, Ying wrote:
>>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>>>
>>>> When creating a virtual machine, we will use memfd_create() to get
>>>> a file descriptor which can be used to create share memory mappings
>>>> using the mmap function, meanwhile the mmap() will set the MAP_POPULATE
>>>> flag to allocate physical pages for the virtual machine.
>>>>
>>>> When allocating physical pages for the guest, the host can fallback to
>>>> allocate some CMA pages for the guest when over half of the zone's free
>>>> memory is in the CMA area.
>>>>
>>>> In guest os, when the application wants to do some data transaction with
>>>> DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and
>>>> create IOMMU mappings for the DMA pages. However, when calling
>>>> VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be
>>>> failed to longterm-pin sometimes.
>>>>
>>>> After some invetigation, we found the pages used to do DMA mapping can
>>>> contain some CMA pages, and these CMA pages will cause a possible
>>>> failure of the longterm-pin, due to failed to migrate the CMA pages.
>>>> The reason of migration failure may be temporary reference count or
>>>> memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA
>>>> ioctl returns error, which makes the application failed to start.
>>>>
>>>> I observed one migration failure case (which is not easy to reproduce) is
>>>> that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed'
>>>> count is also 1.
>>>>
>>>> That means when migrating a THP which is in CMA area, but can not allocate
>>>> a new THP due to memory fragmentation, so it will split the THP. However
>>>> THP split is also failed, probably the reason is temporary reference count
>>>> of this THP. And the temporary reference count can be caused by dropping
>>>> page caches (I observed the drop caches operation in the system), but we
>>>> can not drop the shmem page caches due to they are already dirty at that time.
>>>>
>>>> Especially for THP split failure, which is caused by temporary reference
>>>> count, we can try again to mitigate the failure of migration in this case
>>>> according to previous discussion [1].
>>>
>>> Does the patch solved your problem?
>>
>> The problem is not easy to reproduce and I will test this patch on our
>> products. However I think this is a likely case to fail the migration,
>> which need to be addressed to mitigate the failure.
> 
> You may try to trace all migrations across your fleet (or just pick
> some sample machines, this should make data analysis easier) and
> filter the migration by reasons, for example, MR_LONGTERM_PIN, then
> compare the migration success rate before and after the patch. It > should be a good justification. But it may need some work on data
> aggregation, process and analysis, not sure how feasible it is.

IMO the migration of MR_LONGTERM_PIN is very rare in this case, so we 
can obeserve the migraion failure of longterm pin, once obeserved, the 
application will be aborted. However like I said before, the problem is 
not easy to reproduce :(

Anyway we'll test this 2 patches on our products.

>>>> [1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    mm/huge_memory.c |  4 ++--
>>>>    mm/migrate.c     | 18 +++++++++++++++---
>>>>    2 files changed, 17 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index ad17c8d..a79f03b 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -2666,7 +2666,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>>>        * split PMDs
>>>>        */
>>>>       if (!can_split_folio(folio, &extra_pins)) {
>>>> -            ret = -EBUSY;
>>>> +            ret = -EAGAIN;
>>>>               goto out_unlock;
>>>>       }
>>>>
>>>> @@ -2716,7 +2716,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>>>                       xas_unlock(&xas);
>>>>               local_irq_enable();
>>>>               remap_page(folio, folio_nr_pages(folio));
>>>> -            ret = -EBUSY;
>>>> +            ret = -EAGAIN;
>>>>       }
>>>>
>>>>    out_unlock:
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 8e5eb6e..55c7855 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -1506,9 +1506,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>>>                               if (is_thp) {
>>>>                                       nr_thp_failed++;
>>>>                                       /* THP NUMA faulting doesn't split THP to retry. */
>>>> -                                    if (!nosplit && !try_split_thp(page, &thp_split_pages)) {
>>>> -                                            nr_thp_split++;
>>>> -                                            break;
>>>> +                                    if (!nosplit) {
>>>> +                                            rc = try_split_thp(page, &thp_split_pages);
>>>> +                                            if (!rc) {
>>>> +                                                    nr_thp_split++;
>>>> +                                                    break;
>>>> +                                            } else if (reason == MR_LONGTERM_PIN &&
>>>> +                                                       rc == -EAGAIN) {
>>>
>>> In case reason != MR_LONGTERM_PIN, you change the return value of
>>> migrate_pages().  So you need to use another variable for return value.
>>
>> Good catch, will fix in next version. Thanks for your comments.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-10-21  6:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-20  7:49 [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate Baolin Wang
2022-10-20  7:49 ` [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt Baolin Wang
2022-10-20  8:24   ` Huang, Ying
2022-10-20  9:33     ` Baolin Wang
2022-10-20 19:21       ` Yang Shi
2022-10-21  6:15         ` Baolin Wang
2022-10-20  8:15 ` [PATCH 1/2] mm: gup: Re-pin pages in case of trying several times to migrate Huang, Ying
2022-10-20  9:24   ` Baolin Wang
2022-10-20 11:43     ` Alistair Popple
2022-10-21  0:28       ` Huang, Ying
2022-10-21  2:51       ` Baolin Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.