linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/6] migrate_pages(): batch TLB flushing
@ 2022-09-21  6:06 Huang Ying
  2022-09-21  6:06 ` [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration Huang Ying
                   ` (9 more replies)
  0 siblings, 10 replies; 50+ messages in thread
From: Huang Ying @ 2022-09-21  6:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Huang, Ying, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

From: "Huang, Ying" <ying.huang@intel.com>

Now, migrate_pages() migrate pages one by one, like the fake code as
follows,

  for each page
    unmap
    flush TLB
    copy
    restore map

If multiple pages are passed to migrate_pages(), there are
opportunities to batch the TLB flushing and copying.  That is, we can
change the code to something as follows,

  for each page
    unmap
  for each page
    flush TLB
  for each page
    copy
  for each page
    restore map

The total number of TLB flushing IPI can be reduced considerably.  And
we may use some hardware accelerator such as DSA to accelerate the
page copying.

So in this patch, we refactor the migrate_pages() implementation and
implement the TLB flushing batching.  Base on this, hardware
accelerated page copying can be implemented.

If too many pages are passed to migrate_pages(), in the naive batched
implementation, we may unmap too many pages at the same time.  The
possibility for a task to wait for the migrated pages to be mapped
again increases.  So the latency may be hurt.  To deal with this
issue, the max number of pages be unmapped in batch is restricted to
no more than HPAGE_PMD_NR.  That is, the influence is at the same
level of THP migration.

We use the following test to measure the performance impact of the
patchset,

On a 2-socket Intel server,

 - Run pmbench memory accessing benchmark

 - Run `migratepages` to migrate pages of pmbench between node 0 and
   node 1 back and forth.

With the patch, the TLB flushing IPI reduces 99.1% during the test and
the number of pages migrated successfully per second increases 291.7%.

This patchset is based on v6.0-rc5 and the following patchset,

[PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/

The migrate_pages() related code is converting to folio now. So this
patchset cannot apply recent akpm/mm-unstable branch.  This patchset
is used to check the basic idea.  If it is OK, I will rebase the
patchset on top of folio changes.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration
  2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
@ 2022-09-21  6:06 ` Huang Ying
  2022-09-21 15:55   ` Zi Yan
  2022-09-22  6:03   ` Baolin Wang
  2022-09-21  6:06 ` [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move() Huang Ying
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 50+ messages in thread
From: Huang Ying @ 2022-09-21  6:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Huang Ying, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

This is a preparation patch to batch the page unmapping and moving for
the normal pages and THPs.  Based on that we can batch the TLB
shootdown during the page migration and make it possible to use some
hardware accelerator for the page copying.

In this patch the huge page (PageHuge()) and normal page and THP
migration is separated in migrate_pages() to make it easy to change
the normal page and THP migration implementation.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
---
 mm/migrate.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 64 insertions(+), 9 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 571d8c9fd5bc..117134f1c6dc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1414,6 +1414,66 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 
 	trace_mm_migrate_pages_start(mode, reason);
 
+	for (pass = 0; pass < 10 && retry; pass++) {
+		retry = 0;
+
+		list_for_each_entry_safe(page, page2, from, lru) {
+			nr_subpages = compound_nr(page);
+			cond_resched();
+
+			if (!PageHuge(page))
+				continue;
+
+			rc = unmap_and_move_huge_page(get_new_page,
+						put_new_page, private, page,
+						pass > 2, mode, reason,
+						&ret_pages);
+			/*
+			 * The rules are:
+			 *	Success: hugetlb page will be put back
+			 *	-EAGAIN: stay on the from list
+			 *	-ENOMEM: stay on the from list
+			 *	-ENOSYS: stay on the from list
+			 *	Other errno: put on ret_pages list then splice to
+			 *		     from list
+			 */
+			switch(rc) {
+			case -ENOSYS:
+				/* Hugetlb migration is unsupported */
+				nr_failed++;
+				nr_failed_pages += nr_subpages;
+				list_move_tail(&page->lru, &ret_pages);
+				break;
+			case -ENOMEM:
+				/*
+				 * When memory is low, don't bother to try to migrate
+				 * other pages, just exit.
+				 */
+				nr_failed++;
+				nr_failed_pages += nr_subpages + nr_retry_pages;
+				goto out;
+			case -EAGAIN:
+				retry++;
+				nr_retry_pages += nr_subpages;
+				break;
+			case MIGRATEPAGE_SUCCESS:
+				nr_succeeded += nr_subpages;
+				break;
+			default:
+				/*
+				 * Permanent failure (-EBUSY, etc.):
+				 * unlike -EAGAIN case, the failed page is
+				 * removed from migration page list and not
+				 * retried in the next outer loop.
+				 */
+				nr_failed++;
+				nr_failed_pages += nr_subpages;
+				break;
+			}
+		}
+	}
+	nr_failed += retry;
+	retry = 1;
 thp_subpage_migration:
 	for (pass = 0; pass < 10 && (retry || thp_retry); pass++) {
 		retry = 0;
@@ -1431,18 +1491,14 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 			cond_resched();
 
 			if (PageHuge(page))
-				rc = unmap_and_move_huge_page(get_new_page,
-						put_new_page, private, page,
-						pass > 2, mode, reason,
-						&ret_pages);
-			else
-				rc = unmap_and_move(get_new_page, put_new_page,
+				continue;
+
+			rc = unmap_and_move(get_new_page, put_new_page,
 						private, page, pass > 2, mode,
 						reason, &ret_pages);
 			/*
 			 * The rules are:
-			 *	Success: non hugetlb page will be freed, hugetlb
-			 *		 page will be put back
+			 *	Success: page will be freed
 			 *	-EAGAIN: stay on the from list
 			 *	-ENOMEM: stay on the from list
 			 *	-ENOSYS: stay on the from list
@@ -1468,7 +1524,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 						nr_thp_split++;
 						break;
 					}
-				/* Hugetlb migration is unsupported */
 				} else if (!no_subpage_counting) {
 					nr_failed++;
 				}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
  2022-09-21  6:06 ` [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration Huang Ying
@ 2022-09-21  6:06 ` Huang Ying
  2022-09-21 16:08   ` Zi Yan
                     ` (2 more replies)
  2022-09-21  6:06 ` [RFC 3/6] mm/migrate_pages: restrict number of pages to migrate in batch Huang Ying
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 50+ messages in thread
From: Huang Ying @ 2022-09-21  6:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Huang Ying, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

This is a preparation patch to batch the page unmapping and moving
for the normal pages and THP.

In this patch, unmap_and_move() is split to migrate_page_unmap() and
migrate_page_move().  So, we can batch _unmap() and _move() in
different loops later.  To pass some information between unmap and
move, the original unused newpage->mapping and newpage->private are
used.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
---
 mm/migrate.c | 164 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 122 insertions(+), 42 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 117134f1c6dc..4a81e0bfdbcd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -976,13 +976,32 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
 	return rc;
 }
 
-static int __unmap_and_move(struct page *page, struct page *newpage,
+static void __migrate_page_record(struct page *newpage,
+				  int page_was_mapped,
+				  struct anon_vma *anon_vma)
+{
+	newpage->mapping = (struct address_space *)anon_vma;
+	newpage->private = page_was_mapped;
+}
+
+static void __migrate_page_extract(struct page *newpage,
+				   int *page_was_mappedp,
+				   struct anon_vma **anon_vmap)
+{
+	*anon_vmap = (struct anon_vma *)newpage->mapping;
+	*page_was_mappedp = newpage->private;
+	newpage->mapping = NULL;
+	newpage->private = 0;
+}
+
+#define MIGRATEPAGE_UNMAP		1
+
+static int __migrate_page_unmap(struct page *page, struct page *newpage,
 				int force, enum migrate_mode mode)
 {
 	struct folio *folio = page_folio(page);
-	struct folio *dst = page_folio(newpage);
 	int rc = -EAGAIN;
-	bool page_was_mapped = false;
+	int page_was_mapped = 0;
 	struct anon_vma *anon_vma = NULL;
 	bool is_lru = !__PageMovable(page);
 
@@ -1058,8 +1077,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		goto out_unlock;
 
 	if (unlikely(!is_lru)) {
-		rc = move_to_new_folio(dst, folio, mode);
-		goto out_unlock_both;
+		__migrate_page_record(newpage, page_was_mapped, anon_vma);
+		return MIGRATEPAGE_UNMAP;
 	}
 
 	/*
@@ -1085,11 +1104,41 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
 				page);
 		try_to_migrate(folio, 0);
-		page_was_mapped = true;
+		page_was_mapped = 1;
+	}
+
+	if (!page_mapped(page)) {
+		__migrate_page_record(newpage, page_was_mapped, anon_vma);
+		return MIGRATEPAGE_UNMAP;
 	}
 
-	if (!page_mapped(page))
-		rc = move_to_new_folio(dst, folio, mode);
+	if (page_was_mapped)
+		remove_migration_ptes(folio, folio, false);
+
+out_unlock_both:
+	unlock_page(newpage);
+out_unlock:
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma)
+		put_anon_vma(anon_vma);
+	unlock_page(page);
+out:
+
+	return rc;
+}
+
+static int __migrate_page_move(struct page *page, struct page *newpage,
+			       enum migrate_mode mode)
+{
+	struct folio *folio = page_folio(page);
+	struct folio *dst = page_folio(newpage);
+	int rc;
+	int page_was_mapped = 0;
+	struct anon_vma *anon_vma = NULL;
+
+	__migrate_page_extract(newpage, &page_was_mapped, &anon_vma);
+
+	rc = move_to_new_folio(dst, folio, mode);
 
 	/*
 	 * When successful, push newpage to LRU immediately: so that if it
@@ -1110,14 +1159,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		remove_migration_ptes(folio,
 			rc == MIGRATEPAGE_SUCCESS ? dst : folio, false);
 
-out_unlock_both:
 	unlock_page(newpage);
-out_unlock:
 	/* Drop an anon_vma reference if we took one */
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 	unlock_page(page);
-out:
 	/*
 	 * If migration is successful, decrease refcount of the newpage,
 	 * which will not free the page because new page owner increased
@@ -1129,18 +1175,31 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	return rc;
 }
 
-/*
- * Obtain the lock on page, remove all ptes and migrate the page
- * to the newly allocated page in newpage.
- */
-static int unmap_and_move(new_page_t get_new_page,
-				   free_page_t put_new_page,
-				   unsigned long private, struct page *page,
-				   int force, enum migrate_mode mode,
-				   enum migrate_reason reason,
-				   struct list_head *ret)
+static void migrate_page_done(struct page *page,
+			      enum migrate_reason reason)
+{
+	/*
+	 * Compaction can migrate also non-LRU pages which are
+	 * not accounted to NR_ISOLATED_*. They can be recognized
+	 * as __PageMovable
+	 */
+	if (likely(!__PageMovable(page)))
+		mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+				    page_is_file_lru(page), -thp_nr_pages(page));
+
+	if (reason != MR_MEMORY_FAILURE)
+		/* We release the page in page_handle_poison. */
+		put_page(page);
+}
+
+/* Obtain the lock on page, remove all ptes. */
+static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
+			      unsigned long private, struct page *page,
+			      struct page **newpagep, int force,
+			      enum migrate_mode mode, enum migrate_reason reason,
+			      struct list_head *ret)
 {
-	int rc = MIGRATEPAGE_SUCCESS;
+	int rc = MIGRATEPAGE_UNMAP;
 	struct page *newpage = NULL;
 
 	if (!thp_migration_supported() && PageTransHuge(page))
@@ -1151,19 +1210,48 @@ static int unmap_and_move(new_page_t get_new_page,
 		ClearPageActive(page);
 		ClearPageUnevictable(page);
 		/* free_pages_prepare() will clear PG_isolated. */
-		goto out;
+		list_del(&page->lru);
+		migrate_page_done(page, reason);
+		return MIGRATEPAGE_SUCCESS;
 	}
 
 	newpage = get_new_page(page, private);
 	if (!newpage)
 		return -ENOMEM;
+	*newpagep = newpage;
 
-	newpage->private = 0;
-	rc = __unmap_and_move(page, newpage, force, mode);
+	rc = __migrate_page_unmap(page, newpage, force, mode);
+	if (rc == MIGRATEPAGE_UNMAP)
+		return rc;
+
+	/*
+	 * A page that has not been migrated will have kept its
+	 * references and be restored.
+	 */
+	/* restore the page to right list. */
+	if (rc != -EAGAIN)
+		list_move_tail(&page->lru, ret);
+
+	if (put_new_page)
+		put_new_page(newpage, private);
+	else
+		put_page(newpage);
+
+	return rc;
+}
+
+/* Migrate the page to the newly allocated page in newpage. */
+static int migrate_page_move(free_page_t put_new_page, unsigned long private,
+			     struct page *page, struct page *newpage,
+			     enum migrate_mode mode, enum migrate_reason reason,
+			     struct list_head *ret)
+{
+	int rc;
+
+	rc = __migrate_page_move(page, newpage, mode);
 	if (rc == MIGRATEPAGE_SUCCESS)
 		set_page_owner_migrate_reason(newpage, reason);
 
-out:
 	if (rc != -EAGAIN) {
 		/*
 		 * A page that has been migrated has all references
@@ -1179,20 +1267,7 @@ static int unmap_and_move(new_page_t get_new_page,
 	 * we want to retry.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		/*
-		 * Compaction can migrate also non-LRU pages which are
-		 * not accounted to NR_ISOLATED_*. They can be recognized
-		 * as __PageMovable
-		 */
-		if (likely(!__PageMovable(page)))
-			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
-					page_is_file_lru(page), -thp_nr_pages(page));
-
-		if (reason != MR_MEMORY_FAILURE)
-			/*
-			 * We release the page in page_handle_poison.
-			 */
-			put_page(page);
+		migrate_page_done(page, reason);
 	} else {
 		if (rc != -EAGAIN)
 			list_add_tail(&page->lru, ret);
@@ -1405,6 +1480,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	int pass = 0;
 	bool is_thp = false;
 	struct page *page;
+	struct page *newpage = NULL;
 	struct page *page2;
 	int rc, nr_subpages;
 	LIST_HEAD(ret_pages);
@@ -1493,9 +1569,13 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 			if (PageHuge(page))
 				continue;
 
-			rc = unmap_and_move(get_new_page, put_new_page,
-						private, page, pass > 2, mode,
+			rc = migrate_page_unmap(get_new_page, put_new_page, private,
+						page, &newpage, pass > 2, mode,
 						reason, &ret_pages);
+			if (rc == MIGRATEPAGE_UNMAP)
+				rc = migrate_page_move(put_new_page, private,
+						       page, newpage, mode,
+						       reason, &ret_pages);
 			/*
 			 * The rules are:
 			 *	Success: page will be freed
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC 3/6] mm/migrate_pages: restrict number of pages to migrate in batch
  2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
  2022-09-21  6:06 ` [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration Huang Ying
  2022-09-21  6:06 ` [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move() Huang Ying
@ 2022-09-21  6:06 ` Huang Ying
  2022-09-21 16:10   ` Zi Yan
  2022-09-21  6:06 ` [RFC 4/6] mm/migrate_pages: batch _unmap and _move Huang Ying
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Huang Ying @ 2022-09-21  6:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Huang Ying, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

This is a preparation patch to batch the page unmapping and moving
for the normal pages and THP.

If we had batched the page unmapping, all pages to be migrated would
be unmapped before copying the contents and flags of the pages.  If
the number of pages that were passed to migrate_pages() was too large,
too many pages would be unmapped.  Then, the execution of their
processes would be stopped for too long time.  For example,
migrate_pages() syscall will call migrate_pages() with all pages of a
process.  To avoid this possible issue, in this patch, we restrict the
number of pages to be migrated to be no more than HPAGE_PMD_NR.  That
is, the influence is at the same level of THP migration.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
---
 mm/migrate.c | 93 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 67 insertions(+), 26 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 4a81e0bfdbcd..1077af858e36 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1439,32 +1439,7 @@ static inline int try_split_thp(struct page *page, struct list_head *split_pages
 	return rc;
 }
 
-/*
- * migrate_pages - migrate the pages specified in a list, to the free pages
- *		   supplied as the target for the page migration
- *
- * @from:		The list of pages to be migrated.
- * @get_new_page:	The function used to allocate free pages to be used
- *			as the target of the page migration.
- * @put_new_page:	The function used to free target pages if migration
- *			fails, or NULL if no special handling is necessary.
- * @private:		Private data to be passed on to get_new_page()
- * @mode:		The migration mode that specifies the constraints for
- *			page migration, if any.
- * @reason:		The reason for page migration.
- * @ret_succeeded:	Set to the number of normal pages migrated successfully if
- *			the caller passes a non-NULL pointer.
- *
- * The function returns after 10 attempts or if no pages are movable any more
- * because the list has become empty or no retryable pages exist any more.
- * It is caller's responsibility to call putback_movable_pages() to return pages
- * to the LRU or free list only if ret != 0.
- *
- * Returns the number of {normal page, THP, hugetlb} that were not migrated, or
- * an error code. The number of THP splits will be considered as the number of
- * non-migrated THP, no matter how many subpages of the THP are migrated successfully.
- */
-int migrate_pages(struct list_head *from, new_page_t get_new_page,
+static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 		free_page_t put_new_page, unsigned long private,
 		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
 {
@@ -1709,6 +1684,72 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	return rc;
 }
 
+/*
+ * migrate_pages - migrate the pages specified in a list, to the free pages
+ *		   supplied as the target for the page migration
+ *
+ * @from:		The list of pages to be migrated.
+ * @get_new_page:	The function used to allocate free pages to be used
+ *			as the target of the page migration.
+ * @put_new_page:	The function used to free target pages if migration
+ *			fails, or NULL if no special handling is necessary.
+ * @private:		Private data to be passed on to get_new_page()
+ * @mode:		The migration mode that specifies the constraints for
+ *			page migration, if any.
+ * @reason:		The reason for page migration.
+ * @ret_succeeded:	Set to the number of normal pages migrated successfully if
+ *			the caller passes a non-NULL pointer.
+ *
+ * The function returns after 10 attempts or if no pages are movable any more
+ * because the list has become empty or no retryable pages exist any more.
+ * It is caller's responsibility to call putback_movable_pages() to return pages
+ * to the LRU or free list only if ret != 0.
+ *
+ * Returns the number of {normal page, THP, hugetlb} that were not migrated, or
+ * an error code. The number of THP splits will be considered as the number of
+ * non-migrated THP, no matter how many subpages of the THP are migrated successfully.
+ */
+int migrate_pages(struct list_head *from, new_page_t get_new_page,
+		free_page_t put_new_page, unsigned long private,
+		enum migrate_mode mode, int reason, unsigned int *pret_succeeded)
+{
+	int rc, rc_gether = 0;
+	int ret_succeeded, ret_succeeded_gether = 0;
+	int nr_pages;
+	struct page *page;
+	LIST_HEAD(pagelist);
+	LIST_HEAD(ret_pages);
+
+again:
+	nr_pages = 0;
+	list_for_each_entry(page, from, lru) {
+		nr_pages += compound_nr(page);
+		if (nr_pages > HPAGE_PMD_NR)
+			break;
+	}
+	if (nr_pages > HPAGE_PMD_NR)
+		list_cut_before(&pagelist, from, &page->lru);
+	else
+		list_splice_init(from, &pagelist);
+	rc = migrate_pages_batch(&pagelist, get_new_page, put_new_page, private,
+				 mode, reason, &ret_succeeded);
+	ret_succeeded_gether += ret_succeeded;
+	list_splice_tail_init(&pagelist, &ret_pages);
+	if (rc == -ENOMEM) {
+		rc_gether = rc;
+		goto out;
+	}
+	rc_gether += rc;
+	if (!list_empty(from))
+		goto again;
+out:
+	if (pret_succeeded)
+		*pret_succeeded = ret_succeeded_gether;
+	list_splice(&ret_pages, from);
+
+	return rc_gether;
+}
+
 struct page *alloc_migration_target(struct page *page, unsigned long private)
 {
 	struct folio *folio = page_folio(page);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC 4/6] mm/migrate_pages: batch _unmap and _move
  2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
                   ` (2 preceding siblings ...)
  2022-09-21  6:06 ` [RFC 3/6] mm/migrate_pages: restrict number of pages to migrate in batch Huang Ying
@ 2022-09-21  6:06 ` Huang Ying
  2022-09-21  6:06 ` [RFC 5/6] mm/migrate_pages: share more code between " Huang Ying
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 50+ messages in thread
From: Huang Ying @ 2022-09-21  6:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Huang Ying, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

In this patch the _unmap and _move stage of the page migration is
batched.  That for, previously, it is,

  for each page
    _unmap()
    _move()

Now, it is,

  for each page
    _unmap()
  for each page
    _move()

Based on this, we can batch the TLB flushing and use some hardware
accelerator to copy pages between batched _unmap and batched _move
stages.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
---
 mm/migrate.c | 155 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 139 insertions(+), 16 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 1077af858e36..165cbbc834e2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -996,6 +996,32 @@ static void __migrate_page_extract(struct page *newpage,
 
 #define MIGRATEPAGE_UNMAP		1
 
+static void migrate_page_undo_page(struct page *page,
+				   int page_was_mapped,
+				   struct anon_vma *anon_vma,
+				   struct list_head *ret)
+{
+	struct folio *folio = page_folio(page);
+
+	if (page_was_mapped)
+		remove_migration_ptes(folio, folio, false);
+	if (anon_vma)
+		put_anon_vma(anon_vma);
+	unlock_page(page);
+	list_move_tail(&page->lru, ret);
+}
+
+static void migrate_page_undo_newpage(struct page *newpage,
+				      free_page_t put_new_page,
+				      unsigned long private)
+{
+	unlock_page(newpage);
+	if (put_new_page)
+		put_new_page(newpage, private);
+	else
+		put_page(newpage);
+}
+
 static int __migrate_page_unmap(struct page *page, struct page *newpage,
 				int force, enum migrate_mode mode)
 {
@@ -1140,6 +1166,8 @@ static int __migrate_page_move(struct page *page, struct page *newpage,
 
 	rc = move_to_new_folio(dst, folio, mode);
 
+	if (rc != -EAGAIN)
+		list_del(&newpage->lru);
 	/*
 	 * When successful, push newpage to LRU immediately: so that if it
 	 * turns out to be an mlocked page, remove_migration_ptes() will
@@ -1155,6 +1183,11 @@ static int __migrate_page_move(struct page *page, struct page *newpage,
 			lru_add_drain();
 	}
 
+	if (rc == -EAGAIN) {
+		__migrate_page_record(newpage, page_was_mapped, anon_vma);
+		return rc;
+	}
+
 	if (page_was_mapped)
 		remove_migration_ptes(folio,
 			rc == MIGRATEPAGE_SUCCESS ? dst : folio, false);
@@ -1220,6 +1253,7 @@ static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
 		return -ENOMEM;
 	*newpagep = newpage;
 
+	newpage->private = 0;
 	rc = __migrate_page_unmap(page, newpage, force, mode);
 	if (rc == MIGRATEPAGE_UNMAP)
 		return rc;
@@ -1258,7 +1292,7 @@ static int migrate_page_move(free_page_t put_new_page, unsigned long private,
 		 * removed and will be freed. A page that has not been
 		 * migrated will have kept its references and be restored.
 		 */
-		list_del(&page->lru);
+		list_del_init(&page->lru);
 	}
 
 	/*
@@ -1268,9 +1302,8 @@ static int migrate_page_move(free_page_t put_new_page, unsigned long private,
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
 		migrate_page_done(page, reason);
-	} else {
-		if (rc != -EAGAIN)
-			list_add_tail(&page->lru, ret);
+	} else if (rc != -EAGAIN) {
+		list_add_tail(&page->lru, ret);
 
 		if (put_new_page)
 			put_new_page(newpage, private);
@@ -1455,11 +1488,13 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 	int pass = 0;
 	bool is_thp = false;
 	struct page *page;
-	struct page *newpage = NULL;
+	struct page *newpage = NULL, *newpage2;
 	struct page *page2;
 	int rc, nr_subpages;
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(thp_split_pages);
+	LIST_HEAD(unmap_pages);
+	LIST_HEAD(new_pages);
 	bool nosplit = (reason == MR_NUMA_MISPLACED);
 	bool no_subpage_counting = false;
 
@@ -1541,19 +1576,19 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 			nr_subpages = compound_nr(page);
 			cond_resched();
 
-			if (PageHuge(page))
+			if (PageHuge(page)) {
+				list_move_tail(&page->lru, &ret_pages);
 				continue;
+			}
 
 			rc = migrate_page_unmap(get_new_page, put_new_page, private,
 						page, &newpage, pass > 2, mode,
 						reason, &ret_pages);
-			if (rc == MIGRATEPAGE_UNMAP)
-				rc = migrate_page_move(put_new_page, private,
-						       page, newpage, mode,
-						       reason, &ret_pages);
 			/*
 			 * The rules are:
 			 *	Success: page will be freed
+			 *	Unmap: page will be put on unmap_pages list,
+			 *	       new page put on new_pages list
 			 *	-EAGAIN: stay on the from list
 			 *	-ENOMEM: stay on the from list
 			 *	-ENOSYS: stay on the from list
@@ -1589,7 +1624,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 			case -ENOMEM:
 				/*
 				 * When memory is low, don't bother to try to migrate
-				 * other pages, just exit.
+				 * other pages, move unmapped pages, then exit.
 				 */
 				if (is_thp) {
 					nr_thp_failed++;
@@ -1610,9 +1645,11 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 				 * the caller otherwise the page refcnt will be leaked.
 				 */
 				list_splice_init(&thp_split_pages, from);
-				/* nr_failed isn't updated for not used */
 				nr_thp_failed += thp_retry;
-				goto out;
+				if (list_empty(&unmap_pages))
+					goto out;
+				else
+					goto move;
 			case -EAGAIN:
 				if (is_thp)
 					thp_retry++;
@@ -1625,6 +1662,10 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 				if (is_thp)
 					nr_thp_succeeded++;
 				break;
+			case MIGRATEPAGE_UNMAP:
+				list_move_tail(&page->lru, &unmap_pages);
+				list_add_tail(&newpage->lru, &new_pages);
+				break;
 			default:
 				/*
 				 * Permanent failure (-EBUSY, etc.):
@@ -1645,12 +1686,96 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 	nr_failed += retry;
 	nr_thp_failed += thp_retry;
 	nr_failed_pages += nr_retry_pages;
+move:
+	retry = 1;
+	thp_retry = 1;
+	for (pass = 0; pass < 10 && (retry || thp_retry); pass++) {
+		retry = 0;
+		thp_retry = 0;
+		nr_retry_pages = 0;
+
+		newpage = list_first_entry(&new_pages, struct page, lru);
+		newpage2 = list_next_entry(newpage, lru);
+		list_for_each_entry_safe(page, page2, &unmap_pages, lru) {
+			/*
+			 * THP statistics is based on the source huge page.
+			 * Capture required information that might get lost
+			 * during migration.
+			 */
+			is_thp = PageTransHuge(page) && !PageHuge(page);
+			nr_subpages = compound_nr(page);
+			cond_resched();
+
+			rc = migrate_page_move(put_new_page, private,
+					       page, newpage, mode,
+					       reason, &ret_pages);
+			/*
+			 * The rules are:
+			 *	Success: page will be freed
+			 *	-EAGAIN: stay on the unmap_pages list
+			 *	Other errno: put on ret_pages list then splice to
+			 *		     from list
+			 */
+			switch(rc) {
+			case -EAGAIN:
+				if (is_thp)
+					thp_retry++;
+				else if (!no_subpage_counting)
+					retry++;
+				nr_retry_pages += nr_subpages;
+				break;
+			case MIGRATEPAGE_SUCCESS:
+				nr_succeeded += nr_subpages;
+				if (is_thp)
+					nr_thp_succeeded++;
+				break;
+			default:
+				/*
+				 * Permanent failure (-EBUSY, etc.):
+				 * unlike -EAGAIN case, the failed page is
+				 * removed from migration page list and not
+				 * retried in the next outer loop.
+				 */
+				if (is_thp)
+					nr_thp_failed++;
+				else if (!no_subpage_counting)
+					nr_failed++;
+
+				nr_failed_pages += nr_subpages;
+				break;
+			}
+			newpage = newpage2;
+			newpage2 = list_next_entry(newpage, lru);
+		}
+	}
+	nr_failed += retry;
+	nr_thp_failed += thp_retry;
+	nr_failed_pages += nr_retry_pages;
+
+	rc = nr_failed + nr_thp_failed;
+out:
+	/* Cleanup remaining pages */
+	newpage = list_first_entry(&new_pages, struct page, lru);
+	newpage2 = list_next_entry(newpage, lru);
+	list_for_each_entry_safe(page, page2, &unmap_pages, lru) {
+		int page_was_mapped = 0;
+		struct anon_vma *anon_vma = NULL;
+
+		__migrate_page_extract(newpage, &page_was_mapped, &anon_vma);
+		migrate_page_undo_page(page, page_was_mapped, anon_vma,
+				       &ret_pages);
+		list_del(&newpage->lru);
+		migrate_page_undo_newpage(newpage, put_new_page, private);
+		newpage = newpage2;
+		newpage2 = list_next_entry(newpage, lru);
+	}
+
 	/*
 	 * Try to migrate subpages of fail-to-migrate THPs, no nr_failed
 	 * counting in this round, since all subpages of a THP is counted
 	 * as 1 failure in the first round.
 	 */
-	if (!list_empty(&thp_split_pages)) {
+	if (rc >= 0 && !list_empty(&thp_split_pages)) {
 		/*
 		 * Move non-migrated pages (after 10 retries) to ret_pages
 		 * to avoid migrating them again.
@@ -1662,8 +1787,6 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 		goto thp_subpage_migration;
 	}
 
-	rc = nr_failed + nr_thp_failed;
-out:
 	/*
 	 * Put the permanent failure page back to migration list, they
 	 * will be put back to the right list by the caller.
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC 5/6] mm/migrate_pages: share more code between _unmap and _move
  2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
                   ` (3 preceding siblings ...)
  2022-09-21  6:06 ` [RFC 4/6] mm/migrate_pages: batch _unmap and _move Huang Ying
@ 2022-09-21  6:06 ` Huang Ying
  2022-09-21  6:06 ` [RFC 6/6] mm/migrate_pages: batch flushing TLB Huang Ying
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 50+ messages in thread
From: Huang Ying @ 2022-09-21  6:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Huang Ying, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

This is a code cleanup patch to reduce the duplicated code between the
_unmap and _move stages of migrate_pages().  No functionality change
is expected.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
---
 mm/migrate.c | 240 +++++++++++++++++++++------------------------------
 1 file changed, 100 insertions(+), 140 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 165cbbc834e2..042fa147f302 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -999,6 +999,7 @@ static void __migrate_page_extract(struct page *newpage,
 static void migrate_page_undo_page(struct page *page,
 				   int page_was_mapped,
 				   struct anon_vma *anon_vma,
+				   bool locked,
 				   struct list_head *ret)
 {
 	struct folio *folio = page_folio(page);
@@ -1007,30 +1008,77 @@ static void migrate_page_undo_page(struct page *page,
 		remove_migration_ptes(folio, folio, false);
 	if (anon_vma)
 		put_anon_vma(anon_vma);
-	unlock_page(page);
-	list_move_tail(&page->lru, ret);
+	if (locked)
+		unlock_page(page);
+	if (ret)
+		list_move_tail(&page->lru, ret);
 }
 
 static void migrate_page_undo_newpage(struct page *newpage,
+				      bool locked,
 				      free_page_t put_new_page,
 				      unsigned long private)
 {
-	unlock_page(newpage);
+	if (locked)
+		unlock_page(newpage);
 	if (put_new_page)
 		put_new_page(newpage, private);
 	else
 		put_page(newpage);
 }
 
-static int __migrate_page_unmap(struct page *page, struct page *newpage,
-				int force, enum migrate_mode mode)
+static void migrate_page_done(struct page *page,
+			      enum migrate_reason reason)
+{
+	/*
+	 * Compaction can migrate also non-LRU pages which are
+	 * not accounted to NR_ISOLATED_*. They can be recognized
+	 * as __PageMovable
+	 */
+	if (likely(!__PageMovable(page)))
+		mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+				    page_is_file_lru(page), -thp_nr_pages(page));
+
+	if (reason != MR_MEMORY_FAILURE)
+		/* We release the page in page_handle_poison. */
+		put_page(page);
+}
+
+/* Obtain the lock on page, remove all ptes. */
+static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
+			      unsigned long private, struct page *page,
+			      struct page **newpagep, int force,
+			      enum migrate_mode mode, enum migrate_reason reason,
+			      struct list_head *ret)
 {
 	struct folio *folio = page_folio(page);
-	int rc = -EAGAIN;
+	int rc = MIGRATEPAGE_UNMAP;
+	struct page *newpage = NULL;
 	int page_was_mapped = 0;
 	struct anon_vma *anon_vma = NULL;
 	bool is_lru = !__PageMovable(page);
+	bool locked = false;
+	bool newpage_locked = false;
+
+	if (!thp_migration_supported() && PageTransHuge(page))
+		return -ENOSYS;
 
+	if (page_count(page) == 1) {
+		/* Page was freed from under us. So we are done. */
+		ClearPageActive(page);
+		ClearPageUnevictable(page);
+		/* free_pages_prepare() will clear PG_isolated. */
+		list_del(&page->lru);
+		migrate_page_done(page, reason);
+		return MIGRATEPAGE_SUCCESS;
+	}
+
+	newpage = get_new_page(page, private);
+	if (!newpage)
+		return -ENOMEM;
+	*newpagep = newpage;
+
+	rc = -EAGAIN;
 	if (!trylock_page(page)) {
 		if (!force || mode == MIGRATE_ASYNC)
 			goto out;
@@ -1053,6 +1101,7 @@ static int __migrate_page_unmap(struct page *page, struct page *newpage,
 
 		lock_page(page);
 	}
+	locked = true;
 
 	if (PageWriteback(page)) {
 		/*
@@ -1067,10 +1116,10 @@ static int __migrate_page_unmap(struct page *page, struct page *newpage,
 			break;
 		default:
 			rc = -EBUSY;
-			goto out_unlock;
+			goto out;
 		}
 		if (!force)
-			goto out_unlock;
+			goto out;
 		wait_on_page_writeback(page);
 	}
 
@@ -1100,7 +1149,8 @@ static int __migrate_page_unmap(struct page *page, struct page *newpage,
 	 * This is much like races on refcount of oldpage: just don't BUG().
 	 */
 	if (unlikely(!trylock_page(newpage)))
-		goto out_unlock;
+		goto out;
+	newpage_locked = true;
 
 	if (unlikely(!is_lru)) {
 		__migrate_page_record(newpage, page_was_mapped, anon_vma);
@@ -1123,7 +1173,7 @@ static int __migrate_page_unmap(struct page *page, struct page *newpage,
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(folio);
-			goto out_unlock_both;
+			goto out;
 		}
 	} else if (page_mapped(page)) {
 		/* Establish migration ptes */
@@ -1141,20 +1191,28 @@ static int __migrate_page_unmap(struct page *page, struct page *newpage,
 	if (page_was_mapped)
 		remove_migration_ptes(folio, folio, false);
 
-out_unlock_both:
-	unlock_page(newpage);
-out_unlock:
-	/* Drop an anon_vma reference if we took one */
-	if (anon_vma)
-		put_anon_vma(anon_vma);
-	unlock_page(page);
 out:
+	/*
+	 * A page that has not been migrated will have kept its
+	 * references and be restored.
+	 */
+	/* restore the page to right list. */
+	if (rc != -EAGAIN)
+                ret = NULL;
+
+	migrate_page_undo_page(page, page_was_mapped, anon_vma, locked, ret);
+	if (newpage)
+		migrate_page_undo_newpage(newpage, newpage_locked,
+					  put_new_page, private);
 
 	return rc;
 }
 
-static int __migrate_page_move(struct page *page, struct page *newpage,
-			       enum migrate_mode mode)
+/* Migrate the page to the newly allocated page in newpage. */
+static int migrate_page_move(free_page_t put_new_page, unsigned long private,
+			     struct page *page, struct page *newpage,
+			     enum migrate_mode mode, enum migrate_reason reason,
+			     struct list_head *ret)
 {
 	struct folio *folio = page_folio(page);
 	struct folio *dst = page_folio(newpage);
@@ -1165,9 +1223,10 @@ static int __migrate_page_move(struct page *page, struct page *newpage,
 	__migrate_page_extract(newpage, &page_was_mapped, &anon_vma);
 
 	rc = move_to_new_folio(dst, folio, mode);
+	if (rc)
+		goto out;
 
-	if (rc != -EAGAIN)
-		list_del(&newpage->lru);
+	list_del(&newpage->lru);
 	/*
 	 * When successful, push newpage to LRU immediately: so that if it
 	 * turns out to be an mlocked page, remove_migration_ptes() will
@@ -1177,139 +1236,40 @@ static int __migrate_page_move(struct page *page, struct page *newpage,
 	 * unsuccessful, and other cases when a page has been temporarily
 	 * isolated from the unevictable LRU: but this case is the easiest.
 	 */
-	if (rc == MIGRATEPAGE_SUCCESS) {
-		lru_cache_add(newpage);
-		if (page_was_mapped)
-			lru_add_drain();
-	}
-
-	if (rc == -EAGAIN) {
-		__migrate_page_record(newpage, page_was_mapped, anon_vma);
-		return rc;
-	}
-
+	lru_cache_add(newpage);
 	if (page_was_mapped)
-		remove_migration_ptes(folio,
-			rc == MIGRATEPAGE_SUCCESS ? dst : folio, false);
+		lru_add_drain();
 
+	if (page_was_mapped)
+		remove_migration_ptes(folio, dst, false);
 	unlock_page(newpage);
-	/* Drop an anon_vma reference if we took one */
-	if (anon_vma)
-		put_anon_vma(anon_vma);
-	unlock_page(page);
+	set_page_owner_migrate_reason(newpage, reason);
 	/*
 	 * If migration is successful, decrease refcount of the newpage,
 	 * which will not free the page because new page owner increased
 	 * refcounter.
 	 */
-	if (rc == MIGRATEPAGE_SUCCESS)
-		put_page(newpage);
-
-	return rc;
-}
+	put_page(newpage);
 
-static void migrate_page_done(struct page *page,
-			      enum migrate_reason reason)
-{
 	/*
-	 * Compaction can migrate also non-LRU pages which are
-	 * not accounted to NR_ISOLATED_*. They can be recognized
-	 * as __PageMovable
+	 * A page that has been migrated has all references removed
+	 * and will be freed.
 	 */
-	if (likely(!__PageMovable(page)))
-		mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
-				    page_is_file_lru(page), -thp_nr_pages(page));
-
-	if (reason != MR_MEMORY_FAILURE)
-		/* We release the page in page_handle_poison. */
-		put_page(page);
-}
-
-/* Obtain the lock on page, remove all ptes. */
-static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
-			      unsigned long private, struct page *page,
-			      struct page **newpagep, int force,
-			      enum migrate_mode mode, enum migrate_reason reason,
-			      struct list_head *ret)
-{
-	int rc = MIGRATEPAGE_UNMAP;
-	struct page *newpage = NULL;
-
-	if (!thp_migration_supported() && PageTransHuge(page))
-		return -ENOSYS;
-
-	if (page_count(page) == 1) {
-		/* Page was freed from under us. So we are done. */
-		ClearPageActive(page);
-		ClearPageUnevictable(page);
-		/* free_pages_prepare() will clear PG_isolated. */
-		list_del(&page->lru);
-		migrate_page_done(page, reason);
-		return MIGRATEPAGE_SUCCESS;
-	}
-
-	newpage = get_new_page(page, private);
-	if (!newpage)
-		return -ENOMEM;
-	*newpagep = newpage;
-
-	newpage->private = 0;
-	rc = __migrate_page_unmap(page, newpage, force, mode);
-	if (rc == MIGRATEPAGE_UNMAP)
-		return rc;
-
-	/*
-	 * A page that has not been migrated will have kept its
-	 * references and be restored.
-	 */
-	/* restore the page to right list. */
-	if (rc != -EAGAIN)
-		list_move_tail(&page->lru, ret);
-
-	if (put_new_page)
-		put_new_page(newpage, private);
-	else
-		put_page(newpage);
+	list_del(&page->lru);
+	migrate_page_undo_page(page, 0, anon_vma, true, NULL);
+	migrate_page_done(page, reason);
 
 	return rc;
-}
 
-/* Migrate the page to the newly allocated page in newpage. */
-static int migrate_page_move(free_page_t put_new_page, unsigned long private,
-			     struct page *page, struct page *newpage,
-			     enum migrate_mode mode, enum migrate_reason reason,
-			     struct list_head *ret)
-{
-	int rc;
-
-	rc = __migrate_page_move(page, newpage, mode);
-	if (rc == MIGRATEPAGE_SUCCESS)
-		set_page_owner_migrate_reason(newpage, reason);
-
-	if (rc != -EAGAIN) {
-		/*
-		 * A page that has been migrated has all references
-		 * removed and will be freed. A page that has not been
-		 * migrated will have kept its references and be restored.
-		 */
-		list_del_init(&page->lru);
+out:
+	if (rc == -EAGAIN) {
+		__migrate_page_record(newpage, page_was_mapped, anon_vma);
+		return rc;
 	}
 
-	/*
-	 * If migration is successful, releases reference grabbed during
-	 * isolation. Otherwise, restore the page to right list unless
-	 * we want to retry.
-	 */
-	if (rc == MIGRATEPAGE_SUCCESS) {
-		migrate_page_done(page, reason);
-	} else if (rc != -EAGAIN) {
-		list_add_tail(&page->lru, ret);
-
-		if (put_new_page)
-			put_new_page(newpage, private);
-		else
-			put_page(newpage);
-	}
+	migrate_page_undo_page(page, page_was_mapped, anon_vma, true, ret);
+	list_del(&newpage->lru);
+	migrate_page_undo_newpage(newpage, true, put_new_page, private);
 
 	return rc;
 }
@@ -1763,9 +1723,9 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 
 		__migrate_page_extract(newpage, &page_was_mapped, &anon_vma);
 		migrate_page_undo_page(page, page_was_mapped, anon_vma,
-				       &ret_pages);
+				       true, &ret_pages);
 		list_del(&newpage->lru);
-		migrate_page_undo_newpage(newpage, put_new_page, private);
+		migrate_page_undo_newpage(newpage, true, put_new_page, private);
 		newpage = newpage2;
 		newpage2 = list_next_entry(newpage, lru);
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC 6/6] mm/migrate_pages: batch flushing TLB
  2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
                   ` (4 preceding siblings ...)
  2022-09-21  6:06 ` [RFC 5/6] mm/migrate_pages: share more code between " Huang Ying
@ 2022-09-21  6:06 ` Huang Ying
  2022-09-21 15:47 ` [RFC 0/6] migrate_pages(): batch TLB flushing Zi Yan
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 50+ messages in thread
From: Huang Ying @ 2022-09-21  6:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Huang Ying, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

The TLB flushing will cost quite some CPU cycles during the page
migration in some situations.  For example, when migrate a page of a
process with multiple active threads that run on multiple CPUs.  After
batching the _unmap and _move in migrate_pages(), the TLB flushing can
be batched easily with the existing TLB flush batching mechanism.
This patch implements that.

We use the following test case to test the patch.

On a 2-socket Intel server,

- Run pmbench memory accessing benchmark

- Run `migratepages` to migrate pages of pmbench between node 0 and
  node 1 back and forth.

With the patch, the TLB flushing IPI reduces 99.1% during the test and
the number of pages migrated successfully per second increases 291.7%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
---
 mm/migrate.c |  4 +++-
 mm/rmap.c    | 24 ++++++++++++++++++++----
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 042fa147f302..a0de0d9b4d41 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1179,7 +1179,7 @@ static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
 		/* Establish migration ptes */
 		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
 				page);
-		try_to_migrate(folio, 0);
+		try_to_migrate(folio, TTU_BATCH_FLUSH);
 		page_was_mapped = 1;
 	}
 
@@ -1647,6 +1647,8 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 	nr_thp_failed += thp_retry;
 	nr_failed_pages += nr_retry_pages;
 move:
+	try_to_unmap_flush();
+
 	retry = 1;
 	thp_retry = 1;
 	for (pass = 0; pass < 10 && (retry || thp_retry); pass++) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 93d5a6f793d2..ab88136720dc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1960,8 +1960,24 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
 		} else {
 			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
-			/* Nuke the page table entry. */
-			pteval = ptep_clear_flush(vma, address, pvmw.pte);
+			/*
+			 * Nuke the page table entry.
+			 */
+			if (should_defer_flush(mm, flags)) {
+				/*
+				 * We clear the PTE but do not flush so potentially
+				 * a remote CPU could still be writing to the folio.
+				 * If the entry was previously clean then the
+				 * architecture must guarantee that a clear->dirty
+				 * transition on a cached TLB entry is written through
+				 * and traps if the PTE is unmapped.
+				 */
+				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+			} else {
+				pteval = ptep_clear_flush(vma, address, pvmw.pte);
+			}
 		}
 
 		/* Set the dirty flag on the folio now the pte is gone. */
@@ -2128,10 +2144,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 
 	/*
 	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
-	 * TTU_SPLIT_HUGE_PMD and TTU_SYNC flags.
+	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC and TTU_BATCH_FLUSH flags.
 	 */
 	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
-					TTU_SYNC)))
+					TTU_SYNC | TTU_BATCH_FLUSH)))
 		return;
 
 	if (folio_is_zone_device(folio) &&
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
                   ` (5 preceding siblings ...)
  2022-09-21  6:06 ` [RFC 6/6] mm/migrate_pages: batch flushing TLB Huang Ying
@ 2022-09-21 15:47 ` Zi Yan
  2022-09-22  1:45   ` Huang, Ying
  2022-09-22  3:47   ` haoxin
  2022-09-22 12:50 ` Bharata B Rao
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 50+ messages in thread
From: Zi Yan @ 2022-09-21 15:47 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 2577 bytes --]

On 21 Sep 2022, at 2:06, Huang Ying wrote:

> From: "Huang, Ying" <ying.huang@intel.com>
>
> Now, migrate_pages() migrate pages one by one, like the fake code as
> follows,
>
>   for each page
>     unmap
>     flush TLB
>     copy
>     restore map
>
> If multiple pages are passed to migrate_pages(), there are
> opportunities to batch the TLB flushing and copying.  That is, we can
> change the code to something as follows,
>
>   for each page
>     unmap
>   for each page
>     flush TLB
>   for each page
>     copy
>   for each page
>     restore map
>
> The total number of TLB flushing IPI can be reduced considerably.  And
> we may use some hardware accelerator such as DSA to accelerate the
> page copying.
>
> So in this patch, we refactor the migrate_pages() implementation and
> implement the TLB flushing batching.  Base on this, hardware
> accelerated page copying can be implemented.
>
> If too many pages are passed to migrate_pages(), in the naive batched
> implementation, we may unmap too many pages at the same time.  The
> possibility for a task to wait for the migrated pages to be mapped
> again increases.  So the latency may be hurt.  To deal with this
> issue, the max number of pages be unmapped in batch is restricted to
> no more than HPAGE_PMD_NR.  That is, the influence is at the same
> level of THP migration.
>
> We use the following test to measure the performance impact of the
> patchset,
>
> On a 2-socket Intel server,
>
>  - Run pmbench memory accessing benchmark
>
>  - Run `migratepages` to migrate pages of pmbench between node 0 and
>    node 1 back and forth.
>
> With the patch, the TLB flushing IPI reduces 99.1% during the test and
> the number of pages migrated successfully per second increases 291.7%.

Thank you for the patchset. Batching page migration will definitely
improve its throughput from my past experiments[1] and starting with
TLB flushing is a good first step.

BTW, what is the rationality behind the increased page migration
success rate per second?

>
> This patchset is based on v6.0-rc5 and the following patchset,
>
> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/
>
> The migrate_pages() related code is converting to folio now. So this
> patchset cannot apply recent akpm/mm-unstable branch.  This patchset
> is used to check the basic idea.  If it is OK, I will rebase the
> patchset on top of folio changes.
>
> Best Regards,
> Huang, Ying


[1] https://lwn.net/Articles/784925/

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration
  2022-09-21  6:06 ` [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration Huang Ying
@ 2022-09-21 15:55   ` Zi Yan
  2022-09-22  1:14     ` Huang, Ying
  2022-09-22  6:03   ` Baolin Wang
  1 sibling, 1 reply; 50+ messages in thread
From: Zi Yan @ 2022-09-21 15:55 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 4272 bytes --]

On 21 Sep 2022, at 2:06, Huang Ying wrote:

> This is a preparation patch to batch the page unmapping and moving for
> the normal pages and THPs.  Based on that we can batch the TLB
> shootdown during the page migration and make it possible to use some
> hardware accelerator for the page copying.
>
> In this patch the huge page (PageHuge()) and normal page and THP
> migration is separated in migrate_pages() to make it easy to change
> the normal page and THP migration implementation.
>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Matthew Wilcox <willy@infradead.org>
> ---
>  mm/migrate.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 64 insertions(+), 9 deletions(-)

Maybe it would be better to have two subroutines for hugetlb migration
and normal page migration respectively. migrate_pages() becomes very
large at this point.

>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 571d8c9fd5bc..117134f1c6dc 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1414,6 +1414,66 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>
>  	trace_mm_migrate_pages_start(mode, reason);
>
> +	for (pass = 0; pass < 10 && retry; pass++) {
> +		retry = 0;
> +
> +		list_for_each_entry_safe(page, page2, from, lru) {
> +			nr_subpages = compound_nr(page);
> +			cond_resched();
> +
> +			if (!PageHuge(page))
> +				continue;
> +
> +			rc = unmap_and_move_huge_page(get_new_page,
> +						put_new_page, private, page,
> +						pass > 2, mode, reason,
> +						&ret_pages);
> +			/*
> +			 * The rules are:
> +			 *	Success: hugetlb page will be put back
> +			 *	-EAGAIN: stay on the from list
> +			 *	-ENOMEM: stay on the from list
> +			 *	-ENOSYS: stay on the from list
> +			 *	Other errno: put on ret_pages list then splice to
> +			 *		     from list
> +			 */
> +			switch(rc) {
> +			case -ENOSYS:
> +				/* Hugetlb migration is unsupported */
> +				nr_failed++;
> +				nr_failed_pages += nr_subpages;
> +				list_move_tail(&page->lru, &ret_pages);
> +				break;
> +			case -ENOMEM:
> +				/*
> +				 * When memory is low, don't bother to try to migrate
> +				 * other pages, just exit.
> +				 */
> +				nr_failed++;
> +				nr_failed_pages += nr_subpages + nr_retry_pages;
> +				goto out;
> +			case -EAGAIN:
> +				retry++;
> +				nr_retry_pages += nr_subpages;
> +				break;
> +			case MIGRATEPAGE_SUCCESS:
> +				nr_succeeded += nr_subpages;
> +				break;
> +			default:
> +				/*
> +				 * Permanent failure (-EBUSY, etc.):
> +				 * unlike -EAGAIN case, the failed page is
> +				 * removed from migration page list and not
> +				 * retried in the next outer loop.
> +				 */
> +				nr_failed++;
> +				nr_failed_pages += nr_subpages;
> +				break;
> +			}
> +		}
> +	}
> +	nr_failed += retry;
> +	retry = 1;
>  thp_subpage_migration:
>  	for (pass = 0; pass < 10 && (retry || thp_retry); pass++) {
>  		retry = 0;
> @@ -1431,18 +1491,14 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  			cond_resched();
>
>  			if (PageHuge(page))
> -				rc = unmap_and_move_huge_page(get_new_page,
> -						put_new_page, private, page,
> -						pass > 2, mode, reason,
> -						&ret_pages);
> -			else
> -				rc = unmap_and_move(get_new_page, put_new_page,
> +				continue;
> +
> +			rc = unmap_and_move(get_new_page, put_new_page,
>  						private, page, pass > 2, mode,
>  						reason, &ret_pages);
>  			/*
>  			 * The rules are:
> -			 *	Success: non hugetlb page will be freed, hugetlb
> -			 *		 page will be put back
> +			 *	Success: page will be freed
>  			 *	-EAGAIN: stay on the from list
>  			 *	-ENOMEM: stay on the from list
>  			 *	-ENOSYS: stay on the from list
> @@ -1468,7 +1524,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  						nr_thp_split++;
>  						break;
>  					}
> -				/* Hugetlb migration is unsupported */
>  				} else if (!no_subpage_counting) {
>  					nr_failed++;
>  				}
> -- 
> 2.35.1


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-21  6:06 ` [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move() Huang Ying
@ 2022-09-21 16:08   ` Zi Yan
  2022-09-22  1:15     ` Huang, Ying
  2022-09-22  6:36   ` Baolin Wang
  2022-09-26  9:28   ` Alistair Popple
  2 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2022-09-21 16:08 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 9243 bytes --]

On 21 Sep 2022, at 2:06, Huang Ying wrote:

> This is a preparation patch to batch the page unmapping and moving
> for the normal pages and THP.
>
> In this patch, unmap_and_move() is split to migrate_page_unmap() and
> migrate_page_move().  So, we can batch _unmap() and _move() in
> different loops later.  To pass some information between unmap and
> move, the original unused newpage->mapping and newpage->private are
> used.
>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Matthew Wilcox <willy@infradead.org>
> ---
>  mm/migrate.c | 164 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 122 insertions(+), 42 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 117134f1c6dc..4a81e0bfdbcd 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -976,13 +976,32 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
>  	return rc;
>  }
>
> -static int __unmap_and_move(struct page *page, struct page *newpage,
> +static void __migrate_page_record(struct page *newpage,
> +				  int page_was_mapped,
> +				  struct anon_vma *anon_vma)
> +{
> +	newpage->mapping = (struct address_space *)anon_vma;
> +	newpage->private = page_was_mapped;
> +}
> +
> +static void __migrate_page_extract(struct page *newpage,
> +				   int *page_was_mappedp,
> +				   struct anon_vma **anon_vmap)
> +{
> +	*anon_vmap = (struct anon_vma *)newpage->mapping;
> +	*page_was_mappedp = newpage->private;
> +	newpage->mapping = NULL;
> +	newpage->private = 0;
> +}
> +
> +#define MIGRATEPAGE_UNMAP		1

It is better to move this to migrate.h with MIGRATEPAGE_SUCCESS and
make them an enum.

> +
> +static int __migrate_page_unmap(struct page *page, struct page *newpage,
>  				int force, enum migrate_mode mode)
>  {
>  	struct folio *folio = page_folio(page);
> -	struct folio *dst = page_folio(newpage);
>  	int rc = -EAGAIN;
> -	bool page_was_mapped = false;
> +	int page_was_mapped = 0;
>  	struct anon_vma *anon_vma = NULL;
>  	bool is_lru = !__PageMovable(page);
>
> @@ -1058,8 +1077,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  		goto out_unlock;
>
>  	if (unlikely(!is_lru)) {
> -		rc = move_to_new_folio(dst, folio, mode);
> -		goto out_unlock_both;
> +		__migrate_page_record(newpage, page_was_mapped, anon_vma);
> +		return MIGRATEPAGE_UNMAP;
>  	}
>
>  	/*
> @@ -1085,11 +1104,41 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
>  				page);
>  		try_to_migrate(folio, 0);
> -		page_was_mapped = true;
> +		page_was_mapped = 1;
> +	}
> +
> +	if (!page_mapped(page)) {
> +		__migrate_page_record(newpage, page_was_mapped, anon_vma);
> +		return MIGRATEPAGE_UNMAP;
>  	}
>
> -	if (!page_mapped(page))
> -		rc = move_to_new_folio(dst, folio, mode);
> +	if (page_was_mapped)
> +		remove_migration_ptes(folio, folio, false);
> +
> +out_unlock_both:
> +	unlock_page(newpage);
> +out_unlock:
> +	/* Drop an anon_vma reference if we took one */
> +	if (anon_vma)
> +		put_anon_vma(anon_vma);
> +	unlock_page(page);
> +out:
> +
> +	return rc;
> +}
> +
> +static int __migrate_page_move(struct page *page, struct page *newpage,
> +			       enum migrate_mode mode)
> +{
> +	struct folio *folio = page_folio(page);
> +	struct folio *dst = page_folio(newpage);
> +	int rc;
> +	int page_was_mapped = 0;
> +	struct anon_vma *anon_vma = NULL;
> +
> +	__migrate_page_extract(newpage, &page_was_mapped, &anon_vma);
> +
> +	rc = move_to_new_folio(dst, folio, mode);
>
>  	/*
>  	 * When successful, push newpage to LRU immediately: so that if it
> @@ -1110,14 +1159,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  		remove_migration_ptes(folio,
>  			rc == MIGRATEPAGE_SUCCESS ? dst : folio, false);
>
> -out_unlock_both:
>  	unlock_page(newpage);
> -out_unlock:
>  	/* Drop an anon_vma reference if we took one */
>  	if (anon_vma)
>  		put_anon_vma(anon_vma);
>  	unlock_page(page);
> -out:
>  	/*
>  	 * If migration is successful, decrease refcount of the newpage,
>  	 * which will not free the page because new page owner increased
> @@ -1129,18 +1175,31 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  	return rc;
>  }
>
> -/*
> - * Obtain the lock on page, remove all ptes and migrate the page
> - * to the newly allocated page in newpage.
> - */
> -static int unmap_and_move(new_page_t get_new_page,
> -				   free_page_t put_new_page,
> -				   unsigned long private, struct page *page,
> -				   int force, enum migrate_mode mode,
> -				   enum migrate_reason reason,
> -				   struct list_head *ret)
> +static void migrate_page_done(struct page *page,
> +			      enum migrate_reason reason)
> +{
> +	/*
> +	 * Compaction can migrate also non-LRU pages which are
> +	 * not accounted to NR_ISOLATED_*. They can be recognized
> +	 * as __PageMovable
> +	 */
> +	if (likely(!__PageMovable(page)))
> +		mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> +				    page_is_file_lru(page), -thp_nr_pages(page));
> +
> +	if (reason != MR_MEMORY_FAILURE)
> +		/* We release the page in page_handle_poison. */
> +		put_page(page);
> +}
> +
> +/* Obtain the lock on page, remove all ptes. */
> +static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
> +			      unsigned long private, struct page *page,
> +			      struct page **newpagep, int force,
> +			      enum migrate_mode mode, enum migrate_reason reason,
> +			      struct list_head *ret)
>  {
> -	int rc = MIGRATEPAGE_SUCCESS;
> +	int rc = MIGRATEPAGE_UNMAP;
>  	struct page *newpage = NULL;
>
>  	if (!thp_migration_supported() && PageTransHuge(page))
> @@ -1151,19 +1210,48 @@ static int unmap_and_move(new_page_t get_new_page,
>  		ClearPageActive(page);
>  		ClearPageUnevictable(page);
>  		/* free_pages_prepare() will clear PG_isolated. */
> -		goto out;
> +		list_del(&page->lru);
> +		migrate_page_done(page, reason);
> +		return MIGRATEPAGE_SUCCESS;
>  	}
>
>  	newpage = get_new_page(page, private);
>  	if (!newpage)
>  		return -ENOMEM;
> +	*newpagep = newpage;
>
> -	newpage->private = 0;
> -	rc = __unmap_and_move(page, newpage, force, mode);
> +	rc = __migrate_page_unmap(page, newpage, force, mode);
> +	if (rc == MIGRATEPAGE_UNMAP)
> +		return rc;
> +
> +	/*
> +	 * A page that has not been migrated will have kept its
> +	 * references and be restored.
> +	 */
> +	/* restore the page to right list. */
> +	if (rc != -EAGAIN)
> +		list_move_tail(&page->lru, ret);
> +
> +	if (put_new_page)
> +		put_new_page(newpage, private);
> +	else
> +		put_page(newpage);
> +
> +	return rc;
> +}
> +
> +/* Migrate the page to the newly allocated page in newpage. */
> +static int migrate_page_move(free_page_t put_new_page, unsigned long private,
> +			     struct page *page, struct page *newpage,
> +			     enum migrate_mode mode, enum migrate_reason reason,
> +			     struct list_head *ret)
> +{
> +	int rc;
> +
> +	rc = __migrate_page_move(page, newpage, mode);
>  	if (rc == MIGRATEPAGE_SUCCESS)
>  		set_page_owner_migrate_reason(newpage, reason);
>
> -out:
>  	if (rc != -EAGAIN) {
>  		/*
>  		 * A page that has been migrated has all references
> @@ -1179,20 +1267,7 @@ static int unmap_and_move(new_page_t get_new_page,
>  	 * we want to retry.
>  	 */
>  	if (rc == MIGRATEPAGE_SUCCESS) {
> -		/*
> -		 * Compaction can migrate also non-LRU pages which are
> -		 * not accounted to NR_ISOLATED_*. They can be recognized
> -		 * as __PageMovable
> -		 */
> -		if (likely(!__PageMovable(page)))
> -			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> -					page_is_file_lru(page), -thp_nr_pages(page));
> -
> -		if (reason != MR_MEMORY_FAILURE)
> -			/*
> -			 * We release the page in page_handle_poison.
> -			 */
> -			put_page(page);
> +		migrate_page_done(page, reason);
>  	} else {
>  		if (rc != -EAGAIN)
>  			list_add_tail(&page->lru, ret);
> @@ -1405,6 +1480,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  	int pass = 0;
>  	bool is_thp = false;
>  	struct page *page;
> +	struct page *newpage = NULL;
>  	struct page *page2;
>  	int rc, nr_subpages;
>  	LIST_HEAD(ret_pages);
> @@ -1493,9 +1569,13 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  			if (PageHuge(page))
>  				continue;
>
> -			rc = unmap_and_move(get_new_page, put_new_page,
> -						private, page, pass > 2, mode,
> +			rc = migrate_page_unmap(get_new_page, put_new_page, private,
> +						page, &newpage, pass > 2, mode,
>  						reason, &ret_pages);
> +			if (rc == MIGRATEPAGE_UNMAP)
> +				rc = migrate_page_move(put_new_page, private,
> +						       page, newpage, mode,
> +						       reason, &ret_pages);
>  			/*
>  			 * The rules are:
>  			 *	Success: page will be freed
> -- 
> 2.35.1


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 3/6] mm/migrate_pages: restrict number of pages to migrate in batch
  2022-09-21  6:06 ` [RFC 3/6] mm/migrate_pages: restrict number of pages to migrate in batch Huang Ying
@ 2022-09-21 16:10   ` Zi Yan
  2022-09-21 16:15     ` Zi Yan
  2022-09-22  1:15     ` Huang, Ying
  0 siblings, 2 replies; 50+ messages in thread
From: Zi Yan @ 2022-09-21 16:10 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 6112 bytes --]

On 21 Sep 2022, at 2:06, Huang Ying wrote:

> This is a preparation patch to batch the page unmapping and moving
> for the normal pages and THP.
>
> If we had batched the page unmapping, all pages to be migrated would
> be unmapped before copying the contents and flags of the pages.  If
> the number of pages that were passed to migrate_pages() was too large,
> too many pages would be unmapped.  Then, the execution of their
> processes would be stopped for too long time.  For example,
> migrate_pages() syscall will call migrate_pages() with all pages of a
> process.  To avoid this possible issue, in this patch, we restrict the
> number of pages to be migrated to be no more than HPAGE_PMD_NR.  That
> is, the influence is at the same level of THP migration.
>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Matthew Wilcox <willy@infradead.org>
> ---
>  mm/migrate.c | 93 +++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 67 insertions(+), 26 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4a81e0bfdbcd..1077af858e36 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1439,32 +1439,7 @@ static inline int try_split_thp(struct page *page, struct list_head *split_pages
>  	return rc;
>  }
>
> -/*
> - * migrate_pages - migrate the pages specified in a list, to the free pages
> - *		   supplied as the target for the page migration
> - *
> - * @from:		The list of pages to be migrated.
> - * @get_new_page:	The function used to allocate free pages to be used
> - *			as the target of the page migration.
> - * @put_new_page:	The function used to free target pages if migration
> - *			fails, or NULL if no special handling is necessary.
> - * @private:		Private data to be passed on to get_new_page()
> - * @mode:		The migration mode that specifies the constraints for
> - *			page migration, if any.
> - * @reason:		The reason for page migration.
> - * @ret_succeeded:	Set to the number of normal pages migrated successfully if
> - *			the caller passes a non-NULL pointer.
> - *
> - * The function returns after 10 attempts or if no pages are movable any more
> - * because the list has become empty or no retryable pages exist any more.
> - * It is caller's responsibility to call putback_movable_pages() to return pages
> - * to the LRU or free list only if ret != 0.
> - *
> - * Returns the number of {normal page, THP, hugetlb} that were not migrated, or
> - * an error code. The number of THP splits will be considered as the number of
> - * non-migrated THP, no matter how many subpages of the THP are migrated successfully.
> - */
> -int migrate_pages(struct list_head *from, new_page_t get_new_page,
> +static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>  		free_page_t put_new_page, unsigned long private,
>  		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>  {
> @@ -1709,6 +1684,72 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  	return rc;
>  }
>
> +/*
> + * migrate_pages - migrate the pages specified in a list, to the free pages
> + *		   supplied as the target for the page migration
> + *
> + * @from:		The list of pages to be migrated.
> + * @get_new_page:	The function used to allocate free pages to be used
> + *			as the target of the page migration.
> + * @put_new_page:	The function used to free target pages if migration
> + *			fails, or NULL if no special handling is necessary.
> + * @private:		Private data to be passed on to get_new_page()
> + * @mode:		The migration mode that specifies the constraints for
> + *			page migration, if any.
> + * @reason:		The reason for page migration.
> + * @ret_succeeded:	Set to the number of normal pages migrated successfully if
> + *			the caller passes a non-NULL pointer.
> + *
> + * The function returns after 10 attempts or if no pages are movable any more
> + * because the list has become empty or no retryable pages exist any more.
> + * It is caller's responsibility to call putback_movable_pages() to return pages
> + * to the LRU or free list only if ret != 0.
> + *
> + * Returns the number of {normal page, THP, hugetlb} that were not migrated, or
> + * an error code. The number of THP splits will be considered as the number of
> + * non-migrated THP, no matter how many subpages of the THP are migrated successfully.
> + */
> +int migrate_pages(struct list_head *from, new_page_t get_new_page,
> +		free_page_t put_new_page, unsigned long private,
> +		enum migrate_mode mode, int reason, unsigned int *pret_succeeded)
> +{
> +	int rc, rc_gether = 0;
> +	int ret_succeeded, ret_succeeded_gether = 0;
> +	int nr_pages;
> +	struct page *page;
> +	LIST_HEAD(pagelist);
> +	LIST_HEAD(ret_pages);
> +
> +again:
> +	nr_pages = 0;
> +	list_for_each_entry(page, from, lru) {
> +		nr_pages += compound_nr(page);
> +		if (nr_pages > HPAGE_PMD_NR)

It is better to define a new MACRO like NR_MAX_BATCHED_MIGRATION to be
HPAGE_PMD_NR. It makes code easier to understand and change.

> +			break;
> +	}
> +	if (nr_pages > HPAGE_PMD_NR)
> +		list_cut_before(&pagelist, from, &page->lru);
> +	else
> +		list_splice_init(from, &pagelist);
> +	rc = migrate_pages_batch(&pagelist, get_new_page, put_new_page, private,
> +				 mode, reason, &ret_succeeded);
> +	ret_succeeded_gether += ret_succeeded;
> +	list_splice_tail_init(&pagelist, &ret_pages);
> +	if (rc == -ENOMEM) {
> +		rc_gether = rc;
> +		goto out;
> +	}
> +	rc_gether += rc;
> +	if (!list_empty(from))
> +		goto again;
> +out:
> +	if (pret_succeeded)
> +		*pret_succeeded = ret_succeeded_gether;
> +	list_splice(&ret_pages, from);
> +
> +	return rc_gether;
> +}
> +
>  struct page *alloc_migration_target(struct page *page, unsigned long private)
>  {
>  	struct folio *folio = page_folio(page);
> -- 
> 2.35.1


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 3/6] mm/migrate_pages: restrict number of pages to migrate in batch
  2022-09-21 16:10   ` Zi Yan
@ 2022-09-21 16:15     ` Zi Yan
  2022-09-22  1:15     ` Huang, Ying
  1 sibling, 0 replies; 50+ messages in thread
From: Zi Yan @ 2022-09-21 16:15 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 6608 bytes --]

On 21 Sep 2022, at 12:10, Zi Yan wrote:

> On 21 Sep 2022, at 2:06, Huang Ying wrote:
>
>> This is a preparation patch to batch the page unmapping and moving
>> for the normal pages and THP.
>>
>> If we had batched the page unmapping, all pages to be migrated would
>> be unmapped before copying the contents and flags of the pages.  If
>> the number of pages that were passed to migrate_pages() was too large,
>> too many pages would be unmapped.  Then, the execution of their
>> processes would be stopped for too long time.  For example,
>> migrate_pages() syscall will call migrate_pages() with all pages of a
>> process.  To avoid this possible issue, in this patch, we restrict the
>> number of pages to be migrated to be no more than HPAGE_PMD_NR.  That
>> is, the influence is at the same level of THP migration.
>>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> ---
>>  mm/migrate.c | 93 +++++++++++++++++++++++++++++++++++++---------------
>>  1 file changed, 67 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 4a81e0bfdbcd..1077af858e36 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1439,32 +1439,7 @@ static inline int try_split_thp(struct page *page, struct list_head *split_pages
>>  	return rc;
>>  }
>>
>> -/*
>> - * migrate_pages - migrate the pages specified in a list, to the free pages
>> - *		   supplied as the target for the page migration
>> - *
>> - * @from:		The list of pages to be migrated.
>> - * @get_new_page:	The function used to allocate free pages to be used
>> - *			as the target of the page migration.
>> - * @put_new_page:	The function used to free target pages if migration
>> - *			fails, or NULL if no special handling is necessary.
>> - * @private:		Private data to be passed on to get_new_page()
>> - * @mode:		The migration mode that specifies the constraints for
>> - *			page migration, if any.
>> - * @reason:		The reason for page migration.
>> - * @ret_succeeded:	Set to the number of normal pages migrated successfully if
>> - *			the caller passes a non-NULL pointer.
>> - *
>> - * The function returns after 10 attempts or if no pages are movable any more
>> - * because the list has become empty or no retryable pages exist any more.
>> - * It is caller's responsibility to call putback_movable_pages() to return pages
>> - * to the LRU or free list only if ret != 0.
>> - *
>> - * Returns the number of {normal page, THP, hugetlb} that were not migrated, or
>> - * an error code. The number of THP splits will be considered as the number of
>> - * non-migrated THP, no matter how many subpages of the THP are migrated successfully.
>> - */
>> -int migrate_pages(struct list_head *from, new_page_t get_new_page,
>> +static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>  		free_page_t put_new_page, unsigned long private,
>>  		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)

We are not batching hugetlb page migration, right? migrate_pages_batch() should
not include the hugetlb page migration code. migrate_pages() should look like:

migrate_pages()
{
        migrate hugetlb pages if they exist;
        migrate_pages_batch();
}

>>  {
>> @@ -1709,6 +1684,72 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>  	return rc;
>>  }
>>
>> +/*
>> + * migrate_pages - migrate the pages specified in a list, to the free pages
>> + *		   supplied as the target for the page migration
>> + *
>> + * @from:		The list of pages to be migrated.
>> + * @get_new_page:	The function used to allocate free pages to be used
>> + *			as the target of the page migration.
>> + * @put_new_page:	The function used to free target pages if migration
>> + *			fails, or NULL if no special handling is necessary.
>> + * @private:		Private data to be passed on to get_new_page()
>> + * @mode:		The migration mode that specifies the constraints for
>> + *			page migration, if any.
>> + * @reason:		The reason for page migration.
>> + * @ret_succeeded:	Set to the number of normal pages migrated successfully if
>> + *			the caller passes a non-NULL pointer.
>> + *
>> + * The function returns after 10 attempts or if no pages are movable any more
>> + * because the list has become empty or no retryable pages exist any more.
>> + * It is caller's responsibility to call putback_movable_pages() to return pages
>> + * to the LRU or free list only if ret != 0.
>> + *
>> + * Returns the number of {normal page, THP, hugetlb} that were not migrated, or
>> + * an error code. The number of THP splits will be considered as the number of
>> + * non-migrated THP, no matter how many subpages of the THP are migrated successfully.
>> + */
>> +int migrate_pages(struct list_head *from, new_page_t get_new_page,
>> +		free_page_t put_new_page, unsigned long private,
>> +		enum migrate_mode mode, int reason, unsigned int *pret_succeeded)
>> +{
>> +	int rc, rc_gether = 0;
>> +	int ret_succeeded, ret_succeeded_gether = 0;
>> +	int nr_pages;
>> +	struct page *page;
>> +	LIST_HEAD(pagelist);
>> +	LIST_HEAD(ret_pages);
>> +
>> +again:
>> +	nr_pages = 0;
>> +	list_for_each_entry(page, from, lru) {
>> +		nr_pages += compound_nr(page);
>> +		if (nr_pages > HPAGE_PMD_NR)
>
> It is better to define a new MACRO like NR_MAX_BATCHED_MIGRATION to be
> HPAGE_PMD_NR. It makes code easier to understand and change.
>
>> +			break;
>> +	}
>> +	if (nr_pages > HPAGE_PMD_NR)
>> +		list_cut_before(&pagelist, from, &page->lru);
>> +	else
>> +		list_splice_init(from, &pagelist);
>> +	rc = migrate_pages_batch(&pagelist, get_new_page, put_new_page, private,
>> +				 mode, reason, &ret_succeeded);
>> +	ret_succeeded_gether += ret_succeeded;
>> +	list_splice_tail_init(&pagelist, &ret_pages);
>> +	if (rc == -ENOMEM) {
>> +		rc_gether = rc;
>> +		goto out;
>> +	}
>> +	rc_gether += rc;
>> +	if (!list_empty(from))
>> +		goto again;
>> +out:
>> +	if (pret_succeeded)
>> +		*pret_succeeded = ret_succeeded_gether;
>> +	list_splice(&ret_pages, from);
>> +
>> +	return rc_gether;
>> +}
>> +
>>  struct page *alloc_migration_target(struct page *page, unsigned long private)
>>  {
>>  	struct folio *folio = page_folio(page);
>> -- 
>> 2.35.1
>
>
> --
> Best Regards,
> Yan, Zi


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration
  2022-09-21 15:55   ` Zi Yan
@ 2022-09-22  1:14     ` Huang, Ying
  0 siblings, 0 replies; 50+ messages in thread
From: Huang, Ying @ 2022-09-22  1:14 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Andrew Morton, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

Hi, Zi,

Thank you for comments!

Zi Yan <ziy@nvidia.com> writes:

> On 21 Sep 2022, at 2:06, Huang Ying wrote:
>
>> This is a preparation patch to batch the page unmapping and moving for
>> the normal pages and THPs.  Based on that we can batch the TLB
>> shootdown during the page migration and make it possible to use some
>> hardware accelerator for the page copying.
>>
>> In this patch the huge page (PageHuge()) and normal page and THP
>> migration is separated in migrate_pages() to make it easy to change
>> the normal page and THP migration implementation.
>>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> ---
>>  mm/migrate.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 64 insertions(+), 9 deletions(-)
>
> Maybe it would be better to have two subroutines for hugetlb migration
> and normal page migration respectively. migrate_pages() becomes very
> large at this point.

Yes.  migrate_pages() becomes even larger with this patchset.  I will
consider more about how to deal with that.  I will try the method in
your comments in [3/6] for that too.

Best Regards,
Huang, Ying

>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 571d8c9fd5bc..117134f1c6dc 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1414,6 +1414,66 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>
>>  	trace_mm_migrate_pages_start(mode, reason);
>>
>> +	for (pass = 0; pass < 10 && retry; pass++) {
>> +		retry = 0;
>> +
>> +		list_for_each_entry_safe(page, page2, from, lru) {
>> +			nr_subpages = compound_nr(page);
>> +			cond_resched();
>> +
>> +			if (!PageHuge(page))
>> +				continue;
>> +
>> +			rc = unmap_and_move_huge_page(get_new_page,
>> +						put_new_page, private, page,
>> +						pass > 2, mode, reason,
>> +						&ret_pages);
>> +			/*
>> +			 * The rules are:
>> +			 *	Success: hugetlb page will be put back
>> +			 *	-EAGAIN: stay on the from list
>> +			 *	-ENOMEM: stay on the from list
>> +			 *	-ENOSYS: stay on the from list
>> +			 *	Other errno: put on ret_pages list then splice to
>> +			 *		     from list
>> +			 */
>> +			switch(rc) {
>> +			case -ENOSYS:
>> +				/* Hugetlb migration is unsupported */
>> +				nr_failed++;
>> +				nr_failed_pages += nr_subpages;
>> +				list_move_tail(&page->lru, &ret_pages);
>> +				break;
>> +			case -ENOMEM:
>> +				/*
>> +				 * When memory is low, don't bother to try to migrate
>> +				 * other pages, just exit.
>> +				 */
>> +				nr_failed++;
>> +				nr_failed_pages += nr_subpages + nr_retry_pages;
>> +				goto out;
>> +			case -EAGAIN:
>> +				retry++;
>> +				nr_retry_pages += nr_subpages;
>> +				break;
>> +			case MIGRATEPAGE_SUCCESS:
>> +				nr_succeeded += nr_subpages;
>> +				break;
>> +			default:
>> +				/*
>> +				 * Permanent failure (-EBUSY, etc.):
>> +				 * unlike -EAGAIN case, the failed page is
>> +				 * removed from migration page list and not
>> +				 * retried in the next outer loop.
>> +				 */
>> +				nr_failed++;
>> +				nr_failed_pages += nr_subpages;
>> +				break;
>> +			}
>> +		}
>> +	}
>> +	nr_failed += retry;
>> +	retry = 1;
>>  thp_subpage_migration:
>>  	for (pass = 0; pass < 10 && (retry || thp_retry); pass++) {
>>  		retry = 0;
>> @@ -1431,18 +1491,14 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>  			cond_resched();
>>
>>  			if (PageHuge(page))
>> -				rc = unmap_and_move_huge_page(get_new_page,
>> -						put_new_page, private, page,
>> -						pass > 2, mode, reason,
>> -						&ret_pages);
>> -			else
>> -				rc = unmap_and_move(get_new_page, put_new_page,
>> +				continue;
>> +
>> +			rc = unmap_and_move(get_new_page, put_new_page,
>>  						private, page, pass > 2, mode,
>>  						reason, &ret_pages);
>>  			/*
>>  			 * The rules are:
>> -			 *	Success: non hugetlb page will be freed, hugetlb
>> -			 *		 page will be put back
>> +			 *	Success: page will be freed
>>  			 *	-EAGAIN: stay on the from list
>>  			 *	-ENOMEM: stay on the from list
>>  			 *	-ENOSYS: stay on the from list
>> @@ -1468,7 +1524,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>  						nr_thp_split++;
>>  						break;
>>  					}
>> -				/* Hugetlb migration is unsupported */
>>  				} else if (!no_subpage_counting) {
>>  					nr_failed++;
>>  				}
>> -- 
>> 2.35.1
>
>
> --
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-21 16:08   ` Zi Yan
@ 2022-09-22  1:15     ` Huang, Ying
  0 siblings, 0 replies; 50+ messages in thread
From: Huang, Ying @ 2022-09-22  1:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Andrew Morton, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

Zi Yan <ziy@nvidia.com> writes:

> On 21 Sep 2022, at 2:06, Huang Ying wrote:
>
>> This is a preparation patch to batch the page unmapping and moving
>> for the normal pages and THP.
>>
>> In this patch, unmap_and_move() is split to migrate_page_unmap() and
>> migrate_page_move().  So, we can batch _unmap() and _move() in
>> different loops later.  To pass some information between unmap and
>> move, the original unused newpage->mapping and newpage->private are
>> used.
>>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> ---
>>  mm/migrate.c | 164 ++++++++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 122 insertions(+), 42 deletions(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 117134f1c6dc..4a81e0bfdbcd 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -976,13 +976,32 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
>>  	return rc;
>>  }
>>
>> -static int __unmap_and_move(struct page *page, struct page *newpage,
>> +static void __migrate_page_record(struct page *newpage,
>> +				  int page_was_mapped,
>> +				  struct anon_vma *anon_vma)
>> +{
>> +	newpage->mapping = (struct address_space *)anon_vma;
>> +	newpage->private = page_was_mapped;
>> +}
>> +
>> +static void __migrate_page_extract(struct page *newpage,
>> +				   int *page_was_mappedp,
>> +				   struct anon_vma **anon_vmap)
>> +{
>> +	*anon_vmap = (struct anon_vma *)newpage->mapping;
>> +	*page_was_mappedp = newpage->private;
>> +	newpage->mapping = NULL;
>> +	newpage->private = 0;
>> +}
>> +
>> +#define MIGRATEPAGE_UNMAP		1
>
> It is better to move this to migrate.h with MIGRATEPAGE_SUCCESS and
> make them an enum.

Make sense!  Will do this in the next version.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 3/6] mm/migrate_pages: restrict number of pages to migrate in batch
  2022-09-21 16:10   ` Zi Yan
  2022-09-21 16:15     ` Zi Yan
@ 2022-09-22  1:15     ` Huang, Ying
  1 sibling, 0 replies; 50+ messages in thread
From: Huang, Ying @ 2022-09-22  1:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Andrew Morton, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

Zi Yan <ziy@nvidia.com> writes:

> On 21 Sep 2022, at 2:06, Huang Ying wrote:
>
>> This is a preparation patch to batch the page unmapping and moving
>> for the normal pages and THP.
>>
>> If we had batched the page unmapping, all pages to be migrated would
>> be unmapped before copying the contents and flags of the pages.  If
>> the number of pages that were passed to migrate_pages() was too large,
>> too many pages would be unmapped.  Then, the execution of their
>> processes would be stopped for too long time.  For example,
>> migrate_pages() syscall will call migrate_pages() with all pages of a
>> process.  To avoid this possible issue, in this patch, we restrict the
>> number of pages to be migrated to be no more than HPAGE_PMD_NR.  That
>> is, the influence is at the same level of THP migration.
>>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> ---
>>  mm/migrate.c | 93 +++++++++++++++++++++++++++++++++++++---------------
>>  1 file changed, 67 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 4a81e0bfdbcd..1077af858e36 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1439,32 +1439,7 @@ static inline int try_split_thp(struct page *page, struct list_head *split_pages
>>  	return rc;
>>  }
>>
>> -/*
>> - * migrate_pages - migrate the pages specified in a list, to the free pages
>> - *		   supplied as the target for the page migration
>> - *
>> - * @from:		The list of pages to be migrated.
>> - * @get_new_page:	The function used to allocate free pages to be used
>> - *			as the target of the page migration.
>> - * @put_new_page:	The function used to free target pages if migration
>> - *			fails, or NULL if no special handling is necessary.
>> - * @private:		Private data to be passed on to get_new_page()
>> - * @mode:		The migration mode that specifies the constraints for
>> - *			page migration, if any.
>> - * @reason:		The reason for page migration.
>> - * @ret_succeeded:	Set to the number of normal pages migrated successfully if
>> - *			the caller passes a non-NULL pointer.
>> - *
>> - * The function returns after 10 attempts or if no pages are movable any more
>> - * because the list has become empty or no retryable pages exist any more.
>> - * It is caller's responsibility to call putback_movable_pages() to return pages
>> - * to the LRU or free list only if ret != 0.
>> - *
>> - * Returns the number of {normal page, THP, hugetlb} that were not migrated, or
>> - * an error code. The number of THP splits will be considered as the number of
>> - * non-migrated THP, no matter how many subpages of the THP are migrated successfully.
>> - */
>> -int migrate_pages(struct list_head *from, new_page_t get_new_page,
>> +static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>  		free_page_t put_new_page, unsigned long private,
>>  		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>>  {
>> @@ -1709,6 +1684,72 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>  	return rc;
>>  }
>>
>> +/*
>> + * migrate_pages - migrate the pages specified in a list, to the free pages
>> + *		   supplied as the target for the page migration
>> + *
>> + * @from:		The list of pages to be migrated.
>> + * @get_new_page:	The function used to allocate free pages to be used
>> + *			as the target of the page migration.
>> + * @put_new_page:	The function used to free target pages if migration
>> + *			fails, or NULL if no special handling is necessary.
>> + * @private:		Private data to be passed on to get_new_page()
>> + * @mode:		The migration mode that specifies the constraints for
>> + *			page migration, if any.
>> + * @reason:		The reason for page migration.
>> + * @ret_succeeded:	Set to the number of normal pages migrated successfully if
>> + *			the caller passes a non-NULL pointer.
>> + *
>> + * The function returns after 10 attempts or if no pages are movable any more
>> + * because the list has become empty or no retryable pages exist any more.
>> + * It is caller's responsibility to call putback_movable_pages() to return pages
>> + * to the LRU or free list only if ret != 0.
>> + *
>> + * Returns the number of {normal page, THP, hugetlb} that were not migrated, or
>> + * an error code. The number of THP splits will be considered as the number of
>> + * non-migrated THP, no matter how many subpages of the THP are migrated successfully.
>> + */
>> +int migrate_pages(struct list_head *from, new_page_t get_new_page,
>> +		free_page_t put_new_page, unsigned long private,
>> +		enum migrate_mode mode, int reason, unsigned int *pret_succeeded)
>> +{
>> +	int rc, rc_gether = 0;
>> +	int ret_succeeded, ret_succeeded_gether = 0;
>> +	int nr_pages;
>> +	struct page *page;
>> +	LIST_HEAD(pagelist);
>> +	LIST_HEAD(ret_pages);
>> +
>> +again:
>> +	nr_pages = 0;
>> +	list_for_each_entry(page, from, lru) {
>> +		nr_pages += compound_nr(page);
>> +		if (nr_pages > HPAGE_PMD_NR)
>
> It is better to define a new MACRO like NR_MAX_BATCHED_MIGRATION to be
> HPAGE_PMD_NR. It makes code easier to understand and change.

OK.  Will do that.

Best Regards,
Huang, Ying

>> +			break;
>> +	}
>> +	if (nr_pages > HPAGE_PMD_NR)
>> +		list_cut_before(&pagelist, from, &page->lru);
>> +	else
>> +		list_splice_init(from, &pagelist);
>> +	rc = migrate_pages_batch(&pagelist, get_new_page, put_new_page, private,
>> +				 mode, reason, &ret_succeeded);
>> +	ret_succeeded_gether += ret_succeeded;
>> +	list_splice_tail_init(&pagelist, &ret_pages);
>> +	if (rc == -ENOMEM) {
>> +		rc_gether = rc;
>> +		goto out;
>> +	}
>> +	rc_gether += rc;
>> +	if (!list_empty(from))
>> +		goto again;
>> +out:
>> +	if (pret_succeeded)
>> +		*pret_succeeded = ret_succeeded_gether;
>> +	list_splice(&ret_pages, from);
>> +
>> +	return rc_gether;
>> +}
>> +
>>  struct page *alloc_migration_target(struct page *page, unsigned long private)
>>  {
>>  	struct folio *folio = page_folio(page);
>> -- 
>> 2.35.1
>
>
> --
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-21 15:47 ` [RFC 0/6] migrate_pages(): batch TLB flushing Zi Yan
@ 2022-09-22  1:45   ` Huang, Ying
  2022-09-22  3:47   ` haoxin
  1 sibling, 0 replies; 50+ messages in thread
From: Huang, Ying @ 2022-09-22  1:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Andrew Morton, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

Zi Yan <ziy@nvidia.com> writes:

> On 21 Sep 2022, at 2:06, Huang Ying wrote:
>
>> From: "Huang, Ying" <ying.huang@intel.com>
>>
>> Now, migrate_pages() migrate pages one by one, like the fake code as
>> follows,
>>
>>   for each page
>>     unmap
>>     flush TLB
>>     copy
>>     restore map
>>
>> If multiple pages are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying.  That is, we can
>> change the code to something as follows,
>>
>>   for each page
>>     unmap
>>   for each page
>>     flush TLB
>>   for each page
>>     copy
>>   for each page
>>     restore map
>>
>> The total number of TLB flushing IPI can be reduced considerably.  And
>> we may use some hardware accelerator such as DSA to accelerate the
>> page copying.
>>
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching.  Base on this, hardware
>> accelerated page copying can be implemented.
>>
>> If too many pages are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many pages at the same time.  The
>> possibility for a task to wait for the migrated pages to be mapped
>> again increases.  So the latency may be hurt.  To deal with this
>> issue, the max number of pages be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>> level of THP migration.
>>
>> We use the following test to measure the performance impact of the
>> patchset,
>>
>> On a 2-socket Intel server,
>>
>>  - Run pmbench memory accessing benchmark
>>
>>  - Run `migratepages` to migrate pages of pmbench between node 0 and
>>    node 1 back and forth.
>>
>> With the patch, the TLB flushing IPI reduces 99.1% during the test and
>> the number of pages migrated successfully per second increases 291.7%.
>
> Thank you for the patchset. Batching page migration will definitely
> improve its throughput from my past experiments[1] and starting with
> TLB flushing is a good first step.

Thanks for the pointer, the patch description provides valuable information
for me already!

> BTW, what is the rationality behind the increased page migration
> success rate per second?

From perf profiling data, in the base kernel,

  migrate_pages.migrate_to_node.do_migrate_pages.kernel_migrate_pages.__x64_sys_migrate_pages:	2.87
  ptep_clear_flush.try_to_migrate_one.rmap_walk_anon.try_to_migrate.__unmap_and_move:           2.39

Because pmbench run in the system too, the CPU cycles of migrate_pages()
is about 2.87%.  While the CPU cycles for TLB flushing is 2.39%.  That
is, 2.39/2.87 = 83.3% CPU cycles of migrate_pages() are used for TLB
flushing.

After batching the TLB flushing, the perf profiling data becomes,

  migrate_pages.migrate_to_node.do_migrate_pages.kernel_migrate_pages.__x64_sys_migrate_pages:	2.77
  move_to_new_folio.migrate_pages_batch.migrate_pages.migrate_to_node.do_migrate_pages:         1.68
  copy_page.folio_copy.migrate_folio.move_to_new_folio.migrate_pages_batch:                     1.21

1.21/2.77 = 43.7% CPU cycles of migrate_pages() are used for page
copying now.

  try_to_migrate_one:	0.23

The CPU cycles of unmapping and TLB flushing becomes 0.23/2.77 = 8.3% of
migrate_pages().

All in all, after the optimization, we do much less TLB flushing, which
consumes a lot of CPU cycles before the optimization.  So the throughput
of migrate_pages() increases greatly.

I will add these data in the next version of patch.

Best Regards,
Huang, Ying

>>
>> This patchset is based on v6.0-rc5 and the following patchset,
>>
>> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
>> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/
>>
>> The migrate_pages() related code is converting to folio now. So this
>> patchset cannot apply recent akpm/mm-unstable branch.  This patchset
>> is used to check the basic idea.  If it is OK, I will rebase the
>> patchset on top of folio changes.
>>
>> Best Regards,
>> Huang, Ying
>
>
> [1] https://lwn.net/Articles/784925/
>
> --
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-21 15:47 ` [RFC 0/6] migrate_pages(): batch TLB flushing Zi Yan
  2022-09-22  1:45   ` Huang, Ying
@ 2022-09-22  3:47   ` haoxin
  2022-09-22  4:36     ` Huang, Ying
  1 sibling, 1 reply; 50+ messages in thread
From: haoxin @ 2022-09-22  3:47 UTC (permalink / raw)
  To: Zi Yan, Huang Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

Hi Huang,

     This is an exciting change, but on ARM64 machine the TLB flushing 
are not through IPI, it depends on 'vale1is'

instruction,so I'm wondering if there's also a  benefit on arm64, and 
I'm going to test it on an ARM64 machine.


在 2022/9/21 下午11:47, Zi Yan 写道:
> On 21 Sep 2022, at 2:06, Huang Ying wrote:
>
>> From: "Huang, Ying" <ying.huang@intel.com>
>>
>> Now, migrate_pages() migrate pages one by one, like the fake code as
>> follows,
>>
>>    for each page
>>      unmap
>>      flush TLB
>>      copy
>>      restore map
>>
>> If multiple pages are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying.  That is, we can
>> change the code to something as follows,
>>
>>    for each page
>>      unmap
>>    for each page
>>      flush TLB
>>    for each page
>>      copy
>>    for each page
>>      restore map
>>
>> The total number of TLB flushing IPI can be reduced considerably.  And
>> we may use some hardware accelerator such as DSA to accelerate the
>> page copying.
>>
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching.  Base on this, hardware
>> accelerated page copying can be implemented.
>>
>> If too many pages are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many pages at the same time.  The
>> possibility for a task to wait for the migrated pages to be mapped
>> again increases.  So the latency may be hurt.  To deal with this
>> issue, the max number of pages be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>> level of THP migration.
>>
>> We use the following test to measure the performance impact of the
>> patchset,
>>
>> On a 2-socket Intel server,
>>
>>   - Run pmbench memory accessing benchmark
>>
>>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>>     node 1 back and forth.
>>
>> With the patch, the TLB flushing IPI reduces 99.1% during the test and
>> the number of pages migrated successfully per second increases 291.7%.
> Thank you for the patchset. Batching page migration will definitely
> improve its throughput from my past experiments[1] and starting with
> TLB flushing is a good first step.
>
> BTW, what is the rationality behind the increased page migration
> success rate per second?
>
>> This patchset is based on v6.0-rc5 and the following patchset,
>>
>> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
>> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/
>>
>> The migrate_pages() related code is converting to folio now. So this
>> patchset cannot apply recent akpm/mm-unstable branch.  This patchset
>> is used to check the basic idea.  If it is OK, I will rebase the
>> patchset on top of folio changes.
>>
>> Best Regards,
>> Huang, Ying
>
> [1] https://lwn.net/Articles/784925/
>
> --
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-22  3:47   ` haoxin
@ 2022-09-22  4:36     ` Huang, Ying
  0 siblings, 0 replies; 50+ messages in thread
From: Huang, Ying @ 2022-09-22  4:36 UTC (permalink / raw)
  To: haoxin
  Cc: Zi Yan, linux-mm, linux-kernel, Andrew Morton, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

haoxin <xhao@linux.alibaba.com> writes:

> Hi Huang,
>
> This is an exciting change, but on ARM64 machine the TLB
> flushing are not through IPI, it depends on 'vale1is'
>
> instruction\fso I'm wondering if there's also a benefit on arm64,
> and I'm going to test it on an ARM64 machine.

We have no arm64 machine to test and I know very little about arm64.
Thanks for information and testing.

Best Regards,
Huang, Ying

>
> ( 2022/9/21 \vH11:47, Zi Yan S:
>> On 21 Sep 2022, at 2:06, Huang Ying wrote:
>>
>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>
>>> Now, migrate_pages() migrate pages one by one, like the fake code as
>>> follows,
>>>
>>>    for each page
>>>      unmap
>>>      flush TLB
>>>      copy
>>>      restore map
>>>
>>> If multiple pages are passed to migrate_pages(), there are
>>> opportunities to batch the TLB flushing and copying.  That is, we can
>>> change the code to something as follows,
>>>
>>>    for each page
>>>      unmap
>>>    for each page
>>>      flush TLB
>>>    for each page
>>>      copy
>>>    for each page
>>>      restore map
>>>
>>> The total number of TLB flushing IPI can be reduced considerably.  And
>>> we may use some hardware accelerator such as DSA to accelerate the
>>> page copying.
>>>
>>> So in this patch, we refactor the migrate_pages() implementation and
>>> implement the TLB flushing batching.  Base on this, hardware
>>> accelerated page copying can be implemented.
>>>
>>> If too many pages are passed to migrate_pages(), in the naive batched
>>> implementation, we may unmap too many pages at the same time.  The
>>> possibility for a task to wait for the migrated pages to be mapped
>>> again increases.  So the latency may be hurt.  To deal with this
>>> issue, the max number of pages be unmapped in batch is restricted to
>>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>>> level of THP migration.
>>>
>>> We use the following test to measure the performance impact of the
>>> patchset,
>>>
>>> On a 2-socket Intel server,
>>>
>>>   - Run pmbench memory accessing benchmark
>>>
>>>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>>>     node 1 back and forth.
>>>
>>> With the patch, the TLB flushing IPI reduces 99.1% during the test and
>>> the number of pages migrated successfully per second increases 291.7%.
>> Thank you for the patchset. Batching page migration will definitely
>> improve its throughput from my past experiments[1] and starting with
>> TLB flushing is a good first step.
>>
>> BTW, what is the rationality behind the increased page migration
>> success rate per second?
>>
>>> This patchset is based on v6.0-rc5 and the following patchset,
>>>
>>> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
>>> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/
>>>
>>> The migrate_pages() related code is converting to folio now. So this
>>> patchset cannot apply recent akpm/mm-unstable branch.  This patchset
>>> is used to check the basic idea.  If it is OK, I will rebase the
>>> patchset on top of folio changes.
>>>
>>> Best Regards,
>>> Huang, Ying
>>
>> [1] https://lwn.net/Articles/784925/
>>
>> --
>> Best Regards,
>> Yan, Zi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration
  2022-09-21  6:06 ` [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration Huang Ying
  2022-09-21 15:55   ` Zi Yan
@ 2022-09-22  6:03   ` Baolin Wang
  2022-09-22  6:22     ` Huang, Ying
  1 sibling, 1 reply; 50+ messages in thread
From: Baolin Wang @ 2022-09-22  6:03 UTC (permalink / raw)
  To: Huang Ying, linux-mm
  Cc: linux-kernel, Andrew Morton, Zi Yan, Yang Shi, Oscar Salvador,
	Matthew Wilcox



On 9/21/2022 2:06 PM, Huang Ying wrote:
> This is a preparation patch to batch the page unmapping and moving for
> the normal pages and THPs.  Based on that we can batch the TLB
> shootdown during the page migration and make it possible to use some
> hardware accelerator for the page copying.
> 
> In this patch the huge page (PageHuge()) and normal page and THP
> migration is separated in migrate_pages() to make it easy to change
> the normal page and THP migration implementation.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Matthew Wilcox <willy@infradead.org>
> ---
>   mm/migrate.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-------
>   1 file changed, 64 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 571d8c9fd5bc..117134f1c6dc 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1414,6 +1414,66 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>   
>   	trace_mm_migrate_pages_start(mode, reason);
>   
> +	for (pass = 0; pass < 10 && retry; pass++) {
> +		retry = 0;
> +
> +		list_for_each_entry_safe(page, page2, from, lru) {
> +			nr_subpages = compound_nr(page);
> +			cond_resched();
> +
> +			if (!PageHuge(page))
> +				continue;
> +
> +			rc = unmap_and_move_huge_page(get_new_page,
> +						put_new_page, private, page,
> +						pass > 2, mode, reason,
> +						&ret_pages);
> +			/*
> +			 * The rules are:
> +			 *	Success: hugetlb page will be put back
> +			 *	-EAGAIN: stay on the from list
> +			 *	-ENOMEM: stay on the from list
> +			 *	-ENOSYS: stay on the from list
> +			 *	Other errno: put on ret_pages list then splice to
> +			 *		     from list
> +			 */
> +			switch(rc) {
> +			case -ENOSYS:
> +				/* Hugetlb migration is unsupported */
> +				nr_failed++;
> +				nr_failed_pages += nr_subpages;
> +				list_move_tail(&page->lru, &ret_pages);
> +				break;
> +			case -ENOMEM:
> +				/*
> +				 * When memory is low, don't bother to try to migrate
> +				 * other pages, just exit.
> +				 */
> +				nr_failed++;
> +				nr_failed_pages += nr_subpages + nr_retry_pages;
> +				goto out;
> +			case -EAGAIN:
> +				retry++;
> +				nr_retry_pages += nr_subpages;
> +				break;
> +			case MIGRATEPAGE_SUCCESS:
> +				nr_succeeded += nr_subpages;
> +				break;
> +			default:
> +				/*
> +				 * Permanent failure (-EBUSY, etc.):
> +				 * unlike -EAGAIN case, the failed page is
> +				 * removed from migration page list and not
> +				 * retried in the next outer loop.
> +				 */
> +				nr_failed++;
> +				nr_failed_pages += nr_subpages;
> +				break;
> +			}
> +		}
> +	}
> +	nr_failed += retry;

Seems we should also record the nr_retry_pages? since the second loop 
will reset the nr_retry_pages.

nr_failed_pages += nr_retry_pages;

Besides, I also agree with Zi Yan's comment to simplify this larger 
function.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration
  2022-09-22  6:03   ` Baolin Wang
@ 2022-09-22  6:22     ` Huang, Ying
  0 siblings, 0 replies; 50+ messages in thread
From: Huang, Ying @ 2022-09-22  6:22 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Oscar Salvador, Matthew Wilcox

Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 9/21/2022 2:06 PM, Huang Ying wrote:
>> This is a preparation patch to batch the page unmapping and moving for
>> the normal pages and THPs.  Based on that we can batch the TLB
>> shootdown during the page migration and make it possible to use some
>> hardware accelerator for the page copying.
>> In this patch the huge page (PageHuge()) and normal page and THP
>> migration is separated in migrate_pages() to make it easy to change
>> the normal page and THP migration implementation.
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> ---
>>   mm/migrate.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 64 insertions(+), 9 deletions(-)
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 571d8c9fd5bc..117134f1c6dc 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1414,6 +1414,66 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>     	trace_mm_migrate_pages_start(mode, reason);
>>   +	for (pass = 0; pass < 10 && retry; pass++) {
>> +		retry = 0;
>> +
>> +		list_for_each_entry_safe(page, page2, from, lru) {
>> +			nr_subpages = compound_nr(page);
>> +			cond_resched();
>> +
>> +			if (!PageHuge(page))
>> +				continue;
>> +
>> +			rc = unmap_and_move_huge_page(get_new_page,
>> +						put_new_page, private, page,
>> +						pass > 2, mode, reason,
>> +						&ret_pages);
>> +			/*
>> +			 * The rules are:
>> +			 *	Success: hugetlb page will be put back
>> +			 *	-EAGAIN: stay on the from list
>> +			 *	-ENOMEM: stay on the from list
>> +			 *	-ENOSYS: stay on the from list
>> +			 *	Other errno: put on ret_pages list then splice to
>> +			 *		     from list
>> +			 */
>> +			switch(rc) {
>> +			case -ENOSYS:
>> +				/* Hugetlb migration is unsupported */
>> +				nr_failed++;
>> +				nr_failed_pages += nr_subpages;
>> +				list_move_tail(&page->lru, &ret_pages);
>> +				break;
>> +			case -ENOMEM:
>> +				/*
>> +				 * When memory is low, don't bother to try to migrate
>> +				 * other pages, just exit.
>> +				 */
>> +				nr_failed++;
>> +				nr_failed_pages += nr_subpages + nr_retry_pages;
>> +				goto out;
>> +			case -EAGAIN:
>> +				retry++;
>> +				nr_retry_pages += nr_subpages;
>> +				break;
>> +			case MIGRATEPAGE_SUCCESS:
>> +				nr_succeeded += nr_subpages;
>> +				break;
>> +			default:
>> +				/*
>> +				 * Permanent failure (-EBUSY, etc.):
>> +				 * unlike -EAGAIN case, the failed page is
>> +				 * removed from migration page list and not
>> +				 * retried in the next outer loop.
>> +				 */
>> +				nr_failed++;
>> +				nr_failed_pages += nr_subpages;
>> +				break;
>> +			}
>> +		}
>> +	}
>> +	nr_failed += retry;
>
> Seems we should also record the nr_retry_pages? since the second loop
> will reset the nr_retry_pages.
>
> nr_failed_pages += nr_retry_pages;

Good catch!  Will do that in the next version.

> Besides, I also agree with Zi Yan's comment to simplify this larger
> function.

Yes.  I think so too.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-21  6:06 ` [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move() Huang Ying
  2022-09-21 16:08   ` Zi Yan
@ 2022-09-22  6:36   ` Baolin Wang
  2022-09-26  9:28   ` Alistair Popple
  2 siblings, 0 replies; 50+ messages in thread
From: Baolin Wang @ 2022-09-22  6:36 UTC (permalink / raw)
  To: Huang Ying, linux-mm
  Cc: linux-kernel, Andrew Morton, Zi Yan, Yang Shi, Oscar Salvador,
	Matthew Wilcox



On 9/21/2022 2:06 PM, Huang Ying wrote:
> This is a preparation patch to batch the page unmapping and moving
> for the normal pages and THP.
> 
> In this patch, unmap_and_move() is split to migrate_page_unmap() and
> migrate_page_move().  So, we can batch _unmap() and _move() in
> different loops later.  To pass some information between unmap and
> move, the original unused newpage->mapping and newpage->private are
> used.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Matthew Wilcox <willy@infradead.org>
> ---

Looks good to me.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
                   ` (6 preceding siblings ...)
  2022-09-21 15:47 ` [RFC 0/6] migrate_pages(): batch TLB flushing Zi Yan
@ 2022-09-22 12:50 ` Bharata B Rao
  2022-09-23  7:52   ` Huang, Ying
  2022-09-26  9:11 ` Alistair Popple
  2022-09-27 11:21 ` haoxin
  9 siblings, 1 reply; 50+ messages in thread
From: Bharata B Rao @ 2022-09-22 12:50 UTC (permalink / raw)
  To: Huang Ying, linux-mm
  Cc: linux-kernel, Andrew Morton, Zi Yan, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox

On 9/21/2022 11:36 AM, Huang Ying wrote:
> From: "Huang, Ying" <ying.huang@intel.com>
> 
> Now, migrate_pages() migrate pages one by one, like the fake code as
> follows,
> 
>   for each page
>     unmap
>     flush TLB
>     copy
>     restore map
> 
> If multiple pages are passed to migrate_pages(), there are
> opportunities to batch the TLB flushing and copying.  That is, we can
> change the code to something as follows,
> 
>   for each page
>     unmap
>   for each page
>     flush TLB
>   for each page
>     copy
>   for each page
>     restore map
> 
> The total number of TLB flushing IPI can be reduced considerably.  And
> we may use some hardware accelerator such as DSA to accelerate the
> page copying.
> 
> So in this patch, we refactor the migrate_pages() implementation and
> implement the TLB flushing batching.  Base on this, hardware
> accelerated page copying can be implemented.
> 
> If too many pages are passed to migrate_pages(), in the naive batched
> implementation, we may unmap too many pages at the same time.  The
> possibility for a task to wait for the migrated pages to be mapped
> again increases.  So the latency may be hurt.  To deal with this
> issue, the max number of pages be unmapped in batch is restricted to
> no more than HPAGE_PMD_NR.  That is, the influence is at the same
> level of THP migration.

Thanks for the patchset. I find it hitting the following BUG() when
running mmtests/autonumabench:

kernel BUG at mm/migrate.c:2432!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 PID: 7150 Comm: numa01 Not tainted 6.0.0-rc5+ #171
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.5.6 10/06/2021
RIP: 0010:migrate_misplaced_page+0x670/0x830 
Code: 36 48 8b 3c c5 e0 7a 19 8d e8 dc 10 f7 ff 4c 89 e7 e8 f4 43 f5 ff 8b 55 bc 85 d2 75 6f 48 8b 45 c0 4c 39 e8 0f 84 b0 fb ff ff <0f> 0b 48 8b 7d 90 e9 ec fc ff ff 48 83 e8 01 e9 48 fa ff ff 48 83
RSP: 0000:ffffb1b29ec3fd38 EFLAGS: 00010202
RAX: ffffe946460f8248 RBX: 0000000000000001 RCX: ffffe946460f8248
RDX: 0000000000000000 RSI: ffffe946460f8248 RDI: ffffb1b29ec3fce0
RBP: ffffb1b29ec3fda8 R08: 0000000000000000 R09: 0000000000000005
R10: 0000000000000001 R11: 0000000000000000 R12: ffffe946460f8240
R13: ffffb1b29ec3fd68 R14: 0000000000000001 R15: ffff9698beed5000
FS:  00007fcc31fee640(0000) GS:ffff9697b0000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcc3a3a5000 CR3: 000000016e89c002 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 <TASK>
 __handle_mm_fault+0xb87/0xff0
 handle_mm_fault+0x126/0x3c0
 do_user_addr_fault+0x1ed/0x690
 exc_page_fault+0x84/0x2c0
 asm_exc_page_fault+0x27/0x30 
RIP: 0033:0x7fccfa1a1180
Code: 81 fa 80 00 00 00 76 d2 c5 fe 7f 40 40 c5 fe 7f 40 60 48 83 c7 80 48 81 fa 00 01 00 00 76 2b 48 8d 90 80 00 00 00 48 83 e2 c0 <c5> fd 7f 02 c5 fd 7f 42 20 c5 fd 7f 42 40 c5 fd 7f 42 60 48 83 ea
RSP: 002b:00007fcc31fede38 EFLAGS: 00010283
RAX: 00007fcc39fff010 RBX: 000000000000002c RCX: 00007fccfa11ea3d
RDX: 00007fcc3a3a5000 RSI: 0000000000000000 RDI: 00007fccf9ffef90
RBP: 00007fcc39fff010 R08: 00007fcc31fee640 R09: 00007fcc31fee640
R10: 00007ffdecef614f R11: 0000000000000246 R12: 00000000c0000000
R13: 0000000000000000 R14: 00007fccfa094850 R15: 00007ffdecef6190

This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page().

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-22 12:50 ` Bharata B Rao
@ 2022-09-23  7:52   ` Huang, Ying
  2022-09-27 10:46     ` Bharata B Rao
  0 siblings, 1 reply; 50+ messages in thread
From: Huang, Ying @ 2022-09-23  7:52 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

Bharata B Rao <bharata@amd.com> writes:

> On 9/21/2022 11:36 AM, Huang Ying wrote:
>> From: "Huang, Ying" <ying.huang@intel.com>
>> 
>> Now, migrate_pages() migrate pages one by one, like the fake code as
>> follows,
>> 
>>   for each page
>>     unmap
>>     flush TLB
>>     copy
>>     restore map
>> 
>> If multiple pages are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying.  That is, we can
>> change the code to something as follows,
>> 
>>   for each page
>>     unmap
>>   for each page
>>     flush TLB
>>   for each page
>>     copy
>>   for each page
>>     restore map
>> 
>> The total number of TLB flushing IPI can be reduced considerably.  And
>> we may use some hardware accelerator such as DSA to accelerate the
>> page copying.
>> 
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching.  Base on this, hardware
>> accelerated page copying can be implemented.
>> 
>> If too many pages are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many pages at the same time.  The
>> possibility for a task to wait for the migrated pages to be mapped
>> again increases.  So the latency may be hurt.  To deal with this
>> issue, the max number of pages be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>> level of THP migration.
>
> Thanks for the patchset. I find it hitting the following BUG() when
> running mmtests/autonumabench:
>
> kernel BUG at mm/migrate.c:2432!
> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 7 PID: 7150 Comm: numa01 Not tainted 6.0.0-rc5+ #171
> Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.5.6 10/06/2021
> RIP: 0010:migrate_misplaced_page+0x670/0x830 
> Code: 36 48 8b 3c c5 e0 7a 19 8d e8 dc 10 f7 ff 4c 89 e7 e8 f4 43 f5 ff 8b 55 bc 85 d2 75 6f 48 8b 45 c0 4c 39 e8 0f 84 b0 fb ff ff <0f> 0b 48 8b 7d 90 e9 ec fc ff ff 48 83 e8 01 e9 48 fa ff ff 48 83
> RSP: 0000:ffffb1b29ec3fd38 EFLAGS: 00010202
> RAX: ffffe946460f8248 RBX: 0000000000000001 RCX: ffffe946460f8248
> RDX: 0000000000000000 RSI: ffffe946460f8248 RDI: ffffb1b29ec3fce0
> RBP: ffffb1b29ec3fda8 R08: 0000000000000000 R09: 0000000000000005
> R10: 0000000000000001 R11: 0000000000000000 R12: ffffe946460f8240
> R13: ffffb1b29ec3fd68 R14: 0000000000000001 R15: ffff9698beed5000
> FS:  00007fcc31fee640(0000) GS:ffff9697b0000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fcc3a3a5000 CR3: 000000016e89c002 CR4: 0000000000770ee0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  __handle_mm_fault+0xb87/0xff0
>  handle_mm_fault+0x126/0x3c0
>  do_user_addr_fault+0x1ed/0x690
>  exc_page_fault+0x84/0x2c0
>  asm_exc_page_fault+0x27/0x30 
> RIP: 0033:0x7fccfa1a1180
> Code: 81 fa 80 00 00 00 76 d2 c5 fe 7f 40 40 c5 fe 7f 40 60 48 83 c7 80 48 81 fa 00 01 00 00 76 2b 48 8d 90 80 00 00 00 48 83 e2 c0 <c5> fd 7f 02 c5 fd 7f 42 20 c5 fd 7f 42 40 c5 fd 7f 42 60 48 83 ea
> RSP: 002b:00007fcc31fede38 EFLAGS: 00010283
> RAX: 00007fcc39fff010 RBX: 000000000000002c RCX: 00007fccfa11ea3d
> RDX: 00007fcc3a3a5000 RSI: 0000000000000000 RDI: 00007fccf9ffef90
> RBP: 00007fcc39fff010 R08: 00007fcc31fee640 R09: 00007fcc31fee640
> R10: 00007ffdecef614f R11: 0000000000000246 R12: 00000000c0000000
> R13: 0000000000000000 R14: 00007fccfa094850 R15: 00007ffdecef6190
>
> This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page().

Thank you very much for reporting!  I haven't reproduced this yet.  But
I will pay special attention to this when develop the next version, even
if I cannot reproduce this finally.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
                   ` (7 preceding siblings ...)
  2022-09-22 12:50 ` Bharata B Rao
@ 2022-09-26  9:11 ` Alistair Popple
  2022-09-27 11:21 ` haoxin
  9 siblings, 0 replies; 50+ messages in thread
From: Alistair Popple @ 2022-09-26  9:11 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox


Huang Ying <ying.huang@intel.com> writes:

> From: "Huang, Ying" <ying.huang@intel.com>
>
> Now, migrate_pages() migrate pages one by one, like the fake code as
> follows,
>
>   for each page
>     unmap
>     flush TLB
>     copy
>     restore map
>
> If multiple pages are passed to migrate_pages(), there are
> opportunities to batch the TLB flushing and copying.  That is, we can
> change the code to something as follows,
>
>   for each page
>     unmap
>   for each page
>     flush TLB
>   for each page
>     copy
>   for each page
>     restore map

We use a very similar sequence for the migrate_vma_*() set of calls. It
would be good if we could one day consolidate the two. I believe the
biggest hindrance to that is migrate_vma_*() operates on arrays of pfns
rather than a list of pages. The reason for that is it needs to migrate
non-lru pages and hence can't use page->lru to create a list of pages to
migrate.

So from my perspective I think this direction is good as it would help
with that. One thing to watch out for is deadlocking if locking multiple
pages though.

> The total number of TLB flushing IPI can be reduced considerably.  And
> we may use some hardware accelerator such as DSA to accelerate the
> page copying.
>
> So in this patch, we refactor the migrate_pages() implementation and
> implement the TLB flushing batching.  Base on this, hardware
> accelerated page copying can be implemented.
>
> If too many pages are passed to migrate_pages(), in the naive batched
> implementation, we may unmap too many pages at the same time.  The
> possibility for a task to wait for the migrated pages to be mapped
> again increases.  So the latency may be hurt.  To deal with this
> issue, the max number of pages be unmapped in batch is restricted to
> no more than HPAGE_PMD_NR.  That is, the influence is at the same
> level of THP migration.
>
> We use the following test to measure the performance impact of the
> patchset,
>
> On a 2-socket Intel server,
>
>  - Run pmbench memory accessing benchmark
>
>  - Run `migratepages` to migrate pages of pmbench between node 0 and
>    node 1 back and forth.
>
> With the patch, the TLB flushing IPI reduces 99.1% during the test and
> the number of pages migrated successfully per second increases 291.7%.
>
> This patchset is based on v6.0-rc5 and the following patchset,
>
> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/
>
> The migrate_pages() related code is converting to folio now. So this
> patchset cannot apply recent akpm/mm-unstable branch.  This patchset
> is used to check the basic idea.  If it is OK, I will rebase the
> patchset on top of folio changes.
>
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-21  6:06 ` [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move() Huang Ying
  2022-09-21 16:08   ` Zi Yan
  2022-09-22  6:36   ` Baolin Wang
@ 2022-09-26  9:28   ` Alistair Popple
  2022-09-26 18:06     ` Yang Shi
  2 siblings, 1 reply; 50+ messages in thread
From: Alistair Popple @ 2022-09-26  9:28 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox


Huang Ying <ying.huang@intel.com> writes:

> This is a preparation patch to batch the page unmapping and moving
> for the normal pages and THP.
>
> In this patch, unmap_and_move() is split to migrate_page_unmap() and
> migrate_page_move().  So, we can batch _unmap() and _move() in
> different loops later.  To pass some information between unmap and
> move, the original unused newpage->mapping and newpage->private are
> used.

This looks like it could cause a deadlock between two threads migrating
the same pages if force == true && mode != MIGRATE_ASYNC as
migrate_page_unmap() will call lock_page() while holding the lock on
other pages in the list. Therefore the two threads could deadlock if the
pages are in a different order.

> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Matthew Wilcox <willy@infradead.org>
> ---
>  mm/migrate.c | 164 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 122 insertions(+), 42 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 117134f1c6dc..4a81e0bfdbcd 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -976,13 +976,32 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
>  	return rc;
>  }
>
> -static int __unmap_and_move(struct page *page, struct page *newpage,
> +static void __migrate_page_record(struct page *newpage,
> +				  int page_was_mapped,
> +				  struct anon_vma *anon_vma)
> +{
> +	newpage->mapping = (struct address_space *)anon_vma;
> +	newpage->private = page_was_mapped;
> +}
> +
> +static void __migrate_page_extract(struct page *newpage,
> +				   int *page_was_mappedp,
> +				   struct anon_vma **anon_vmap)
> +{
> +	*anon_vmap = (struct anon_vma *)newpage->mapping;
> +	*page_was_mappedp = newpage->private;
> +	newpage->mapping = NULL;
> +	newpage->private = 0;
> +}
> +
> +#define MIGRATEPAGE_UNMAP		1
> +
> +static int __migrate_page_unmap(struct page *page, struct page *newpage,
>  				int force, enum migrate_mode mode)
>  {
>  	struct folio *folio = page_folio(page);
> -	struct folio *dst = page_folio(newpage);
>  	int rc = -EAGAIN;
> -	bool page_was_mapped = false;
> +	int page_was_mapped = 0;
>  	struct anon_vma *anon_vma = NULL;
>  	bool is_lru = !__PageMovable(page);
>
> @@ -1058,8 +1077,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  		goto out_unlock;
>
>  	if (unlikely(!is_lru)) {
> -		rc = move_to_new_folio(dst, folio, mode);
> -		goto out_unlock_both;
> +		__migrate_page_record(newpage, page_was_mapped, anon_vma);
> +		return MIGRATEPAGE_UNMAP;
>  	}
>
>  	/*
> @@ -1085,11 +1104,41 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
>  				page);
>  		try_to_migrate(folio, 0);
> -		page_was_mapped = true;
> +		page_was_mapped = 1;
> +	}
> +
> +	if (!page_mapped(page)) {
> +		__migrate_page_record(newpage, page_was_mapped, anon_vma);
> +		return MIGRATEPAGE_UNMAP;
>  	}
>
> -	if (!page_mapped(page))
> -		rc = move_to_new_folio(dst, folio, mode);
> +	if (page_was_mapped)
> +		remove_migration_ptes(folio, folio, false);
> +
> +out_unlock_both:
> +	unlock_page(newpage);
> +out_unlock:
> +	/* Drop an anon_vma reference if we took one */
> +	if (anon_vma)
> +		put_anon_vma(anon_vma);
> +	unlock_page(page);
> +out:
> +
> +	return rc;
> +}
> +
> +static int __migrate_page_move(struct page *page, struct page *newpage,
> +			       enum migrate_mode mode)
> +{
> +	struct folio *folio = page_folio(page);
> +	struct folio *dst = page_folio(newpage);
> +	int rc;
> +	int page_was_mapped = 0;
> +	struct anon_vma *anon_vma = NULL;
> +
> +	__migrate_page_extract(newpage, &page_was_mapped, &anon_vma);
> +
> +	rc = move_to_new_folio(dst, folio, mode);
>
>  	/*
>  	 * When successful, push newpage to LRU immediately: so that if it
> @@ -1110,14 +1159,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  		remove_migration_ptes(folio,
>  			rc == MIGRATEPAGE_SUCCESS ? dst : folio, false);
>
> -out_unlock_both:
>  	unlock_page(newpage);
> -out_unlock:
>  	/* Drop an anon_vma reference if we took one */
>  	if (anon_vma)
>  		put_anon_vma(anon_vma);
>  	unlock_page(page);
> -out:
>  	/*
>  	 * If migration is successful, decrease refcount of the newpage,
>  	 * which will not free the page because new page owner increased
> @@ -1129,18 +1175,31 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  	return rc;
>  }
>
> -/*
> - * Obtain the lock on page, remove all ptes and migrate the page
> - * to the newly allocated page in newpage.
> - */
> -static int unmap_and_move(new_page_t get_new_page,
> -				   free_page_t put_new_page,
> -				   unsigned long private, struct page *page,
> -				   int force, enum migrate_mode mode,
> -				   enum migrate_reason reason,
> -				   struct list_head *ret)
> +static void migrate_page_done(struct page *page,
> +			      enum migrate_reason reason)
> +{
> +	/*
> +	 * Compaction can migrate also non-LRU pages which are
> +	 * not accounted to NR_ISOLATED_*. They can be recognized
> +	 * as __PageMovable
> +	 */
> +	if (likely(!__PageMovable(page)))
> +		mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> +				    page_is_file_lru(page), -thp_nr_pages(page));
> +
> +	if (reason != MR_MEMORY_FAILURE)
> +		/* We release the page in page_handle_poison. */
> +		put_page(page);
> +}
> +
> +/* Obtain the lock on page, remove all ptes. */
> +static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
> +			      unsigned long private, struct page *page,
> +			      struct page **newpagep, int force,
> +			      enum migrate_mode mode, enum migrate_reason reason,
> +			      struct list_head *ret)
>  {
> -	int rc = MIGRATEPAGE_SUCCESS;
> +	int rc = MIGRATEPAGE_UNMAP;
>  	struct page *newpage = NULL;
>
>  	if (!thp_migration_supported() && PageTransHuge(page))
> @@ -1151,19 +1210,48 @@ static int unmap_and_move(new_page_t get_new_page,
>  		ClearPageActive(page);
>  		ClearPageUnevictable(page);
>  		/* free_pages_prepare() will clear PG_isolated. */
> -		goto out;
> +		list_del(&page->lru);
> +		migrate_page_done(page, reason);
> +		return MIGRATEPAGE_SUCCESS;
>  	}
>
>  	newpage = get_new_page(page, private);
>  	if (!newpage)
>  		return -ENOMEM;
> +	*newpagep = newpage;
>
> -	newpage->private = 0;
> -	rc = __unmap_and_move(page, newpage, force, mode);
> +	rc = __migrate_page_unmap(page, newpage, force, mode);
> +	if (rc == MIGRATEPAGE_UNMAP)
> +		return rc;
> +
> +	/*
> +	 * A page that has not been migrated will have kept its
> +	 * references and be restored.
> +	 */
> +	/* restore the page to right list. */
> +	if (rc != -EAGAIN)
> +		list_move_tail(&page->lru, ret);
> +
> +	if (put_new_page)
> +		put_new_page(newpage, private);
> +	else
> +		put_page(newpage);
> +
> +	return rc;
> +}
> +
> +/* Migrate the page to the newly allocated page in newpage. */
> +static int migrate_page_move(free_page_t put_new_page, unsigned long private,
> +			     struct page *page, struct page *newpage,
> +			     enum migrate_mode mode, enum migrate_reason reason,
> +			     struct list_head *ret)
> +{
> +	int rc;
> +
> +	rc = __migrate_page_move(page, newpage, mode);
>  	if (rc == MIGRATEPAGE_SUCCESS)
>  		set_page_owner_migrate_reason(newpage, reason);
>
> -out:
>  	if (rc != -EAGAIN) {
>  		/*
>  		 * A page that has been migrated has all references
> @@ -1179,20 +1267,7 @@ static int unmap_and_move(new_page_t get_new_page,
>  	 * we want to retry.
>  	 */
>  	if (rc == MIGRATEPAGE_SUCCESS) {
> -		/*
> -		 * Compaction can migrate also non-LRU pages which are
> -		 * not accounted to NR_ISOLATED_*. They can be recognized
> -		 * as __PageMovable
> -		 */
> -		if (likely(!__PageMovable(page)))
> -			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> -					page_is_file_lru(page), -thp_nr_pages(page));
> -
> -		if (reason != MR_MEMORY_FAILURE)
> -			/*
> -			 * We release the page in page_handle_poison.
> -			 */
> -			put_page(page);
> +		migrate_page_done(page, reason);
>  	} else {
>  		if (rc != -EAGAIN)
>  			list_add_tail(&page->lru, ret);
> @@ -1405,6 +1480,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  	int pass = 0;
>  	bool is_thp = false;
>  	struct page *page;
> +	struct page *newpage = NULL;
>  	struct page *page2;
>  	int rc, nr_subpages;
>  	LIST_HEAD(ret_pages);
> @@ -1493,9 +1569,13 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  			if (PageHuge(page))
>  				continue;
>
> -			rc = unmap_and_move(get_new_page, put_new_page,
> -						private, page, pass > 2, mode,
> +			rc = migrate_page_unmap(get_new_page, put_new_page, private,
> +						page, &newpage, pass > 2, mode,
>  						reason, &ret_pages);
> +			if (rc == MIGRATEPAGE_UNMAP)
> +				rc = migrate_page_move(put_new_page, private,
> +						       page, newpage, mode,
> +						       reason, &ret_pages);
>  			/*
>  			 * The rules are:
>  			 *	Success: page will be freed

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-26  9:28   ` Alistair Popple
@ 2022-09-26 18:06     ` Yang Shi
  2022-09-27  0:02       ` Alistair Popple
  0 siblings, 1 reply; 50+ messages in thread
From: Yang Shi @ 2022-09-26 18:06 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Huang Ying, linux-mm, linux-kernel, Andrew Morton, Zi Yan,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

On Mon, Sep 26, 2022 at 2:37 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Huang Ying <ying.huang@intel.com> writes:
>
> > This is a preparation patch to batch the page unmapping and moving
> > for the normal pages and THP.
> >
> > In this patch, unmap_and_move() is split to migrate_page_unmap() and
> > migrate_page_move().  So, we can batch _unmap() and _move() in
> > different loops later.  To pass some information between unmap and
> > move, the original unused newpage->mapping and newpage->private are
> > used.
>
> This looks like it could cause a deadlock between two threads migrating
> the same pages if force == true && mode != MIGRATE_ASYNC as
> migrate_page_unmap() will call lock_page() while holding the lock on
> other pages in the list. Therefore the two threads could deadlock if the
> pages are in a different order.

It seems unlikely to me since the page has to be isolated from lru
before migration. The isolating from lru is atomic, so the two threads
unlikely see the same pages on both lists.

But there might be other cases which may incur deadlock, for example,
filesystem writeback IIUC. Some filesystems may lock a bunch of pages
then write them back in a batch. The same pages may be on the
migration list and they are also dirty and seen by writeback. I'm not
sure whether I miss something that could prevent such a deadlock from
happening.

>
> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Yang Shi <shy828301@gmail.com>
> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > ---
> >  mm/migrate.c | 164 ++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 122 insertions(+), 42 deletions(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 117134f1c6dc..4a81e0bfdbcd 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -976,13 +976,32 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
> >       return rc;
> >  }
> >
> > -static int __unmap_and_move(struct page *page, struct page *newpage,
> > +static void __migrate_page_record(struct page *newpage,
> > +                               int page_was_mapped,
> > +                               struct anon_vma *anon_vma)
> > +{
> > +     newpage->mapping = (struct address_space *)anon_vma;
> > +     newpage->private = page_was_mapped;
> > +}
> > +
> > +static void __migrate_page_extract(struct page *newpage,
> > +                                int *page_was_mappedp,
> > +                                struct anon_vma **anon_vmap)
> > +{
> > +     *anon_vmap = (struct anon_vma *)newpage->mapping;
> > +     *page_was_mappedp = newpage->private;
> > +     newpage->mapping = NULL;
> > +     newpage->private = 0;
> > +}
> > +
> > +#define MIGRATEPAGE_UNMAP            1
> > +
> > +static int __migrate_page_unmap(struct page *page, struct page *newpage,
> >                               int force, enum migrate_mode mode)
> >  {
> >       struct folio *folio = page_folio(page);
> > -     struct folio *dst = page_folio(newpage);
> >       int rc = -EAGAIN;
> > -     bool page_was_mapped = false;
> > +     int page_was_mapped = 0;
> >       struct anon_vma *anon_vma = NULL;
> >       bool is_lru = !__PageMovable(page);
> >
> > @@ -1058,8 +1077,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >               goto out_unlock;
> >
> >       if (unlikely(!is_lru)) {
> > -             rc = move_to_new_folio(dst, folio, mode);
> > -             goto out_unlock_both;
> > +             __migrate_page_record(newpage, page_was_mapped, anon_vma);
> > +             return MIGRATEPAGE_UNMAP;
> >       }
> >
> >       /*
> > @@ -1085,11 +1104,41 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >               VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
> >                               page);
> >               try_to_migrate(folio, 0);
> > -             page_was_mapped = true;
> > +             page_was_mapped = 1;
> > +     }
> > +
> > +     if (!page_mapped(page)) {
> > +             __migrate_page_record(newpage, page_was_mapped, anon_vma);
> > +             return MIGRATEPAGE_UNMAP;
> >       }
> >
> > -     if (!page_mapped(page))
> > -             rc = move_to_new_folio(dst, folio, mode);
> > +     if (page_was_mapped)
> > +             remove_migration_ptes(folio, folio, false);
> > +
> > +out_unlock_both:
> > +     unlock_page(newpage);
> > +out_unlock:
> > +     /* Drop an anon_vma reference if we took one */
> > +     if (anon_vma)
> > +             put_anon_vma(anon_vma);
> > +     unlock_page(page);
> > +out:
> > +
> > +     return rc;
> > +}
> > +
> > +static int __migrate_page_move(struct page *page, struct page *newpage,
> > +                            enum migrate_mode mode)
> > +{
> > +     struct folio *folio = page_folio(page);
> > +     struct folio *dst = page_folio(newpage);
> > +     int rc;
> > +     int page_was_mapped = 0;
> > +     struct anon_vma *anon_vma = NULL;
> > +
> > +     __migrate_page_extract(newpage, &page_was_mapped, &anon_vma);
> > +
> > +     rc = move_to_new_folio(dst, folio, mode);
> >
> >       /*
> >        * When successful, push newpage to LRU immediately: so that if it
> > @@ -1110,14 +1159,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >               remove_migration_ptes(folio,
> >                       rc == MIGRATEPAGE_SUCCESS ? dst : folio, false);
> >
> > -out_unlock_both:
> >       unlock_page(newpage);
> > -out_unlock:
> >       /* Drop an anon_vma reference if we took one */
> >       if (anon_vma)
> >               put_anon_vma(anon_vma);
> >       unlock_page(page);
> > -out:
> >       /*
> >        * If migration is successful, decrease refcount of the newpage,
> >        * which will not free the page because new page owner increased
> > @@ -1129,18 +1175,31 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >       return rc;
> >  }
> >
> > -/*
> > - * Obtain the lock on page, remove all ptes and migrate the page
> > - * to the newly allocated page in newpage.
> > - */
> > -static int unmap_and_move(new_page_t get_new_page,
> > -                                free_page_t put_new_page,
> > -                                unsigned long private, struct page *page,
> > -                                int force, enum migrate_mode mode,
> > -                                enum migrate_reason reason,
> > -                                struct list_head *ret)
> > +static void migrate_page_done(struct page *page,
> > +                           enum migrate_reason reason)
> > +{
> > +     /*
> > +      * Compaction can migrate also non-LRU pages which are
> > +      * not accounted to NR_ISOLATED_*. They can be recognized
> > +      * as __PageMovable
> > +      */
> > +     if (likely(!__PageMovable(page)))
> > +             mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> > +                                 page_is_file_lru(page), -thp_nr_pages(page));
> > +
> > +     if (reason != MR_MEMORY_FAILURE)
> > +             /* We release the page in page_handle_poison. */
> > +             put_page(page);
> > +}
> > +
> > +/* Obtain the lock on page, remove all ptes. */
> > +static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
> > +                           unsigned long private, struct page *page,
> > +                           struct page **newpagep, int force,
> > +                           enum migrate_mode mode, enum migrate_reason reason,
> > +                           struct list_head *ret)
> >  {
> > -     int rc = MIGRATEPAGE_SUCCESS;
> > +     int rc = MIGRATEPAGE_UNMAP;
> >       struct page *newpage = NULL;
> >
> >       if (!thp_migration_supported() && PageTransHuge(page))
> > @@ -1151,19 +1210,48 @@ static int unmap_and_move(new_page_t get_new_page,
> >               ClearPageActive(page);
> >               ClearPageUnevictable(page);
> >               /* free_pages_prepare() will clear PG_isolated. */
> > -             goto out;
> > +             list_del(&page->lru);
> > +             migrate_page_done(page, reason);
> > +             return MIGRATEPAGE_SUCCESS;
> >       }
> >
> >       newpage = get_new_page(page, private);
> >       if (!newpage)
> >               return -ENOMEM;
> > +     *newpagep = newpage;
> >
> > -     newpage->private = 0;
> > -     rc = __unmap_and_move(page, newpage, force, mode);
> > +     rc = __migrate_page_unmap(page, newpage, force, mode);
> > +     if (rc == MIGRATEPAGE_UNMAP)
> > +             return rc;
> > +
> > +     /*
> > +      * A page that has not been migrated will have kept its
> > +      * references and be restored.
> > +      */
> > +     /* restore the page to right list. */
> > +     if (rc != -EAGAIN)
> > +             list_move_tail(&page->lru, ret);
> > +
> > +     if (put_new_page)
> > +             put_new_page(newpage, private);
> > +     else
> > +             put_page(newpage);
> > +
> > +     return rc;
> > +}
> > +
> > +/* Migrate the page to the newly allocated page in newpage. */
> > +static int migrate_page_move(free_page_t put_new_page, unsigned long private,
> > +                          struct page *page, struct page *newpage,
> > +                          enum migrate_mode mode, enum migrate_reason reason,
> > +                          struct list_head *ret)
> > +{
> > +     int rc;
> > +
> > +     rc = __migrate_page_move(page, newpage, mode);
> >       if (rc == MIGRATEPAGE_SUCCESS)
> >               set_page_owner_migrate_reason(newpage, reason);
> >
> > -out:
> >       if (rc != -EAGAIN) {
> >               /*
> >                * A page that has been migrated has all references
> > @@ -1179,20 +1267,7 @@ static int unmap_and_move(new_page_t get_new_page,
> >        * we want to retry.
> >        */
> >       if (rc == MIGRATEPAGE_SUCCESS) {
> > -             /*
> > -              * Compaction can migrate also non-LRU pages which are
> > -              * not accounted to NR_ISOLATED_*. They can be recognized
> > -              * as __PageMovable
> > -              */
> > -             if (likely(!__PageMovable(page)))
> > -                     mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> > -                                     page_is_file_lru(page), -thp_nr_pages(page));
> > -
> > -             if (reason != MR_MEMORY_FAILURE)
> > -                     /*
> > -                      * We release the page in page_handle_poison.
> > -                      */
> > -                     put_page(page);
> > +             migrate_page_done(page, reason);
> >       } else {
> >               if (rc != -EAGAIN)
> >                       list_add_tail(&page->lru, ret);
> > @@ -1405,6 +1480,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >       int pass = 0;
> >       bool is_thp = false;
> >       struct page *page;
> > +     struct page *newpage = NULL;
> >       struct page *page2;
> >       int rc, nr_subpages;
> >       LIST_HEAD(ret_pages);
> > @@ -1493,9 +1569,13 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >                       if (PageHuge(page))
> >                               continue;
> >
> > -                     rc = unmap_and_move(get_new_page, put_new_page,
> > -                                             private, page, pass > 2, mode,
> > +                     rc = migrate_page_unmap(get_new_page, put_new_page, private,
> > +                                             page, &newpage, pass > 2, mode,
> >                                               reason, &ret_pages);
> > +                     if (rc == MIGRATEPAGE_UNMAP)
> > +                             rc = migrate_page_move(put_new_page, private,
> > +                                                    page, newpage, mode,
> > +                                                    reason, &ret_pages);
> >                       /*
> >                        * The rules are:
> >                        *      Success: page will be freed

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-26 18:06     ` Yang Shi
@ 2022-09-27  0:02       ` Alistair Popple
  2022-09-27  1:51         ` Huang, Ying
  2022-09-27 20:54         ` Yang Shi
  0 siblings, 2 replies; 50+ messages in thread
From: Alistair Popple @ 2022-09-27  0:02 UTC (permalink / raw)
  To: Yang Shi
  Cc: Huang Ying, linux-mm, linux-kernel, Andrew Morton, Zi Yan,
	Baolin Wang, Oscar Salvador, Matthew Wilcox


Yang Shi <shy828301@gmail.com> writes:

> On Mon, Sep 26, 2022 at 2:37 AM Alistair Popple <apopple@nvidia.com> wrote:
>>
>>
>> Huang Ying <ying.huang@intel.com> writes:
>>
>> > This is a preparation patch to batch the page unmapping and moving
>> > for the normal pages and THP.
>> >
>> > In this patch, unmap_and_move() is split to migrate_page_unmap() and
>> > migrate_page_move().  So, we can batch _unmap() and _move() in
>> > different loops later.  To pass some information between unmap and
>> > move, the original unused newpage->mapping and newpage->private are
>> > used.
>>
>> This looks like it could cause a deadlock between two threads migrating
>> the same pages if force == true && mode != MIGRATE_ASYNC as
>> migrate_page_unmap() will call lock_page() while holding the lock on
>> other pages in the list. Therefore the two threads could deadlock if the
>> pages are in a different order.
>
> It seems unlikely to me since the page has to be isolated from lru
> before migration. The isolating from lru is atomic, so the two threads
> unlikely see the same pages on both lists.

Oh thanks! That is a good point and I agree since lru isolation is
atomic the two threads won't see the same pages. migrate_vma_setup()
does LRU isolation after locking the page which is why the potential
exists there. We could potentially switch that around but given
ZONE_DEVICE pages aren't on an lru it wouldn't help much.

> But there might be other cases which may incur deadlock, for example,
> filesystem writeback IIUC. Some filesystems may lock a bunch of pages
> then write them back in a batch. The same pages may be on the
> migration list and they are also dirty and seen by writeback. I'm not
> sure whether I miss something that could prevent such a deadlock from
> happening.

I'm not overly familiar with that area but I would assume any filesystem
code doing this would already have to deal with deadlock potential.

>>
>> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> > Cc: Zi Yan <ziy@nvidia.com>
>> > Cc: Yang Shi <shy828301@gmail.com>
>> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> > Cc: Oscar Salvador <osalvador@suse.de>
>> > Cc: Matthew Wilcox <willy@infradead.org>
>> > ---
>> >  mm/migrate.c | 164 ++++++++++++++++++++++++++++++++++++++-------------
>> >  1 file changed, 122 insertions(+), 42 deletions(-)
>> >
>> > diff --git a/mm/migrate.c b/mm/migrate.c
>> > index 117134f1c6dc..4a81e0bfdbcd 100644
>> > --- a/mm/migrate.c
>> > +++ b/mm/migrate.c
>> > @@ -976,13 +976,32 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
>> >       return rc;
>> >  }
>> >
>> > -static int __unmap_and_move(struct page *page, struct page *newpage,
>> > +static void __migrate_page_record(struct page *newpage,
>> > +                               int page_was_mapped,
>> > +                               struct anon_vma *anon_vma)
>> > +{
>> > +     newpage->mapping = (struct address_space *)anon_vma;
>> > +     newpage->private = page_was_mapped;
>> > +}
>> > +
>> > +static void __migrate_page_extract(struct page *newpage,
>> > +                                int *page_was_mappedp,
>> > +                                struct anon_vma **anon_vmap)
>> > +{
>> > +     *anon_vmap = (struct anon_vma *)newpage->mapping;
>> > +     *page_was_mappedp = newpage->private;
>> > +     newpage->mapping = NULL;
>> > +     newpage->private = 0;
>> > +}
>> > +
>> > +#define MIGRATEPAGE_UNMAP            1
>> > +
>> > +static int __migrate_page_unmap(struct page *page, struct page *newpage,
>> >                               int force, enum migrate_mode mode)
>> >  {
>> >       struct folio *folio = page_folio(page);
>> > -     struct folio *dst = page_folio(newpage);
>> >       int rc = -EAGAIN;
>> > -     bool page_was_mapped = false;
>> > +     int page_was_mapped = 0;
>> >       struct anon_vma *anon_vma = NULL;
>> >       bool is_lru = !__PageMovable(page);
>> >
>> > @@ -1058,8 +1077,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>> >               goto out_unlock;
>> >
>> >       if (unlikely(!is_lru)) {
>> > -             rc = move_to_new_folio(dst, folio, mode);
>> > -             goto out_unlock_both;
>> > +             __migrate_page_record(newpage, page_was_mapped, anon_vma);
>> > +             return MIGRATEPAGE_UNMAP;
>> >       }
>> >
>> >       /*
>> > @@ -1085,11 +1104,41 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>> >               VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
>> >                               page);
>> >               try_to_migrate(folio, 0);
>> > -             page_was_mapped = true;
>> > +             page_was_mapped = 1;
>> > +     }
>> > +
>> > +     if (!page_mapped(page)) {
>> > +             __migrate_page_record(newpage, page_was_mapped, anon_vma);
>> > +             return MIGRATEPAGE_UNMAP;
>> >       }
>> >
>> > -     if (!page_mapped(page))
>> > -             rc = move_to_new_folio(dst, folio, mode);
>> > +     if (page_was_mapped)
>> > +             remove_migration_ptes(folio, folio, false);
>> > +
>> > +out_unlock_both:
>> > +     unlock_page(newpage);
>> > +out_unlock:
>> > +     /* Drop an anon_vma reference if we took one */
>> > +     if (anon_vma)
>> > +             put_anon_vma(anon_vma);
>> > +     unlock_page(page);
>> > +out:
>> > +
>> > +     return rc;
>> > +}
>> > +
>> > +static int __migrate_page_move(struct page *page, struct page *newpage,
>> > +                            enum migrate_mode mode)
>> > +{
>> > +     struct folio *folio = page_folio(page);
>> > +     struct folio *dst = page_folio(newpage);
>> > +     int rc;
>> > +     int page_was_mapped = 0;
>> > +     struct anon_vma *anon_vma = NULL;
>> > +
>> > +     __migrate_page_extract(newpage, &page_was_mapped, &anon_vma);
>> > +
>> > +     rc = move_to_new_folio(dst, folio, mode);
>> >
>> >       /*
>> >        * When successful, push newpage to LRU immediately: so that if it
>> > @@ -1110,14 +1159,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>> >               remove_migration_ptes(folio,
>> >                       rc == MIGRATEPAGE_SUCCESS ? dst : folio, false);
>> >
>> > -out_unlock_both:
>> >       unlock_page(newpage);
>> > -out_unlock:
>> >       /* Drop an anon_vma reference if we took one */
>> >       if (anon_vma)
>> >               put_anon_vma(anon_vma);
>> >       unlock_page(page);
>> > -out:
>> >       /*
>> >        * If migration is successful, decrease refcount of the newpage,
>> >        * which will not free the page because new page owner increased
>> > @@ -1129,18 +1175,31 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>> >       return rc;
>> >  }
>> >
>> > -/*
>> > - * Obtain the lock on page, remove all ptes and migrate the page
>> > - * to the newly allocated page in newpage.
>> > - */
>> > -static int unmap_and_move(new_page_t get_new_page,
>> > -                                free_page_t put_new_page,
>> > -                                unsigned long private, struct page *page,
>> > -                                int force, enum migrate_mode mode,
>> > -                                enum migrate_reason reason,
>> > -                                struct list_head *ret)
>> > +static void migrate_page_done(struct page *page,
>> > +                           enum migrate_reason reason)
>> > +{
>> > +     /*
>> > +      * Compaction can migrate also non-LRU pages which are
>> > +      * not accounted to NR_ISOLATED_*. They can be recognized
>> > +      * as __PageMovable
>> > +      */
>> > +     if (likely(!__PageMovable(page)))
>> > +             mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
>> > +                                 page_is_file_lru(page), -thp_nr_pages(page));
>> > +
>> > +     if (reason != MR_MEMORY_FAILURE)
>> > +             /* We release the page in page_handle_poison. */
>> > +             put_page(page);
>> > +}
>> > +
>> > +/* Obtain the lock on page, remove all ptes. */
>> > +static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
>> > +                           unsigned long private, struct page *page,
>> > +                           struct page **newpagep, int force,
>> > +                           enum migrate_mode mode, enum migrate_reason reason,
>> > +                           struct list_head *ret)
>> >  {
>> > -     int rc = MIGRATEPAGE_SUCCESS;
>> > +     int rc = MIGRATEPAGE_UNMAP;
>> >       struct page *newpage = NULL;
>> >
>> >       if (!thp_migration_supported() && PageTransHuge(page))
>> > @@ -1151,19 +1210,48 @@ static int unmap_and_move(new_page_t get_new_page,
>> >               ClearPageActive(page);
>> >               ClearPageUnevictable(page);
>> >               /* free_pages_prepare() will clear PG_isolated. */
>> > -             goto out;
>> > +             list_del(&page->lru);
>> > +             migrate_page_done(page, reason);
>> > +             return MIGRATEPAGE_SUCCESS;
>> >       }
>> >
>> >       newpage = get_new_page(page, private);
>> >       if (!newpage)
>> >               return -ENOMEM;
>> > +     *newpagep = newpage;
>> >
>> > -     newpage->private = 0;
>> > -     rc = __unmap_and_move(page, newpage, force, mode);
>> > +     rc = __migrate_page_unmap(page, newpage, force, mode);
>> > +     if (rc == MIGRATEPAGE_UNMAP)
>> > +             return rc;
>> > +
>> > +     /*
>> > +      * A page that has not been migrated will have kept its
>> > +      * references and be restored.
>> > +      */
>> > +     /* restore the page to right list. */
>> > +     if (rc != -EAGAIN)
>> > +             list_move_tail(&page->lru, ret);
>> > +
>> > +     if (put_new_page)
>> > +             put_new_page(newpage, private);
>> > +     else
>> > +             put_page(newpage);
>> > +
>> > +     return rc;
>> > +}
>> > +
>> > +/* Migrate the page to the newly allocated page in newpage. */
>> > +static int migrate_page_move(free_page_t put_new_page, unsigned long private,
>> > +                          struct page *page, struct page *newpage,
>> > +                          enum migrate_mode mode, enum migrate_reason reason,
>> > +                          struct list_head *ret)
>> > +{
>> > +     int rc;
>> > +
>> > +     rc = __migrate_page_move(page, newpage, mode);
>> >       if (rc == MIGRATEPAGE_SUCCESS)
>> >               set_page_owner_migrate_reason(newpage, reason);
>> >
>> > -out:
>> >       if (rc != -EAGAIN) {
>> >               /*
>> >                * A page that has been migrated has all references
>> > @@ -1179,20 +1267,7 @@ static int unmap_and_move(new_page_t get_new_page,
>> >        * we want to retry.
>> >        */
>> >       if (rc == MIGRATEPAGE_SUCCESS) {
>> > -             /*
>> > -              * Compaction can migrate also non-LRU pages which are
>> > -              * not accounted to NR_ISOLATED_*. They can be recognized
>> > -              * as __PageMovable
>> > -              */
>> > -             if (likely(!__PageMovable(page)))
>> > -                     mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
>> > -                                     page_is_file_lru(page), -thp_nr_pages(page));
>> > -
>> > -             if (reason != MR_MEMORY_FAILURE)
>> > -                     /*
>> > -                      * We release the page in page_handle_poison.
>> > -                      */
>> > -                     put_page(page);
>> > +             migrate_page_done(page, reason);
>> >       } else {
>> >               if (rc != -EAGAIN)
>> >                       list_add_tail(&page->lru, ret);
>> > @@ -1405,6 +1480,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>> >       int pass = 0;
>> >       bool is_thp = false;
>> >       struct page *page;
>> > +     struct page *newpage = NULL;
>> >       struct page *page2;
>> >       int rc, nr_subpages;
>> >       LIST_HEAD(ret_pages);
>> > @@ -1493,9 +1569,13 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>> >                       if (PageHuge(page))
>> >                               continue;
>> >
>> > -                     rc = unmap_and_move(get_new_page, put_new_page,
>> > -                                             private, page, pass > 2, mode,
>> > +                     rc = migrate_page_unmap(get_new_page, put_new_page, private,
>> > +                                             page, &newpage, pass > 2, mode,
>> >                                               reason, &ret_pages);
>> > +                     if (rc == MIGRATEPAGE_UNMAP)
>> > +                             rc = migrate_page_move(put_new_page, private,
>> > +                                                    page, newpage, mode,
>> > +                                                    reason, &ret_pages);
>> >                       /*
>> >                        * The rules are:
>> >                        *      Success: page will be freed

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-27  0:02       ` Alistair Popple
@ 2022-09-27  1:51         ` Huang, Ying
  2022-09-27 20:34           ` John Hubbard
  2022-09-27 20:56           ` Yang Shi
  2022-09-27 20:54         ` Yang Shi
  1 sibling, 2 replies; 50+ messages in thread
From: Huang, Ying @ 2022-09-27  1:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Yang Shi, linux-mm, linux-kernel, Andrew Morton, Zi Yan,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

Alistair Popple <apopple@nvidia.com> writes:

> Yang Shi <shy828301@gmail.com> writes:
>
>> On Mon, Sep 26, 2022 at 2:37 AM Alistair Popple <apopple@nvidia.com> wrote:
>>>
>>>
>>> Huang Ying <ying.huang@intel.com> writes:
>>>
>>> > This is a preparation patch to batch the page unmapping and moving
>>> > for the normal pages and THP.
>>> >
>>> > In this patch, unmap_and_move() is split to migrate_page_unmap() and
>>> > migrate_page_move().  So, we can batch _unmap() and _move() in
>>> > different loops later.  To pass some information between unmap and
>>> > move, the original unused newpage->mapping and newpage->private are
>>> > used.
>>>
>>> This looks like it could cause a deadlock between two threads migrating
>>> the same pages if force == true && mode != MIGRATE_ASYNC as
>>> migrate_page_unmap() will call lock_page() while holding the lock on
>>> other pages in the list. Therefore the two threads could deadlock if the
>>> pages are in a different order.
>>
>> It seems unlikely to me since the page has to be isolated from lru
>> before migration. The isolating from lru is atomic, so the two threads
>> unlikely see the same pages on both lists.
>
> Oh thanks! That is a good point and I agree since lru isolation is
> atomic the two threads won't see the same pages. migrate_vma_setup()
> does LRU isolation after locking the page which is why the potential
> exists there. We could potentially switch that around but given
> ZONE_DEVICE pages aren't on an lru it wouldn't help much.
>
>> But there might be other cases which may incur deadlock, for example,
>> filesystem writeback IIUC. Some filesystems may lock a bunch of pages
>> then write them back in a batch. The same pages may be on the
>> migration list and they are also dirty and seen by writeback. I'm not
>> sure whether I miss something that could prevent such a deadlock from
>> happening.
>
> I'm not overly familiar with that area but I would assume any filesystem
> code doing this would already have to deal with deadlock potential.

Thank you very much for pointing this out.  I think the deadlock is a
real issue.  Anyway, we shouldn't forbid other places in kernel to lock
2 pages at the same time.

The simplest solution is to batch page migration only if mode ==
MIGRATE_ASYNC.  Then we may consider to fall back to non-batch mode if
mode != MIGRATE_ASYNC and trylock page fails.

Best Regards,
Huang, Ying

[snip]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-23  7:52   ` Huang, Ying
@ 2022-09-27 10:46     ` Bharata B Rao
  2022-09-28  1:46       ` Huang, Ying
  0 siblings, 1 reply; 50+ messages in thread
From: Bharata B Rao @ 2022-09-27 10:46 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

On 9/23/2022 1:22 PM, Huang, Ying wrote:
> Bharata B Rao <bharata@amd.com> writes:
>>
>> Thanks for the patchset. I find it hitting the following BUG() when
>> running mmtests/autonumabench:
>>
>> kernel BUG at mm/migrate.c:2432!
>>
>> This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page().
> 
> Thank you very much for reporting!  I haven't reproduced this yet.  But
> I will pay special attention to this when develop the next version, even
> if I cannot reproduce this finally.

The following change fixes the above reported BUG_ON().

diff --git a/mm/migrate.c b/mm/migrate.c
index a0de0d9b4d41..c11dd82245e5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1197,7 +1197,7 @@ static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
         * references and be restored.
         */
        /* restore the page to right list. */
-       if (rc != -EAGAIN)
+       if (rc == -EAGAIN)
                 ret = NULL;
 
        migrate_page_undo_page(page, page_was_mapped, anon_vma, locked, ret);

The pages that returned from unmapping stage with -EAGAIN used to
end up on "ret" list rather than continuing on the "from" list.

Regards,
Bharata.

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
                   ` (8 preceding siblings ...)
  2022-09-26  9:11 ` Alistair Popple
@ 2022-09-27 11:21 ` haoxin
  2022-09-28  2:01   ` Huang, Ying
  2022-11-01 14:49   ` Hesham Almatary
  9 siblings, 2 replies; 50+ messages in thread
From: haoxin @ 2022-09-27 11:21 UTC (permalink / raw)
  To: Huang Ying, linux-mm
  Cc: linux-kernel, Andrew Morton, Zi Yan, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox, yangyicong, v-songbaohua,
	21cnbao

Hi,  Huang

在 2022/9/21 下午2:06, Huang Ying 写道:
> From: "Huang, Ying" <ying.huang@intel.com>
>
> Now, migrate_pages() migrate pages one by one, like the fake code as
> follows,
>
>    for each page
>      unmap
>      flush TLB
>      copy
>      restore map
>
> If multiple pages are passed to migrate_pages(), there are
> opportunities to batch the TLB flushing and copying.  That is, we can
> change the code to something as follows,
>
>    for each page
>      unmap
>    for each page
>      flush TLB
>    for each page
>      copy
>    for each page
>      restore map
>
> The total number of TLB flushing IPI can be reduced considerably.  And
> we may use some hardware accelerator such as DSA to accelerate the
> page copying.
>
> So in this patch, we refactor the migrate_pages() implementation and
> implement the TLB flushing batching.  Base on this, hardware
> accelerated page copying can be implemented.
>
> If too many pages are passed to migrate_pages(), in the naive batched
> implementation, we may unmap too many pages at the same time.  The
> possibility for a task to wait for the migrated pages to be mapped
> again increases.  So the latency may be hurt.  To deal with this
> issue, the max number of pages be unmapped in batch is restricted to
> no more than HPAGE_PMD_NR.  That is, the influence is at the same
> level of THP migration.
>
> We use the following test to measure the performance impact of the
> patchset,
>
> On a 2-socket Intel server,
>
>   - Run pmbench memory accessing benchmark
>
>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>     node 1 back and forth.
>
As the pmbench can not run on arm64 machine, so i use lmbench instead.
I test case like this:  (i am not sure whether it is reasonable, but it seems worked)
./bw_mem -N10000 10000m rd &
time migratepages pid node0 node1

o/patch      		w/patch
real	0m0.035s  	real	0m0.024s
user	0m0.000s  	user	0m0.000s
sys	0m0.035s        sys	0m0.024s

the migratepages time is reduced above 32%.

But there has a problem, i see the batch flush is called by
migrate_pages_batch
	try_to_unmap_flush
		arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.

But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.

So, the tlb batch flush means no any flush is did, it is a empty func.

Maybe this patch can help solve this problem.
https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/
		




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-27  1:51         ` Huang, Ying
@ 2022-09-27 20:34           ` John Hubbard
  2022-09-27 20:57             ` Yang Shi
  2022-09-27 20:56           ` Yang Shi
  1 sibling, 1 reply; 50+ messages in thread
From: John Hubbard @ 2022-09-27 20:34 UTC (permalink / raw)
  To: Huang, Ying, Alistair Popple
  Cc: Yang Shi, linux-mm, linux-kernel, Andrew Morton, Zi Yan,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

On 9/26/22 18:51, Huang, Ying wrote:
>>> But there might be other cases which may incur deadlock, for example,
>>> filesystem writeback IIUC. Some filesystems may lock a bunch of pages
>>> then write them back in a batch. The same pages may be on the
>>> migration list and they are also dirty and seen by writeback. I'm not
>>> sure whether I miss something that could prevent such a deadlock from
>>> happening.
>>
>> I'm not overly familiar with that area but I would assume any filesystem
>> code doing this would already have to deal with deadlock potential.
> 
> Thank you very much for pointing this out.  I think the deadlock is a
> real issue.  Anyway, we shouldn't forbid other places in kernel to lock
> 2 pages at the same time.
> 

I also agree that we cannot make any rules such as "do not lock > 1 page
at the same time, elsewhere in the kernel", because it is already
happening, for example in page-writeback.c, which locks PAGEVEC_SIZE
(15) pages per batch [1].

The only deadlock prevention convention that I see is the convention of
locking the pages in order of ascending address. That only helps if
everything does it that way, and migrate code definitely does not.
However...I thought that up until now, at least, the migrate code relied
on trylock (which can fail, and so migration can fail, too), to avoid
deadlock. Is that changing somehow, I didn't see it?


[1] https://elixir.bootlin.com/linux/latest/source/mm/page-writeback.c#L2296

thanks,

-- 
John Hubbard
NVIDIA

> The simplest solution is to batch page migration only if mode ==
> MIGRATE_ASYNC.  Then we may consider to fall back to non-batch mode if
> mode != MIGRATE_ASYNC and trylock page fails.
> 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-27  0:02       ` Alistair Popple
  2022-09-27  1:51         ` Huang, Ying
@ 2022-09-27 20:54         ` Yang Shi
  1 sibling, 0 replies; 50+ messages in thread
From: Yang Shi @ 2022-09-27 20:54 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Huang Ying, linux-mm, linux-kernel, Andrew Morton, Zi Yan,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

On Mon, Sep 26, 2022 at 5:14 PM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Mon, Sep 26, 2022 at 2:37 AM Alistair Popple <apopple@nvidia.com> wrote:
> >>
> >>
> >> Huang Ying <ying.huang@intel.com> writes:
> >>
> >> > This is a preparation patch to batch the page unmapping and moving
> >> > for the normal pages and THP.
> >> >
> >> > In this patch, unmap_and_move() is split to migrate_page_unmap() and
> >> > migrate_page_move().  So, we can batch _unmap() and _move() in
> >> > different loops later.  To pass some information between unmap and
> >> > move, the original unused newpage->mapping and newpage->private are
> >> > used.
> >>
> >> This looks like it could cause a deadlock between two threads migrating
> >> the same pages if force == true && mode != MIGRATE_ASYNC as
> >> migrate_page_unmap() will call lock_page() while holding the lock on
> >> other pages in the list. Therefore the two threads could deadlock if the
> >> pages are in a different order.
> >
> > It seems unlikely to me since the page has to be isolated from lru
> > before migration. The isolating from lru is atomic, so the two threads
> > unlikely see the same pages on both lists.
>
> Oh thanks! That is a good point and I agree since lru isolation is
> atomic the two threads won't see the same pages. migrate_vma_setup()
> does LRU isolation after locking the page which is why the potential
> exists there. We could potentially switch that around but given
> ZONE_DEVICE pages aren't on an lru it wouldn't help much.

Aha, I see. It has a different lock - isolation order from regular pages.

>
> > But there might be other cases which may incur deadlock, for example,
> > filesystem writeback IIUC. Some filesystems may lock a bunch of pages
> > then write them back in a batch. The same pages may be on the
> > migration list and they are also dirty and seen by writeback. I'm not
> > sure whether I miss something that could prevent such a deadlock from
> > happening.
>
> I'm not overly familiar with that area but I would assume any filesystem
> code doing this would already have to deal with deadlock potential.

AFAIK, actually not IIUC. For example, write back just simply look up
page cache and lock them one by one.

>
> >>
> >> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> >> > Cc: Zi Yan <ziy@nvidia.com>
> >> > Cc: Yang Shi <shy828301@gmail.com>
> >> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> > Cc: Oscar Salvador <osalvador@suse.de>
> >> > Cc: Matthew Wilcox <willy@infradead.org>
> >> > ---
> >> >  mm/migrate.c | 164 ++++++++++++++++++++++++++++++++++++++-------------
> >> >  1 file changed, 122 insertions(+), 42 deletions(-)
> >> >
> >> > diff --git a/mm/migrate.c b/mm/migrate.c
> >> > index 117134f1c6dc..4a81e0bfdbcd 100644
> >> > --- a/mm/migrate.c
> >> > +++ b/mm/migrate.c
> >> > @@ -976,13 +976,32 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
> >> >       return rc;
> >> >  }
> >> >
> >> > -static int __unmap_and_move(struct page *page, struct page *newpage,
> >> > +static void __migrate_page_record(struct page *newpage,
> >> > +                               int page_was_mapped,
> >> > +                               struct anon_vma *anon_vma)
> >> > +{
> >> > +     newpage->mapping = (struct address_space *)anon_vma;
> >> > +     newpage->private = page_was_mapped;
> >> > +}
> >> > +
> >> > +static void __migrate_page_extract(struct page *newpage,
> >> > +                                int *page_was_mappedp,
> >> > +                                struct anon_vma **anon_vmap)
> >> > +{
> >> > +     *anon_vmap = (struct anon_vma *)newpage->mapping;
> >> > +     *page_was_mappedp = newpage->private;
> >> > +     newpage->mapping = NULL;
> >> > +     newpage->private = 0;
> >> > +}
> >> > +
> >> > +#define MIGRATEPAGE_UNMAP            1
> >> > +
> >> > +static int __migrate_page_unmap(struct page *page, struct page *newpage,
> >> >                               int force, enum migrate_mode mode)
> >> >  {
> >> >       struct folio *folio = page_folio(page);
> >> > -     struct folio *dst = page_folio(newpage);
> >> >       int rc = -EAGAIN;
> >> > -     bool page_was_mapped = false;
> >> > +     int page_was_mapped = 0;
> >> >       struct anon_vma *anon_vma = NULL;
> >> >       bool is_lru = !__PageMovable(page);
> >> >
> >> > @@ -1058,8 +1077,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >> >               goto out_unlock;
> >> >
> >> >       if (unlikely(!is_lru)) {
> >> > -             rc = move_to_new_folio(dst, folio, mode);
> >> > -             goto out_unlock_both;
> >> > +             __migrate_page_record(newpage, page_was_mapped, anon_vma);
> >> > +             return MIGRATEPAGE_UNMAP;
> >> >       }
> >> >
> >> >       /*
> >> > @@ -1085,11 +1104,41 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >> >               VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
> >> >                               page);
> >> >               try_to_migrate(folio, 0);
> >> > -             page_was_mapped = true;
> >> > +             page_was_mapped = 1;
> >> > +     }
> >> > +
> >> > +     if (!page_mapped(page)) {
> >> > +             __migrate_page_record(newpage, page_was_mapped, anon_vma);
> >> > +             return MIGRATEPAGE_UNMAP;
> >> >       }
> >> >
> >> > -     if (!page_mapped(page))
> >> > -             rc = move_to_new_folio(dst, folio, mode);
> >> > +     if (page_was_mapped)
> >> > +             remove_migration_ptes(folio, folio, false);
> >> > +
> >> > +out_unlock_both:
> >> > +     unlock_page(newpage);
> >> > +out_unlock:
> >> > +     /* Drop an anon_vma reference if we took one */
> >> > +     if (anon_vma)
> >> > +             put_anon_vma(anon_vma);
> >> > +     unlock_page(page);
> >> > +out:
> >> > +
> >> > +     return rc;
> >> > +}
> >> > +
> >> > +static int __migrate_page_move(struct page *page, struct page *newpage,
> >> > +                            enum migrate_mode mode)
> >> > +{
> >> > +     struct folio *folio = page_folio(page);
> >> > +     struct folio *dst = page_folio(newpage);
> >> > +     int rc;
> >> > +     int page_was_mapped = 0;
> >> > +     struct anon_vma *anon_vma = NULL;
> >> > +
> >> > +     __migrate_page_extract(newpage, &page_was_mapped, &anon_vma);
> >> > +
> >> > +     rc = move_to_new_folio(dst, folio, mode);
> >> >
> >> >       /*
> >> >        * When successful, push newpage to LRU immediately: so that if it
> >> > @@ -1110,14 +1159,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >> >               remove_migration_ptes(folio,
> >> >                       rc == MIGRATEPAGE_SUCCESS ? dst : folio, false);
> >> >
> >> > -out_unlock_both:
> >> >       unlock_page(newpage);
> >> > -out_unlock:
> >> >       /* Drop an anon_vma reference if we took one */
> >> >       if (anon_vma)
> >> >               put_anon_vma(anon_vma);
> >> >       unlock_page(page);
> >> > -out:
> >> >       /*
> >> >        * If migration is successful, decrease refcount of the newpage,
> >> >        * which will not free the page because new page owner increased
> >> > @@ -1129,18 +1175,31 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >> >       return rc;
> >> >  }
> >> >
> >> > -/*
> >> > - * Obtain the lock on page, remove all ptes and migrate the page
> >> > - * to the newly allocated page in newpage.
> >> > - */
> >> > -static int unmap_and_move(new_page_t get_new_page,
> >> > -                                free_page_t put_new_page,
> >> > -                                unsigned long private, struct page *page,
> >> > -                                int force, enum migrate_mode mode,
> >> > -                                enum migrate_reason reason,
> >> > -                                struct list_head *ret)
> >> > +static void migrate_page_done(struct page *page,
> >> > +                           enum migrate_reason reason)
> >> > +{
> >> > +     /*
> >> > +      * Compaction can migrate also non-LRU pages which are
> >> > +      * not accounted to NR_ISOLATED_*. They can be recognized
> >> > +      * as __PageMovable
> >> > +      */
> >> > +     if (likely(!__PageMovable(page)))
> >> > +             mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> >> > +                                 page_is_file_lru(page), -thp_nr_pages(page));
> >> > +
> >> > +     if (reason != MR_MEMORY_FAILURE)
> >> > +             /* We release the page in page_handle_poison. */
> >> > +             put_page(page);
> >> > +}
> >> > +
> >> > +/* Obtain the lock on page, remove all ptes. */
> >> > +static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
> >> > +                           unsigned long private, struct page *page,
> >> > +                           struct page **newpagep, int force,
> >> > +                           enum migrate_mode mode, enum migrate_reason reason,
> >> > +                           struct list_head *ret)
> >> >  {
> >> > -     int rc = MIGRATEPAGE_SUCCESS;
> >> > +     int rc = MIGRATEPAGE_UNMAP;
> >> >       struct page *newpage = NULL;
> >> >
> >> >       if (!thp_migration_supported() && PageTransHuge(page))
> >> > @@ -1151,19 +1210,48 @@ static int unmap_and_move(new_page_t get_new_page,
> >> >               ClearPageActive(page);
> >> >               ClearPageUnevictable(page);
> >> >               /* free_pages_prepare() will clear PG_isolated. */
> >> > -             goto out;
> >> > +             list_del(&page->lru);
> >> > +             migrate_page_done(page, reason);
> >> > +             return MIGRATEPAGE_SUCCESS;
> >> >       }
> >> >
> >> >       newpage = get_new_page(page, private);
> >> >       if (!newpage)
> >> >               return -ENOMEM;
> >> > +     *newpagep = newpage;
> >> >
> >> > -     newpage->private = 0;
> >> > -     rc = __unmap_and_move(page, newpage, force, mode);
> >> > +     rc = __migrate_page_unmap(page, newpage, force, mode);
> >> > +     if (rc == MIGRATEPAGE_UNMAP)
> >> > +             return rc;
> >> > +
> >> > +     /*
> >> > +      * A page that has not been migrated will have kept its
> >> > +      * references and be restored.
> >> > +      */
> >> > +     /* restore the page to right list. */
> >> > +     if (rc != -EAGAIN)
> >> > +             list_move_tail(&page->lru, ret);
> >> > +
> >> > +     if (put_new_page)
> >> > +             put_new_page(newpage, private);
> >> > +     else
> >> > +             put_page(newpage);
> >> > +
> >> > +     return rc;
> >> > +}
> >> > +
> >> > +/* Migrate the page to the newly allocated page in newpage. */
> >> > +static int migrate_page_move(free_page_t put_new_page, unsigned long private,
> >> > +                          struct page *page, struct page *newpage,
> >> > +                          enum migrate_mode mode, enum migrate_reason reason,
> >> > +                          struct list_head *ret)
> >> > +{
> >> > +     int rc;
> >> > +
> >> > +     rc = __migrate_page_move(page, newpage, mode);
> >> >       if (rc == MIGRATEPAGE_SUCCESS)
> >> >               set_page_owner_migrate_reason(newpage, reason);
> >> >
> >> > -out:
> >> >       if (rc != -EAGAIN) {
> >> >               /*
> >> >                * A page that has been migrated has all references
> >> > @@ -1179,20 +1267,7 @@ static int unmap_and_move(new_page_t get_new_page,
> >> >        * we want to retry.
> >> >        */
> >> >       if (rc == MIGRATEPAGE_SUCCESS) {
> >> > -             /*
> >> > -              * Compaction can migrate also non-LRU pages which are
> >> > -              * not accounted to NR_ISOLATED_*. They can be recognized
> >> > -              * as __PageMovable
> >> > -              */
> >> > -             if (likely(!__PageMovable(page)))
> >> > -                     mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> >> > -                                     page_is_file_lru(page), -thp_nr_pages(page));
> >> > -
> >> > -             if (reason != MR_MEMORY_FAILURE)
> >> > -                     /*
> >> > -                      * We release the page in page_handle_poison.
> >> > -                      */
> >> > -                     put_page(page);
> >> > +             migrate_page_done(page, reason);
> >> >       } else {
> >> >               if (rc != -EAGAIN)
> >> >                       list_add_tail(&page->lru, ret);
> >> > @@ -1405,6 +1480,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >> >       int pass = 0;
> >> >       bool is_thp = false;
> >> >       struct page *page;
> >> > +     struct page *newpage = NULL;
> >> >       struct page *page2;
> >> >       int rc, nr_subpages;
> >> >       LIST_HEAD(ret_pages);
> >> > @@ -1493,9 +1569,13 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >> >                       if (PageHuge(page))
> >> >                               continue;
> >> >
> >> > -                     rc = unmap_and_move(get_new_page, put_new_page,
> >> > -                                             private, page, pass > 2, mode,
> >> > +                     rc = migrate_page_unmap(get_new_page, put_new_page, private,
> >> > +                                             page, &newpage, pass > 2, mode,
> >> >                                               reason, &ret_pages);
> >> > +                     if (rc == MIGRATEPAGE_UNMAP)
> >> > +                             rc = migrate_page_move(put_new_page, private,
> >> > +                                                    page, newpage, mode,
> >> > +                                                    reason, &ret_pages);
> >> >                       /*
> >> >                        * The rules are:
> >> >                        *      Success: page will be freed

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-27  1:51         ` Huang, Ying
  2022-09-27 20:34           ` John Hubbard
@ 2022-09-27 20:56           ` Yang Shi
  1 sibling, 0 replies; 50+ messages in thread
From: Yang Shi @ 2022-09-27 20:56 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Alistair Popple, linux-mm, linux-kernel, Andrew Morton, Zi Yan,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

On Mon, Sep 26, 2022 at 6:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Alistair Popple <apopple@nvidia.com> writes:
>
> > Yang Shi <shy828301@gmail.com> writes:
> >
> >> On Mon, Sep 26, 2022 at 2:37 AM Alistair Popple <apopple@nvidia.com> wrote:
> >>>
> >>>
> >>> Huang Ying <ying.huang@intel.com> writes:
> >>>
> >>> > This is a preparation patch to batch the page unmapping and moving
> >>> > for the normal pages and THP.
> >>> >
> >>> > In this patch, unmap_and_move() is split to migrate_page_unmap() and
> >>> > migrate_page_move().  So, we can batch _unmap() and _move() in
> >>> > different loops later.  To pass some information between unmap and
> >>> > move, the original unused newpage->mapping and newpage->private are
> >>> > used.
> >>>
> >>> This looks like it could cause a deadlock between two threads migrating
> >>> the same pages if force == true && mode != MIGRATE_ASYNC as
> >>> migrate_page_unmap() will call lock_page() while holding the lock on
> >>> other pages in the list. Therefore the two threads could deadlock if the
> >>> pages are in a different order.
> >>
> >> It seems unlikely to me since the page has to be isolated from lru
> >> before migration. The isolating from lru is atomic, so the two threads
> >> unlikely see the same pages on both lists.
> >
> > Oh thanks! That is a good point and I agree since lru isolation is
> > atomic the two threads won't see the same pages. migrate_vma_setup()
> > does LRU isolation after locking the page which is why the potential
> > exists there. We could potentially switch that around but given
> > ZONE_DEVICE pages aren't on an lru it wouldn't help much.
> >
> >> But there might be other cases which may incur deadlock, for example,
> >> filesystem writeback IIUC. Some filesystems may lock a bunch of pages
> >> then write them back in a batch. The same pages may be on the
> >> migration list and they are also dirty and seen by writeback. I'm not
> >> sure whether I miss something that could prevent such a deadlock from
> >> happening.
> >
> > I'm not overly familiar with that area but I would assume any filesystem
> > code doing this would already have to deal with deadlock potential.
>
> Thank you very much for pointing this out.  I think the deadlock is a
> real issue.  Anyway, we shouldn't forbid other places in kernel to lock
> 2 pages at the same time.
>
> The simplest solution is to batch page migration only if mode ==
> MIGRATE_ASYNC.  Then we may consider to fall back to non-batch mode if
> mode != MIGRATE_ASYNC and trylock page fails.

Seems like so...

>
> Best Regards,
> Huang, Ying
>
> [snip]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-27 20:34           ` John Hubbard
@ 2022-09-27 20:57             ` Yang Shi
  2022-09-28  0:59               ` Alistair Popple
  0 siblings, 1 reply; 50+ messages in thread
From: Yang Shi @ 2022-09-27 20:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: Huang, Ying, Alistair Popple, linux-mm, linux-kernel,
	Andrew Morton, Zi Yan, Baolin Wang, Oscar Salvador,
	Matthew Wilcox

On Tue, Sep 27, 2022 at 1:35 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 9/26/22 18:51, Huang, Ying wrote:
> >>> But there might be other cases which may incur deadlock, for example,
> >>> filesystem writeback IIUC. Some filesystems may lock a bunch of pages
> >>> then write them back in a batch. The same pages may be on the
> >>> migration list and they are also dirty and seen by writeback. I'm not
> >>> sure whether I miss something that could prevent such a deadlock from
> >>> happening.
> >>
> >> I'm not overly familiar with that area but I would assume any filesystem
> >> code doing this would already have to deal with deadlock potential.
> >
> > Thank you very much for pointing this out.  I think the deadlock is a
> > real issue.  Anyway, we shouldn't forbid other places in kernel to lock
> > 2 pages at the same time.
> >
>
> I also agree that we cannot make any rules such as "do not lock > 1 page
> at the same time, elsewhere in the kernel", because it is already
> happening, for example in page-writeback.c, which locks PAGEVEC_SIZE
> (15) pages per batch [1].
>
> The only deadlock prevention convention that I see is the convention of
> locking the pages in order of ascending address. That only helps if
> everything does it that way, and migrate code definitely does not.
> However...I thought that up until now, at least, the migrate code relied
> on trylock (which can fail, and so migration can fail, too), to avoid
> deadlock. Is that changing somehow, I didn't see it?

The trylock is used by async mode which does try to avoid blocking.
But sync mode does use lock. The current implementation of migration
does migrate one page at a time, so it is not a problem.

>
>
> [1] https://elixir.bootlin.com/linux/latest/source/mm/page-writeback.c#L2296
>
> thanks,
>
> --
> John Hubbard
> NVIDIA
>
> > The simplest solution is to batch page migration only if mode ==
> > MIGRATE_ASYNC.  Then we may consider to fall back to non-batch mode if
> > mode != MIGRATE_ASYNC and trylock page fails.
> >
>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-27 20:57             ` Yang Shi
@ 2022-09-28  0:59               ` Alistair Popple
  2022-09-28  1:41                 ` Huang, Ying
  0 siblings, 1 reply; 50+ messages in thread
From: Alistair Popple @ 2022-09-28  0:59 UTC (permalink / raw)
  To: Yang Shi
  Cc: John Hubbard, Huang, Ying, linux-mm, linux-kernel, Andrew Morton,
	Zi Yan, Baolin Wang, Oscar Salvador, Matthew Wilcox


Yang Shi <shy828301@gmail.com> writes:

> On Tue, Sep 27, 2022 at 1:35 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 9/26/22 18:51, Huang, Ying wrote:
>> >>> But there might be other cases which may incur deadlock, for example,
>> >>> filesystem writeback IIUC. Some filesystems may lock a bunch of pages
>> >>> then write them back in a batch. The same pages may be on the
>> >>> migration list and they are also dirty and seen by writeback. I'm not
>> >>> sure whether I miss something that could prevent such a deadlock from
>> >>> happening.
>> >>
>> >> I'm not overly familiar with that area but I would assume any filesystem
>> >> code doing this would already have to deal with deadlock potential.
>> >
>> > Thank you very much for pointing this out.  I think the deadlock is a
>> > real issue.  Anyway, we shouldn't forbid other places in kernel to lock
>> > 2 pages at the same time.
>> >
>>
>> I also agree that we cannot make any rules such as "do not lock > 1 page
>> at the same time, elsewhere in the kernel", because it is already
>> happening, for example in page-writeback.c, which locks PAGEVEC_SIZE
>> (15) pages per batch [1].

That's not really the case though. The inner loop of write_cache_page()
only ever locks one page at a time, either directly via the
unlock_page() on L2338 (those goto's are amazing) or indirectly via
(*writepage)() on L2359.

So there's no deadlock potential there because unlocking any previously
locked page(s) doesn't depend on obtaining the lock for another page.
Unless I've missed something?

>> The only deadlock prevention convention that I see is the convention of
>> locking the pages in order of ascending address. That only helps if
>> everything does it that way, and migrate code definitely does not.
>> However...I thought that up until now, at least, the migrate code relied
>> on trylock (which can fail, and so migration can fail, too), to avoid
>> deadlock. Is that changing somehow, I didn't see it?
>
> The trylock is used by async mode which does try to avoid blocking.
> But sync mode does use lock. The current implementation of migration
> does migrate one page at a time, so it is not a problem.
>
>>
>>
>> [1] https://elixir.bootlin.com/linux/latest/source/mm/page-writeback.c#L2296
>>
>> thanks,
>>
>> --
>> John Hubbard
>> NVIDIA
>>
>> > The simplest solution is to batch page migration only if mode ==
>> > MIGRATE_ASYNC.  Then we may consider to fall back to non-batch mode if
>> > mode != MIGRATE_ASYNC and trylock page fails.
>> >
>>
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-28  0:59               ` Alistair Popple
@ 2022-09-28  1:41                 ` Huang, Ying
  2022-09-28  1:44                   ` John Hubbard
  0 siblings, 1 reply; 50+ messages in thread
From: Huang, Ying @ 2022-09-28  1:41 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Yang Shi, John Hubbard, linux-mm, linux-kernel, Andrew Morton,
	Zi Yan, Baolin Wang, Oscar Salvador, Matthew Wilcox

Alistair Popple <apopple@nvidia.com> writes:

> Yang Shi <shy828301@gmail.com> writes:
>
>> On Tue, Sep 27, 2022 at 1:35 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>
>>> On 9/26/22 18:51, Huang, Ying wrote:
>>> >>> But there might be other cases which may incur deadlock, for example,
>>> >>> filesystem writeback IIUC. Some filesystems may lock a bunch of pages
>>> >>> then write them back in a batch. The same pages may be on the
>>> >>> migration list and they are also dirty and seen by writeback. I'm not
>>> >>> sure whether I miss something that could prevent such a deadlock from
>>> >>> happening.
>>> >>
>>> >> I'm not overly familiar with that area but I would assume any filesystem
>>> >> code doing this would already have to deal with deadlock potential.
>>> >
>>> > Thank you very much for pointing this out.  I think the deadlock is a
>>> > real issue.  Anyway, we shouldn't forbid other places in kernel to lock
>>> > 2 pages at the same time.
>>> >
>>>
>>> I also agree that we cannot make any rules such as "do not lock > 1 page
>>> at the same time, elsewhere in the kernel", because it is already
>>> happening, for example in page-writeback.c, which locks PAGEVEC_SIZE
>>> (15) pages per batch [1].
>
> That's not really the case though. The inner loop of write_cache_page()
> only ever locks one page at a time, either directly via the
> unlock_page() on L2338 (those goto's are amazing) or indirectly via
> (*writepage)() on L2359.
>
> So there's no deadlock potential there because unlocking any previously
> locked page(s) doesn't depend on obtaining the lock for another page.
> Unless I've missed something?

Yes.  This is my understanding too after checking ext4_writepage().

Best Regards,
Huang, Ying

>>> The only deadlock prevention convention that I see is the convention of
>>> locking the pages in order of ascending address. That only helps if
>>> everything does it that way, and migrate code definitely does not.
>>> However...I thought that up until now, at least, the migrate code relied
>>> on trylock (which can fail, and so migration can fail, too), to avoid
>>> deadlock. Is that changing somehow, I didn't see it?
>>
>> The trylock is used by async mode which does try to avoid blocking.
>> But sync mode does use lock. The current implementation of migration
>> does migrate one page at a time, so it is not a problem.
>>
>>>
>>>
>>> [1] https://elixir.bootlin.com/linux/latest/source/mm/page-writeback.c#L2296
>>>
>>> thanks,
>>>
>>> --
>>> John Hubbard
>>> NVIDIA
>>>
>>> > The simplest solution is to batch page migration only if mode ==
>>> > MIGRATE_ASYNC.  Then we may consider to fall back to non-batch mode if
>>> > mode != MIGRATE_ASYNC and trylock page fails.
>>> >
>>>
>>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-28  1:41                 ` Huang, Ying
@ 2022-09-28  1:44                   ` John Hubbard
  2022-09-28  1:49                     ` Yang Shi
  0 siblings, 1 reply; 50+ messages in thread
From: John Hubbard @ 2022-09-28  1:44 UTC (permalink / raw)
  To: Huang, Ying, Alistair Popple
  Cc: Yang Shi, linux-mm, linux-kernel, Andrew Morton, Zi Yan,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

On 9/27/22 18:41, Huang, Ying wrote:
>>>> I also agree that we cannot make any rules such as "do not lock > 1 page
>>>> at the same time, elsewhere in the kernel", because it is already
>>>> happening, for example in page-writeback.c, which locks PAGEVEC_SIZE
>>>> (15) pages per batch [1].
>>
>> That's not really the case though. The inner loop of write_cache_page()
>> only ever locks one page at a time, either directly via the
>> unlock_page() on L2338 (those goto's are amazing) or indirectly via
>> (*writepage)() on L2359.
>>
>> So there's no deadlock potential there because unlocking any previously
>> locked page(s) doesn't depend on obtaining the lock for another page.
>> Unless I've missed something?
> 
> Yes.  This is my understanding too after checking ext4_writepage().
> 

Yes, I missed the ".writepage() shall unlock the page" design point. Now
it seems much more reasonable and safer. :)

thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-27 10:46     ` Bharata B Rao
@ 2022-09-28  1:46       ` Huang, Ying
  0 siblings, 0 replies; 50+ messages in thread
From: Huang, Ying @ 2022-09-28  1:46 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox

Bharata B Rao <bharata@amd.com> writes:

> On 9/23/2022 1:22 PM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>>>
>>> Thanks for the patchset. I find it hitting the following BUG() when
>>> running mmtests/autonumabench:
>>>
>>> kernel BUG at mm/migrate.c:2432!
>>>
>>> This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page().
>> 
>> Thank you very much for reporting!  I haven't reproduced this yet.  But
>> I will pay special attention to this when develop the next version, even
>> if I cannot reproduce this finally.
>
> The following change fixes the above reported BUG_ON().
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a0de0d9b4d41..c11dd82245e5 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1197,7 +1197,7 @@ static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
>          * references and be restored.
>          */
>         /* restore the page to right list. */
> -       if (rc != -EAGAIN)
> +       if (rc == -EAGAIN)
>                  ret = NULL;
>  
>         migrate_page_undo_page(page, page_was_mapped, anon_vma, locked, ret);
>
> The pages that returned from unmapping stage with -EAGAIN used to
> end up on "ret" list rather than continuing on the "from" list.

Yes.  You are right.  Thank you very much!

Digging some history, it is found that the code was correct in previous
versions, but became wrong for mistake during code rebasing.  Will be
more careful in the future and try to organize the patchset better to
make it easier to review the changes.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-28  1:44                   ` John Hubbard
@ 2022-09-28  1:49                     ` Yang Shi
  2022-09-28  1:56                       ` John Hubbard
  0 siblings, 1 reply; 50+ messages in thread
From: Yang Shi @ 2022-09-28  1:49 UTC (permalink / raw)
  To: John Hubbard
  Cc: Huang, Ying, Alistair Popple, linux-mm, linux-kernel,
	Andrew Morton, Zi Yan, Baolin Wang, Oscar Salvador,
	Matthew Wilcox

On Tue, Sep 27, 2022 at 6:45 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 9/27/22 18:41, Huang, Ying wrote:
> >>>> I also agree that we cannot make any rules such as "do not lock > 1 page
> >>>> at the same time, elsewhere in the kernel", because it is already
> >>>> happening, for example in page-writeback.c, which locks PAGEVEC_SIZE
> >>>> (15) pages per batch [1].
> >>
> >> That's not really the case though. The inner loop of write_cache_page()
> >> only ever locks one page at a time, either directly via the
> >> unlock_page() on L2338 (those goto's are amazing) or indirectly via
> >> (*writepage)() on L2359.
> >>
> >> So there's no deadlock potential there because unlocking any previously
> >> locked page(s) doesn't depend on obtaining the lock for another page.
> >> Unless I've missed something?
> >
> > Yes.  This is my understanding too after checking ext4_writepage().
> >
>
> Yes, I missed the ".writepage() shall unlock the page" design point. Now
> it seems much more reasonable and safer. :)

.writepage is deprecated (see
https://lore.kernel.org/linux-fsdevel/20220719041311.709250-1-hch@lst.de/),
write back actually uses .writepages.

>
> thanks,
>
> --
> John Hubbard
> NVIDIA
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-28  1:49                     ` Yang Shi
@ 2022-09-28  1:56                       ` John Hubbard
  2022-09-28  2:14                         ` Yang Shi
  0 siblings, 1 reply; 50+ messages in thread
From: John Hubbard @ 2022-09-28  1:56 UTC (permalink / raw)
  To: Yang Shi
  Cc: Huang, Ying, Alistair Popple, linux-mm, linux-kernel,
	Andrew Morton, Zi Yan, Baolin Wang, Oscar Salvador,
	Matthew Wilcox

On 9/27/22 18:49, Yang Shi wrote:
> On Tue, Sep 27, 2022 at 6:45 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 9/27/22 18:41, Huang, Ying wrote:
>>>>>> I also agree that we cannot make any rules such as "do not lock > 1 page
>>>>>> at the same time, elsewhere in the kernel", because it is already
>>>>>> happening, for example in page-writeback.c, which locks PAGEVEC_SIZE
>>>>>> (15) pages per batch [1].
>>>>
>>>> That's not really the case though. The inner loop of write_cache_page()
>>>> only ever locks one page at a time, either directly via the
>>>> unlock_page() on L2338 (those goto's are amazing) or indirectly via
>>>> (*writepage)() on L2359.
>>>>
>>>> So there's no deadlock potential there because unlocking any previously
>>>> locked page(s) doesn't depend on obtaining the lock for another page.
>>>> Unless I've missed something?
>>>
>>> Yes.  This is my understanding too after checking ext4_writepage().
>>>
>>
>> Yes, I missed the ".writepage() shall unlock the page" design point. Now
>> it seems much more reasonable and safer. :)
> 
> .writepage is deprecated (see
> https://lore.kernel.org/linux-fsdevel/20220719041311.709250-1-hch@lst.de/),
> write back actually uses .writepages.

write_cache_pages() seems to directly call it, though:

generic_writepages()
  write_cache_pages(__writepage)
    __writepage()
      mapping->a_ops->writepage(page, wbc)

So it seems like it's still alive and well. And in any case, it is definitely
passing one page at a time from write_cache_pages(), right?


 thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-27 11:21 ` haoxin
@ 2022-09-28  2:01   ` Huang, Ying
  2022-09-28  3:33     ` haoxin
  2022-11-01 14:49   ` Hesham Almatary
  1 sibling, 1 reply; 50+ messages in thread
From: Huang, Ying @ 2022-09-28  2:01 UTC (permalink / raw)
  To: haoxin
  Cc: linux-mm, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox, yangyicong,
	v-songbaohua, 21cnbao

haoxin <xhao@linux.alibaba.com> writes:

> Hi, Huang
>
> ( 2022/9/21 \vH2:06, Huang Ying S:
>> From: "Huang, Ying" <ying.huang@intel.com>
>>
>> Now, migrate_pages() migrate pages one by one, like the fake code as
>> follows,
>>
>>    for each page
>>      unmap
>>      flush TLB
>>      copy
>>      restore map
>>
>> If multiple pages are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying.  That is, we can
>> change the code to something as follows,
>>
>>    for each page
>>      unmap
>>    for each page
>>      flush TLB
>>    for each page
>>      copy
>>    for each page
>>      restore map
>>
>> The total number of TLB flushing IPI can be reduced considerably.  And
>> we may use some hardware accelerator such as DSA to accelerate the
>> page copying.
>>
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching.  Base on this, hardware
>> accelerated page copying can be implemented.
>>
>> If too many pages are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many pages at the same time.  The
>> possibility for a task to wait for the migrated pages to be mapped
>> again increases.  So the latency may be hurt.  To deal with this
>> issue, the max number of pages be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>> level of THP migration.
>>
>> We use the following test to measure the performance impact of the
>> patchset,
>>
>> On a 2-socket Intel server,
>>
>>   - Run pmbench memory accessing benchmark
>>
>>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>>     node 1 back and forth.
>>
> As the pmbench can not run on arm64 machine, so i use lmbench instead.
> I test case like this:  (i am not sure whether it is reasonable, but it seems worked)
> ./bw_mem -N10000 10000m rd &
> time migratepages pid node0 node1
>
> o/patch      		w/patch
> real	0m0.035s  	real	0m0.024s
> user	0m0.000s  	user	0m0.000s
> sys	0m0.035s        sys	0m0.024s
>
> the migratepages time is reduced above 32%.
>
> But there has a problem, i see the batch flush is called by
> migrate_pages_batch
> 	try_to_unmap_flush
> 		arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>
> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>
> So, the tlb batch flush means no any flush is did, it is a empty func.

Yes.  And should_defer_flush() will always return false too.  That is,
the TLB will still be flushed, but will not be batched.

> Maybe this patch can help solve this problem.
> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/

Yes.  This will bring TLB flush batching to ARM64.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-28  1:56                       ` John Hubbard
@ 2022-09-28  2:14                         ` Yang Shi
  2022-09-28  2:57                           ` John Hubbard
  0 siblings, 1 reply; 50+ messages in thread
From: Yang Shi @ 2022-09-28  2:14 UTC (permalink / raw)
  To: John Hubbard
  Cc: Huang, Ying, Alistair Popple, linux-mm, linux-kernel,
	Andrew Morton, Zi Yan, Baolin Wang, Oscar Salvador,
	Matthew Wilcox

On Tue, Sep 27, 2022 at 6:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 9/27/22 18:49, Yang Shi wrote:
> > On Tue, Sep 27, 2022 at 6:45 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>
> >> On 9/27/22 18:41, Huang, Ying wrote:
> >>>>>> I also agree that we cannot make any rules such as "do not lock > 1 page
> >>>>>> at the same time, elsewhere in the kernel", because it is already
> >>>>>> happening, for example in page-writeback.c, which locks PAGEVEC_SIZE
> >>>>>> (15) pages per batch [1].
> >>>>
> >>>> That's not really the case though. The inner loop of write_cache_page()
> >>>> only ever locks one page at a time, either directly via the
> >>>> unlock_page() on L2338 (those goto's are amazing) or indirectly via
> >>>> (*writepage)() on L2359.
> >>>>
> >>>> So there's no deadlock potential there because unlocking any previously
> >>>> locked page(s) doesn't depend on obtaining the lock for another page.
> >>>> Unless I've missed something?
> >>>
> >>> Yes.  This is my understanding too after checking ext4_writepage().
> >>>
> >>
> >> Yes, I missed the ".writepage() shall unlock the page" design point. Now
> >> it seems much more reasonable and safer. :)
> >
> > .writepage is deprecated (see
> > https://lore.kernel.org/linux-fsdevel/20220719041311.709250-1-hch@lst.de/),
> > write back actually uses .writepages.
>
> write_cache_pages() seems to directly call it, though:
>
> generic_writepages()
>   write_cache_pages(__writepage)
>     __writepage()
>       mapping->a_ops->writepage(page, wbc)
>
> So it seems like it's still alive and well. And in any case, it is definitely
> passing one page at a time from write_cache_pages(), right?

IIRC, the writeback may not call generic_writepages. On my ext4
filesystem, the writeback call stack looks like:

@[
    ext4_writepages+1
    do_writepages+191
    __writeback_single_inode+65
    writeback_sb_inodes+477
    __writeback_inodes_wb+76
    wb_writeback+457
    wb_workfn+680
    process_one_work+485
    worker_thread+80
    kthread+231
    ret_from_fork+34
]: 2


>
>
>  thanks,
>
> --
> John Hubbard
> NVIDIA
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-28  2:14                         ` Yang Shi
@ 2022-09-28  2:57                           ` John Hubbard
  2022-09-28  3:25                             ` Yang Shi
  0 siblings, 1 reply; 50+ messages in thread
From: John Hubbard @ 2022-09-28  2:57 UTC (permalink / raw)
  To: Yang Shi
  Cc: Huang, Ying, Alistair Popple, linux-mm, linux-kernel,
	Andrew Morton, Zi Yan, Baolin Wang, Oscar Salvador,
	Matthew Wilcox

On 9/27/22 19:14, Yang Shi wrote:
> IIRC, the writeback may not call generic_writepages. On my ext4
> filesystem, the writeback call stack looks like:
> 
> @[
>     ext4_writepages+1
>     do_writepages+191
>     __writeback_single_inode+65
>     writeback_sb_inodes+477
>     __writeback_inodes_wb+76
>     wb_writeback+457
>     wb_workfn+680
>     process_one_work+485
>     worker_thread+80
>     kthread+231
>     ret_from_fork+34
> ]: 2
> 

Sure, that's fine for ext4, in that particular case, but

a) not all filesystems have .writepages hooked up. That's why
do_writepages() checks for .writepages(), and falls back to
.writepage():

	if (mapping->a_ops->writepages)
		ret = mapping->a_ops->writepages(mapping, wbc);
	else
		ret = generic_writepages(mapping, wbc);

, and

b) there are also a lot of places and ways to invoke writebacks. There
are periodic writebacks, and there are data integrity (WB_SYNC_ALL)
writebacks, and other places where a page needs to be cleaned--so, a lot
of call sites. Some of which will land on a .writepage() sometimes, even
now.

For just one example, I see migrate.c's writeout() function directly
calling writepage():

	rc = mapping->a_ops->writepage(&folio->page, &wbc);


thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-28  2:57                           ` John Hubbard
@ 2022-09-28  3:25                             ` Yang Shi
  2022-09-28  3:39                               ` Yang Shi
  0 siblings, 1 reply; 50+ messages in thread
From: Yang Shi @ 2022-09-28  3:25 UTC (permalink / raw)
  To: John Hubbard
  Cc: Huang, Ying, Alistair Popple, linux-mm, linux-kernel,
	Andrew Morton, Zi Yan, Baolin Wang, Oscar Salvador,
	Matthew Wilcox

On Tue, Sep 27, 2022 at 7:57 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 9/27/22 19:14, Yang Shi wrote:
> > IIRC, the writeback may not call generic_writepages. On my ext4
> > filesystem, the writeback call stack looks like:
> >
> > @[
> >     ext4_writepages+1
> >     do_writepages+191
> >     __writeback_single_inode+65
> >     writeback_sb_inodes+477
> >     __writeback_inodes_wb+76
> >     wb_writeback+457
> >     wb_workfn+680
> >     process_one_work+485
> >     worker_thread+80
> >     kthread+231
> >     ret_from_fork+34
> > ]: 2
> >
>
> Sure, that's fine for ext4, in that particular case, but
>
> a) not all filesystems have .writepages hooked up. That's why
> do_writepages() checks for .writepages(), and falls back to
> .writepage():
>
>         if (mapping->a_ops->writepages)
>                 ret = mapping->a_ops->writepages(mapping, wbc);
>         else
>                 ret = generic_writepages(mapping, wbc);
>
> , and
>
> b) there are also a lot of places and ways to invoke writebacks. There
> are periodic writebacks, and there are data integrity (WB_SYNC_ALL)
> writebacks, and other places where a page needs to be cleaned--so, a lot
> of call sites. Some of which will land on a .writepage() sometimes, even
> now.
>
> For just one example, I see migrate.c's writeout() function directly
> calling writepage():
>
>         rc = mapping->a_ops->writepage(&folio->page, &wbc);

Thanks, John. You are right. I think "deprecated" is inaccurate,
.writepage is actually no longer used in memory reclaim context, but
it is still used for other contexts.

Anyway I think what we tried to figure out in the first place is
whether it is possible that filesystem writeback have dead lock with
the new batch migration or not. I think the conclusion didn't change.

>
>
> thanks,
>
> --
> John Hubbard
> NVIDIA
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-28  2:01   ` Huang, Ying
@ 2022-09-28  3:33     ` haoxin
  2022-09-28  4:53       ` Huang, Ying
  0 siblings, 1 reply; 50+ messages in thread
From: haoxin @ 2022-09-28  3:33 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox, yangyicong,
	v-songbaohua, 21cnbao


在 2022/9/28 上午10:01, Huang, Ying 写道:
> haoxin <xhao@linux.alibaba.com> writes:
>
>> Hi, Huang
>>
>> ( 2022/9/21 \vH2:06, Huang Ying S:
>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>
>>> Now, migrate_pages() migrate pages one by one, like the fake code as
>>> follows,
>>>
>>>     for each page
>>>       unmap
>>>       flush TLB
>>>       copy
>>>       restore map
>>>
>>> If multiple pages are passed to migrate_pages(), there are
>>> opportunities to batch the TLB flushing and copying.  That is, we can
>>> change the code to something as follows,
>>>
>>>     for each page
>>>       unmap
>>>     for each page
>>>       flush TLB
>>>     for each page
>>>       copy
>>>     for each page
>>>       restore map
>>>
>>> The total number of TLB flushing IPI can be reduced considerably.  And
>>> we may use some hardware accelerator such as DSA to accelerate the
>>> page copying.
>>>
>>> So in this patch, we refactor the migrate_pages() implementation and
>>> implement the TLB flushing batching.  Base on this, hardware
>>> accelerated page copying can be implemented.
>>>
>>> If too many pages are passed to migrate_pages(), in the naive batched
>>> implementation, we may unmap too many pages at the same time.  The
>>> possibility for a task to wait for the migrated pages to be mapped
>>> again increases.  So the latency may be hurt.  To deal with this
>>> issue, the max number of pages be unmapped in batch is restricted to
>>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>>> level of THP migration.
>>>
>>> We use the following test to measure the performance impact of the
>>> patchset,
>>>
>>> On a 2-socket Intel server,
>>>
>>>    - Run pmbench memory accessing benchmark
>>>
>>>    - Run `migratepages` to migrate pages of pmbench between node 0 and
>>>      node 1 back and forth.
>>>
>> As the pmbench can not run on arm64 machine, so i use lmbench instead.
>> I test case like this:  (i am not sure whether it is reasonable, but it seems worked)
>> ./bw_mem -N10000 10000m rd &
>> time migratepages pid node0 node1
>>
>> o/patch      		w/patch
>> real	0m0.035s  	real	0m0.024s
>> user	0m0.000s  	user	0m0.000s
>> sys	0m0.035s        sys	0m0.024s
>>
>> the migratepages time is reduced above 32%.
>>
>> But there has a problem, i see the batch flush is called by
>> migrate_pages_batch
>> 	try_to_unmap_flush
>> 		arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>>
>> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>>
>> So, the tlb batch flush means no any flush is did, it is a empty func.
> Yes.  And should_defer_flush() will always return false too.  That is,
> the TLB will still be flushed, but will not be batched.
Oh, yes, i  ignore this, thank you.
>
>> Maybe this patch can help solve this problem.
>> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/
> Yes.  This will bring TLB flush batching to ARM64.
Next time,  i will combine with this patch, and do some test again, do 
you have any suggestion about  benchmark ?
>
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move()
  2022-09-28  3:25                             ` Yang Shi
@ 2022-09-28  3:39                               ` Yang Shi
  0 siblings, 0 replies; 50+ messages in thread
From: Yang Shi @ 2022-09-28  3:39 UTC (permalink / raw)
  To: John Hubbard
  Cc: Huang, Ying, Alistair Popple, linux-mm, linux-kernel,
	Andrew Morton, Zi Yan, Baolin Wang, Oscar Salvador,
	Matthew Wilcox

On Tue, Sep 27, 2022 at 8:25 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Sep 27, 2022 at 7:57 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 9/27/22 19:14, Yang Shi wrote:
> > > IIRC, the writeback may not call generic_writepages. On my ext4
> > > filesystem, the writeback call stack looks like:
> > >
> > > @[
> > >     ext4_writepages+1
> > >     do_writepages+191
> > >     __writeback_single_inode+65
> > >     writeback_sb_inodes+477
> > >     __writeback_inodes_wb+76
> > >     wb_writeback+457
> > >     wb_workfn+680
> > >     process_one_work+485
> > >     worker_thread+80
> > >     kthread+231
> > >     ret_from_fork+34
> > > ]: 2
> > >
> >
> > Sure, that's fine for ext4, in that particular case, but
> >
> > a) not all filesystems have .writepages hooked up. That's why
> > do_writepages() checks for .writepages(), and falls back to
> > .writepage():
> >
> >         if (mapping->a_ops->writepages)
> >                 ret = mapping->a_ops->writepages(mapping, wbc);
> >         else
> >                 ret = generic_writepages(mapping, wbc);
> >
> > , and
> >
> > b) there are also a lot of places and ways to invoke writebacks. There
> > are periodic writebacks, and there are data integrity (WB_SYNC_ALL)
> > writebacks, and other places where a page needs to be cleaned--so, a lot
> > of call sites. Some of which will land on a .writepage() sometimes, even
> > now.
> >
> > For just one example, I see migrate.c's writeout() function directly
> > calling writepage():
> >
> >         rc = mapping->a_ops->writepage(&folio->page, &wbc);
>
> Thanks, John. You are right. I think "deprecated" is inaccurate,
> .writepage is actually no longer used in memory reclaim context, but
> it is still used for other contexts.

Hmm.. it is definitely used currently, but it seems like the plan is
to deprecate ->writepage finally IIUC. See
https://lore.kernel.org/linux-mm/YvQYjpDHH5KckCrw@casper.infradead.org/

>
> Anyway I think what we tried to figure out in the first place is
> whether it is possible that filesystem writeback have dead lock with
> the new batch migration or not. I think the conclusion didn't change.
>
> >
> >
> > thanks,
> >
> > --
> > John Hubbard
> > NVIDIA
> >

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-28  3:33     ` haoxin
@ 2022-09-28  4:53       ` Huang, Ying
  0 siblings, 0 replies; 50+ messages in thread
From: Huang, Ying @ 2022-09-28  4:53 UTC (permalink / raw)
  To: haoxin
  Cc: linux-mm, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox, yangyicong,
	v-songbaohua, 21cnbao

haoxin <xhao@linux.alibaba.com> writes:

> ( 2022/9/28 H10:01, Huang, Ying S:
>> haoxin <xhao@linux.alibaba.com> writes:
>>
>>> Hi, Huang
>>>
>>> ( 2022/9/21 \vH2:06, Huang Ying S:
>>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>>
>>>> Now, migrate_pages() migrate pages one by one, like the fake code as
>>>> follows,
>>>>
>>>>     for each page
>>>>       unmap
>>>>       flush TLB
>>>>       copy
>>>>       restore map
>>>>
>>>> If multiple pages are passed to migrate_pages(), there are
>>>> opportunities to batch the TLB flushing and copying.  That is, we can
>>>> change the code to something as follows,
>>>>
>>>>     for each page
>>>>       unmap
>>>>     for each page
>>>>       flush TLB
>>>>     for each page
>>>>       copy
>>>>     for each page
>>>>       restore map
>>>>
>>>> The total number of TLB flushing IPI can be reduced considerably.  And
>>>> we may use some hardware accelerator such as DSA to accelerate the
>>>> page copying.
>>>>
>>>> So in this patch, we refactor the migrate_pages() implementation and
>>>> implement the TLB flushing batching.  Base on this, hardware
>>>> accelerated page copying can be implemented.
>>>>
>>>> If too many pages are passed to migrate_pages(), in the naive batched
>>>> implementation, we may unmap too many pages at the same time.  The
>>>> possibility for a task to wait for the migrated pages to be mapped
>>>> again increases.  So the latency may be hurt.  To deal with this
>>>> issue, the max number of pages be unmapped in batch is restricted to
>>>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>>>> level of THP migration.
>>>>
>>>> We use the following test to measure the performance impact of the
>>>> patchset,
>>>>
>>>> On a 2-socket Intel server,
>>>>
>>>>    - Run pmbench memory accessing benchmark
>>>>
>>>>    - Run `migratepages` to migrate pages of pmbench between node 0 and
>>>>      node 1 back and forth.
>>>>
>>> As the pmbench can not run on arm64 machine, so i use lmbench instead.
>>> I test case like this:  (i am not sure whether it is reasonable, but it seems worked)
>>> ./bw_mem -N10000 10000m rd &
>>> time migratepages pid node0 node1
>>>
>>> o/patch      		w/patch
>>> real	0m0.035s  	real	0m0.024s
>>> user	0m0.000s  	user	0m0.000s
>>> sys	0m0.035s        sys	0m0.024s
>>>
>>> the migratepages time is reduced above 32%.
>>>
>>> But there has a problem, i see the batch flush is called by
>>> migrate_pages_batch
>>> 	try_to_unmap_flush
>>> 		arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>>>
>>> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>>>
>>> So, the tlb batch flush means no any flush is did, it is a empty func.
>> Yes.  And should_defer_flush() will always return false too.  That is,
>> the TLB will still be flushed, but will not be batched.
> Oh, yes, i ignore this, thank you.
>>
>>> Maybe this patch can help solve this problem.
>>> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/
>> Yes.  This will bring TLB flush batching to ARM64.
> Next time, i will combine with this patch, and do some test again,
> do you have any suggestion about benchmark ?

I think your benchmark should be OK.  If multiple threads are used, the
effect of patchset will be better.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-09-27 11:21 ` haoxin
  2022-09-28  2:01   ` Huang, Ying
@ 2022-11-01 14:49   ` Hesham Almatary
  2022-11-02  3:14     ` Huang, Ying
  1 sibling, 1 reply; 50+ messages in thread
From: Hesham Almatary @ 2022-11-01 14:49 UTC (permalink / raw)
  To: haoxin, Huang Ying
  Cc: linux-kernel, Andrew Morton, Zi Yan, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox, yangyicong, v-songbaohua,
	21cnbao, linux-mm


On 9/27/2022 12:21 PM, haoxin wrote:
> Hi, Huang
>
> 在 2022/9/21 下午2:06, Huang Ying 写道:
>> From: "Huang, Ying" <ying.huang@intel.com>
>>
>> Now, migrate_pages() migrate pages one by one, like the fake code as
>> follows,
>>
>>    for each page
>>      unmap
>>      flush TLB
>>      copy
>>      restore map
>>
>> If multiple pages are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying.  That is, we can
>> change the code to something as follows,
>>
>>    for each page
>>      unmap
>>    for each page
>>      flush TLB
>>    for each page
>>      copy
>>    for each page
>>      restore map
>>
>> The total number of TLB flushing IPI can be reduced considerably.  And
>> we may use some hardware accelerator such as DSA to accelerate the
>> page copying.
>>
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching.  Base on this, hardware
>> accelerated page copying can be implemented.
>>
>> If too many pages are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many pages at the same time. The
>> possibility for a task to wait for the migrated pages to be mapped
>> again increases.  So the latency may be hurt.  To deal with this
>> issue, the max number of pages be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>> level of THP migration.
>>
>> We use the following test to measure the performance impact of the
>> patchset,
>>
>> On a 2-socket Intel server,
>>
>>   - Run pmbench memory accessing benchmark
>>
>>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>>     node 1 back and forth.
>>
> As the pmbench can not run on arm64 machine, so i use lmbench instead.
> I test case like this:  (i am not sure whether it is reasonable, but it seems 
> worked)
> ./bw_mem -N10000 10000m rd &
> time migratepages pid node0 node1
>
FYI, I have ported pmbench to AArch64 [1]. The project seems to be abandoned on 
bitbucket,

I wonder if it makes sense to fork it elsewhere and push the pending PRs there.


[1] https://bitbucket.org/jisooy/pmbench/pull-requests/5

> o/patch w/patch
> real    0m0.035s      real    0m0.024s
> user    0m0.000s      user    0m0.000s
> sys    0m0.035s        sys    0m0.024s
>
> the migratepages time is reduced above 32%.
>
> But there has a problem, i see the batch flush is called by
> migrate_pages_batch
>     try_to_unmap_flush
>         arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>
> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not 
> support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>
> So, the tlb batch flush means no any flush is did, it is a empty func.
>
> Maybe this patch can help solve this problem.
> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/ 
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-11-01 14:49   ` Hesham Almatary
@ 2022-11-02  3:14     ` Huang, Ying
  2022-11-02 14:13       ` Hesham Almatary
  0 siblings, 1 reply; 50+ messages in thread
From: Huang, Ying @ 2022-11-02  3:14 UTC (permalink / raw)
  To: Hesham Almatary
  Cc: haoxin, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox, yangyicong,
	v-songbaohua, 21cnbao, linux-mm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ascii, Size: 3104 bytes --]

Hesham Almatary <hesham.almatary@huawei.com> writes:

> On 9/27/2022 12:21 PM, haoxin wrote:
>> Hi, Huang
>>
>> ( 2022/9/21 \vH2:06, Huang Ying S:
>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>
>>> Now, migrate_pages() migrate pages one by one, like the fake code as
>>> follows,
>>>
>>>   for each page
>>>    unmap
>>>    flush TLB
>>>    copy
>>>    restore map
>>>
>>> If multiple pages are passed to migrate_pages(), there are
>>> opportunities to batch the TLB flushing and copying. That is, we can
>>> change the code to something as follows,
>>>
>>>   for each page
>>>    unmap
>>>   for each page
>>>    flush TLB
>>>   for each page
>>>    copy
>>>   for each page
>>>    restore map
>>>
>>> The total number of TLB flushing IPI can be reduced considerably. And
>>> we may use some hardware accelerator such as DSA to accelerate the
>>> page copying.
>>>
>>> So in this patch, we refactor the migrate_pages() implementation and
>>> implement the TLB flushing batching. Base on this, hardware
>>> accelerated page copying can be implemented.
>>>
>>> If too many pages are passed to migrate_pages(), in the naive batched
>>> implementation, we may unmap too many pages at the same time. The
>>> possibility for a task to wait for the migrated pages to be mapped
>>> again increases. So the latency may be hurt. To deal with this
>>> issue, the max number of pages be unmapped in batch is restricted to
>>> no more than HPAGE_PMD_NR. That is, the influence is at the same
>>> level of THP migration.
>>>
>>> We use the following test to measure the performance impact of the
>>> patchset,
>>>
>>> On a 2-socket Intel server,
>>>
>>>  - Run pmbench memory accessing benchmark
>>>
>>>  - Run `migratepages` to migrate pages of pmbench between node 0 and
>>>   node 1 back and forth.
>>>
>> As the pmbench can not run on arm64 machine, so i use lmbench instead.
>> I test case like this: (i am not sure whether it is reasonable,
>> but it seems worked)
>> ./bw_mem -N10000 10000m rd &
>> time migratepages pid node0 node1
>>
> FYI, I have ported pmbench to AArch64 [1]. The project seems to be
> abandoned on bitbucket,
>
> I wonder if it makes sense to fork it elsewhere and push the pending PRs there.
>
>
> [1] https://bitbucket.org/jisooy/pmbench/pull-requests/5

Maybe try to contact the original author with email firstly?

Best Regards,
Huang, Ying

>> o/patch w/patch
>> real  0m0.035s   real  0m0.024s
>> user  0m0.000s   user  0m0.000s
>> sys  0m0.035s    sys  0m0.024s
>>
>> the migratepages time is reduced above 32%.
>>
>> But there has a problem, i see the batch flush is called by
>> migrate_pages_batch
>>   try_to_unmap_flush
>>     arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>>
>> But in arm64, the arch_tlbbatch_flush are not supported, becasue it
>> not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>>
>> So, the tlb batch flush means no any flush is did, it is a empty func.
>>
>> Maybe this patch can help solve this problem.
>> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/ 
>>
>>
>>
>>
>>
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC 0/6] migrate_pages(): batch TLB flushing
  2022-11-02  3:14     ` Huang, Ying
@ 2022-11-02 14:13       ` Hesham Almatary
  0 siblings, 0 replies; 50+ messages in thread
From: Hesham Almatary @ 2022-11-02 14:13 UTC (permalink / raw)
  To: Huang, Ying
  Cc: haoxin, linux-kernel, Andrew Morton, Zi Yan, Yang Shi,
	Baolin Wang, Oscar Salvador, Matthew Wilcox, yangyicong,
	v-songbaohua, 21cnbao, linux-mm


On 11/2/2022 3:14 AM, Huang, Ying wrote:
> Hesham Almatary <hesham.almatary@huawei.com> writes:
>
>> On 9/27/2022 12:21 PM, haoxin wrote:
>>> Hi, Huang
>>>
>>> ( 2022/9/21 \vH2:06, Huang Ying S:
>>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>>
>>>> Now, migrate_pages() migrate pages one by one, like the fake code as
>>>> follows,
>>>>
>>>> ? for each page
>>>> ?? unmap
>>>> ?? flush TLB
>>>> ?? copy
>>>> ?? restore map
>>>>
>>>> If multiple pages are passed to migrate_pages(), there are
>>>> opportunities to batch the TLB flushing and copying. That is, we can
>>>> change the code to something as follows,
>>>>
>>>> ? for each page
>>>> ?? unmap
>>>> ? for each page
>>>> ?? flush TLB
>>>> ? for each page
>>>> ?? copy
>>>> ? for each page
>>>> ?? restore map
>>>>
>>>> The total number of TLB flushing IPI can be reduced considerably. And
>>>> we may use some hardware accelerator such as DSA to accelerate the
>>>> page copying.
>>>>
>>>> So in this patch, we refactor the migrate_pages() implementation and
>>>> implement the TLB flushing batching. Base on this, hardware
>>>> accelerated page copying can be implemented.
>>>>
>>>> If too many pages are passed to migrate_pages(), in the naive batched
>>>> implementation, we may unmap too many pages at the same time. The
>>>> possibility for a task to wait for the migrated pages to be mapped
>>>> again increases. So the latency may be hurt. To deal with this
>>>> issue, the max number of pages be unmapped in batch is restricted to
>>>> no more than HPAGE_PMD_NR. That is, the influence is at the same
>>>> level of THP migration.
>>>>
>>>> We use the following test to measure the performance impact of the
>>>> patchset,
>>>>
>>>> On a 2-socket Intel server,
>>>>
>>>>   - Run pmbench memory accessing benchmark
>>>>
>>>>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>>>> ? node 1 back and forth.
>>>>
>>> As the pmbench can not run on arm64 machine, so i use lmbench instead.
>>> I test case like this: (i am not sure whether it is reasonable,
>>> but it seems worked)
>>> ./bw_mem -N10000 10000m rd &
>>> time migratepages pid node0 node1
>>>
>> FYI, I have ported pmbench to AArch64 [1]. The project seems to be
>> abandoned on bitbucket,
>>
>> I wonder if it makes sense to fork it elsewhere and push the pending PRs there.
>>
>>
>> [1] https://bitbucket.org/jisooy/pmbench/pull-requests/5
> Maybe try to contact the original author with email firstly?

That's  a good idea. I'm not planning to fork/maintain it myself, but if anyone

is interested in doing so, I am happy to help out and submit PRs there.


> Best Regards,
> Huang, Ying
>
>>> o/patch w/patch
>>> real? 0m0.035s?? real? 0m0.024s
>>> user? 0m0.000s?? user? 0m0.000s
>>> sys? 0m0.035s??? sys? 0m0.024s
>>>
>>> the migratepages time is reduced above 32%.
>>>
>>> But there has a problem, i see the batch flush is called by
>>> migrate_pages_batch
>>> ??try_to_unmap_flush
>>> ??? arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>>>
>>> But in arm64, the arch_tlbbatch_flush are not supported, becasue it
>>> not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>>>
>>> So, the tlb batch flush means no any flush is did, it is a empty func.
>>>
>>> Maybe this patch can help solve this problem.
>>> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/
>>>
>>>
>>>
>>>
>>>
>>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2022-11-02 14:13 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-21  6:06 [RFC 0/6] migrate_pages(): batch TLB flushing Huang Ying
2022-09-21  6:06 ` [RFC 1/6] mm/migrate_pages: separate huge page and normal pages migration Huang Ying
2022-09-21 15:55   ` Zi Yan
2022-09-22  1:14     ` Huang, Ying
2022-09-22  6:03   ` Baolin Wang
2022-09-22  6:22     ` Huang, Ying
2022-09-21  6:06 ` [RFC 2/6] mm/migrate_pages: split unmap_and_move() to _unmap() and _move() Huang Ying
2022-09-21 16:08   ` Zi Yan
2022-09-22  1:15     ` Huang, Ying
2022-09-22  6:36   ` Baolin Wang
2022-09-26  9:28   ` Alistair Popple
2022-09-26 18:06     ` Yang Shi
2022-09-27  0:02       ` Alistair Popple
2022-09-27  1:51         ` Huang, Ying
2022-09-27 20:34           ` John Hubbard
2022-09-27 20:57             ` Yang Shi
2022-09-28  0:59               ` Alistair Popple
2022-09-28  1:41                 ` Huang, Ying
2022-09-28  1:44                   ` John Hubbard
2022-09-28  1:49                     ` Yang Shi
2022-09-28  1:56                       ` John Hubbard
2022-09-28  2:14                         ` Yang Shi
2022-09-28  2:57                           ` John Hubbard
2022-09-28  3:25                             ` Yang Shi
2022-09-28  3:39                               ` Yang Shi
2022-09-27 20:56           ` Yang Shi
2022-09-27 20:54         ` Yang Shi
2022-09-21  6:06 ` [RFC 3/6] mm/migrate_pages: restrict number of pages to migrate in batch Huang Ying
2022-09-21 16:10   ` Zi Yan
2022-09-21 16:15     ` Zi Yan
2022-09-22  1:15     ` Huang, Ying
2022-09-21  6:06 ` [RFC 4/6] mm/migrate_pages: batch _unmap and _move Huang Ying
2022-09-21  6:06 ` [RFC 5/6] mm/migrate_pages: share more code between " Huang Ying
2022-09-21  6:06 ` [RFC 6/6] mm/migrate_pages: batch flushing TLB Huang Ying
2022-09-21 15:47 ` [RFC 0/6] migrate_pages(): batch TLB flushing Zi Yan
2022-09-22  1:45   ` Huang, Ying
2022-09-22  3:47   ` haoxin
2022-09-22  4:36     ` Huang, Ying
2022-09-22 12:50 ` Bharata B Rao
2022-09-23  7:52   ` Huang, Ying
2022-09-27 10:46     ` Bharata B Rao
2022-09-28  1:46       ` Huang, Ying
2022-09-26  9:11 ` Alistair Popple
2022-09-27 11:21 ` haoxin
2022-09-28  2:01   ` Huang, Ying
2022-09-28  3:33     ` haoxin
2022-09-28  4:53       ` Huang, Ying
2022-11-01 14:49   ` Hesham Almatary
2022-11-02  3:14     ` Huang, Ying
2022-11-02 14:13       ` Hesham Almatary

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).