[PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration
@ 2023-02-24 14:11 Huang Ying
  2023-02-24 14:11 ` [PATCH 1/3] migrate_pages: fix deadlock in batched migration Huang Ying
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Huang Ying @ 2023-02-24 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

Two deadlock bugs were reported for the migrate_pages() batching
series.  Thanks Hugh and Pengfei.  Analysis shows that if we have
locked some other folios except the one we are migrating, it's not
safe in general to wait synchronously, for example, to wait the
writeback to complete or wait to lock the buffer head.

So 1/3 fixes the deadlock in a simple way, where the batching support
for the synchronous migration is disabled.  The change is
straightforward and easy to be understood.  While 3/3 re-introduce the
batching for synchronous migration via trying to migrate
asynchronously in batch optimistically, then fall back to migrate
synchronously one by one for fail-to-migrate folios.  Test shows that
this can restore the TLB flushing batching performance for synchronous
migration effectively.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/3] migrate_pages: fix deadlock in batched migration
  2023-02-24 14:11 [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration Huang Ying
@ 2023-02-24 14:11 ` Huang Ying
  2023-02-28  6:13   ` Hugh Dickins
  2023-02-24 14:11 ` [PATCH 2/3] migrate_pages: move split folios processing out of migrate_pages_batch() Huang Ying
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Huang Ying @ 2023-02-24 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

Two deadlock bugs were reported for the migrate_pages() batching
series.  Thanks Hugh and Pengfei!  For example, in the following
deadlock trace snippet,

 INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
       Not tainted 6.2.0-rc4-kvm+ #1314
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u4:0    state:D stack:0     pid:9     ppid:2      flags:0x00004000
 Workqueue: loop4 loop_rootcg_workfn
 Call Trace:
  <TASK>
  __schedule+0x43b/0xd00
  schedule+0x6a/0xf0
  io_schedule+0x4a/0x80
  folio_wait_bit_common+0x1b5/0x4e0
  ? __pfx_wake_page_function+0x10/0x10
  __filemap_get_folio+0x73d/0x770
  shmem_get_folio_gfp+0x1fd/0xc80
  shmem_write_begin+0x91/0x220
  generic_perform_write+0x10e/0x2e0
  __generic_file_write_iter+0x17e/0x290
  ? generic_write_checks+0x12b/0x1a0
  generic_file_write_iter+0x97/0x180
  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
  do_iter_readv_writev+0x13c/0x210
  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
  do_iter_write+0xf6/0x330
  vfs_iter_write+0x46/0x70
  loop_process_work+0x723/0xfe0
  loop_rootcg_workfn+0x28/0x40
  process_one_work+0x3cc/0x8d0
  worker_thread+0x66/0x630
  ? __pfx_worker_thread+0x10/0x10
  kthread+0x153/0x190
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x29/0x50
  </TASK>

 INFO: task repro:1023 blocked for more than 147 seconds.
       Not tainted 6.2.0-rc4-kvm+ #1314
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:repro           state:D stack:0     pid:1023  ppid:360    flags:0x00004004
 Call Trace:
  <TASK>
  __schedule+0x43b/0xd00
  schedule+0x6a/0xf0
  io_schedule+0x4a/0x80
  folio_wait_bit_common+0x1b5/0x4e0
  ? compaction_alloc+0x77/0x1150
  ? __pfx_wake_page_function+0x10/0x10
  folio_wait_bit+0x30/0x40
  folio_wait_writeback+0x2e/0x1e0
  migrate_pages_batch+0x555/0x1ac0
  ? __pfx_compaction_alloc+0x10/0x10
  ? __pfx_compaction_free+0x10/0x10
  ? __this_cpu_preempt_check+0x17/0x20
  ? lock_is_held_type+0xe6/0x140
  migrate_pages+0x100e/0x1180
  ? __pfx_compaction_free+0x10/0x10
  ? __pfx_compaction_alloc+0x10/0x10
  compact_zone+0xe10/0x1b50
  ? lock_is_held_type+0xe6/0x140
  ? check_preemption_disabled+0x80/0xf0
  compact_node+0xa3/0x100
  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
  ? _find_first_bit+0x7b/0x90
  sysctl_compaction_handler+0x5d/0xb0
  proc_sys_call_handler+0x29d/0x420
  proc_sys_write+0x2b/0x40
  vfs_write+0x3a3/0x780
  ksys_write+0xb7/0x180
  __x64_sys_write+0x26/0x30
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x72/0xdc
 RIP: 0033:0x7f3a2471f59d
 RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
 RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
 R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
 R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
  </TASK>

The page migration task has held the lock of the shmem folio A, and is
waiting the writeback of the folio B of the file system on the loop
block device to complete.  While the loop worker task which writes
back the folio B is waiting to lock the shmem folio A, because the
folio A backs the folio B in the loop device.  Thus deadlock is
triggered.

In general, if we have locked some other folios except the one we are
migrating, it's not safe to wait synchronously, for example, to wait
the writeback to complete or wait to lock the buffer head.

To fix the deadlock, in this patch, we avoid to batch the page
migration except for MIGRATE_ASYNC mode.  In MIGRATE_ASYNC mode,
synchronous waiting is avoided.

The fix can be improved further.  We will do that as soon as possible.

Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/
Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/migrate.c | 62 ++++++++++++++++------------------------------------
 1 file changed, 19 insertions(+), 43 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 37865f85df6d..7ac37dbbf307 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1106,7 +1106,7 @@ static void migrate_folio_done(struct folio *src,
 /* Obtain the lock on page, remove all ptes. */
 static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page,
 			       unsigned long private, struct folio *src,
-			       struct folio **dstp, int force, bool avoid_force_lock,
+			       struct folio **dstp, int force,
 			       enum migrate_mode mode, enum migrate_reason reason,
 			       struct list_head *ret)
 {
@@ -1157,17 +1157,6 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
 		if (current->flags & PF_MEMALLOC)
 			goto out;
 
-		/*
-		 * We have locked some folios and are going to wait to lock
-		 * this folio.  To avoid a potential deadlock, let's bail
-		 * out and not do that. The locked folios will be moved and
-		 * unlocked, then we can wait to lock this folio.
-		 */
-		if (avoid_force_lock) {
-			rc = -EDEADLOCK;
-			goto out;
-		}
-
 		folio_lock(src);
 	}
 	locked = true;
@@ -1247,7 +1236,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
 		/* Establish migration ptes */
 		VM_BUG_ON_FOLIO(folio_test_anon(src) &&
 			       !folio_test_ksm(src) && !anon_vma, src);
-		try_to_migrate(src, TTU_BATCH_FLUSH);
+		try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
 		page_was_mapped = 1;
 	}
 
@@ -1261,7 +1250,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
 	 * A folio that has not been unmapped will be restored to
 	 * right list unless we want to retry.
 	 */
-	if (rc == -EAGAIN || rc == -EDEADLOCK)
+	if (rc == -EAGAIN)
 		ret = NULL;
 
 	migrate_folio_undo_src(src, page_was_mapped, anon_vma, locked, ret);
@@ -1634,11 +1623,9 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 	LIST_HEAD(dst_folios);
 	bool nosplit = (reason == MR_NUMA_MISPLACED);
 	bool no_split_folio_counting = false;
-	bool avoid_force_lock;
 
 retry:
 	rc_saved = 0;
-	avoid_force_lock = false;
 	retry = 1;
 	for (pass = 0;
 	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
@@ -1683,15 +1670,14 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 			}
 
 			rc = migrate_folio_unmap(get_new_page, put_new_page, private,
-						 folio, &dst, pass > 2, avoid_force_lock,
-						 mode, reason, ret_folios);
+						 folio, &dst, pass > 2, mode,
+						 reason, ret_folios);
 			/*
 			 * The rules are:
 			 *	Success: folio will be freed
 			 *	Unmap: folio will be put on unmap_folios list,
 			 *	       dst folio put on dst_folios list
 			 *	-EAGAIN: stay on the from list
-			 *	-EDEADLOCK: stay on the from list
 			 *	-ENOMEM: stay on the from list
 			 *	Other errno: put on ret_folios list
 			 */
@@ -1743,14 +1729,6 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 					goto out;
 				else
 					goto move;
-			case -EDEADLOCK:
-				/*
-				 * The folio cannot be locked for potential deadlock.
-				 * Go move (and unlock) all locked folios.  Then we can
-				 * try again.
-				 */
-				rc_saved = rc;
-				goto move;
 			case -EAGAIN:
 				if (is_large) {
 					large_retry++;
@@ -1765,11 +1743,6 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 				stats->nr_thp_succeeded += is_thp;
 				break;
 			case MIGRATEPAGE_UNMAP:
-				/*
-				 * We have locked some folios, don't force lock
-				 * to avoid deadlock.
-				 */
-				avoid_force_lock = true;
 				list_move_tail(&folio->lru, &unmap_folios);
 				list_add_tail(&dst->lru, &dst_folios);
 				break;
@@ -1894,17 +1867,15 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 		 */
 		list_splice_init(from, ret_folios);
 		list_splice_init(&split_folios, from);
+		/*
+		 * Force async mode to avoid to wait lock or bit when we have
+		 * locked more than one folios.
+		 */
+		mode = MIGRATE_ASYNC;
 		no_split_folio_counting = true;
 		goto retry;
 	}
 
-	/*
-	 * We have unlocked all locked folios, so we can force lock now, let's
-	 * try again.
-	 */
-	if (rc == -EDEADLOCK)
-		goto retry;
-
 	return rc;
 }
 
@@ -1939,7 +1910,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
 {
 	int rc, rc_gather;
-	int nr_pages;
+	int nr_pages, batch;
 	struct folio *folio, *folio2;
 	LIST_HEAD(folios);
 	LIST_HEAD(ret_folios);
@@ -1953,6 +1924,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 				     mode, reason, &stats, &ret_folios);
 	if (rc_gather < 0)
 		goto out;
+
+	if (mode == MIGRATE_ASYNC)
+		batch = NR_MAX_BATCHED_MIGRATION;
+	else
+		batch = 1;
 again:
 	nr_pages = 0;
 	list_for_each_entry_safe(folio, folio2, from, lru) {
@@ -1963,11 +1939,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 		}
 
 		nr_pages += folio_nr_pages(folio);
-		if (nr_pages > NR_MAX_BATCHED_MIGRATION)
+		if (nr_pages >= batch)
 			break;
 	}
-	if (nr_pages > NR_MAX_BATCHED_MIGRATION)
-		list_cut_before(&folios, from, &folio->lru);
+	if (nr_pages >= batch)
+		list_cut_before(&folios, from, &folio2->lru);
 	else
 		list_splice_init(from, &folios);
 	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/3] migrate_pages: move split folios processing out of migrate_pages_batch()
  2023-02-24 14:11 [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration Huang Ying
  2023-02-24 14:11 ` [PATCH 1/3] migrate_pages: fix deadlock in batched migration Huang Ying
@ 2023-02-24 14:11 ` Huang Ying
  2023-03-01  2:23   ` Baolin Wang
  2023-02-24 14:11 ` [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly Huang Ying
  2023-02-26  4:55 ` [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration Andrew Morton
  3 siblings, 1 reply; 22+ messages in thread
From: Huang Ying @ 2023-02-24 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

To simplify the code logic and reduce the line number.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/migrate.c | 76 ++++++++++++++++++----------------------------------
 1 file changed, 26 insertions(+), 50 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 7ac37dbbf307..91198b487e49 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1605,9 +1605,10 @@ static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
 static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 		free_page_t put_new_page, unsigned long private,
 		enum migrate_mode mode, int reason, struct list_head *ret_folios,
-		struct migrate_pages_stats *stats)
+		struct list_head *split_folios, struct migrate_pages_stats *stats,
+		int nr_pass)
 {
-	int retry;
+	int retry = 1;
 	int large_retry = 1;
 	int thp_retry = 1;
 	int nr_failed = 0;
@@ -1617,19 +1618,12 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 	bool is_large = false;
 	bool is_thp = false;
 	struct folio *folio, *folio2, *dst = NULL, *dst2;
-	int rc, rc_saved, nr_pages;
-	LIST_HEAD(split_folios);
+	int rc, rc_saved = 0, nr_pages;
 	LIST_HEAD(unmap_folios);
 	LIST_HEAD(dst_folios);
 	bool nosplit = (reason == MR_NUMA_MISPLACED);
-	bool no_split_folio_counting = false;
 
-retry:
-	rc_saved = 0;
-	retry = 1;
-	for (pass = 0;
-	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
-	     pass++) {
+	for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
 		retry = 0;
 		large_retry = 0;
 		thp_retry = 0;
@@ -1660,7 +1654,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 			if (!thp_migration_supported() && is_thp) {
 				nr_large_failed++;
 				stats->nr_thp_failed++;
-				if (!try_split_folio(folio, &split_folios)) {
+				if (!try_split_folio(folio, split_folios)) {
 					stats->nr_thp_split++;
 					continue;
 				}
@@ -1692,7 +1686,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 					stats->nr_thp_failed += is_thp;
 					/* Large folio NUMA faulting doesn't split to retry. */
 					if (!nosplit) {
-						int ret = try_split_folio(folio, &split_folios);
+						int ret = try_split_folio(folio, split_folios);
 
 						if (!ret) {
 							stats->nr_thp_split += is_thp;
@@ -1709,18 +1703,11 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 							break;
 						}
 					}
-				} else if (!no_split_folio_counting) {
+				} else {
 					nr_failed++;
 				}
 
 				stats->nr_failed_pages += nr_pages + nr_retry_pages;
-				/*
-				 * There might be some split folios of fail-to-migrate large
-				 * folios left in split_folios list. Move them to ret_folios
-				 * list so that they could be put back to the right list by
-				 * the caller otherwise the folio refcnt will be leaked.
-				 */
-				list_splice_init(&split_folios, ret_folios);
 				/* nr_failed isn't updated for not used */
 				nr_large_failed += large_retry;
 				stats->nr_thp_failed += thp_retry;
@@ -1733,7 +1720,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 				if (is_large) {
 					large_retry++;
 					thp_retry += is_thp;
-				} else if (!no_split_folio_counting) {
+				} else {
 					retry++;
 				}
 				nr_retry_pages += nr_pages;
@@ -1756,7 +1743,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 				if (is_large) {
 					nr_large_failed++;
 					stats->nr_thp_failed += is_thp;
-				} else if (!no_split_folio_counting) {
+				} else {
 					nr_failed++;
 				}
 
@@ -1774,9 +1761,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 	try_to_unmap_flush();
 
 	retry = 1;
-	for (pass = 0;
-	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
-	     pass++) {
+	for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
 		retry = 0;
 		large_retry = 0;
 		thp_retry = 0;
@@ -1805,7 +1790,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 				if (is_large) {
 					large_retry++;
 					thp_retry += is_thp;
-				} else if (!no_split_folio_counting) {
+				} else {
 					retry++;
 				}
 				nr_retry_pages += nr_pages;
@@ -1818,7 +1803,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 				if (is_large) {
 					nr_large_failed++;
 					stats->nr_thp_failed += is_thp;
-				} else if (!no_split_folio_counting) {
+				} else {
 					nr_failed++;
 				}
 
@@ -1855,27 +1840,6 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 		dst2 = list_next_entry(dst, lru);
 	}
 
-	/*
-	 * Try to migrate split folios of fail-to-migrate large folios, no
-	 * nr_failed counting in this round, since all split folios of a
-	 * large folio is counted as 1 failure in the first round.
-	 */
-	if (rc >= 0 && !list_empty(&split_folios)) {
-		/*
-		 * Move non-migrated folios (after NR_MAX_MIGRATE_PAGES_RETRY
-		 * retries) to ret_folios to avoid migrating them again.
-		 */
-		list_splice_init(from, ret_folios);
-		list_splice_init(&split_folios, from);
-		/*
-		 * Force async mode to avoid to wait lock or bit when we have
-		 * locked more than one folios.
-		 */
-		mode = MIGRATE_ASYNC;
-		no_split_folio_counting = true;
-		goto retry;
-	}
-
 	return rc;
 }
 
@@ -1914,6 +1878,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	struct folio *folio, *folio2;
 	LIST_HEAD(folios);
 	LIST_HEAD(ret_folios);
+	LIST_HEAD(split_folios);
 	struct migrate_pages_stats stats;
 
 	trace_mm_migrate_pages_start(mode, reason);
@@ -1947,12 +1912,23 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	else
 		list_splice_init(from, &folios);
 	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
-				 mode, reason, &ret_folios, &stats);
+				 mode, reason, &ret_folios, &split_folios, &stats,
+				 NR_MAX_MIGRATE_PAGES_RETRY);
 	list_splice_tail_init(&folios, &ret_folios);
 	if (rc < 0) {
 		rc_gather = rc;
+		list_splice_tail(&split_folios, &ret_folios);
 		goto out;
 	}
+	if (!list_empty(&split_folios)) {
+		/*
+		 * Failure isn't counted since all split folios of a large folio
+		 * is counted as 1 failure already.
+		 */
+		migrate_pages_batch(&split_folios, get_new_page, put_new_page, private,
+				    MIGRATE_ASYNC, reason, &ret_folios, NULL, &stats, 1);
+		list_splice_tail_init(&split_folios, &ret_folios);
+	}
 	rc_gather += rc;
 	if (!list_empty(from))
 		goto again;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly
  2023-02-24 14:11 [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration Huang Ying
  2023-02-24 14:11 ` [PATCH 1/3] migrate_pages: fix deadlock in batched migration Huang Ying
  2023-02-24 14:11 ` [PATCH 2/3] migrate_pages: move split folios processing out of migrate_pages_batch() Huang Ying
@ 2023-02-24 14:11 ` Huang Ying
  2023-02-28  6:36   ` Hugh Dickins
  2023-03-01  3:08   ` Baolin Wang
  2023-02-26  4:55 ` [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration Andrew Morton
  3 siblings, 2 replies; 22+ messages in thread
From: Huang Ying @ 2023-02-24 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

When we have locked more than one folios, we cannot wait the lock or
bit (e.g., page lock, buffer head lock, writeback bit) synchronously.
Otherwise deadlock may be triggered.  This make it hard to batch the
synchronous migration directly.

This patch re-enables batching synchronous migration via trying to
migrate in batch asynchronously firstly.  And any folios that are
failed to be migrated asynchronously will be migrated synchronously
one by one.

Test shows that this can restore the TLB flushing batching performance
for synchronous migration effectively.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/migrate.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 55 insertions(+), 10 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 91198b487e49..c17ce5ee8d92 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1843,6 +1843,51 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
 	return rc;
 }
 
+static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
+		free_page_t put_new_page, unsigned long private,
+		enum migrate_mode mode, int reason, struct list_head *ret_folios,
+		struct list_head *split_folios, struct migrate_pages_stats *stats)
+{
+	int rc, nr_failed = 0;
+	LIST_HEAD(folios);
+	struct migrate_pages_stats astats;
+
+	memset(&astats, 0, sizeof(astats));
+	/* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
+	rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
+				 reason, &folios, split_folios, &astats,
+				 NR_MAX_MIGRATE_PAGES_RETRY);
+	stats->nr_succeeded += astats.nr_succeeded;
+	stats->nr_thp_succeeded += astats.nr_thp_succeeded;
+	stats->nr_thp_split += astats.nr_thp_split;
+	if (rc < 0) {
+		stats->nr_failed_pages += astats.nr_failed_pages;
+		stats->nr_thp_failed += astats.nr_thp_failed;
+		list_splice_tail(&folios, ret_folios);
+		return rc;
+	}
+	stats->nr_thp_failed += astats.nr_thp_split;
+	nr_failed += astats.nr_thp_split;
+	/*
+	 * Fall back to migrate all failed folios one by one synchronously. All
+	 * failed folios except split THPs will be retried, so their failure
+	 * isn't counted
+	 */
+	list_splice_tail_init(&folios, from);
+	while (!list_empty(from)) {
+		list_move(from->next, &folios);
+		rc = migrate_pages_batch(&folios, get_new_page, put_new_page,
+					 private, mode, reason, ret_folios,
+					 split_folios, stats, NR_MAX_MIGRATE_PAGES_RETRY);
+		list_splice_tail_init(&folios, ret_folios);
+		if (rc < 0)
+			return rc;
+		nr_failed += rc;
+	}
+
+	return nr_failed;
+}
+
 /*
  * migrate_pages - migrate the folios specified in a list, to the free folios
  *		   supplied as the target for the page migration
@@ -1874,7 +1919,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
 {
 	int rc, rc_gather;
-	int nr_pages, batch;
+	int nr_pages;
 	struct folio *folio, *folio2;
 	LIST_HEAD(folios);
 	LIST_HEAD(ret_folios);
@@ -1890,10 +1935,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	if (rc_gather < 0)
 		goto out;
 
-	if (mode == MIGRATE_ASYNC)
-		batch = NR_MAX_BATCHED_MIGRATION;
-	else
-		batch = 1;
 again:
 	nr_pages = 0;
 	list_for_each_entry_safe(folio, folio2, from, lru) {
@@ -1904,16 +1945,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 		}
 
 		nr_pages += folio_nr_pages(folio);
-		if (nr_pages >= batch)
+		if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
 			break;
 	}
-	if (nr_pages >= batch)
+	if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
 		list_cut_before(&folios, from, &folio2->lru);
 	else
 		list_splice_init(from, &folios);
-	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
-				 mode, reason, &ret_folios, &split_folios, &stats,
-				 NR_MAX_MIGRATE_PAGES_RETRY);
+	if (mode == MIGRATE_ASYNC)
+		rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
+					 mode, reason, &ret_folios, &split_folios, &stats,
+					 NR_MAX_MIGRATE_PAGES_RETRY);
+	else
+		rc = migrate_pages_sync(&folios, get_new_page, put_new_page, private,
+					mode, reason, &ret_folios, &split_folios, &stats);
 	list_splice_tail_init(&folios, &ret_folios);
 	if (rc < 0) {
 		rc_gather = rc;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration
  2023-02-24 14:11 [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration Huang Ying
                   ` (2 preceding siblings ...)
  2023-02-24 14:11 ` [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly Huang Ying
@ 2023-02-26  4:55 ` Andrew Morton
  2023-02-27  1:25   ` Huang, Ying
  3 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2023-02-26  4:55 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, linux-kernel, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

On Fri, 24 Feb 2023 22:11:42 +0800 Huang Ying <ying.huang@intel.com> wrote:

> Two deadlock bugs were reported for the migrate_pages() batching
> series.

"migrate_pages(): batch TLB flushing"

>  Thanks Hugh and Pengfei.  Analysis shows that if we have
> locked some other folios except the one we are migrating, it's not
> safe in general to wait synchronously, for example, to wait the
> writeback to complete or wait to lock the buffer head.
> 
> So 1/3 fixes the deadlock in a simple way, where the batching support
> for the synchronous migration is disabled.  The change is
> straightforward and easy to be understood.  While 3/3 re-introduce the
> batching for synchronous migration via trying to migrate
> asynchronously in batch optimistically, then fall back to migrate
> synchronously one by one for fail-to-migrate folios.  Test shows that
> this can restore the TLB flushing batching performance for synchronous
> migration effectively.

If anyone backports the "migrate_pages(): batch TLB flushing" series
into their kernels, they will want to know about such fixes.  So we can
help them by providing suitable Link: tags.

Such a Link: may also be helpful to people who are performing git
bisection searches for some issue but who keep stumbling over the
issues which this series addresses.

Being lazy, I slapped

Fixes: 6f7d760e86fa ("migrate_pages: move THP/hugetlb migration support check to simplify code")

on all three, as this was the final patch in that series.  Inaccurate,
but it means that these fixes will land in a suitable place if anyone
needs them.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration
  2023-02-26  4:55 ` [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration Andrew Morton
@ 2023-02-27  1:25   ` Huang, Ying
  0 siblings, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2023-02-27  1:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 24 Feb 2023 22:11:42 +0800 Huang Ying <ying.huang@intel.com> wrote:
>
>> Two deadlock bugs were reported for the migrate_pages() batching
>> series.
>
> "migrate_pages(): batch TLB flushing"

Yes.  Should have written as that.

>>  Thanks Hugh and Pengfei.  Analysis shows that if we have
>> locked some other folios except the one we are migrating, it's not
>> safe in general to wait synchronously, for example, to wait the
>> writeback to complete or wait to lock the buffer head.
>> 
>> So 1/3 fixes the deadlock in a simple way, where the batching support
>> for the synchronous migration is disabled.  The change is
>> straightforward and easy to be understood.  While 3/3 re-introduce the
>> batching for synchronous migration via trying to migrate
>> asynchronously in batch optimistically, then fall back to migrate
>> synchronously one by one for fail-to-migrate folios.  Test shows that
>> this can restore the TLB flushing batching performance for synchronous
>> migration effectively.
>
> If anyone backports the "migrate_pages(): batch TLB flushing" series
> into their kernels, they will want to know about such fixes.  So we can
> help them by providing suitable Link: tags.
>
> Such a Link: may also be helpful to people who are performing git
> bisection searches for some issue but who keep stumbling over the
> issues which this series addresses.
>
> Being lazy, I slapped
>
> Fixes: 6f7d760e86fa ("migrate_pages: move THP/hugetlb migration support check to simplify code")
>
> on all three, as this was the final patch in that series.  Inaccurate,
> but it means that these fixes will land in a suitable place if anyone
> needs them.

Sorry.  I should have added the "Fixes:" tag.  I will be more careful
in the future.  And, I will add proper "Link:" tag too.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] migrate_pages: fix deadlock in batched migration
  2023-02-24 14:11 ` [PATCH 1/3] migrate_pages: fix deadlock in batched migration Huang Ying
@ 2023-02-28  6:13   ` Hugh Dickins
  2023-02-28  7:22     ` Huang, Ying
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2023-02-28  6:13 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

On Fri, 24 Feb 2023, Huang Ying wrote:

> Two deadlock bugs were reported for the migrate_pages() batching
> series.  Thanks Hugh and Pengfei!  For example, in the following
> deadlock trace snippet,
> 
>  INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
>        Not tainted 6.2.0-rc4-kvm+ #1314
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/u4:0    state:D stack:0     pid:9     ppid:2      flags:0x00004000
>  Workqueue: loop4 loop_rootcg_workfn
>  Call Trace:
>   <TASK>
>   __schedule+0x43b/0xd00
>   schedule+0x6a/0xf0
>   io_schedule+0x4a/0x80
>   folio_wait_bit_common+0x1b5/0x4e0
>   ? __pfx_wake_page_function+0x10/0x10
>   __filemap_get_folio+0x73d/0x770
>   shmem_get_folio_gfp+0x1fd/0xc80
>   shmem_write_begin+0x91/0x220
>   generic_perform_write+0x10e/0x2e0
>   __generic_file_write_iter+0x17e/0x290
>   ? generic_write_checks+0x12b/0x1a0
>   generic_file_write_iter+0x97/0x180
>   ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>   do_iter_readv_writev+0x13c/0x210
>   ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>   do_iter_write+0xf6/0x330
>   vfs_iter_write+0x46/0x70
>   loop_process_work+0x723/0xfe0
>   loop_rootcg_workfn+0x28/0x40
>   process_one_work+0x3cc/0x8d0
>   worker_thread+0x66/0x630
>   ? __pfx_worker_thread+0x10/0x10
>   kthread+0x153/0x190
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x29/0x50
>   </TASK>
> 
>  INFO: task repro:1023 blocked for more than 147 seconds.
>        Not tainted 6.2.0-rc4-kvm+ #1314
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:repro           state:D stack:0     pid:1023  ppid:360    flags:0x00004004
>  Call Trace:
>   <TASK>
>   __schedule+0x43b/0xd00
>   schedule+0x6a/0xf0
>   io_schedule+0x4a/0x80
>   folio_wait_bit_common+0x1b5/0x4e0
>   ? compaction_alloc+0x77/0x1150
>   ? __pfx_wake_page_function+0x10/0x10
>   folio_wait_bit+0x30/0x40
>   folio_wait_writeback+0x2e/0x1e0
>   migrate_pages_batch+0x555/0x1ac0
>   ? __pfx_compaction_alloc+0x10/0x10
>   ? __pfx_compaction_free+0x10/0x10
>   ? __this_cpu_preempt_check+0x17/0x20
>   ? lock_is_held_type+0xe6/0x140
>   migrate_pages+0x100e/0x1180
>   ? __pfx_compaction_free+0x10/0x10
>   ? __pfx_compaction_alloc+0x10/0x10
>   compact_zone+0xe10/0x1b50
>   ? lock_is_held_type+0xe6/0x140
>   ? check_preemption_disabled+0x80/0xf0
>   compact_node+0xa3/0x100
>   ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
>   ? _find_first_bit+0x7b/0x90
>   sysctl_compaction_handler+0x5d/0xb0
>   proc_sys_call_handler+0x29d/0x420
>   proc_sys_write+0x2b/0x40
>   vfs_write+0x3a3/0x780
>   ksys_write+0xb7/0x180
>   __x64_sys_write+0x26/0x30
>   do_syscall_64+0x3b/0x90
>   entry_SYSCALL_64_after_hwframe+0x72/0xdc
>  RIP: 0033:0x7f3a2471f59d
>  RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
>  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
>  RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
>  R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
>  R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
>   </TASK>
> 
> The page migration task has held the lock of the shmem folio A, and is
> waiting the writeback of the folio B of the file system on the loop
> block device to complete.  While the loop worker task which writes
> back the folio B is waiting to lock the shmem folio A, because the
> folio A backs the folio B in the loop device.  Thus deadlock is
> triggered.
> 
> In general, if we have locked some other folios except the one we are
> migrating, it's not safe to wait synchronously, for example, to wait
> the writeback to complete or wait to lock the buffer head.
> 
> To fix the deadlock, in this patch, we avoid to batch the page
> migration except for MIGRATE_ASYNC mode.  In MIGRATE_ASYNC mode,
> synchronous waiting is avoided.
> 
> The fix can be improved further.  We will do that as soon as possible.
> 
> Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/
> Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Reported-by: Hugh Dickins <hughd@google.com>
> Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Stefan Roesch <shr@devkernel.io>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Xin Hao <xhao@linux.alibaba.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/migrate.c | 62 ++++++++++++++++------------------------------------
>  1 file changed, 19 insertions(+), 43 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 37865f85df6d..7ac37dbbf307 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1106,7 +1106,7 @@ static void migrate_folio_done(struct folio *src,
>  /* Obtain the lock on page, remove all ptes. */
>  static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page,
>  			       unsigned long private, struct folio *src,
> -			       struct folio **dstp, int force, bool avoid_force_lock,
> +			       struct folio **dstp, int force,
>  			       enum migrate_mode mode, enum migrate_reason reason,
>  			       struct list_head *ret)
>  {
> @@ -1157,17 +1157,6 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
>  		if (current->flags & PF_MEMALLOC)
>  			goto out;
>  
> -		/*
> -		 * We have locked some folios and are going to wait to lock
> -		 * this folio.  To avoid a potential deadlock, let's bail
> -		 * out and not do that. The locked folios will be moved and
> -		 * unlocked, then we can wait to lock this folio.
> -		 */
> -		if (avoid_force_lock) {
> -			rc = -EDEADLOCK;
> -			goto out;
> -		}
> -
>  		folio_lock(src);
>  	}
>  	locked = true;
> @@ -1247,7 +1236,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
>  		/* Establish migration ptes */
>  		VM_BUG_ON_FOLIO(folio_test_anon(src) &&
>  			       !folio_test_ksm(src) && !anon_vma, src);
> -		try_to_migrate(src, TTU_BATCH_FLUSH);
> +		try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);

Why that change, I wonder?  The TTU_BATCH_FLUSH can still be useful for
gathering multiple cross-CPU TLB flushes into one, even when it's only
a single page in the batch.


>  		page_was_mapped = 1;
>  	}
>  
> @@ -1261,7 +1250,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
>  	 * A folio that has not been unmapped will be restored to
>  	 * right list unless we want to retry.
>  	 */
> -	if (rc == -EAGAIN || rc == -EDEADLOCK)
> +	if (rc == -EAGAIN)
>  		ret = NULL;
>  
>  	migrate_folio_undo_src(src, page_was_mapped, anon_vma, locked, ret);
> @@ -1634,11 +1623,9 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>  	LIST_HEAD(dst_folios);
>  	bool nosplit = (reason == MR_NUMA_MISPLACED);
>  	bool no_split_folio_counting = false;
> -	bool avoid_force_lock;
>  
>  retry:
>  	rc_saved = 0;
> -	avoid_force_lock = false;
>  	retry = 1;
>  	for (pass = 0;
>  	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
> @@ -1683,15 +1670,14 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>  			}
>  
>  			rc = migrate_folio_unmap(get_new_page, put_new_page, private,
> -						 folio, &dst, pass > 2, avoid_force_lock,
> -						 mode, reason, ret_folios);
> +						 folio, &dst, pass > 2, mode,
> +						 reason, ret_folios);
>  			/*
>  			 * The rules are:
>  			 *	Success: folio will be freed
>  			 *	Unmap: folio will be put on unmap_folios list,
>  			 *	       dst folio put on dst_folios list
>  			 *	-EAGAIN: stay on the from list
> -			 *	-EDEADLOCK: stay on the from list
>  			 *	-ENOMEM: stay on the from list
>  			 *	Other errno: put on ret_folios list
>  			 */
> @@ -1743,14 +1729,6 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>  					goto out;
>  				else
>  					goto move;
> -			case -EDEADLOCK:
> -				/*
> -				 * The folio cannot be locked for potential deadlock.
> -				 * Go move (and unlock) all locked folios.  Then we can
> -				 * try again.
> -				 */
> -				rc_saved = rc;
> -				goto move;
>  			case -EAGAIN:
>  				if (is_large) {
>  					large_retry++;
> @@ -1765,11 +1743,6 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>  				stats->nr_thp_succeeded += is_thp;
>  				break;
>  			case MIGRATEPAGE_UNMAP:
> -				/*
> -				 * We have locked some folios, don't force lock
> -				 * to avoid deadlock.
> -				 */
> -				avoid_force_lock = true;
>  				list_move_tail(&folio->lru, &unmap_folios);
>  				list_add_tail(&dst->lru, &dst_folios);
>  				break;
> @@ -1894,17 +1867,15 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>  		 */
>  		list_splice_init(from, ret_folios);
>  		list_splice_init(&split_folios, from);
> +		/*
> +		 * Force async mode to avoid to wait lock or bit when we have
> +		 * locked more than one folios.
> +		 */
> +		mode = MIGRATE_ASYNC;

It goes away in a later patch anyway, but I didn't understand that change -
I thought this was a point at which no locks are held.  Oh, perhaps I get
it now: because the batch of 1 is here becoming a batch of HPAGE_PMD_NR?

>  		no_split_folio_counting = true;
>  		goto retry;
>  	}
>  
> -	/*
> -	 * We have unlocked all locked folios, so we can force lock now, let's
> -	 * try again.
> -	 */
> -	if (rc == -EDEADLOCK)
> -		goto retry;
> -
>  	return rc;
>  }
>  
> @@ -1939,7 +1910,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>  {
>  	int rc, rc_gather;
> -	int nr_pages;
> +	int nr_pages, batch;
>  	struct folio *folio, *folio2;
>  	LIST_HEAD(folios);
>  	LIST_HEAD(ret_folios);
> @@ -1953,6 +1924,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  				     mode, reason, &stats, &ret_folios);
>  	if (rc_gather < 0)
>  		goto out;
> +
> +	if (mode == MIGRATE_ASYNC)
> +		batch = NR_MAX_BATCHED_MIGRATION;
> +	else
> +		batch = 1;
>  again:
>  	nr_pages = 0;
>  	list_for_each_entry_safe(folio, folio2, from, lru) {
> @@ -1963,11 +1939,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  		}
>  
>  		nr_pages += folio_nr_pages(folio);
> -		if (nr_pages > NR_MAX_BATCHED_MIGRATION)
> +		if (nr_pages >= batch)
>  			break;

Yes, the off-by-one fixes look good.

>  	}
> -	if (nr_pages > NR_MAX_BATCHED_MIGRATION)
> -		list_cut_before(&folios, from, &folio->lru);
> +	if (nr_pages >= batch)
> +		list_cut_before(&folios, from, &folio2->lru);
>  	else
>  		list_splice_init(from, &folios);
>  	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
> -- 
> 2.39.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly
  2023-02-24 14:11 ` [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly Huang Ying
@ 2023-02-28  6:36   ` Hugh Dickins
  2023-02-28  7:45     ` Huang, Ying
  2023-03-01  3:08   ` Baolin Wang
  1 sibling, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2023-02-28  6:36 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

On Fri, 24 Feb 2023, Huang Ying wrote:

> When we have locked more than one folios, we cannot wait the lock or
> bit (e.g., page lock, buffer head lock, writeback bit) synchronously.
> Otherwise deadlock may be triggered.  This make it hard to batch the
> synchronous migration directly.
> 
> This patch re-enables batching synchronous migration via trying to
> migrate in batch asynchronously firstly.  And any folios that are
> failed to be migrated asynchronously will be migrated synchronously
> one by one.
> 
> Test shows that this can restore the TLB flushing batching performance
> for synchronous migration effectively.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Hugh Dickins <hughd@google.com>

I'm not sure whether my 48 hours on two machines counts for a
Tested-by: Hugh Dickins <hughd@google.com>
or not; but it certainly looks like you've fixed my deadlock.

> Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Stefan Roesch <shr@devkernel.io>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Xin Hao <xhao@linux.alibaba.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/migrate.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 55 insertions(+), 10 deletions(-)

I was initially disappointed, that this was more complicated than I had
thought it should be; but came to understand why.  My "change the mode
to MIGRATE_ASYNC after the first" model would have condemned most of the
MIGRATE_SYNC batch of pages to be handled as lightly as MIGRATE_ASYNC:
not good enough, you're right be trying harder here.

> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 91198b487e49..c17ce5ee8d92 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1843,6 +1843,51 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>  	return rc;
>  }
>  
> +static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
> +		free_page_t put_new_page, unsigned long private,
> +		enum migrate_mode mode, int reason, struct list_head *ret_folios,
> +		struct list_head *split_folios, struct migrate_pages_stats *stats)
> +{
> +	int rc, nr_failed = 0;
> +	LIST_HEAD(folios);
> +	struct migrate_pages_stats astats;
> +
> +	memset(&astats, 0, sizeof(astats));
> +	/* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
> +	rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
> +				 reason, &folios, split_folios, &astats,
> +				 NR_MAX_MIGRATE_PAGES_RETRY);

I wonder if that and below would better be NR_MAX_MIGRATE_PAGES_RETRY / 2.

Though I've never got down to adjusting that number (and it's not a job
to be done in this set of patches), those 10 retries sometimes terrify
me, from a latency point of view.  They can have such different weights:
in the unmapped case, 10 retries is okay; but when a pinned page is mapped
into 1000 processes, the thought of all that unmapping and TLB flushing
and remapping is terrifying.

Since you're retrying below, halve both numbers of retries for now?

> +	stats->nr_succeeded += astats.nr_succeeded;
> +	stats->nr_thp_succeeded += astats.nr_thp_succeeded;
> +	stats->nr_thp_split += astats.nr_thp_split;
> +	if (rc < 0) {
> +		stats->nr_failed_pages += astats.nr_failed_pages;
> +		stats->nr_thp_failed += astats.nr_thp_failed;
> +		list_splice_tail(&folios, ret_folios);
> +		return rc;
> +	}
> +	stats->nr_thp_failed += astats.nr_thp_split;
> +	nr_failed += astats.nr_thp_split;
> +	/*
> +	 * Fall back to migrate all failed folios one by one synchronously. All
> +	 * failed folios except split THPs will be retried, so their failure
> +	 * isn't counted
> +	 */
> +	list_splice_tail_init(&folios, from);
> +	while (!list_empty(from)) {
> +		list_move(from->next, &folios);
> +		rc = migrate_pages_batch(&folios, get_new_page, put_new_page,
> +					 private, mode, reason, ret_folios,
> +					 split_folios, stats, NR_MAX_MIGRATE_PAGES_RETRY);

NR_MAX_MIGRATE_PAGES_RETRY / 2 ?

> +		list_splice_tail_init(&folios, ret_folios);
> +		if (rc < 0)
> +			return rc;
> +		nr_failed += rc;
> +	}
> +
> +	return nr_failed;
> +}
> +
>  /*
>   * migrate_pages - migrate the folios specified in a list, to the free folios
>   *		   supplied as the target for the page migration
> @@ -1874,7 +1919,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>  {
>  	int rc, rc_gather;
> -	int nr_pages, batch;
> +	int nr_pages;
>  	struct folio *folio, *folio2;
>  	LIST_HEAD(folios);
>  	LIST_HEAD(ret_folios);
> @@ -1890,10 +1935,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  	if (rc_gather < 0)
>  		goto out;
>  
> -	if (mode == MIGRATE_ASYNC)
> -		batch = NR_MAX_BATCHED_MIGRATION;
> -	else
> -		batch = 1;
>  again:
>  	nr_pages = 0;
>  	list_for_each_entry_safe(folio, folio2, from, lru) {
> @@ -1904,16 +1945,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  		}
>  
>  		nr_pages += folio_nr_pages(folio);
> -		if (nr_pages >= batch)
> +		if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
>  			break;
>  	}
> -	if (nr_pages >= batch)
> +	if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
>  		list_cut_before(&folios, from, &folio2->lru);
>  	else
>  		list_splice_init(from, &folios);
> -	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
> -				 mode, reason, &ret_folios, &split_folios, &stats,
> -				 NR_MAX_MIGRATE_PAGES_RETRY);
> +	if (mode == MIGRATE_ASYNC)
> +		rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
> +					 mode, reason, &ret_folios, &split_folios, &stats,
> +					 NR_MAX_MIGRATE_PAGES_RETRY);
> +	else
> +		rc = migrate_pages_sync(&folios, get_new_page, put_new_page, private,
> +					mode, reason, &ret_folios, &split_folios, &stats);
>  	list_splice_tail_init(&folios, &ret_folios);
>  	if (rc < 0) {
>  		rc_gather = rc;
> -- 
> 2.39.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] migrate_pages: fix deadlock in batched migration
  2023-02-28  6:13   ` Hugh Dickins
@ 2023-02-28  7:22     ` Huang, Ying
  2023-02-28 21:07       ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2023-02-28  7:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

Hi, Hugh,

Thank you very much for review!

Hugh Dickins <hughd@google.com> writes:

> On Fri, 24 Feb 2023, Huang Ying wrote:
>
>> Two deadlock bugs were reported for the migrate_pages() batching
>> series.  Thanks Hugh and Pengfei!  For example, in the following
>> deadlock trace snippet,
>> 
>>  INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
>>        Not tainted 6.2.0-rc4-kvm+ #1314
>>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>  task:kworker/u4:0    state:D stack:0     pid:9     ppid:2      flags:0x00004000
>>  Workqueue: loop4 loop_rootcg_workfn
>>  Call Trace:
>>   <TASK>
>>   __schedule+0x43b/0xd00
>>   schedule+0x6a/0xf0
>>   io_schedule+0x4a/0x80
>>   folio_wait_bit_common+0x1b5/0x4e0
>>   ? __pfx_wake_page_function+0x10/0x10
>>   __filemap_get_folio+0x73d/0x770
>>   shmem_get_folio_gfp+0x1fd/0xc80
>>   shmem_write_begin+0x91/0x220
>>   generic_perform_write+0x10e/0x2e0
>>   __generic_file_write_iter+0x17e/0x290
>>   ? generic_write_checks+0x12b/0x1a0
>>   generic_file_write_iter+0x97/0x180
>>   ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>>   do_iter_readv_writev+0x13c/0x210
>>   ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>>   do_iter_write+0xf6/0x330
>>   vfs_iter_write+0x46/0x70
>>   loop_process_work+0x723/0xfe0
>>   loop_rootcg_workfn+0x28/0x40
>>   process_one_work+0x3cc/0x8d0
>>   worker_thread+0x66/0x630
>>   ? __pfx_worker_thread+0x10/0x10
>>   kthread+0x153/0x190
>>   ? __pfx_kthread+0x10/0x10
>>   ret_from_fork+0x29/0x50
>>   </TASK>
>> 
>>  INFO: task repro:1023 blocked for more than 147 seconds.
>>        Not tainted 6.2.0-rc4-kvm+ #1314
>>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>  task:repro           state:D stack:0     pid:1023  ppid:360    flags:0x00004004
>>  Call Trace:
>>   <TASK>
>>   __schedule+0x43b/0xd00
>>   schedule+0x6a/0xf0
>>   io_schedule+0x4a/0x80
>>   folio_wait_bit_common+0x1b5/0x4e0
>>   ? compaction_alloc+0x77/0x1150
>>   ? __pfx_wake_page_function+0x10/0x10
>>   folio_wait_bit+0x30/0x40
>>   folio_wait_writeback+0x2e/0x1e0
>>   migrate_pages_batch+0x555/0x1ac0
>>   ? __pfx_compaction_alloc+0x10/0x10
>>   ? __pfx_compaction_free+0x10/0x10
>>   ? __this_cpu_preempt_check+0x17/0x20
>>   ? lock_is_held_type+0xe6/0x140
>>   migrate_pages+0x100e/0x1180
>>   ? __pfx_compaction_free+0x10/0x10
>>   ? __pfx_compaction_alloc+0x10/0x10
>>   compact_zone+0xe10/0x1b50
>>   ? lock_is_held_type+0xe6/0x140
>>   ? check_preemption_disabled+0x80/0xf0
>>   compact_node+0xa3/0x100
>>   ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
>>   ? _find_first_bit+0x7b/0x90
>>   sysctl_compaction_handler+0x5d/0xb0
>>   proc_sys_call_handler+0x29d/0x420
>>   proc_sys_write+0x2b/0x40
>>   vfs_write+0x3a3/0x780
>>   ksys_write+0xb7/0x180
>>   __x64_sys_write+0x26/0x30
>>   do_syscall_64+0x3b/0x90
>>   entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>  RIP: 0033:0x7f3a2471f59d
>>  RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
>>  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
>>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
>>  RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
>>  R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
>>  R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
>>   </TASK>
>> 
>> The page migration task has held the lock of the shmem folio A, and is
>> waiting the writeback of the folio B of the file system on the loop
>> block device to complete.  While the loop worker task which writes
>> back the folio B is waiting to lock the shmem folio A, because the
>> folio A backs the folio B in the loop device.  Thus deadlock is
>> triggered.
>> 
>> In general, if we have locked some other folios except the one we are
>> migrating, it's not safe to wait synchronously, for example, to wait
>> the writeback to complete or wait to lock the buffer head.
>> 
>> To fix the deadlock, in this patch, we avoid to batch the page
>> migration except for MIGRATE_ASYNC mode.  In MIGRATE_ASYNC mode,
>> synchronous waiting is avoided.
>> 
>> The fix can be improved further.  We will do that as soon as possible.
>> 
>> Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/
>> Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Reported-by: Hugh Dickins <hughd@google.com>
>> Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Stefan Roesch <shr@devkernel.io>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: Xin Hao <xhao@linux.alibaba.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  mm/migrate.c | 62 ++++++++++++++++------------------------------------
>>  1 file changed, 19 insertions(+), 43 deletions(-)
>> 
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 37865f85df6d..7ac37dbbf307 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1106,7 +1106,7 @@ static void migrate_folio_done(struct folio *src,
>>  /* Obtain the lock on page, remove all ptes. */
>>  static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page,
>>  			       unsigned long private, struct folio *src,
>> -			       struct folio **dstp, int force, bool avoid_force_lock,
>> +			       struct folio **dstp, int force,
>>  			       enum migrate_mode mode, enum migrate_reason reason,
>>  			       struct list_head *ret)
>>  {
>> @@ -1157,17 +1157,6 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
>>  		if (current->flags & PF_MEMALLOC)
>>  			goto out;
>>  
>> -		/*
>> -		 * We have locked some folios and are going to wait to lock
>> -		 * this folio.  To avoid a potential deadlock, let's bail
>> -		 * out and not do that. The locked folios will be moved and
>> -		 * unlocked, then we can wait to lock this folio.
>> -		 */
>> -		if (avoid_force_lock) {
>> -			rc = -EDEADLOCK;
>> -			goto out;
>> -		}
>> -
>>  		folio_lock(src);
>>  	}
>>  	locked = true;
>> @@ -1247,7 +1236,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
>>  		/* Establish migration ptes */
>>  		VM_BUG_ON_FOLIO(folio_test_anon(src) &&
>>  			       !folio_test_ksm(src) && !anon_vma, src);
>> -		try_to_migrate(src, TTU_BATCH_FLUSH);
>> +		try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
>
> Why that change, I wonder? The TTU_BATCH_FLUSH can still be useful for
> gathering multiple cross-CPU TLB flushes into one, even when it's only
> a single page in the batch.

Firstly, I would have thought that we have no opportunities to batch the
TLB flushing now.  But as you pointed out, it is still possible to batch
if mapcount > 1.  Secondly, without TTU_BATCH_FLUSH, we may flush the
TLB for a single page (with invlpg instruction), otherwise, we will
flush the TLB for all pages.  The former is faster and will not
influence other TLB entries of the process.

Or we use TTU_BATCH_FLUSH only if mapcount > 1?

>
>>  		page_was_mapped = 1;
>>  	}
>>  
>> @@ -1261,7 +1250,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
>>  	 * A folio that has not been unmapped will be restored to
>>  	 * right list unless we want to retry.
>>  	 */
>> -	if (rc == -EAGAIN || rc == -EDEADLOCK)
>> +	if (rc == -EAGAIN)
>>  		ret = NULL;
>>  
>>  	migrate_folio_undo_src(src, page_was_mapped, anon_vma, locked, ret);
>> @@ -1634,11 +1623,9 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>  	LIST_HEAD(dst_folios);
>>  	bool nosplit = (reason == MR_NUMA_MISPLACED);
>>  	bool no_split_folio_counting = false;
>> -	bool avoid_force_lock;
>>  
>>  retry:
>>  	rc_saved = 0;
>> -	avoid_force_lock = false;
>>  	retry = 1;
>>  	for (pass = 0;
>>  	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
>> @@ -1683,15 +1670,14 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>  			}
>>  
>>  			rc = migrate_folio_unmap(get_new_page, put_new_page, private,
>> -						 folio, &dst, pass > 2, avoid_force_lock,
>> -						 mode, reason, ret_folios);
>> +						 folio, &dst, pass > 2, mode,
>> +						 reason, ret_folios);
>>  			/*
>>  			 * The rules are:
>>  			 *	Success: folio will be freed
>>  			 *	Unmap: folio will be put on unmap_folios list,
>>  			 *	       dst folio put on dst_folios list
>>  			 *	-EAGAIN: stay on the from list
>> -			 *	-EDEADLOCK: stay on the from list
>>  			 *	-ENOMEM: stay on the from list
>>  			 *	Other errno: put on ret_folios list
>>  			 */
>> @@ -1743,14 +1729,6 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>  					goto out;
>>  				else
>>  					goto move;
>> -			case -EDEADLOCK:
>> -				/*
>> -				 * The folio cannot be locked for potential deadlock.
>> -				 * Go move (and unlock) all locked folios.  Then we can
>> -				 * try again.
>> -				 */
>> -				rc_saved = rc;
>> -				goto move;
>>  			case -EAGAIN:
>>  				if (is_large) {
>>  					large_retry++;
>> @@ -1765,11 +1743,6 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>  				stats->nr_thp_succeeded += is_thp;
>>  				break;
>>  			case MIGRATEPAGE_UNMAP:
>> -				/*
>> -				 * We have locked some folios, don't force lock
>> -				 * to avoid deadlock.
>> -				 */
>> -				avoid_force_lock = true;
>>  				list_move_tail(&folio->lru, &unmap_folios);
>>  				list_add_tail(&dst->lru, &dst_folios);
>>  				break;
>> @@ -1894,17 +1867,15 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>  		 */
>>  		list_splice_init(from, ret_folios);
>>  		list_splice_init(&split_folios, from);
>> +		/*
>> +		 * Force async mode to avoid to wait lock or bit when we have
>> +		 * locked more than one folios.
>> +		 */
>> +		mode = MIGRATE_ASYNC;
>
> It goes away in a later patch anyway, but I didn't understand that change -
> I thought this was a point at which no locks are held.  Oh, perhaps I get
> it now: because the batch of 1 is here becoming a batch of HPAGE_PMD_NR?

Yes.  Now, there's HPAGE_PMD_NR folios in "from" list.

And, in the later patch, I just move the logic out of this function.
The split_folios is return to the caller, and the caller will call
migrate_pages_batch() again to migrate "split_folios" with MIGRATE_ASYNC
mode.

>>  		no_split_folio_counting = true;
>>  		goto retry;
>>  	}
>>  
>> -	/*
>> -	 * We have unlocked all locked folios, so we can force lock now, let's
>> -	 * try again.
>> -	 */
>> -	if (rc == -EDEADLOCK)
>> -		goto retry;
>> -
>>  	return rc;
>>  }
>>  
>> @@ -1939,7 +1910,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>  		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>>  {
>>  	int rc, rc_gather;
>> -	int nr_pages;
>> +	int nr_pages, batch;
>>  	struct folio *folio, *folio2;
>>  	LIST_HEAD(folios);
>>  	LIST_HEAD(ret_folios);
>> @@ -1953,6 +1924,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>  				     mode, reason, &stats, &ret_folios);
>>  	if (rc_gather < 0)
>>  		goto out;
>> +
>> +	if (mode == MIGRATE_ASYNC)
>> +		batch = NR_MAX_BATCHED_MIGRATION;
>> +	else
>> +		batch = 1;
>>  again:
>>  	nr_pages = 0;
>>  	list_for_each_entry_safe(folio, folio2, from, lru) {
>> @@ -1963,11 +1939,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>  		}
>>  
>>  		nr_pages += folio_nr_pages(folio);
>> -		if (nr_pages > NR_MAX_BATCHED_MIGRATION)
>> +		if (nr_pages >= batch)
>>  			break;
>
> Yes, the off-by-one fixes look good.

Thanks!

>>  	}
>> -	if (nr_pages > NR_MAX_BATCHED_MIGRATION)
>> -		list_cut_before(&folios, from, &folio->lru);
>> +	if (nr_pages >= batch)
>> +		list_cut_before(&folios, from, &folio2->lru);
>>  	else
>>  		list_splice_init(from, &folios);
>>  	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
>> -- 

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly
  2023-02-28  6:36   ` Hugh Dickins
@ 2023-02-28  7:45     ` Huang, Ying
  2023-02-28 21:22       ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2023-02-28  7:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

Hugh Dickins <hughd@google.com> writes:

> On Fri, 24 Feb 2023, Huang Ying wrote:
>
>> When we have locked more than one folios, we cannot wait the lock or
>> bit (e.g., page lock, buffer head lock, writeback bit) synchronously.
>> Otherwise deadlock may be triggered.  This make it hard to batch the
>> synchronous migration directly.
>> 
>> This patch re-enables batching synchronous migration via trying to
>> migrate in batch asynchronously firstly.  And any folios that are
>> failed to be migrated asynchronously will be migrated synchronously
>> one by one.
>> 
>> Test shows that this can restore the TLB flushing batching performance
>> for synchronous migration effectively.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>
> I'm not sure whether my 48 hours on two machines counts for a
> Tested-by: Hugh Dickins <hughd@google.com>
> or not; but it certainly looks like you've fixed my deadlock.

Thank you very much for testing the series!

>> Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Stefan Roesch <shr@devkernel.io>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: Xin Hao <xhao@linux.alibaba.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  mm/migrate.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 55 insertions(+), 10 deletions(-)
>
> I was initially disappointed, that this was more complicated than I had
> thought it should be; but came to understand why.  My "change the mode
> to MIGRATE_ASYNC after the first" model would have condemned most of the
> MIGRATE_SYNC batch of pages to be handled as lightly as MIGRATE_ASYNC:
> not good enough, you're right be trying harder here.
>
>> 
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 91198b487e49..c17ce5ee8d92 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1843,6 +1843,51 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>  	return rc;
>>  }
>>  
>> +static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
>> +		free_page_t put_new_page, unsigned long private,
>> +		enum migrate_mode mode, int reason, struct list_head *ret_folios,
>> +		struct list_head *split_folios, struct migrate_pages_stats *stats)
>> +{
>> +	int rc, nr_failed = 0;
>> +	LIST_HEAD(folios);
>> +	struct migrate_pages_stats astats;
>> +
>> +	memset(&astats, 0, sizeof(astats));
>> +	/* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
>> +	rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
>> +				 reason, &folios, split_folios, &astats,
>> +				 NR_MAX_MIGRATE_PAGES_RETRY);
>
> I wonder if that and below would better be NR_MAX_MIGRATE_PAGES_RETRY / 2.
>
> Though I've never got down to adjusting that number (and it's not a job
> to be done in this set of patches), those 10 retries sometimes terrify
> me, from a latency point of view.  They can have such different weights:
> in the unmapped case, 10 retries is okay; but when a pinned page is mapped
> into 1000 processes, the thought of all that unmapping and TLB flushing
> and remapping is terrifying.
>
> Since you're retrying below, halve both numbers of retries for now?

Yes.  These are reasonable concerns.

And in the original implementation, we only wait to lock page and wait
the writeback to complete if pass > 2.  This is kind of trying to
migrate asynchronously for 3 times before the real synchronous
migration.  So, should we delete the "force" logic (in
migrate_folio_unmap()), and try to migrate asynchronously for 3 times in
batch before migrating synchronously for 7 times one by one?

>> +	stats->nr_succeeded += astats.nr_succeeded;
>> +	stats->nr_thp_succeeded += astats.nr_thp_succeeded;
>> +	stats->nr_thp_split += astats.nr_thp_split;
>> +	if (rc < 0) {
>> +		stats->nr_failed_pages += astats.nr_failed_pages;
>> +		stats->nr_thp_failed += astats.nr_thp_failed;
>> +		list_splice_tail(&folios, ret_folios);
>> +		return rc;
>> +	}
>> +	stats->nr_thp_failed += astats.nr_thp_split;
>> +	nr_failed += astats.nr_thp_split;
>> +	/*
>> +	 * Fall back to migrate all failed folios one by one synchronously. All
>> +	 * failed folios except split THPs will be retried, so their failure
>> +	 * isn't counted
>> +	 */
>> +	list_splice_tail_init(&folios, from);
>> +	while (!list_empty(from)) {
>> +		list_move(from->next, &folios);
>> +		rc = migrate_pages_batch(&folios, get_new_page, put_new_page,
>> +					 private, mode, reason, ret_folios,
>> +					 split_folios, stats, NR_MAX_MIGRATE_PAGES_RETRY);
>
> NR_MAX_MIGRATE_PAGES_RETRY / 2 ?
>
>> +		list_splice_tail_init(&folios, ret_folios);
>> +		if (rc < 0)
>> +			return rc;
>> +		nr_failed += rc;
>> +	}
>> +
>> +	return nr_failed;
>> +}
>> +
>>  /*
>>   * migrate_pages - migrate the folios specified in a list, to the free folios
>>   *		   supplied as the target for the page migration
>> @@ -1874,7 +1919,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>  		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>>  {
>>  	int rc, rc_gather;
>> -	int nr_pages, batch;
>> +	int nr_pages;
>>  	struct folio *folio, *folio2;
>>  	LIST_HEAD(folios);
>>  	LIST_HEAD(ret_folios);
>> @@ -1890,10 +1935,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>  	if (rc_gather < 0)
>>  		goto out;
>>  
>> -	if (mode == MIGRATE_ASYNC)
>> -		batch = NR_MAX_BATCHED_MIGRATION;
>> -	else
>> -		batch = 1;
>>  again:
>>  	nr_pages = 0;
>>  	list_for_each_entry_safe(folio, folio2, from, lru) {
>> @@ -1904,16 +1945,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>  		}
>>  
>>  		nr_pages += folio_nr_pages(folio);
>> -		if (nr_pages >= batch)
>> +		if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
>>  			break;
>>  	}
>> -	if (nr_pages >= batch)
>> +	if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
>>  		list_cut_before(&folios, from, &folio2->lru);
>>  	else
>>  		list_splice_init(from, &folios);
>> -	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
>> -				 mode, reason, &ret_folios, &split_folios, &stats,
>> -				 NR_MAX_MIGRATE_PAGES_RETRY);
>> +	if (mode == MIGRATE_ASYNC)
>> +		rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
>> +					 mode, reason, &ret_folios, &split_folios, &stats,
>> +					 NR_MAX_MIGRATE_PAGES_RETRY);
>> +	else
>> +		rc = migrate_pages_sync(&folios, get_new_page, put_new_page, private,
>> +					mode, reason, &ret_folios, &split_folios, &stats);
>>  	list_splice_tail_init(&folios, &ret_folios);
>>  	if (rc < 0) {
>>  		rc_gather = rc;
>> -- 
>> 2.39.1

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] migrate_pages: fix deadlock in batched migration
  2023-02-28  7:22     ` Huang, Ying
@ 2023-02-28 21:07       ` Hugh Dickins
  2023-03-01  1:17         ` Huang, Ying
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2023-02-28 21:07 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Hugh Dickins, Andrew Morton, linux-mm, linux-kernel, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

On Tue, 28 Feb 2023, Huang, Ying wrote:
> Hugh Dickins <hughd@google.com> writes:
> > On Fri, 24 Feb 2023, Huang Ying wrote:
> >> @@ -1247,7 +1236,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
> >>  		/* Establish migration ptes */
> >>  		VM_BUG_ON_FOLIO(folio_test_anon(src) &&
> >>  			       !folio_test_ksm(src) && !anon_vma, src);
> >> -		try_to_migrate(src, TTU_BATCH_FLUSH);
> >> +		try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
> >
> > Why that change, I wonder? The TTU_BATCH_FLUSH can still be useful for
> > gathering multiple cross-CPU TLB flushes into one, even when it's only
> > a single page in the batch.
> 
> Firstly, I would have thought that we have no opportunities to batch the
> TLB flushing now.  But as you pointed out, it is still possible to batch
> if mapcount > 1.  Secondly, without TTU_BATCH_FLUSH, we may flush the
> TLB for a single page (with invlpg instruction), otherwise, we will
> flush the TLB for all pages.  The former is faster and will not
> influence other TLB entries of the process.
> 
> Or we use TTU_BATCH_FLUSH only if mapcount > 1?

I had not thought at all of the "invlpg" advantage (which I imagine
some other architectures than x86 share) to not delaying the TLB flush
of a single PTE.

Frankly, I just don't have any feeling for the tradeoff between
multiple remote invlpgs versus one remote batched TLB flush of all.
Which presumably depends on number of CPUs, size of TLBs, etc etc.

Your "mapcount > 1" idea might be good, but I cannot tell: I'd say
now that there's no reason to change your "mode == MIGRATE_ASYNC ?
TTU_BATCH_FLUSH : 0" without much more thought, or a quick insight
from someone else.  Some other time maybe.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly
  2023-02-28  7:45     ` Huang, Ying
@ 2023-02-28 21:22       ` Hugh Dickins
  2023-03-01  6:08         ` Huang, Ying
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2023-02-28 21:22 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Hugh Dickins, Andrew Morton, linux-mm, linux-kernel, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

On Tue, 28 Feb 2023, Huang, Ying wrote:
> Hugh Dickins <hughd@google.com> writes:
> > On Fri, 24 Feb 2023, Huang Ying wrote:
> >> 
> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 91198b487e49..c17ce5ee8d92 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -1843,6 +1843,51 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
> >>  	return rc;
> >>  }
> >>  
> >> +static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
> >> +		free_page_t put_new_page, unsigned long private,
> >> +		enum migrate_mode mode, int reason, struct list_head *ret_folios,
> >> +		struct list_head *split_folios, struct migrate_pages_stats *stats)
> >> +{
> >> +	int rc, nr_failed = 0;
> >> +	LIST_HEAD(folios);
> >> +	struct migrate_pages_stats astats;
> >> +
> >> +	memset(&astats, 0, sizeof(astats));
> >> +	/* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
> >> +	rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
> >> +				 reason, &folios, split_folios, &astats,
> >> +				 NR_MAX_MIGRATE_PAGES_RETRY);
> >
> > I wonder if that and below would better be NR_MAX_MIGRATE_PAGES_RETRY / 2.
> >
> > Though I've never got down to adjusting that number (and it's not a job
> > to be done in this set of patches), those 10 retries sometimes terrify
> > me, from a latency point of view.  They can have such different weights:
> > in the unmapped case, 10 retries is okay; but when a pinned page is mapped
> > into 1000 processes, the thought of all that unmapping and TLB flushing
> > and remapping is terrifying.
> >
> > Since you're retrying below, halve both numbers of retries for now?
> 
> Yes.  These are reasonable concerns.
> 
> And in the original implementation, we only wait to lock page and wait
> the writeback to complete if pass > 2.  This is kind of trying to
> migrate asynchronously for 3 times before the real synchronous
> migration.  So, should we delete the "force" logic (in
> migrate_folio_unmap()), and try to migrate asynchronously for 3 times in
> batch before migrating synchronously for 7 times one by one?

Oh, that's a good idea (but please don't imagine I've thought it through):
I hadn't realized the way in which your migrate_pages_sync() addition is
kind of duplicating the way that the "force" argument conditions behaviour,
It would be very appealing to delete the "force" argument now if you can.

But aside from that, you've also made me wonder (again, please remember I
don't have a good picture of the new migrate_pages() sequence in my head)
whether you have already made a *great* strike against my 10 retries
terror.  Am I reading it right, that the unmapping is now done on the
first try, and the remove_migration_ptes after the last try (all the
pages involved having remained locked throughout)?

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] migrate_pages: fix deadlock in batched migration
  2023-02-28 21:07       ` Hugh Dickins
@ 2023-03-01  1:17         ` Huang, Ying
  0 siblings, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2023-03-01  1:17 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

Hugh Dickins <hughd@google.com> writes:

> On Tue, 28 Feb 2023, Huang, Ying wrote:
>> Hugh Dickins <hughd@google.com> writes:
>> > On Fri, 24 Feb 2023, Huang Ying wrote:
>> >> @@ -1247,7 +1236,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
>> >>  		/* Establish migration ptes */
>> >>  		VM_BUG_ON_FOLIO(folio_test_anon(src) &&
>> >>  			       !folio_test_ksm(src) && !anon_vma, src);
>> >> -		try_to_migrate(src, TTU_BATCH_FLUSH);
>> >> +		try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
>> >
>> > Why that change, I wonder? The TTU_BATCH_FLUSH can still be useful for
>> > gathering multiple cross-CPU TLB flushes into one, even when it's only
>> > a single page in the batch.
>> 
>> Firstly, I would have thought that we have no opportunities to batch the
>> TLB flushing now.  But as you pointed out, it is still possible to batch
>> if mapcount > 1.  Secondly, without TTU_BATCH_FLUSH, we may flush the
>> TLB for a single page (with invlpg instruction), otherwise, we will
>> flush the TLB for all pages.  The former is faster and will not
>> influence other TLB entries of the process.
>> 
>> Or we use TTU_BATCH_FLUSH only if mapcount > 1?
>
> I had not thought at all of the "invlpg" advantage (which I imagine
> some other architectures than x86 share) to not delaying the TLB flush
> of a single PTE.
>
> Frankly, I just don't have any feeling for the tradeoff between
> multiple remote invlpgs versus one remote batched TLB flush of all.
> Which presumably depends on number of CPUs, size of TLBs, etc etc.
>
> Your "mapcount > 1" idea might be good, but I cannot tell: I'd say
> now that there's no reason to change your "mode == MIGRATE_ASYNC ?
> TTU_BATCH_FLUSH : 0" without much more thought, or a quick insight
> from someone else.  Some other time maybe.

Yes.  I think that this is reasonable.  We can revisit this later.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] migrate_pages: move split folios processing out of migrate_pages_batch()
  2023-02-24 14:11 ` [PATCH 2/3] migrate_pages: move split folios processing out of migrate_pages_batch() Huang Ying
@ 2023-03-01  2:23   ` Baolin Wang
  2023-03-01  6:35     ` Huang, Ying
  0 siblings, 1 reply; 22+ messages in thread
From: Baolin Wang @ 2023-03-01  2:23 UTC (permalink / raw)
  To: Huang Ying, Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Matthew Wilcox, Mike Kravetz



On 2/24/2023 10:11 PM, Huang Ying wrote:
> To simplify the code logic and reduce the line number.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Stefan Roesch <shr@devkernel.io>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Xin Hao <xhao@linux.alibaba.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/migrate.c | 76 ++++++++++++++++++----------------------------------
>   1 file changed, 26 insertions(+), 50 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7ac37dbbf307..91198b487e49 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1605,9 +1605,10 @@ static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
>   static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   		free_page_t put_new_page, unsigned long private,
>   		enum migrate_mode mode, int reason, struct list_head *ret_folios,
> -		struct migrate_pages_stats *stats)
> +		struct list_head *split_folios, struct migrate_pages_stats *stats,
> +		int nr_pass)
>   {
> -	int retry;
> +	int retry = 1;
>   	int large_retry = 1;
>   	int thp_retry = 1;
>   	int nr_failed = 0;
> @@ -1617,19 +1618,12 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   	bool is_large = false;
>   	bool is_thp = false;
>   	struct folio *folio, *folio2, *dst = NULL, *dst2;
> -	int rc, rc_saved, nr_pages;
> -	LIST_HEAD(split_folios);
> +	int rc, rc_saved = 0, nr_pages;
>   	LIST_HEAD(unmap_folios);
>   	LIST_HEAD(dst_folios);
>   	bool nosplit = (reason == MR_NUMA_MISPLACED);
> -	bool no_split_folio_counting = false;
>   
> -retry:
> -	rc_saved = 0;
> -	retry = 1;
> -	for (pass = 0;
> -	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
> -	     pass++) {
> +	for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
>   		retry = 0;
>   		large_retry = 0;
>   		thp_retry = 0;
> @@ -1660,7 +1654,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   			if (!thp_migration_supported() && is_thp) {
>   				nr_large_failed++;
>   				stats->nr_thp_failed++;
> -				if (!try_split_folio(folio, &split_folios)) {
> +				if (!try_split_folio(folio, split_folios)) {
>   					stats->nr_thp_split++;
>   					continue;
>   				}
> @@ -1692,7 +1686,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   					stats->nr_thp_failed += is_thp;
>   					/* Large folio NUMA faulting doesn't split to retry. */
>   					if (!nosplit) {
> -						int ret = try_split_folio(folio, &split_folios);
> +						int ret = try_split_folio(folio, split_folios);
>   
>   						if (!ret) {
>   							stats->nr_thp_split += is_thp;
> @@ -1709,18 +1703,11 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   							break;
>   						}
>   					}
> -				} else if (!no_split_folio_counting) {
> +				} else {
>   					nr_failed++;
>   				}
>   
>   				stats->nr_failed_pages += nr_pages + nr_retry_pages;
> -				/*
> -				 * There might be some split folios of fail-to-migrate large
> -				 * folios left in split_folios list. Move them to ret_folios
> -				 * list so that they could be put back to the right list by
> -				 * the caller otherwise the folio refcnt will be leaked.
> -				 */
> -				list_splice_init(&split_folios, ret_folios);
>   				/* nr_failed isn't updated for not used */
>   				nr_large_failed += large_retry;
>   				stats->nr_thp_failed += thp_retry;
> @@ -1733,7 +1720,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   				if (is_large) {
>   					large_retry++;
>   					thp_retry += is_thp;
> -				} else if (!no_split_folio_counting) {
> +				} else {
>   					retry++;
>   				}
>   				nr_retry_pages += nr_pages;
> @@ -1756,7 +1743,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   				if (is_large) {
>   					nr_large_failed++;
>   					stats->nr_thp_failed += is_thp;
> -				} else if (!no_split_folio_counting) {
> +				} else {
>   					nr_failed++;
>   				}
>   
> @@ -1774,9 +1761,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   	try_to_unmap_flush();
>   
>   	retry = 1;
> -	for (pass = 0;
> -	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
> -	     pass++) {
> +	for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
>   		retry = 0;
>   		large_retry = 0;
>   		thp_retry = 0;
> @@ -1805,7 +1790,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   				if (is_large) {
>   					large_retry++;
>   					thp_retry += is_thp;
> -				} else if (!no_split_folio_counting) {
> +				} else {
>   					retry++;
>   				}
>   				nr_retry_pages += nr_pages;
> @@ -1818,7 +1803,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   				if (is_large) {
>   					nr_large_failed++;
>   					stats->nr_thp_failed += is_thp;
> -				} else if (!no_split_folio_counting) {
> +				} else {
>   					nr_failed++;
>   				}
>   
> @@ -1855,27 +1840,6 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   		dst2 = list_next_entry(dst, lru);
>   	}
>   
> -	/*
> -	 * Try to migrate split folios of fail-to-migrate large folios, no
> -	 * nr_failed counting in this round, since all split folios of a
> -	 * large folio is counted as 1 failure in the first round.
> -	 */
> -	if (rc >= 0 && !list_empty(&split_folios)) {
> -		/*
> -		 * Move non-migrated folios (after NR_MAX_MIGRATE_PAGES_RETRY
> -		 * retries) to ret_folios to avoid migrating them again.
> -		 */
> -		list_splice_init(from, ret_folios);
> -		list_splice_init(&split_folios, from);
> -		/*
> -		 * Force async mode to avoid to wait lock or bit when we have
> -		 * locked more than one folios.
> -		 */
> -		mode = MIGRATE_ASYNC;
> -		no_split_folio_counting = true;
> -		goto retry;
> -	}
> -
>   	return rc;
>   }
>   
> @@ -1914,6 +1878,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>   	struct folio *folio, *folio2;
>   	LIST_HEAD(folios);
>   	LIST_HEAD(ret_folios);
> +	LIST_HEAD(split_folios);
>   	struct migrate_pages_stats stats;
>   
>   	trace_mm_migrate_pages_start(mode, reason);
> @@ -1947,12 +1912,23 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>   	else
>   		list_splice_init(from, &folios);
>   	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
> -				 mode, reason, &ret_folios, &stats);
> +				 mode, reason, &ret_folios, &split_folios, &stats,
> +				 NR_MAX_MIGRATE_PAGES_RETRY);
>   	list_splice_tail_init(&folios, &ret_folios);
>   	if (rc < 0) {
>   		rc_gather = rc;
> +		list_splice_tail(&split_folios, &ret_folios);

Can we still keep the original comments? Which can help to understand 
the case, at least for me:)
  /*
   * There might be some split folios of fail-to-migrate large
   * folios left in split_folios list. Move them to ret_folios
   * list so that they could be put back to the right list by
   * the caller otherwise the folio refcnt will be leaked.
   */

>   		goto out;
>   	}
> +	if (!list_empty(&split_folios)) {
> +		/*
> +		 * Failure isn't counted since all split folios of a large folio
> +		 * is counted as 1 failure already.
> +		 */
> +		migrate_pages_batch(&split_folios, get_new_page, put_new_page, private,
> +				    MIGRATE_ASYNC, reason, &ret_folios, NULL, &stats, 1);

Better to copy the original comments to explain why force to 
MIGRATE_ASYNC mode for split folios.

Thanks for the simplification, and please feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly
  2023-02-24 14:11 ` [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly Huang Ying
  2023-02-28  6:36   ` Hugh Dickins
@ 2023-03-01  3:08   ` Baolin Wang
  2023-03-01  6:18     ` Huang, Ying
  1 sibling, 1 reply; 22+ messages in thread
From: Baolin Wang @ 2023-03-01  3:08 UTC (permalink / raw)
  To: Huang Ying, Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Matthew Wilcox, Mike Kravetz



On 2/24/2023 10:11 PM, Huang Ying wrote:
> When we have locked more than one folios, we cannot wait the lock or
> bit (e.g., page lock, buffer head lock, writeback bit) synchronously.
> Otherwise deadlock may be triggered.  This make it hard to batch the
> synchronous migration directly.
> 
> This patch re-enables batching synchronous migration via trying to
> migrate in batch asynchronously firstly.  And any folios that are
> failed to be migrated asynchronously will be migrated synchronously
> one by one.
> 
> Test shows that this can restore the TLB flushing batching performance
> for synchronous migration effectively.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Stefan Roesch <shr@devkernel.io>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Xin Hao <xhao@linux.alibaba.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/migrate.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 55 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 91198b487e49..c17ce5ee8d92 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1843,6 +1843,51 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>   	return rc;
>   }
>   
> +static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
> +		free_page_t put_new_page, unsigned long private,
> +		enum migrate_mode mode, int reason, struct list_head *ret_folios,
> +		struct list_head *split_folios, struct migrate_pages_stats *stats)
> +{
> +	int rc, nr_failed = 0;
> +	LIST_HEAD(folios);
> +	struct migrate_pages_stats astats;
> +
> +	memset(&astats, 0, sizeof(astats));
> +	/* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
> +	rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
> +				 reason, &folios, split_folios, &astats,
> +				 NR_MAX_MIGRATE_PAGES_RETRY);
> +	stats->nr_succeeded += astats.nr_succeeded;
> +	stats->nr_thp_succeeded += astats.nr_thp_succeeded;
> +	stats->nr_thp_split += astats.nr_thp_split;
> +	if (rc < 0) {
> +		stats->nr_failed_pages += astats.nr_failed_pages;
> +		stats->nr_thp_failed += astats.nr_thp_failed;
> +		list_splice_tail(&folios, ret_folios);
> +		return rc;
> +	}
> +	stats->nr_thp_failed += astats.nr_thp_split;
> +	nr_failed += astats.nr_thp_split;
> +	/*
> +	 * Fall back to migrate all failed folios one by one synchronously. All
> +	 * failed folios except split THPs will be retried, so their failure
> +	 * isn't counted
> +	 */
> +	list_splice_tail_init(&folios, from);
> +	while (!list_empty(from)) {
> +		list_move(from->next, &folios);
> +		rc = migrate_pages_batch(&folios, get_new_page, put_new_page,
> +					 private, mode, reason, ret_folios,
> +					 split_folios, stats, NR_MAX_MIGRATE_PAGES_RETRY);
> +		list_splice_tail_init(&folios, ret_folios);
> +		if (rc < 0)
> +			return rc;
> +		nr_failed += rc;
> +	}
> +
> +	return nr_failed;
> +}
> +
>   /*
>    * migrate_pages - migrate the folios specified in a list, to the free folios
>    *		   supplied as the target for the page migration
> @@ -1874,7 +1919,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>   		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>   {
>   	int rc, rc_gather;
> -	int nr_pages, batch;
> +	int nr_pages;
>   	struct folio *folio, *folio2;
>   	LIST_HEAD(folios);
>   	LIST_HEAD(ret_folios);
> @@ -1890,10 +1935,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>   	if (rc_gather < 0)
>   		goto out;
>   
> -	if (mode == MIGRATE_ASYNC)
> -		batch = NR_MAX_BATCHED_MIGRATION;
> -	else
> -		batch = 1;
>   again:
>   	nr_pages = 0;
>   	list_for_each_entry_safe(folio, folio2, from, lru) {
> @@ -1904,16 +1945,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>   		}
>   
>   		nr_pages += folio_nr_pages(folio);
> -		if (nr_pages >= batch)
> +		if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
>   			break;
>   	}
> -	if (nr_pages >= batch)
> +	if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
>   		list_cut_before(&folios, from, &folio2->lru);
>   	else
>   		list_splice_init(from, &folios);
> -	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
> -				 mode, reason, &ret_folios, &split_folios, &stats,
> -				 NR_MAX_MIGRATE_PAGES_RETRY);
> +	if (mode == MIGRATE_ASYNC)
> +		rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
> +					 mode, reason, &ret_folios, &split_folios, &stats,
> +					 NR_MAX_MIGRATE_PAGES_RETRY);
> +	else
> +		rc = migrate_pages_sync(&folios, get_new_page, put_new_page, private,
> +					mode, reason, &ret_folios, &split_folios, &stats);

For split folios, it seems also reasonable to use migrate_pages_sync() 
instead of always using fixed MIGRATE_ASYNC mode?

>   	list_splice_tail_init(&folios, &ret_folios);
>   	if (rc < 0) {
>   		rc_gather = rc;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly
  2023-02-28 21:22       ` Hugh Dickins
@ 2023-03-01  6:08         ` Huang, Ying
  2023-03-01  6:46           ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2023-03-01  6:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

Hugh Dickins <hughd@google.com> writes:

> On Tue, 28 Feb 2023, Huang, Ying wrote:
>> Hugh Dickins <hughd@google.com> writes:
>> > On Fri, 24 Feb 2023, Huang Ying wrote:
>> >> 
>> >> diff --git a/mm/migrate.c b/mm/migrate.c
>> >> index 91198b487e49..c17ce5ee8d92 100644
>> >> --- a/mm/migrate.c
>> >> +++ b/mm/migrate.c
>> >> @@ -1843,6 +1843,51 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>> >>  	return rc;
>> >>  }
>> >>  
>> >> +static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
>> >> +		free_page_t put_new_page, unsigned long private,
>> >> +		enum migrate_mode mode, int reason, struct list_head *ret_folios,
>> >> +		struct list_head *split_folios, struct migrate_pages_stats *stats)
>> >> +{
>> >> +	int rc, nr_failed = 0;
>> >> +	LIST_HEAD(folios);
>> >> +	struct migrate_pages_stats astats;
>> >> +
>> >> +	memset(&astats, 0, sizeof(astats));
>> >> +	/* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
>> >> +	rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
>> >> +				 reason, &folios, split_folios, &astats,
>> >> +				 NR_MAX_MIGRATE_PAGES_RETRY);
>> >
>> > I wonder if that and below would better be NR_MAX_MIGRATE_PAGES_RETRY / 2.
>> >
>> > Though I've never got down to adjusting that number (and it's not a job
>> > to be done in this set of patches), those 10 retries sometimes terrify
>> > me, from a latency point of view.  They can have such different weights:
>> > in the unmapped case, 10 retries is okay; but when a pinned page is mapped
>> > into 1000 processes, the thought of all that unmapping and TLB flushing
>> > and remapping is terrifying.
>> >
>> > Since you're retrying below, halve both numbers of retries for now?
>> 
>> Yes.  These are reasonable concerns.
>> 
>> And in the original implementation, we only wait to lock page and wait
>> the writeback to complete if pass > 2.  This is kind of trying to
>> migrate asynchronously for 3 times before the real synchronous
>> migration.  So, should we delete the "force" logic (in
>> migrate_folio_unmap()), and try to migrate asynchronously for 3 times in
>> batch before migrating synchronously for 7 times one by one?
>
> Oh, that's a good idea (but please don't imagine I've thought it through):
> I hadn't realized the way in which your migrate_pages_sync() addition is
> kind of duplicating the way that the "force" argument conditions behaviour,
> It would be very appealing to delete the "force" argument now if you can.

Sure.  Will do that in the next version.

> But aside from that, you've also made me wonder (again, please remember I
> don't have a good picture of the new migrate_pages() sequence in my head)
> whether you have already made a *great* strike against my 10 retries
> terror.  Am I reading it right, that the unmapping is now done on the
> first try, and the remove_migration_ptes after the last try (all the
> pages involved having remained locked throughout)?

Yes.  You are right.  Now, unmapping and moving are two separate steps,
and they are retried separately.  After a folio has been unmapped
successfully, we will not remap/unmap it 10 times if the folio is pinned
so that failed to move (migrate_folio_move()).  So the latency caused by
retrying is much better now.  But I still tend to keep the total retry
number as before.  Do you agree?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly
  2023-03-01  3:08   ` Baolin Wang
@ 2023-03-01  6:18     ` Huang, Ying
  2023-03-01 11:03       ` Baolin Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2023-03-01  6:18 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Matthew Wilcox, Mike Kravetz

Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 2/24/2023 10:11 PM, Huang Ying wrote:
>> When we have locked more than one folios, we cannot wait the lock or
>> bit (e.g., page lock, buffer head lock, writeback bit) synchronously.
>> Otherwise deadlock may be triggered.  This make it hard to batch the
>> synchronous migration directly.
>> This patch re-enables batching synchronous migration via trying to
>> migrate in batch asynchronously firstly.  And any folios that are
>> failed to be migrated asynchronously will be migrated synchronously
>> one by one.
>> Test shows that this can restore the TLB flushing batching
>> performance
>> for synchronous migration effectively.
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Stefan Roesch <shr@devkernel.io>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: Xin Hao <xhao@linux.alibaba.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   mm/migrate.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 55 insertions(+), 10 deletions(-)
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 91198b487e49..c17ce5ee8d92 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1843,6 +1843,51 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>   	return rc;
>>   }
>>   +static int migrate_pages_sync(struct list_head *from, new_page_t
>> get_new_page,
>> +		free_page_t put_new_page, unsigned long private,
>> +		enum migrate_mode mode, int reason, struct list_head *ret_folios,
>> +		struct list_head *split_folios, struct migrate_pages_stats *stats)
>> +{
>> +	int rc, nr_failed = 0;
>> +	LIST_HEAD(folios);
>> +	struct migrate_pages_stats astats;
>> +
>> +	memset(&astats, 0, sizeof(astats));
>> +	/* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
>> +	rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
>> +				 reason, &folios, split_folios, &astats,
>> +				 NR_MAX_MIGRATE_PAGES_RETRY);
>> +	stats->nr_succeeded += astats.nr_succeeded;
>> +	stats->nr_thp_succeeded += astats.nr_thp_succeeded;
>> +	stats->nr_thp_split += astats.nr_thp_split;
>> +	if (rc < 0) {
>> +		stats->nr_failed_pages += astats.nr_failed_pages;
>> +		stats->nr_thp_failed += astats.nr_thp_failed;
>> +		list_splice_tail(&folios, ret_folios);
>> +		return rc;
>> +	}
>> +	stats->nr_thp_failed += astats.nr_thp_split;
>> +	nr_failed += astats.nr_thp_split;
>> +	/*
>> +	 * Fall back to migrate all failed folios one by one synchronously. All
>> +	 * failed folios except split THPs will be retried, so their failure
>> +	 * isn't counted
>> +	 */
>> +	list_splice_tail_init(&folios, from);
>> +	while (!list_empty(from)) {
>> +		list_move(from->next, &folios);
>> +		rc = migrate_pages_batch(&folios, get_new_page, put_new_page,
>> +					 private, mode, reason, ret_folios,
>> +					 split_folios, stats, NR_MAX_MIGRATE_PAGES_RETRY);
>> +		list_splice_tail_init(&folios, ret_folios);
>> +		if (rc < 0)
>> +			return rc;
>> +		nr_failed += rc;
>> +	}
>> +
>> +	return nr_failed;
>> +}
>> +
>>   /*
>>    * migrate_pages - migrate the folios specified in a list, to the free folios
>>    *		   supplied as the target for the page migration
>> @@ -1874,7 +1919,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>   		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>>   {
>>   	int rc, rc_gather;
>> -	int nr_pages, batch;
>> +	int nr_pages;
>>   	struct folio *folio, *folio2;
>>   	LIST_HEAD(folios);
>>   	LIST_HEAD(ret_folios);
>> @@ -1890,10 +1935,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>   	if (rc_gather < 0)
>>   		goto out;
>>   -	if (mode == MIGRATE_ASYNC)
>> -		batch = NR_MAX_BATCHED_MIGRATION;
>> -	else
>> -		batch = 1;
>>   again:
>>   	nr_pages = 0;
>>   	list_for_each_entry_safe(folio, folio2, from, lru) {
>> @@ -1904,16 +1945,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>   		}
>>     		nr_pages += folio_nr_pages(folio);
>> -		if (nr_pages >= batch)
>> +		if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
>>   			break;
>>   	}
>> -	if (nr_pages >= batch)
>> +	if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
>>   		list_cut_before(&folios, from, &folio2->lru);
>>   	else
>>   		list_splice_init(from, &folios);
>> -	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
>> -				 mode, reason, &ret_folios, &split_folios, &stats,
>> -				 NR_MAX_MIGRATE_PAGES_RETRY);
>> +	if (mode == MIGRATE_ASYNC)
>> +		rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
>> +					 mode, reason, &ret_folios, &split_folios, &stats,
>> +					 NR_MAX_MIGRATE_PAGES_RETRY);
>> +	else
>> +		rc = migrate_pages_sync(&folios, get_new_page, put_new_page, private,
>> +					mode, reason, &ret_folios, &split_folios, &stats);
>
> For split folios, it seems also reasonable to use migrate_pages_sync()
> instead of always using fixed MIGRATE_ASYNC mode?

For split folios, we only try to migrate them with minimal effort.
Previously, we decrease the retry number from 10 to 1.  Now, I think
that it's reasonable to change the migration mode to MIGRATE_ASYNC to
reduce latency.  They have been counted as failure anyway.

>>   	list_splice_tail_init(&folios, &ret_folios);
>>   	if (rc < 0) {
>>   		rc_gather = rc;

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] migrate_pages: move split folios processing out of migrate_pages_batch()
  2023-03-01  2:23   ` Baolin Wang
@ 2023-03-01  6:35     ` Huang, Ying
  2023-03-01 11:07       ` Baolin Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2023-03-01  6:35 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Matthew Wilcox, Mike Kravetz

Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 2/24/2023 10:11 PM, Huang Ying wrote:
>> To simplify the code logic and reduce the line number.
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Stefan Roesch <shr@devkernel.io>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: Xin Hao <xhao@linux.alibaba.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   mm/migrate.c | 76 ++++++++++++++++++----------------------------------
>>   1 file changed, 26 insertions(+), 50 deletions(-)
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 7ac37dbbf307..91198b487e49 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1605,9 +1605,10 @@ static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
>>   static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>   		free_page_t put_new_page, unsigned long private,
>>   		enum migrate_mode mode, int reason, struct list_head *ret_folios,
>> -		struct migrate_pages_stats *stats)
>> +		struct list_head *split_folios, struct migrate_pages_stats *stats,
>> +		int nr_pass)
>>   {
>> -	int retry;
>> +	int retry = 1;
>>   	int large_retry = 1;
>>   	int thp_retry = 1;
>>   	int nr_failed = 0;
>> @@ -1617,19 +1618,12 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>   	bool is_large = false;
>>   	bool is_thp = false;
>>   	struct folio *folio, *folio2, *dst = NULL, *dst2;
>> -	int rc, rc_saved, nr_pages;
>> -	LIST_HEAD(split_folios);
>> +	int rc, rc_saved = 0, nr_pages;
>>   	LIST_HEAD(unmap_folios);
>>   	LIST_HEAD(dst_folios);
>>   	bool nosplit = (reason == MR_NUMA_MISPLACED);
>> -	bool no_split_folio_counting = false;
>>   -retry:
>> -	rc_saved = 0;
>> -	retry = 1;
>> -	for (pass = 0;
>> -	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
>> -	     pass++) {
>> +	for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
>>   		retry = 0;
>>   		large_retry = 0;
>>   		thp_retry = 0;
>> @@ -1660,7 +1654,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>   			if (!thp_migration_supported() && is_thp) {
>>   				nr_large_failed++;
>>   				stats->nr_thp_failed++;
>> -				if (!try_split_folio(folio, &split_folios)) {
>> +				if (!try_split_folio(folio, split_folios)) {
>>   					stats->nr_thp_split++;
>>   					continue;
>>   				}
>> @@ -1692,7 +1686,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>   					stats->nr_thp_failed += is_thp;
>>   					/* Large folio NUMA faulting doesn't split to retry. */
>>   					if (!nosplit) {
>> -						int ret = try_split_folio(folio, &split_folios);
>> +						int ret = try_split_folio(folio, split_folios);
>>     						if (!ret) {
>>   							stats->nr_thp_split += is_thp;
>> @@ -1709,18 +1703,11 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>   							break;
>>   						}
>>   					}
>> -				} else if (!no_split_folio_counting) {
>> +				} else {
>>   					nr_failed++;
>>   				}
>>     				stats->nr_failed_pages += nr_pages +
>> nr_retry_pages;
>> -				/*
>> -				 * There might be some split folios of fail-to-migrate large
>> -				 * folios left in split_folios list. Move them to ret_folios
>> -				 * list so that they could be put back to the right list by
>> -				 * the caller otherwise the folio refcnt will be leaked.
>> -				 */
>> -				list_splice_init(&split_folios, ret_folios);
>>   				/* nr_failed isn't updated for not used */
>>   				nr_large_failed += large_retry;
>>   				stats->nr_thp_failed += thp_retry;
>> @@ -1733,7 +1720,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>   				if (is_large) {
>>   					large_retry++;
>>   					thp_retry += is_thp;
>> -				} else if (!no_split_folio_counting) {
>> +				} else {
>>   					retry++;
>>   				}
>>   				nr_retry_pages += nr_pages;
>> @@ -1756,7 +1743,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>   				if (is_large) {
>>   					nr_large_failed++;
>>   					stats->nr_thp_failed += is_thp;
>> -				} else if (!no_split_folio_counting) {
>> +				} else {
>>   					nr_failed++;
>>   				}
>>   @@ -1774,9 +1761,7 @@ static int migrate_pages_batch(struct
>> list_head *from, new_page_t get_new_page,
>>   	try_to_unmap_flush();
>>     	retry = 1;
>> -	for (pass = 0;
>> -	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
>> -	     pass++) {
>> +	for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
>>   		retry = 0;
>>   		large_retry = 0;
>>   		thp_retry = 0;
>> @@ -1805,7 +1790,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>   				if (is_large) {
>>   					large_retry++;
>>   					thp_retry += is_thp;
>> -				} else if (!no_split_folio_counting) {
>> +				} else {
>>   					retry++;
>>   				}
>>   				nr_retry_pages += nr_pages;
>> @@ -1818,7 +1803,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>   				if (is_large) {
>>   					nr_large_failed++;
>>   					stats->nr_thp_failed += is_thp;
>> -				} else if (!no_split_folio_counting) {
>> +				} else {
>>   					nr_failed++;
>>   				}
>>   @@ -1855,27 +1840,6 @@ static int migrate_pages_batch(struct
>> list_head *from, new_page_t get_new_page,
>>   		dst2 = list_next_entry(dst, lru);
>>   	}
>>   -	/*
>> -	 * Try to migrate split folios of fail-to-migrate large folios, no
>> -	 * nr_failed counting in this round, since all split folios of a
>> -	 * large folio is counted as 1 failure in the first round.
>> -	 */
>> -	if (rc >= 0 && !list_empty(&split_folios)) {
>> -		/*
>> -		 * Move non-migrated folios (after NR_MAX_MIGRATE_PAGES_RETRY
>> -		 * retries) to ret_folios to avoid migrating them again.
>> -		 */
>> -		list_splice_init(from, ret_folios);
>> -		list_splice_init(&split_folios, from);
>> -		/*
>> -		 * Force async mode to avoid to wait lock or bit when we have
>> -		 * locked more than one folios.
>> -		 */
>> -		mode = MIGRATE_ASYNC;
>> -		no_split_folio_counting = true;
>> -		goto retry;
>> -	}
>> -
>>   	return rc;
>>   }
>>   @@ -1914,6 +1878,7 @@ int migrate_pages(struct list_head *from,
>> new_page_t get_new_page,
>>   	struct folio *folio, *folio2;
>>   	LIST_HEAD(folios);
>>   	LIST_HEAD(ret_folios);
>> +	LIST_HEAD(split_folios);
>>   	struct migrate_pages_stats stats;
>>     	trace_mm_migrate_pages_start(mode, reason);
>> @@ -1947,12 +1912,23 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>   	else
>>   		list_splice_init(from, &folios);
>>   	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
>> -				 mode, reason, &ret_folios, &stats);
>> +				 mode, reason, &ret_folios, &split_folios, &stats,
>> +				 NR_MAX_MIGRATE_PAGES_RETRY);
>>   	list_splice_tail_init(&folios, &ret_folios);
>>   	if (rc < 0) {
>>   		rc_gather = rc;
>> +		list_splice_tail(&split_folios, &ret_folios);
>
> Can we still keep the original comments? Which can help to understand
> the case, at least for me:)
>  /*
>   * There might be some split folios of fail-to-migrate large
>   * folios left in split_folios list. Move them to ret_folios
>   * list so that they could be put back to the right list by
>   * the caller otherwise the folio refcnt will be leaked.
>   */

Previously, the cleanup code is buried in a corner of a much more
complex code path.  So the comments are necessary.  Now, it is an
explicit and simple code path.  And, the rule is clear, every folio list
needs to be cleaned up before return: folios, split_folios, then
ret_folios.  And we have done this here and there in the series.

>>   		goto out;
>>   	}
>> +	if (!list_empty(&split_folios)) {
>> +		/*
>> +		 * Failure isn't counted since all split folios of a large folio
>> +		 * is counted as 1 failure already.
>> +		 */
>> +		migrate_pages_batch(&split_folios, get_new_page, put_new_page, private,
>> +				    MIGRATE_ASYNC, reason, &ret_folios, NULL, &stats, 1);
>
> Better to copy the original comments to explain why force to
> MIGRATE_ASYNC mode for split folios.

Yes.  It's a good idea to explain that.  And now the rule to call
migrate_pages_batch() has been changed.  If mode != MIGRATE_ASYNC, the
length of "from" must be <= 1.  I will add a VM_WARN_ON() for that at
the beginning of migrate_pages_batch().  And I would rather to add the
comments to the header of migrate_pages().  Other callers of the
function needs to follow that rule too.

> Thanks for the simplification, and please feel free to add:
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Thank you very much for review!

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly
  2023-03-01  6:08         ` Huang, Ying
@ 2023-03-01  6:46           ` Hugh Dickins
  2023-03-01  7:10             ` Huang, Ying
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2023-03-01  6:46 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Hugh Dickins, Andrew Morton, linux-mm, linux-kernel, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

On Wed, 1 Mar 2023, Huang, Ying wrote:
> Hugh Dickins <hughd@google.com> writes:
> > On Tue, 28 Feb 2023, Huang, Ying wrote:
> >> Hugh Dickins <hughd@google.com> writes:
> >> > On Fri, 24 Feb 2023, Huang Ying wrote:
> >> >> 
> >> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> >> index 91198b487e49..c17ce5ee8d92 100644
> >> >> --- a/mm/migrate.c
> >> >> +++ b/mm/migrate.c
> >> >> @@ -1843,6 +1843,51 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
> >> >>  	return rc;
> >> >>  }
> >> >>  
> >> >> +static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
> >> >> +		free_page_t put_new_page, unsigned long private,
> >> >> +		enum migrate_mode mode, int reason, struct list_head *ret_folios,
> >> >> +		struct list_head *split_folios, struct migrate_pages_stats *stats)
> >> >> +{
> >> >> +	int rc, nr_failed = 0;
> >> >> +	LIST_HEAD(folios);
> >> >> +	struct migrate_pages_stats astats;
> >> >> +
> >> >> +	memset(&astats, 0, sizeof(astats));
> >> >> +	/* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
> >> >> +	rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
> >> >> +				 reason, &folios, split_folios, &astats,
> >> >> +				 NR_MAX_MIGRATE_PAGES_RETRY);
> >> >
> >> > I wonder if that and below would better be NR_MAX_MIGRATE_PAGES_RETRY / 2.
> >> >
> >> > Though I've never got down to adjusting that number (and it's not a job
> >> > to be done in this set of patches), those 10 retries sometimes terrify
> >> > me, from a latency point of view.  They can have such different weights:
> >> > in the unmapped case, 10 retries is okay; but when a pinned page is mapped
> >> > into 1000 processes, the thought of all that unmapping and TLB flushing
> >> > and remapping is terrifying.
> >> >
> >> > Since you're retrying below, halve both numbers of retries for now?
> >> 
> >> Yes.  These are reasonable concerns.
> >> 
> >> And in the original implementation, we only wait to lock page and wait
> >> the writeback to complete if pass > 2.  This is kind of trying to
> >> migrate asynchronously for 3 times before the real synchronous
> >> migration.  So, should we delete the "force" logic (in
> >> migrate_folio_unmap()), and try to migrate asynchronously for 3 times in
> >> batch before migrating synchronously for 7 times one by one?
> >
> > Oh, that's a good idea (but please don't imagine I've thought it through):
> > I hadn't realized the way in which your migrate_pages_sync() addition is
> > kind of duplicating the way that the "force" argument conditions behaviour,
> > It would be very appealing to delete the "force" argument now if you can.
> 
> Sure.  Will do that in the next version.
> 
> > But aside from that, you've also made me wonder (again, please remember I
> > don't have a good picture of the new migrate_pages() sequence in my head)
> > whether you have already made a *great* strike against my 10 retries
> > terror.  Am I reading it right, that the unmapping is now done on the
> > first try, and the remove_migration_ptes after the last try (all the
> > pages involved having remained locked throughout)?
> 
> Yes.  You are right.  Now, unmapping and moving are two separate steps,
> and they are retried separately.  After a folio has been unmapped
> successfully, we will not remap/unmap it 10 times if the folio is pinned
> so that failed to move (migrate_folio_move()).  So the latency caused by
> retrying is much better now.  But I still tend to keep the total retry
> number as before.  Do you agree?

Yes, I agree, keep the total retry number 10 as before: maybe someone in
future will show that more than 5 is a waste of time, but there's little
need to get into that now: if you've put an end to that 10 times unmapping
and remapping, that's a great step forward, quite apart from the TLB flush
batching itself.

(I did change "no need" to "little need" above: I do have some some
anxiety about the increased latencies from keeping folios locked and
migration entries in place for significantly longer than before your
batching: I won't be surprised if the maximum batch size has to be
lowered, if reports of latency spikes come in; and that might extend
to the retry count too.)

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly
  2023-03-01  6:46           ` Hugh Dickins
@ 2023-03-01  7:10             ` Huang, Ying
  0 siblings, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2023-03-01  7:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Baolin Wang, Matthew Wilcox, Mike Kravetz

Hugh Dickins <hughd@google.com> writes:

> On Wed, 1 Mar 2023, Huang, Ying wrote:
>> Hugh Dickins <hughd@google.com> writes:
>> > On Tue, 28 Feb 2023, Huang, Ying wrote:
>> >> Hugh Dickins <hughd@google.com> writes:
>> >> > On Fri, 24 Feb 2023, Huang Ying wrote:
>> >> >> 
>> >> >> diff --git a/mm/migrate.c b/mm/migrate.c
>> >> >> index 91198b487e49..c17ce5ee8d92 100644
>> >> >> --- a/mm/migrate.c
>> >> >> +++ b/mm/migrate.c
>> >> >> @@ -1843,6 +1843,51 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>> >> >>  	return rc;
>> >> >>  }
>> >> >>  
>> >> >> +static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
>> >> >> +		free_page_t put_new_page, unsigned long private,
>> >> >> +		enum migrate_mode mode, int reason, struct list_head *ret_folios,
>> >> >> +		struct list_head *split_folios, struct migrate_pages_stats *stats)
>> >> >> +{
>> >> >> +	int rc, nr_failed = 0;
>> >> >> +	LIST_HEAD(folios);
>> >> >> +	struct migrate_pages_stats astats;
>> >> >> +
>> >> >> +	memset(&astats, 0, sizeof(astats));
>> >> >> +	/* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
>> >> >> +	rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
>> >> >> +				 reason, &folios, split_folios, &astats,
>> >> >> +				 NR_MAX_MIGRATE_PAGES_RETRY);
>> >> >
>> >> > I wonder if that and below would better be NR_MAX_MIGRATE_PAGES_RETRY / 2.
>> >> >
>> >> > Though I've never got down to adjusting that number (and it's not a job
>> >> > to be done in this set of patches), those 10 retries sometimes terrify
>> >> > me, from a latency point of view.  They can have such different weights:
>> >> > in the unmapped case, 10 retries is okay; but when a pinned page is mapped
>> >> > into 1000 processes, the thought of all that unmapping and TLB flushing
>> >> > and remapping is terrifying.
>> >> >
>> >> > Since you're retrying below, halve both numbers of retries for now?
>> >> 
>> >> Yes.  These are reasonable concerns.
>> >> 
>> >> And in the original implementation, we only wait to lock page and wait
>> >> the writeback to complete if pass > 2.  This is kind of trying to
>> >> migrate asynchronously for 3 times before the real synchronous
>> >> migration.  So, should we delete the "force" logic (in
>> >> migrate_folio_unmap()), and try to migrate asynchronously for 3 times in
>> >> batch before migrating synchronously for 7 times one by one?
>> >
>> > Oh, that's a good idea (but please don't imagine I've thought it through):
>> > I hadn't realized the way in which your migrate_pages_sync() addition is
>> > kind of duplicating the way that the "force" argument conditions behaviour,
>> > It would be very appealing to delete the "force" argument now if you can.
>> 
>> Sure.  Will do that in the next version.
>> 
>> > But aside from that, you've also made me wonder (again, please remember I
>> > don't have a good picture of the new migrate_pages() sequence in my head)
>> > whether you have already made a *great* strike against my 10 retries
>> > terror.  Am I reading it right, that the unmapping is now done on the
>> > first try, and the remove_migration_ptes after the last try (all the
>> > pages involved having remained locked throughout)?
>> 
>> Yes.  You are right.  Now, unmapping and moving are two separate steps,
>> and they are retried separately.  After a folio has been unmapped
>> successfully, we will not remap/unmap it 10 times if the folio is pinned
>> so that failed to move (migrate_folio_move()).  So the latency caused by
>> retrying is much better now.  But I still tend to keep the total retry
>> number as before.  Do you agree?
>
> Yes, I agree, keep the total retry number 10 as before: maybe someone in
> future will show that more than 5 is a waste of time, but there's little
> need to get into that now: if you've put an end to that 10 times unmapping
> and remapping, that's a great step forward, quite apart from the TLB flush
> batching itself.
>
> (I did change "no need" to "little need" above: I do have some some
> anxiety about the increased latencies from keeping folios locked and
> migration entries in place for significantly longer than before your
> batching: I won't be surprised if the maximum batch size has to be
> lowered, if reports of latency spikes come in; and that might extend
> to the retry count too.)

Yes.  Latency are always concerns for batching.  We may revisit this
when needed.  Something good now is that we will never wait the lock or
bit in batched mode.  Latency tolerance depends on caller too, for
example, when we migrate some cold pages from DRAM to CXL MEM, we can
tolerate relatively long latency.  If so, we can add a parameter to
migrate_pages() to restrict the batch number and retry number when
necessary too.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly
  2023-03-01  6:18     ` Huang, Ying
@ 2023-03-01 11:03       ` Baolin Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2023-03-01 11:03 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Matthew Wilcox, Mike Kravetz



On 3/1/2023 2:18 PM, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> On 2/24/2023 10:11 PM, Huang Ying wrote:
>>> When we have locked more than one folios, we cannot wait the lock or
>>> bit (e.g., page lock, buffer head lock, writeback bit) synchronously.
>>> Otherwise deadlock may be triggered.  This make it hard to batch the
>>> synchronous migration directly.
>>> This patch re-enables batching synchronous migration via trying to
>>> migrate in batch asynchronously firstly.  And any folios that are
>>> failed to be migrated asynchronously will be migrated synchronously
>>> one by one.
>>> Test shows that this can restore the TLB flushing batching
>>> performance
>>> for synchronous migration effectively.
>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
>>> Cc: Christoph Hellwig <hch@lst.de>
>>> Cc: Stefan Roesch <shr@devkernel.io>
>>> Cc: Tejun Heo <tj@kernel.org>
>>> Cc: Xin Hao <xhao@linux.alibaba.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Yang Shi <shy828301@gmail.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>    mm/migrate.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--------
>>>    1 file changed, 55 insertions(+), 10 deletions(-)
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 91198b487e49..c17ce5ee8d92 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1843,6 +1843,51 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>>    	return rc;
>>>    }
>>>    +static int migrate_pages_sync(struct list_head *from, new_page_t
>>> get_new_page,
>>> +		free_page_t put_new_page, unsigned long private,
>>> +		enum migrate_mode mode, int reason, struct list_head *ret_folios,
>>> +		struct list_head *split_folios, struct migrate_pages_stats *stats)
>>> +{
>>> +	int rc, nr_failed = 0;
>>> +	LIST_HEAD(folios);
>>> +	struct migrate_pages_stats astats;
>>> +
>>> +	memset(&astats, 0, sizeof(astats));
>>> +	/* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
>>> +	rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
>>> +				 reason, &folios, split_folios, &astats,
>>> +				 NR_MAX_MIGRATE_PAGES_RETRY);
>>> +	stats->nr_succeeded += astats.nr_succeeded;
>>> +	stats->nr_thp_succeeded += astats.nr_thp_succeeded;
>>> +	stats->nr_thp_split += astats.nr_thp_split;
>>> +	if (rc < 0) {
>>> +		stats->nr_failed_pages += astats.nr_failed_pages;
>>> +		stats->nr_thp_failed += astats.nr_thp_failed;
>>> +		list_splice_tail(&folios, ret_folios);
>>> +		return rc;
>>> +	}
>>> +	stats->nr_thp_failed += astats.nr_thp_split;
>>> +	nr_failed += astats.nr_thp_split;
>>> +	/*
>>> +	 * Fall back to migrate all failed folios one by one synchronously. All
>>> +	 * failed folios except split THPs will be retried, so their failure
>>> +	 * isn't counted
>>> +	 */
>>> +	list_splice_tail_init(&folios, from);
>>> +	while (!list_empty(from)) {
>>> +		list_move(from->next, &folios);
>>> +		rc = migrate_pages_batch(&folios, get_new_page, put_new_page,
>>> +					 private, mode, reason, ret_folios,
>>> +					 split_folios, stats, NR_MAX_MIGRATE_PAGES_RETRY);
>>> +		list_splice_tail_init(&folios, ret_folios);
>>> +		if (rc < 0)
>>> +			return rc;
>>> +		nr_failed += rc;
>>> +	}
>>> +
>>> +	return nr_failed;
>>> +}
>>> +
>>>    /*
>>>     * migrate_pages - migrate the folios specified in a list, to the free folios
>>>     *		   supplied as the target for the page migration
>>> @@ -1874,7 +1919,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>>    		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>>>    {
>>>    	int rc, rc_gather;
>>> -	int nr_pages, batch;
>>> +	int nr_pages;
>>>    	struct folio *folio, *folio2;
>>>    	LIST_HEAD(folios);
>>>    	LIST_HEAD(ret_folios);
>>> @@ -1890,10 +1935,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>>    	if (rc_gather < 0)
>>>    		goto out;
>>>    -	if (mode == MIGRATE_ASYNC)
>>> -		batch = NR_MAX_BATCHED_MIGRATION;
>>> -	else
>>> -		batch = 1;
>>>    again:
>>>    	nr_pages = 0;
>>>    	list_for_each_entry_safe(folio, folio2, from, lru) {
>>> @@ -1904,16 +1945,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>>    		}
>>>      		nr_pages += folio_nr_pages(folio);
>>> -		if (nr_pages >= batch)
>>> +		if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
>>>    			break;
>>>    	}
>>> -	if (nr_pages >= batch)
>>> +	if (nr_pages >= NR_MAX_BATCHED_MIGRATION)
>>>    		list_cut_before(&folios, from, &folio2->lru);
>>>    	else
>>>    		list_splice_init(from, &folios);
>>> -	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
>>> -				 mode, reason, &ret_folios, &split_folios, &stats,
>>> -				 NR_MAX_MIGRATE_PAGES_RETRY);
>>> +	if (mode == MIGRATE_ASYNC)
>>> +		rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
>>> +					 mode, reason, &ret_folios, &split_folios, &stats,
>>> +					 NR_MAX_MIGRATE_PAGES_RETRY);
>>> +	else
>>> +		rc = migrate_pages_sync(&folios, get_new_page, put_new_page, private,
>>> +					mode, reason, &ret_folios, &split_folios, &stats);
>>
>> For split folios, it seems also reasonable to use migrate_pages_sync()
>> instead of always using fixed MIGRATE_ASYNC mode?
> 
> For split folios, we only try to migrate them with minimal effort.
> Previously, we decrease the retry number from 10 to 1.  Now, I think
> that it's reasonable to change the migration mode to MIGRATE_ASYNC to
> reduce latency.  They have been counted as failure anyway.

Sounds reasonable. Thanks for explanation. Please feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] migrate_pages: move split folios processing out of migrate_pages_batch()
  2023-03-01  6:35     ` Huang, Ying
@ 2023-03-01 11:07       ` Baolin Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2023-03-01 11:07 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Xu, Pengfei,
	Christoph Hellwig, Stefan Roesch, Tejun Heo, Xin Hao, Zi Yan,
	Yang Shi, Matthew Wilcox, Mike Kravetz



On 3/1/2023 2:35 PM, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> On 2/24/2023 10:11 PM, Huang Ying wrote:
>>> To simplify the code logic and reduce the line number.
>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
>>> Cc: Christoph Hellwig <hch@lst.de>
>>> Cc: Stefan Roesch <shr@devkernel.io>
>>> Cc: Tejun Heo <tj@kernel.org>
>>> Cc: Xin Hao <xhao@linux.alibaba.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Yang Shi <shy828301@gmail.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>    mm/migrate.c | 76 ++++++++++++++++++----------------------------------
>>>    1 file changed, 26 insertions(+), 50 deletions(-)
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 7ac37dbbf307..91198b487e49 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1605,9 +1605,10 @@ static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
>>>    static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>>    		free_page_t put_new_page, unsigned long private,
>>>    		enum migrate_mode mode, int reason, struct list_head *ret_folios,
>>> -		struct migrate_pages_stats *stats)
>>> +		struct list_head *split_folios, struct migrate_pages_stats *stats,
>>> +		int nr_pass)
>>>    {
>>> -	int retry;
>>> +	int retry = 1;
>>>    	int large_retry = 1;
>>>    	int thp_retry = 1;
>>>    	int nr_failed = 0;
>>> @@ -1617,19 +1618,12 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>>    	bool is_large = false;
>>>    	bool is_thp = false;
>>>    	struct folio *folio, *folio2, *dst = NULL, *dst2;
>>> -	int rc, rc_saved, nr_pages;
>>> -	LIST_HEAD(split_folios);
>>> +	int rc, rc_saved = 0, nr_pages;
>>>    	LIST_HEAD(unmap_folios);
>>>    	LIST_HEAD(dst_folios);
>>>    	bool nosplit = (reason == MR_NUMA_MISPLACED);
>>> -	bool no_split_folio_counting = false;
>>>    -retry:
>>> -	rc_saved = 0;
>>> -	retry = 1;
>>> -	for (pass = 0;
>>> -	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
>>> -	     pass++) {
>>> +	for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
>>>    		retry = 0;
>>>    		large_retry = 0;
>>>    		thp_retry = 0;
>>> @@ -1660,7 +1654,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>>    			if (!thp_migration_supported() && is_thp) {
>>>    				nr_large_failed++;
>>>    				stats->nr_thp_failed++;
>>> -				if (!try_split_folio(folio, &split_folios)) {
>>> +				if (!try_split_folio(folio, split_folios)) {
>>>    					stats->nr_thp_split++;
>>>    					continue;
>>>    				}
>>> @@ -1692,7 +1686,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>>    					stats->nr_thp_failed += is_thp;
>>>    					/* Large folio NUMA faulting doesn't split to retry. */
>>>    					if (!nosplit) {
>>> -						int ret = try_split_folio(folio, &split_folios);
>>> +						int ret = try_split_folio(folio, split_folios);
>>>      						if (!ret) {
>>>    							stats->nr_thp_split += is_thp;
>>> @@ -1709,18 +1703,11 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>>    							break;
>>>    						}
>>>    					}
>>> -				} else if (!no_split_folio_counting) {
>>> +				} else {
>>>    					nr_failed++;
>>>    				}
>>>      				stats->nr_failed_pages += nr_pages +
>>> nr_retry_pages;
>>> -				/*
>>> -				 * There might be some split folios of fail-to-migrate large
>>> -				 * folios left in split_folios list. Move them to ret_folios
>>> -				 * list so that they could be put back to the right list by
>>> -				 * the caller otherwise the folio refcnt will be leaked.
>>> -				 */
>>> -				list_splice_init(&split_folios, ret_folios);
>>>    				/* nr_failed isn't updated for not used */
>>>    				nr_large_failed += large_retry;
>>>    				stats->nr_thp_failed += thp_retry;
>>> @@ -1733,7 +1720,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>>    				if (is_large) {
>>>    					large_retry++;
>>>    					thp_retry += is_thp;
>>> -				} else if (!no_split_folio_counting) {
>>> +				} else {
>>>    					retry++;
>>>    				}
>>>    				nr_retry_pages += nr_pages;
>>> @@ -1756,7 +1743,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>>    				if (is_large) {
>>>    					nr_large_failed++;
>>>    					stats->nr_thp_failed += is_thp;
>>> -				} else if (!no_split_folio_counting) {
>>> +				} else {
>>>    					nr_failed++;
>>>    				}
>>>    @@ -1774,9 +1761,7 @@ static int migrate_pages_batch(struct
>>> list_head *from, new_page_t get_new_page,
>>>    	try_to_unmap_flush();
>>>      	retry = 1;
>>> -	for (pass = 0;
>>> -	     pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
>>> -	     pass++) {
>>> +	for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
>>>    		retry = 0;
>>>    		large_retry = 0;
>>>    		thp_retry = 0;
>>> @@ -1805,7 +1790,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>>    				if (is_large) {
>>>    					large_retry++;
>>>    					thp_retry += is_thp;
>>> -				} else if (!no_split_folio_counting) {
>>> +				} else {
>>>    					retry++;
>>>    				}
>>>    				nr_retry_pages += nr_pages;
>>> @@ -1818,7 +1803,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
>>>    				if (is_large) {
>>>    					nr_large_failed++;
>>>    					stats->nr_thp_failed += is_thp;
>>> -				} else if (!no_split_folio_counting) {
>>> +				} else {
>>>    					nr_failed++;
>>>    				}
>>>    @@ -1855,27 +1840,6 @@ static int migrate_pages_batch(struct
>>> list_head *from, new_page_t get_new_page,
>>>    		dst2 = list_next_entry(dst, lru);
>>>    	}
>>>    -	/*
>>> -	 * Try to migrate split folios of fail-to-migrate large folios, no
>>> -	 * nr_failed counting in this round, since all split folios of a
>>> -	 * large folio is counted as 1 failure in the first round.
>>> -	 */
>>> -	if (rc >= 0 && !list_empty(&split_folios)) {
>>> -		/*
>>> -		 * Move non-migrated folios (after NR_MAX_MIGRATE_PAGES_RETRY
>>> -		 * retries) to ret_folios to avoid migrating them again.
>>> -		 */
>>> -		list_splice_init(from, ret_folios);
>>> -		list_splice_init(&split_folios, from);
>>> -		/*
>>> -		 * Force async mode to avoid to wait lock or bit when we have
>>> -		 * locked more than one folios.
>>> -		 */
>>> -		mode = MIGRATE_ASYNC;
>>> -		no_split_folio_counting = true;
>>> -		goto retry;
>>> -	}
>>> -
>>>    	return rc;
>>>    }
>>>    @@ -1914,6 +1878,7 @@ int migrate_pages(struct list_head *from,
>>> new_page_t get_new_page,
>>>    	struct folio *folio, *folio2;
>>>    	LIST_HEAD(folios);
>>>    	LIST_HEAD(ret_folios);
>>> +	LIST_HEAD(split_folios);
>>>    	struct migrate_pages_stats stats;
>>>      	trace_mm_migrate_pages_start(mode, reason);
>>> @@ -1947,12 +1912,23 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>>    	else
>>>    		list_splice_init(from, &folios);
>>>    	rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
>>> -				 mode, reason, &ret_folios, &stats);
>>> +				 mode, reason, &ret_folios, &split_folios, &stats,
>>> +				 NR_MAX_MIGRATE_PAGES_RETRY);
>>>    	list_splice_tail_init(&folios, &ret_folios);
>>>    	if (rc < 0) {
>>>    		rc_gather = rc;
>>> +		list_splice_tail(&split_folios, &ret_folios);
>>
>> Can we still keep the original comments? Which can help to understand
>> the case, at least for me:)
>>   /*
>>    * There might be some split folios of fail-to-migrate large
>>    * folios left in split_folios list. Move them to ret_folios
>>    * list so that they could be put back to the right list by
>>    * the caller otherwise the folio refcnt will be leaked.
>>    */
> 
> Previously, the cleanup code is buried in a corner of a much more
> complex code path.  So the comments are necessary.  Now, it is an
> explicit and simple code path.  And, the rule is clear, every folio list
> needs to be cleaned up before return: folios, split_folios, then
> ret_folios.  And we have done this here and there in the series.

OK. Fair enough.

> 
>>>    		goto out;
>>>    	}
>>> +	if (!list_empty(&split_folios)) {
>>> +		/*
>>> +		 * Failure isn't counted since all split folios of a large folio
>>> +		 * is counted as 1 failure already.
>>> +		 */
>>> +		migrate_pages_batch(&split_folios, get_new_page, put_new_page, private,
>>> +				    MIGRATE_ASYNC, reason, &ret_folios, NULL, &stats, 1);
>>
>> Better to copy the original comments to explain why force to
>> MIGRATE_ASYNC mode for split folios.
> 
> Yes.  It's a good idea to explain that.  And now the rule to call
> migrate_pages_batch() has been changed.  If mode != MIGRATE_ASYNC, the
> length of "from" must be <= 1.  I will add a VM_WARN_ON() for that at
> the beginning of migrate_pages_batch().  And I would rather to add the
> comments to the header of migrate_pages().  Other callers of the
> function needs to follow that rule too.

Looks reasonable to me. Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-03-01 11:07 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-24 14:11 [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration Huang Ying
2023-02-24 14:11 ` [PATCH 1/3] migrate_pages: fix deadlock in batched migration Huang Ying
2023-02-28  6:13   ` Hugh Dickins
2023-02-28  7:22     ` Huang, Ying
2023-02-28 21:07       ` Hugh Dickins
2023-03-01  1:17         ` Huang, Ying
2023-02-24 14:11 ` [PATCH 2/3] migrate_pages: move split folios processing out of migrate_pages_batch() Huang Ying
2023-03-01  2:23   ` Baolin Wang
2023-03-01  6:35     ` Huang, Ying
2023-03-01 11:07       ` Baolin Wang
2023-02-24 14:11 ` [PATCH 3/3] migrate_pages: try migrate in batch asynchronously firstly Huang Ying
2023-02-28  6:36   ` Hugh Dickins
2023-02-28  7:45     ` Huang, Ying
2023-02-28 21:22       ` Hugh Dickins
2023-03-01  6:08         ` Huang, Ying
2023-03-01  6:46           ` Hugh Dickins
2023-03-01  7:10             ` Huang, Ying
2023-03-01  3:08   ` Baolin Wang
2023-03-01  6:18     ` Huang, Ying
2023-03-01 11:03       ` Baolin Wang
2023-02-26  4:55 ` [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration Andrew Morton
2023-02-27  1:25   ` Huang, Ying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.