All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Zi Yan <ziy@nvidia.com>,
	Yang Shi <shy828301@gmail.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Oscar Salvador <osalvador@suse.de>,
	Matthew Wilcox <willy@infradead.org>,
	Bharata B Rao <bharata@amd.com>,
	Alistair Popple <apopple@nvidia.com>,
	Xin Hao <xhao@linux.alibaba.com>,
	Minchan Kim <minchan@kernel.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	"Xu, Pengfei" <pengfei.xu@intel.com>,
	Christoph Hellwig <hch@lst.de>, Stefan Roesch <shr@devkernel.io>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing
Date: Mon, 20 Feb 2023 18:48:38 -0800 (PST)	[thread overview]
Message-ID: <2ab4b33e-f570-a6ff-6315-7d5a4614a7bd@google.com> (raw)
In-Reply-To: <874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Mon, 20 Feb 2023, Huang, Ying wrote:

> Hi, Hugh,
> 
> Hugh Dickins <hughd@google.com> writes:
> 
> > On Mon, 13 Feb 2023, Huang Ying wrote:
> >
> >> From: "Huang, Ying" <ying.huang@intel.com>
> >> 
> >> Now, migrate_pages() migrate folios one by one, like the fake code as
> >> follows,
> >> 
> >>   for each folio
> >>     unmap
> >>     flush TLB
> >>     copy
> >>     restore map
> >> 
> >> If multiple folios are passed to migrate_pages(), there are
> >> opportunities to batch the TLB flushing and copying.  That is, we can
> >> change the code to something as follows,
> >> 
> >>   for each folio
> >>     unmap
> >>   for each folio
> >>     flush TLB
> >>   for each folio
> >>     copy
> >>   for each folio
> >>     restore map
> >> 
> >> The total number of TLB flushing IPI can be reduced considerably.  And
> >> we may use some hardware accelerator such as DSA to accelerate the
> >> folio copying.
> >> 
> >> So in this patch, we refactor the migrate_pages() implementation and
> >> implement the TLB flushing batching.  Base on this, hardware
> >> accelerated folio copying can be implemented.
> >> 
> >> If too many folios are passed to migrate_pages(), in the naive batched
> >> implementation, we may unmap too many folios at the same time.  The
> >> possibility for a task to wait for the migrated folios to be mapped
> >> again increases.  So the latency may be hurt.  To deal with this
> >> issue, the max number of folios be unmapped in batch is restricted to
> >> no more than HPAGE_PMD_NR in the unit of page.  That is, the influence
> >> is at the same level of THP migration.
> >> 
> >> We use the following test to measure the performance impact of the
> >> patchset,
> >> 
> >> On a 2-socket Intel server,
> >> 
> >>  - Run pmbench memory accessing benchmark
> >> 
> >>  - Run `migratepages` to migrate pages of pmbench between node 0 and
> >>    node 1 back and forth.
> >> 
> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and
> >> the number of pages migrated successfully per second increases 291.7%.
> >> 
> >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
> >> 2 NUMA nodes.  Test results show that the page migration performance
> >> increases up to 78%.
> >> 
> >> This patchset is based on mm-unstable 2023-02-10.
> >
> > And back in linux-next this week: I tried next-20230217 overnight.
> >
> > There is a deadlock in this patchset (and in previous versions: sorry
> > it's taken me so long to report), but I think one that's easily solved.
> >
> > I've not bisected to precisely which patch (load can take several hours
> > to hit the deadlock), but it doesn't really matter, and I expect that
> > you can guess.
> >
> > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE),
> > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs.
> > So, plenty of ext4 page cache and buffer_heads.
> >
> > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(),
> > either in kcompactd0 or in khugepaged trying to compact, or in both:
> > it ends up calling __lock_buffer(), and that schedules away, waiting
> > forever to get BH_lock.  I have not identified who is holding BH_lock,
> > but I imagine a jbd2 journalling thread, and presume that it wants one
> > of the folio locks which migrate_pages_batch() is already holding; or
> > maybe it's all more convoluted than that.  Other tasks then back up
> > waiting on those folio locks held in the batch.
> >
> > Never a problem with buffer_migrate_folio(), always with the "more
> > careful" buffer_migrate_folio_norefs().  And the patch below fixes
> > it for me: I've had enough hours with it now, on enough occasions,
> > to be confident of that.
> >
> > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
> > very well, and I hope can assure us that there is an understandable
> > deadlock here, from holding several random folio locks, then trying
> > to lock buffers.  Cc'ing fsdevel, because there's a risk that mm
> > folk think something is safe, when it's not sufficient to cope with
> > the diversity of filesystems.  I hope nothing more than the below is
> > needed (and I've had no other problems with the patchset: good job),
> > but cannot be sure.
> >
> > [PATCH next] migrate_pages: fix deadlock on buffer heads
> >
> > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(),
> > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only
> > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed.
> >
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> >
> > --- next-20230217/mm/migrate.c
> > +++ fixed/mm/migrate.c
> > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct
> >  	if (folio_ref_count(src) != expected_count)
> >  		return -EAGAIN;
> >  
> > -	if (!buffer_migrate_lock_buffers(head, mode))
> > +	if (!buffer_migrate_lock_buffers(head,
> > +			check_refs ? MIGRATE_ASYNC : mode))
> >  		return -EAGAIN;
> >  
> >  	if (check_refs) {
> 
> Thank you very much for pointing this out and the fix patch.  Today, my
> colleague Pengfei reported a deadlock bug to me.  It seems that we
> cannot wait the writeback to complete when we have locked some folios.
> Below patch can fix that deadlock.  I don't know whether this is related
> to the deadlock you run into.  It appears that we should avoid to
> lock/wait synchronously if we have locked more than one folios.

Thanks, I've checked now, on next-20230217 without my patch but
with your patch below: it took a few hours, but still deadlocks
as I described above, so it's not the same issue.

Yes, that's a good principle, that we should avoid to lock/wait
synchronously once we have locked one folio (hmm, above you say
"more than one": I think we mean the same thing, we're just
stating it differently, given how the code runs at present).

I'm not a great fan of migrate_folio_unmap()'s arguments,
"force" followed by "oh, but don't force" (but applaud the recent
"avoid_force_lock" as much better than the original "force_lock").
I haven't tried, but I wonder if you can avoid both those arguments,
and both of these patches, by passing down an adjusted mode (perhaps
MIGRATE_ASYNC, or perhaps a new mode) to all callees, once the first
folio of a batch has been acquired (then restore to the original mode
when starting a new batch).

(My patch is weak in that it trylocks for buffer_head even on the
first folio of a MIGRATE_SYNC norefs batch, although that has never
given a problem historically: adjusting the mode after acquiring
the first folio would correct that weakness.)

Hugh

> 
> Best Regards,
> Huang, Ying
> 
> ------------------------------------8<------------------------------------
> From 0699fa2f80a67e863107d49a25909c92b900a9be Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Mon, 20 Feb 2023 14:56:34 +0800
> Subject: [PATCH] migrate_pages: fix deadlock on waiting writeback
> 
> Pengfei reported a system soft lockup issue with Syzkaller.  The stack
> traces are as follows,
> 
> ...
> [  300.124933] INFO: task kworker/u4:3:73 blocked for more than 147 seconds.
> [  300.125214]       Not tainted 6.2.0-rc4-kvm+ #1314
> [  300.125408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  300.125736] task:kworker/u4:3    state:D stack:0     pid:73    ppid:2      flags:0x00004000
> [  300.126059] Workqueue: writeback wb_workfn (flush-7:3)
> [  300.126282] Call Trace:
> [  300.126378]  <TASK>
> [  300.126464]  __schedule+0x43b/0xd00
> [  300.126601]  ? __blk_flush_plug+0x142/0x180
> [  300.126765]  schedule+0x6a/0xf0
> [  300.126912]  io_schedule+0x4a/0x80
> [  300.127051]  folio_wait_bit_common+0x1b5/0x4e0
> [  300.127227]  ? __pfx_wake_page_function+0x10/0x10
> [  300.127403]  __folio_lock+0x27/0x40
> [  300.127541]  write_cache_pages+0x350/0x870
> [  300.127699]  ? __pfx_iomap_do_writepage+0x10/0x10
> [  300.127889]  iomap_writepages+0x3f/0x80
> [  300.128037]  xfs_vm_writepages+0x94/0xd0
> [  300.128192]  ? __pfx_xfs_vm_writepages+0x10/0x10
> [  300.128370]  do_writepages+0x10a/0x240
> [  300.128514]  ? lock_is_held_type+0xe6/0x140
> [  300.128675]  __writeback_single_inode+0x9f/0xa90
> [  300.128854]  writeback_sb_inodes+0x2fb/0x8d0
> [  300.129030]  __writeback_inodes_wb+0x68/0x150
> [  300.129212]  wb_writeback+0x49c/0x770
> [  300.129357]  wb_workfn+0x6fb/0x9d0
> [  300.129500]  process_one_work+0x3cc/0x8d0
> [  300.129669]  worker_thread+0x66/0x630
> [  300.129824]  ? __pfx_worker_thread+0x10/0x10
> [  300.129989]  kthread+0x153/0x190
> [  300.130116]  ? __pfx_kthread+0x10/0x10
> [  300.130264]  ret_from_fork+0x29/0x50
> [  300.130409]  </TASK>
> [  300.179347] INFO: task repro:1023 blocked for more than 147 seconds.
> [  300.179905]       Not tainted 6.2.0-rc4-kvm+ #1314
> [  300.180317] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  300.180955] task:repro           state:D stack:0     pid:1023  ppid:360    flags:0x00004004
> [  300.181660] Call Trace:
> [  300.181879]  <TASK>
> [  300.182085]  __schedule+0x43b/0xd00
> [  300.182407]  schedule+0x6a/0xf0
> [  300.182694]  io_schedule+0x4a/0x80
> [  300.183020]  folio_wait_bit_common+0x1b5/0x4e0
> [  300.183506]  ? compaction_alloc+0x77/0x1150
> [  300.183892]  ? __pfx_wake_page_function+0x10/0x10
> [  300.184304]  folio_wait_bit+0x30/0x40
> [  300.184640]  folio_wait_writeback+0x2e/0x1e0
> [  300.185034]  migrate_pages_batch+0x555/0x1ac0
> [  300.185462]  ? __pfx_compaction_alloc+0x10/0x10
> [  300.185808]  ? __pfx_compaction_free+0x10/0x10
> [  300.186022]  ? __this_cpu_preempt_check+0x17/0x20
> [  300.186234]  ? lock_is_held_type+0xe6/0x140
> [  300.186423]  migrate_pages+0x100e/0x1180
> [  300.186603]  ? __pfx_compaction_free+0x10/0x10
> [  300.186800]  ? __pfx_compaction_alloc+0x10/0x10
> [  300.187011]  compact_zone+0xe10/0x1b50
> [  300.187182]  ? lock_is_held_type+0xe6/0x140
> [  300.187374]  ? check_preemption_disabled+0x80/0xf0
> [  300.187588]  compact_node+0xa3/0x100
> [  300.187755]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [  300.187993]  ? _find_first_bit+0x7b/0x90
> [  300.188171]  sysctl_compaction_handler+0x5d/0xb0
> [  300.188376]  proc_sys_call_handler+0x29d/0x420
> [  300.188583]  proc_sys_write+0x2b/0x40
> [  300.188749]  vfs_write+0x3a3/0x780
> [  300.188912]  ksys_write+0xb7/0x180
> [  300.189070]  __x64_sys_write+0x26/0x30
> [  300.189260]  do_syscall_64+0x3b/0x90
> [  300.189424]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [  300.189654] RIP: 0033:0x7f3a2471f59d
> [  300.189815] RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
> [  300.190137] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
> [  300.190397] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
> [  300.190653] RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
> [  300.190910] R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
> [  300.191172] R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
> [  300.191440]  </TASK>
> ...
> 
> To migrate a folio, we may wait the writeback of a folio to complete
> when we already have held the lock of some folios.  But the writeback
> code may wait to lock some folio we held lock.  This causes the
> deadlock.  To fix the issue, we will avoid to wait the writeback to
> complete if we have locked some folios.  After moving the locked
> folios and unlocked, we will retry.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Stefan Roesch <shr@devkernel.io>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Xin Hao <xhao@linux.alibaba.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/migrate.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 28b435cdeac8..bc9a8050f1b0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1205,6 +1205,18 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
>  		}
>  		if (!force)
>  			goto out;
> +		/*
> +		 * We have locked some folios and are going to wait the
> +		 * writeback of this folio to complete.  But it's possible for
> +		 * the writeback to wait to lock the folios we have locked.  To
> +		 * avoid a potential deadlock, let's bail out and not do that.
> +		 * The locked folios will be moved and unlocked, then we
> +		 * can wait the writeback of this folio.
> +		 */
> +		if (avoid_force_lock) {
> +			rc = -EDEADLOCK;
> +			goto out;
> +		}
>  		folio_wait_writeback(src);
>  	}
>  
> -- 
> 2.39.1
> 
> 

  reply	other threads:[~2023-02-21  2:50 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-13 12:34 [PATCH -v5 0/9] migrate_pages(): batch TLB flushing Huang Ying
2023-02-13 12:34 ` [PATCH -v5 1/9] migrate_pages: organize stats with struct migrate_pages_stats Huang Ying
2023-02-13 12:34 ` [PATCH -v5 2/9] migrate_pages: separate hugetlb folios migration Huang Ying
2023-02-13 12:34 ` [PATCH -v5 3/9] migrate_pages: restrict number of pages to migrate in batch Huang Ying
2023-02-13 12:34 ` [PATCH -v5 4/9] migrate_pages: split unmap_and_move() to _unmap() and _move() Huang Ying
2023-02-13 12:34 ` [PATCH -v5 5/9] migrate_pages: batch _unmap and _move Huang Ying
2023-02-13 12:34 ` [PATCH -v5 6/9] migrate_pages: move migrate_folio_unmap() Huang Ying
2023-02-13 12:34 ` [PATCH -v5 7/9] migrate_pages: share more code between _unmap and _move Huang Ying
2023-02-13 12:34 ` [PATCH -v5 8/9] migrate_pages: batch flushing TLB Huang Ying
2023-02-13 12:34 ` [PATCH -v5 9/9] migrate_pages: move THP/hugetlb migration support check to simplify code Huang Ying
2023-02-17 21:47 ` [PATCH -v5 0/9] migrate_pages(): batch TLB flushing Hugh Dickins
2023-02-20  9:28   ` Huang, Ying
2023-02-21  2:48     ` Hugh Dickins [this message]
2023-02-21  3:34       ` Huang, Ying
2023-02-21 22:15         ` Hugh Dickins
2023-02-21  4:30       ` Matthew Wilcox
2023-02-21  4:38         ` Huang, Ying
2023-02-21 14:04       ` Huang, Ying
2023-02-21 22:25         ` Hugh Dickins
2023-02-22  1:02           ` Huang, Ying
2023-02-27 11:06   ` Jan Kara
2023-02-28  1:13     ` Huang, Ying
2023-02-28  5:59       ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2ab4b33e-f570-a6ff-6315-7d5a4614a7bd@google.com \
    --to=hughd@google.com \
    --cc=42.hyeyoo@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bharata@amd.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=osalvador@suse.de \
    --cc=pengfei.xu@intel.com \
    --cc=shr@devkernel.io \
    --cc=shy828301@gmail.com \
    --cc=tj@kernel.org \
    --cc=willy@infradead.org \
    --cc=xhao@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.