All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Zi Yan <ziy@nvidia.com>,
	Yang Shi <shy828301@gmail.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Oscar Salvador <osalvador@suse.de>,
	Matthew Wilcox <willy@infradead.org>,
	Bharata B Rao <bharata@amd.com>,
	Alistair Popple <apopple@nvidia.com>,
	Xin Hao <xhao@linux.alibaba.com>,
	Minchan Kim <minchan@kernel.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>
Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing
Date: Tue, 28 Feb 2023 09:13:26 +0800	[thread overview]
Message-ID: <87pm9ubnih.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <20230227110614.dngdub2j3exr6dfp@quack3> (Jan Kara's message of "Mon, 27 Feb 2023 12:06:14 +0100")

Hi, Honza,

Jan Kara <jack@suse.cz> writes:

> On Fri 17-02-23 13:47:48, Hugh Dickins wrote:
>> On Mon, 13 Feb 2023, Huang Ying wrote:
>> 
>> > From: "Huang, Ying" <ying.huang@intel.com>
>> > 
>> > Now, migrate_pages() migrate folios one by one, like the fake code as
>> > follows,
>> > 
>> >   for each folio
>> >     unmap
>> >     flush TLB
>> >     copy
>> >     restore map
>> > 
>> > If multiple folios are passed to migrate_pages(), there are
>> > opportunities to batch the TLB flushing and copying.  That is, we can
>> > change the code to something as follows,
>> > 
>> >   for each folio
>> >     unmap
>> >   for each folio
>> >     flush TLB
>> >   for each folio
>> >     copy
>> >   for each folio
>> >     restore map
>> > 
>> > The total number of TLB flushing IPI can be reduced considerably.  And
>> > we may use some hardware accelerator such as DSA to accelerate the
>> > folio copying.
>> > 
>> > So in this patch, we refactor the migrate_pages() implementation and
>> > implement the TLB flushing batching.  Base on this, hardware
>> > accelerated folio copying can be implemented.
>> > 
>> > If too many folios are passed to migrate_pages(), in the naive batched
>> > implementation, we may unmap too many folios at the same time.  The
>> > possibility for a task to wait for the migrated folios to be mapped
>> > again increases.  So the latency may be hurt.  To deal with this
>> > issue, the max number of folios be unmapped in batch is restricted to
>> > no more than HPAGE_PMD_NR in the unit of page.  That is, the influence
>> > is at the same level of THP migration.
>> > 
>> > We use the following test to measure the performance impact of the
>> > patchset,
>> > 
>> > On a 2-socket Intel server,
>> > 
>> >  - Run pmbench memory accessing benchmark
>> > 
>> >  - Run `migratepages` to migrate pages of pmbench between node 0 and
>> >    node 1 back and forth.
>> > 
>> > With the patch, the TLB flushing IPI reduces 99.1% during the test and
>> > the number of pages migrated successfully per second increases 291.7%.
>> > 
>> > Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
>> > 2 NUMA nodes.  Test results show that the page migration performance
>> > increases up to 78%.
>> > 
>> > This patchset is based on mm-unstable 2023-02-10.
>> 
>> And back in linux-next this week: I tried next-20230217 overnight.
>> 
>> There is a deadlock in this patchset (and in previous versions: sorry
>> it's taken me so long to report), but I think one that's easily solved.
>> 
>> I've not bisected to precisely which patch (load can take several hours
>> to hit the deadlock), but it doesn't really matter, and I expect that
>> you can guess.
>> 
>> My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE),
>> and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs.
>> So, plenty of ext4 page cache and buffer_heads.
>> 
>> Again and again, the deadlock is seen with buffer_migrate_folio_norefs(),
>> either in kcompactd0 or in khugepaged trying to compact, or in both:
>> it ends up calling __lock_buffer(), and that schedules away, waiting
>> forever to get BH_lock.  I have not identified who is holding BH_lock,
>> but I imagine a jbd2 journalling thread, and presume that it wants one
>> of the folio locks which migrate_pages_batch() is already holding; or
>> maybe it's all more convoluted than that.  Other tasks then back up
>> waiting on those folio locks held in the batch.
>> 
>> Never a problem with buffer_migrate_folio(), always with the "more
>> careful" buffer_migrate_folio_norefs().  And the patch below fixes
>> it for me: I've had enough hours with it now, on enough occasions,
>> to be confident of that.
>> 
>> Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
>> very well, and I hope can assure us that there is an understandable
>> deadlock here, from holding several random folio locks, then trying
>> to lock buffers.  Cc'ing fsdevel, because there's a risk that mm
>> folk think something is safe, when it's not sufficient to cope with
>> the diversity of filesystems.  I hope nothing more than the below is
>> needed (and I've had no other problems with the patchset: good job),
>> but cannot be sure.
>
> I suspect it can indeed be caused by the presence of the loop device as
> Huang Ying has suggested. What filesystems using buffer_heads do is a
> pattern like:
>
> bh = page_buffers(loop device page cache page);
> lock_buffer(bh);
> submit_bh(bh);
> - now on loop dev this ends up doing:
>   lo_write_bvec()
>     vfs_iter_write()
>       ...
>       folio_lock(backing file folio);
>
> So if migration code holds "backing file folio" lock and at the same time
> waits for 'bh' lock (while trying to migrate loop device page cache page), it
> is a deadlock.
>
> Proposed solution of never waiting for locks in batched mode looks like a
> sensible one to me...

Thank you very much for detail explanation!

Best Regards,
Huang, Ying

  reply	other threads:[~2023-02-28  1:14 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-13 12:34 [PATCH -v5 0/9] migrate_pages(): batch TLB flushing Huang Ying
2023-02-13 12:34 ` [PATCH -v5 1/9] migrate_pages: organize stats with struct migrate_pages_stats Huang Ying
2023-02-13 12:34 ` [PATCH -v5 2/9] migrate_pages: separate hugetlb folios migration Huang Ying
2023-02-13 12:34 ` [PATCH -v5 3/9] migrate_pages: restrict number of pages to migrate in batch Huang Ying
2023-02-13 12:34 ` [PATCH -v5 4/9] migrate_pages: split unmap_and_move() to _unmap() and _move() Huang Ying
2023-02-13 12:34 ` [PATCH -v5 5/9] migrate_pages: batch _unmap and _move Huang Ying
2023-02-13 12:34 ` [PATCH -v5 6/9] migrate_pages: move migrate_folio_unmap() Huang Ying
2023-02-13 12:34 ` [PATCH -v5 7/9] migrate_pages: share more code between _unmap and _move Huang Ying
2023-02-13 12:34 ` [PATCH -v5 8/9] migrate_pages: batch flushing TLB Huang Ying
2023-02-13 12:34 ` [PATCH -v5 9/9] migrate_pages: move THP/hugetlb migration support check to simplify code Huang Ying
2023-02-17 21:47 ` [PATCH -v5 0/9] migrate_pages(): batch TLB flushing Hugh Dickins
2023-02-20  9:28   ` Huang, Ying
2023-02-21  2:48     ` Hugh Dickins
2023-02-21  3:34       ` Huang, Ying
2023-02-21 22:15         ` Hugh Dickins
2023-02-21  4:30       ` Matthew Wilcox
2023-02-21  4:38         ` Huang, Ying
2023-02-21 14:04       ` Huang, Ying
2023-02-21 22:25         ` Hugh Dickins
2023-02-22  1:02           ` Huang, Ying
2023-02-27 11:06   ` Jan Kara
2023-02-28  1:13     ` Huang, Ying [this message]
2023-02-28  5:59       ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pm9ubnih.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=42.hyeyoo@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bharata@amd.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=osalvador@suse.de \
    --cc=shy828301@gmail.com \
    --cc=willy@infradead.org \
    --cc=xhao@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.