linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qian Cai <qcai@redhat.com>
To: Oscar Salvador <osalvador@suse.de>, akpm@linux-foundation.org
Cc: n-horiguchi@ah.jp.nec.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/7] HWPoison: Refactor get page interface
Date: Wed, 02 Dec 2020 08:34:57 -0500	[thread overview]
Message-ID: <1ba3d19ab0629e549519fb94b73cabb0b392fb2a.camel@redhat.com> (raw)
In-Reply-To: <20201119105716.5962-1-osalvador@suse.de>

On Thu, 2020-11-19 at 11:57 +0100, Oscar Salvador wrote:
> Hi,
> 
> following up on previous fix-ups an refactors, this patchset simplifies
> the get page interface and removes the MF_COUNT_INCREASED trick we have
> for soft offline.

Well, the madvise() EIO is back. I don't understand why we can't test it on a
NUMA system before posting this over and over again.

# git clone https://e.coding.net/cailca/linux/mm
# cd mm; make
# ./ranbug 1 
- start: migrate_huge_offline
- use NUMA nodes 0,3.
- mmap and free 8388608 bytes hugepages on node 0
- mmap and free 8388608 bytes hugepages on node 3
madvise: Input/output error

[ 1270.054919][ T7497] Soft offlining pfn 0x1958e00 at process virtual address 0x7f7d9ca00000
[ 1270.067318][ T7497] Soft offlining pfn 0x18d0600 at process virtual address 0x7f7d9c800000
[ 1270.078856][ T7497] Soft offlining pfn 0x1ac800 at process virtual address 0x7f7d9ca00000
[ 1270.091268][ T7497] Soft offlining pfn 0x1e10a00 at process virtual address 0x7f7d9c800000
[ 1270.101946][ T7497] Soft offlining pfn 0x18c800 at process virtual address 0x7f7d9ca00000
[ 1270.111678][ T7497] soft offline: 0x18c800: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.126133][ T7497] Soft offlining pfn 0x18b5400 at process virtual address 0x7f7d9c800000
[ 1270.136581][ T7497] Soft offlining pfn 0x211c00 at process virtual address 0x7f7d9ca00000
[ 1270.146214][ T7497] soft offline: 0x211c00: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.160624][ T7497] Soft offlining pfn 0x19bee00 at process virtual address 0x7f7d9c800000
[ 1270.170896][ T7497] Soft offlining pfn 0x1e21a00 at process virtual address 0x7f7d9ca00000
[ 1270.185011][ T7497] Soft offlining pfn 0x1fd1200 at process virtual address 0x7f7d9c800000
[ 1270.195341][ T7497] Soft offlining pfn 0x1882400 at process virtual address 0x7f7d9ca00000
[ 1270.480593][ T7497] Soft offlining pfn 0x18bc000 at process virtual address 0x7f7d9c800000
[ 1270.491961][ T7497] soft offline: 0x18bc000: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.506018][ T7497] Soft offlining pfn 0x1e76a00 at process virtual address 0x7f7d9c800000
[ 1270.590266][ T7497] Soft offlining pfn 0x1b3c00 at process virtual address 0x7f7d9ca00000
[ 1270.600207][ T7497] soft offline: 0x1b3c00: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.614316][ T7497] Soft offlining pfn 0x1882600 at process virtual address 0x7f7d9c800000
[ 1270.662427][ T7497] Soft offlining pfn 0x1b3c00 at process virtual address 0x7f7d9ca00000
[ 1270.744249][ T7497] Soft offlining pfn 0x18bc000 at process virtual address 0x7f7d9c800000
[ 1270.754314][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1270.765204][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.816653][ T7497] Soft offlining pfn 0x18d0400 at process virtual address 0x7f7d9c800000
[ 1270.827049][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1270.837997][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.852156][ T7497] Soft offlining pfn 0x186ca00 at process virtual address 0x7f7d9c800000
[ 1270.862350][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1270.872922][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.887133][ T7497] Soft offlining pfn 0x18ac200 at process virtual address 0x7f7d9c800000
[ 1270.897450][ T7497] Soft offlining pfn 0x211c00 at process virtual address 0x7f7d9ca00000
[ 1270.907416][ T7497] soft offline: 0x211c00: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.921365][ T7497] Soft offlining pfn 0x1e1cc00 at process virtual address 0x7f7d9c800000
[ 1270.931700][ T7497] Soft offlining pfn 0x18c800 at process virtual address 0x7f7d9ca00000
[ 1270.941580][ T7497] soft offline: 0x18c800: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.955649][ T7497] Soft offlining pfn 0x1e6ae00 at process virtual address 0x7f7d9c800000
[ 1270.966063][ T7497] Soft offlining pfn 0x211c00 at process virtual address 0x7f7d9ca00000
[ 1270.975965][ T7497] soft offline: 0x211c00: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.990059][ T7497] Soft offlining pfn 0x1e72e00 at process virtual address 0x7f7d9c800000
[ 1271.000323][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1271.011006][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.025152][ T7497] Soft offlining pfn 0x1e22200 at process virtual address 0x7f7d9c800000
[ 1271.035395][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1271.045916][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.060159][ T7497] Soft offlining pfn 0x1e6fe00 at process virtual address 0x7f7d9c800000
[ 1271.070695][ T7497] Soft offlining pfn 0x18c800 at process virtual address 0x7f7d9ca00000
[ 1271.080596][ T7497] soft offline: 0x18c800: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.094725][ T7497] Soft offlining pfn 0x1968200 at process virtual address 0x7f7d9c800000
[ 1271.105006][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1271.115567][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.129775][ T7497] Soft offlining pfn 0x1e1ae00 at process virtual address 0x7f7d9c800000
[ 1271.140285][ T7497] Soft offlining pfn 0x18c800 at process virtual address 0x7f7d9ca00000
[ 1271.150185][ T7497] soft offline: 0x18c800: hugepage isolation failed: 0, page count 2, type bfffc[ 1271.468115][ T7497] Soft offlining pfn 0x1de4600 at process virtual address 0x7f7d9c800000
[ 1271.479348][ T7497] Soft offlining pfn 0x145e00 at process virtual address 0x7f7d9ca00000
[ 1271.489928][ T7497] soft offline: 0x145e00: hugepage isolation 1271.538433][ T7497] Soft offlining pfn 0x1fae00 at process virtual address 0x7f7d9c800000
[ 1271.548880][ T7497] Soft offlining pfn 0x1995e00 at process virtual address 0x7f7d9ca00000
[ 1271.558877][ T7497] soft offline: 0x1995e00: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.573055][ T7497] Soft offlining pfn 0x221e00 at process virtual address 0x7f7d9c800000
[ 1271.583453][ T7497] Soft offlining pfn 0x1901800 at process virtual address 0x7f7d9ca00000
[ 1271.593440][ T7497] soft offline: 0x1901800: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.610005][ T7497] Soft offlining pfn 0x232400 at process virtual address 0x7f7d9c800000
[ 1271.620439][ T7497] Soft offlinin[ 1272.005890][ T7497] Soft offlining pfn 0x230e00 at process virtual address 0x7f7d9c800000
[ 1272.017226][ T7497] Soft offlining pfn 0x185fe00 at process virtual address 0x7f7d9ca00000
[ 1272.029194][ T7497] Soft offlining pfn 0x1f1400 at process virtual address 0x7f7d9c800000
[ 1272.040088][ T7497] Soft offlining pfn 0x1f9e00 at process virtual address 0x7f7d9ca00000
[ 1272.052415][ T7497] Soft offlining pfn 0x1885a00 at process virtual address 0x7f7d9c800000
[ 1272.062510][ T7497] Soft offlining pfn 0x18b6000 at process virtual address 0x7f7d9ca00000
[ 1272.071931][ T7497] soft_offline_page: 0x18b6000: unknown page type: 3bfffc000000000 ((%pG?))

> 
> Please, note that this patchset is on top of [1] and [2].
> 
> This patchset does three things:
> 
>  1) Drops MF_COUNT_INCREASED trick
>  2) Refactors get page interface
>  3) Places a common entry for grabbin a page from both hard offline
>     and soft offline guarded by zone_pcp_{disable/enable}, so we do not
>     have to drain pcplists by ourself and retry again.
> 
> Note that the MF_COUNT_INCREASED trick was left because if get_hwpoison_page
> races with put_page (e.g:)
> 
> CPU0                         CPU1
> put_page (refcount decremented to 0)
>  __put_single_page
>   free_unref_page
>    free_unref_page_prepare
>     free_pcp_prepare
>      free_pages_prepare                           soft_offline_page
>      :page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP     get_any_page
>                             			    get_hwpoison_page
>    free_unref_page_commit
>     free_one_page
>      __free_one_page (place it in buddy)
> 
> get_hwpoison_page sees that page has a refcount of 0, but since it was not
> placed
> in buddy yet we cannot really handle it.
> We now have a sort of maximum passes in get_any_page, so in case we race
> with either an allocation or a put_page, we retry again.
> 
> After an off-list discussion with Naoya, he agreed to proceed.
> 
> [1] https://patchwork.kernel.org/project/linux-mm/list/?series=364009
> [2] https://patchwork.kernel.org/project/linux-mm/list/?series=381903
> 
> Naoya Horiguchi (3):
>   mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
>   mm,hwpoison: remove MF_COUNT_INCREASED
>   mm,hwpoison: remove flag argument from soft offline functions
> 
> Oscar Salvador (4):
>   mm,hwpoison: Refactor get_any_page
>   mm,hwpoison: Drop pfn parameter
>   mm,hwpoison: Disable pcplists before grabbing a refcount
>   mm,hwpoison: Remove drain_all_pages from shake_page
> 
>  drivers/base/memory.c |   2 +-
>  include/linux/mm.h    |   9 +--
>  mm/madvise.c          |  19 +++--
>  mm/memory-failure.c   | 168 +++++++++++++++++-------------------------
>  4 files changed, 85 insertions(+), 113 deletions(-)
> 


  parent reply	other threads:[~2020-12-02 13:36 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-19 10:57 [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador
2020-11-19 10:57 ` [PATCH 1/7] mm,hwpoison: Refactor get_any_page Oscar Salvador
2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
2020-11-25 16:54   ` Vlastimil Babka
2020-11-19 10:57 ` [PATCH 2/7] mm,hwpoison: Drop pfn parameter Oscar Salvador
2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
2020-11-25 16:55   ` Vlastimil Babka
2020-11-19 10:57 ` [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
2020-11-25 18:20   ` Vlastimil Babka
2020-12-01 11:35     ` Oscar Salvador
2020-12-04 17:25       ` Vlastimil Babka
2020-12-05 15:34         ` Oscar Salvador
2020-12-07  2:34           ` HORIGUCHI NAOYA(堀口 直也)
2020-12-07  7:24             ` Oscar Salvador
2020-11-19 10:57 ` [PATCH 4/7] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
2020-11-19 10:57 ` [PATCH 5/7] mm,hwpoison: remove flag argument from soft offline functions Oscar Salvador
2020-11-19 10:57 ` [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount Oscar Salvador
2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
2020-11-26 13:45   ` Vlastimil Babka
2020-11-28  0:51     ` Andrew Morton
2020-11-19 10:57 ` [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page Oscar Salvador
2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
2020-11-26 13:52   ` Vlastimil Babka
2020-11-27  7:20     ` Oscar Salvador
2020-12-02 13:34 ` Qian Cai [this message]
2020-12-02 13:41   ` [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1ba3d19ab0629e549519fb94b73cabb0b392fb2a.camel@redhat.com \
    --to=qcai@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=osalvador@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).