linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qian Cai <cai@lca.pw>
To: Oscar Salvador <osalvador@suse.de>
Cc: akpm@linux-foundation.org, mhocko@suse.com, linux-mm@kvack.org,
	mike.kravetz@oracle.com, david@redhat.com,
	aneesh.kumar@linux.vnet.ibm.com, naoya.horiguchi@nec.com,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 00/15] Hwpoison soft-offline rework
Date: Fri, 17 Jul 2020 10:49:28 -0400	[thread overview]
Message-ID: <20200717144913.GA31206@lca.pw> (raw)
In-Reply-To: <20200716123810.25292-1-osalvador@suse.de>

On Thu, Jul 16, 2020 at 02:37:54PM +0200, Oscar Salvador wrote:
> Hi all,
> 
> this is a follow-up version on [1].
> That version had some flaws wrt. handling hugetlb pages, so this version
> fixes it.
> I checked that the case reported by Qian seems to work fine now.

I am still getting EIO from madvise on some x86 NUMA systems with next-20200717
which includes this patchset.

# git clone https://gitlab.com/cailca/linux-mm
# cd linux-mm; make

# ./random 1
- start: migrate_huge_offline
- use NUMA nodes 0,3.
- mmap and free 8388608 bytes hugepages on node 0
- mmap and free 8388608 bytes hugepages on node 3
madvise: Input/output error

== serial console output ==
[  100.149531][ T1644] Soft offlining pfn 0x3a5000 at process virtual address 0x7f1bf7e00000
[  100.193804][ T1644] Soft offlining pfn 0x3a5200 at process virtual address 0x7f1bf8000000
[  100.263446][ T1644] Soft offlining pfn 0x1fa3e00 at process virtual address 0x7f1bf7e00000
[  100.302745][ T1644] __get_any_page: 0x1fa3e00 free huge page
[  100.330226][ T1644] Soft offlining pfn 0x3bd600 at process virtual address 0x7f1bf8000000
[  100.373717][ T1644] Soft offlining pfn 0x202dc00 at process virtual address 0x7f1bf7e00000
[  100.414605][ T1644] Soft offlining pfn 0x1fa3c00 at process virtual address 0x7f1bf8000000
[  100.457675][ T1644] Soft offlining pfn 0x1fd3a00 at process virtual address 0x7f1bf7e00000
[  100.498519][ T1644] Soft offlining pfn 0x1fd3800 at process virtual address 0x7f1bf8000000
[  100.541750][ T1644] Soft offlining pfn 0x1fa3800 at process virtual address 0x7f1bf7e00000
[  100.582207][ T1644] Soft offlining pfn 0x1ffde00 at process virtual address 0x7f1bf8000000
[  100.625221][ T1644] Soft offlining pfn 0x1f7fe00 at process virtual address 0x7f1bf7e00000
[  100.665768][ T1644] Soft offlining pfn 0x1f7fc00 at process virtual address 0x7f1bf8000000
[  100.712181][ T1644] Soft offlining pfn 0x202d800 at process virtual address 0x7f1bf7e00000
[  100.753815][ T1644] Soft offlining pfn 0x205da00 at process virtual address 0x7f1bf8000000
[  100.796807][ T1644] Soft offlining pfn 0x1f7fa00 at process virtual address 0x7f1bf7e00000
[  100.837403][ T1644] Soft offlining pfn 0x1f7f800 at process virtual address 0x7f1bf8000000
[  100.880442][ T1644] Soft offlining pfn 0x1ffd800 at process virtual address 0x7f1bf7e00000
[  100.921584][ T1644] Soft offlining pfn 0x1fd3600 at process virtual address 0x7f1bf8000000
[  100.964444][ T1644] Soft offlining pfn 0x1fa3600 at process virtual address 0x7f1bf7e00000
[  101.005009][ T1644] Soft offlining pfn 0x1fa3400 at process virtual address 0x7f1bf8000000
[  101.047984][ T1644] Soft offlining pfn 0x1f7f400 at process virtual address 0x7f1bf7e00000
[  101.088665][ T1644] Soft offlining pfn 0x205d600 at process virtual address 0x7f1bf8000000
[  101.131368][ T1644] Soft offlining pfn 0x202d600 at process virtual address 0x7f1bf7e00000
[  101.171717][ T1644] Soft offlining pfn 0x202d400 at process virtual address 0x7f1bf8000000
[  101.216780][ T1644] Soft offlining pfn 0x1fa3000 at process virtual address 0x7f1bf7e00000
[  101.258755][ T1644] Soft offlining pfn 0x1fd3200 at process virtual address 0x7f1bf8000000
[  101.302585][ T1644] Soft offlining pfn 0x1ffd600 at process virtual address 0x7f1bf7e00000
[  101.344729][ T1644] Soft offlining pfn 0x1ffd400 at process virtual address 0x7f1bf8000000
[  101.388958][ T1644] Soft offlining pfn 0x205d000 at process virtual address 0x7f1bf7e00000
[  101.430995][ T1644] Soft offlining pfn 0x1f7f200 at process virtual address 0x7f1bf8000000
[  101.474513][ T1644] Soft offlining pfn 0x1fd2e00 at process virtual address 0x7f1bf7e00000
[  101.515333][ T1644] Soft offlining pfn 0x1fd2c00 at process virtual address 0x7f1bf8000000
[  101.558119][ T1644] Soft offlining pfn 0x1fa2c00 at process virtual address 0x7f1bf7e00000
[  101.600051][ T1644] Soft offlining pfn 0x202d200 at process virtual address 0x7f1bf8000000
[  101.643046][ T1644] Soft offlining pfn 0x1ffd200 at process virtual address 0x7f1bf7e00000
[  101.683842][ T1644] Soft offlining pfn 0x1ffd000 at process virtual address 0x7f1bf8000000
[  101.730551][ T1644] Soft offlining pfn 0x205cc00 at process virtual address 0x7f1bf7e00000
[  101.772575][ T1644] Soft offlining pfn 0x1f7ee00 at process virtual address 0x7f1bf8000000
[  101.818438][ T1644] Soft offlining pfn 0x1fd2a00 at process virtual address 0x7f1bf7e00000
[  101.861488][ T1644] Soft offlining pfn 0x1fd2800 at process virtual address 0x7f1bf8000000
[  101.904410][ T1644] Soft offlining pfn 0x1fa2800 at process virtual address 0x7f1bf7e00000
[  101.946639][ T1644] Soft offlining pfn 0x202ce00 at process virtual address 0x7f1bf8000000
[  101.989523][ T1644] Soft offlining pfn 0x1ffce00 at process virtual address 0x7f1bf7e00000
[  102.030092][ T1644] Soft offlining pfn 0x1ffcc00 at process virtual address 0x7f1bf8000000
[  102.076592][ T1644] Soft offlining pfn 0x437600 at process virtual address 0x7f1bf7e00000
[  102.116941][ T1644] Soft offlining pfn 0x433800 at process virtual address 0x7f1bf8000000
[  102.161314][ T1644] Soft offlining pfn 0x1f7e800 at process virtual address 0x7f1bf7e00000
[  102.200495][ T1644] Soft offlining pfn 0x1fa2a00 at process virtual address 0x7f1bf8000000
[  102.247260][ T1644] Soft offlining pfn 0x1fd2600 at process virtual address 0x7f1bf7e00000
[  102.290189][ T1644] Soft offlining pfn 0x1fd2400 at process virtual address 0x7f1bf8000000
[  102.328558][ T1644] __get_any_page: 0x1fd2400: unknown zero refcount page type 3bfffc000000000

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29
node 0 size: 15646 MB
node 0 free: 14779 MB
node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35
node 1 size: 31966 MB
node 1 free: 29825 MB
node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41
node 2 size: 32253 MB
node 2 free: 31029 MB
node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47
node 3 size: 32252 MB
node 3 free: 31360 MB
node distances:
node   0   1   2   3 
  0:  10  21  21  21 
  1:  21  10  21  21 
  2:  21  21  10  21 
  3:  21  21  21  10

# cat /proc/meminfo 
MemTotal:       114809324 kB
MemFree:        109554164 kB
MemAvailable:   109256800 kB
Buffers:            5376 kB
Cached:           166932 kB
SwapCached:            0 kB
Active:           119804 kB
Inactive:         107788 kB
Active(anon):      56356 kB
Inactive(anon):     9308 kB
Active(file):      63448 kB
Inactive(file):    98480 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4194300 kB
SwapFree:        4194300 kB
Dirty:               108 kB
Writeback:             0 kB
AnonPages:         55296 kB
Mapped:            82252 kB
Shmem:             10380 kB
KReclaimable:     114784 kB
Slab:            4401424 kB
SReclaimable:     114784 kB
SUnreclaim:      4286640 kB
KernelStack:       18560 kB
PageTables:         5648 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    61547760 kB
Committed_AS:     286384 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      168220 kB
VmallocChunk:          0 kB
Percpu:            39424 kB
HardwareCorrupted:   196 kB
AnonHugePages:     10240 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:      50
HugePages_Free:        2
HugePages_Rsvd:        0
HugePages_Surp:       25
Hugepagesize:       2048 kB
Hugetlb:          102400 kB
DirectMap4k:      578044 kB
DirectMap2M:    22358016 kB
DirectMap1G:    113246208 kB

> 
> Cover letter:
> 
> This patchset was initially based on Naoya's hwpoison rework [1], so
> thanks to him for the initial work.
> I would also like to think Naoya for testing the patchset off-line,
> and report any issues he found, that was quite helpful.
> 
> This patchset aims to fix some issues laying in soft-offline handling,
> but it also takes the chance and takes some further steps to perform 
> cleanups and some refactoring as well.
> 
> 
>  - Motivation:
> 
>    A customer and I were facing an issue were processes were killed
>    after having soft-offlined some of their pages.
>    This should not happen when soft-offlining, as it is meant to be non-disruptive.
>    I was able to reproduce the issue when I stressed the memory +
>    soft offlining pages in the meantime.
> 
>    After debugging the issue, I saw that the problem was that pages were returned
>    back to user-space after having offlined them properly.
>    So, when those pages were faulted in, the fault handler returned VM_FAULT_POISON
>    all the way down to the arch handler, and it simply killed the process.
> 
>    After a further anaylsis, it became clear that the problem was that when
>    kcompactd kicked in to migrate pages over, compaction_alloc callback
>    was handing poisoned pages to the migrate routine.
> 
>    All this could happen because isolate_freepages_block and
>    fast_isolate_freepages just check for the page to be PageBuddy,
>    and since 1) poisoned pages can be part of a higher order page
>    and 2) poisoned pages are also Page Buddy, they can sneak in easily.
> 
>    I also saw some other problems with sawap pages, but I suspected it
>    to be the same sort of problem, so I did not follow that trace.
> 
>    The above refers to soft-offline.
>    But I also saw problems with hard-offline, specially hugetlb corruption,
>    and some other weird stuff. (I could paste the logs)
> 
>    The full explanation refering to the soft-offline case can be found at [2].
> 
>  - Approach:
> 
>    The taken approach is to contain those pages and never let them hit 
>    neither pcplists nor buddy freelists.
>    Only when they are completely out of reach, we flag them as poisoned.
> 
>    A full explanation of this can be found in patch#11 and patch#12
> 
>  - Outcome:
> 
>    With this patchset, I no longer see the issues with soft-offline.
> 
> [1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
> 
> Naoya Horiguchi (6):
>   mm,hwpoison: cleanup unused PageHuge() check
>   mm, hwpoison: remove recalculating hpage
>   mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
>   mm,hwpoison-inject: don't pin for hwpoison_filter
>   mm,hwpoison: remove MF_COUNT_INCREASED
>   mm,hwpoison: remove flag argument from soft offline functions
> 
> Oscar Salvador (9):
>   mm,madvise: Refactor madvise_inject_error
>   mm,hwpoison: Un-export get_hwpoison_page and make it static
>   mm,hwpoison: Kill put_hwpoison_page
>   mm,hwpoison: Unify THP handling for hard and soft offline
>   mm,hwpoison: Rework soft offline for free pages
>   mm,hwpoison: Rework soft offline for in-use pages
>   mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
>   mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
>   mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
> 
>  drivers/base/memory.c      |   2 +-
>  include/linux/mm.h         |  12 +-
>  include/linux/page-flags.h |   6 +-
>  include/ras/ras_event.h    |   3 +
>  mm/hugetlb.c               |  60 +++++++-
>  mm/hwpoison-inject.c       |  18 +--
>  mm/madvise.c               |  37 ++---
>  mm/memory-failure.c        | 307 +++++++++++++++----------------------
>  mm/migrate.c               |  11 +-
>  mm/page_alloc.c            |  70 +++++++--
>  10 files changed, 270 insertions(+), 256 deletions(-)
> 
> -- 
> 2.26.2
> 
> 

      parent reply	other threads:[~2020-07-17 14:49 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
2020-07-16 12:37 ` [PATCH v4 01/15] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
2020-07-16 12:37 ` [PATCH v4 02/15] mm, hwpoison: remove recalculating hpage Oscar Salvador
2020-07-16 12:37 ` [PATCH v4 03/15] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
2020-07-16 23:15   ` Mike Kravetz
2020-07-16 12:37 ` [PATCH v4 04/15] mm,madvise: Refactor madvise_inject_error Oscar Salvador
2020-07-16 12:37 ` [PATCH v4 05/15] mm,hwpoison-inject: don't pin for hwpoison_filter Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 06/15] mm,hwpoison: Un-export get_hwpoison_page and make it static Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 07/15] mm,hwpoison: Kill put_hwpoison_page Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 08/15] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 09/15] mm,hwpoison: remove flag argument from soft offline functions Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 10/15] mm,hwpoison: Unify THP handling for hard and soft offline Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 11/15] mm,hwpoison: Rework soft offline for free pages Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages Oscar Salvador
2020-07-17  6:55   ` HORIGUCHI NAOYA(堀口 直也)
     [not found]   ` <f7387d64d0024d15a1bc821a8e19b8f0@DB7PR04MB5180.eurprd04.prod.outlook.com>
2020-07-20  8:27     ` osalvador
2020-07-22  8:08       ` osalvador
2020-07-23 10:19         ` Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 13/15] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 14/15] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 15/15] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP Oscar Salvador
2020-07-16 12:38 ` [PATCH] x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support Oscar Salvador
2020-07-16 12:43   ` osalvador
2020-07-17 14:49 ` Qian Cai [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200717144913.GA31206@lca.pw \
    --to=cai@lca.pw \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=osalvador@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).