Question about hwpoison handling of 1GB hugepage

* Question about hwpoison handling of 1GB hugepage
@ 2022-03-31 10:56 Liu Shixin
  2022-04-03 23:42 ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 1 reply; 6+ messages in thread
From: Liu Shixin @ 2022-03-31 10:56 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton; +Cc: linux-mm, Linux Kernel Mailing List

Hi,

Recently, I found a problem with hwpoison 1GB hugepage.
I created a process and mapped 1GB hugepage. This process will then fork a
child process and write/read this 1GB hugepage. Then I inject hwpoison into
this 1GB hugepage. The child process triggers the memory failure and is
being killed as expected. After this, the parent process will try to fork a
new child process and do the same thing. It is killed again and finally it
goes into such an infinite loop. I found this was caused by
commit 31286a8484a8 ("mm: hwpoison: disable memory error handling on 1GB hugepage")

It looks like there is a bug for hwpoison 1GB hugepage so I try to reproduce
the bug described. After trying to revert the patch in an earlier version of
the kernel, I reproduce the bug described. Then I try to revert the patch in
latest version, and find the bug is no longer reproduced.

I compare the code paths of 1 GB hugepage and 2 MB hugepage for second madvise(MADV_HWPOISON),
and find that the problem is caused because in gup_pud_range(), pud_none() and
pud_huge() both return false and then trigger the bug. But in gup_pmd_range(),
the pmd_none() is modified to pmd_present() which will make code return directly.
The I find that it is commit 15494520b776 ("mm: fix gup_pud_range") which
cause latest version not reproduced. I backport commit 15494520b776 in
earlier version and find the bug is no longer reproduced either.

So I'd like to consult that is it the time to revert commit 31286a8484a8?
Or if we modify pud_huge to be similar with pmd_huge, is it sufficient?

I also noticed there is a TODO comment in memory_failure_hugetlb():
    - conversion of a pud that maps an error hugetlb into hwpoison
      entry properly works, and
    - other mm code walking over page table is aware of pud-aligned
      hwpoison entries. 

I'm not sure whether the above fix are sufficient, so is there anything else need
to analysis that I haven't considered?

Thanks,

^ permalink raw reply	[flat|nested] 6+ messages in thread