linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>
To: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: "osalvador@suse.de" <osalvador@suse.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"mhocko@kernel.org" <mhocko@kernel.org>,
	"mike.kravetz@oracle.com" <mike.kravetz@oracle.com>,
	"n-horiguchi@ah.jp.nec.com" <n-horiguchi@ah.jp.nec.com>,
	"max7255@yandex-team.ru" <max7255@yandex-team.ru>
Subject: Re: [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline
Date: Mon, 15 Jun 2020 06:19:53 +0000	[thread overview]
Message-ID: <20200615061951.GA26108@hori.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <20200611164319.16860-1-zeil@yandex-team.ru>

Hi Dmitry,

On Thu, Jun 11, 2020 at 07:43:19PM +0300, Dmitry Yakunin wrote:
> Hello!
> 
> We are faced with similar problems with hwpoisoned pages
> on one of our production clusters after kernel update to stable 4.19.
> Application that does a lot of memory allocations sometimes caught SIGBUS signal
> with message in dmesg about hardware memory corruption fault.
> In kernel and mce logs we saw messages about soft offlining pages with
> correctable errors. Those events always had happened before application
> was killed. This is not the behavior we expect. We want our application to
> continue working on a smaller set of available pages in the system.
> 
> This issue is difficult to reproduce, but we suppose that the reason for such
> behavior is that compaction does not check for page poisonness while processing
> free pages, so as a result valid userspace data gets migrated to bad pages.
> We wrote the simple test:
>   - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL
>     through writing pfn to /sys/devices/system/memory/soft_offline_page
>   - force compaction by echo 1 >> /proc/sys/vm/compact_memory
> Without this patch series after these steps bash became unusable
> and every attempt to run any command leads to SIGBUS with message about
> hardware memory corruption fault. And after applying this series to our kernel
> tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7
> this behavior is still reproducible.
> 
> So, we want to know, why this patchset wasn't merged to the upstream?
> Is there any problems in such rework for {soft,hard}-offline handling?

No technical reason, it's just because I didn't have enough power to push
this to be merged. Really sorry about that.

> BTW, this patchset should be updated with upstream changes in mm.

I'm working this now and still need more testing to confirm, but I hope
I'll update and post this for 5.9.

Thanks,
Naoya Horiguchi

      reply	other threads:[~2020-06-15  6:20 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-17 14:21 [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline Oscar Salvador
2019-10-17 14:21 ` [RFC PATCH v2 01/16] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
2019-10-18 11:48   ` Michal Hocko
2019-10-21  7:00     ` Naoya Horiguchi
2019-10-21 12:16       ` Michal Hocko
2019-11-12 12:22       ` Aneesh Kumar K.V
2019-11-13  6:02         ` Naoya Horiguchi
2019-10-17 14:21 ` [RFC PATCH v2 02/16] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
2019-10-18 11:52   ` Michal Hocko
2019-10-21  7:02     ` Naoya Horiguchi
2019-10-21 12:20       ` Michal Hocko
2019-10-17 14:21 ` [RFC PATCH v2 03/16] mm,madvise: Refactor madvise_inject_error Oscar Salvador
2019-10-21  7:03   ` Naoya Horiguchi
2019-10-17 14:21 ` [RFC PATCH v2 04/16] mm,hwpoison-inject: don't pin for hwpoison_filter Oscar Salvador
2019-10-17 14:21 ` [RFC PATCH v2 05/16] mm,hwpoison: Un-export get_hwpoison_page and make it static Oscar Salvador
2019-10-21  7:03   ` Naoya Horiguchi
2019-10-17 14:21 ` [RFC PATCH v2 06/16] mm,hwpoison: Kill put_hwpoison_page Oscar Salvador
2019-10-21  7:04   ` Naoya Horiguchi
2019-10-17 14:21 ` [RFC PATCH v2 07/16] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
2019-10-17 14:21 ` [RFC PATCH v2 08/16] mm,hwpoison: remove flag argument from soft offline functions Oscar Salvador
2019-10-17 14:21 ` [RFC PATCH v2 09/16] mm,hwpoison: Unify THP handling for hard and soft offline Oscar Salvador
2019-10-21  7:04   ` Naoya Horiguchi
2019-10-21  9:51     ` [PATCH 17/16] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP Naoya Horiguchi
2019-10-22  8:00       ` Oscar Salvador
2019-10-17 14:21 ` [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages Oscar Salvador
2019-10-18 12:06   ` Michal Hocko
2019-10-21 12:58     ` Oscar Salvador
2019-10-21 15:41       ` Michal Hocko
2019-10-22  7:46         ` Oscar Salvador
2019-10-22  8:26           ` Michal Hocko
2019-10-22  8:35             ` Oscar Salvador
2019-10-22  9:22               ` Michal Hocko
2019-10-22  9:58                 ` Oscar Salvador
2019-10-22 10:24                   ` Michal Hocko
2019-10-22 10:33                     ` Oscar Salvador
2019-10-23  2:15                       ` Naoya Horiguchi
2019-10-23  2:01                   ` Naoya Horiguchi
2019-10-21  7:45   ` Naoya Horiguchi
2019-10-22  8:00     ` Oscar Salvador
2019-10-17 14:21 ` [RFC PATCH v2 11/16] mm,hwpoison: Rework soft offline for in-use pages Oscar Salvador
2019-10-18 12:39   ` Michal Hocko
2019-10-21 13:48     ` Oscar Salvador
2019-10-21 14:06       ` Michal Hocko
2019-10-22  7:56         ` Oscar Salvador
2019-10-22  8:30           ` Michal Hocko
2019-10-22  9:40             ` Oscar Salvador
2019-10-17 14:21 ` [RFC PATCH v2 12/16] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page Oscar Salvador
2019-10-17 14:21 ` [RFC PATCH v2 13/16] mm,hwpoison: Take pages off the buddy when hard-offlining Oscar Salvador
2019-10-17 14:21 ` [RFC PATCH v2 14/16] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline Oscar Salvador
2019-10-21  9:20   ` Naoya Horiguchi
2019-10-17 14:21 ` [RFC PATCH v2 15/16] mm/hwpoison-inject: Rip off duplicated checks Oscar Salvador
2019-10-21  9:40   ` David Hildenbrand
2019-10-22  7:57     ` Oscar Salvador
2019-10-17 14:21 ` [RFC PATCH v2 16/16] mm, soft-offline: convert parameter to pfn Oscar Salvador
2019-10-18  8:15   ` David Hildenbrand
2020-06-11 16:43 ` [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline Dmitry Yakunin
2020-06-15  6:19   ` HORIGUCHI NAOYA(堀口 直也) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200615061951.GA26108@hori.linux.bs1.fc.nec.co.jp \
    --to=naoya.horiguchi@nec.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=max7255@yandex-team.ru \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=osalvador@suse.de \
    --cc=zeil@yandex-team.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).