LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Michal Hocko <mhocko@suse.com>,
	Oscar Salvador <osalvador@suse.com>,
	David Hildenbrand <david@redhat.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.9 36/63] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
Date: Fri, 11 Jan 2019 15:14:39 +0100
Message-ID: <20190111131051.242600471@linuxfoundation.org> (raw)
In-Reply-To: <20190111131046.387528003@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Michal Hocko <mhocko@suse.com>

commit b15c87263a69272423771118c653e9a1d0672caa upstream.

We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel.  The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing.  There are two problems with that.  First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail.  Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.

Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase

===

int main(void)
{
        int ret;
        int i;
        int fd;
        char *array = malloc(4096);
        char *array_locked = malloc(4096);

        fd = open("/tmp/data", O_RDONLY);
        read(fd, array, 4095);

        for (i = 0; i < 4096; i++)
                array_locked[i] = 'd';

        ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
        if (ret)
                perror("mlock");

        sleep (20);

        ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
        if (ret)
                perror("madvise");

        for (i = 0; i < 4096; i++)
                array_locked[i] = 'd';

        return 0;
}
===

+ offline this memory.

In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel:  [<ffffffff81019ac9>] dump_trace+0x59/0x340
kernel:  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
kernel:  [<ffffffff8101ac71>] show_stack+0x21/0x40
kernel:  [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
kernel:  [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
kernel:  [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
kernel:  [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
kernel:  [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
kernel:  [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
kernel:  [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
kernel:  [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
kernel:  [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
kernel:  [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
kernel:  [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
kernel:  [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
kernel:  [<ffffffff8121404b>] kernel_read+0x3b/0x50
kernel:  [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
kernel:  [<ffffffff81215f08>] do_execve+0x28/0x30
kernel:  [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
kernel:  [<ffffffff8161c045>] ret_from_fork+0x55/0x80

And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count.  It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.

With the upstream kernel the failure is slightly different.  The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.

Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped.  As
explained by Naoya:

: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.

Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that.  Naoya has noted the following:

: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
:         /*
:          * Torn down by someone else?
:          */
:         if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
:                 action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
:                 res =3D -EBUSY;
:                 goto out;
:         }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.

Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.com>
Debugged-by: Oscar Salvador <osalvador@suse.com>
Tested-by: Oscar Salvador <osalvador@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/memory_hotplug.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -34,6 +34,7 @@
 #include <linux/memblock.h>
 #include <linux/bootmem.h>
 #include <linux/compaction.h>
+#include <linux/rmap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1617,6 +1618,21 @@ do_migrate_range(unsigned long start_pfn
 			continue;
 		}
 
+		/*
+		 * HWPoison pages have elevated reference counts so the migration would
+		 * fail on them. It also doesn't make any sense to migrate them in the
+		 * first place. Still try to unmap such a page in case it is still mapped
+		 * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
+		 * the unmap as the catch all safety net).
+		 */
+		if (PageHWPoison(page)) {
+			if (WARN_ON(PageLRU(page)))
+				isolate_lru_page(page);
+			if (page_mapped(page))
+				try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
+			continue;
+		}
+
 		if (!get_page_unless_zero(page))
 			continue;
 		/*



  parent reply index

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-11 14:14 [PATCH 4.9 00/63] 4.9.150-stable review Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 01/63] pinctrl: meson: fix pull enable register calculation Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 02/63] powerpc: Fix COFF zImage booting on old powermacs Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 03/63] ARM: imx: update the cpu power up timing setting on i.mx6sx Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 04/63] ARM: dts: imx7d-nitrogen7: Fix the description of the Wifi clock Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 05/63] Input: restore EV_ABS ABS_RESERVED Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 06/63] checkstack.pl: fix for aarch64 Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 07/63] xfrm: Fix bucket count reported to userspace Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 08/63] netfilter: seqadj: re-load tcp header pointer after possible head reallocation Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 09/63] scsi: bnx2fc: Fix NULL dereference in error handling Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 10/63] Input: omap-keypad - fix idle configuration to not block SoC idle states Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 11/63] netfilter: ipset: do not call ipset_nest_end after nla_nest_cancel Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 12/63] bnx2x: Clear fip MAC when fcoe offload support is disabled Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 13/63] bnx2x: Remove configured vlans as part of unload sequence Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 14/63] bnx2x: Send update-svid ramrod with retry/poll flags enabled Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 15/63] scsi: target: iscsi: cxgbit: fix csk leak Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 16/63] scsi: target: iscsi: cxgbit: add missing spin_lock_init() Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 17/63] drivers: net: xgene: Remove unnecessary forward declarations Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 18/63] w90p910_ether: remove incorrect __init annotation Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 19/63] net: hns: Incorrect offset address used for some registers Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 20/63] net: hns: All ports can not work when insmod hns ko after rmmod Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 21/63] net: hns: Some registers use wrong address according to the datasheet Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 22/63] net: hns: Fixed bug that netdev was opened twice Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 23/63] net: hns: Clean rx fbd when ae stopped Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 24/63] net: hns: Free irq when exit from abnormal branch Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 25/63] net: hns: Avoid net reset caused by pause frames storm Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 26/63] net: hns: Fix ntuple-filters status error Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 27/63] net: hns: Add mac pcs config when enable|disable mac Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 28/63] SUNRPC: Fix a race with XPRT_CONNECTING Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 29/63] lan78xx: Resolve issue with changing MAC address Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 30/63] vxge: ensure data0 is initialized in when fetching firmware version information Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 31/63] net: netxen: fix a missing check and an uninitialized use Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 32/63] serial/sunsu: fix refcount leak Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 33/63] scsi: zfcp: fix posting too many status read buffers leading to adapter shutdown Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 34/63] libceph: fix CEPH_FEATURE_CEPHX_V2 check in calc_signature() Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 35/63] fork: record start_time late Greg Kroah-Hartman
2019-01-11 14:14 ` Greg Kroah-Hartman [this message]
2019-01-11 14:14 ` [PATCH 4.9 37/63] mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 38/63] mm, devm_memremap_pages: kill mapping "System RAM" support Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 39/63] sunrpc: fix cache_head leak due to queued request Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 40/63] sunrpc: use SVC_NET() in svcauth_gss_* functions Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 41/63] MIPS: math-emu: Write-protect delay slot emulation pages Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 42/63] crypto: x86/chacha20 - avoid sleeping with preemption disabled Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 43/63] vhost/vsock: fix uninitialized vhost_vsock->guest_cid Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 44/63] IB/hfi1: Incorrect sizing of sge for PIO will OOPs Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 45/63] ALSA: cs46xx: Potential NULL dereference in probe Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 46/63] ALSA: usb-audio: Avoid access before bLength check in build_audio_procunit() Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 47/63] ALSA: usb-audio: Fix an out-of-bound read in create_composite_quirks Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 48/63] dlm: fixed memory leaks after failed ls_remove_names allocation Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 49/63] dlm: possible memory leak on error path in create_lkb() Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 50/63] dlm: lost put_lkb on error path in receive_convert() and receive_unlock() Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 51/63] dlm: memory leaks on error path in dlm_user_request() Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 52/63] gfs2: Get rid of potential double-freeing in gfs2_create_inode Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 53/63] gfs2: Fix loop in gfs2_rbm_find Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 54/63] b43: Fix error in cordic routine Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 55/63] powerpc/tm: Set MSR[TS] just prior to recheckpoint Greg Kroah-Hartman
2019-01-11 14:14 ` [PATCH 4.9 56/63] 9p/net: put a lower bound on msize Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.9 57/63] rxe: fix error completion wr_id and qp_num Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.9 58/63] iommu/vt-d: Handle domain agaw being less than iommu agaw Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.9 59/63] ceph: dont update importing caps mseq when handing cap export Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.9 60/63] genwqe: Fix size check Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.9 61/63] intel_th: msu: Fix an off-by-one in attribute store Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.9 62/63] power: supply: olpc_battery: correct the temperature units Greg Kroah-Hartman
2019-01-11 14:15 ` [PATCH 4.9 63/63] drm/vc4: Set ->is_yuv to false when num_planes == 1 Greg Kroah-Hartman
2019-01-11 21:42 ` [PATCH 4.9 00/63] 4.9.150-stable review shuah
2019-01-12  8:07 ` Naresh Kamboju
2019-01-12 17:43 ` Guenter Roeck

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190111131051.242600471@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=osalvador@suse.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox