[patch 03/32] mm, compaction: make capture control handling safe wrt interrupts

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 03/32] mm, compaction: make capture control handling safe wrt interrupts
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
@ 2020-06-26  3:29 ` Andrew Morton
  2020-06-26  3:29 ` [patch 05/32] ocfs2: avoid inode removal while nfsd is accessing it Andrew Morton
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:29 UTC (permalink / raw)
  To: akpm, alex.shi, hughd, liwang, mgorman, mm-commits, stable,
	torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, compaction: make capture control handling safe wrt interrupts

Hugh reports:

: While stressing compaction, one run oopsed on NULL capc->cc in
: __free_one_page()'s task_capc(zone): compact_zone_order() had been
: interrupted, and a page was being freed in the return from interrupt.
: 
: Though you would not expect it from the source, both gccs I was using (a
: 4.8.1 and a 7.5.0) had chosen to compile compact_zone_order() with the
: ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before callq
: compact_zone - long after the "current->capture_control = &capc".  An
: interrupt in between those finds capc->cc NULL (zeroed by an earlier rep
: stos).
: 
: This could presumably be fixed by a barrier() before setting
: current->capture_control in compact_zone_order(); but would also need more
: care on return from compact_zone(), in order not to risk leaking a page
: captured by interrupt just before capture_control is reset.
: 
: Maybe that is the preferable fix, but I felt safer for task_capc() to
: exclude the rather surprising possibility of capture at interrupt time.

I have checked that gcc10 also behaves the same.

The advantage of fix in compact_zone_order() is that we don't add another
test in the page freeing hot path, and that it might prevent future
problems if we stop exposing pointers to uninitialized structures in
current task.

So this patch implements the suggestion for compact_zone_order() with
barrier() (and WRITE_ONCE() to prevent store tearing) for setting
current->capture_control, and prevents page leaking with
WRITE_ONCE/READ_ONCE in the proper order.

Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Hugh Dickins <hughd@google.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Li Wang <liwang@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>	[5.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

--- a/mm/compaction.c~mm-compaction-make-capture-control-handling-safe-wrt-interrupts
+++ a/mm/compaction.c
@@ -2316,15 +2316,26 @@ static enum compact_result compact_zone_
 		.page = NULL,
 	};

-	current->capture_control = &capc;
+	/*
+	 * Make sure the structs are really initialized before we expose the
+	 * capture control, in case we are interrupted and the interrupt handler
+	 * frees a page.
+	 */
+	barrier();
+	WRITE_ONCE(current->capture_control, &capc);

 	ret = compact_zone(&cc, &capc);

 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));

-	*capture = capc.page;
-	current->capture_control = NULL;
+	/*
+	 * Make sure we hide capture control first before we read the captured
+	 * page pointer, otherwise an interrupt could free and capture a page
+	 * and we would leak it.
+	 */
+	WRITE_ONCE(current->capture_control, NULL);
+	*capture = READ_ONCE(capc.page);

 	return ret;
 }
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 05/32] ocfs2: avoid inode removal while nfsd is accessing it
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
  2020-06-26  3:29 ` [patch 03/32] mm, compaction: make capture control handling safe wrt interrupts Andrew Morton
@ 2020-06-26  3:29 ` Andrew Morton
  2020-06-26  3:29 ` [patch 06/32] ocfs2: load global_inode_alloc Andrew Morton
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:29 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, mark,
	mm-commits, piaojun, stable, torvalds

From: Junxiao Bi <junxiao.bi@oracle.com>
Subject: ocfs2: avoid inode removal while nfsd is accessing it

Patch series "ocfs2: fix nfsd over ocfs2 issues", v2.

This is a series of patches to fix issues on nfsd over ocfs2.  patch 1 is
to avoid inode removed while nfsd access it patch 2 & 3 is to fix a panic
issue.


This patch (of 4):

When nfsd is getting file dentry using handle or parent dentry of some
dentry, one cluster lock is used to avoid inode removed from other node,
but it still could be removed from local node, so use a rw lock to avoid
this.

Link: http://lkml.kernel.org/r/20200616183829.87211-1-junxiao.bi@oracle.com
Link: http://lkml.kernel.org/r/20200616183829.87211-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/dlmglue.c |   17 ++++++++++++++++-
 fs/ocfs2/ocfs2.h   |    1 +
 2 files changed, 17 insertions(+), 1 deletion(-)

--- a/fs/ocfs2/dlmglue.c~ocfs2-avoid-inode-removed-while-nfsd-access-it
+++ a/fs/ocfs2/dlmglue.c
@@ -689,6 +689,12 @@ static void ocfs2_nfs_sync_lock_res_init
 				   &ocfs2_nfs_sync_lops, osb);
 }
 
+static void ocfs2_nfs_sync_lock_init(struct ocfs2_super *osb)
+{
+	ocfs2_nfs_sync_lock_res_init(&osb->osb_nfs_sync_lockres, osb);
+	init_rwsem(&osb->nfs_sync_rwlock);
+}
+
 void ocfs2_trim_fs_lock_res_init(struct ocfs2_super *osb)
 {
 	struct ocfs2_lock_res *lockres = &osb->osb_trim_fs_lockres;
@@ -2855,6 +2861,11 @@ int ocfs2_nfs_sync_lock(struct ocfs2_sup
 	if (ocfs2_is_hard_readonly(osb))
 		return -EROFS;
 
+	if (ex)
+		down_write(&osb->nfs_sync_rwlock);
+	else
+		down_read(&osb->nfs_sync_rwlock);
+
 	if (ocfs2_mount_local(osb))
 		return 0;
 
@@ -2873,6 +2884,10 @@ void ocfs2_nfs_sync_unlock(struct ocfs2_
 	if (!ocfs2_mount_local(osb))
 		ocfs2_cluster_unlock(osb, lockres,
 				     ex ? LKM_EXMODE : LKM_PRMODE);
+	if (ex)
+		up_write(&osb->nfs_sync_rwlock);
+	else
+		up_read(&osb->nfs_sync_rwlock);
 }
 
 int ocfs2_trim_fs_lock(struct ocfs2_super *osb,
@@ -3340,7 +3355,7 @@ int ocfs2_dlm_init(struct ocfs2_super *o
 local:
 	ocfs2_super_lock_res_init(&osb->osb_super_lockres, osb);
 	ocfs2_rename_lock_res_init(&osb->osb_rename_lockres, osb);
-	ocfs2_nfs_sync_lock_res_init(&osb->osb_nfs_sync_lockres, osb);
+	ocfs2_nfs_sync_lock_init(osb);
 	ocfs2_orphan_scan_lock_res_init(&osb->osb_orphan_scan.os_lockres, osb);
 
 	osb->cconn = conn;
--- a/fs/ocfs2/ocfs2.h~ocfs2-avoid-inode-removed-while-nfsd-access-it
+++ a/fs/ocfs2/ocfs2.h
@@ -395,6 +395,7 @@ struct ocfs2_super
 	struct ocfs2_lock_res osb_super_lockres;
 	struct ocfs2_lock_res osb_rename_lockres;
 	struct ocfs2_lock_res osb_nfs_sync_lockres;
+	struct rw_semaphore nfs_sync_rwlock;
 	struct ocfs2_lock_res osb_trim_fs_lockres;
 	struct mutex obs_trim_fs_mutex;
 	struct ocfs2_dlm_debug *osb_dlm_debug;
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 06/32] ocfs2: load global_inode_alloc
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
  2020-06-26  3:29 ` [patch 03/32] mm, compaction: make capture control handling safe wrt interrupts Andrew Morton
  2020-06-26  3:29 ` [patch 05/32] ocfs2: avoid inode removal while nfsd is accessing it Andrew Morton
@ 2020-06-26  3:29 ` Andrew Morton
  2020-06-26  3:29 ` [patch 07/32] ocfs2: fix panic on nfs server over ocfs2 Andrew Morton
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:29 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, mark,
	mm-commits, piaojun, stable, torvalds

From: Junxiao Bi <junxiao.bi@oracle.com>
Subject: ocfs2: load global_inode_alloc

Set global_inode_alloc as OCFS2_FIRST_ONLINE_SYSTEM_INODE, that will make
it load during mount.  It can be used to test whether some global/system
inodes are valid.  One use case is that nfsd will test whether root inode
is valid.

Link: http://lkml.kernel.org/r/20200616183829.87211-3-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/ocfs2_fs.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/ocfs2/ocfs2_fs.h~ocfs2-load-global_inode_alloc
+++ a/fs/ocfs2/ocfs2_fs.h
@@ -326,8 +326,8 @@ struct ocfs2_system_inode_info {
 enum {
 	BAD_BLOCK_SYSTEM_INODE = 0,
 	GLOBAL_INODE_ALLOC_SYSTEM_INODE,
+#define OCFS2_FIRST_ONLINE_SYSTEM_INODE GLOBAL_INODE_ALLOC_SYSTEM_INODE
 	SLOT_MAP_SYSTEM_INODE,
-#define OCFS2_FIRST_ONLINE_SYSTEM_INODE SLOT_MAP_SYSTEM_INODE
 	HEARTBEAT_SYSTEM_INODE,
 	GLOBAL_BITMAP_SYSTEM_INODE,
 	USER_QUOTA_SYSTEM_INODE,
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 07/32] ocfs2: fix panic on nfs server over ocfs2
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (2 preceding siblings ...)
  2020-06-26  3:29 ` [patch 06/32] ocfs2: load global_inode_alloc Andrew Morton
@ 2020-06-26  3:29 ` Andrew Morton
  2020-06-26  3:29 ` [patch 08/32] ocfs2: fix value of OCFS2_INVALID_SLOT Andrew Morton
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:29 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, mark,
	mm-commits, piaojun, stable, torvalds

From: Junxiao Bi <junxiao.bi@oracle.com>
Subject: ocfs2: fix panic on nfs server over ocfs2

The following kernel panic was captured when running nfs server over
ocfs2, at that time ocfs2_test_inode_bit() was checking whether one inode
locating at "blkno" 5 was valid, that is ocfs2 root inode, its
"suballoc_slot" was OCFS2_INVALID_SLOT(65535) and it was allocted from
//global_inode_alloc, but here it wrongly assumed that it was got from per
slot inode alloctor which would cause array overflow and trigger kernel
panic.

[430033.469151] BUG: unable to handle kernel paging request at
0000000000001088
[430033.469367] IP: [<ffffffff816f6898>] _raw_spin_lock+0x18/0xf0
[430033.469567] PGD 1e06ba067 PUD 1e9e7d067 PMD 0
[430033.469769] Oops: 0002 [#1] SMP
[430033.469975] Modules linked in: tun nfsd lockd grace nfs_acl auth_rpcgss
ocfs2 xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs
xen_privcmd ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
ocfs2_stackglue configfs bnx2fc fcoe libfcoe libfc sunrpc bridge 8021q mrp
garp stp llc bonding dm_round_robin scsi_dh_emc dm_multipath iTCO_wdt
iTCO_vendor_support pcspkr sb_edac edac_core i2c_i801 i2c_core lpc_ich
mfd_core sg ext4 jbd2 mbcache2 sd_mod ahci libahci lpfc scsi_transport_fc
be2net vxlan udp_tunnel ip6_udp_tunnel mpt3sas scsi_transport_sas raid_class
crc32c_intel be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3
mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi
dm_mirror dm_region_hash dm_log dm_mod
[430033.472350] CPU: 6 PID: 24873 Comm: nfsd Not tainted
4.1.12-124.36.1.el6uek.x86_64 #2
[430033.472719] Hardware name: Huawei CH121 V3/IT11SGCA1, BIOS 3.87
02/02/2018
[430033.472910] task: ffff88005ae98000 ti: ffff88005ae94000 task.ti:
ffff88005ae94000
[430033.473277] RIP: e030:[<ffffffff816f6898>]  [<ffffffff816f6898>]
_raw_spin_lock+0x18/0xf0
[430033.473655] RSP: e02b:ffff88005ae97908  EFLAGS: 00010206
[430033.473850] RAX: ffff88005ae98000 RBX: 0000000000001088 RCX:
0000000000000000
[430033.474205] RDX: 0000000000020000 RSI: 0000000000000009 RDI:
0000000000001088
[430033.474574] RBP: ffff88005ae97928 R08: 0000000000000000 R09:
ffff880212878e00
[430033.474938] R10: 0000000000007ff0 R11: 0000000000000000 R12:
0000000000001088
[430033.475324] R13: ffff8800063c0aa8 R14: ffff8800650c27d0 R15:
000000000000ffff
[430033.475721] FS:  0000000000000000(0000) GS:ffff880218180000(0000)
knlGS:ffff880218180000
[430033.476199] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[430033.476390] CR2: 0000000000001088 CR3: 00000002033d0000 CR4:
0000000000042660
[430033.476760] Stack:
[430033.476942]  0000000000001000 0000000000001088 ffff8800063c0aa8
ffff8800650c27d0
[430033.477329]  ffff88005ae97948 ffffffff8122a3de 0000000000000009
ffff8800063c0000
[430033.477718]  ffff88005ae979e8 ffffffffc0714e43 ffff88005ae97968
ffff88019de8f958
[430033.478104] Call Trace:
[430033.478286]  [<ffffffff8122a3de>] igrab+0x1e/0x60
[430033.478494]  [<ffffffffc0714e43>] ocfs2_get_system_file_inode+0x63/0x3a0
[ocfs2]
[430033.478870]  [<ffffffffc06a87df>] ? ocfs2_read_blocks_sync+0x13f/0x3c0
[ocfs2]
[430033.479267]  [<ffffffffc06ff2d8>] ocfs2_test_inode_bit+0x328/0xa00
[ocfs2]
[430033.479498]  [<ffffffffc06bef5a>] ocfs2_get_parent+0xba/0x3e0 [ocfs2]
[430033.479730]  [<ffffffff8129b305>] reconnect_path+0xb5/0x300
[430033.479933]  [<ffffffff8129b646>] exportfs_decode_fh+0xf6/0x2b0
[430033.480124]  [<ffffffffc0814af0>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd]
[430033.480294]  [<ffffffffc081a682>] ? exp_find+0xe2/0x190 [nfsd]
[430033.480461]  [<ffffffff810e5a7e>] ? irq_get_irq_data+0xe/0x10
[430033.480627]  [<ffffffff810ea1a7>] ? __call_rcu_nocb_enqueue+0xd7/0xe0
[430033.480794]  [<ffffffff810eb9e8>] ? __call_rcu+0xe8/0x360
[430033.480959]  [<ffffffffc0815860>] fh_verify+0x350/0x660 [nfsd]
[430033.481134]  [<ffffffffc0535076>] ? cache_check+0x56/0x3a0 [sunrpc]
[430033.481317]  [<ffffffffc0823a4d>] nfsd4_putfh+0x4d/0x60 [nfsd]
[430033.481505]  [<ffffffffc0826003>] nfsd4_proc_compound+0x3d3/0x6f0 [nfsd]
[430033.481730]  [<ffffffffc0811f60>] nfsd_dispatch+0xe0/0x290 [nfsd]
[430033.481950]  [<ffffffffc052b752>] ? svc_tcp_adjust_wspace+0x12/0x30
[sunrpc]
[430033.482152]  [<ffffffffc052a512>] svc_process_common+0x412/0x6a0 [sunrpc]
[430033.482351]  [<ffffffffc052a8c3>] svc_process+0x123/0x210 [sunrpc]
[430033.482550]  [<ffffffffc081190f>] nfsd+0xff/0x170 [nfsd]
[430033.482744]  [<ffffffffc0811810>] ? nfsd_destroy+0x80/0x80 [nfsd]
[430033.482943]  [<ffffffff810a7aeb>] kthread+0xcb/0xf0
[430033.483151]  [<ffffffff816f10ea>] ? __schedule+0x24a/0x810
[430033.483354]  [<ffffffff816f10ea>] ? __schedule+0x24a/0x810
[430033.483553]  [<ffffffff810a7a20>] ? kthread_create_on_node+0x180/0x180
[430033.483777]  [<ffffffff816f72a1>] ret_from_fork+0x61/0x90
[430033.483976]  [<ffffffff810a7a20>] ? kthread_create_on_node+0x180/0x180
[430033.484191] Code: 83 c2 02 0f b7 f2 e8 18 dc 91 ff 66 90 eb bf 0f 1f 40
00 55 48 89 e5 41 56 41 55 41 54 53 0f 1f 44 00 00 48 89 fb ba 00 00 02 00
<f0> 0f c1 17 89 d0 45 31 e4 45 31 ed c1 e8 10 66 39 d0 41 89 c6
[430033.485174] RIP  [<ffffffff816f6898>] _raw_spin_lock+0x18/0xf0
[430033.485370]  RSP <ffff88005ae97908>
[430033.485566] CR2: 0000000000001088
[430033.486223] ---[ end trace 7264463cd1aac8f9 ]---
[430033.666368] Kernel panic - not syncing: Fatal exception

Link: http://lkml.kernel.org/r/20200616183829.87211-4-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/suballoc.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- a/fs/ocfs2/suballoc.c~ocfs2-fix-panic-on-nfs-server-over-ocfs2
+++ a/fs/ocfs2/suballoc.c
@@ -2825,9 +2825,12 @@ int ocfs2_test_inode_bit(struct ocfs2_su
 		goto bail;
 	}
 
-	inode_alloc_inode =
-		ocfs2_get_system_file_inode(osb, INODE_ALLOC_SYSTEM_INODE,
-					    suballoc_slot);
+	if (suballoc_slot == (u16)OCFS2_INVALID_SLOT)
+		inode_alloc_inode = ocfs2_get_system_file_inode(osb,
+			GLOBAL_INODE_ALLOC_SYSTEM_INODE, suballoc_slot);
+	else
+		inode_alloc_inode = ocfs2_get_system_file_inode(osb,
+			INODE_ALLOC_SYSTEM_INODE, suballoc_slot);
 	if (!inode_alloc_inode) {
 		/* the error code could be inaccurate, but we are not able to
 		 * get the correct one. */
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 08/32] ocfs2: fix value of OCFS2_INVALID_SLOT
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (3 preceding siblings ...)
  2020-06-26  3:29 ` [patch 07/32] ocfs2: fix panic on nfs server over ocfs2 Andrew Morton
@ 2020-06-26  3:29 ` Andrew Morton
  2020-06-26  3:29 ` [patch 11/32] mm, slab: fix sign conversion problem in memcg_uncharge_slab() Andrew Morton
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:29 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, mark,
	mm-commits, piaojun, stable, torvalds

From: Junxiao Bi <junxiao.bi@oracle.com>
Subject: ocfs2: fix value of OCFS2_INVALID_SLOT

In the ocfs2 disk layout, slot number is 16 bits, but in ocfs2
implementation, slot number is 32 bits.  Usually this will not cause any
issue, because slot number is converted from u16 to u32, but
OCFS2_INVALID_SLOT was defined as -1, when an invalid slot number from
disk was obtained, its value was (u16)-1, and it was converted to u32. 
Then the following checking in get_local_system_inode will be always
skipped:

 static struct inode **get_local_system_inode(struct ocfs2_super *osb,
                                               int type,
                                               u32 slot)
 {
 	BUG_ON(slot == OCFS2_INVALID_SLOT);
	...
 }

Link: http://lkml.kernel.org/r/20200616183829.87211-5-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/ocfs2_fs.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/ocfs2/ocfs2_fs.h~ocfs2-fix-value-of-ocfs2_invalid_slot
+++ a/fs/ocfs2/ocfs2_fs.h
@@ -290,7 +290,7 @@
 #define OCFS2_MAX_SLOTS			255
 
 /* Slot map indicator for an empty slot */
-#define OCFS2_INVALID_SLOT		-1
+#define OCFS2_INVALID_SLOT		((u16)-1)
 
 #define OCFS2_VOL_UUID_LEN		16
 #define OCFS2_MAX_VOL_LABEL_LEN		64
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 11/32] mm, slab: fix sign conversion problem in memcg_uncharge_slab()
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (4 preceding siblings ...)
  2020-06-26  3:29 ` [patch 08/32] ocfs2: fix value of OCFS2_INVALID_SLOT Andrew Morton
@ 2020-06-26  3:29 ` Andrew Morton
  2020-06-26  3:29 ` [patch 12/32] mm/slab: use memzero_explicit() in kzfree() Andrew Morton
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:29 UTC (permalink / raw)
  To: akpm, cl, guro, hannes, iamjoonsoo.kim, longman, mhocko,
	mm-commits, penberg, rientjes, shakeelb, stable, torvalds,
	vdavydov.dev

From: Waiman Long <longman@redhat.com>
Subject: mm, slab: fix sign conversion problem in memcg_uncharge_slab()

It was found that running the LTP test on a PowerPC system could produce
erroneous values in /proc/meminfo, like:

  MemTotal:       531915072 kB
  MemFree:        507962176 kB
  MemAvailable:   1100020596352 kB

Using bisection, the problem is tracked down to commit 9c315e4d7d8c ("mm:
memcg/slab: cache page number in memcg_(un)charge_slab()").

In memcg_uncharge_slab() with a "int order" argument:

  unsigned int nr_pages = 1 << order;
    :
  mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);

The mod_lruvec_state() function will eventually call the
__mod_zone_page_state() which accepts a long argument.  Depending on the
compiler and how inlining is done, "-nr_pages" may be treated as a
negative number or a very large positive number.  Apparently, it was
treated as a large positive number in that PowerPC system leading to
incorrect stat counts.  This problem hasn't been seen in x86-64 yet,
perhaps the gcc compiler there has some slight difference in behavior.

It is fixed by making nr_pages a signed value.  For consistency, a similar
change is applied to memcg_charge_slab() as well.

Link: http://lkml.kernel.org/r/20200620184719.10994-1-longman@redhat.com
Fixes: 9c315e4d7d8c ("mm: memcg/slab: cache page number in memcg_(un)charge_slab()").
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/slab.h~mm-slab-fix-sign-conversion-problem-in-memcg_uncharge_slab
+++ a/mm/slab.h
@@ -348,7 +348,7 @@ static __always_inline int memcg_charge_
 					     gfp_t gfp, int order,
 					     struct kmem_cache *s)
 {
-	unsigned int nr_pages = 1 << order;
+	int nr_pages = 1 << order;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 	int ret;
@@ -388,7 +388,7 @@ out:
 static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 						struct kmem_cache *s)
 {
-	unsigned int nr_pages = 1 << order;
+	int nr_pages = 1 << order;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 12/32] mm/slab: use memzero_explicit() in kzfree()
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (5 preceding siblings ...)
  2020-06-26  3:29 ` [patch 11/32] mm, slab: fix sign conversion problem in memcg_uncharge_slab() Andrew Morton
@ 2020-06-26  3:29 ` Andrew Morton
  2020-06-26  3:29 ` [patch 14/32] mm: fix swap cache node allocation mask Andrew Morton
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:29 UTC (permalink / raw)
  To: akpm, dan.carpenter, dhowells, hannes, jarkko.sakkinen, Jason,
	jmorris, joe, longman, mhocko, mm-commits, rientjes, serge,
	stable, torvalds, willy

From: Waiman Long <longman@redhat.com>
Subject: mm/slab: use memzero_explicit() in kzfree()

The kzfree() function is normally used to clear some sensitive
information, like encryption keys, in the buffer before freeing it back to
the pool.  Memset() is currently used for buffer clearing.  However
unlikely, there is still a non-zero probability that the compiler may
choose to optimize away the memory clearing especially if LTO is being
used in the future.  To make sure that this optimization will never
happen, memzero_explicit(), which is introduced in v3.18, is now used in
kzfree() to future-proof it.

Link: http://lkml.kernel.org/r/20200616154311.12314-2-longman@redhat.com
Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab_common.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/slab_common.c~mm-slab-use-memzero_explicit-in-kzfree
+++ a/mm/slab_common.c
@@ -1726,7 +1726,7 @@ void kzfree(const void *p)
 	if (unlikely(ZERO_OR_NULL_PTR(mem)))
 		return;
 	ks = ksize(mem);
-	memset(mem, 0, ks);
+	memzero_explicit(mem, ks);
 	kfree(mem);
 }
 EXPORT_SYMBOL(kzfree);
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 14/32] mm: fix swap cache node allocation mask
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (6 preceding siblings ...)
  2020-06-26  3:29 ` [patch 12/32] mm/slab: use memzero_explicit() in kzfree() Andrew Morton
@ 2020-06-26  3:29 ` Andrew Morton
  2020-06-26  3:30 ` [patch 20/32] mm: memcontrol: handle div0 crash race condition in memory.low Andrew Morton
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:29 UTC (permalink / raw)
  To: akpm, hughd, lists, mm-commits, stable, torvalds, vbabka, willy

From: Hugh Dickins <hughd@google.com>
Subject: mm: fix swap cache node allocation mask

https://bugzilla.kernel.org/show_bug.cgi?id=208085 reports that a slightly
overcommitted load, testing swap and zram along with i915, splats and
keeps on splatting, when it had better fail less noisily:

gnome-shell: page allocation failure: order:0,
mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE),
nodemask=(null),cpuset=/,mems_allowed=0
CPU: 2 PID: 1155 Comm: gnome-shell Not tainted 5.7.0-1.fc33.x86_64 #1
Call Trace:
dump_stack+0x64/0x88
warn_alloc.cold+0x75/0xd9
__alloc_pages_slowpath.constprop.0+0xcfa/0xd30
__alloc_pages_nodemask+0x2df/0x320
alloc_slab_page+0x195/0x310
allocate_slab+0x3c5/0x440
___slab_alloc+0x40c/0x5f0
__slab_alloc+0x1c/0x30
kmem_cache_alloc+0x20e/0x220
xas_nomem+0x28/0x70
add_to_swap_cache+0x321/0x400
__read_swap_cache_async+0x105/0x240
swap_cluster_readahead+0x22c/0x2e0
shmem_swapin+0x8e/0xc0
shmem_swapin_page+0x196/0x740
shmem_getpage_gfp+0x3a2/0xa60
shmem_read_mapping_page_gfp+0x32/0x60
shmem_get_pages+0x155/0x5e0 [i915]
__i915_gem_object_get_pages+0x68/0xa0 [i915]
i915_vma_pin+0x3fe/0x6c0 [i915]
eb_add_vma+0x10b/0x2c0 [i915]
i915_gem_do_execbuffer+0x704/0x3430 [i915]
i915_gem_execbuffer2_ioctl+0x1ea/0x3e0 [i915]
drm_ioctl_kernel+0x86/0xd0 [drm]
drm_ioctl+0x206/0x390 [drm]
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5b/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported on 5.7, but it goes back really to 3.1: when
shmem_read_mapping_page_gfp() was implemented for use by i915, and
allowed for __GFP_NORETRY and __GFP_NOWARN flags in most places, but
missed swapin's "& GFP_KERNEL" mask for page tree node allocation in
__read_swap_cache_async() - that was to mask off HIGHUSER_MOVABLE bits
from what page cache uses, but GFP_RECLAIM_MASK is now what's needed.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2006151330070.11064@eggly.anvils
Fixes: 68da9f055755 ("tmpfs: pass gfp to shmem_getpage_gfp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Chris Murphy <lists@colorremedies.com>
Analyzed-by: Vlastimil Babka <vbabka@suse.cz>
Analyzed-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Chris Murphy <lists@colorremedies.com>
Cc: <stable@vger.kernel.org>	[3.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swap_state.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/swap_state.c~mm-fix-swap-cache-node-allocation-mask
+++ a/mm/swap_state.c
@@ -21,7 +21,7 @@
 #include <linux/vmalloc.h>
 #include <linux/swap_slots.h>
 #include <linux/huge_mm.h>
-
+#include "internal.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -429,7 +429,7 @@ struct page *__read_swap_cache_async(swp
 	__SetPageSwapBacked(page);
 
 	/* May fail (-ENOMEM) if XArray node allocation failed. */
-	if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
+	if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK)) {
 		put_swap_page(page, entry);
 		goto fail_unlock;
 	}
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 20/32] mm: memcontrol: handle div0 crash race condition in memory.low
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (7 preceding siblings ...)
  2020-06-26  3:29 ` [patch 14/32] mm: fix swap cache node allocation mask Andrew Morton
@ 2020-06-26  3:30 ` Andrew Morton
  2020-06-26  3:30 ` [patch 21/32] mm/memcontrol.c: add missed css_put() Andrew Morton
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:30 UTC (permalink / raw)
  To: akpm, chris, guro, hannes, mhocko, mm-commits, stable, tj, torvalds

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: memcontrol: handle div0 crash race condition in memory.low

Tejun reports seeing rare div0 crashes in memory.low stress testing:

[37228.504582] RIP: 0010:mem_cgroup_calculate_protection+0xed/0x150
[37228.505059] Code: 0f 46 d1 4c 39 d8 72 57 f6 05 16 d6 42 01 40 74 1f 4c 39 d8 76 1a 4c 39 d1 76 15 4c 29 d1 4c 29 d8 4d 29 d9 31 d2 48 0f af c1 <49> f7 f1 49 01 c2 4c 89 96 38 01 00 00 5d c3 48 0f af c7 31 d2 49
[37228.506254] RSP: 0018:ffffa14e01d6fcd0 EFLAGS: 00010246
[37228.506769] RAX: 000000000243e384 RBX: 0000000000000000 RCX: 0000000000008f4b
[37228.507319] RDX: 0000000000000000 RSI: ffff8b89bee84000 RDI: 0000000000000000
[37228.507869] RBP: ffffa14e01d6fcd0 R08: ffff8b89ca7d40f8 R09: 0000000000000000
[37228.508376] R10: 0000000000000000 R11: 00000000006422f7 R12: 0000000000000000
[37228.508881] R13: ffff8b89d9617000 R14: ffff8b89bee84000 R15: ffffa14e01d6fdb8
[37228.509397] FS:  0000000000000000(0000) GS:ffff8b8a1f1c0000(0000) knlGS:0000000000000000
[37228.509917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37228.510442] CR2: 00007f93b1fc175b CR3: 000000016100a000 CR4: 0000000000340ea0
[37228.511076] Call Trace:
[37228.511561]  shrink_node+0x1e5/0x6c0
[37228.512044]  balance_pgdat+0x32d/0x5f0
[37228.512521]  kswapd+0x1d7/0x3d0
[37228.513346]  ? wait_woken+0x80/0x80
[37228.514170]  kthread+0x11c/0x160
[37228.514983]  ? balance_pgdat+0x5f0/0x5f0
[37228.515797]  ? kthread_park+0x90/0x90
[37228.516593]  ret_from_fork+0x1f/0x30

This happens when parent_usage == siblings_protected.  We check that usage
is bigger than protected, which should imply parent_usage being bigger
than siblings_protected.  However, we don't read (or even update) these
values atomically, and they can be out of sync as the memory state changes
under us.  A bit of fluctuation around the target protection isn't a big
deal, but we need to handle the div0 case.

Check the parent state explicitly to make sure we have a reasonable
positive value for the divisor.

Link: http://lkml.kernel.org/r/20200615140658.601684-1-hannes@cmpxchg.org
Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

--- a/mm/memcontrol.c~mm-memcontrol-handle-div0-crash-race-condition-in-memorylow
+++ a/mm/memcontrol.c
@@ -6360,11 +6360,16 @@ static unsigned long effective_protectio
 	 * We're using unprotected memory for the weight so that if
 	 * some cgroups DO claim explicit protection, we don't protect
 	 * the same bytes twice.
+	 *
+	 * Check both usage and parent_usage against the respective
+	 * protected values. One should imply the other, but they
+	 * aren't read atomically - make sure the division is sane.
 	 */
 	if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT))
 		return ep;
-
-	if (parent_effective > siblings_protected && usage > protected) {
+	if (parent_effective > siblings_protected &&
+	    parent_usage > siblings_protected &&
+	    usage > protected) {
 		unsigned long unclaimed;
 
 		unclaimed = parent_effective - siblings_protected;
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 21/32] mm/memcontrol.c: add missed css_put()
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (8 preceding siblings ...)
  2020-06-26  3:30 ` [patch 20/32] mm: memcontrol: handle div0 crash race condition in memory.low Andrew Morton
@ 2020-06-26  3:30 ` Andrew Morton
  2020-06-26  3:30 ` [patch 31/32] mm/memory_hotplug.c: fix false softlockup during pfn range removal Andrew Morton
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:30 UTC (permalink / raw)
  To: akpm, cai, guro, hannes, mhocko, mm-commits, songmuchun, stable,
	torvalds, vdavydov.dev

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm/memcontrol.c: add missed css_put()

We should put the css reference when memory allocation failed.

Link: http://lkml.kernel.org/r/20200614122653.98829-1-songmuchun@bytedance.com
Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/memcontrol.c~mm-memcontrol-fix-do-not-put-the-css-reference
+++ a/mm/memcontrol.c
@@ -2772,8 +2772,10 @@ static void memcg_schedule_kmem_cache_cr
 		return;
 
 	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
-	if (!cw)
+	if (!cw) {
+		css_put(&memcg->css);
 		return;
+	}
 
 	cw->memcg = memcg;
 	cw->cachep = cachep;
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 31/32] mm/memory_hotplug.c: fix false softlockup during pfn range removal
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (9 preceding siblings ...)
  2020-06-26  3:30 ` [patch 21/32] mm/memcontrol.c: add missed css_put() Andrew Morton
@ 2020-06-26  3:30 ` Andrew Morton
  2020-06-27  3:33 ` [merged] mm-compaction-make-capture-control-handling-safe-wrt-interrupts.patch removed from -mm tree Andrew Morton
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-26  3:30 UTC (permalink / raw)
  To: akpm, ben.widawsky, dan.j.williams, david, mm-commits, stable,
	steve.scargall, torvalds, vishal.l.verma

From: Ben Widawsky <ben.widawsky@intel.com>
Subject: mm/memory_hotplug.c: fix false softlockup during pfn range removal

When working with very large nodes, poisoning the struct pages (for which
there will be very many) can take a very long time.  If the system is
using voluntary preemptions, the software watchdog will not be able to
detect forward progress.  This patch addresses this issue by offering to
give up time like __remove_pages() does.  This behavior was introduced in
v5.6 with: commit d33695b16a9f ("mm/memory_hotplug: poison memmap in
remove_pfn_range_from_zone()")

Alternately, init_page_poison could do this cond_resched(), but it seems
to me that the caller of init_page_poison() is what actually knows whether
or not it should relax its own priority.

Based on Dan's notes, I think this is perfectly safe: commit f931ab479dd2
("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")

Aside from fixing the lockup, it is also a friendlier thing to do on lower
core systems that might wipe out large chunks of hotplug memory (probably
not a very common case).

Fixes this kind of splat:
[  352.142079] watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [daxctl:9922]
[  352.150067] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security rfkill ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_umad vfat fat kmem intel_rapl_msr intel_rapl_common rpcrdma sunrpc rdma_ucm ib_iser isst_if_common rdma_cm skx_edac iw_cm ib_cm x86_pkg_temp_thermal libiscsi intel_powerclamp scsi_transport_iscsi coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i40iw intel_cstate iTCO_wdt ib_uverbs iTCO_vendor_support device_dax ib_core joydev intel_uncore i2c_i801 lpc_ich ipmi_ssif ioatdma dca wmi ipmi_si ipmi_devintf ipmi_msghandler dax_pmem
[  352.150123]  dax_pmem_core acpi_pad acpi_power_meter drm ip_tables xfs libcrc32c nd_pmem nd_btt i40e e1000e crc32c_intel nfit
[  352.260774] irq event stamp: 138450
[  352.264692] hardirqs last  enabled at (138449): [<ffffffffa1001f26>] trace_hardirqs_on_thunk+0x1a/0x1c
[  352.275134] hardirqs last disabled at (138450): [<ffffffffa1001f42>] trace_hardirqs_off_thunk+0x1a/0x1c
[  352.285662] softirqs last  enabled at (138448): [<ffffffffa1e00347>] __do_softirq+0x347/0x456
[  352.295233] softirqs last disabled at (138443): [<ffffffffa10c416d>] irq_exit+0x7d/0xb0
[  352.304210] CPU: 46 PID: 9922 Comm: daxctl Not tainted 5.7.0-BEN-14238-g373c6049b336 #30
[  352.313283] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0578.D07.1902280810 02/28/2019
[  352.324308] RIP: 0010:memset_erms+0x9/0x10
[  352.328905] Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
[  352.349953] RSP: 0018:ffffc90021b5fd50 EFLAGS: 00010202 ORIG_RAX: ffffffffffffff13
[  352.358450] RAX: 00000000000000ff RBX: ffff88983ffd5000 RCX: 0000000065df0e40
[  352.366457] RDX: 00000003a0000000 RSI: 00000000ffffffff RDI: ffffea039b20f1c0
[  352.374465] RBP: ffff88983ffd6c00 R08: 0000000000000000 R09: ffffea0061000000
[  352.382473] R10: 0000000000000001 R11: 0000000000000000 R12: ffffea006f808000
[  352.390480] R13: 0000000001840000 R14: 000000000e800000 R15: ffff8997d7b764e0
[  352.398487] FS:  00007f724ef81780(0000) GS:ffff8997df100000(0000) knlGS:0000000000000000
[  352.407562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  352.414011] CR2: 00007ffcd4070768 CR3: 000001178c722002 CR4: 00000000003606e0
[  352.422056] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  352.430092] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  352.438115] Call Trace:
[  352.440865]  remove_pfn_range_from_zone+0x3a/0x380
[  352.446244]  ? cpumask_next+0x17/0x20
[  352.450361]  memunmap_pages+0x17f/0x280
[  352.454670]  release_nodes+0x22a/0x260
[  352.458888]  __device_release_driver+0x172/0x220
[  352.464070]  device_driver_detach+0x3e/0xa0
[  352.468753]  unbind_store+0x113/0x130
[  352.472868]  kernfs_fop_write+0xdc/0x1c0
[  352.477273]  vfs_write+0xde/0x1d0
[  352.482218]  ksys_write+0x58/0xd0
[  352.487207]  do_syscall_64+0x5a/0x120
[  352.492529]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  352.499446] RIP: 0033:0x7f724f40b5e7
[  352.504673] Code: Bad RIP value.
[  352.509484] RSP: 002b:00007ffcd40738f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  352.519213] RAX: ffffffffffffffda RBX: 00007f724ef816a8 RCX: 00007f724f40b5e7
[  352.528410] RDX: 0000000000000007 RSI: 00005617d7cd1277 RDI: 0000000000000003
[  352.537573] RBP: 0000000000000003 R08: 00000000ffffffff R09: 00007ffcd40737d0
[  352.546764] R10: 0000000000000000 R11: 0000000000000246 R12: 00005617d7cd1277
[  352.555929] R13: 0000000000000000 R14: 0000000000000007 R15: 00005617d7cd1230
[  370.353742] Built 2 zonelists, mobility grouping on.  Total pages: 49050381
[  370.373317] Policy zone: Normal
[  374.948164] Built 3 zonelists, mobility grouping on.  Total pages: 49312525
[  375.017496] Policy zone: Normal

David said: "It really only is an issue for devmem.  Ordinary
hotplugged system memory is not affected (onlined/offlined in memory
block granularity)."

Link: http://lkml.kernel.org/r/20200619231213.1160351-1-ben.widawsky@intel.com
Fixes: commit d33695b16a9f ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Reported-by: "Scargall, Steve" <steve.scargall@intel.com>
Reported-by: Ben Widawsky <ben.widawsky@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

--- a/mm/memory_hotplug.c~mm-fix-false-softlockup-during-pfn-range-removal
+++ a/mm/memory_hotplug.c
@@ -471,11 +471,20 @@ void __ref remove_pfn_range_from_zone(st
 				      unsigned long start_pfn,
 				      unsigned long nr_pages)
 {
+	const unsigned long end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
-	unsigned long flags;
+	unsigned long pfn, cur_nr_pages, flags;
 
 	/* Poison struct pages because they are now uninitialized again. */
-	page_init_poison(pfn_to_page(start_pfn), sizeof(struct page) * nr_pages);
+	for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) {
+		cond_resched();
+
+		/* Select all remaining pages up to the next section boundary */
+		cur_nr_pages =
+			min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn);
+		page_init_poison(pfn_to_page(pfn),
+				 sizeof(struct page) * cur_nr_pages);
+	}
 
 #ifdef CONFIG_ZONE_DEVICE
 	/*
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] mm-compaction-make-capture-control-handling-safe-wrt-interrupts.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (10 preceding siblings ...)
  2020-06-26  3:30 ` [patch 31/32] mm/memory_hotplug.c: fix false softlockup during pfn range removal Andrew Morton
@ 2020-06-27  3:33 ` Andrew Morton
  2020-06-27  3:33 ` [merged] ocfs2-avoid-inode-removed-while-nfsd-access-it.patch " Andrew Morton
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:33 UTC (permalink / raw)
  To: alex.shi, hughd, liwang, mgorman, mm-commits, stable, vbabka


The patch titled
     Subject: mm, compaction: make capture control handling safe wrt interrupts
has been removed from the -mm tree.  Its filename was
     mm-compaction-make-capture-control-handling-safe-wrt-interrupts.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, compaction: make capture control handling safe wrt interrupts

Hugh reports:

: While stressing compaction, one run oopsed on NULL capc->cc in
: __free_one_page()'s task_capc(zone): compact_zone_order() had been
: interrupted, and a page was being freed in the return from interrupt.
: 
: Though you would not expect it from the source, both gccs I was using (a
: 4.8.1 and a 7.5.0) had chosen to compile compact_zone_order() with the
: ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before callq
: compact_zone - long after the "current->capture_control = &capc".  An
: interrupt in between those finds capc->cc NULL (zeroed by an earlier rep
: stos).
: 
: This could presumably be fixed by a barrier() before setting
: current->capture_control in compact_zone_order(); but would also need more
: care on return from compact_zone(), in order not to risk leaking a page
: captured by interrupt just before capture_control is reset.
: 
: Maybe that is the preferable fix, but I felt safer for task_capc() to
: exclude the rather surprising possibility of capture at interrupt time.

I have checked that gcc10 also behaves the same.

The advantage of fix in compact_zone_order() is that we don't add another
test in the page freeing hot path, and that it might prevent future
problems if we stop exposing pointers to uninitialized structures in
current task.

So this patch implements the suggestion for compact_zone_order() with
barrier() (and WRITE_ONCE() to prevent store tearing) for setting
current->capture_control, and prevents page leaking with
WRITE_ONCE/READ_ONCE in the proper order.

Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Hugh Dickins <hughd@google.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Li Wang <liwang@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>	[5.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

--- a/mm/compaction.c~mm-compaction-make-capture-control-handling-safe-wrt-interrupts
+++ a/mm/compaction.c
@@ -2316,15 +2316,26 @@ static enum compact_result compact_zone_
 		.page = NULL,
 	};
 
-	current->capture_control = &capc;
+	/*
+	 * Make sure the structs are really initialized before we expose the
+	 * capture control, in case we are interrupted and the interrupt handler
+	 * frees a page.
+	 */
+	barrier();
+	WRITE_ONCE(current->capture_control, &capc);
 
 	ret = compact_zone(&cc, &capc);
 
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
-	*capture = capc.page;
-	current->capture_control = NULL;
+	/*
+	 * Make sure we hide capture control first before we read the captured
+	 * page pointer, otherwise an interrupt could free and capture a page
+	 * and we would leak it.
+	 */
+	WRITE_ONCE(current->capture_control, NULL);
+	*capture = READ_ONCE(capc.page);
 
 	return ret;
 }
_

Patches currently in -mm which might be from vbabka@suse.cz are

mm-slub-extend-slub_debug-syntax-for-multiple-blocks.patch
mm-slub-make-some-slub_debug-related-attributes-read-only.patch
mm-slub-remove-runtime-allocation-order-changes.patch
mm-slub-make-remaining-slub_debug-related-attributes-read-only.patch
mm-slub-make-reclaim_account-attribute-read-only.patch
mm-slub-introduce-static-key-for-slub_debug.patch
mm-slub-introduce-kmem_cache_debug_flags.patch
mm-slub-introduce-kmem_cache_debug_flags-fix.patch
mm-slub-extend-checks-guarded-by-slub_debug-static-key.patch
mm-slab-slub-move-and-improve-cache_from_obj.patch
mm-slab-slub-improve-error-reporting-and-overhead-of-cache_from_obj.patch
mm-slab-slub-improve-error-reporting-and-overhead-of-cache_from_obj-fix.patch
mm-page_alloc-use-unlikely-in-task_capc.patch


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] ocfs2-avoid-inode-removed-while-nfsd-access-it.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (11 preceding siblings ...)
  2020-06-27  3:33 ` [merged] mm-compaction-make-capture-control-handling-safe-wrt-interrupts.patch removed from -mm tree Andrew Morton
@ 2020-06-27  3:33 ` Andrew Morton
  2020-06-27  3:33 ` [merged] ocfs2-load-global_inode_alloc.patch " Andrew Morton
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:33 UTC (permalink / raw)
  To: gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, mark, mm-commits,
	piaojun, stable


The patch titled
     Subject: ocfs2: avoid inode removal while nfsd is accessing it
has been removed from the -mm tree.  Its filename was
     ocfs2-avoid-inode-removed-while-nfsd-access-it.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Junxiao Bi <junxiao.bi@oracle.com>
Subject: ocfs2: avoid inode removal while nfsd is accessing it

Patch series "ocfs2: fix nfsd over ocfs2 issues", v2.

This is a series of patches to fix issues on nfsd over ocfs2.  patch 1 is
to avoid inode removed while nfsd access it patch 2 & 3 is to fix a panic
issue.


This patch (of 4):

When nfsd is getting file dentry using handle or parent dentry of some
dentry, one cluster lock is used to avoid inode removed from other node,
but it still could be removed from local node, so use a rw lock to avoid
this.

Link: http://lkml.kernel.org/r/20200616183829.87211-1-junxiao.bi@oracle.com
Link: http://lkml.kernel.org/r/20200616183829.87211-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/dlmglue.c |   17 ++++++++++++++++-
 fs/ocfs2/ocfs2.h   |    1 +
 2 files changed, 17 insertions(+), 1 deletion(-)

--- a/fs/ocfs2/dlmglue.c~ocfs2-avoid-inode-removed-while-nfsd-access-it
+++ a/fs/ocfs2/dlmglue.c
@@ -689,6 +689,12 @@ static void ocfs2_nfs_sync_lock_res_init
 				   &ocfs2_nfs_sync_lops, osb);
 }
 
+static void ocfs2_nfs_sync_lock_init(struct ocfs2_super *osb)
+{
+	ocfs2_nfs_sync_lock_res_init(&osb->osb_nfs_sync_lockres, osb);
+	init_rwsem(&osb->nfs_sync_rwlock);
+}
+
 void ocfs2_trim_fs_lock_res_init(struct ocfs2_super *osb)
 {
 	struct ocfs2_lock_res *lockres = &osb->osb_trim_fs_lockres;
@@ -2855,6 +2861,11 @@ int ocfs2_nfs_sync_lock(struct ocfs2_sup
 	if (ocfs2_is_hard_readonly(osb))
 		return -EROFS;
 
+	if (ex)
+		down_write(&osb->nfs_sync_rwlock);
+	else
+		down_read(&osb->nfs_sync_rwlock);
+
 	if (ocfs2_mount_local(osb))
 		return 0;
 
@@ -2873,6 +2884,10 @@ void ocfs2_nfs_sync_unlock(struct ocfs2_
 	if (!ocfs2_mount_local(osb))
 		ocfs2_cluster_unlock(osb, lockres,
 				     ex ? LKM_EXMODE : LKM_PRMODE);
+	if (ex)
+		up_write(&osb->nfs_sync_rwlock);
+	else
+		up_read(&osb->nfs_sync_rwlock);
 }
 
 int ocfs2_trim_fs_lock(struct ocfs2_super *osb,
@@ -3340,7 +3355,7 @@ int ocfs2_dlm_init(struct ocfs2_super *o
 local:
 	ocfs2_super_lock_res_init(&osb->osb_super_lockres, osb);
 	ocfs2_rename_lock_res_init(&osb->osb_rename_lockres, osb);
-	ocfs2_nfs_sync_lock_res_init(&osb->osb_nfs_sync_lockres, osb);
+	ocfs2_nfs_sync_lock_init(osb);
 	ocfs2_orphan_scan_lock_res_init(&osb->osb_orphan_scan.os_lockres, osb);
 
 	osb->cconn = conn;
--- a/fs/ocfs2/ocfs2.h~ocfs2-avoid-inode-removed-while-nfsd-access-it
+++ a/fs/ocfs2/ocfs2.h
@@ -395,6 +395,7 @@ struct ocfs2_super
 	struct ocfs2_lock_res osb_super_lockres;
 	struct ocfs2_lock_res osb_rename_lockres;
 	struct ocfs2_lock_res osb_nfs_sync_lockres;
+	struct rw_semaphore nfs_sync_rwlock;
 	struct ocfs2_lock_res osb_trim_fs_lockres;
 	struct mutex obs_trim_fs_mutex;
 	struct ocfs2_dlm_debug *osb_dlm_debug;
_

Patches currently in -mm which might be from junxiao.bi@oracle.com are



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] ocfs2-load-global_inode_alloc.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (12 preceding siblings ...)
  2020-06-27  3:33 ` [merged] ocfs2-avoid-inode-removed-while-nfsd-access-it.patch " Andrew Morton
@ 2020-06-27  3:33 ` Andrew Morton
  2020-06-27  3:33 ` [merged] ocfs2-fix-panic-on-nfs-server-over-ocfs2.patch " Andrew Morton
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:33 UTC (permalink / raw)
  To: gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, mark, mm-commits,
	piaojun, stable


The patch titled
     Subject: ocfs2: load global_inode_alloc
has been removed from the -mm tree.  Its filename was
     ocfs2-load-global_inode_alloc.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Junxiao Bi <junxiao.bi@oracle.com>
Subject: ocfs2: load global_inode_alloc

Set global_inode_alloc as OCFS2_FIRST_ONLINE_SYSTEM_INODE, that will make
it load during mount.  It can be used to test whether some global/system
inodes are valid.  One use case is that nfsd will test whether root inode
is valid.

Link: http://lkml.kernel.org/r/20200616183829.87211-3-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/ocfs2_fs.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/ocfs2/ocfs2_fs.h~ocfs2-load-global_inode_alloc
+++ a/fs/ocfs2/ocfs2_fs.h
@@ -326,8 +326,8 @@ struct ocfs2_system_inode_info {
 enum {
 	BAD_BLOCK_SYSTEM_INODE = 0,
 	GLOBAL_INODE_ALLOC_SYSTEM_INODE,
+#define OCFS2_FIRST_ONLINE_SYSTEM_INODE GLOBAL_INODE_ALLOC_SYSTEM_INODE
 	SLOT_MAP_SYSTEM_INODE,
-#define OCFS2_FIRST_ONLINE_SYSTEM_INODE SLOT_MAP_SYSTEM_INODE
 	HEARTBEAT_SYSTEM_INODE,
 	GLOBAL_BITMAP_SYSTEM_INODE,
 	USER_QUOTA_SYSTEM_INODE,
_

Patches currently in -mm which might be from junxiao.bi@oracle.com are



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] ocfs2-fix-panic-on-nfs-server-over-ocfs2.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (13 preceding siblings ...)
  2020-06-27  3:33 ` [merged] ocfs2-load-global_inode_alloc.patch " Andrew Morton
@ 2020-06-27  3:33 ` Andrew Morton
  2020-06-27  3:33 ` [merged] ocfs2-fix-value-of-ocfs2_invalid_slot.patch " Andrew Morton
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:33 UTC (permalink / raw)
  To: gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, mark, mm-commits,
	piaojun, stable


The patch titled
     Subject: ocfs2: fix panic on nfs server over ocfs2
has been removed from the -mm tree.  Its filename was
     ocfs2-fix-panic-on-nfs-server-over-ocfs2.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Junxiao Bi <junxiao.bi@oracle.com>
Subject: ocfs2: fix panic on nfs server over ocfs2

The following kernel panic was captured when running nfs server over
ocfs2, at that time ocfs2_test_inode_bit() was checking whether one inode
locating at "blkno" 5 was valid, that is ocfs2 root inode, its
"suballoc_slot" was OCFS2_INVALID_SLOT(65535) and it was allocted from
//global_inode_alloc, but here it wrongly assumed that it was got from per
slot inode alloctor which would cause array overflow and trigger kernel
panic.

[430033.469151] BUG: unable to handle kernel paging request at
0000000000001088
[430033.469367] IP: [<ffffffff816f6898>] _raw_spin_lock+0x18/0xf0
[430033.469567] PGD 1e06ba067 PUD 1e9e7d067 PMD 0
[430033.469769] Oops: 0002 [#1] SMP
[430033.469975] Modules linked in: tun nfsd lockd grace nfs_acl auth_rpcgss
ocfs2 xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs
xen_privcmd ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
ocfs2_stackglue configfs bnx2fc fcoe libfcoe libfc sunrpc bridge 8021q mrp
garp stp llc bonding dm_round_robin scsi_dh_emc dm_multipath iTCO_wdt
iTCO_vendor_support pcspkr sb_edac edac_core i2c_i801 i2c_core lpc_ich
mfd_core sg ext4 jbd2 mbcache2 sd_mod ahci libahci lpfc scsi_transport_fc
be2net vxlan udp_tunnel ip6_udp_tunnel mpt3sas scsi_transport_sas raid_class
crc32c_intel be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3
mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi
dm_mirror dm_region_hash dm_log dm_mod
[430033.472350] CPU: 6 PID: 24873 Comm: nfsd Not tainted
4.1.12-124.36.1.el6uek.x86_64 #2
[430033.472719] Hardware name: Huawei CH121 V3/IT11SGCA1, BIOS 3.87
02/02/2018
[430033.472910] task: ffff88005ae98000 ti: ffff88005ae94000 task.ti:
ffff88005ae94000
[430033.473277] RIP: e030:[<ffffffff816f6898>]  [<ffffffff816f6898>]
_raw_spin_lock+0x18/0xf0
[430033.473655] RSP: e02b:ffff88005ae97908  EFLAGS: 00010206
[430033.473850] RAX: ffff88005ae98000 RBX: 0000000000001088 RCX:
0000000000000000
[430033.474205] RDX: 0000000000020000 RSI: 0000000000000009 RDI:
0000000000001088
[430033.474574] RBP: ffff88005ae97928 R08: 0000000000000000 R09:
ffff880212878e00
[430033.474938] R10: 0000000000007ff0 R11: 0000000000000000 R12:
0000000000001088
[430033.475324] R13: ffff8800063c0aa8 R14: ffff8800650c27d0 R15:
000000000000ffff
[430033.475721] FS:  0000000000000000(0000) GS:ffff880218180000(0000)
knlGS:ffff880218180000
[430033.476199] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[430033.476390] CR2: 0000000000001088 CR3: 00000002033d0000 CR4:
0000000000042660
[430033.476760] Stack:
[430033.476942]  0000000000001000 0000000000001088 ffff8800063c0aa8
ffff8800650c27d0
[430033.477329]  ffff88005ae97948 ffffffff8122a3de 0000000000000009
ffff8800063c0000
[430033.477718]  ffff88005ae979e8 ffffffffc0714e43 ffff88005ae97968
ffff88019de8f958
[430033.478104] Call Trace:
[430033.478286]  [<ffffffff8122a3de>] igrab+0x1e/0x60
[430033.478494]  [<ffffffffc0714e43>] ocfs2_get_system_file_inode+0x63/0x3a0
[ocfs2]
[430033.478870]  [<ffffffffc06a87df>] ? ocfs2_read_blocks_sync+0x13f/0x3c0
[ocfs2]
[430033.479267]  [<ffffffffc06ff2d8>] ocfs2_test_inode_bit+0x328/0xa00
[ocfs2]
[430033.479498]  [<ffffffffc06bef5a>] ocfs2_get_parent+0xba/0x3e0 [ocfs2]
[430033.479730]  [<ffffffff8129b305>] reconnect_path+0xb5/0x300
[430033.479933]  [<ffffffff8129b646>] exportfs_decode_fh+0xf6/0x2b0
[430033.480124]  [<ffffffffc0814af0>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd]
[430033.480294]  [<ffffffffc081a682>] ? exp_find+0xe2/0x190 [nfsd]
[430033.480461]  [<ffffffff810e5a7e>] ? irq_get_irq_data+0xe/0x10
[430033.480627]  [<ffffffff810ea1a7>] ? __call_rcu_nocb_enqueue+0xd7/0xe0
[430033.480794]  [<ffffffff810eb9e8>] ? __call_rcu+0xe8/0x360
[430033.480959]  [<ffffffffc0815860>] fh_verify+0x350/0x660 [nfsd]
[430033.481134]  [<ffffffffc0535076>] ? cache_check+0x56/0x3a0 [sunrpc]
[430033.481317]  [<ffffffffc0823a4d>] nfsd4_putfh+0x4d/0x60 [nfsd]
[430033.481505]  [<ffffffffc0826003>] nfsd4_proc_compound+0x3d3/0x6f0 [nfsd]
[430033.481730]  [<ffffffffc0811f60>] nfsd_dispatch+0xe0/0x290 [nfsd]
[430033.481950]  [<ffffffffc052b752>] ? svc_tcp_adjust_wspace+0x12/0x30
[sunrpc]
[430033.482152]  [<ffffffffc052a512>] svc_process_common+0x412/0x6a0 [sunrpc]
[430033.482351]  [<ffffffffc052a8c3>] svc_process+0x123/0x210 [sunrpc]
[430033.482550]  [<ffffffffc081190f>] nfsd+0xff/0x170 [nfsd]
[430033.482744]  [<ffffffffc0811810>] ? nfsd_destroy+0x80/0x80 [nfsd]
[430033.482943]  [<ffffffff810a7aeb>] kthread+0xcb/0xf0
[430033.483151]  [<ffffffff816f10ea>] ? __schedule+0x24a/0x810
[430033.483354]  [<ffffffff816f10ea>] ? __schedule+0x24a/0x810
[430033.483553]  [<ffffffff810a7a20>] ? kthread_create_on_node+0x180/0x180
[430033.483777]  [<ffffffff816f72a1>] ret_from_fork+0x61/0x90
[430033.483976]  [<ffffffff810a7a20>] ? kthread_create_on_node+0x180/0x180
[430033.484191] Code: 83 c2 02 0f b7 f2 e8 18 dc 91 ff 66 90 eb bf 0f 1f 40
00 55 48 89 e5 41 56 41 55 41 54 53 0f 1f 44 00 00 48 89 fb ba 00 00 02 00
<f0> 0f c1 17 89 d0 45 31 e4 45 31 ed c1 e8 10 66 39 d0 41 89 c6
[430033.485174] RIP  [<ffffffff816f6898>] _raw_spin_lock+0x18/0xf0
[430033.485370]  RSP <ffff88005ae97908>
[430033.485566] CR2: 0000000000001088
[430033.486223] ---[ end trace 7264463cd1aac8f9 ]---
[430033.666368] Kernel panic - not syncing: Fatal exception

Link: http://lkml.kernel.org/r/20200616183829.87211-4-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/suballoc.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- a/fs/ocfs2/suballoc.c~ocfs2-fix-panic-on-nfs-server-over-ocfs2
+++ a/fs/ocfs2/suballoc.c
@@ -2825,9 +2825,12 @@ int ocfs2_test_inode_bit(struct ocfs2_su
 		goto bail;
 	}
 
-	inode_alloc_inode =
-		ocfs2_get_system_file_inode(osb, INODE_ALLOC_SYSTEM_INODE,
-					    suballoc_slot);
+	if (suballoc_slot == (u16)OCFS2_INVALID_SLOT)
+		inode_alloc_inode = ocfs2_get_system_file_inode(osb,
+			GLOBAL_INODE_ALLOC_SYSTEM_INODE, suballoc_slot);
+	else
+		inode_alloc_inode = ocfs2_get_system_file_inode(osb,
+			INODE_ALLOC_SYSTEM_INODE, suballoc_slot);
 	if (!inode_alloc_inode) {
 		/* the error code could be inaccurate, but we are not able to
 		 * get the correct one. */
_

Patches currently in -mm which might be from junxiao.bi@oracle.com are



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] ocfs2-fix-value-of-ocfs2_invalid_slot.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (14 preceding siblings ...)
  2020-06-27  3:33 ` [merged] ocfs2-fix-panic-on-nfs-server-over-ocfs2.patch " Andrew Morton
@ 2020-06-27  3:33 ` Andrew Morton
  2020-06-27  3:33 ` [merged] mm-slab-fix-sign-conversion-problem-in-memcg_uncharge_slab.patch " Andrew Morton
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:33 UTC (permalink / raw)
  To: gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, mark, mm-commits,
	piaojun, stable


The patch titled
     Subject: ocfs2: fix value of OCFS2_INVALID_SLOT
has been removed from the -mm tree.  Its filename was
     ocfs2-fix-value-of-ocfs2_invalid_slot.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Junxiao Bi <junxiao.bi@oracle.com>
Subject: ocfs2: fix value of OCFS2_INVALID_SLOT

In the ocfs2 disk layout, slot number is 16 bits, but in ocfs2
implementation, slot number is 32 bits.  Usually this will not cause any
issue, because slot number is converted from u16 to u32, but
OCFS2_INVALID_SLOT was defined as -1, when an invalid slot number from
disk was obtained, its value was (u16)-1, and it was converted to u32. 
Then the following checking in get_local_system_inode will be always
skipped:

 static struct inode **get_local_system_inode(struct ocfs2_super *osb,
                                               int type,
                                               u32 slot)
 {
 	BUG_ON(slot == OCFS2_INVALID_SLOT);
	...
 }

Link: http://lkml.kernel.org/r/20200616183829.87211-5-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/ocfs2_fs.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/ocfs2/ocfs2_fs.h~ocfs2-fix-value-of-ocfs2_invalid_slot
+++ a/fs/ocfs2/ocfs2_fs.h
@@ -290,7 +290,7 @@
 #define OCFS2_MAX_SLOTS			255
 
 /* Slot map indicator for an empty slot */
-#define OCFS2_INVALID_SLOT		-1
+#define OCFS2_INVALID_SLOT		((u16)-1)
 
 #define OCFS2_VOL_UUID_LEN		16
 #define OCFS2_MAX_VOL_LABEL_LEN		64
_

Patches currently in -mm which might be from junxiao.bi@oracle.com are



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] mm-slab-fix-sign-conversion-problem-in-memcg_uncharge_slab.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (15 preceding siblings ...)
  2020-06-27  3:33 ` [merged] ocfs2-fix-value-of-ocfs2_invalid_slot.patch " Andrew Morton
@ 2020-06-27  3:33 ` Andrew Morton
  2020-06-27  3:33 ` [merged] mm-slab-use-memzero_explicit-in-kzfree.patch " Andrew Morton
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:33 UTC (permalink / raw)
  To: cl, guro, hannes, iamjoonsoo.kim, longman, mhocko, mm-commits,
	penberg, rientjes, shakeelb, stable, vdavydov.dev


The patch titled
     Subject: mm, slab: fix sign conversion problem in memcg_uncharge_slab()
has been removed from the -mm tree.  Its filename was
     mm-slab-fix-sign-conversion-problem-in-memcg_uncharge_slab.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Waiman Long <longman@redhat.com>
Subject: mm, slab: fix sign conversion problem in memcg_uncharge_slab()

It was found that running the LTP test on a PowerPC system could produce
erroneous values in /proc/meminfo, like:

  MemTotal:       531915072 kB
  MemFree:        507962176 kB
  MemAvailable:   1100020596352 kB

Using bisection, the problem is tracked down to commit 9c315e4d7d8c ("mm:
memcg/slab: cache page number in memcg_(un)charge_slab()").

In memcg_uncharge_slab() with a "int order" argument:

  unsigned int nr_pages = 1 << order;
    :
  mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);

The mod_lruvec_state() function will eventually call the
__mod_zone_page_state() which accepts a long argument.  Depending on the
compiler and how inlining is done, "-nr_pages" may be treated as a
negative number or a very large positive number.  Apparently, it was
treated as a large positive number in that PowerPC system leading to
incorrect stat counts.  This problem hasn't been seen in x86-64 yet,
perhaps the gcc compiler there has some slight difference in behavior.

It is fixed by making nr_pages a signed value.  For consistency, a similar
change is applied to memcg_charge_slab() as well.

Link: http://lkml.kernel.org/r/20200620184719.10994-1-longman@redhat.com
Fixes: 9c315e4d7d8c ("mm: memcg/slab: cache page number in memcg_(un)charge_slab()").
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/slab.h~mm-slab-fix-sign-conversion-problem-in-memcg_uncharge_slab
+++ a/mm/slab.h
@@ -348,7 +348,7 @@ static __always_inline int memcg_charge_
 					     gfp_t gfp, int order,
 					     struct kmem_cache *s)
 {
-	unsigned int nr_pages = 1 << order;
+	int nr_pages = 1 << order;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 	int ret;
@@ -388,7 +388,7 @@ out:
 static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 						struct kmem_cache *s)
 {
-	unsigned int nr_pages = 1 << order;
+	int nr_pages = 1 << order;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
_

Patches currently in -mm which might be from longman@redhat.com are

mm-treewide-rename-kzfree-to-kfree_sensitive.patch
sched-mm-optimize-current_gfp_context.patch


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] mm-slab-use-memzero_explicit-in-kzfree.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (16 preceding siblings ...)
  2020-06-27  3:33 ` [merged] mm-slab-fix-sign-conversion-problem-in-memcg_uncharge_slab.patch " Andrew Morton
@ 2020-06-27  3:33 ` Andrew Morton
  2020-06-27  3:33 ` [merged] mm-fix-swap-cache-node-allocation-mask.patch " Andrew Morton
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:33 UTC (permalink / raw)
  To: dan.carpenter, dhowells, hannes, jarkko.sakkinen, Jason, jmorris,
	joe, longman, mhocko, mm-commits, rientjes, serge, stable, willy


The patch titled
     Subject: mm/slab: use memzero_explicit() in kzfree()
has been removed from the -mm tree.  Its filename was
     mm-slab-use-memzero_explicit-in-kzfree.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Waiman Long <longman@redhat.com>
Subject: mm/slab: use memzero_explicit() in kzfree()

The kzfree() function is normally used to clear some sensitive
information, like encryption keys, in the buffer before freeing it back to
the pool.  Memset() is currently used for buffer clearing.  However
unlikely, there is still a non-zero probability that the compiler may
choose to optimize away the memory clearing especially if LTO is being
used in the future.  To make sure that this optimization will never
happen, memzero_explicit(), which is introduced in v3.18, is now used in
kzfree() to future-proof it.

Link: http://lkml.kernel.org/r/20200616154311.12314-2-longman@redhat.com
Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab_common.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/slab_common.c~mm-slab-use-memzero_explicit-in-kzfree
+++ a/mm/slab_common.c
@@ -1726,7 +1726,7 @@ void kzfree(const void *p)
 	if (unlikely(ZERO_OR_NULL_PTR(mem)))
 		return;
 	ks = ksize(mem);
-	memset(mem, 0, ks);
+	memzero_explicit(mem, ks);
 	kfree(mem);
 }
 EXPORT_SYMBOL(kzfree);
_

Patches currently in -mm which might be from longman@redhat.com are

mm-treewide-rename-kzfree-to-kfree_sensitive.patch
sched-mm-optimize-current_gfp_context.patch


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] mm-fix-swap-cache-node-allocation-mask.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (17 preceding siblings ...)
  2020-06-27  3:33 ` [merged] mm-slab-use-memzero_explicit-in-kzfree.patch " Andrew Morton
@ 2020-06-27  3:33 ` Andrew Morton
  2020-06-27  3:33 ` [merged] mm-memcontrol-handle-div0-crash-race-condition-in-memorylow.patch " Andrew Morton
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:33 UTC (permalink / raw)
  To: hughd, lists, mm-commits, stable, vbabka, willy


The patch titled
     Subject: mm: fix swap cache node allocation mask
has been removed from the -mm tree.  Its filename was
     mm-fix-swap-cache-node-allocation-mask.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Hugh Dickins <hughd@google.com>
Subject: mm: fix swap cache node allocation mask

https://bugzilla.kernel.org/show_bug.cgi?id=208085 reports that a slightly
overcommitted load, testing swap and zram along with i915, splats and
keeps on splatting, when it had better fail less noisily:

gnome-shell: page allocation failure: order:0,
mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE),
nodemask=(null),cpuset=/,mems_allowed=0
CPU: 2 PID: 1155 Comm: gnome-shell Not tainted 5.7.0-1.fc33.x86_64 #1
Call Trace:
dump_stack+0x64/0x88
warn_alloc.cold+0x75/0xd9
__alloc_pages_slowpath.constprop.0+0xcfa/0xd30
__alloc_pages_nodemask+0x2df/0x320
alloc_slab_page+0x195/0x310
allocate_slab+0x3c5/0x440
___slab_alloc+0x40c/0x5f0
__slab_alloc+0x1c/0x30
kmem_cache_alloc+0x20e/0x220
xas_nomem+0x28/0x70
add_to_swap_cache+0x321/0x400
__read_swap_cache_async+0x105/0x240
swap_cluster_readahead+0x22c/0x2e0
shmem_swapin+0x8e/0xc0
shmem_swapin_page+0x196/0x740
shmem_getpage_gfp+0x3a2/0xa60
shmem_read_mapping_page_gfp+0x32/0x60
shmem_get_pages+0x155/0x5e0 [i915]
__i915_gem_object_get_pages+0x68/0xa0 [i915]
i915_vma_pin+0x3fe/0x6c0 [i915]
eb_add_vma+0x10b/0x2c0 [i915]
i915_gem_do_execbuffer+0x704/0x3430 [i915]
i915_gem_execbuffer2_ioctl+0x1ea/0x3e0 [i915]
drm_ioctl_kernel+0x86/0xd0 [drm]
drm_ioctl+0x206/0x390 [drm]
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5b/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported on 5.7, but it goes back really to 3.1: when
shmem_read_mapping_page_gfp() was implemented for use by i915, and
allowed for __GFP_NORETRY and __GFP_NOWARN flags in most places, but
missed swapin's "& GFP_KERNEL" mask for page tree node allocation in
__read_swap_cache_async() - that was to mask off HIGHUSER_MOVABLE bits
from what page cache uses, but GFP_RECLAIM_MASK is now what's needed.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2006151330070.11064@eggly.anvils
Fixes: 68da9f055755 ("tmpfs: pass gfp to shmem_getpage_gfp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Chris Murphy <lists@colorremedies.com>
Analyzed-by: Vlastimil Babka <vbabka@suse.cz>
Analyzed-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Chris Murphy <lists@colorremedies.com>
Cc: <stable@vger.kernel.org>	[3.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swap_state.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/swap_state.c~mm-fix-swap-cache-node-allocation-mask
+++ a/mm/swap_state.c
@@ -21,7 +21,7 @@
 #include <linux/vmalloc.h>
 #include <linux/swap_slots.h>
 #include <linux/huge_mm.h>
-
+#include "internal.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -429,7 +429,7 @@ struct page *__read_swap_cache_async(swp
 	__SetPageSwapBacked(page);
 
 	/* May fail (-ENOMEM) if XArray node allocation failed. */
-	if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
+	if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK)) {
 		put_swap_page(page, entry);
 		goto fail_unlock;
 	}
_

Patches currently in -mm which might be from hughd@google.com are

mm-vmstat-add-events-for-pmd-based-thp-migration-without-split-fix.patch


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] mm-memcontrol-handle-div0-crash-race-condition-in-memorylow.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (18 preceding siblings ...)
  2020-06-27  3:33 ` [merged] mm-fix-swap-cache-node-allocation-mask.patch " Andrew Morton
@ 2020-06-27  3:33 ` Andrew Morton
  2020-06-27  3:33 ` [merged] mm-memcontrol-fix-do-not-put-the-css-reference.patch " Andrew Morton
  2020-06-27  3:34 ` [merged] mm-fix-false-softlockup-during-pfn-range-removal.patch " Andrew Morton
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:33 UTC (permalink / raw)
  To: chris, guro, hannes, mhocko, mm-commits, stable, tj


The patch titled
     Subject: mm: memcontrol: handle div0 crash race condition in memory.low
has been removed from the -mm tree.  Its filename was
     mm-memcontrol-handle-div0-crash-race-condition-in-memorylow.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: memcontrol: handle div0 crash race condition in memory.low

Tejun reports seeing rare div0 crashes in memory.low stress testing:

[37228.504582] RIP: 0010:mem_cgroup_calculate_protection+0xed/0x150
[37228.505059] Code: 0f 46 d1 4c 39 d8 72 57 f6 05 16 d6 42 01 40 74 1f 4c 39 d8 76 1a 4c 39 d1 76 15 4c 29 d1 4c 29 d8 4d 29 d9 31 d2 48 0f af c1 <49> f7 f1 49 01 c2 4c 89 96 38 01 00 00 5d c3 48 0f af c7 31 d2 49
[37228.506254] RSP: 0018:ffffa14e01d6fcd0 EFLAGS: 00010246
[37228.506769] RAX: 000000000243e384 RBX: 0000000000000000 RCX: 0000000000008f4b
[37228.507319] RDX: 0000000000000000 RSI: ffff8b89bee84000 RDI: 0000000000000000
[37228.507869] RBP: ffffa14e01d6fcd0 R08: ffff8b89ca7d40f8 R09: 0000000000000000
[37228.508376] R10: 0000000000000000 R11: 00000000006422f7 R12: 0000000000000000
[37228.508881] R13: ffff8b89d9617000 R14: ffff8b89bee84000 R15: ffffa14e01d6fdb8
[37228.509397] FS:  0000000000000000(0000) GS:ffff8b8a1f1c0000(0000) knlGS:0000000000000000
[37228.509917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37228.510442] CR2: 00007f93b1fc175b CR3: 000000016100a000 CR4: 0000000000340ea0
[37228.511076] Call Trace:
[37228.511561]  shrink_node+0x1e5/0x6c0
[37228.512044]  balance_pgdat+0x32d/0x5f0
[37228.512521]  kswapd+0x1d7/0x3d0
[37228.513346]  ? wait_woken+0x80/0x80
[37228.514170]  kthread+0x11c/0x160
[37228.514983]  ? balance_pgdat+0x5f0/0x5f0
[37228.515797]  ? kthread_park+0x90/0x90
[37228.516593]  ret_from_fork+0x1f/0x30

This happens when parent_usage == siblings_protected.  We check that usage
is bigger than protected, which should imply parent_usage being bigger
than siblings_protected.  However, we don't read (or even update) these
values atomically, and they can be out of sync as the memory state changes
under us.  A bit of fluctuation around the target protection isn't a big
deal, but we need to handle the div0 case.

Check the parent state explicitly to make sure we have a reasonable
positive value for the divisor.

Link: http://lkml.kernel.org/r/20200615140658.601684-1-hannes@cmpxchg.org
Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

--- a/mm/memcontrol.c~mm-memcontrol-handle-div0-crash-race-condition-in-memorylow
+++ a/mm/memcontrol.c
@@ -6360,11 +6360,16 @@ static unsigned long effective_protectio
 	 * We're using unprotected memory for the weight so that if
 	 * some cgroups DO claim explicit protection, we don't protect
 	 * the same bytes twice.
+	 *
+	 * Check both usage and parent_usage against the respective
+	 * protected values. One should imply the other, but they
+	 * aren't read atomically - make sure the division is sane.
 	 */
 	if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT))
 		return ep;
-
-	if (parent_effective > siblings_protected && usage > protected) {
+	if (parent_effective > siblings_protected &&
+	    parent_usage > siblings_protected &&
+	    usage > protected) {
 		unsigned long unclaimed;
 
 		unclaimed = parent_effective - siblings_protected;
_

Patches currently in -mm which might be from hannes@cmpxchg.org are

mm-memcontrol-decouple-reference-counting-from-page-accounting.patch


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] mm-memcontrol-fix-do-not-put-the-css-reference.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (19 preceding siblings ...)
  2020-06-27  3:33 ` [merged] mm-memcontrol-handle-div0-crash-race-condition-in-memorylow.patch " Andrew Morton
@ 2020-06-27  3:33 ` Andrew Morton
  2020-06-27  3:34 ` [merged] mm-fix-false-softlockup-during-pfn-range-removal.patch " Andrew Morton
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:33 UTC (permalink / raw)
  To: cai, guro, hannes, mhocko, mm-commits, songmuchun, stable, vdavydov.dev


The patch titled
     Subject: mm/memcontrol.c: add missed css_put()
has been removed from the -mm tree.  Its filename was
     mm-memcontrol-fix-do-not-put-the-css-reference.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Muchun Song <songmuchun@bytedance.com>
Subject: mm/memcontrol.c: add missed css_put()

We should put the css reference when memory allocation failed.

Link: http://lkml.kernel.org/r/20200614122653.98829-1-songmuchun@bytedance.com
Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/memcontrol.c~mm-memcontrol-fix-do-not-put-the-css-reference
+++ a/mm/memcontrol.c
@@ -2772,8 +2772,10 @@ static void memcg_schedule_kmem_cache_cr
 		return;
 
 	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
-	if (!cw)
+	if (!cw) {
+		css_put(&memcg->css);
 		return;
+	}
 
 	cw->memcg = memcg;
 	cw->cachep = cachep;
_

Patches currently in -mm which might be from songmuchun@bytedance.com are



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [merged] mm-fix-false-softlockup-during-pfn-range-removal.patch removed from -mm tree
       [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
                   ` (20 preceding siblings ...)
  2020-06-27  3:33 ` [merged] mm-memcontrol-fix-do-not-put-the-css-reference.patch " Andrew Morton
@ 2020-06-27  3:34 ` Andrew Morton
  21 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-06-27  3:34 UTC (permalink / raw)
  To: ben.widawsky, dan.j.williams, david, mm-commits, stable,
	steve.scargall, vishal.l.verma


The patch titled
     Subject: mm/memory_hotplug.c: fix false softlockup during pfn range removal
has been removed from the -mm tree.  Its filename was
     mm-fix-false-softlockup-during-pfn-range-removal.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Ben Widawsky <ben.widawsky@intel.com>
Subject: mm/memory_hotplug.c: fix false softlockup during pfn range removal

When working with very large nodes, poisoning the struct pages (for which
there will be very many) can take a very long time.  If the system is
using voluntary preemptions, the software watchdog will not be able to
detect forward progress.  This patch addresses this issue by offering to
give up time like __remove_pages() does.  This behavior was introduced in
v5.6 with: commit d33695b16a9f ("mm/memory_hotplug: poison memmap in
remove_pfn_range_from_zone()")

Alternately, init_page_poison could do this cond_resched(), but it seems
to me that the caller of init_page_poison() is what actually knows whether
or not it should relax its own priority.

Based on Dan's notes, I think this is perfectly safe: commit f931ab479dd2
("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")

Aside from fixing the lockup, it is also a friendlier thing to do on lower
core systems that might wipe out large chunks of hotplug memory (probably
not a very common case).

Fixes this kind of splat:
[  352.142079] watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [daxctl:9922]
[  352.150067] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security rfkill ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_umad vfat fat kmem intel_rapl_msr intel_rapl_common rpcrdma sunrpc rdma_ucm ib_iser isst_if_common rdma_cm skx_edac iw_cm ib_cm x86_pkg_temp_thermal libiscsi intel_powerclamp scsi_transport_iscsi coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i40iw intel_cstate iTCO_wdt ib_uverbs iTCO_vendor_support device_dax ib_core joydev intel_uncore i2c_i801 lpc_ich ipmi_ssif ioatdma dca wmi ipmi_si ipmi_devintf ipmi_msghandler dax_pmem
[  352.150123]  dax_pmem_core acpi_pad acpi_power_meter drm ip_tables xfs libcrc32c nd_pmem nd_btt i40e e1000e crc32c_intel nfit
[  352.260774] irq event stamp: 138450
[  352.264692] hardirqs last  enabled at (138449): [<ffffffffa1001f26>] trace_hardirqs_on_thunk+0x1a/0x1c
[  352.275134] hardirqs last disabled at (138450): [<ffffffffa1001f42>] trace_hardirqs_off_thunk+0x1a/0x1c
[  352.285662] softirqs last  enabled at (138448): [<ffffffffa1e00347>] __do_softirq+0x347/0x456
[  352.295233] softirqs last disabled at (138443): [<ffffffffa10c416d>] irq_exit+0x7d/0xb0
[  352.304210] CPU: 46 PID: 9922 Comm: daxctl Not tainted 5.7.0-BEN-14238-g373c6049b336 #30
[  352.313283] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0578.D07.1902280810 02/28/2019
[  352.324308] RIP: 0010:memset_erms+0x9/0x10
[  352.328905] Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
[  352.349953] RSP: 0018:ffffc90021b5fd50 EFLAGS: 00010202 ORIG_RAX: ffffffffffffff13
[  352.358450] RAX: 00000000000000ff RBX: ffff88983ffd5000 RCX: 0000000065df0e40
[  352.366457] RDX: 00000003a0000000 RSI: 00000000ffffffff RDI: ffffea039b20f1c0
[  352.374465] RBP: ffff88983ffd6c00 R08: 0000000000000000 R09: ffffea0061000000
[  352.382473] R10: 0000000000000001 R11: 0000000000000000 R12: ffffea006f808000
[  352.390480] R13: 0000000001840000 R14: 000000000e800000 R15: ffff8997d7b764e0
[  352.398487] FS:  00007f724ef81780(0000) GS:ffff8997df100000(0000) knlGS:0000000000000000
[  352.407562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  352.414011] CR2: 00007ffcd4070768 CR3: 000001178c722002 CR4: 00000000003606e0
[  352.422056] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  352.430092] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  352.438115] Call Trace:
[  352.440865]  remove_pfn_range_from_zone+0x3a/0x380
[  352.446244]  ? cpumask_next+0x17/0x20
[  352.450361]  memunmap_pages+0x17f/0x280
[  352.454670]  release_nodes+0x22a/0x260
[  352.458888]  __device_release_driver+0x172/0x220
[  352.464070]  device_driver_detach+0x3e/0xa0
[  352.468753]  unbind_store+0x113/0x130
[  352.472868]  kernfs_fop_write+0xdc/0x1c0
[  352.477273]  vfs_write+0xde/0x1d0
[  352.482218]  ksys_write+0x58/0xd0
[  352.487207]  do_syscall_64+0x5a/0x120
[  352.492529]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  352.499446] RIP: 0033:0x7f724f40b5e7
[  352.504673] Code: Bad RIP value.
[  352.509484] RSP: 002b:00007ffcd40738f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  352.519213] RAX: ffffffffffffffda RBX: 00007f724ef816a8 RCX: 00007f724f40b5e7
[  352.528410] RDX: 0000000000000007 RSI: 00005617d7cd1277 RDI: 0000000000000003
[  352.537573] RBP: 0000000000000003 R08: 00000000ffffffff R09: 00007ffcd40737d0
[  352.546764] R10: 0000000000000000 R11: 0000000000000246 R12: 00005617d7cd1277
[  352.555929] R13: 0000000000000000 R14: 0000000000000007 R15: 00005617d7cd1230
[  370.353742] Built 2 zonelists, mobility grouping on.  Total pages: 49050381
[  370.373317] Policy zone: Normal
[  374.948164] Built 3 zonelists, mobility grouping on.  Total pages: 49312525
[  375.017496] Policy zone: Normal

David said: "It really only is an issue for devmem.  Ordinary
hotplugged system memory is not affected (onlined/offlined in memory
block granularity)."

Link: http://lkml.kernel.org/r/20200619231213.1160351-1-ben.widawsky@intel.com
Fixes: commit d33695b16a9f ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Reported-by: "Scargall, Steve" <steve.scargall@intel.com>
Reported-by: Ben Widawsky <ben.widawsky@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

--- a/mm/memory_hotplug.c~mm-fix-false-softlockup-during-pfn-range-removal
+++ a/mm/memory_hotplug.c
@@ -471,11 +471,20 @@ void __ref remove_pfn_range_from_zone(st
 				      unsigned long start_pfn,
 				      unsigned long nr_pages)
 {
+	const unsigned long end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
-	unsigned long flags;
+	unsigned long pfn, cur_nr_pages, flags;
 
 	/* Poison struct pages because they are now uninitialized again. */
-	page_init_poison(pfn_to_page(start_pfn), sizeof(struct page) * nr_pages);
+	for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) {
+		cond_resched();
+
+		/* Select all remaining pages up to the next section boundary */
+		cur_nr_pages =
+			min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn);
+		page_init_poison(pfn_to_page(pfn),
+				 sizeof(struct page) * cur_nr_pages);
+	}
 
 #ifdef CONFIG_ZONE_DEVICE
 	/*
_

Patches currently in -mm which might be from ben.widawsky@intel.com are



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-06-27  3:34 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200625202807.b630829d6fa55388148bee7d@linux-foundation.org>
2020-06-26  3:29 ` [patch 03/32] mm, compaction: make capture control handling safe wrt interrupts Andrew Morton
2020-06-26  3:29 ` [patch 05/32] ocfs2: avoid inode removal while nfsd is accessing it Andrew Morton
2020-06-26  3:29 ` [patch 06/32] ocfs2: load global_inode_alloc Andrew Morton
2020-06-26  3:29 ` [patch 07/32] ocfs2: fix panic on nfs server over ocfs2 Andrew Morton
2020-06-26  3:29 ` [patch 08/32] ocfs2: fix value of OCFS2_INVALID_SLOT Andrew Morton
2020-06-26  3:29 ` [patch 11/32] mm, slab: fix sign conversion problem in memcg_uncharge_slab() Andrew Morton
2020-06-26  3:29 ` [patch 12/32] mm/slab: use memzero_explicit() in kzfree() Andrew Morton
2020-06-26  3:29 ` [patch 14/32] mm: fix swap cache node allocation mask Andrew Morton
2020-06-26  3:30 ` [patch 20/32] mm: memcontrol: handle div0 crash race condition in memory.low Andrew Morton
2020-06-26  3:30 ` [patch 21/32] mm/memcontrol.c: add missed css_put() Andrew Morton
2020-06-26  3:30 ` [patch 31/32] mm/memory_hotplug.c: fix false softlockup during pfn range removal Andrew Morton
2020-06-27  3:33 ` [merged] mm-compaction-make-capture-control-handling-safe-wrt-interrupts.patch removed from -mm tree Andrew Morton
2020-06-27  3:33 ` [merged] ocfs2-avoid-inode-removed-while-nfsd-access-it.patch " Andrew Morton
2020-06-27  3:33 ` [merged] ocfs2-load-global_inode_alloc.patch " Andrew Morton
2020-06-27  3:33 ` [merged] ocfs2-fix-panic-on-nfs-server-over-ocfs2.patch " Andrew Morton
2020-06-27  3:33 ` [merged] ocfs2-fix-value-of-ocfs2_invalid_slot.patch " Andrew Morton
2020-06-27  3:33 ` [merged] mm-slab-fix-sign-conversion-problem-in-memcg_uncharge_slab.patch " Andrew Morton
2020-06-27  3:33 ` [merged] mm-slab-use-memzero_explicit-in-kzfree.patch " Andrew Morton
2020-06-27  3:33 ` [merged] mm-fix-swap-cache-node-allocation-mask.patch " Andrew Morton
2020-06-27  3:33 ` [merged] mm-memcontrol-handle-div0-crash-race-condition-in-memorylow.patch " Andrew Morton
2020-06-27  3:33 ` [merged] mm-memcontrol-fix-do-not-put-the-css-reference.patch " Andrew Morton
2020-06-27  3:34 ` [merged] mm-fix-false-softlockup-during-pfn-range-removal.patch " Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).