mm-commits.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* incoming
@ 2020-11-14  6:51 Andrew Morton
  2020-11-14  6:51 ` [patch 01/14] mm/compaction: count pages and stop correctly during page isolation Andrew Morton
                   ` (13 more replies)
  0 siblings, 14 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits

14 patches, based on 9e6a39eae450b81c8b2c8cbbfbdf8218e9b40c81.

Subsystems affected by this patch series:

  mm/migration
  mm/vmscan
  mailmap
  mm/slub
  mm/gup
  kbuild
  reboot
  kernel/watchdog
  mm/memcg
  mm/hugetlbfs
  panic
  ocfs2

Subsystem: mm/migration

    Zi Yan <ziy@nvidia.com>:
      mm/compaction: count pages and stop correctly during page isolation
      mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate

Subsystem: mm/vmscan

    Nicholas Piggin <npiggin@gmail.com>:
      mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit

Subsystem: mailmap

    Dmitry Baryshkov <dbaryshkov@gmail.com>:
      mailmap: fix entry for Dmitry Baryshkov/Eremin-Solenikov

Subsystem: mm/slub

    Laurent Dufour <ldufour@linux.ibm.com>:
      mm/slub: fix panic in slab_alloc_node()

Subsystem: mm/gup

    Jason Gunthorpe <jgg@nvidia.com>:
      mm/gup: use unpin_user_pages() in __gup_longterm_locked()

Subsystem: kbuild

    Arvind Sankar <nivedita@alum.mit.edu>:
      compiler.h: fix barrier_data() on clang

Subsystem: reboot

    Matteo Croce <mcroce@microsoft.com>:
    Patch series "fix parsing of reboot= cmdline", v3:
      Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"
      reboot: fix overflow parsing reboot cpu number

Subsystem: kernel/watchdog

    Santosh Sivaraj <santosh@fossix.org>:
      kernel/watchdog: fix watchdog_allowed_mask not used warning

Subsystem: mm/memcg

    Muchun Song <songmuchun@bytedance.com>:
      mm: memcontrol: fix missing wakeup polling thread

Subsystem: mm/hugetlbfs

    Mike Kravetz <mike.kravetz@oracle.com>:
      hugetlbfs: fix anon huge page migration race

Subsystem: panic

    Christophe Leroy <christophe.leroy@csgroup.eu>:
      panic: don't dump stack twice on warn

Subsystem: ocfs2

    Wengang Wang <wen.gang.wang@oracle.com>:
      ocfs2: initialize ip_next_orphan

 .mailmap                       |    5 +-
 fs/ocfs2/super.c               |    1 
 include/asm-generic/barrier.h  |    1 
 include/linux/compiler-clang.h |    6 --
 include/linux/compiler-gcc.h   |   19 --------
 include/linux/compiler.h       |   18 +++++++-
 include/linux/memcontrol.h     |   11 ++++-
 kernel/panic.c                 |    3 -
 kernel/reboot.c                |   28 ++++++------
 kernel/watchdog.c              |    4 -
 mm/compaction.c                |   12 +++--
 mm/gup.c                       |   14 ++++--
 mm/hugetlb.c                   |   90 ++---------------------------------------
 mm/memory-failure.c            |   36 +++++++---------
 mm/migrate.c                   |   46 +++++++++++---------
 mm/rmap.c                      |    5 --
 mm/slub.c                      |    2 
 mm/vmscan.c                    |    5 +-
 18 files changed, 119 insertions(+), 187 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 01/14] mm/compaction: count pages and stop correctly during page isolation
  2020-11-14  6:51 incoming Andrew Morton
@ 2020-11-14  6:51 ` Andrew Morton
  2020-11-14  6:51 ` [patch 02/14] mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate Andrew Morton
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:51 UTC (permalink / raw)
  To: akpm, linux-mm, mgorman, mhocko, mm-commits, riel, shy828301,
	stable, torvalds, vbabka, ziy

From: Zi Yan <ziy@nvidia.com>
Subject: mm/compaction: count pages and stop correctly during page isolation

In isolate_migratepages_block, when cc->alloc_contig is true, we are able
to isolate compound pages, nr_migratepages and nr_isolated did not count
compound pages correctly, causing us to isolate more pages than we
thought.  Count compound pages as the number of base pages they contain. 
Otherwise, we might be trapped in too_many_isolated while loop, since the
actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384, where
COMPACT_CLUSTER_MAX is 32, since we stop isolation after
cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.

In addition, after we fix the issue above, cc->nr_migratepages could never
be equal to COMPACT_CLUSTER_MAX if compound pages are isolated, thus page
isolation could not stop as we intended.  Change the isolation stop
condition to >=.

The issue can be triggered as follows: In a system with 16GB memory and an
8GB CMA region reserved by hugetlb_cma, if we first allocate 10GB THPs and
mlock them (so some THPs are allocated in the CMA region and mlocked),
reserving 6 1GB hugetlb pages via
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will get stuck
(looping in too_many_isolated function) until we kill either task.  With
the patch applied, oom will kill the application with 10GB THPs and let
hugetlb page reservation finish.

[ziy@nvidia.com: v3]
  Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
Link: https://lkml.kernel.org/r/20201029200435.3386066-1-zi.yan@sent.com
Fixes: 1da2f328fa64 ("cmm,thp,compaction,cma: allow THP migration for CMA allocations")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/compaction.c~mm-compaction-count-pages-and-stop-correctly-during-page-isolation
+++ a/mm/compaction.c
@@ -1012,8 +1012,8 @@ isolate_migratepages_block(struct compac
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
-		cc->nr_migratepages++;
-		nr_isolated++;
+		cc->nr_migratepages += compound_nr(page);
+		nr_isolated += compound_nr(page);
 
 		/*
 		 * Avoid isolating too much unless this block is being
@@ -1021,7 +1021,7 @@ isolate_success:
 		 * or a lock is contended. For contention, isolate quickly to
 		 * potentially remove one source of contention.
 		 */
-		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX &&
+		if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX &&
 		    !cc->rescan && !cc->contended) {
 			++low_pfn;
 			break;
@@ -1132,7 +1132,7 @@ isolate_migratepages_range(struct compac
 		if (!pfn)
 			break;
 
-		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+		if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX)
 			break;
 	}
 
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 02/14] mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
  2020-11-14  6:51 incoming Andrew Morton
  2020-11-14  6:51 ` [patch 01/14] mm/compaction: count pages and stop correctly during page isolation Andrew Morton
@ 2020-11-14  6:51 ` Andrew Morton
  2020-11-14  6:51 ` [patch 03/14] mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit Andrew Morton
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:51 UTC (permalink / raw)
  To: akpm, linux-mm, mgorman, mhocko, mm-commits, riel, shy828301,
	stable, torvalds, vbabka, ziy

From: Zi Yan <ziy@nvidia.com>
Subject: mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate

In isolate_migratepages_block, if we have too many isolated pages and
nr_migratepages is not zero, we should try to migrate what we have without
wasting time on isolating.

In theory it's possible that multiple parallel compactions will cause
too_many_isolated() to become true even if each has isolated less than
COMPACT_CLUSTER_MAX, and loop forever in the while loop.  Bailing
immediately prevents that.

[vbabka@suse.cz: changelog addition]
Link: https://lkml.kernel.org/r/20201030183809.3616803-2-zi.yan@sent.com
Fixes: 1da2f328fa64 (“mm,thp,compaction,cma: allow THP migration for CMA allocations”)
Signed-off-by: Zi Yan <ziy@nvidia.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/mm/compaction.c~mm-compaction-stop-isolation-if-too-many-pages-are-isolated-and-we-have-pages-to-migrate
+++ a/mm/compaction.c
@@ -817,6 +817,10 @@ isolate_migratepages_block(struct compac
 	 * delay for some time until fewer pages are isolated
 	 */
 	while (unlikely(too_many_isolated(pgdat))) {
+		/* stop isolation if there are still pages not migrated */
+		if (cc->nr_migratepages)
+			return 0;
+
 		/* async migration should just abort */
 		if (cc->mode == MIGRATE_ASYNC)
 			return 0;
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 03/14] mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit
  2020-11-14  6:51 incoming Andrew Morton
  2020-11-14  6:51 ` [patch 01/14] mm/compaction: count pages and stop correctly during page isolation Andrew Morton
  2020-11-14  6:51 ` [patch 02/14] mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate Andrew Morton
@ 2020-11-14  6:51 ` Andrew Morton
  2020-11-14 21:39   ` Linus Torvalds
  2020-11-14  6:51 ` [patch 04/14] mailmap: fix entry for Dmitry Baryshkov/Eremin-Solenikov Andrew Morton
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:51 UTC (permalink / raw)
  To: a.sahrawat, akpm, linux-mm, maninder1.s, mgorman, mhocko,
	mm-commits, npiggin, stable, torvalds, v.narang, vbabka

From: Nicholas Piggin <npiggin@gmail.com>
Subject: mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit

Previously the negated unsigned long would be cast back to signed long
which would have the correct negative value.  After commit 730ec8c01a2b
("mm/vmscan.c: change prototype for shrink_page_list"), the large unsigned
int converts to a large positive signed long.

Symptoms include CMA allocations hanging forever holding the cma_mutex due
to alloc_contig_range->...->isolate_migratepages_block waiting forever in
"while (unlikely(too_many_isolated(pgdat)))".

[akpm@linux-foundation.org: fix -stat.nr_lazyfree_fail as well, per Michal]
Link: https://lkml.kernel.org/r/20201029032320.1448441-1-npiggin@gmail.com
Fixes: 730ec8c01a2b ("mm/vmscan.c: change prototype for shrink_page_list")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vaneet Narang <v.narang@samsung.com>
Cc: Maninder Singh <maninder1.s@samsung.com>
Cc: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/mm/vmscan.c~mm-vmscan-fix-nr_isolated_file-corruption-on-64-bit
+++ a/mm/vmscan.c
@@ -1516,7 +1516,8 @@ unsigned int reclaim_clean_pages_from_li
 	nr_reclaimed = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
 			TTU_IGNORE_ACCESS, &stat, true);
 	list_splice(&clean_pages, page_list);
-	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -nr_reclaimed);
+	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE,
+			    -(long)nr_reclaimed);
 	/*
 	 * Since lazyfree pages are isolated from file LRU from the beginning,
 	 * they will rotate back to anonymous LRU in the end if it failed to
@@ -1526,7 +1527,7 @@ unsigned int reclaim_clean_pages_from_li
 	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON,
 			    stat.nr_lazyfree_fail);
 	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE,
-			    -stat.nr_lazyfree_fail);
+			    -(long)stat.nr_lazyfree_fail);
 	return nr_reclaimed;
 }
 
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 04/14] mailmap: fix entry for Dmitry Baryshkov/Eremin-Solenikov
  2020-11-14  6:51 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2020-11-14  6:51 ` [patch 03/14] mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit Andrew Morton
@ 2020-11-14  6:51 ` Andrew Morton
  2020-11-14  6:51 ` [patch 05/14] mm/slub: fix panic in slab_alloc_node() Andrew Morton
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:51 UTC (permalink / raw)
  To: akpm, dbaryshkov, linux-mm, mm-commits, torvalds

From: Dmitry Baryshkov <dbaryshkov@gmail.com>
Subject: mailmap: fix entry for Dmitry Baryshkov/Eremin-Solenikov

Change back surname to new (old) one.  Dmitry Baryshkov -> Dmitry
Eremin-Solenikov -> Dmitry Baryshkov.  Map several odd entries to main
identity.

Link: https://lkml.kernel.org/r/20201103005158.1181426-1-dmitry.baryshkov@linaro.org
Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 .mailmap |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/.mailmap~mailmap-fix-entry-for-dmitry-baryshkov-eremin-solenikov
+++ a/.mailmap
@@ -82,7 +82,10 @@ Dengcheng Zhu <dzhu@wavecomp.com> <dengc
 Dengcheng Zhu <dzhu@wavecomp.com> <dengcheng.zhu@imgtec.com>
 Dengcheng Zhu <dzhu@wavecomp.com> <dengcheng.zhu@mips.com>
 <dev.kurt@vandijck-laurijssen.be> <kurt.van.dijck@eia.be>
-Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
+Dmitry Baryshkov <dbaryshkov@gmail.com>
+Dmitry Baryshkov <dbaryshkov@gmail.com> <[dbaryshkov@gmail.com]>
+Dmitry Baryshkov <dbaryshkov@gmail.com> <dmitry_baryshkov@mentor.com>
+Dmitry Baryshkov <dbaryshkov@gmail.com> <dmitry_eremin@mentor.com>
 Dmitry Safonov <0x7f454c46@gmail.com> <dima@arista.com>
 Dmitry Safonov <0x7f454c46@gmail.com> <d.safonov@partner.samsung.com>
 Dmitry Safonov <0x7f454c46@gmail.com> <dsafonov@virtuozzo.com>
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 05/14] mm/slub: fix panic in slab_alloc_node()
  2020-11-14  6:51 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2020-11-14  6:51 ` [patch 04/14] mailmap: fix entry for Dmitry Baryshkov/Eremin-Solenikov Andrew Morton
@ 2020-11-14  6:51 ` Andrew Morton
  2020-11-14  6:51 ` [patch 06/14] mm/gup: use unpin_user_pages() in __gup_longterm_locked() Andrew Morton
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:51 UTC (permalink / raw)
  To: akpm, cheloha, cl, iamjoonsoo.kim, ldufour, linux-mm, mhocko,
	mm-commits, nathanl, penberg, richard.weiyang, rientjes, stable,
	torvalds, vbabka

From: Laurent Dufour <ldufour@linux.ibm.com>
Subject: mm/slub: fix panic in slab_alloc_node()

While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
with 11TB of ram, I hit the following panic:

BUG: Kernel NULL pointer dereference on read at 0x00000007
Faulting instruction address: 0xc000000000456048
Oops: Kernel access of bad area, sig: 11 [#2]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp
CPU: 160 PID: 1 Comm: systemd Tainted: G      D           5.9.0 #1
NIP:  c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
REGS: c00006028d1b77a0 TRAP: 0300   Tainted: G      D            (5.9.0)
MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004228  XER: 00000000
CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
NIP [c000000000456048] __kmalloc_node+0x108/0x790
LR [c000000000455fd4] __kmalloc_node+0x94/0x790
Call Trace:
[c00006028d1b7a30] [c00007c51af92000] 0xc00007c51af92000 (unreliable)
[c00006028d1b7aa0] [c0000000003c0ff8] kvmalloc_node+0x58/0x110
[c00006028d1b7ae0] [c00000000047b45c] mem_cgroup_css_online+0x10c/0x270
[c00006028d1b7b30] [c000000000241fd8] online_css+0x48/0xd0
[c00006028d1b7b60] [c00000000024af14] cgroup_apply_control_enable+0x2c4/0x470
[c00006028d1b7c40] [c00000000024e838] cgroup_mkdir+0x408/0x5f0
[c00006028d1b7cb0] [c0000000005a4ef0] kernfs_iop_mkdir+0x90/0x100
[c00006028d1b7cf0] [c0000000004b8168] vfs_mkdir+0x138/0x250
[c00006028d1b7d40] [c0000000004baf04] do_mkdirat+0x154/0x1c0
[c00006028d1b7dc0] [c000000000032b38] system_call_exception+0xf8/0x200
[c00006028d1b7e20] [c00000000000c740] system_call_common+0xf0/0x27c
Instruction dump:
e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
2fbc0000 419e0018 41920230 e9270010 <89290007> 7f994800 419e0220 7ee6bb78

This pointing to the following code:

mm/slub.c:2851
        if (unlikely(!object || !node_match(page, node))) {
c000000000456038:       00 00 bc 2f     cmpdi   cr7,r28,0
c00000000045603c:       18 00 9e 41     beq     cr7,c000000000456054 <__kmalloc_node+0x114>
node_match():
mm/slub.c:2491
        if (node != NUMA_NO_NODE && page_to_nid(page) != node)
c000000000456040:       30 02 92 41     beq     cr4,c000000000456270 <__kmalloc_node+0x330>
page_to_nid():
include/linux/mm.h:1294
c000000000456044:       10 00 27 e9     ld      r9,16(r7)
c000000000456048:       07 00 29 89     lbz     r9,7(r9)	<<<< r9 = NULL
node_match():
mm/slub.c:2491
c00000000045604c:       00 48 99 7f     cmpw    cr7,r25,r9
c000000000456050:       20 02 9e 41     beq     cr7,c000000000456270 <__kmalloc_node+0x330>

The panic occurred in slab_alloc_node() when checking for the page's node:
	object = c->freelist;
	page = c->page;
	if (unlikely(!object || !node_match(page, node))) {
		object = __slab_alloc(s, gfpflags, node, addr, c);
		stat(s, ALLOC_SLOWPATH);

The issue is that object is not NULL while page is NULL which is odd but
may happen if the cache flush happened after loading object but before
loading page.  Thus checking for the page pointer is required too.

The cache flush is done through an inter processor interrupt when a piece
of memory is off-lined.  That interrupt is triggered when a memory
hot-unplug operation is initiated and offline_pages() is calling the
slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback() which
is calling flush_cpu_slab().  If that interrupt is caught between the
reading of c->freelist and the reading of c->page, this could lead to such
a situation.  That situation is expected and the later call to
this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
the whole operation.

In commit 6159d0f5c03e ("mm/slub.c: page is always non-NULL in
node_match()") check on the page pointer has been removed assuming that
page is always valid when it is called.  It happens that this is not true
in that particular case, so check for page before calling node_match()
here.

Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.com
Fixes: 6159d0f5c03e ("mm/slub.c: page is always non-NULL in node_match()")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/slub.c~mm-slub-fix-panic-in-slab_alloc_node
+++ a/mm/slub.c
@@ -2852,7 +2852,7 @@ redo:
 
 	object = c->freelist;
 	page = c->page;
-	if (unlikely(!object || !node_match(page, node))) {
+	if (unlikely(!object || !page || !node_match(page, node))) {
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 	} else {
 		void *next_object = get_freepointer_safe(s, object);
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 06/14] mm/gup: use unpin_user_pages() in __gup_longterm_locked()
  2020-11-14  6:51 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2020-11-14  6:51 ` [patch 05/14] mm/slub: fix panic in slab_alloc_node() Andrew Morton
@ 2020-11-14  6:51 ` Andrew Morton
  2020-11-14  6:51 ` [patch 07/14] compiler.h: fix barrier_data() on clang Andrew Morton
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:51 UTC (permalink / raw)
  To: akpm, aneesh.kumar, dan.j.williams, ira.weiny, jgg, jhubbard,
	linux-mm, mm-commits, stable, torvalds

From: Jason Gunthorpe <jgg@nvidia.com>
Subject: mm/gup: use unpin_user_pages() in __gup_longterm_locked()

When FOLL_PIN is passed to __get_user_pages() the page list must be put
back using unpin_user_pages() otherwise the page pin reference persists in
a corrupted state.

There are two places in the unwind of __gup_longterm_locked() that put the
pages back without checking.  Normally on error this function would return
the partial page list making this the caller's responsibility, but in
these two cases the caller is not allowed to see these pages at all.

Link: https://lkml.kernel.org/r/0-v2-3ae7d9d162e2+2a7-gup_cma_fix_jgg@nvidia.com
Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reported-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- a/mm/gup.c~mm-gup-use-unpin_user_pages-in-__gup_longterm_locked
+++ a/mm/gup.c
@@ -1647,8 +1647,11 @@ check_again:
 		/*
 		 * drop the above get_user_pages reference.
 		 */
-		for (i = 0; i < nr_pages; i++)
-			put_page(pages[i]);
+		if (gup_flags & FOLL_PIN)
+			unpin_user_pages(pages, nr_pages);
+		else
+			for (i = 0; i < nr_pages; i++)
+				put_page(pages[i]);
 
 		if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
 			(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
@@ -1728,8 +1731,11 @@ static long __gup_longterm_locked(struct
 			goto out;
 
 		if (check_dax_vmas(vmas_tmp, rc)) {
-			for (i = 0; i < rc; i++)
-				put_page(pages[i]);
+			if (gup_flags & FOLL_PIN)
+				unpin_user_pages(pages, rc);
+			else
+				for (i = 0; i < rc; i++)
+					put_page(pages[i]);
 			rc = -EOPNOTSUPP;
 			goto out;
 		}
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 07/14] compiler.h: fix barrier_data() on clang
  2020-11-14  6:51 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2020-11-14  6:51 ` [patch 06/14] mm/gup: use unpin_user_pages() in __gup_longterm_locked() Andrew Morton
@ 2020-11-14  6:51 ` Andrew Morton
  2020-11-14  6:52 ` [patch 08/14] Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint" Andrew Morton
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:51 UTC (permalink / raw)
  To: akpm, keescook, linux-mm, mm-commits, ndesaulniers, nivedita,
	rdunlap, stable, torvalds

From: Arvind Sankar <nivedita@alum.mit.edu>
Subject: compiler.h: fix barrier_data() on clang

Commit 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h
mutually exclusive") neglected to copy barrier_data() from compiler-gcc.h
into compiler-clang.h.  The definition in compiler-gcc.h was really to
work around clang's more aggressive optimization, so this broke
barrier_data() on clang, and consequently memzero_explicit() as well.

For example, this results in at least the memzero_explicit() call in
lib/crypto/sha256.c:sha256_transform() being optimized away by clang.

Fix this by moving the definition of barrier_data() into compiler.h.

Also move the gcc/clang definition of barrier() into compiler.h,
__memory_barrier() is icc-specific (and barrier() is already defined using
it in compiler-intel.h) and doesn't belong in compiler.h.

[rdunlap@infradead.org: fix ALPHA builds when SMP is not enabled]
  Link: https://lkml.kernel.org/r/20201101231835.4589-1-rdunlap@infradead.org
Link: https://lkml.kernel.org/r/20201014212631.207844-1-nivedita@alum.mit.edu
Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-generic/barrier.h  |    1 +
 include/linux/compiler-clang.h |    6 ------
 include/linux/compiler-gcc.h   |   19 -------------------
 include/linux/compiler.h       |   18 ++++++++++++++++--
 4 files changed, 17 insertions(+), 27 deletions(-)

--- a/include/asm-generic/barrier.h~compilerh-fix-barrier_data-on-clang
+++ a/include/asm-generic/barrier.h
@@ -13,6 +13,7 @@
 
 #ifndef __ASSEMBLY__
 
+#include <linux/compiler.h>
 #include <asm/rwonce.h>
 
 #ifndef nop
--- a/include/linux/compiler-clang.h~compilerh-fix-barrier_data-on-clang
+++ a/include/linux/compiler-clang.h
@@ -60,12 +60,6 @@
 #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1
 #endif
 
-/* The following are for compatibility with GCC, from compiler-gcc.h,
- * and may be redefined here because they should not be shared with other
- * compilers, like ICC.
- */
-#define barrier() __asm__ __volatile__("" : : : "memory")
-
 #if __has_feature(shadow_call_stack)
 # define __noscs	__attribute__((__no_sanitize__("shadow-call-stack")))
 #endif
--- a/include/linux/compiler-gcc.h~compilerh-fix-barrier_data-on-clang
+++ a/include/linux/compiler-gcc.h
@@ -15,25 +15,6 @@
 # error Sorry, your version of GCC is too old - please use 4.9 or newer.
 #endif
 
-/* Optimization barrier */
-
-/* The "volatile" is due to gcc bugs */
-#define barrier() __asm__ __volatile__("": : :"memory")
-/*
- * This version is i.e. to prevent dead stores elimination on @ptr
- * where gcc and llvm may behave differently when otherwise using
- * normal barrier(): while gcc behavior gets along with a normal
- * barrier(), llvm needs an explicit input variable to be assumed
- * clobbered. The issue is as follows: while the inline asm might
- * access any memory it wants, the compiler could have fit all of
- * @ptr into memory registers instead, and since @ptr never escaped
- * from that, it proved that the inline asm wasn't touching any of
- * it. This version works well with both compilers, i.e. we're telling
- * the compiler that the inline asm absolutely may see the contents
- * of @ptr. See also: https://llvm.org/bugs/show_bug.cgi?id=15495
- */
-#define barrier_data(ptr) __asm__ __volatile__("": :"r"(ptr) :"memory")
-
 /*
  * This macro obfuscates arithmetic on a variable address so that gcc
  * shouldn't recognize the original var, and make assumptions about it.
--- a/include/linux/compiler.h~compilerh-fix-barrier_data-on-clang
+++ a/include/linux/compiler.h
@@ -80,11 +80,25 @@ void ftrace_likely_update(struct ftrace_
 
 /* Optimization barrier */
 #ifndef barrier
-# define barrier() __memory_barrier()
+/* The "volatile" is due to gcc bugs */
+# define barrier() __asm__ __volatile__("": : :"memory")
 #endif
 
 #ifndef barrier_data
-# define barrier_data(ptr) barrier()
+/*
+ * This version is i.e. to prevent dead stores elimination on @ptr
+ * where gcc and llvm may behave differently when otherwise using
+ * normal barrier(): while gcc behavior gets along with a normal
+ * barrier(), llvm needs an explicit input variable to be assumed
+ * clobbered. The issue is as follows: while the inline asm might
+ * access any memory it wants, the compiler could have fit all of
+ * @ptr into memory registers instead, and since @ptr never escaped
+ * from that, it proved that the inline asm wasn't touching any of
+ * it. This version works well with both compilers, i.e. we're telling
+ * the compiler that the inline asm absolutely may see the contents
+ * of @ptr. See also: https://llvm.org/bugs/show_bug.cgi?id=15495
+ */
+# define barrier_data(ptr) __asm__ __volatile__("": :"r"(ptr) :"memory")
 #endif
 
 /* workaround for GCC PR82365 if needed */
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 08/14] Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"
  2020-11-14  6:51 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2020-11-14  6:51 ` [patch 07/14] compiler.h: fix barrier_data() on clang Andrew Morton
@ 2020-11-14  6:52 ` Andrew Morton
  2020-11-14  6:52 ` [patch 09/14] reboot: fix overflow parsing reboot cpu number Andrew Morton
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:52 UTC (permalink / raw)
  To: akpm, arnd, fabf, gregkh, keescook, linux-mm, linux, mcroce,
	mm-commits, pasha.tatashin, pmladek, robinmholt, rppt, stable,
	torvalds

From: Matteo Croce <mcroce@microsoft.com>
Subject: Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"

Patch series "fix parsing of reboot= cmdline", v3.

The parsing of the reboot= cmdline has two major errors:

- a missing bound check can crash the system on reboot
- parsing of the cpu number only works if specified last

Fix both.


This patch (of 2):

This reverts commit 616feab753972b97.

kstrtoint() and simple_strtoul() have a subtle difference which makes them
non interchangeable: if a non digit character is found amid the parsing,
the former will return an error, while the latter will just stop parsing,
e.g.  simple_strtoul("123xyx") = 123.

The kernel cmdline reboot= argument allows to specify the CPU used for
rebooting, with the syntax `s####` among the other flags, e.g. 
"reboot=warm,s31,force", so if this flag is not the last given, it's
silently ignored as well as the subsequent ones.

Link: https://lkml.kernel.org/r/20201103214025.116799-2-mcroce@linux.microsoft.com
Fixes: 616feab75397 ("kernel/reboot.c: convert simple_strtoul to kstrtoint")
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/reboot.c |   21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

--- a/kernel/reboot.c~revert-kernel-rebootc-convert-simple_strtoul-to-kstrtoint
+++ a/kernel/reboot.c
@@ -551,22 +551,15 @@ static int __init reboot_setup(char *str
 			break;
 
 		case 's':
-		{
-			int rc;
-
-			if (isdigit(*(str+1))) {
-				rc = kstrtoint(str+1, 0, &reboot_cpu);
-				if (rc)
-					return rc;
-			} else if (str[1] == 'm' && str[2] == 'p' &&
-				   isdigit(*(str+3))) {
-				rc = kstrtoint(str+3, 0, &reboot_cpu);
-				if (rc)
-					return rc;
-			} else
+			if (isdigit(*(str+1)))
+				reboot_cpu = simple_strtoul(str+1, NULL, 0);
+			else if (str[1] == 'm' && str[2] == 'p' &&
+							isdigit(*(str+3)))
+				reboot_cpu = simple_strtoul(str+3, NULL, 0);
+			else
 				*mode = REBOOT_SOFT;
 			break;
-		}
+
 		case 'g':
 			*mode = REBOOT_GPIO;
 			break;
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 09/14] reboot: fix overflow parsing reboot cpu number
  2020-11-14  6:51 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2020-11-14  6:52 ` [patch 08/14] Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint" Andrew Morton
@ 2020-11-14  6:52 ` Andrew Morton
  2020-11-14  6:52 ` [patch 10/14] kernel/watchdog: fix watchdog_allowed_mask not used warning Andrew Morton
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:52 UTC (permalink / raw)
  To: akpm, arnd, fabf, gregkh, keescook, linux-mm, linux, mcroce,
	mm-commits, pasha.tatashin, pmladek, robinmholt, rppt, stable,
	torvalds

From: Matteo Croce <mcroce@microsoft.com>
Subject: reboot: fix overflow parsing reboot cpu number

Limit the CPU number to num_possible_cpus(), because setting it to a value
lower than INT_MAX but higher than NR_CPUS produces the following error on
reboot and shutdown:

    BUG: unable to handle page fault for address: ffffffff90ab1bb0
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 1c09067 P4D 1c09067 PUD 1c0a063 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 1 Comm: systemd-shutdow Not tainted 5.9.0-rc8-kvm #110
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
    RIP: 0010:migrate_to_reboot_cpu+0xe/0x60
    Code: ea ea 00 48 89 fa 48 c7 c7 30 57 f1 81 e9 fa ef ff ff 66 2e 0f 1f 84 00 00 00 00 00 53 8b 1d d5 ea ea 00 e8 14 33 fe ff 89 da <48> 0f a3 15 ea fc bd 00 48 89 d0 73 29 89 c2 c1 e8 06 65 48 8b 3c
    RSP: 0018:ffffc90000013e08 EFLAGS: 00010246
    RAX: ffff88801f0a0000 RBX: 0000000077359400 RCX: 0000000000000000
    RDX: 0000000077359400 RSI: 0000000000000002 RDI: ffffffff81c199e0
    RBP: ffffffff81c1e3c0 R08: ffff88801f41f000 R09: ffffffff81c1e348
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 00007f32bedf8830 R14: 00000000fee1dead R15: 0000000000000000
    FS:  00007f32bedf8980(0000) GS:ffff88801f480000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffff90ab1bb0 CR3: 000000001d057000 CR4: 00000000000006a0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __do_sys_reboot.cold+0x34/0x5b
    ? vfs_writev+0x92/0xc0
    ? do_writev+0x52/0xd0
    do_syscall_64+0x2d/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f32bfaaecd3
    Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 89 81 0c 00 f7 d8
    RSP: 002b:00007fff6265fb58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f32bfaaecd3
    RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
    RBP: 0000000000000000 R08: 0000000000008020 R09: 00007fff6265ef60
    R10: 00007f32bedf8830 R11: 0000000000000202 R12: 0000000000000000
    R13: 0000557bba2c51c0 R14: 0000000000000000 R15: 00007fff6265fbc8
    CR2: ffffffff90ab1bb0
    ---[ end trace b813e80157136563 ]---
    RIP: 0010:migrate_to_reboot_cpu+0xe/0x60
    Code: ea ea 00 48 89 fa 48 c7 c7 30 57 f1 81 e9 fa ef ff ff 66 2e 0f 1f 84 00 00 00 00 00 53 8b 1d d5 ea ea 00 e8 14 33 fe ff 89 da <48> 0f a3 15 ea fc bd 00 48 89 d0 73 29 89 c2 c1 e8 06 65 48 8b 3c
    RSP: 0018:ffffc90000013e08 EFLAGS: 00010246
    RAX: ffff88801f0a0000 RBX: 0000000077359400 RCX: 0000000000000000
    RDX: 0000000077359400 RSI: 0000000000000002 RDI: ffffffff81c199e0
    RBP: ffffffff81c1e3c0 R08: ffff88801f41f000 R09: ffffffff81c1e348
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 00007f32bedf8830 R14: 00000000fee1dead R15: 0000000000000000
    FS:  00007f32bedf8980(0000) GS:ffff88801f480000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffff90ab1bb0 CR3: 000000001d057000 CR4: 00000000000006a0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
    Kernel Offset: disabled
    ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---

Link: https://lkml.kernel.org/r/20201103214025.116799-3-mcroce@linux.microsoft.com
Fixes: 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic kernel")
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/reboot.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/kernel/reboot.c~reboot-fix-overflow-parsing-reboot-cpu-number
+++ a/kernel/reboot.c
@@ -558,6 +558,13 @@ static int __init reboot_setup(char *str
 				reboot_cpu = simple_strtoul(str+3, NULL, 0);
 			else
 				*mode = REBOOT_SOFT;
+			if (reboot_cpu >= num_possible_cpus()) {
+				pr_err("Ignoring the CPU number in reboot= option. "
+				       "CPU %d exceeds possible cpu number %d\n",
+				       reboot_cpu, num_possible_cpus());
+				reboot_cpu = 0;
+				break;
+			}
 			break;
 
 		case 'g':
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 10/14] kernel/watchdog: fix watchdog_allowed_mask not used warning
  2020-11-14  6:51 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2020-11-14  6:52 ` [patch 09/14] reboot: fix overflow parsing reboot cpu number Andrew Morton
@ 2020-11-14  6:52 ` Andrew Morton
  2020-11-14  6:52 ` [patch 11/14] mm: memcontrol: fix missing wakeup polling thread Andrew Morton
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:52 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, pmladek, santosh, tglx, torvalds

From: Santosh Sivaraj <santosh@fossix.org>
Subject: kernel/watchdog: fix watchdog_allowed_mask not used warning

Define watchdog_allowed_mask only when SOFTLOCKUP_DETECTOR is enabled.

Link: https://lkml.kernel.org/r/20201106015025.1281561-1-santosh@fossix.org
Fixes: 7feeb9cd4f5b ("watchdog/sysctl: Clean up sysctl variable name space")
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/watchdog.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/watchdog.c~kernel-watchdog-fix-watchdog_allowed_mask-not-used-warning
+++ a/kernel/watchdog.c
@@ -44,8 +44,6 @@ int __read_mostly soft_watchdog_user_ena
 int __read_mostly watchdog_thresh = 10;
 static int __read_mostly nmi_watchdog_available;
 
-static struct cpumask watchdog_allowed_mask __read_mostly;
-
 struct cpumask watchdog_cpumask __read_mostly;
 unsigned long *watchdog_cpumask_bits = cpumask_bits(&watchdog_cpumask);
 
@@ -162,6 +160,8 @@ static void lockup_detector_update_enabl
 int __read_mostly sysctl_softlockup_all_cpu_backtrace;
 #endif
 
+static struct cpumask watchdog_allowed_mask __read_mostly;
+
 /* Global variables, exported for sysctl */
 unsigned int __read_mostly softlockup_panic =
 			CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE;
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 11/14] mm: memcontrol: fix missing wakeup polling thread
  2020-11-14  6:51 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2020-11-14  6:52 ` [patch 10/14] kernel/watchdog: fix watchdog_allowed_mask not used warning Andrew Morton
@ 2020-11-14  6:52 ` Andrew Morton
  2020-11-14  6:52 ` [patch 12/14] hugetlbfs: fix anon huge page migration race Andrew Morton
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:52 UTC (permalink / raw)
  To: akpm, chris, guro, hannes, laoar.shao, linux-mm, mhocko,
	mm-commits, shakeelb, songmuchun, tj, torvalds

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: memcontrol: fix missing wakeup polling thread

When we poll the swap.events, we can miss being woken up when the swap
event occurs.  Because we didn't notify.

Link: https://lkml.kernel.org/r/20201105161936.98312-1-songmuchun@bytedance.com
Fixes: f3a53a3a1e5b ("mm, memcontrol: implement memory.swap.events")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/include/linux/memcontrol.h~mm-memcontrol-fix-missing-wakeup-polling-thread
+++ a/include/linux/memcontrol.h
@@ -900,12 +900,19 @@ static inline void count_memcg_event_mm(
 static inline void memcg_memory_event(struct mem_cgroup *memcg,
 				      enum memcg_memory_event event)
 {
+	bool swap_event = event == MEMCG_SWAP_HIGH || event == MEMCG_SWAP_MAX ||
+			  event == MEMCG_SWAP_FAIL;
+
 	atomic_long_inc(&memcg->memory_events_local[event]);
-	cgroup_file_notify(&memcg->events_local_file);
+	if (!swap_event)
+		cgroup_file_notify(&memcg->events_local_file);
 
 	do {
 		atomic_long_inc(&memcg->memory_events[event]);
-		cgroup_file_notify(&memcg->events_file);
+		if (swap_event)
+			cgroup_file_notify(&memcg->swap_events_file);
+		else
+			cgroup_file_notify(&memcg->events_file);
 
 		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 			break;
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 12/14] hugetlbfs: fix anon huge page migration race
  2020-11-14  6:51 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2020-11-14  6:52 ` [patch 11/14] mm: memcontrol: fix missing wakeup polling thread Andrew Morton
@ 2020-11-14  6:52 ` Andrew Morton
  2020-11-14  6:52 ` [patch 13/14] panic: don't dump stack twice on warn Andrew Morton
  2020-11-14  6:52 ` [patch 14/14] ocfs2: initialize ip_next_orphan Andrew Morton
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:52 UTC (permalink / raw)
  To: akpm, cai, hughd, linux-mm, mike.kravetz, mm-commits,
	naoya.horiguchi, stable, torvalds

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlbfs: fix anon huge page migration race

Qian Cai reported the following BUG in [1]

[ 6147.019063][T45242] LTP: starting move_pages12
[ 6147.475680][T64921] BUG: unable to handle page fault for address: ffffffffffffffe0
...
[ 6147.525866][T64921] RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170
avc_start_pgoff at mm/interval_tree.c:63
[ 6147.620914][T64921] Call Trace:
[ 6147.624078][T64921]  rmap_walk_anon+0x141/0xa30
rmap_walk_anon at mm/rmap.c:1864
[ 6147.628639][T64921]  try_to_unmap+0x209/0x2d0
try_to_unmap at mm/rmap.c:1763
[ 6147.633026][T64921]  ? rmap_walk_locked+0x140/0x140
[ 6147.637936][T64921]  ? page_remove_rmap+0x1190/0x1190
[ 6147.643020][T64921]  ? page_not_mapped+0x10/0x10
[ 6147.647668][T64921]  ? page_get_anon_vma+0x290/0x290
[ 6147.652664][T64921]  ? page_mapcount_is_zero+0x10/0x10
[ 6147.657838][T64921]  ? hugetlb_page_mapping_lock_write+0x97/0x180
[ 6147.663972][T64921]  migrate_pages+0x1005/0x1fb0
[ 6147.668617][T64921]  ? remove_migration_pte+0xac0/0xac0
[ 6147.673875][T64921]  move_pages_and_store_status.isra.47+0xd7/0x1a0
[ 6147.680181][T64921]  ? migrate_pages+0x1fb0/0x1fb0
[ 6147.685002][T64921]  __x64_sys_move_pages+0xa5c/0x1100
[ 6147.690176][T64921]  ? trace_hardirqs_on+0x20/0x1b5
[ 6147.695084][T64921]  ? move_pages_and_store_status.isra.47+0x1a0/0x1a0
[ 6147.701653][T64921]  ? rcu_read_lock_sched_held+0xaa/0xd0
[ 6147.707088][T64921]  ? switch_fpu_return+0x196/0x400
[ 6147.712083][T64921]  ? lockdep_hardirqs_on_prepare+0x38c/0x550
[ 6147.717954][T64921]  ? do_syscall_64+0x24/0x310
[ 6147.722513][T64921]  do_syscall_64+0x5f/0x310
[ 6147.726897][T64921]  ? trace_hardirqs_off+0x12/0x1a0
[ 6147.731894][T64921]  ? asm_exc_page_fault+0x8/0x30
[ 6147.736714][T64921]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Hugh Dickens diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization.  Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem.   This is wrong for
anon pages as the anon_vma_lock should be held in this case.  Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.

Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong.  There is no way to
keep mapping valid while dropping page lock.

This patch does the following:
- Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
  calling try_to_unmap.
- Remove the hacky code in hugetlb_page_mapping_lock_write.  The routine
  will now simply do a 'trylock' while still holding the page lock.  If
  the trylock fails, it will return NULL.  This could impact the callers:
  - migration calling code will receive -EAGAIN and retry up to the hard
    coded limit (10).
  - memory error code will treat the page as BUSY.  This will force
    killing (SIGKILL) instead of SIGBUS any mapping tasks.
  Do note that this change in behavior only happens when there is a race.
  None of the standard kernel testing suites actually hit this race, but
  it is possible.

[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Qian Cai <cai@lca.pw>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c        |   90 ++----------------------------------------
 mm/memory-failure.c |   36 +++++++---------
 mm/migrate.c        |   46 +++++++++++----------
 mm/rmap.c           |    5 --
 4 files changed, 48 insertions(+), 129 deletions(-)

--- a/mm/hugetlb.c~hugetlbfs-fix-anon-huge-page-migration-race
+++ a/mm/hugetlb.c
@@ -1568,103 +1568,23 @@ int PageHeadHuge(struct page *page_head)
 }
 
 /*
- * Find address_space associated with hugetlbfs page.
- * Upon entry page is locked and page 'was' mapped although mapped state
- * could change.  If necessary, use anon_vma to find vma and associated
- * address space.  The returned mapping may be stale, but it can not be
- * invalid as page lock (which is held) is required to destroy mapping.
- */
-static struct address_space *_get_hugetlb_page_mapping(struct page *hpage)
-{
-	struct anon_vma *anon_vma;
-	pgoff_t pgoff_start, pgoff_end;
-	struct anon_vma_chain *avc;
-	struct address_space *mapping = page_mapping(hpage);
-
-	/* Simple file based mapping */
-	if (mapping)
-		return mapping;
-
-	/*
-	 * Even anonymous hugetlbfs mappings are associated with an
-	 * underlying hugetlbfs file (see hugetlb_file_setup in mmap
-	 * code).  Find a vma associated with the anonymous vma, and
-	 * use the file pointer to get address_space.
-	 */
-	anon_vma = page_lock_anon_vma_read(hpage);
-	if (!anon_vma)
-		return mapping;  /* NULL */
-
-	/* Use first found vma */
-	pgoff_start = page_to_pgoff(hpage);
-	pgoff_end = pgoff_start + pages_per_huge_page(page_hstate(hpage)) - 1;
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
-					pgoff_start, pgoff_end) {
-		struct vm_area_struct *vma = avc->vma;
-
-		mapping = vma->vm_file->f_mapping;
-		break;
-	}
-
-	anon_vma_unlock_read(anon_vma);
-	return mapping;
-}
-
-/*
  * Find and lock address space (mapping) in write mode.
  *
- * Upon entry, the page is locked which allows us to find the mapping
- * even in the case of an anon page.  However, locking order dictates
- * the i_mmap_rwsem be acquired BEFORE the page lock.  This is hugetlbfs
- * specific.  So, we first try to lock the sema while still holding the
- * page lock.  If this works, great!  If not, then we need to drop the
- * page lock and then acquire i_mmap_rwsem and reacquire page lock.  Of
- * course, need to revalidate state along the way.
+ * Upon entry, the page is locked which means that page_mapping() is
+ * stable.  Due to locking order, we can only trylock_write.  If we can
+ * not get the lock, simply return NULL to caller.
  */
 struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage)
 {
-	struct address_space *mapping, *mapping2;
+	struct address_space *mapping = page_mapping(hpage);
 
-	mapping = _get_hugetlb_page_mapping(hpage);
-retry:
 	if (!mapping)
 		return mapping;
 
-	/*
-	 * If no contention, take lock and return
-	 */
 	if (i_mmap_trylock_write(mapping))
 		return mapping;
 
-	/*
-	 * Must drop page lock and wait on mapping sema.
-	 * Note:  Once page lock is dropped, mapping could become invalid.
-	 * As a hack, increase map count until we lock page again.
-	 */
-	atomic_inc(&hpage->_mapcount);
-	unlock_page(hpage);
-	i_mmap_lock_write(mapping);
-	lock_page(hpage);
-	atomic_add_negative(-1, &hpage->_mapcount);
-
-	/* verify page is still mapped */
-	if (!page_mapped(hpage)) {
-		i_mmap_unlock_write(mapping);
-		return NULL;
-	}
-
-	/*
-	 * Get address space again and verify it is the same one
-	 * we locked.  If not, drop lock and retry.
-	 */
-	mapping2 = _get_hugetlb_page_mapping(hpage);
-	if (mapping2 != mapping) {
-		i_mmap_unlock_write(mapping);
-		mapping = mapping2;
-		goto retry;
-	}
-
-	return mapping;
+	return NULL;
 }
 
 pgoff_t __basepage_index(struct page *page)
--- a/mm/memory-failure.c~hugetlbfs-fix-anon-huge-page-migration-race
+++ a/mm/memory-failure.c
@@ -1057,27 +1057,25 @@ static bool hwpoison_user_mappings(struc
 	if (!PageHuge(hpage)) {
 		unmap_success = try_to_unmap(hpage, ttu);
 	} else {
-		/*
-		 * For hugetlb pages, try_to_unmap could potentially call
-		 * huge_pmd_unshare.  Because of this, take semaphore in
-		 * write mode here and set TTU_RMAP_LOCKED to indicate we
-		 * have taken the lock at this higer level.
-		 *
-		 * Note that the call to hugetlb_page_mapping_lock_write
-		 * is necessary even if mapping is already set.  It handles
-		 * ugliness of potentially having to drop page lock to obtain
-		 * i_mmap_rwsem.
-		 */
-		mapping = hugetlb_page_mapping_lock_write(hpage);
-
-		if (mapping) {
-			unmap_success = try_to_unmap(hpage,
+		if (!PageAnon(hpage)) {
+			/*
+			 * For hugetlb pages in shared mappings, try_to_unmap
+			 * could potentially call huge_pmd_unshare.  Because of
+			 * this, take semaphore in write mode here and set
+			 * TTU_RMAP_LOCKED to indicate we have taken the lock
+			 * at this higer level.
+			 */
+			mapping = hugetlb_page_mapping_lock_write(hpage);
+			if (mapping) {
+				unmap_success = try_to_unmap(hpage,
 						     ttu|TTU_RMAP_LOCKED);
-			i_mmap_unlock_write(mapping);
+				i_mmap_unlock_write(mapping);
+			} else {
+				pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn);
+				unmap_success = false;
+			}
 		} else {
-			pr_info("Memory failure: %#lx: could not find mapping for mapped huge page\n",
-				pfn);
-			unmap_success = false;
+			unmap_success = try_to_unmap(hpage, ttu);
 		}
 	}
 	if (!unmap_success)
--- a/mm/migrate.c~hugetlbfs-fix-anon-huge-page-migration-race
+++ a/mm/migrate.c
@@ -1328,34 +1328,38 @@ static int unmap_and_move_huge_page(new_
 		goto put_anon;
 
 	if (page_mapped(hpage)) {
-		/*
-		 * try_to_unmap could potentially call huge_pmd_unshare.
-		 * Because of this, take semaphore in write mode here and
-		 * set TTU_RMAP_LOCKED to let lower levels know we have
-		 * taken the lock.
-		 */
-		mapping = hugetlb_page_mapping_lock_write(hpage);
-		if (unlikely(!mapping))
-			goto unlock_put_anon;
-
-		try_to_unmap(hpage,
-			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
-			TTU_RMAP_LOCKED);
+		bool mapping_locked = false;
+		enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK|
+					TTU_IGNORE_ACCESS;
+
+		if (!PageAnon(hpage)) {
+			/*
+			 * In shared mappings, try_to_unmap could potentially
+			 * call huge_pmd_unshare.  Because of this, take
+			 * semaphore in write mode here and set TTU_RMAP_LOCKED
+			 * to let lower levels know we have taken the lock.
+			 */
+			mapping = hugetlb_page_mapping_lock_write(hpage);
+			if (unlikely(!mapping))
+				goto unlock_put_anon;
+
+			mapping_locked = true;
+			ttu |= TTU_RMAP_LOCKED;
+		}
+
+		try_to_unmap(hpage, ttu);
 		page_was_mapped = 1;
-		/*
-		 * Leave mapping locked until after subsequent call to
-		 * remove_migration_ptes()
-		 */
+
+		if (mapping_locked)
+			i_mmap_unlock_write(mapping);
 	}
 
 	if (!page_mapped(hpage))
 		rc = move_to_new_page(new_hpage, hpage, mode);
 
-	if (page_was_mapped) {
+	if (page_was_mapped)
 		remove_migration_ptes(hpage,
-			rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, true);
-		i_mmap_unlock_write(mapping);
-	}
+			rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false);
 
 unlock_put_anon:
 	unlock_page(new_hpage);
--- a/mm/rmap.c~hugetlbfs-fix-anon-huge-page-migration-race
+++ a/mm/rmap.c
@@ -1413,9 +1413,6 @@ static bool try_to_unmap_one(struct page
 		/*
 		 * If sharing is possible, start and end will be adjusted
 		 * accordingly.
-		 *
-		 * If called for a huge page, caller must hold i_mmap_rwsem
-		 * in write mode as it is possible to call huge_pmd_unshare.
 		 */
 		adjust_range_if_pmd_sharing_possible(vma, &range.start,
 						     &range.end);
@@ -1462,7 +1459,7 @@ static bool try_to_unmap_one(struct page
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
-		if (PageHuge(page)) {
+		if (PageHuge(page) && !PageAnon(page)) {
 			/*
 			 * To call huge_pmd_unshare, i_mmap_rwsem must be
 			 * held in write mode.  Caller needs to explicitly
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 13/14] panic: don't dump stack twice on warn
  2020-11-14  6:51 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2020-11-14  6:52 ` [patch 12/14] hugetlbfs: fix anon huge page migration race Andrew Morton
@ 2020-11-14  6:52 ` Andrew Morton
  2020-11-14  6:52 ` [patch 14/14] ocfs2: initialize ip_next_orphan Andrew Morton
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:52 UTC (permalink / raw)
  To: aik, akpm, christophe.leroy, linux-mm, mm-commits, torvalds,
	wangkefeng.wang

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: panic: don't dump stack twice on warn

Before commit 3f388f28639f ("panic: dump registers on panic_on_warn"),
__warn() was calling show_regs() when regs was not NULL, and show_stack()
otherwise.

After that commit, show_stack() is called regardless of whether
show_regs() has been called or not, leading to duplicated Call Trace:

[    7.112617] ------------[ cut here ]------------
[    7.117041] WARNING: CPU: 0 PID: 1 at arch/powerpc/mm/nohash/8xx.c:186 mmu_mark_initmem_nx+0x24/0x94
[    7.126021] CPU: 0 PID: 1 Comm: swapper Not tainted 5.10.0-rc2-s3k-dev-01375-gf46ec0d3ecbd-dirty #4092
[    7.135202] NIP:  c00128b4 LR: c0010228 CTR: 00000000
[    7.140205] REGS: c9023e40 TRAP: 0700   Not tainted  (5.10.0-rc2-s3k-dev-01375-gf46ec0d3ecbd-dirty)
[    7.149131] MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 24000424  XER: 00000000
[    7.155760]
[    7.155760] GPR00: c0010228 c9023ef8 c2100000 0074c000 ffffffff 00000000 c2151000 c07b3880
[    7.155760] GPR08: ff000900 0074c000 c8000000 c33b53a8 24000822 00000000 c0003a20 00000000
[    7.155760] GPR16: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[    7.155760] GPR24: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00800000
[    7.191092] NIP [c00128b4] mmu_mark_initmem_nx+0x24/0x94
[    7.196333] LR [c0010228] free_initmem+0x20/0x58
[    7.200855] Call Trace:
[    7.203319] [c9023f18] [c0010228] free_initmem+0x20/0x58
[    7.208564] [c9023f28] [c0003a3c] kernel_init+0x1c/0x114
[    7.213813] [c9023f38] [c000f184] ret_from_kernel_thread+0x14/0x1c
[    7.219869] Instruction dump:
[    7.222805] 7d291850 7d234b78 4e800020 9421ffe0 7c0802a6 bfc10018 3fe0c060 3bff0000
[    7.230462] 3fff4080 3bffffff 90010024 57ff0010 <0fe00000> 392001cd 7c3e0b78 953e0008
[    7.238327] CPU: 0 PID: 1 Comm: swapper Not tainted 5.10.0-rc2-s3k-dev-01375-gf46ec0d3ecbd-dirty #4092
[    7.247500] Call Trace:
[    7.249977] [c9023dc0] [c001e070] __warn+0x8c/0xd8 (unreliable)
[    7.255815] [c9023de0] [c05e0e5c] report_bug+0x11c/0x154
[    7.261085] [c9023e10] [c0009ea4] program_check_exception+0x1dc/0x6e0
[    7.267430] [c9023e30] [c000f43c] ret_from_except_full+0x0/0x4
[    7.273238] --- interrupt: 700 at mmu_mark_initmem_nx+0x24/0x94
[    7.273238]     LR = free_initmem+0x20/0x58
[    7.283155] [c9023ef8] [00000000] 0x0 (unreliable)
[    7.287913] [c9023f18] [c0010228] free_initmem+0x20/0x58
[    7.293160] [c9023f28] [c0003a3c] kernel_init+0x1c/0x114
[    7.298410] [c9023f38] [c000f184] ret_from_kernel_thread+0x14/0x1c
[    7.304479] ---[ end trace 31702cd2a9570752 ]---

Only call show_stack() when regs is NULL.

Link: https://lkml.kernel.org/r/e8c055458b080707f1bc1a98ff8bea79d0cec445.1604748361.git.christophe.leroy@csgroup.eu
Fixes: 3f388f28639f ("panic: dump registers on panic_on_warn")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/panic.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/kernel/panic.c~panic-dont-dump-stack-twice-on-warn
+++ a/kernel/panic.c
@@ -605,7 +605,8 @@ void __warn(const char *file, int line,
 		panic("panic_on_warn set ...\n");
 	}
 
-	dump_stack();
+	if (!regs)
+		dump_stack();
 
 	print_irqtrace_events(current);
 
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 14/14] ocfs2: initialize ip_next_orphan
  2020-11-14  6:51 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2020-11-14  6:52 ` [patch 13/14] panic: don't dump stack twice on warn Andrew Morton
@ 2020-11-14  6:52 ` Andrew Morton
  13 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2020-11-14  6:52 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, linux-mm,
	mark, mm-commits, piaojun, stable, torvalds, wen.gang.wang

From: Wengang Wang <wen.gang.wang@oracle.com>
Subject: ocfs2: initialize ip_next_orphan

Though problem if found on a lower 4.1.12 kernel, I think upstream has
same issue.

In one node in the cluster, there is the following callback trace:

# cat /proc/21473/stack
[<ffffffffc09a2f06>] __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
[<ffffffffc09a4481>] ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
[<ffffffffc09b2ce2>] ocfs2_evict_inode+0x152/0x820 [ocfs2]
[<ffffffff8122b36e>] evict+0xae/0x1a0
[<ffffffff8122bd26>] iput+0x1c6/0x230
[<ffffffffc09b60ed>] ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
[<ffffffffc0992ae0>] ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
[<ffffffffc099a1e9>] ocfs2_dir_foreach+0x29/0x30 [ocfs2]
[<ffffffffc09b7716>] ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
[<ffffffffc09b9b4e>] ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
[<ffffffff810a1399>] process_one_work+0x169/0x4a0
[<ffffffff810a1bcb>] worker_thread+0x5b/0x560
[<ffffffff810a7a2b>] kthread+0xcb/0xf0
[<ffffffff816f5d21>] ret_from_fork+0x61/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

The above stack is not reasonable, the final iput shouldn't happen in
ocfs2_orphan_filldir() function. Looking at the code,

2067         /* Skip inodes which are already added to recover list, since dio may
2068          * happen concurrently with unlink/rename */
2069         if (OCFS2_I(iter)->ip_next_orphan) {
2070                 iput(iter);
2071                 return 0;
2072         }
2073

The logic thinks the inode is already in recover list on seeing
ip_next_orphan is non-NULL, so it skip this inode after dropping a
reference which incremented in ocfs2_iget().

While, if the inode is already in recover list, it should have another
reference and the iput() at line 2070 should not be the final iput
(dropping the last reference).  So I don't think the inode is really in
the recover list (no vmcore to confirm).

Note that ocfs2_queue_orphans(), though not shown up in the call back
trace, is holding cluster lock on the orphan directory when looking up for
unlinked inodes.  The on disk inode eviction could involve a lot of IOs
which may need long time to finish.  That means this node could hold the
cluster lock for very long time, that can lead to the lock requests (from
other nodes) to the orhpan directory hang for long time.

Looking at more on ip_next_orphan, I found it's not initialized when
allocating a new ocfs2_inode_info structure.

This causes te reflink operations from some nodes hang for very long
time waiting for the cluster lock on the orphan directory.

Fix: initialize ip_next_orphan as NULL.

Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/super.c |    1 +
 1 file changed, 1 insertion(+)

--- a/fs/ocfs2/super.c~ocfs2-initialize-ip_next_orphan
+++ a/fs/ocfs2/super.c
@@ -1713,6 +1713,7 @@ static void ocfs2_inode_init_once(void *
 
 	oi->ip_blkno = 0ULL;
 	oi->ip_clusters = 0;
+	oi->ip_next_orphan = NULL;
 
 	ocfs2_resv_init_once(&oi->ip_la_data_resv);
 
_

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 03/14] mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit
  2020-11-14  6:51 ` [patch 03/14] mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit Andrew Morton
@ 2020-11-14 21:39   ` Linus Torvalds
  2020-11-14 22:14     ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2020-11-14 21:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: a.sahrawat, Linux-MM, maninder1.s, Mel Gorman, Michal Hocko,
	mm-commits, Nick Piggin, stable, v.narang, Vlastimil Babka

I've applied this patch, but I have to say that I absolutely detest it.

This is very fragile, and I think the problem is the nasty interface.

Why don't we have simple wrappers that internally do that "mod", but
actually expose "inc" and "dec" instead, and make it much harder to
have these kinds of problems?

The function name is horrible too: "mod_node_page_state()"?  It's not
like that really even describes what it does. It doesn't modify any
page state what-so-ever, what it does is to update some node counters.

Did somebody mis-understand what "stat" stands for? It's not some odd
shortening of "state" (like "creat"). It's shorthand for "statistics".
Because that completely nonsensical "state" naming shows up in all of
those helper functions too.

And the "page" part of the name is misleading too, since it's not even
about page statistics, and the people who have a page need to
literally do "page_pgdat()" on that page in order to pass in the
"struct pglist_data" that the function wants.

To make matters worse, we have the double underscore versions too,
which have that magical "we know interrupts are disabled"
optimization, so we duplicate all of this horror.

So I really think this whole area would be ripe for some major
interface cleanups, both in naming and in interfaces.

A _small_ step would have been to avoid the "mod" thing.

            Linus

On Fri, Nov 13, 2020 at 10:51 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> From: Nicholas Piggin <npiggin@gmail.com>
> Subject: mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit
>
> -       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -nr_reclaimed);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE,
> +                           -(long)nr_reclaimed);
...
>         mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE,
> -                           -stat.nr_lazyfree_fail);
> +                           -(long)stat.nr_lazyfree_fail);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 03/14] mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit
  2020-11-14 21:39   ` Linus Torvalds
@ 2020-11-14 22:14     ` Matthew Wilcox
  0 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2020-11-14 22:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, a.sahrawat, Linux-MM, maninder1.s, Mel Gorman,
	Michal Hocko, mm-commits, Nick Piggin, stable, v.narang,
	Vlastimil Babka

On Sat, Nov 14, 2020 at 01:39:47PM -0800, Linus Torvalds wrote:
> I've applied this patch, but I have to say that I absolutely detest it.
> 
> This is very fragile, and I think the problem is the nasty interface.

I agree, it's nast.

> Why don't we have simple wrappers that internally do that "mod", but
> actually expose "inc" and "dec" instead, and make it much harder to
> have these kinds of problems?

We actually need 'add' and 'sub', not 'inc' and 'dec'.  Sometimes we inc,
but often we need to add.  But, yes, 'mod' is a bad name.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-11-14 22:15 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-14  6:51 incoming Andrew Morton
2020-11-14  6:51 ` [patch 01/14] mm/compaction: count pages and stop correctly during page isolation Andrew Morton
2020-11-14  6:51 ` [patch 02/14] mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate Andrew Morton
2020-11-14  6:51 ` [patch 03/14] mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit Andrew Morton
2020-11-14 21:39   ` Linus Torvalds
2020-11-14 22:14     ` Matthew Wilcox
2020-11-14  6:51 ` [patch 04/14] mailmap: fix entry for Dmitry Baryshkov/Eremin-Solenikov Andrew Morton
2020-11-14  6:51 ` [patch 05/14] mm/slub: fix panic in slab_alloc_node() Andrew Morton
2020-11-14  6:51 ` [patch 06/14] mm/gup: use unpin_user_pages() in __gup_longterm_locked() Andrew Morton
2020-11-14  6:51 ` [patch 07/14] compiler.h: fix barrier_data() on clang Andrew Morton
2020-11-14  6:52 ` [patch 08/14] Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint" Andrew Morton
2020-11-14  6:52 ` [patch 09/14] reboot: fix overflow parsing reboot cpu number Andrew Morton
2020-11-14  6:52 ` [patch 10/14] kernel/watchdog: fix watchdog_allowed_mask not used warning Andrew Morton
2020-11-14  6:52 ` [patch 11/14] mm: memcontrol: fix missing wakeup polling thread Andrew Morton
2020-11-14  6:52 ` [patch 12/14] hugetlbfs: fix anon huge page migration race Andrew Morton
2020-11-14  6:52 ` [patch 13/14] panic: don't dump stack twice on warn Andrew Morton
2020-11-14  6:52 ` [patch 14/14] ocfs2: initialize ip_next_orphan Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).