mm-commits.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* incoming
@ 2020-11-02  1:06 Andrew Morton
  2020-11-02  1:07 ` [patch 01/15] mm/mremap_pages: fix static key devmap_managed_key updates Andrew Morton
                   ` (14 more replies)
  0 siblings, 15 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm

15 patches, based on 3cea11cd5e3b00d91caf0b4730194039b45c5891.

Subsystems affected by this patch series:

  mm/memremap
  mm/memcg
  mm/slab-generic
  mm/kasan
  mm/mempolicy
  signals
  lib
  mm/pagecache
  kthread
  mm/oom-kill
  mm/pagemap
  epoll
  core-kernel

Subsystem: mm/memremap

    Ralph Campbell <rcampbell@nvidia.com>:
      mm/mremap_pages: fix static key devmap_managed_key updates

Subsystem: mm/memcg

    Mike Kravetz <mike.kravetz@oracle.com>:
      hugetlb_cgroup: fix reservation accounting

    zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>:
      mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg

    Roman Gushchin <guro@fb.com>:
      mm: memcg: link page counters to root if use_hierarchy is false

Subsystem: mm/slab-generic

Subsystem: mm/kasan

    Andrey Konovalov <andreyknvl@google.com>:
      kasan: adopt KUNIT tests to SW_TAGS mode

Subsystem: mm/mempolicy

    Shijie Luo <luoshijie1@huawei.com>:
      mm: mempolicy: fix potential pte_unmap_unlock pte error

Subsystem: signals

    Oleg Nesterov <oleg@redhat.com>:
      ptrace: fix task_join_group_stop() for the case when current is traced

Subsystem: lib

    Vasily Gorbik <gor@linux.ibm.com>:
      lib/crc32test: remove extra local_irq_disable/enable

Subsystem: mm/pagecache

    Jason Yan <yanaijie@huawei.com>:
      mm/truncate.c: make __invalidate_mapping_pages() static

Subsystem: kthread

    Zqiang <qiang.zhang@windriver.com>:
      kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled

Subsystem: mm/oom-kill

    Charles Haithcock <chaithco@redhat.com>:
      mm, oom: keep oom_adj under or at upper limit when printing

Subsystem: mm/pagemap

    Jason Gunthorpe <jgg@nvidia.com>:
      mm: always have io_remap_pfn_range() set pgprot_decrypted()

Subsystem: epoll

    Soheil Hassas Yeganeh <soheil@google.com>:
      epoll: check ep_events_available() upon timeout
      epoll: add a selftest for epoll timeout race

Subsystem: core-kernel

    Lukas Bulwahn <lukas.bulwahn@gmail.com>:
      kernel/hung_task.c: make type annotations consistent

 fs/eventpoll.c                                                |   16 +
 fs/proc/base.c                                                |    2 
 include/linux/mm.h                                            |    9 
 include/linux/pgtable.h                                       |    4 
 kernel/hung_task.c                                            |    3 
 kernel/kthread.c                                              |    3 
 kernel/signal.c                                               |   19 -
 lib/crc32test.c                                               |    4 
 lib/test_kasan.c                                              |  149 +++++++---
 mm/hugetlb.c                                                  |   20 -
 mm/memcontrol.c                                               |   25 +
 mm/mempolicy.c                                                |    6 
 mm/memremap.c                                                 |   39 +-
 mm/truncate.c                                                 |    2 
 tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c |   95 ++++++
 15 files changed, 290 insertions(+), 106 deletions(-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 01/15] mm/mremap_pages: fix static key devmap_managed_key updates
  2020-11-02  1:06 incoming Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:07 ` [patch 02/15] hugetlb_cgroup: fix reservation accounting Andrew Morton
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: akpm, aneesh.kumar, dan.j.williams, hch, ira.weiny, jgg,
	linux-mm, mm-commits, rcampbell, sachinp, torvalds

From: Ralph Campbell <rcampbell@nvidia.com>
Subject: mm/mremap_pages: fix static key devmap_managed_key updates

commit 6f42193fd86e ("memremap: don't use a separate devm action for
devmap_managed_enable_get") changed the static key updates such that we
now call devmap_managed_enable_put() without doing the equivalent
devmap_managed_enable_get().

devmap_managed_enable_get() is only called for MEMORY_DEVICE_PRIVATE and
MEMORY_DEVICE_FS_DAX, But memunmap_pages() get called for other pgmap
types too.  This results in the below warning when switching between
system-ram and devdax mode for devdax namespace.

 jump label: negative count!
 WARNING: CPU: 52 PID: 1335 at kernel/jump_label.c:235 static_key_slow_try_dec+0x88/0xa0
 Modules linked in:
 ....

 NIP [c000000000433318] static_key_slow_try_dec+0x88/0xa0
 LR [c000000000433314] static_key_slow_try_dec+0x84/0xa0
 Call Trace:
 [c000000025c1f660] [c000000000433314] static_key_slow_try_dec+0x84/0xa0
 [c000000025c1f6d0] [c000000000433664] __static_key_slow_dec_cpuslocked+0x34/0xd0
 [c000000025c1f700] [c0000000004337a4] static_key_slow_dec+0x54/0xf0
 [c000000025c1f770] [c00000000059c49c] memunmap_pages+0x36c/0x500
 [c000000025c1f820] [c000000000d91d10] devm_action_release+0x30/0x50
 [c000000025c1f840] [c000000000d92e34] release_nodes+0x2f4/0x3e0
 [c000000025c1f8f0] [c000000000d8b15c] device_release_driver_internal+0x17c/0x280
 [c000000025c1f930] [c000000000d883a4] bus_remove_device+0x124/0x210
 [c000000025c1f9b0] [c000000000d80ef4] device_del+0x1d4/0x530
 [c000000025c1fa70] [c000000000e341e8] unregister_dev_dax+0x48/0xe0
 [c000000025c1fae0] [c000000000d91d10] devm_action_release+0x30/0x50
 [c000000025c1fb00] [c000000000d92e34] release_nodes+0x2f4/0x3e0
 [c000000025c1fbb0] [c000000000d8b15c] device_release_driver_internal+0x17c/0x280
 [c000000025c1fbf0] [c000000000d87000] unbind_store+0x130/0x170
 [c000000025c1fc30] [c000000000d862a0] drv_attr_store+0x40/0x60
 [c000000025c1fc50] [c0000000006d316c] sysfs_kf_write+0x6c/0xb0
 [c000000025c1fc90] [c0000000006d2328] kernfs_fop_write+0x118/0x280
 [c000000025c1fce0] [c0000000005a79f8] vfs_write+0xe8/0x2a0
 [c000000025c1fd30] [c0000000005a7d94] ksys_write+0x84/0x140
 [c000000025c1fd80] [c00000000003a430] system_call_exception+0x120/0x270
 [c000000025c1fe20] [c00000000000c540] system_call_common+0xf0/0x27c

Link: https://lkml.kernel.org/r/20201023183222.13186-1-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memremap.c |   39 ++++++++++++++++-----------------------
 1 file changed, 16 insertions(+), 23 deletions(-)

--- a/mm/memremap.c~mm-mremap_pages-fix-static-key-devmap_managed_key-updates
+++ a/mm/memremap.c
@@ -41,28 +41,24 @@ EXPORT_SYMBOL_GPL(memremap_compat_align)
 DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
 EXPORT_SYMBOL(devmap_managed_key);
 
-static void devmap_managed_enable_put(void)
+static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 {
-	static_branch_dec(&devmap_managed_key);
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+	    pgmap->type == MEMORY_DEVICE_FS_DAX)
+		static_branch_dec(&devmap_managed_key);
 }
 
-static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
+static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
-	    (!pgmap->ops || !pgmap->ops->page_free)) {
-		WARN(1, "Missing page_free method\n");
-		return -EINVAL;
-	}
-
-	static_branch_inc(&devmap_managed_key);
-	return 0;
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+	    pgmap->type == MEMORY_DEVICE_FS_DAX)
+		static_branch_inc(&devmap_managed_key);
 }
 #else
-static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
+static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
-	return -EINVAL;
 }
-static void devmap_managed_enable_put(void)
+static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 {
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
@@ -169,7 +165,7 @@ void memunmap_pages(struct dev_pagemap *
 		pageunmap_range(pgmap, i);
 
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
-	devmap_managed_enable_put();
+	devmap_managed_enable_put(pgmap);
 }
 EXPORT_SYMBOL_GPL(memunmap_pages);
 
@@ -307,7 +303,6 @@ void *memremap_pages(struct dev_pagemap
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
-	bool need_devmap_managed = true;
 	int error, i;
 
 	if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
@@ -323,6 +318,10 @@ void *memremap_pages(struct dev_pagemap
 			WARN(1, "Missing migrate_to_ram method\n");
 			return ERR_PTR(-EINVAL);
 		}
+		if (!pgmap->ops->page_free) {
+			WARN(1, "Missing page_free method\n");
+			return ERR_PTR(-EINVAL);
+		}
 		if (!pgmap->owner) {
 			WARN(1, "Missing owner\n");
 			return ERR_PTR(-EINVAL);
@@ -336,11 +335,9 @@ void *memremap_pages(struct dev_pagemap
 		}
 		break;
 	case MEMORY_DEVICE_GENERIC:
-		need_devmap_managed = false;
 		break;
 	case MEMORY_DEVICE_PCI_P2PDMA:
 		params.pgprot = pgprot_noncached(params.pgprot);
-		need_devmap_managed = false;
 		break;
 	default:
 		WARN(1, "Invalid pgmap type %d\n", pgmap->type);
@@ -364,11 +361,7 @@ void *memremap_pages(struct dev_pagemap
 		}
 	}
 
-	if (need_devmap_managed) {
-		error = devmap_managed_enable_get(pgmap);
-		if (error)
-			return ERR_PTR(error);
-	}
+	devmap_managed_enable_get(pgmap);
 
 	/*
 	 * Clear the pgmap nr_range as it will be incremented for each
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 02/15] hugetlb_cgroup: fix reservation accounting
  2020-11-02  1:06 incoming Andrew Morton
  2020-11-02  1:07 ` [patch 01/15] mm/mremap_pages: fix static key devmap_managed_key updates Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:07 ` [patch 03/15] mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg Andrew Morton
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: akpm, almasrymina, aneesh.kumar, david, linux-mm, mhocko,
	mike.kravetz, mm-commits, mprivozn, mst, songmuchun, stable, tj,
	torvalds

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb_cgroup: fix reservation accounting

Michal Privoznik was using "free page reporting" in QEMU/virtio-balloon
with hugetlbfs and hit the warning below.  QEMU with free page hinting
uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported
as free by a VM. The reporting granularity is in pageblock granularity.
So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE)
one huge page in QEMU.

[  315.251417] ------------[ cut here ]------------
[  315.251424] WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50
[  315.251425] Modules linked in: ...
[  315.251466] CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137
[  315.251467] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020
[  315.251469] RIP: 0010:page_counter_uncharge+0x4b/0x50
...
[  315.251479] Call Trace:
[  315.251485]  hugetlb_cgroup_uncharge_file_region+0x4b/0x80
[  315.251487]  region_del+0x1d3/0x300
[  315.251489]  hugetlb_unreserve_pages+0x39/0xb0
[  315.251492]  remove_inode_hugepages+0x1a8/0x3d0
[  315.251495]  ? tlb_finish_mmu+0x7a/0x1d0
[  315.251497]  hugetlbfs_fallocate+0x3c4/0x5c0
[  315.251519]  ? kvm_arch_vcpu_ioctl_run+0x614/0x1700 [kvm]
[  315.251522]  ? file_has_perm+0xa2/0xb0
[  315.251524]  ? inode_security+0xc/0x60
[  315.251525]  ? selinux_file_permission+0x4e/0x120
[  315.251527]  vfs_fallocate+0x146/0x290
[  315.251529]  __x64_sys_fallocate+0x3e/0x70
[  315.251531]  do_syscall_64+0x33/0x40
[  315.251533]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
[  315.251542] ---[ end trace 4c88c62ccb1349c9 ]---

Investigation of the issue uncovered bugs in hugetlb cgroup reservation
accounting.  This patch addresses the found issues.

Link: https://lkml.kernel.org/r/20201021204426.36069-1-mike.kravetz@oracle.com
Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings")
Cc: <stable@vger.kernel.org>
Reported-by: Michal Privoznik <mprivozn@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Michal Privoznik <mprivozn@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

--- a/mm/hugetlb.c~hugetlb_cgroup-fix-reservation-accounting
+++ a/mm/hugetlb.c
@@ -648,6 +648,8 @@ retry:
 			}
 
 			del += t - f;
+			hugetlb_cgroup_uncharge_file_region(
+				resv, rg, t - f);
 
 			/* New entry for end of split region */
 			nrg->from = t;
@@ -660,9 +662,6 @@ retry:
 			/* Original entry is trimmed */
 			rg->to = f;
 
-			hugetlb_cgroup_uncharge_file_region(
-				resv, rg, nrg->to - nrg->from);
-
 			list_add(&nrg->link, &rg->link);
 			nrg = NULL;
 			break;
@@ -678,17 +677,17 @@ retry:
 		}
 
 		if (f <= rg->from) {	/* Trim beginning of region */
-			del += t - rg->from;
-			rg->from = t;
-
 			hugetlb_cgroup_uncharge_file_region(resv, rg,
 							    t - rg->from);
-		} else {		/* Trim end of region */
-			del += rg->to - f;
-			rg->to = f;
 
+			del += t - rg->from;
+			rg->from = t;
+		} else {		/* Trim end of region */
 			hugetlb_cgroup_uncharge_file_region(resv, rg,
 							    rg->to - f);
+
+			del += rg->to - f;
+			rg->to = f;
 		}
 	}
 
@@ -2443,6 +2442,9 @@ struct page *alloc_huge_page(struct vm_a
 
 		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
 		hugetlb_acct_memory(h, -rsv_adjust);
+		if (deferred_reserve)
+			hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
+					pages_per_huge_page(h), page);
 	}
 	return page;
 
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 03/15] mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg
  2020-11-02  1:06 incoming Andrew Morton
  2020-11-02  1:07 ` [patch 01/15] mm/mremap_pages: fix static key devmap_managed_key updates Andrew Morton
  2020-11-02  1:07 ` [patch 02/15] hugetlb_cgroup: fix reservation accounting Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:07 ` [patch 04/15] mm: memcg: link page counters to root if use_hierarchy is false Andrew Morton
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: akpm, hannes, linux-mm, mhocko, mm-commits, torvalds, zhongjiang-ali

From: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Subject: mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg

memcg_page_state will get the specified number in hierarchical memcg,
It should multiply by HPAGE_PMD_NR rather than an page if the item is
NR_ANON_THPS.

[akpm@linux-foundation.org: fix printk warning]
[akpm@linux-foundation.org: use u64 cast, per Michal]
Link: https://lkml.kernel.org/r/1603722395-72443-1-git-send-email-zhongjiang-ali@linux.alibaba.com
Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- a/mm/memcontrol.c~mm-memcontrol-correct-the-nr_anon_thps-counter-of-hierarchical-memcg
+++ a/mm/memcontrol.c
@@ -4110,11 +4110,17 @@ static int memcg_stat_show(struct seq_fi
 			   (u64)memsw * PAGE_SIZE);
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
+		unsigned long nr;
+
 		if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
 			continue;
+		nr = memcg_page_state(memcg, memcg1_stats[i]);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		if (memcg1_stats[i] == NR_ANON_THPS)
+			nr *= HPAGE_PMD_NR;
+#endif
 		seq_printf(m, "total_%s %llu\n", memcg1_stat_names[i],
-			   (u64)memcg_page_state(memcg, memcg1_stats[i]) *
-			   PAGE_SIZE);
+						(u64)nr * PAGE_SIZE);
 	}
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 04/15] mm: memcg: link page counters to root if use_hierarchy is false
  2020-11-02  1:06 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2020-11-02  1:07 ` [patch 03/15] mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:07 ` [patch 05/15] kasan: adopt KUNIT tests to SW_TAGS mode Andrew Morton
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, ltp, mhocko, mkoutny, mm-commits,
	rpalethorpe, shakeelb, stable, torvalds

From: Roman Gushchin <guro@fb.com>
Subject: mm: memcg: link page counters to root if use_hierarchy is false

Richard reported a warning which can be reproduced by running the LTP
madvise6 test (cgroup v1 in the non-hierarchical mode should be used):

[    9.841552] ------------[ cut here ]------------
[    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[    9.841982] Modules linked in:
[    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
[    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[    9.842571] Workqueue: events drain_local_stock
[    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
[    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
[    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
[    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
[    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
[    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
[    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
[    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
[    9.845319] Call Trace:
[    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[    9.845684] drain_local_stock (mm/memcontrol.c:2255)
[    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
[    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
[    9.846034] ? process_one_work (kernel/workqueue.c:2358)
[    9.846162] kthread (kernel/kthread.c:292)
[    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[    9.846531] ---[ end trace 8b5647c1eba9d18a ]---

The problem occurs because in the non-hierarchical mode non-root page
counters are not linked to root page counters, so the charge is not
propagated to the root memory cgroup.

After the removal of the original memory cgroup and reparenting of the
object cgroup, the root cgroup might be uncharged by draining a objcg
stock, for example.  It leads to an eventual underflow of the charge and
triggers a warning.

Fix it by linking all page counters to corresponding root page counters in
the non-hierarchical mode.

Please note, that in the non-hierarchical mode all objcgs are always
reparented to the root memory cgroup, even if the hierarchy has more than
1 level.  This patch doesn't change it.

The patch also doesn't affect how the hierarchical mode is working, which
is the only sane and truly supported mode now.

Thanks to Richard for reporting, debugging and providing an alternative
version of the fix!

Link: https://lkml.kernel.org/r/20201026231326.3212225-1-guro@fb.com
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Roman Gushchin <guro@fb.com>
Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>
Reported-by: <ltp@lists.linux.it>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

--- a/mm/memcontrol.c~mm-memcg-link-page-counters-to-root-if-use_hierarchy-is-false
+++ a/mm/memcontrol.c
@@ -5345,17 +5345,22 @@ mem_cgroup_css_alloc(struct cgroup_subsy
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
 	}
-	if (parent && parent->use_hierarchy) {
+	if (!parent) {
+		page_counter_init(&memcg->memory, NULL);
+		page_counter_init(&memcg->swap, NULL);
+		page_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->tcpmem, NULL);
+	} else if (parent->use_hierarchy) {
 		memcg->use_hierarchy = true;
 		page_counter_init(&memcg->memory, &parent->memory);
 		page_counter_init(&memcg->swap, &parent->swap);
 		page_counter_init(&memcg->kmem, &parent->kmem);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
 	} else {
-		page_counter_init(&memcg->memory, NULL);
-		page_counter_init(&memcg->swap, NULL);
-		page_counter_init(&memcg->kmem, NULL);
-		page_counter_init(&memcg->tcpmem, NULL);
+		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
+		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
+		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
+		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 05/15] kasan: adopt KUNIT tests to SW_TAGS mode
  2020-11-02  1:06 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2020-11-02  1:07 ` [patch 04/15] mm: memcg: link page counters to root if use_hierarchy is false Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:07 ` [patch 06/15] mm: mempolicy: fix potential pte_unmap_unlock pte error Andrew Morton
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: akpm, andreyknvl, davidgow, linux-mm, mm-commits, torvalds

From: Andrey Konovalov <andreyknvl@google.com>
Subject: kasan: adopt KUNIT tests to SW_TAGS mode

Now that we have KASAN-KUNIT tests integration, it's easy to see that some
KASAN tests are not adopted to the SW_TAGS mode and are failing.

Adjust the allocation size for kasan_memchr() and kasan_memcmp() by roung
it up to OOB_TAG_OFF so the bad access ends up in a separate memory
granule.

Add a new kmalloc_uaf_16() tests that relies on UAF, and a new
kasan_bitops_tags() test that is tailored to tag-based mode, as it's hard
to adopt the existing kmalloc_oob_16() and kasan_bitops_generic() (renamed
from kasan_bitops()) without losing the precision.

Add new kmalloc_uaf_16() and kasan_bitops_uaf() tests that rely on UAFs,
as it's hard to adopt the existing kmalloc_oob_16() and kasan_bitops_oob()
(rename from kasan_bitops()) without losing the precision.

Disable kasan_global_oob() and kasan_alloca_oob_left/right() as SW_TAGS
mode doesn't instrument globals nor dynamic allocas.

Link: https://lkml.kernel.org/r/76eee17b6531ca8b3ca92b240cb2fd23204aaff7.1603129942.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_kasan.c |  149 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 107 insertions(+), 42 deletions(-)

--- a/lib/test_kasan.c~kasan-adopt-kunit-tests-to-sw_tags-mode
+++ a/lib/test_kasan.c
@@ -216,6 +216,12 @@ static void kmalloc_oob_16(struct kunit
 		u64 words[2];
 	} *ptr1, *ptr2;
 
+	/* This test is specifically crafted for the generic mode. */
+	if (!IS_ENABLED(CONFIG_KASAN_GENERIC)) {
+		kunit_info(test, "CONFIG_KASAN_GENERIC required\n");
+		return;
+	}
+
 	ptr1 = kmalloc(sizeof(*ptr1) - 3, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1);
 
@@ -227,6 +233,23 @@ static void kmalloc_oob_16(struct kunit
 	kfree(ptr2);
 }
 
+static void kmalloc_uaf_16(struct kunit *test)
+{
+	struct {
+		u64 words[2];
+	} *ptr1, *ptr2;
+
+	ptr1 = kmalloc(sizeof(*ptr1), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1);
+
+	ptr2 = kmalloc(sizeof(*ptr2), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2);
+	kfree(ptr2);
+
+	KUNIT_EXPECT_KASAN_FAIL(test, *ptr1 = *ptr2);
+	kfree(ptr1);
+}
+
 static void kmalloc_oob_memset_2(struct kunit *test)
 {
 	char *ptr;
@@ -429,6 +452,12 @@ static void kasan_global_oob(struct kuni
 	volatile int i = 3;
 	char *p = &global_array[ARRAY_SIZE(global_array) + i];
 
+	/* Only generic mode instruments globals. */
+	if (!IS_ENABLED(CONFIG_KASAN_GENERIC)) {
+		kunit_info(test, "CONFIG_KASAN_GENERIC required");
+		return;
+	}
+
 	KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p);
 }
 
@@ -467,6 +496,12 @@ static void kasan_alloca_oob_left(struct
 	char alloca_array[i];
 	char *p = alloca_array - 1;
 
+	/* Only generic mode instruments dynamic allocas. */
+	if (!IS_ENABLED(CONFIG_KASAN_GENERIC)) {
+		kunit_info(test, "CONFIG_KASAN_GENERIC required");
+		return;
+	}
+
 	if (!IS_ENABLED(CONFIG_KASAN_STACK)) {
 		kunit_info(test, "CONFIG_KASAN_STACK is not enabled");
 		return;
@@ -481,6 +516,12 @@ static void kasan_alloca_oob_right(struc
 	char alloca_array[i];
 	char *p = alloca_array + i;
 
+	/* Only generic mode instruments dynamic allocas. */
+	if (!IS_ENABLED(CONFIG_KASAN_GENERIC)) {
+		kunit_info(test, "CONFIG_KASAN_GENERIC required");
+		return;
+	}
+
 	if (!IS_ENABLED(CONFIG_KASAN_STACK)) {
 		kunit_info(test, "CONFIG_KASAN_STACK is not enabled");
 		return;
@@ -551,6 +592,9 @@ static void kasan_memchr(struct kunit *t
 		return;
 	}
 
+	if (OOB_TAG_OFF)
+		size = round_up(size, OOB_TAG_OFF);
+
 	ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
@@ -573,6 +617,9 @@ static void kasan_memcmp(struct kunit *t
 		return;
 	}
 
+	if (OOB_TAG_OFF)
+		size = round_up(size, OOB_TAG_OFF);
+
 	ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 	memset(arr, 0, sizeof(arr));
@@ -619,13 +666,50 @@ static void kasan_strings(struct kunit *
 	KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = strnlen(ptr, 1));
 }
 
-static void kasan_bitops(struct kunit *test)
+static void kasan_bitops_modify(struct kunit *test, int nr, void *addr)
+{
+	KUNIT_EXPECT_KASAN_FAIL(test, set_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, __set_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, clear_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, __clear_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, clear_bit_unlock(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, __clear_bit_unlock(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, change_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, __change_bit(nr, addr));
+}
+
+static void kasan_bitops_test_and_modify(struct kunit *test, int nr, void *addr)
+{
+	KUNIT_EXPECT_KASAN_FAIL(test, test_and_set_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, __test_and_set_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, test_and_set_bit_lock(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, test_and_clear_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, __test_and_clear_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, test_and_change_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, __test_and_change_bit(nr, addr));
+	KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = test_bit(nr, addr));
+
+#if defined(clear_bit_unlock_is_negative_byte)
+	KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result =
+				clear_bit_unlock_is_negative_byte(nr, addr));
+#endif
+}
+
+static void kasan_bitops_generic(struct kunit *test)
 {
+	long *bits;
+
+	/* This test is specifically crafted for the generic mode. */
+	if (!IS_ENABLED(CONFIG_KASAN_GENERIC)) {
+		kunit_info(test, "CONFIG_KASAN_GENERIC required\n");
+		return;
+	}
+
 	/*
 	 * Allocate 1 more byte, which causes kzalloc to round up to 16-bytes;
 	 * this way we do not actually corrupt other memory.
 	 */
-	long *bits = kzalloc(sizeof(*bits) + 1, GFP_KERNEL);
+	bits = kzalloc(sizeof(*bits) + 1, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, bits);
 
 	/*
@@ -633,55 +717,34 @@ static void kasan_bitops(struct kunit *t
 	 * below accesses are still out-of-bounds, since bitops are defined to
 	 * operate on the whole long the bit is in.
 	 */
-	KUNIT_EXPECT_KASAN_FAIL(test, set_bit(BITS_PER_LONG, bits));
-
-	KUNIT_EXPECT_KASAN_FAIL(test, __set_bit(BITS_PER_LONG, bits));
-
-	KUNIT_EXPECT_KASAN_FAIL(test, clear_bit(BITS_PER_LONG, bits));
-
-	KUNIT_EXPECT_KASAN_FAIL(test, __clear_bit(BITS_PER_LONG, bits));
-
-	KUNIT_EXPECT_KASAN_FAIL(test, clear_bit_unlock(BITS_PER_LONG, bits));
-
-	KUNIT_EXPECT_KASAN_FAIL(test, __clear_bit_unlock(BITS_PER_LONG, bits));
-
-	KUNIT_EXPECT_KASAN_FAIL(test, change_bit(BITS_PER_LONG, bits));
-
-	KUNIT_EXPECT_KASAN_FAIL(test, __change_bit(BITS_PER_LONG, bits));
+	kasan_bitops_modify(test, BITS_PER_LONG, bits);
 
 	/*
 	 * Below calls try to access bit beyond allocated memory.
 	 */
-	KUNIT_EXPECT_KASAN_FAIL(test,
-		test_and_set_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
-
-	KUNIT_EXPECT_KASAN_FAIL(test,
-		__test_and_set_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
+	kasan_bitops_test_and_modify(test, BITS_PER_LONG + BITS_PER_BYTE, bits);
 
-	KUNIT_EXPECT_KASAN_FAIL(test,
-		test_and_set_bit_lock(BITS_PER_LONG + BITS_PER_BYTE, bits));
-
-	KUNIT_EXPECT_KASAN_FAIL(test,
-		test_and_clear_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
+	kfree(bits);
+}
 
-	KUNIT_EXPECT_KASAN_FAIL(test,
-		__test_and_clear_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
+static void kasan_bitops_tags(struct kunit *test)
+{
+	long *bits;
 
-	KUNIT_EXPECT_KASAN_FAIL(test,
-		test_and_change_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
+	/* This test is specifically crafted for the tag-based mode. */
+	if (IS_ENABLED(CONFIG_KASAN_GENERIC)) {
+		kunit_info(test, "CONFIG_KASAN_SW_TAGS required\n");
+		return;
+	}
 
-	KUNIT_EXPECT_KASAN_FAIL(test,
-		__test_and_change_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
+	/* Allocation size will be rounded to up granule size, which is 16. */
+	bits = kzalloc(sizeof(*bits), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, bits);
 
-	KUNIT_EXPECT_KASAN_FAIL(test,
-		kasan_int_result =
-			test_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
+	/* Do the accesses past the 16 allocated bytes. */
+	kasan_bitops_modify(test, BITS_PER_LONG, &bits[1]);
+	kasan_bitops_test_and_modify(test, BITS_PER_LONG + BITS_PER_BYTE, &bits[1]);
 
-#if defined(clear_bit_unlock_is_negative_byte)
-	KUNIT_EXPECT_KASAN_FAIL(test,
-		kasan_int_result = clear_bit_unlock_is_negative_byte(
-			BITS_PER_LONG + BITS_PER_BYTE, bits));
-#endif
 	kfree(bits);
 }
 
@@ -728,6 +791,7 @@ static struct kunit_case kasan_kunit_tes
 	KUNIT_CASE(kmalloc_oob_krealloc_more),
 	KUNIT_CASE(kmalloc_oob_krealloc_less),
 	KUNIT_CASE(kmalloc_oob_16),
+	KUNIT_CASE(kmalloc_uaf_16),
 	KUNIT_CASE(kmalloc_oob_in_memset),
 	KUNIT_CASE(kmalloc_oob_memset_2),
 	KUNIT_CASE(kmalloc_oob_memset_4),
@@ -751,7 +815,8 @@ static struct kunit_case kasan_kunit_tes
 	KUNIT_CASE(kasan_memchr),
 	KUNIT_CASE(kasan_memcmp),
 	KUNIT_CASE(kasan_strings),
-	KUNIT_CASE(kasan_bitops),
+	KUNIT_CASE(kasan_bitops_generic),
+	KUNIT_CASE(kasan_bitops_tags),
 	KUNIT_CASE(kmalloc_double_kzfree),
 	KUNIT_CASE(vmalloc_oob),
 	{}
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 06/15] mm: mempolicy: fix potential pte_unmap_unlock pte error
  2020-11-02  1:06 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2020-11-02  1:07 ` [patch 05/15] kasan: adopt KUNIT tests to SW_TAGS mode Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:07 ` [patch 07/15] ptrace: fix task_join_group_stop() for the case when current is traced Andrew Morton
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: akpm, linfeilong, linmiaohe, linux-mm, luoshijie1, mhocko,
	mm-commits, osalvador, stable, torvalds

From: Shijie Luo <luoshijie1@huawei.com>
Subject: mm: mempolicy: fix potential pte_unmap_unlock pte error

When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to
pte_unmap_unlock seems like not a good idea.

queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't
migrate misplaced pages but returns with EIO when encountering such a
page.  Since commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO
when MPOL_MF_STRICT is specified") and early break on the first pte in the
range results in pte_unmap_unlock on an underflow pte.  This can lead to
lockups later on when somebody tries to lock the pte resp. 
page_table_lock again..

Link: https://lkml.kernel.org/r/20201019074853.50856-1-luoshijie1@huawei.com
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Feilong Lin <linfeilong@huawei.com>
Cc: Shijie Luo <luoshijie1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/mempolicy.c~mm-mempolicy-fix-potential-pte_unmap_unlock-pte-error
+++ a/mm/mempolicy.c
@@ -525,7 +525,7 @@ static int queue_pages_pte_range(pmd_t *
 	unsigned long flags = qp->flags;
 	int ret;
 	bool has_unmovable = false;
-	pte_t *pte;
+	pte_t *pte, *mapped_pte;
 	spinlock_t *ptl;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
@@ -539,7 +539,7 @@ static int queue_pages_pte_range(pmd_t *
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
-	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		if (!pte_present(*pte))
 			continue;
@@ -571,7 +571,7 @@ static int queue_pages_pte_range(pmd_t *
 		} else
 			break;
 	}
-	pte_unmap_unlock(pte - 1, ptl);
+	pte_unmap_unlock(mapped_pte, ptl);
 	cond_resched();
 
 	if (has_unmovable)
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 07/15] ptrace: fix task_join_group_stop() for the case when current is traced
  2020-11-02  1:06 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2020-11-02  1:07 ` [patch 06/15] mm: mempolicy: fix potential pte_unmap_unlock pte error Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:07 ` [patch 08/15] lib/crc32test: remove extra local_irq_disable/enable Andrew Morton
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: akpm, axboe, christian, ebiederm, linux-mm, liuzhiqiang26,
	mm-commits, oleg, stable, tj, torvalds

From: Oleg Nesterov <oleg@redhat.com>
Subject: ptrace: fix task_join_group_stop() for the case when current is traced

This testcase

	#include <stdio.h>
	#include <unistd.h>
	#include <signal.h>
	#include <sys/ptrace.h>
	#include <sys/wait.h>
	#include <pthread.h>
	#include <assert.h>

	void *tf(void *arg)
	{
		return NULL;
	}

	int main(void)
	{
		int pid = fork();
		if (!pid) {
			kill(getpid(), SIGSTOP);

			pthread_t th;
			pthread_create(&th, NULL, tf, NULL);

			return 0;
		}

		waitpid(pid, NULL, WSTOPPED);

		ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE);
		waitpid(pid, NULL, 0);

		ptrace(PTRACE_CONT, pid, 0,0);
		waitpid(pid, NULL, 0);

		int status;
		int thread = waitpid(-1, &status, 0);
		assert(thread > 0 && thread != pid);
		assert(status == 0x80137f);

		return 0;
	}

fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap().

This is because task_join_group_stop() has 2 problems when current is traced:

	1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee
	   can be woken up by debugger and it can clone another thread which
	   should join the group-stop.

	   We need to check group_stop_count || SIGNAL_STOP_STOPPED.

	2. If SIGNAL_STOP_STOPPED is already set, we should not increment
	   sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread
	   should stop without another do_notify_parent_cldstop() report.

To clarify, the problem is very old and we should blame
ptrace_init_task().  But now that we have task_join_group_stop() it makes
more sense to fix this helper to avoid the code duplication.

Link: https://lkml.kernel.org/r/20201019134237.GA18810@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: syzbot+3485e3773f7da290eecc@syzkaller.appspotmail.com
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <christian@brauner.io>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/signal.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

--- a/kernel/signal.c~ptrace-fix-task_join_group_stop-for-the-case-when-current-is-traced
+++ a/kernel/signal.c
@@ -391,16 +391,17 @@ static bool task_participate_group_stop(
 
 void task_join_group_stop(struct task_struct *task)
 {
+	unsigned long mask = current->jobctl & JOBCTL_STOP_SIGMASK;
+	struct signal_struct *sig = current->signal;
+
+	if (sig->group_stop_count) {
+		sig->group_stop_count++;
+		mask |= JOBCTL_STOP_CONSUME;
+	} else if (!(sig->flags & SIGNAL_STOP_STOPPED))
+		return;
+
 	/* Have the new thread join an on-going signal group stop */
-	unsigned long jobctl = current->jobctl;
-	if (jobctl & JOBCTL_STOP_PENDING) {
-		struct signal_struct *sig = current->signal;
-		unsigned long signr = jobctl & JOBCTL_STOP_SIGMASK;
-		unsigned long gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME;
-		if (task_set_jobctl_pending(task, signr | gstop)) {
-			sig->group_stop_count++;
-		}
-	}
+	task_set_jobctl_pending(task, mask | JOBCTL_STOP_PENDING);
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 08/15] lib/crc32test: remove extra local_irq_disable/enable
  2020-11-02  1:06 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2020-11-02  1:07 ` [patch 07/15] ptrace: fix task_join_group_stop() for the case when current is traced Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:07 ` [patch 09/15] mm/truncate.c: make __invalidate_mapping_pages() static Andrew Morton
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: akpm, gor, gregkh, linux-mm, mingo, mm-commits, peterz, torvalds

From: Vasily Gorbik <gor@linux.ibm.com>
Subject: lib/crc32test: remove extra local_irq_disable/enable

Commit 4d004099a668 ("lockdep: Fix lockdep recursion") uncovered the
following issue in lib/crc32test reported on s390:

BUG: using __this_cpu_read() in preemptible [00000000] code: swapper/0/1
caller is lockdep_hardirqs_on_prepare+0x48/0x270
CPU: 6 PID: 1 Comm: swapper/0 Not tainted 5.9.0-next-20201015-15164-g03d992bd2de6 #19
Hardware name: IBM 3906 M04 704 (LPAR)
Call Trace:
 [<0000000000d54870>] show_stack+0x90/0xf8
 [<0000000000d675d2>] dump_stack+0xa2/0xd8
 [<0000000000d6c9ac>] check_preemption_disabled+0xe4/0xe8
 [<00000000001d6098>] lockdep_hardirqs_on_prepare+0x48/0x270
 [<00000000002ac274>] trace_hardirqs_on+0x9c/0x1b8
 [<0000000001441430>] crc32_test.isra.0+0x170/0x1c0
 [<0000000001441764>] crc32test_init+0x1c/0x40
 [<0000000000100cd0>] do_one_initcall+0x40/0x130
 [<0000000001411586>] do_initcalls+0x126/0x150
 [<0000000001411826>] kernel_init_freeable+0x1f6/0x230
 [<0000000000d6cc92>] kernel_init+0x22/0x150
 [<0000000000d7bcc4>] ret_from_fork+0x24/0x2c
no locks held by swapper/0/1.

Remove extra local_irq_disable/local_irq_enable helpers calls.

Link: https://lkml.kernel.org/r/patch.git-4369da00c06e.your-ad-here.call-01602859837-ext-1679@work.hours
Fixes: 5fb7f87408f1 ("lib: add module support to crc32 tests")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/crc32test.c |    4 ----
 1 file changed, 4 deletions(-)

--- a/lib/crc32test.c~lib-crc32test-remove-extra-local_irq_disable-enable
+++ a/lib/crc32test.c
@@ -683,7 +683,6 @@ static int __init crc32c_test(void)
 
 	/* reduce OS noise */
 	local_irq_save(flags);
-	local_irq_disable();
 
 	nsec = ktime_get_ns();
 	for (i = 0; i < 100; i++) {
@@ -694,7 +693,6 @@ static int __init crc32c_test(void)
 	nsec = ktime_get_ns() - nsec;
 
 	local_irq_restore(flags);
-	local_irq_enable();
 
 	pr_info("crc32c: CRC_LE_BITS = %d\n", CRC_LE_BITS);
 
@@ -768,7 +766,6 @@ static int __init crc32_test(void)
 
 	/* reduce OS noise */
 	local_irq_save(flags);
-	local_irq_disable();
 
 	nsec = ktime_get_ns();
 	for (i = 0; i < 100; i++) {
@@ -783,7 +780,6 @@ static int __init crc32_test(void)
 	nsec = ktime_get_ns() - nsec;
 
 	local_irq_restore(flags);
-	local_irq_enable();
 
 	pr_info("crc32: CRC_LE_BITS = %d, CRC_BE BITS = %d\n",
 		 CRC_LE_BITS, CRC_BE_BITS);
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 09/15] mm/truncate.c: make __invalidate_mapping_pages() static
  2020-11-02  1:06 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2020-11-02  1:07 ` [patch 08/15] lib/crc32test: remove extra local_irq_disable/enable Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:07 ` [patch 10/15] kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled Andrew Morton
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: akpm, laoar.shao, linux-mm, mm-commits, torvalds, yanaijie

From: Jason Yan <yanaijie@huawei.com>
Subject: mm/truncate.c: make __invalidate_mapping_pages() static

Fix the following sparse warning:

mm/truncate.c:531:15: warning: symbol '__invalidate_mapping_pages' was
not declared. Should it be static?

Link: https://lkml.kernel.org/r/20201015054808.2445904-1-yanaijie@huawei.com
Fixes: eb1d7a65f08a ("mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED")
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/truncate.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/truncate.c~mm-make-__invalidate_mapping_pages-static
+++ a/mm/truncate.c
@@ -528,7 +528,7 @@ void truncate_inode_pages_final(struct a
 }
 EXPORT_SYMBOL(truncate_inode_pages_final);
 
-unsigned long __invalidate_mapping_pages(struct address_space *mapping,
+static unsigned long __invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end, unsigned long *nr_pagevec)
 {
 	pgoff_t indices[PAGEVEC_SIZE];
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 10/15] kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled
  2020-11-02  1:06 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2020-11-02  1:07 ` [patch 09/15] mm/truncate.c: make __invalidate_mapping_pages() static Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:07 ` [patch 11/15] mm, oom: keep oom_adj under or at upper limit when printing Andrew Morton
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, pmladek, qiang.zhang, stable, tj, torvalds

From: Zqiang <qiang.zhang@windriver.com>
Subject: kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled

There is a small race window when a delayed work is being canceled and the
work still might be queued from the timer_fn:

	CPU0						CPU1
kthread_cancel_delayed_work_sync()
   __kthread_cancel_work_sync()
     __kthread_cancel_work()
        work->canceling++;
					      kthread_delayed_work_timer_fn()
						   kthread_insert_work();

BUG: kthread_insert_work() should not get called when work->canceling is
set.

Link: https://lkml.kernel.org/r/20201014083030.16895-1-qiang.zhang@windriver.com
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/kthread.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/kernel/kthread.c~kthread_worker-prevent-queuing-delayed-work-from-timer_fn-when-it-is-being-canceled
+++ a/kernel/kthread.c
@@ -897,7 +897,8 @@ void kthread_delayed_work_timer_fn(struc
 	/* Move the work from worker->delayed_work_list. */
 	WARN_ON_ONCE(list_empty(&work->node));
 	list_del_init(&work->node);
-	kthread_insert_work(worker, work, &worker->work_list);
+	if (!work->canceling)
+		kthread_insert_work(worker, work, &worker->work_list);
 
 	raw_spin_unlock_irqrestore(&worker->lock, flags);
 }
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 11/15] mm, oom: keep oom_adj under or at upper limit when printing
  2020-11-02  1:06 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2020-11-02  1:07 ` [patch 10/15] kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled Andrew Morton
@ 2020-11-02  1:07 ` Andrew Morton
  2020-11-02  1:08 ` [patch 12/15] mm: always have io_remap_pfn_range() set pgprot_decrypted() Andrew Morton
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:07 UTC (permalink / raw)
  To: adobriyan, akpm, chaithco, linux-mm, mhocko, mm-commits, torvalds

From: Charles Haithcock <chaithco@redhat.com>
Subject: mm, oom: keep oom_adj under or at upper limit when printing

For oom_score_adj values in the range [942,999], the current
calculations will print 16 for oom_adj. This patch simply limits the
output so output is inline with docs.

Link: https://lkml.kernel.org/r/20201020165130.33927-1-chaithco@redhat.com
Signed-off-by: Charles Haithcock <chaithco@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/base.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/fs/proc/base.c~mm-oom-keep-oom_adj-under-or-at-upper-limit-when-printing
+++ a/fs/proc/base.c
@@ -1049,6 +1049,8 @@ static ssize_t oom_adj_read(struct file
 		oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
 			  OOM_SCORE_ADJ_MAX;
 	put_task_struct(task);
+	if (oom_adj > OOM_ADJUST_MAX)
+		oom_adj = OOM_ADJUST_MAX;
 	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 12/15] mm: always have io_remap_pfn_range() set pgprot_decrypted()
  2020-11-02  1:06 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2020-11-02  1:07 ` [patch 11/15] mm, oom: keep oom_adj under or at upper limit when printing Andrew Morton
@ 2020-11-02  1:08 ` Andrew Morton
  2020-11-02  1:08 ` [patch 13/15] epoll: check ep_events_available() upon timeout Andrew Morton
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:08 UTC (permalink / raw)
  To: akpm, aryabinin, bp, brijesh.singh, corbet, dvyukov, dyoung,
	glider, jgg, konrad.wilk, linux-mm, luto, lwoodman, matt, mingo,
	mm-commits, mst, pbonzini, peterz, riel, stable, tglx,
	thomas.lendacky, torvalds, toshi.kani

From: Jason Gunthorpe <jgg@nvidia.com>
Subject: mm: always have io_remap_pfn_range() set pgprot_decrypted()

The purpose of io_remap_pfn_range() is to map IO memory, such as a memory
mapped IO exposed through a PCI BAR.  IO devices do not understand
encryption, so this memory must always be decrypted.  Automatically call
pgprot_decrypted() as part of the generic implementation.

This fixes a bug where enabling AMD SME causes subsystems, such as RDMA,
using io_remap_pfn_range() to expose BAR pages to user space to fail.  The
CPU will encrypt access to those BAR pages instead of passing unencrypted
IO directly to the device.

Places not mapping IO should use remap_pfn_range().

Link: https://lkml.kernel.org/r/0-v1-025d64bdf6c4+e-amd_sme_fix_jgg@nvidia.com
Fixes: aca20d546214 ("x86/mm: Add support to make use of Secure Memory Encryption")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
CcK Arnd Bergmann <arnd@arndb.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Dave Young" <dyoung@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h      |    9 +++++++++
 include/linux/pgtable.h |    4 ----
 2 files changed, 9 insertions(+), 4 deletions(-)

--- a/include/linux/mm.h~mm-always-have-io_remap_pfn_range-set-pgprot_decrypted
+++ a/include/linux/mm.h
@@ -2759,6 +2759,15 @@ static inline vm_fault_t vmf_insert_page
 	return VM_FAULT_NOPAGE;
 }
 
+#ifndef io_remap_pfn_range
+static inline int io_remap_pfn_range(struct vm_area_struct *vma,
+				     unsigned long addr, unsigned long pfn,
+				     unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range(vma, addr, pfn, size, pgprot_decrypted(prot));
+}
+#endif
+
 static inline vm_fault_t vmf_error(int err)
 {
 	if (err == -ENOMEM)
--- a/include/linux/pgtable.h~mm-always-have-io_remap_pfn_range-set-pgprot_decrypted
+++ a/include/linux/pgtable.h
@@ -1427,10 +1427,6 @@ typedef unsigned int pgtbl_mod_mask;
 
 #endif /* !__ASSEMBLY__ */
 
-#ifndef io_remap_pfn_range
-#define io_remap_pfn_range remap_pfn_range
-#endif
-
 #ifndef has_transparent_hugepage
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define has_transparent_hugepage() 1
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 13/15] epoll: check ep_events_available() upon timeout
  2020-11-02  1:06 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2020-11-02  1:08 ` [patch 12/15] mm: always have io_remap_pfn_range() set pgprot_decrypted() Andrew Morton
@ 2020-11-02  1:08 ` Andrew Morton
  2020-11-02 17:08   ` Linus Torvalds
  2020-11-02  1:08 ` [patch 14/15] epoll: add a selftest for epoll timeout race Andrew Morton
  2020-11-02  1:08 ` [patch 15/15] kernel/hung_task.c: make type annotations consistent Andrew Morton
  14 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:08 UTC (permalink / raw)
  To: akpm, dave, edumazet, guantaol, khazhy, linux-mm, mm-commits,
	soheil, torvalds, viro, willemb

From: Soheil Hassas Yeganeh <soheil@google.com>
Subject: epoll: check ep_events_available() upon timeout

After abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2)
timeout"), we break out of the ep_poll loop upon timeout, without checking
whether there is any new events available.  Prior to that patch-series we
always called ep_events_available() after exiting the loop.

This can cause races and missed wakeups.  For example, consider the
following scenario reported by Guantao Liu:

Suppose we have an eventfd added using EPOLLET to an epollfd.

Thread 1: Sleeps for just below 5ms and then writes to an eventfd.
Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an
          event of the eventfd, it will write back on that fd.
Thread 3: Calls epoll_wait with a negative timeout.

Prior to abc610e01c66, it is guaranteed that Thread 3 will wake up either
by Thread 1 or Thread 2.  After abc610e01c66, Thread 3 can be blocked
indefinitely if Thread 2 sees a timeout right before the write to the
eventfd by Thread 1.  Thread 2 will be woken up from
schedule_hrtimeout_range and, with evail 0, it will not call
ep_send_events().

To fix this issue, while holding the lock, try to remove the thread that
timed out the wait queue and check whether it was woken up or not.

Link: https://lkml.kernel.org/r/20201028180202.952079-1-soheil.kdev@gmail.com
Fixes: abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reported-by: Guantao Liu <guantaol@google.com>
Tested-by: Guantao Liu <guantaol@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/eventpoll.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- a/fs/eventpoll.c~epoll-check-ep_events_available-upon-timeout
+++ a/fs/eventpoll.c
@@ -1907,7 +1907,21 @@ fetch_events:
 
 		if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) {
 			timed_out = 1;
-			break;
+			__set_current_state(TASK_RUNNING);
+			/*
+			 * Acquire the lock and try to remove this thread from
+			 * the wait queue. If this thread is not on the wait
+			 * queue, it has woken up after its timeout ended
+			 * before it could re-acquire the lock. In that case,
+			 * try to harvest some events.
+			 */
+			write_lock_irq(&ep->lock);
+			if (!list_empty(&wait.entry))
+				__remove_wait_queue(&ep->wq, &wait);
+			else
+				eavail = 1;
+			write_unlock_irq(&ep->lock);
+			goto send_events;
 		}
 
 		/* We were woken up, thus go and try to harvest some events */
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 14/15] epoll: add a selftest for epoll timeout race
  2020-11-02  1:06 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2020-11-02  1:08 ` [patch 13/15] epoll: check ep_events_available() upon timeout Andrew Morton
@ 2020-11-02  1:08 ` Andrew Morton
  2020-11-02  1:08 ` [patch 15/15] kernel/hung_task.c: make type annotations consistent Andrew Morton
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:08 UTC (permalink / raw)
  To: akpm, dave, edumazet, guantaol, khazhy, linux-mm, mm-commits,
	soheil, torvalds, viro, willemb

From: Soheil Hassas Yeganeh <soheil@google.com>
Subject: epoll: add a selftest for epoll timeout race

Add a test case to ensure an event is observed by at least one poller when
an epoll timeout is used.

Link: https://lkml.kernel.org/r/20201028180202.952079-2-soheil.kdev@gmail.com
Signed-off-by: Guantao Liu <guantaol@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c |   95 ++++++++++
 1 file changed, 95 insertions(+)

--- a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c~epoll-add-a-selftest-for-epoll-timeout-race
+++ a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
@@ -3282,4 +3282,99 @@ TEST(epoll60)
 	close(ctx.epfd);
 }
 
+struct epoll61_ctx {
+	int epfd;
+	int evfd;
+};
+
+static void *epoll61_write_eventfd(void *ctx_)
+{
+	struct epoll61_ctx *ctx = ctx_;
+	int64_t l = 1;
+
+	usleep(10950);
+	write(ctx->evfd, &l, sizeof(l));
+	return NULL;
+}
+
+static void *epoll61_epoll_with_timeout(void *ctx_)
+{
+	struct epoll61_ctx *ctx = ctx_;
+	struct epoll_event events[1];
+	int n;
+
+	n = epoll_wait(ctx->epfd, events, 1, 11);
+	/*
+	 * If epoll returned the eventfd, write on the eventfd to wake up the
+	 * blocking poller.
+	 */
+	if (n == 1) {
+		int64_t l = 1;
+
+		write(ctx->evfd, &l, sizeof(l));
+	}
+	return NULL;
+}
+
+static void *epoll61_blocking_epoll(void *ctx_)
+{
+	struct epoll61_ctx *ctx = ctx_;
+	struct epoll_event events[1];
+
+	epoll_wait(ctx->epfd, events, 1, -1);
+	return NULL;
+}
+
+TEST(epoll61)
+{
+	struct epoll61_ctx ctx;
+	struct epoll_event ev;
+	int i, r;
+
+	ctx.epfd = epoll_create1(0);
+	ASSERT_GE(ctx.epfd, 0);
+	ctx.evfd = eventfd(0, EFD_NONBLOCK);
+	ASSERT_GE(ctx.evfd, 0);
+
+	ev.events = EPOLLIN | EPOLLET | EPOLLERR | EPOLLHUP;
+	ev.data.ptr = NULL;
+	r = epoll_ctl(ctx.epfd, EPOLL_CTL_ADD, ctx.evfd, &ev);
+	ASSERT_EQ(r, 0);
+
+	/*
+	 * We are testing a race.  Repeat the test case 1000 times to make it
+	 * more likely to fail in case of a bug.
+	 */
+	for (i = 0; i < 1000; i++) {
+		pthread_t threads[3];
+		int n;
+
+		/*
+		 * Start 3 threads:
+		 * Thread 1 sleeps for 10.9ms and writes to the evenfd.
+		 * Thread 2 calls epoll with a timeout of 11ms.
+		 * Thread 3 calls epoll with a timeout of -1.
+		 *
+		 * The eventfd write by Thread 1 should either wakeup Thread 2
+		 * or Thread 3.  If it wakes up Thread 2, Thread 2 writes on the
+		 * eventfd to wake up Thread 3.
+		 *
+		 * If no events are missed, all three threads should eventually
+		 * be joinable.
+		 */
+		ASSERT_EQ(pthread_create(&threads[0], NULL,
+					 epoll61_write_eventfd, &ctx), 0);
+		ASSERT_EQ(pthread_create(&threads[1], NULL,
+					 epoll61_epoll_with_timeout, &ctx), 0);
+		ASSERT_EQ(pthread_create(&threads[2], NULL,
+					 epoll61_blocking_epoll, &ctx), 0);
+
+		for (n = 0; n < ARRAY_SIZE(threads); ++n)
+			ASSERT_EQ(pthread_join(threads[n], NULL), 0);
+	}
+
+	close(ctx.epfd);
+	close(ctx.evfd);
+}
+
 TEST_HARNESS_MAIN
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 15/15] kernel/hung_task.c: make type annotations consistent
  2020-11-02  1:06 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2020-11-02  1:08 ` [patch 14/15] epoll: add a selftest for epoll timeout race Andrew Morton
@ 2020-11-02  1:08 ` Andrew Morton
  14 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2020-11-02  1:08 UTC (permalink / raw)
  To: akpm, hch, linux-mm, lukas.bulwahn, mm-commits, penguin-kernel,
	rdna, torvalds, viro

From: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Subject: kernel/hung_task.c: make type annotations consistent

Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
removed various __user annotations from function signatures as part of its
refactoring.

It also removed the __user annotation for proc_dohung_task_timeout_secs()
at its declaration in ./sched/sysctl.h, but not at its definition in
./kernel/hung_task.c.

Hence, sparse complains:

  kernel/hung_task.c:271:5: error: \
  symbol 'proc_dohung_task_timeout_secs' redeclared with different type \
  (incompatible argument 3 (different address spaces))

Adjust the annotation at the definition fitting to that refactoring to make
sparse happy again, which also resolves this warning from sparse:

  kernel/hung_task.c:277:52: warning: incorrect type in argument 3 \
    (different address spaces)
  kernel/hung_task.c:277:52:    expected void *
  kernel/hung_task.c:277:52:    got void [noderef] __user *buffer

No functional change. No change in object code.

Link: https://lkml.kernel.org/r/20201028130541.20320-1-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/hung_task.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/hung_task.c~kernel-hung_taskc-make-type-annotations-consistent
+++ a/kernel/hung_task.c
@@ -225,8 +225,7 @@ static long hung_timeout_jiffies(unsigne
  * Process updating of timeout sysctl
  */
 int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
-				  void __user *buffer,
-				  size_t *lenp, loff_t *ppos)
+				  void *buffer, size_t *lenp, loff_t *ppos)
 {
 	int ret;
 
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 13/15] epoll: check ep_events_available() upon timeout
  2020-11-02  1:08 ` [patch 13/15] epoll: check ep_events_available() upon timeout Andrew Morton
@ 2020-11-02 17:08   ` Linus Torvalds
  2020-11-02 17:48     ` Soheil Hassas Yeganeh
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2020-11-02 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Davidlohr Bueso, Eric Dumazet, guantaol, khazhy, Linux-MM,
	mm-commits, Soheil Hassas Yeganeh, Al Viro, willemb

On Sun, Nov 1, 2020 at 5:08 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> After abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2)
> timeout"), we break out of the ep_poll loop upon timeout, without checking
> whether there is any new events available.  Prior to that patch-series we
> always called ep_events_available() after exiting the loop.

This patch looks overly complicated to me.

It does the exact same thing as the "break" does, except:

 - it does it the non-optimized way without the "avoid spinlock"

 - it sets eavail if there was

It would seem like the *much* simpler patch is to just do this oneliner instead:

    diff --git a/fs/eventpoll.c b/fs/eventpoll.c
    index 4df61129566d..29fa770ce1e3 100644
    --- a/fs/eventpoll.c
    +++ b/fs/eventpoll.c
    @@ -1907,6 +1907,7 @@ static int ep_poll(struct eventpoll *ep,
struct epoll_event __user *events,

                if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) {
                        timed_out = 1;
    +                   eavail = 1;
                        break;
                }

and *boom* you're done. That will mean that after a timeout we'll try
one more time to just do that ep_events_available() thing.

I can see no downside to just setting eavail unconditionally and
keeping the code much simpler. Hmm?

             Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 13/15] epoll: check ep_events_available() upon timeout
  2020-11-02 17:08   ` Linus Torvalds
@ 2020-11-02 17:48     ` Soheil Hassas Yeganeh
  2020-11-02 18:51       ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-02 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Davidlohr Bueso, Eric Dumazet, Guantao Liu,
	Khazhismel Kumykov, Linux-MM, mm-commits, Al Viro,
	Willem de Bruijn

On Mon, Nov 2, 2020 at 12:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, Nov 1, 2020 at 5:08 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > After abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2)
> > timeout"), we break out of the ep_poll loop upon timeout, without checking
> > whether there is any new events available.  Prior to that patch-series we
> > always called ep_events_available() after exiting the loop.
>
> This patch looks overly complicated to me.
>
> It does the exact same thing as the "break" does, except:
>
>  - it does it the non-optimized way without the "avoid spinlock"
>
>  - it sets eavail if there was
>
> It would seem like the *much* simpler patch is to just do this oneliner instead:
>
>     diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>     index 4df61129566d..29fa770ce1e3 100644
>     --- a/fs/eventpoll.c
>     +++ b/fs/eventpoll.c
>     @@ -1907,6 +1907,7 @@ static int ep_poll(struct eventpoll *ep,
> struct epoll_event __user *events,
>
>                 if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) {
>                         timed_out = 1;
>     +                   eavail = 1;
>                         break;
>                 }
>
> and *boom* you're done. That will mean that after a timeout we'll try
> one more time to just do that ep_events_available() thing.

Thank you for the suggestion. That was the first version I tried, and
I can confirm it fixes the race because we call ep_send_events() once
more before returning.  Though, I believe, due to time_out=1, we won't
goto fetch_events to call ep_events_available():

if (!res && eavail &&
   !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
 goto fetch_events;

> I can see no downside to just setting eavail unconditionally and
> keeping the code much simpler. Hmm?

You're spot on that the patch is more complicated than your
suggestion.  However, the downside I observed was a performance
regression for the non-racy case: Suppose there are a few threads with
a similar non-zero timeout and no ready event. They will all
experience a noticeable contention in ep_scan_ready_list, by
unconditionally calling ep_send_events(). The contention was large
because there will be 2 write locks on ep->lock and one mutex lock on
ep->mtx with a large critical section.

FWIW, I also tried the following idea to eliminate that contention but
I couldn't prove that it's correct, because of the gap between calling
ep_events_available and removing this thread from the wait queue.
That was why I resorted to calling __remove_wait_queue() under the
lock.

     diff --git a/fs/eventpoll.c b/fs/eventpoll.c
     index 4df61129566d..29fa770ce1e3 100644
     --- a/fs/eventpoll.c
     +++ b/fs/eventpoll.c
     @@ -1907,6 +1907,7 @@ static int ep_poll(struct eventpoll *ep,
 struct epoll_event __user *events,

                 if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) {
                         timed_out = 1;
     +                   eavail = ep_events_available(ep);
                         break;
                 }

Also, please note that the non-optimized way without the "avoid
spinlock" is not problematic in any of our benchmarks because, when
the thread times out, it's likely on the wait queue except for this
particular racy scenario.

Thanks!
Soheil


>              Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 13/15] epoll: check ep_events_available() upon timeout
  2020-11-02 17:48     ` Soheil Hassas Yeganeh
@ 2020-11-02 18:51       ` Linus Torvalds
  2020-11-02 19:38         ` Linus Torvalds
  2020-11-02 19:54         ` Soheil Hassas Yeganeh
  0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2020-11-02 18:51 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: Andrew Morton, Davidlohr Bueso, Eric Dumazet, Guantao Liu,
	Khazhismel Kumykov, Linux-MM, mm-commits, Al Viro,
	Willem de Bruijn

On Mon, Nov 2, 2020 at 9:49 AM Soheil Hassas Yeganeh <soheil@google.com> wrote:
>
> Thank you for the suggestion. That was the first version I tried, and
> I can confirm it fixes the race because we call ep_send_events() once
> more before returning.  Though, I believe, due to time_out=1, we won't
> goto fetch_events to call ep_events_available():
>
> if (!res && eavail &&
>    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
>  goto fetch_events;

Right. We won't be repeating the loop, but we will do one final send_events.

Which I think is really the point, no?

> You're spot on that the patch is more complicated than your
> suggestion.  However, the downside I observed was a performance
> regression for the non-racy case: Suppose there are a few threads with
> a similar non-zero timeout and no ready event. They will all
> experience a noticeable contention in ep_scan_ready_list, by
> unconditionally calling ep_send_events(). The contention was large
> because there will be 2 write locks on ep->lock and one mutex lock on
> ep->mtx with a large critical section.

Ugh. I really detest the eventpoll code. Afaik, it has no normal users
anywhere, so it gets no real coverage except for the odd cases
(presumably inside google?)

A lot of the work over the last couple of years has been to try to
simplify the code and streamline it, and fix bugs due to recursion.

I really wish we would continue that pattern of trying to simplify
this code rather than add more complexity on top of it, which is why I
reacted to strongly to that patch.

And that whole ep_poll() function is written in just about the most
confusing way possible, with code that looks like loops but aren't
("while (0)") and goto's that _are_ loops ("goto fetch_events").

I'll go stare at it some more.

                   Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 13/15] epoll: check ep_events_available() upon timeout
  2020-11-02 18:51       ` Linus Torvalds
@ 2020-11-02 19:38         ` Linus Torvalds
  2020-11-02 19:54         ` Soheil Hassas Yeganeh
  1 sibling, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2020-11-02 19:38 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: Andrew Morton, Davidlohr Bueso, Eric Dumazet, Guantao Liu,
	Khazhismel Kumykov, Linux-MM, mm-commits, Al Viro,
	Willem de Bruijn

[-- Attachment #1: Type: text/plain, Size: 1713 bytes --]

On Mon, Nov 2, 2020 at 10:51 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'll go stare at it some more.

That code is fundamentally broken in so many ways.

Look at how  ep_poll() does that ep_events_available() without
actually holding the ep->lock (or the ep->mtx) in half the cases.

End result: it works in 99.9% of all cases, but there's a race with
somebody else doing

        WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
        /*
         * Quickly re-inject items left on "txlist".
         */
        list_splice(&txlist, &ep->rdllist);

when that code can see an empty rdllist and an inactive ovflist and
thus decide that there are no events available.

I think the "Quickly re-inject" comment may be because some people
knew of that race.

The signal handling is also odd and looks broken.

The "short circuit on fatal signals" means that ep_send_events() isn't
actually done on a SIGKILL, but the code also used an exclusive wait,
so nobody else will be woken up either.

Admittedly you can steal wakeups other ways, by simply not caring
about the end result, so maybe that's all just inherent in epoll
anyway. But it looks strange, and it seems pointless: the right thing
to do would seem to be simply to have a regular check for
signal_pending(), and returning -EINTR if rather than looping.

And that do { } while (0) is entirely pointless. It seems to exist in
order to use "break" instead of the goto that everything else does,
which I guess is nice, except the whole need for that comes from how
oddly the code is written.

Why doesn't this all do something like the attached instead?

NOTE! I did not bother to fix that ep_events_available() race.

                      Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 4551 bytes --]

 fs/eventpoll.c | 114 +++++++++++++++++++++++++++------------------------------
 1 file changed, 54 insertions(+), 60 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4df61129566d..d4732327f57a 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1864,85 +1864,79 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	 */
 	ep_reset_busy_poll_napi_id(ep);
 
-	do {
-		/*
-		 * Internally init_wait() uses autoremove_wake_function(),
-		 * thus wait entry is removed from the wait queue on each
-		 * wakeup. Why it is important? In case of several waiters
-		 * each new wakeup will hit the next waiter, giving it the
-		 * chance to harvest new event. Otherwise wakeup can be
-		 * lost. This is also good performance-wise, because on
-		 * normal wakeup path no need to call __remove_wait_queue()
-		 * explicitly, thus ep->lock is not taken, which halts the
-		 * event delivery.
-		 */
-		init_wait(&wait);
+	if (signal_pending(current))
+		return -EINTR;
 
-		write_lock_irq(&ep->lock);
-		/*
-		 * Barrierless variant, waitqueue_active() is called under
-		 * the same lock on wakeup ep_poll_callback() side, so it
-		 * is safe to avoid an explicit barrier.
-		 */
-		__set_current_state(TASK_INTERRUPTIBLE);
-
-		/*
-		 * Do the final check under the lock. ep_scan_ready_list()
-		 * plays with two lists (->rdllist and ->ovflist) and there
-		 * is always a race when both lists are empty for short
-		 * period of time although events are pending, so lock is
-		 * important.
-		 */
-		eavail = ep_events_available(ep);
-		if (!eavail) {
-			if (signal_pending(current))
-				res = -EINTR;
-			else
-				__add_wait_queue_exclusive(&ep->wq, &wait);
-		}
-		write_unlock_irq(&ep->lock);
-
-		if (eavail || res)
-			break;
+	/*
+	 * Internally init_wait() uses autoremove_wake_function(),
+	 * thus wait entry is removed from the wait queue on each
+	 * wakeup. Why it is important? In case of several waiters
+	 * each new wakeup will hit the next waiter, giving it the
+	 * chance to harvest new event. Otherwise wakeup can be
+	 * lost. This is also good performance-wise, because on
+	 * normal wakeup path no need to call __remove_wait_queue()
+	 * explicitly, thus ep->lock is not taken, which halts the
+	 * event delivery.
+	 */
+	init_wait(&wait);
 
-		if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) {
-			timed_out = 1;
-			break;
-		}
+	write_lock_irq(&ep->lock);
+	/*
+	 * Barrierless variant, waitqueue_active() is called under
+	 * the same lock on wakeup ep_poll_callback() side, so it
+	 * is safe to avoid an explicit barrier.
+	 */
+	__set_current_state(TASK_INTERRUPTIBLE);
 
-		/* We were woken up, thus go and try to harvest some events */
-		eavail = 1;
+	/*
+	 * Do the final check under the lock. ep_scan_ready_list()
+	 * plays with two lists (->rdllist and ->ovflist) and there
+	 * is always a race when both lists are empty for short
+	 * period of time although events are pending, so lock is
+	 * important.
+	 */
+	eavail = ep_events_available(ep);
+	if (!eavail)
+		__add_wait_queue_exclusive(&ep->wq, &wait);
+	write_unlock_irq(&ep->lock);
 
-	} while (0);
+	if (!eavail)
+		timed_out = !schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS);
 
 	__set_current_state(TASK_RUNNING);
 
+	/*
+	 * If we were woken up, assume there's something available, otherwise
+	 * remove ourselves from the wait queue and check carefully (since we
+	 * hold the lock anyway).
+	 */
+	eavail = 1;
 	if (!list_empty_careful(&wait.entry)) {
 		write_lock_irq(&ep->lock);
 		__remove_wait_queue(&ep->wq, &wait);
+		eavail = ep_events_available(ep);
 		write_unlock_irq(&ep->lock);
 	}
 
 send_events:
-	if (fatal_signal_pending(current)) {
-		/*
-		 * Always short-circuit for fatal signals to allow
-		 * threads to make a timely exit without the chance of
-		 * finding more events available and fetching
-		 * repeatedly.
-		 */
-		res = -EINTR;
+	if (res)
+		return res;
+	if (eavail) {
+		res = ep_send_events(ep, events, maxevents);
+		if (res)
+			return res;
 	}
+	if (signal_pending(current))
+		return -EINTR;
+
 	/*
-	 * Try to transfer events to user space. In case we get 0 events and
-	 * there's still timeout left over, we go trying again in search of
-	 * more luck.
+	 * In case we get 0 events and there's still timeout left over, we
+	 * go trying again in search of more luck.
 	 */
-	if (!res && eavail &&
-	    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
+	if (!timed_out)
 		goto fetch_events;
 
-	return res;
+	return 0;
 }
 
 /**

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [patch 13/15] epoll: check ep_events_available() upon timeout
  2020-11-02 18:51       ` Linus Torvalds
  2020-11-02 19:38         ` Linus Torvalds
@ 2020-11-02 19:54         ` Soheil Hassas Yeganeh
  2020-11-02 20:12           ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-02 19:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Davidlohr Bueso, Eric Dumazet, Guantao Liu,
	Khazhismel Kumykov, Linux-MM, mm-commits, Al Viro,
	Willem de Bruijn

On Mon, Nov 2, 2020 at 1:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 2, 2020 at 9:49 AM Soheil Hassas Yeganeh <soheil@google.com> wrote:
> >
> > Thank you for the suggestion. That was the first version I tried, and
> > I can confirm it fixes the race because we call ep_send_events() once
> > more before returning.  Though, I believe, due to time_out=1, we won't
> > goto fetch_events to call ep_events_available():
> >
> > if (!res && eavail &&
> >    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
> >  goto fetch_events;
>
> Right. We won't be repeating the loop, but we will do one final send_events.
>
> Which I think is really the point, no?

Yes, absolutely, that's the point.

> > You're spot on that the patch is more complicated than your
> > suggestion.  However, the downside I observed was a performance
> > regression for the non-racy case: Suppose there are a few threads with
> > a similar non-zero timeout and no ready event. They will all
> > experience a noticeable contention in ep_scan_ready_list, by
> > unconditionally calling ep_send_events(). The contention was large
> > because there will be 2 write locks on ep->lock and one mutex lock on
> > ep->mtx with a large critical section.
>
> Ugh. I really detest the eventpoll code. Afaik, it has no normal users
> anywhere, so it gets no real coverage except for the odd cases
> (presumably inside google?)
>
> A lot of the work over the last couple of years has been to try to
> simplify the code and streamline it, and fix bugs due to recursion.
>
> I really wish we would continue that pattern of trying to simplify
> this code rather than add more complexity on top of it, which is why I
> reacted to strongly to that patch.
>
> And that whole ep_poll() function is written in just about the most
> confusing way possible, with code that looks like loops but aren't
> ("while (0)") and goto's that _are_ loops ("goto fetch_events").

I wholeheartedly agree. Due to its cryptic implementation, It was
really difficult to think about the correctness of the fixes.

On Mon, Nov 2, 2020 at 2:39 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 2, 2020 at 10:51 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I'll go stare at it some more.
>
> That code is fundamentally broken in so many ways.
>
> Look at how  ep_poll() does that ep_events_available() without
> actually holding the ep->lock (or the ep->mtx) in half the cases.
>
> End result: it works in 99.9% of all cases, but there's a race with
> somebody else doing
>
>         WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>         /*
>          * Quickly re-inject items left on "txlist".
>          */
>         list_splice(&txlist, &ep->rdllist);
>
> when that code can see an empty rdllist and an inactive ovflist and
> thus decide that there are no events available.
>
> I think the "Quickly re-inject" comment may be because some people
> knew of that race.
>
> The signal handling is also odd and looks broken.
>
> The "short circuit on fatal signals" means that ep_send_events() isn't
> actually done on a SIGKILL, but the code also used an exclusive wait,
> so nobody else will be woken up either.
>
> Admittedly you can steal wakeups other ways, by simply not caring
> about the end result, so maybe that's all just inherent in epoll
> anyway. But it looks strange, and it seems pointless: the right thing
> to do would seem to be simply to have a regular check for
> signal_pending(), and returning -EINTR if rather than looping.
>
> And that do { } while (0) is entirely pointless. It seems to exist in
> order to use "break" instead of the goto that everything else does,
> which I guess is nice, except the whole need for that comes from how
> oddly the code is written.
>
> Why doesn't this all do something like the attached instead?

This looks really great to me! Thank you!  I'll give it a try and get
back to you soon.

> NOTE! I did not bother to fix that ep_events_available() race.

Given that you're calling ep_events_available() under lock, I think
this should address the inefficiency for the non-racy timeout case,  I
mentioned above. The remaining races are preexisting and all result in
spurious events, which should be fine.

Thanks again!
Soheil

>                       Linus



> I'll go stare at it some more.
>
>                    Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 13/15] epoll: check ep_events_available() upon timeout
  2020-11-02 19:54         ` Soheil Hassas Yeganeh
@ 2020-11-02 20:12           ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2020-11-02 20:12 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: Andrew Morton, Davidlohr Bueso, Eric Dumazet, Guantao Liu,
	Khazhismel Kumykov, Linux-MM, mm-commits, Al Viro,
	Willem de Bruijn

On Mon, Nov 2, 2020 at 11:55 AM Soheil Hassas Yeganeh <soheil@google.com> wrote:
>
> Given that you're calling ep_events_available() under lock, I think
> this should address the inefficiency for the non-racy timeout case,  I
> mentioned above. The remaining races are preexisting and all result in
> spurious events, which should be fine.

Well, they might result in not seeing events that are there, but for
that case we end up going the logn way around and doing the "wait for
events" code, and that should take care of it, I guess.

So it's racy, but it gets fixed up later regardless of which way it
races, I think.

It might be worth a comment or two.

And that "goto fetch_events" thing might be worth writing as an actual
loop, rather than the pointless one I removed.

I suspect that old pointless loop might have been historical (some
kind of "loop waiting for event"), since that's how those things are
usually written. I didn't go and look at the history.

Anyway, I've dropped the original 13/15 patch, buit I'll apply the
test-case one (14/15) even if it might now fail in the current state.

I hope that you will end up submitting some cleaned-up (and tested!)
version of that patch I cobbled together - I'll archive this thread
for now on the assumption that the resolution of all this will come
back to me later...

                     Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-11-02 20:14 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-02  1:06 incoming Andrew Morton
2020-11-02  1:07 ` [patch 01/15] mm/mremap_pages: fix static key devmap_managed_key updates Andrew Morton
2020-11-02  1:07 ` [patch 02/15] hugetlb_cgroup: fix reservation accounting Andrew Morton
2020-11-02  1:07 ` [patch 03/15] mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg Andrew Morton
2020-11-02  1:07 ` [patch 04/15] mm: memcg: link page counters to root if use_hierarchy is false Andrew Morton
2020-11-02  1:07 ` [patch 05/15] kasan: adopt KUNIT tests to SW_TAGS mode Andrew Morton
2020-11-02  1:07 ` [patch 06/15] mm: mempolicy: fix potential pte_unmap_unlock pte error Andrew Morton
2020-11-02  1:07 ` [patch 07/15] ptrace: fix task_join_group_stop() for the case when current is traced Andrew Morton
2020-11-02  1:07 ` [patch 08/15] lib/crc32test: remove extra local_irq_disable/enable Andrew Morton
2020-11-02  1:07 ` [patch 09/15] mm/truncate.c: make __invalidate_mapping_pages() static Andrew Morton
2020-11-02  1:07 ` [patch 10/15] kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled Andrew Morton
2020-11-02  1:07 ` [patch 11/15] mm, oom: keep oom_adj under or at upper limit when printing Andrew Morton
2020-11-02  1:08 ` [patch 12/15] mm: always have io_remap_pfn_range() set pgprot_decrypted() Andrew Morton
2020-11-02  1:08 ` [patch 13/15] epoll: check ep_events_available() upon timeout Andrew Morton
2020-11-02 17:08   ` Linus Torvalds
2020-11-02 17:48     ` Soheil Hassas Yeganeh
2020-11-02 18:51       ` Linus Torvalds
2020-11-02 19:38         ` Linus Torvalds
2020-11-02 19:54         ` Soheil Hassas Yeganeh
2020-11-02 20:12           ` Linus Torvalds
2020-11-02  1:08 ` [patch 14/15] epoll: add a selftest for epoll timeout race Andrew Morton
2020-11-02  1:08 ` [patch 15/15] kernel/hung_task.c: make type annotations consistent Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).