All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] mm, oom: fix oom_reaper fallouts
@ 2017-08-07 11:38 ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-07 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Argangeli, Kirill A. Shutemov, Tetsuo Handa,
	Oleg Nesterov, Wenwei Tao, linux-mm, LKML

Hi,
there are two issues this patch series attempts to fix. First one is
something that has been broken since MMF_UNSTABLE flag introduction
and I guess we should backport it stable trees (patch 1). The other
issue has been brought up by Wenwei Tao and Tetsuo Handa has created
a test case to trigger it very reliably. I am not yet sure this is a
stable material because the test case is rather artificial. If there is
a demand for the stable backport I will prepare it, of course, though.

I hope I've done the second patch correctly but I would definitely
appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
previous attempt with some more context was posted here
http://lkml.kernel.org/r/20170803135902.31977-1-mhocko@kernel.org

My testing didn't show anything unusual with these two applied on top of
the mmotm tree.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 0/2] mm, oom: fix oom_reaper fallouts
@ 2017-08-07 11:38 ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-07 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Argangeli, Kirill A. Shutemov, Tetsuo Handa,
	Oleg Nesterov, Wenwei Tao, linux-mm, LKML

Hi,
there are two issues this patch series attempts to fix. First one is
something that has been broken since MMF_UNSTABLE flag introduction
and I guess we should backport it stable trees (patch 1). The other
issue has been brought up by Wenwei Tao and Tetsuo Handa has created
a test case to trigger it very reliably. I am not yet sure this is a
stable material because the test case is rather artificial. If there is
a demand for the stable backport I will prepare it, of course, though.

I hope I've done the second patch correctly but I would definitely
appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
previous attempt with some more context was posted here
http://lkml.kernel.org/r/20170803135902.31977-1-mhocko@kernel.org

My testing didn't show anything unusual with these two applied on top of
the mmotm tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
  2017-08-07 11:38 ` Michal Hocko
@ 2017-08-07 11:38   ` Michal Hocko
  -1 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-07 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Argangeli, Kirill A. Shutemov, Tetsuo Handa,
	Oleg Nesterov, Wenwei Tao, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30

The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
existing error code because all arch specific page fault handlers and
g-u-p would have to learn a new error code combination.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Cc: stable # 4.9+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..4fe5b6254688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3881,8 +3881,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * further.
 	 */
 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+
+		/*
+		 * We are going to enforce SIGBUS but the PF path might have
+		 * dropped the mmap_sem already so take it again so that
+		 * we do not break expectations of all arch specific PF paths
+		 * and g-u-p
+		 */
+		if (ret & VM_FAULT_RETRY)
+			down_read(&vma->vm_mm->mmap_sem);
 		ret = VM_FAULT_SIGBUS;
+	}
 
 	return ret;
 }
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
@ 2017-08-07 11:38   ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-07 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Argangeli, Kirill A. Shutemov, Tetsuo Handa,
	Oleg Nesterov, Wenwei Tao, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30

The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
existing error code because all arch specific page fault handlers and
g-u-p would have to learn a new error code combination.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Cc: stable # 4.9+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..4fe5b6254688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3881,8 +3881,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * further.
 	 */
 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+
+		/*
+		 * We are going to enforce SIGBUS but the PF path might have
+		 * dropped the mmap_sem already so take it again so that
+		 * we do not break expectations of all arch specific PF paths
+		 * and g-u-p
+		 */
+		if (ret & VM_FAULT_RETRY)
+			down_read(&vma->vm_mm->mmap_sem);
 		ret = VM_FAULT_SIGBUS;
+	}
 
 	return ret;
 }
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-07 11:38 ` Michal Hocko
@ 2017-08-07 11:38   ` Michal Hocko
  -1 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-07 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Argangeli, Kirill A. Shutemov, Tetsuo Handa,
	Oleg Nesterov, Wenwei Tao, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies, and so the
oom_reaper can tear it down, is not entirely true.

__task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT
is set but do_group_exit sends SIGKILL to all threads _after_ the
flag is set. So there is a race window when some threads won't have
fatal_signal_pending while the oom_reaper could start unmapping the
address space. Moreover some paths might not check for fatal signals
before each PF/g-u-p/copy_from_user.

We already have a protection for oom_reaper vs. PF races by checking
MMF_UNSTABLE. This has been, however, checked only for kernel threads
(use_mm users) which can outlive the oom victim. A simple fix would be
to extend the current check in handle_mm_fault for all tasks but that
wouldn't be sufficient because the current check assumes that a kernel
thread would bail out after EFAULT from get_user*/copy_from_user and
never re-read the same address which would succeed because the PF path
has established page tables already. This seems to be the case for the
only existing use_mm user currently (virtio driver) but it is rather
fragile in general.

This is even more fragile in general for more complex paths such as
generic_perform_write which can re-read the same address more times
(e.g. iov_iter_copy_from_user_atomic to fail and then
iov_iter_fault_in_readable on retry). Therefore we have to implement
MMF_UNSTABLE protection in a robust way and never make a potentially
corrupted content visible. That requires to hook deeper into the PF
path and check for the flag _every time_ before a pte for anonymous
memory is established (that means all !VM_SHARED mappings).

The corruption can be triggered artificially [1] but there doesn't seem
to be any real life bug report. The race window should be quite tight
to trigger most of the time.

Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

[1] http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp
---
 include/linux/oom.h | 22 ++++++++++++++++++++++
 mm/huge_memory.c    | 30 ++++++++++++++++++++++--------
 mm/memory.c         | 46 ++++++++++++++++++++--------------------------
 3 files changed, 64 insertions(+), 34 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2be5a6..76aac4ce39bc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -6,6 +6,8 @@
 #include <linux/types.h>
 #include <linux/nodemask.h>
 #include <uapi/linux/oom.h>
+#include <linux/sched/coredump.h> /* MMF_* */
+#include <linux/mm.h> /* VM_FAULT* */
 
 struct zonelist;
 struct notifier_block;
@@ -63,6 +65,26 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
 	return tsk->signal->oom_mm;
 }
 
+/*
+ * Checks whether a page fault on the given mm is still reliable.
+ * This is no longer true if the oom reaper started to reap the
+ * address space which is reflected by MMF_UNSTABLE flag set in
+ * the mm. At that moment any !shared mapping would lose the content
+ * and could cause a memory corruption (zero pages instead of the
+ * original content).
+ *
+ * User should call this before establishing a page table entry for
+ * a !shared mapping and under the proper page table lock.
+ *
+ * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
+ */
+static inline int check_stable_address_space(struct mm_struct *mm)
+{
+	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
+		return VM_FAULT_SIGBUS;
+	return 0;
+}
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..b03cfc0d3141 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -32,6 +32,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
+#include <linux/oom.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -550,6 +551,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	int ret = 0;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -561,9 +563,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	pgtable = pte_alloc_one(vma->vm_mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto release;
 	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
@@ -576,13 +577,14 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_none(*vmf->pmd))) {
-		spin_unlock(vmf->ptl);
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		pte_free(vma->vm_mm, pgtable);
+		goto unlock_release;
 	} else {
 		pmd_t entry;
 
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock_release;
+
 		/* Deliver the page fault to userland */
 		if (userfaultfd_missing(vma)) {
 			int ret;
@@ -610,6 +612,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	}
 
 	return 0;
+unlock_release:
+	spin_unlock(vmf->ptl);
+release:
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
+	mem_cgroup_cancel_charge(page, memcg, true);
+	put_page(page);
+	return ret;
+
 }
 
 /*
@@ -688,7 +699,10 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		ret = 0;
 		set = false;
 		if (pmd_none(*vmf->pmd)) {
-			if (userfaultfd_missing(vma)) {
+			ret = check_stable_address_space(vma->vm_mm);
+			if (ret) {
+				spin_unlock(vmf->ptl);
+			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
diff --git a/mm/memory.c b/mm/memory.c
index 4fe5b6254688..1b4504441bd2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -68,6 +68,7 @@
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
+#include <linux/oom.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -2864,6 +2865,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mem_cgroup *memcg;
 	struct page *page;
+	int ret = 0;
 	pte_t entry;
 
 	/* File mapping without ->vm_ops ? */
@@ -2896,6 +2898,9 @@ static int do_anonymous_page(struct vm_fault *vmf)
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte))
 			goto unlock;
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2930,6 +2935,10 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (!pte_none(*vmf->pte))
 		goto release;
 
+	ret = check_stable_address_space(vma->vm_mm);
+	if (ret)
+		goto release;
+
 	/* Deliver the page fault to userland, check inside PT lock */
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2949,7 +2958,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return 0;
+	return ret;
 release:
 	mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
@@ -3223,7 +3232,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 int finish_fault(struct vm_fault *vmf)
 {
 	struct page *page;
-	int ret;
+	int ret = 0;
 
 	/* Did we COW the page? */
 	if ((vmf->flags & FAULT_FLAG_WRITE) &&
@@ -3231,7 +3240,15 @@ int finish_fault(struct vm_fault *vmf)
 		page = vmf->cow_page;
 	else
 		page = vmf->page;
-	ret = alloc_set_pte(vmf, vmf->memcg, page);
+
+	/*
+	 * check even for read faults because we might have lost our CoWed
+	 * page
+	 */
+	if (!(vmf->vma->vm_flags & VM_SHARED))
+		ret = check_stable_address_space(vmf->vma->vm_mm);
+	if (!ret)
+		ret = alloc_set_pte(vmf, vmf->memcg, page);
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
@@ -3871,29 +3888,6 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
-	/*
-	 * This mm has been already reaped by the oom reaper and so the
-	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
-	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-
-		/*
-		 * We are going to enforce SIGBUS but the PF path might have
-		 * dropped the mmap_sem already so take it again so that
-		 * we do not break expectations of all arch specific PF paths
-		 * and g-u-p
-		 */
-		if (ret & VM_FAULT_RETRY)
-			down_read(&vma->vm_mm->mmap_sem);
-		ret = VM_FAULT_SIGBUS;
-	}
-
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-07 11:38   ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-07 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Argangeli, Kirill A. Shutemov, Tetsuo Handa,
	Oleg Nesterov, Wenwei Tao, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies, and so the
oom_reaper can tear it down, is not entirely true.

__task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT
is set but do_group_exit sends SIGKILL to all threads _after_ the
flag is set. So there is a race window when some threads won't have
fatal_signal_pending while the oom_reaper could start unmapping the
address space. Moreover some paths might not check for fatal signals
before each PF/g-u-p/copy_from_user.

We already have a protection for oom_reaper vs. PF races by checking
MMF_UNSTABLE. This has been, however, checked only for kernel threads
(use_mm users) which can outlive the oom victim. A simple fix would be
to extend the current check in handle_mm_fault for all tasks but that
wouldn't be sufficient because the current check assumes that a kernel
thread would bail out after EFAULT from get_user*/copy_from_user and
never re-read the same address which would succeed because the PF path
has established page tables already. This seems to be the case for the
only existing use_mm user currently (virtio driver) but it is rather
fragile in general.

This is even more fragile in general for more complex paths such as
generic_perform_write which can re-read the same address more times
(e.g. iov_iter_copy_from_user_atomic to fail and then
iov_iter_fault_in_readable on retry). Therefore we have to implement
MMF_UNSTABLE protection in a robust way and never make a potentially
corrupted content visible. That requires to hook deeper into the PF
path and check for the flag _every time_ before a pte for anonymous
memory is established (that means all !VM_SHARED mappings).

The corruption can be triggered artificially [1] but there doesn't seem
to be any real life bug report. The race window should be quite tight
to trigger most of the time.

Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

[1] http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp
---
 include/linux/oom.h | 22 ++++++++++++++++++++++
 mm/huge_memory.c    | 30 ++++++++++++++++++++++--------
 mm/memory.c         | 46 ++++++++++++++++++++--------------------------
 3 files changed, 64 insertions(+), 34 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2be5a6..76aac4ce39bc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -6,6 +6,8 @@
 #include <linux/types.h>
 #include <linux/nodemask.h>
 #include <uapi/linux/oom.h>
+#include <linux/sched/coredump.h> /* MMF_* */
+#include <linux/mm.h> /* VM_FAULT* */
 
 struct zonelist;
 struct notifier_block;
@@ -63,6 +65,26 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
 	return tsk->signal->oom_mm;
 }
 
+/*
+ * Checks whether a page fault on the given mm is still reliable.
+ * This is no longer true if the oom reaper started to reap the
+ * address space which is reflected by MMF_UNSTABLE flag set in
+ * the mm. At that moment any !shared mapping would lose the content
+ * and could cause a memory corruption (zero pages instead of the
+ * original content).
+ *
+ * User should call this before establishing a page table entry for
+ * a !shared mapping and under the proper page table lock.
+ *
+ * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
+ */
+static inline int check_stable_address_space(struct mm_struct *mm)
+{
+	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
+		return VM_FAULT_SIGBUS;
+	return 0;
+}
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..b03cfc0d3141 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -32,6 +32,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
+#include <linux/oom.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -550,6 +551,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	int ret = 0;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -561,9 +563,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	pgtable = pte_alloc_one(vma->vm_mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto release;
 	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
@@ -576,13 +577,14 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_none(*vmf->pmd))) {
-		spin_unlock(vmf->ptl);
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		pte_free(vma->vm_mm, pgtable);
+		goto unlock_release;
 	} else {
 		pmd_t entry;
 
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock_release;
+
 		/* Deliver the page fault to userland */
 		if (userfaultfd_missing(vma)) {
 			int ret;
@@ -610,6 +612,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	}
 
 	return 0;
+unlock_release:
+	spin_unlock(vmf->ptl);
+release:
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
+	mem_cgroup_cancel_charge(page, memcg, true);
+	put_page(page);
+	return ret;
+
 }
 
 /*
@@ -688,7 +699,10 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		ret = 0;
 		set = false;
 		if (pmd_none(*vmf->pmd)) {
-			if (userfaultfd_missing(vma)) {
+			ret = check_stable_address_space(vma->vm_mm);
+			if (ret) {
+				spin_unlock(vmf->ptl);
+			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
diff --git a/mm/memory.c b/mm/memory.c
index 4fe5b6254688..1b4504441bd2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -68,6 +68,7 @@
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
+#include <linux/oom.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -2864,6 +2865,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mem_cgroup *memcg;
 	struct page *page;
+	int ret = 0;
 	pte_t entry;
 
 	/* File mapping without ->vm_ops ? */
@@ -2896,6 +2898,9 @@ static int do_anonymous_page(struct vm_fault *vmf)
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte))
 			goto unlock;
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2930,6 +2935,10 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (!pte_none(*vmf->pte))
 		goto release;
 
+	ret = check_stable_address_space(vma->vm_mm);
+	if (ret)
+		goto release;
+
 	/* Deliver the page fault to userland, check inside PT lock */
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2949,7 +2958,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return 0;
+	return ret;
 release:
 	mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
@@ -3223,7 +3232,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 int finish_fault(struct vm_fault *vmf)
 {
 	struct page *page;
-	int ret;
+	int ret = 0;
 
 	/* Did we COW the page? */
 	if ((vmf->flags & FAULT_FLAG_WRITE) &&
@@ -3231,7 +3240,15 @@ int finish_fault(struct vm_fault *vmf)
 		page = vmf->cow_page;
 	else
 		page = vmf->page;
-	ret = alloc_set_pte(vmf, vmf->memcg, page);
+
+	/*
+	 * check even for read faults because we might have lost our CoWed
+	 * page
+	 */
+	if (!(vmf->vma->vm_flags & VM_SHARED))
+		ret = check_stable_address_space(vmf->vma->vm_mm);
+	if (!ret)
+		ret = alloc_set_pte(vmf, vmf->memcg, page);
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
@@ -3871,29 +3888,6 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
-	/*
-	 * This mm has been already reaped by the oom reaper and so the
-	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
-	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-
-		/*
-		 * We are going to enforce SIGBUS but the PF path might have
-		 * dropped the mmap_sem already so take it again so that
-		 * we do not break expectations of all arch specific PF paths
-		 * and g-u-p
-		 */
-		if (ret & VM_FAULT_RETRY)
-			down_read(&vma->vm_mm->mmap_sem);
-		ret = VM_FAULT_SIGBUS;
-	}
-
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] mm, oom: fix oom_reaper fallouts
  2017-08-07 11:38 ` Michal Hocko
@ 2017-08-07 13:28   ` Tetsuo Handa
  -1 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-07 13:28 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> Hi,
> there are two issues this patch series attempts to fix. First one is
> something that has been broken since MMF_UNSTABLE flag introduction
> and I guess we should backport it stable trees (patch 1). The other
> issue has been brought up by Wenwei Tao and Tetsuo Handa has created
> a test case to trigger it very reliably. I am not yet sure this is a
> stable material because the test case is rather artificial. If there is
> a demand for the stable backport I will prepare it, of course, though.
> 
> I hope I've done the second patch correctly but I would definitely
> appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
> previous attempt with some more context was posted here
> http://lkml.kernel.org/r/20170803135902.31977-1-mhocko@kernel.org
> 
> My testing didn't show anything unusual with these two applied on top of
> the mmotm tree.

I really don't like your likely/unlikely speculation.
I can trigger this problem with 8 threads using 4.13.0-rc2-next-20170728.
The written data can be random values (seems portion of a.out memory image).
I guess that unexpected information leakage is possible.

----------
$ cat 0804.c
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>

#define NUMTHREADS 8
#define STACKSIZE 8192

static int pipe_fd[2] = { EOF, EOF };
static int file_writer(void *i)
{
        static char buffer[1048576];
        int fd;
        char buffer2[64] = { };
        snprintf(buffer2, sizeof(buffer2), "/tmp/file.%lu", (unsigned long) i);
        fd = open(buffer2, O_WRONLY | O_CREAT | O_APPEND, 0600);
        memset(buffer, 0xFF, sizeof(buffer));
        read(pipe_fd[0], buffer, 1);
        while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
        return 0;
}

int main(int argc, char *argv[])
{
        char *buf = NULL;
        unsigned long size;
        unsigned long i;
        char *stack;
        if (pipe(pipe_fd))
                return 1;
        stack = malloc(STACKSIZE * NUMTHREADS);
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        for (i = 0; i < NUMTHREADS; i++)
                if (clone(file_writer, stack + (i + 1) * STACKSIZE,
                          CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | CLONE_FS |
                          CLONE_FILES, (void *) i) == -1)
                        break;
        close(pipe_fd[1]);
        /* Will cause OOM due to overcommit; if not use SysRq-f */
        for (i = 0; i < size; i += 4096)
                buf[i] = 0;
        kill(-1, SIGKILL);
        return 0;
}
$ gcc -Wall -O3 0804.c
$ while :; do ./a.out; cat /tmp/file.* | od -b; /bin/rm /tmp/file.*; done
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
6415240000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
4461000000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
3762500000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
4347360000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
6451300000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
3564010000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
3564020000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
5055400000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
5446620000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
6203470000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
4127560000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
5224170000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
5612450000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
4766420000 055 061 061 051 000 000 056 163 171 155 164 141 142 000 056 163
4766420020 164 162 164 141 142 000 056 163 150 163 164 162 164 141 142 000
4766420040 056 151 156 164 145 162 160 000 056 156 157 164 145 056 101 102
4766420060 111 055 164 141 147 000 056 156 157 164 145 056 147 156 165 056
4766420100 142 165 151 154 144 055 151 144 000 056 147 156 165 056 150 141
4766420120 163 150 000 056 144 171 156 163 171 155 000 056 144 171 156 163
4766420140 164 162 000 056 147 156 165 056 166 145 162 163 151 157 156 000
4766420160 056 147 156 165 056 166 145 162 163 151 157 156 137 162 000 056
4766420200 162 145 154 141 056 144 171 156 000 056 162 145 154 141 056 160
4766420220 154 164 000 056 151 156 151 164 000 056 164 145 170 164 000 056
4766420240 146 151 156 151 000 056 162 157 144 141 164 141 000 056 145 150
4766420260 137 146 162 141 155 145 137 150 144 162 000 056 145 150 137 146
4766420300 162 141 155 145 000 056 151 156 151 164 137 141 162 162 141 171
4766420320 000 056 146 151 156 151 137 141 162 162 141 171 000 056 152 143
4766420340 162 000 056 144 171 156 141 155 151 143 000 056 147 157 164 000
4766420360 056 147 157 164 056 160 154 164 000 056 144 141 164 141 000 056
4766420400 142 163 163 000 056 143 157 155 155 145 156 164 000 000 000 000
4766420420 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766420440 000 000 000 000 000 000 000 000 000 000 000 000 003 000 001 000
4766420460 070 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766420500 000 000 000 000 003 000 002 000 124 002 100 000 000 000 000 000
4766420520 000 000 000 000 000 000 000 000 000 000 000 000 003 000 003 000
4766420540 164 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766420560 000 000 000 000 003 000 004 000 230 002 100 000 000 000 000 000
4766420600 000 000 000 000 000 000 000 000 000 000 000 000 003 000 005 000
4766420620 270 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766420640 000 000 000 000 003 000 006 000 010 004 100 000 000 000 000 000
4766420660 000 000 000 000 000 000 000 000 000 000 000 000 003 000 007 000
4766420700 206 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766420720 000 000 000 000 003 000 010 000 250 004 100 000 000 000 000 000
4766420740 000 000 000 000 000 000 000 000 000 000 000 000 003 000 011 000
4766420760 310 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421000 000 000 000 000 003 000 012 000 340 004 100 000 000 000 000 000
4766421020 000 000 000 000 000 000 000 000 000 000 000 000 003 000 013 000
4766421040 030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421060 000 000 000 000 003 000 014 000 100 006 100 000 000 000 000 000
4766421100 000 000 000 000 000 000 000 000 000 000 000 000 003 000 015 000
4766421120 040 007 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421140 000 000 000 000 003 000 016 000 024 012 100 000 000 000 000 000
4766421160 000 000 000 000 000 000 000 000 000 000 000 000 003 000 017 000
4766421200 040 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421220 000 000 000 000 003 000 020 000 100 012 100 000 000 000 000 000
4766421240 000 000 000 000 000 000 000 000 000 000 000 000 003 000 021 000
4766421260 200 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421300 000 000 000 000 003 000 022 000 020 016 140 000 000 000 000 000
4766421320 000 000 000 000 000 000 000 000 000 000 000 000 003 000 023 000
4766421340 030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421360 000 000 000 000 003 000 024 000 040 016 140 000 000 000 000 000
4766421400 000 000 000 000 000 000 000 000 000 000 000 000 003 000 025 000
4766421420 050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421440 000 000 000 000 003 000 026 000 370 017 140 000 000 000 000 000
4766421460 000 000 000 000 000 000 000 000 000 000 000 000 003 000 027 000
4766421500 000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421520 000 000 000 000 003 000 030 000 200 020 140 000 000 000 000 000
4766421540 000 000 000 000 000 000 000 000 000 000 000 000 003 000 031 000
4766421560 240 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421600 000 000 000 000 003 000 032 000 000 000 000 000 000 000 000 000
4766421620 000 000 000 000 000 000 000 000 001 000 000 000 004 000 361 377
4766421640 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421660 010 000 000 000 002 000 015 000 000 011 100 000 000 000 000 000
4766421700 221 000 000 000 000 000 000 000 024 000 000 000 001 000 031 000
4766421720 300 020 140 000 000 000 000 000 000 000 020 000 000 000 000 000
4766421740 040 000 000 000 001 000 030 000 220 020 140 000 000 000 000 000
4766421760 010 000 000 000 000 000 000 000 050 000 000 000 004 000 361 377
4766422000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422020 063 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
4766422040 000 000 000 000 000 000 000 000 100 000 000 000 002 000 015 000
4766422060 100 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422100 125 000 000 000 002 000 015 000 160 010 100 000 000 000 000 000
4766422120 000 000 000 000 000 000 000 000 150 000 000 000 002 000 015 000
4766422140 260 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422160 176 000 000 000 001 000 031 000 240 020 140 000 000 000 000 000
4766422200 001 000 000 000 000 000 000 000 215 000 000 000 001 000 023 000
4766422220 030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422240 264 000 000 000 002 000 015 000 320 010 100 000 000 000 000 000
4766422260 000 000 000 000 000 000 000 000 300 000 000 000 001 000 022 000
4766422300 020 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422320 050 000 000 000 004 000 361 377 000 000 000 000 000 000 000 000
4766422340 000 000 000 000 000 000 000 000 337 000 000 000 001 000 021 000
4766422360 300 013 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422400 355 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
4766422420 000 000 000 000 000 000 000 000 000 000 000 000 004 000 361 377
4766422440 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422460 371 000 000 000 000 000 022 000 030 016 140 000 000 000 000 000
4766422500 000 000 000 000 000 000 000 000 012 001 000 000 001 000 025 000
4766422520 050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422540 023 001 000 000 000 000 022 000 020 016 140 000 000 000 000 000
4766422560 000 000 000 000 000 000 000 000 046 001 000 000 001 000 027 000
4766422600 000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422620 074 001 000 000 022 000 015 000 020 012 100 000 000 000 000 000
4766422640 002 000 000 000 000 000 000 000 114 001 000 000 040 000 000 000
4766422660 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422700 150 001 000 000 040 000 030 000 200 020 140 000 000 000 000 000
4766422720 000 000 000 000 000 000 000 000 163 001 000 000 022 000 000 000
4766422740 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422760 206 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
4766423000 000 000 000 000 000 000 000 000 231 001 000 000 020 000 030 000
4766423020 230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423040 240 001 000 000 022 000 016 000 024 012 100 000 000 000 000 000
4766423060 000 000 000 000 000 000 000 000 246 001 000 000 022 000 000 000
4766423100 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423120 274 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
4766423140 000 000 000 000 000 000 000 000 320 001 000 000 022 000 000 000
4766423160 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423200 343 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
4766423220 000 000 000 000 000 000 000 000 365 001 000 000 022 000 000 000
4766423240 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423260 007 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
4766423300 000 000 000 000 000 000 000 000 046 002 000 000 020 000 030 000
4766423320 200 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423340 063 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
4766423360 000 000 000 000 000 000 000 000 102 002 000 000 021 002 017 000
4766423400 050 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423420 117 002 000 000 021 000 017 000 040 012 100 000 000 000 000 000
4766423440 004 000 000 000 000 000 000 000 136 002 000 000 022 000 000 000
4766423460 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423500 160 002 000 000 022 000 015 000 240 011 100 000 000 000 000 000
4766423520 145 000 000 000 000 000 000 000 200 002 000 000 022 000 000 000
4766423540 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423560 224 002 000 000 020 000 031 000 300 020 160 000 000 000 000 000
4766423600 000 000 000 000 000 000 000 000 231 002 000 000 022 000 015 000
4766423620 023 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423640 240 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
4766423660 000 000 000 000 000 000 000 000 265 002 000 000 020 000 031 000
4766423700 230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423720 301 002 000 000 022 000 015 000 040 007 100 000 000 000 000 000
4766423740 363 000 000 000 000 000 000 000 306 002 000 000 022 000 000 000
4766423760 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766424000 330 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
4766424020 000 000 000 000 000 000 000 000 354 002 000 000 021 002 030 000
4766424040 230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766424060 370 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
4766424100 000 000 000 000 000 000 000 000 022 003 000 000 022 000 013 000
4766424120 030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766424140 000 060 070 060 064 056 143 000 146 151 154 145 137 167 162 151
4766424160 164 145 162 000 142 165 146 146 145 162 056 064 067 066 061 000
4766424200 160 151 160 145 137 146 144 000 143 162 164 163 164 165 146 146
4766424220 056 143 000 137 137 112 103 122 137 114 111 123 124 137 137 000
4766424240 144 145 162 145 147 151 163 164 145 162 137 164 155 137 143 154
4766424260 157 156 145 163 000 162 145 147 151 163 164 145 162 137 164 155
4766424300 137 143 154 157 156 145 163 000 137 137 144 157 137 147 154 157
4766424320 142 141 154 137 144 164 157 162 163 137 141 165 170 000 143 157
4766424340 155 160 154 145 164 145 144 056 066 063 064 064 000 137 137 144
4766424360 157 137 147 154 157 142 141 154 137 144 164 157 162 163 137 141
4766424400 165 170 137 146 151 156 151 137 141 162 162 141 171 137 145 156
4766424420 164 162 171 000 146 162 141 155 145 137 144 165 155 155 171 000
4766424440 137 137 146 162 141 155 145 137 144 165 155 155 171 137 151 156
4766424460 151 164 137 141 162 162 141 171 137 145 156 164 162 171 000 137
4766424500 137 106 122 101 115 105 137 105 116 104 137 137 000 137 137 112
4766424520 103 122 137 105 116 104 137 137 000 137 137 151 156 151 164 137
4766424540 141 162 162 141 171 137 145 156 144 000 137 104 131 116 101 115
4766424560 111 103 000 137 137 151 156 151 164 137 141 162 162 141 171 137
4766424600 163 164 141 162 164 000 137 107 114 117 102 101 114 137 117 106
4766424620 106 123 105 124 137 124 101 102 114 105 137 000 137 137 154 151
4766424640 142 143 137 143 163 165 137 146 151 156 151 000 137 111 124 115
4766424660 137 144 145 162 145 147 151 163 164 145 162 124 115 103 154 157
4766424700 156 145 124 141 142 154 145 000 144 141 164 141 137 163 164 141
4766424720 162 164 000 143 154 157 156 145 100 100 107 114 111 102 103 137
4766424740 062 056 062 056 065 000 167 162 151 164 145 100 100 107 114 111
4766424760 102 103 137 062 056 062 056 065 000 137 145 144 141 164 141 000
4766425000 137 146 151 156 151 000 163 156 160 162 151 156 164 146 100 100
4766425020 107 114 111 102 103 137 062 056 062 056 065 000 155 145 155 163
4766425040 145 164 100 100 107 114 111 102 103 137 062 056 062 056 065 000
4766425060 143 154 157 163 145 100 100 107 114 111 102 103 137 062 056 062
4766425100 056 065 000 160 151 160 145 100 100 107 114 111 102 103 137 062
4766425120 056 062 056 065 000 162 145 141 144 100 100 107 114 111 102 103
4766425140 137 062 056 062 056 065 000 137 137 154 151 142 143 137 163 164
4766425160 141 162 164 137 155 141 151 156 100 100 107 114 111 102 103 137
4766425200 062 056 062 056 065 000 137 137 144 141 164 141 137 163 164 141
4766425220 162 164 000 137 137 147 155 157 156 137 163 164 141 162 164 137
4766425240 137 000 137 137 144 163 157 137 150 141 156 144 154 145 000 137
4766425260 111 117 137 163 164 144 151 156 137 165 163 145 144 000 153 151
4766425300 154 154 100 100 107 114 111 102 103 137 062 056 062 056 065 000
4766425320 137 137 154 151 142 143 137 143 163 165 137 151 156 151 164 000
4766425340 155 141 154 154 157 143 100 100 107 114 111 102 103 137 062 056
4766425360 062 056 065 000 137 145 156 144 000 137 163 164 141 162 164 000
4766425400 162 145 141 154 154 157 143 100 100 107 114 111 102 103 137 062
4766425420 056 062 056 065 000 137 137 142 163 163 137 163 164 141 162 164
4766425440 000 155 141 151 156 000 157 160 145 156 100 100 107 114 111 102
4766425460 103 137 062 056 062 056 065 000 137 112 166 137 122 145 147 151
4766425500 163 164 145 162 103 154 141 163 163 145 163 000 137 137 124 115
4766425520 103 137 105 116 104 137 137 000 137 111 124 115 137 162 145 147
4766425540 151 163 164 145 162 124 115 103 154 157 156 145 124 141 142 154
4766425560 145 000 137 151 156 151 164 000 000 000 000 000 000 000 000 000
4766425600 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
4766425660 000 000 000 000 000 000 000 000 033 000 000 000 001 000 000 000
4766425700 002 000 000 000 000 000 000 000 070 002 100 000 000 000 000 000
4766425720 070 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
4766425740 000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
4766425760 000 000 000 000 000 000 000 000 043 000 000 000 007 000 000 000
4766426000 002 000 000 000 000 000 000 000 124 002 100 000 000 000 000 000
4766426020 124 002 000 000 000 000 000 000 040 000 000 000 000 000 000 000
4766426040 000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
4766426060 000 000 000 000 000 000 000 000 061 000 000 000 007 000 000 000
4766426100 002 000 000 000 000 000 000 000 164 002 100 000 000 000 000 000
4766426120 164 002 000 000 000 000 000 000 044 000 000 000 000 000 000 000
4766426140 000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
4766426160 000 000 000 000 000 000 000 000 104 000 000 000 366 377 377 157
4766426200 002 000 000 000 000 000 000 000 230 002 100 000 000 000 000 000
4766426220 230 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
4766426240 005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
4766426260 000 000 000 000 000 000 000 000 116 000 000 000 013 000 000 000
4766426300 002 000 000 000 000 000 000 000 270 002 100 000 000 000 000 000
4766426320 270 002 000 000 000 000 000 000 120 001 000 000 000 000 000 000
4766426340 006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
4766426360 030 000 000 000 000 000 000 000 126 000 000 000 003 000 000 000
4766426400 002 000 000 000 000 000 000 000 010 004 100 000 000 000 000 000
4766426420 010 004 000 000 000 000 000 000 175 000 000 000 000 000 000 000
4766426440 000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
4766426460 000 000 000 000 000 000 000 000 136 000 000 000 377 377 377 157
4766426500 002 000 000 000 000 000 000 000 206 004 100 000 000 000 000 000
4766426520 206 004 000 000 000 000 000 000 034 000 000 000 000 000 000 000
4766426540 005 000 000 000 000 000 000 000 002 000 000 000 000 000 000 000
4766426560 002 000 000 000 000 000 000 000 153 000 000 000 376 377 377 157
4766426600 002 000 000 000 000 000 000 000 250 004 100 000 000 000 000 000
4766426620 250 004 000 000 000 000 000 000 040 000 000 000 000 000 000 000
4766426640 006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
4766426660 000 000 000 000 000 000 000 000 172 000 000 000 004 000 000 000
4766426700 002 000 000 000 000 000 000 000 310 004 100 000 000 000 000 000
4766426720 310 004 000 000 000 000 000 000 030 000 000 000 000 000 000 000
4766426740 005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
4766426760 030 000 000 000 000 000 000 000 204 000 000 000 004 000 000 000
4766427000 102 000 000 000 000 000 000 000 340 004 100 000 000 000 000 000
4766427020 340 004 000 000 000 000 000 000 070 001 000 000 000 000 000 000
4766427040 005 000 000 000 014 000 000 000 010 000 000 000 000 000 000 000
4766427060 030 000 000 000 000 000 000 000 216 000 000 000 001 000 000 000
4766427100 006 000 000 000 000 000 000 000 030 006 100 000 000 000 000 000
4766427120 030 006 000 000 000 000 000 000 032 000 000 000 000 000 000 000
4766427140 000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
4766427160 000 000 000 000 000 000 000 000 211 000 000 000 001 000 000 000
4766427200 006 000 000 000 000 000 000 000 100 006 100 000 000 000 000 000
4766427220 100 006 000 000 000 000 000 000 340 000 000 000 000 000 000 000
4766427240 000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
4766427260 020 000 000 000 000 000 000 000 224 000 000 000 001 000 000 000
4766427300 006 000 000 000 000 000 000 000 040 007 100 000 000 000 000 000
4766427320 040 007 000 000 000 000 000 000 364 002 000 000 000 000 000 000
4766427340 000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
4766427360 000 000 000 000 000 000 000 000 232 000 000 000 001 000 000 000
4766427400 006 000 000 000 000 000 000 000 024 012 100 000 000 000 000 000
4766427420 024 012 000 000 000 000 000 000 011 000 000 000 000 000 000 000
4766427440 000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
4766427460 000 000 000 000 000 000 000 000 240 000 000 000 001 000 000 000
4766427500 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
4766430000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
6006410000
^C
----------

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] mm, oom: fix oom_reaper fallouts
@ 2017-08-07 13:28   ` Tetsuo Handa
  0 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-07 13:28 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> Hi,
> there are two issues this patch series attempts to fix. First one is
> something that has been broken since MMF_UNSTABLE flag introduction
> and I guess we should backport it stable trees (patch 1). The other
> issue has been brought up by Wenwei Tao and Tetsuo Handa has created
> a test case to trigger it very reliably. I am not yet sure this is a
> stable material because the test case is rather artificial. If there is
> a demand for the stable backport I will prepare it, of course, though.
> 
> I hope I've done the second patch correctly but I would definitely
> appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
> previous attempt with some more context was posted here
> http://lkml.kernel.org/r/20170803135902.31977-1-mhocko@kernel.org
> 
> My testing didn't show anything unusual with these two applied on top of
> the mmotm tree.

I really don't like your likely/unlikely speculation.
I can trigger this problem with 8 threads using 4.13.0-rc2-next-20170728.
The written data can be random values (seems portion of a.out memory image).
I guess that unexpected information leakage is possible.

----------
$ cat 0804.c
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>

#define NUMTHREADS 8
#define STACKSIZE 8192

static int pipe_fd[2] = { EOF, EOF };
static int file_writer(void *i)
{
        static char buffer[1048576];
        int fd;
        char buffer2[64] = { };
        snprintf(buffer2, sizeof(buffer2), "/tmp/file.%lu", (unsigned long) i);
        fd = open(buffer2, O_WRONLY | O_CREAT | O_APPEND, 0600);
        memset(buffer, 0xFF, sizeof(buffer));
        read(pipe_fd[0], buffer, 1);
        while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
        return 0;
}

int main(int argc, char *argv[])
{
        char *buf = NULL;
        unsigned long size;
        unsigned long i;
        char *stack;
        if (pipe(pipe_fd))
                return 1;
        stack = malloc(STACKSIZE * NUMTHREADS);
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        for (i = 0; i < NUMTHREADS; i++)
                if (clone(file_writer, stack + (i + 1) * STACKSIZE,
                          CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | CLONE_FS |
                          CLONE_FILES, (void *) i) == -1)
                        break;
        close(pipe_fd[1]);
        /* Will cause OOM due to overcommit; if not use SysRq-f */
        for (i = 0; i < size; i += 4096)
                buf[i] = 0;
        kill(-1, SIGKILL);
        return 0;
}
$ gcc -Wall -O3 0804.c
$ while :; do ./a.out; cat /tmp/file.* | od -b; /bin/rm /tmp/file.*; done
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
6415240000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
4461000000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
3762500000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
4347360000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
6451300000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
3564010000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
3564020000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
5055400000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
5446620000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
6203470000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
4127560000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
5224170000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
5612450000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
4766420000 055 061 061 051 000 000 056 163 171 155 164 141 142 000 056 163
4766420020 164 162 164 141 142 000 056 163 150 163 164 162 164 141 142 000
4766420040 056 151 156 164 145 162 160 000 056 156 157 164 145 056 101 102
4766420060 111 055 164 141 147 000 056 156 157 164 145 056 147 156 165 056
4766420100 142 165 151 154 144 055 151 144 000 056 147 156 165 056 150 141
4766420120 163 150 000 056 144 171 156 163 171 155 000 056 144 171 156 163
4766420140 164 162 000 056 147 156 165 056 166 145 162 163 151 157 156 000
4766420160 056 147 156 165 056 166 145 162 163 151 157 156 137 162 000 056
4766420200 162 145 154 141 056 144 171 156 000 056 162 145 154 141 056 160
4766420220 154 164 000 056 151 156 151 164 000 056 164 145 170 164 000 056
4766420240 146 151 156 151 000 056 162 157 144 141 164 141 000 056 145 150
4766420260 137 146 162 141 155 145 137 150 144 162 000 056 145 150 137 146
4766420300 162 141 155 145 000 056 151 156 151 164 137 141 162 162 141 171
4766420320 000 056 146 151 156 151 137 141 162 162 141 171 000 056 152 143
4766420340 162 000 056 144 171 156 141 155 151 143 000 056 147 157 164 000
4766420360 056 147 157 164 056 160 154 164 000 056 144 141 164 141 000 056
4766420400 142 163 163 000 056 143 157 155 155 145 156 164 000 000 000 000
4766420420 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766420440 000 000 000 000 000 000 000 000 000 000 000 000 003 000 001 000
4766420460 070 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766420500 000 000 000 000 003 000 002 000 124 002 100 000 000 000 000 000
4766420520 000 000 000 000 000 000 000 000 000 000 000 000 003 000 003 000
4766420540 164 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766420560 000 000 000 000 003 000 004 000 230 002 100 000 000 000 000 000
4766420600 000 000 000 000 000 000 000 000 000 000 000 000 003 000 005 000
4766420620 270 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766420640 000 000 000 000 003 000 006 000 010 004 100 000 000 000 000 000
4766420660 000 000 000 000 000 000 000 000 000 000 000 000 003 000 007 000
4766420700 206 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766420720 000 000 000 000 003 000 010 000 250 004 100 000 000 000 000 000
4766420740 000 000 000 000 000 000 000 000 000 000 000 000 003 000 011 000
4766420760 310 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421000 000 000 000 000 003 000 012 000 340 004 100 000 000 000 000 000
4766421020 000 000 000 000 000 000 000 000 000 000 000 000 003 000 013 000
4766421040 030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421060 000 000 000 000 003 000 014 000 100 006 100 000 000 000 000 000
4766421100 000 000 000 000 000 000 000 000 000 000 000 000 003 000 015 000
4766421120 040 007 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421140 000 000 000 000 003 000 016 000 024 012 100 000 000 000 000 000
4766421160 000 000 000 000 000 000 000 000 000 000 000 000 003 000 017 000
4766421200 040 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421220 000 000 000 000 003 000 020 000 100 012 100 000 000 000 000 000
4766421240 000 000 000 000 000 000 000 000 000 000 000 000 003 000 021 000
4766421260 200 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421300 000 000 000 000 003 000 022 000 020 016 140 000 000 000 000 000
4766421320 000 000 000 000 000 000 000 000 000 000 000 000 003 000 023 000
4766421340 030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421360 000 000 000 000 003 000 024 000 040 016 140 000 000 000 000 000
4766421400 000 000 000 000 000 000 000 000 000 000 000 000 003 000 025 000
4766421420 050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421440 000 000 000 000 003 000 026 000 370 017 140 000 000 000 000 000
4766421460 000 000 000 000 000 000 000 000 000 000 000 000 003 000 027 000
4766421500 000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421520 000 000 000 000 003 000 030 000 200 020 140 000 000 000 000 000
4766421540 000 000 000 000 000 000 000 000 000 000 000 000 003 000 031 000
4766421560 240 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421600 000 000 000 000 003 000 032 000 000 000 000 000 000 000 000 000
4766421620 000 000 000 000 000 000 000 000 001 000 000 000 004 000 361 377
4766421640 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766421660 010 000 000 000 002 000 015 000 000 011 100 000 000 000 000 000
4766421700 221 000 000 000 000 000 000 000 024 000 000 000 001 000 031 000
4766421720 300 020 140 000 000 000 000 000 000 000 020 000 000 000 000 000
4766421740 040 000 000 000 001 000 030 000 220 020 140 000 000 000 000 000
4766421760 010 000 000 000 000 000 000 000 050 000 000 000 004 000 361 377
4766422000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422020 063 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
4766422040 000 000 000 000 000 000 000 000 100 000 000 000 002 000 015 000
4766422060 100 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422100 125 000 000 000 002 000 015 000 160 010 100 000 000 000 000 000
4766422120 000 000 000 000 000 000 000 000 150 000 000 000 002 000 015 000
4766422140 260 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422160 176 000 000 000 001 000 031 000 240 020 140 000 000 000 000 000
4766422200 001 000 000 000 000 000 000 000 215 000 000 000 001 000 023 000
4766422220 030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422240 264 000 000 000 002 000 015 000 320 010 100 000 000 000 000 000
4766422260 000 000 000 000 000 000 000 000 300 000 000 000 001 000 022 000
4766422300 020 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422320 050 000 000 000 004 000 361 377 000 000 000 000 000 000 000 000
4766422340 000 000 000 000 000 000 000 000 337 000 000 000 001 000 021 000
4766422360 300 013 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422400 355 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
4766422420 000 000 000 000 000 000 000 000 000 000 000 000 004 000 361 377
4766422440 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422460 371 000 000 000 000 000 022 000 030 016 140 000 000 000 000 000
4766422500 000 000 000 000 000 000 000 000 012 001 000 000 001 000 025 000
4766422520 050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422540 023 001 000 000 000 000 022 000 020 016 140 000 000 000 000 000
4766422560 000 000 000 000 000 000 000 000 046 001 000 000 001 000 027 000
4766422600 000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422620 074 001 000 000 022 000 015 000 020 012 100 000 000 000 000 000
4766422640 002 000 000 000 000 000 000 000 114 001 000 000 040 000 000 000
4766422660 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422700 150 001 000 000 040 000 030 000 200 020 140 000 000 000 000 000
4766422720 000 000 000 000 000 000 000 000 163 001 000 000 022 000 000 000
4766422740 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766422760 206 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
4766423000 000 000 000 000 000 000 000 000 231 001 000 000 020 000 030 000
4766423020 230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423040 240 001 000 000 022 000 016 000 024 012 100 000 000 000 000 000
4766423060 000 000 000 000 000 000 000 000 246 001 000 000 022 000 000 000
4766423100 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423120 274 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
4766423140 000 000 000 000 000 000 000 000 320 001 000 000 022 000 000 000
4766423160 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423200 343 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
4766423220 000 000 000 000 000 000 000 000 365 001 000 000 022 000 000 000
4766423240 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423260 007 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
4766423300 000 000 000 000 000 000 000 000 046 002 000 000 020 000 030 000
4766423320 200 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423340 063 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
4766423360 000 000 000 000 000 000 000 000 102 002 000 000 021 002 017 000
4766423400 050 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423420 117 002 000 000 021 000 017 000 040 012 100 000 000 000 000 000
4766423440 004 000 000 000 000 000 000 000 136 002 000 000 022 000 000 000
4766423460 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423500 160 002 000 000 022 000 015 000 240 011 100 000 000 000 000 000
4766423520 145 000 000 000 000 000 000 000 200 002 000 000 022 000 000 000
4766423540 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423560 224 002 000 000 020 000 031 000 300 020 160 000 000 000 000 000
4766423600 000 000 000 000 000 000 000 000 231 002 000 000 022 000 015 000
4766423620 023 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423640 240 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
4766423660 000 000 000 000 000 000 000 000 265 002 000 000 020 000 031 000
4766423700 230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766423720 301 002 000 000 022 000 015 000 040 007 100 000 000 000 000 000
4766423740 363 000 000 000 000 000 000 000 306 002 000 000 022 000 000 000
4766423760 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
4766424000 330 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
4766424020 000 000 000 000 000 000 000 000 354 002 000 000 021 002 030 000
4766424040 230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
4766424060 370 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
4766424100 000 000 000 000 000 000 000 000 022 003 000 000 022 000 013 000
4766424120 030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
4766424140 000 060 070 060 064 056 143 000 146 151 154 145 137 167 162 151
4766424160 164 145 162 000 142 165 146 146 145 162 056 064 067 066 061 000
4766424200 160 151 160 145 137 146 144 000 143 162 164 163 164 165 146 146
4766424220 056 143 000 137 137 112 103 122 137 114 111 123 124 137 137 000
4766424240 144 145 162 145 147 151 163 164 145 162 137 164 155 137 143 154
4766424260 157 156 145 163 000 162 145 147 151 163 164 145 162 137 164 155
4766424300 137 143 154 157 156 145 163 000 137 137 144 157 137 147 154 157
4766424320 142 141 154 137 144 164 157 162 163 137 141 165 170 000 143 157
4766424340 155 160 154 145 164 145 144 056 066 063 064 064 000 137 137 144
4766424360 157 137 147 154 157 142 141 154 137 144 164 157 162 163 137 141
4766424400 165 170 137 146 151 156 151 137 141 162 162 141 171 137 145 156
4766424420 164 162 171 000 146 162 141 155 145 137 144 165 155 155 171 000
4766424440 137 137 146 162 141 155 145 137 144 165 155 155 171 137 151 156
4766424460 151 164 137 141 162 162 141 171 137 145 156 164 162 171 000 137
4766424500 137 106 122 101 115 105 137 105 116 104 137 137 000 137 137 112
4766424520 103 122 137 105 116 104 137 137 000 137 137 151 156 151 164 137
4766424540 141 162 162 141 171 137 145 156 144 000 137 104 131 116 101 115
4766424560 111 103 000 137 137 151 156 151 164 137 141 162 162 141 171 137
4766424600 163 164 141 162 164 000 137 107 114 117 102 101 114 137 117 106
4766424620 106 123 105 124 137 124 101 102 114 105 137 000 137 137 154 151
4766424640 142 143 137 143 163 165 137 146 151 156 151 000 137 111 124 115
4766424660 137 144 145 162 145 147 151 163 164 145 162 124 115 103 154 157
4766424700 156 145 124 141 142 154 145 000 144 141 164 141 137 163 164 141
4766424720 162 164 000 143 154 157 156 145 100 100 107 114 111 102 103 137
4766424740 062 056 062 056 065 000 167 162 151 164 145 100 100 107 114 111
4766424760 102 103 137 062 056 062 056 065 000 137 145 144 141 164 141 000
4766425000 137 146 151 156 151 000 163 156 160 162 151 156 164 146 100 100
4766425020 107 114 111 102 103 137 062 056 062 056 065 000 155 145 155 163
4766425040 145 164 100 100 107 114 111 102 103 137 062 056 062 056 065 000
4766425060 143 154 157 163 145 100 100 107 114 111 102 103 137 062 056 062
4766425100 056 065 000 160 151 160 145 100 100 107 114 111 102 103 137 062
4766425120 056 062 056 065 000 162 145 141 144 100 100 107 114 111 102 103
4766425140 137 062 056 062 056 065 000 137 137 154 151 142 143 137 163 164
4766425160 141 162 164 137 155 141 151 156 100 100 107 114 111 102 103 137
4766425200 062 056 062 056 065 000 137 137 144 141 164 141 137 163 164 141
4766425220 162 164 000 137 137 147 155 157 156 137 163 164 141 162 164 137
4766425240 137 000 137 137 144 163 157 137 150 141 156 144 154 145 000 137
4766425260 111 117 137 163 164 144 151 156 137 165 163 145 144 000 153 151
4766425300 154 154 100 100 107 114 111 102 103 137 062 056 062 056 065 000
4766425320 137 137 154 151 142 143 137 143 163 165 137 151 156 151 164 000
4766425340 155 141 154 154 157 143 100 100 107 114 111 102 103 137 062 056
4766425360 062 056 065 000 137 145 156 144 000 137 163 164 141 162 164 000
4766425400 162 145 141 154 154 157 143 100 100 107 114 111 102 103 137 062
4766425420 056 062 056 065 000 137 137 142 163 163 137 163 164 141 162 164
4766425440 000 155 141 151 156 000 157 160 145 156 100 100 107 114 111 102
4766425460 103 137 062 056 062 056 065 000 137 112 166 137 122 145 147 151
4766425500 163 164 145 162 103 154 141 163 163 145 163 000 137 137 124 115
4766425520 103 137 105 116 104 137 137 000 137 111 124 115 137 162 145 147
4766425540 151 163 164 145 162 124 115 103 154 157 156 145 124 141 142 154
4766425560 145 000 137 151 156 151 164 000 000 000 000 000 000 000 000 000
4766425600 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
4766425660 000 000 000 000 000 000 000 000 033 000 000 000 001 000 000 000
4766425700 002 000 000 000 000 000 000 000 070 002 100 000 000 000 000 000
4766425720 070 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
4766425740 000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
4766425760 000 000 000 000 000 000 000 000 043 000 000 000 007 000 000 000
4766426000 002 000 000 000 000 000 000 000 124 002 100 000 000 000 000 000
4766426020 124 002 000 000 000 000 000 000 040 000 000 000 000 000 000 000
4766426040 000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
4766426060 000 000 000 000 000 000 000 000 061 000 000 000 007 000 000 000
4766426100 002 000 000 000 000 000 000 000 164 002 100 000 000 000 000 000
4766426120 164 002 000 000 000 000 000 000 044 000 000 000 000 000 000 000
4766426140 000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
4766426160 000 000 000 000 000 000 000 000 104 000 000 000 366 377 377 157
4766426200 002 000 000 000 000 000 000 000 230 002 100 000 000 000 000 000
4766426220 230 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
4766426240 005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
4766426260 000 000 000 000 000 000 000 000 116 000 000 000 013 000 000 000
4766426300 002 000 000 000 000 000 000 000 270 002 100 000 000 000 000 000
4766426320 270 002 000 000 000 000 000 000 120 001 000 000 000 000 000 000
4766426340 006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
4766426360 030 000 000 000 000 000 000 000 126 000 000 000 003 000 000 000
4766426400 002 000 000 000 000 000 000 000 010 004 100 000 000 000 000 000
4766426420 010 004 000 000 000 000 000 000 175 000 000 000 000 000 000 000
4766426440 000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
4766426460 000 000 000 000 000 000 000 000 136 000 000 000 377 377 377 157
4766426500 002 000 000 000 000 000 000 000 206 004 100 000 000 000 000 000
4766426520 206 004 000 000 000 000 000 000 034 000 000 000 000 000 000 000
4766426540 005 000 000 000 000 000 000 000 002 000 000 000 000 000 000 000
4766426560 002 000 000 000 000 000 000 000 153 000 000 000 376 377 377 157
4766426600 002 000 000 000 000 000 000 000 250 004 100 000 000 000 000 000
4766426620 250 004 000 000 000 000 000 000 040 000 000 000 000 000 000 000
4766426640 006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
4766426660 000 000 000 000 000 000 000 000 172 000 000 000 004 000 000 000
4766426700 002 000 000 000 000 000 000 000 310 004 100 000 000 000 000 000
4766426720 310 004 000 000 000 000 000 000 030 000 000 000 000 000 000 000
4766426740 005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
4766426760 030 000 000 000 000 000 000 000 204 000 000 000 004 000 000 000
4766427000 102 000 000 000 000 000 000 000 340 004 100 000 000 000 000 000
4766427020 340 004 000 000 000 000 000 000 070 001 000 000 000 000 000 000
4766427040 005 000 000 000 014 000 000 000 010 000 000 000 000 000 000 000
4766427060 030 000 000 000 000 000 000 000 216 000 000 000 001 000 000 000
4766427100 006 000 000 000 000 000 000 000 030 006 100 000 000 000 000 000
4766427120 030 006 000 000 000 000 000 000 032 000 000 000 000 000 000 000
4766427140 000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
4766427160 000 000 000 000 000 000 000 000 211 000 000 000 001 000 000 000
4766427200 006 000 000 000 000 000 000 000 100 006 100 000 000 000 000 000
4766427220 100 006 000 000 000 000 000 000 340 000 000 000 000 000 000 000
4766427240 000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
4766427260 020 000 000 000 000 000 000 000 224 000 000 000 001 000 000 000
4766427300 006 000 000 000 000 000 000 000 040 007 100 000 000 000 000 000
4766427320 040 007 000 000 000 000 000 000 364 002 000 000 000 000 000 000
4766427340 000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
4766427360 000 000 000 000 000 000 000 000 232 000 000 000 001 000 000 000
4766427400 006 000 000 000 000 000 000 000 024 012 100 000 000 000 000 000
4766427420 024 012 000 000 000 000 000 000 011 000 000 000 000 000 000 000
4766427440 000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
4766427460 000 000 000 000 000 000 000 000 240 000 000 000 001 000 000 000
4766427500 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
4766430000
Killed
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
6006410000
^C
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] mm, oom: fix oom_reaper fallouts
  2017-08-07 13:28   ` Tetsuo Handa
@ 2017-08-07 14:04     ` Michal Hocko
  -1 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-07 14:04 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Mon 07-08-17 22:28:27, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Hi,
> > there are two issues this patch series attempts to fix. First one is
> > something that has been broken since MMF_UNSTABLE flag introduction
> > and I guess we should backport it stable trees (patch 1). The other
> > issue has been brought up by Wenwei Tao and Tetsuo Handa has created
> > a test case to trigger it very reliably. I am not yet sure this is a
> > stable material because the test case is rather artificial. If there is
> > a demand for the stable backport I will prepare it, of course, though.
> > 
> > I hope I've done the second patch correctly but I would definitely
> > appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
> > previous attempt with some more context was posted here
> > http://lkml.kernel.org/r/20170803135902.31977-1-mhocko@kernel.org
> > 
> > My testing didn't show anything unusual with these two applied on top of
> > the mmotm tree.
> 
> I really don't like your likely/unlikely speculation.

Have you seen any non artificial workload triggering this? Look, I am
not going to argue about how likely this is or not. I've said I am
willing to do backports if there is a demand but please do realize that
this is not a trivial change to backport pre 4.9 kernels would require
MMF_UNSTABLE to be backported as well. This all can be discussed
after the merge so can we focus on the review now rather than any
distractions?

Also please note that while writing zeros is certainly bad any integrity
assumptions are basically off when an application gets killed
unexpectedly while performing an IO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] mm, oom: fix oom_reaper fallouts
@ 2017-08-07 14:04     ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-07 14:04 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Mon 07-08-17 22:28:27, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Hi,
> > there are two issues this patch series attempts to fix. First one is
> > something that has been broken since MMF_UNSTABLE flag introduction
> > and I guess we should backport it stable trees (patch 1). The other
> > issue has been brought up by Wenwei Tao and Tetsuo Handa has created
> > a test case to trigger it very reliably. I am not yet sure this is a
> > stable material because the test case is rather artificial. If there is
> > a demand for the stable backport I will prepare it, of course, though.
> > 
> > I hope I've done the second patch correctly but I would definitely
> > appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
> > previous attempt with some more context was posted here
> > http://lkml.kernel.org/r/20170803135902.31977-1-mhocko@kernel.org
> > 
> > My testing didn't show anything unusual with these two applied on top of
> > the mmotm tree.
> 
> I really don't like your likely/unlikely speculation.

Have you seen any non artificial workload triggering this? Look, I am
not going to argue about how likely this is or not. I've said I am
willing to do backports if there is a demand but please do realize that
this is not a trivial change to backport pre 4.9 kernels would require
MMF_UNSTABLE to be backported as well. This all can be discussed
after the merge so can we focus on the review now rather than any
distractions?

Also please note that while writing zeros is certainly bad any integrity
assumptions are basically off when an application gets killed
unexpectedly while performing an IO.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] mm, oom: fix oom_reaper fallouts
  2017-08-07 14:04     ` Michal Hocko
@ 2017-08-07 15:23       ` Tetsuo Handa
  -1 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-07 15:23 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 07-08-17 22:28:27, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Hi,
> > > there are two issues this patch series attempts to fix. First one is
> > > something that has been broken since MMF_UNSTABLE flag introduction
> > > and I guess we should backport it stable trees (patch 1). The other
> > > issue has been brought up by Wenwei Tao and Tetsuo Handa has created
> > > a test case to trigger it very reliably. I am not yet sure this is a
> > > stable material because the test case is rather artificial. If there is
> > > a demand for the stable backport I will prepare it, of course, though.
> > > 
> > > I hope I've done the second patch correctly but I would definitely
> > > appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
> > > previous attempt with some more context was posted here
> > > http://lkml.kernel.org/r/20170803135902.31977-1-mhocko@kernel.org
> > > 
> > > My testing didn't show anything unusual with these two applied on top of
> > > the mmotm tree.
> > 
> > I really don't like your likely/unlikely speculation.
> 
> Have you seen any non artificial workload triggering this?

It will be 5 to 10 years away from now to know whether non artificial
workload triggers this. (I mean, customers start using RHEL8.)

>                                                            Look, I am
> not going to argue about how likely this is or not. I've said I am
> willing to do backports if there is a demand but please do realize that
> this is not a trivial change to backport pre 4.9 kernels would require
> MMF_UNSTABLE to be backported as well. This all can be discussed
> after the merge so can we focus on the review now rather than any
> distractions?

3f70dc38cec2 was not working as expected. Nobody tested that OOM situation.
Then, I think we can revert 3f70dc38cec2, and then make it possible to uniformly
apply MMF_UNSTABLE to all 4.6+ kernels.

> 
> Also please note that while writing zeros is certainly bad any integrity
> assumptions are basically off when an application gets killed
> unexpectedly while performing an IO.

I consider unexpectedly saving process image (instead of zeros) to a file
is similar to fs.suid_dumpable problem (i.e. could cause a security problem).
I do expect that this patch is backported to RHEL8 (I don't know which version
RHEL8 will choose, but I guess it will be between 4.6 and 4.13).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] mm, oom: fix oom_reaper fallouts
@ 2017-08-07 15:23       ` Tetsuo Handa
  0 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-07 15:23 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 07-08-17 22:28:27, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Hi,
> > > there are two issues this patch series attempts to fix. First one is
> > > something that has been broken since MMF_UNSTABLE flag introduction
> > > and I guess we should backport it stable trees (patch 1). The other
> > > issue has been brought up by Wenwei Tao and Tetsuo Handa has created
> > > a test case to trigger it very reliably. I am not yet sure this is a
> > > stable material because the test case is rather artificial. If there is
> > > a demand for the stable backport I will prepare it, of course, though.
> > > 
> > > I hope I've done the second patch correctly but I would definitely
> > > appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
> > > previous attempt with some more context was posted here
> > > http://lkml.kernel.org/r/20170803135902.31977-1-mhocko@kernel.org
> > > 
> > > My testing didn't show anything unusual with these two applied on top of
> > > the mmotm tree.
> > 
> > I really don't like your likely/unlikely speculation.
> 
> Have you seen any non artificial workload triggering this?

It will be 5 to 10 years away from now to know whether non artificial
workload triggers this. (I mean, customers start using RHEL8.)

>                                                            Look, I am
> not going to argue about how likely this is or not. I've said I am
> willing to do backports if there is a demand but please do realize that
> this is not a trivial change to backport pre 4.9 kernels would require
> MMF_UNSTABLE to be backported as well. This all can be discussed
> after the merge so can we focus on the review now rather than any
> distractions?

3f70dc38cec2 was not working as expected. Nobody tested that OOM situation.
Then, I think we can revert 3f70dc38cec2, and then make it possible to uniformly
apply MMF_UNSTABLE to all 4.6+ kernels.

> 
> Also please note that while writing zeros is certainly bad any integrity
> assumptions are basically off when an application gets killed
> unexpectedly while performing an IO.

I consider unexpectedly saving process image (instead of zeros) to a file
is similar to fs.suid_dumpable problem (i.e. could cause a security problem).
I do expect that this patch is backported to RHEL8 (I don't know which version
RHEL8 will choose, but I guess it will be between 4.6 and 4.13).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-07 11:38   ` Michal Hocko
@ 2017-08-08 17:48     ` Andrea Arcangeli
  -1 siblings, 0 replies; 58+ messages in thread
From: Andrea Arcangeli @ 2017-08-08 17:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML, Michal Hocko

Hello,

On Mon, Aug 07, 2017 at 01:38:39PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Wenwei Tao has noticed that our current assumption that the oom victim
> is dying and never doing any visible changes after it dies, and so the
> oom_reaper can tear it down, is not entirely true.
> 
> __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT
> is set but do_group_exit sends SIGKILL to all threads _after_ the
> flag is set. So there is a race window when some threads won't have
> fatal_signal_pending while the oom_reaper could start unmapping the
> address space. Moreover some paths might not check for fatal signals
> before each PF/g-u-p/copy_from_user.
> 
> We already have a protection for oom_reaper vs. PF races by checking
> MMF_UNSTABLE. This has been, however, checked only for kernel threads
> (use_mm users) which can outlive the oom victim. A simple fix would be
> to extend the current check in handle_mm_fault for all tasks but that
> wouldn't be sufficient because the current check assumes that a kernel
> thread would bail out after EFAULT from get_user*/copy_from_user and
> never re-read the same address which would succeed because the PF path
> has established page tables already. This seems to be the case for the
> only existing use_mm user currently (virtio driver) but it is rather
> fragile in general.
> 
> This is even more fragile in general for more complex paths such as
> generic_perform_write which can re-read the same address more times
> (e.g. iov_iter_copy_from_user_atomic to fail and then
> iov_iter_fault_in_readable on retry). Therefore we have to implement
> MMF_UNSTABLE protection in a robust way and never make a potentially
> corrupted content visible. That requires to hook deeper into the PF
> path and check for the flag _every time_ before a pte for anonymous
> memory is established (that means all !VM_SHARED mappings).
> 
> The corruption can be triggered artificially [1] but there doesn't seem
> to be any real life bug report. The race window should be quite tight
> to trigger most of the time.

The bug corrected by this patch 1/2 I pointed it out last week while
reviewing other oom reaper fixes so that looks fine.

However I'd prefer to dump MMF_UNSTABLE for good instead of adding
more of it. It can be replaced with unmap_page_range in
__oom_reap_task_mm with a function that arms a special migration entry
so that no branchs are added to the fast paths and it's all hidden
inside is_migration_entry slow paths. Instead of triggering a
wait_on_page_bit(TASK_UNINTERRUPTIBLE) when is_migration_entry(entry)
is true, it will do a:

   __set_current_state(TASK_KILLABLE);
   schedule();
   return VM_FAULT_SIGBUS;

Because the SIGKILL is already posted by the time it gets waken, the
sigbus handler cannot run because the process will exit before
returning to userland, and the error should prevent GUP to keep trying
in a loop (which would happen with a regular migration entry).

It will be a page-less migration entry, so a fake, fixed,
non-page-struct-backing page pointer, could be used to create the
migration entry. migration_entry_to_page will not return a page, but
such entry can be cleared fine during exit_mmap like a regular
migration entry. No pagetable will be established either during those
migration entry blocking events in do_swap_page.

The above however looks simple compared to the core dumping. That is
an additional trouble, and not just because it can call
handle_mm_fault without mmap_sem. Regardless of mmap_sem, I wonder if
SIGNAL_GROUP_COREDUMP can get set while __oom_reap_task_mm is already
running and then what happens?  It can't be ok if core dumping can run
in those page-less migration entries and if it does, there's no chance
to get a coherent coredump after that, the page contents are already
freed and reused by the time. There should be an explanation of how
this race against coredumping is controlled to be sure oom reaper
can't start during coredumping (of course there's the check already,
but I'm just wondering if such check leaves a window for the race, if
there was a race already in the main page faults).

Overall OOM killing to me was reliable also before the oom reaper was
introduced.

I just did a search in bz for RHEL7 and there's a single bugreport
related to OOM issues but it's hanging in a non-ext4 filesystem, and
not nested in alloc_pages (but in wait_for_completion) and it's not
reproducible with ext4. And it's happening only in an artificial
specific "eatmemory" stress test from QA, there seems to be zero
customer related bugreports about OOM hangs.

A couple of years ago I could trivially trigger OOM deadlocks on
various ext4 paths that loops or use GFP_NOFAIL, but that was just a
matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
memory reserves below the low watermark.

It is also fine to kill a few more processes in fact. It's not the end
of the world if two tasks are killed because the first one couldn't
reach exit_mmap without oom reaper assistance. The fs kind of OOM
hangs in kernel threads are major issues if the whole filesystem in
the journal or something tends to prevent a multitude of tasks to
handle SIGKILL, so it has to be handled with reserves and it looked
like it was working fine already.

The main point of the oom reaper nowadays is to free memory fast
enough so a second task isn't killed as a false positive, but it's not
like anybody will notice much of a difference if a second task is
killed, it wasn't commonly happening either.

Certainly it's preferable to get two tasks killed than corrupted core
dumps or corrupted memory, so if oom reaper will stay we need to
document how we guarantee it's mutually exclusive against core dumping
and it'd better not slowdown page fault fast paths considering it's
possible to do so by arming page-less migration entries that can wait
for sigkill to be delivered in do_swap_page.

It's a big hammer feature that is nice to have but doing it safely and
without adding branches to the fast paths, is somewhat more complex
than current code.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-08 17:48     ` Andrea Arcangeli
  0 siblings, 0 replies; 58+ messages in thread
From: Andrea Arcangeli @ 2017-08-08 17:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML, Michal Hocko

Hello,

On Mon, Aug 07, 2017 at 01:38:39PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Wenwei Tao has noticed that our current assumption that the oom victim
> is dying and never doing any visible changes after it dies, and so the
> oom_reaper can tear it down, is not entirely true.
> 
> __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT
> is set but do_group_exit sends SIGKILL to all threads _after_ the
> flag is set. So there is a race window when some threads won't have
> fatal_signal_pending while the oom_reaper could start unmapping the
> address space. Moreover some paths might not check for fatal signals
> before each PF/g-u-p/copy_from_user.
> 
> We already have a protection for oom_reaper vs. PF races by checking
> MMF_UNSTABLE. This has been, however, checked only for kernel threads
> (use_mm users) which can outlive the oom victim. A simple fix would be
> to extend the current check in handle_mm_fault for all tasks but that
> wouldn't be sufficient because the current check assumes that a kernel
> thread would bail out after EFAULT from get_user*/copy_from_user and
> never re-read the same address which would succeed because the PF path
> has established page tables already. This seems to be the case for the
> only existing use_mm user currently (virtio driver) but it is rather
> fragile in general.
> 
> This is even more fragile in general for more complex paths such as
> generic_perform_write which can re-read the same address more times
> (e.g. iov_iter_copy_from_user_atomic to fail and then
> iov_iter_fault_in_readable on retry). Therefore we have to implement
> MMF_UNSTABLE protection in a robust way and never make a potentially
> corrupted content visible. That requires to hook deeper into the PF
> path and check for the flag _every time_ before a pte for anonymous
> memory is established (that means all !VM_SHARED mappings).
> 
> The corruption can be triggered artificially [1] but there doesn't seem
> to be any real life bug report. The race window should be quite tight
> to trigger most of the time.

The bug corrected by this patch 1/2 I pointed it out last week while
reviewing other oom reaper fixes so that looks fine.

However I'd prefer to dump MMF_UNSTABLE for good instead of adding
more of it. It can be replaced with unmap_page_range in
__oom_reap_task_mm with a function that arms a special migration entry
so that no branchs are added to the fast paths and it's all hidden
inside is_migration_entry slow paths. Instead of triggering a
wait_on_page_bit(TASK_UNINTERRUPTIBLE) when is_migration_entry(entry)
is true, it will do a:

   __set_current_state(TASK_KILLABLE);
   schedule();
   return VM_FAULT_SIGBUS;

Because the SIGKILL is already posted by the time it gets waken, the
sigbus handler cannot run because the process will exit before
returning to userland, and the error should prevent GUP to keep trying
in a loop (which would happen with a regular migration entry).

It will be a page-less migration entry, so a fake, fixed,
non-page-struct-backing page pointer, could be used to create the
migration entry. migration_entry_to_page will not return a page, but
such entry can be cleared fine during exit_mmap like a regular
migration entry. No pagetable will be established either during those
migration entry blocking events in do_swap_page.

The above however looks simple compared to the core dumping. That is
an additional trouble, and not just because it can call
handle_mm_fault without mmap_sem. Regardless of mmap_sem, I wonder if
SIGNAL_GROUP_COREDUMP can get set while __oom_reap_task_mm is already
running and then what happens?  It can't be ok if core dumping can run
in those page-less migration entries and if it does, there's no chance
to get a coherent coredump after that, the page contents are already
freed and reused by the time. There should be an explanation of how
this race against coredumping is controlled to be sure oom reaper
can't start during coredumping (of course there's the check already,
but I'm just wondering if such check leaves a window for the race, if
there was a race already in the main page faults).

Overall OOM killing to me was reliable also before the oom reaper was
introduced.

I just did a search in bz for RHEL7 and there's a single bugreport
related to OOM issues but it's hanging in a non-ext4 filesystem, and
not nested in alloc_pages (but in wait_for_completion) and it's not
reproducible with ext4. And it's happening only in an artificial
specific "eatmemory" stress test from QA, there seems to be zero
customer related bugreports about OOM hangs.

A couple of years ago I could trivially trigger OOM deadlocks on
various ext4 paths that loops or use GFP_NOFAIL, but that was just a
matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
memory reserves below the low watermark.

It is also fine to kill a few more processes in fact. It's not the end
of the world if two tasks are killed because the first one couldn't
reach exit_mmap without oom reaper assistance. The fs kind of OOM
hangs in kernel threads are major issues if the whole filesystem in
the journal or something tends to prevent a multitude of tasks to
handle SIGKILL, so it has to be handled with reserves and it looked
like it was working fine already.

The main point of the oom reaper nowadays is to free memory fast
enough so a second task isn't killed as a false positive, but it's not
like anybody will notice much of a difference if a second task is
killed, it wasn't commonly happening either.

Certainly it's preferable to get two tasks killed than corrupted core
dumps or corrupted memory, so if oom reaper will stay we need to
document how we guarantee it's mutually exclusive against core dumping
and it'd better not slowdown page fault fast paths considering it's
possible to do so by arming page-less migration entries that can wait
for sigkill to be delivered in do_swap_page.

It's a big hammer feature that is nice to have but doing it safely and
without adding branches to the fast paths, is somewhat more complex
than current code.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-08 17:48     ` Andrea Arcangeli
@ 2017-08-08 23:35       ` Tetsuo Handa
  -1 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-08 23:35 UTC (permalink / raw)
  To: aarcange, mhocko
  Cc: akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

Andrea Arcangeli wrote:
> Overall OOM killing to me was reliable also before the oom reaper was
> introduced.

I don't think so. We spent a lot of time in order to remove possible locations
which can lead to failing to invoke the OOM killer when out_of_memory() is called.

> 
> I just did a search in bz for RHEL7 and there's a single bugreport
> related to OOM issues but it's hanging in a non-ext4 filesystem, and
> not nested in alloc_pages (but in wait_for_completion) and it's not
> reproducible with ext4. And it's happening only in an artificial
> specific "eatmemory" stress test from QA, there seems to be zero
> customer related bugreports about OOM hangs.

Since RHEL7 changed default filesystem from ext4 to xfs, OOM related problems
became much easier to occur, for xfs involves many kernel threads where
TIF_MEMDIE based access to memory reserves cannot work among relevant threads.

Judging from my experience at a support center, it is too difficult for customers
to report OOM hangs. It requires customers to stand by in front of the console
twenty-four seven so that we get SysRq-t etc. whenever an OOM related problem is
suspected. We can't ask customers for such effort. There is no report does not mean
OOM hang is not occurring without artificial memory stress tests.

> 
> A couple of years ago I could trivially trigger OOM deadlocks on
> various ext4 paths that loops or use GFP_NOFAIL, but that was just a
> matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
> memory reserves below the low watermark.
> 
> It is also fine to kill a few more processes in fact. It's not the end
> of the world if two tasks are killed because the first one couldn't
> reach exit_mmap without oom reaper assistance. The fs kind of OOM
> hangs in kernel threads are major issues if the whole filesystem in
> the journal or something tends to prevent a multitude of tasks to
> handle SIGKILL, so it has to be handled with reserves and it looked
> like it was working fine already.
> 
> The main point of the oom reaper nowadays is to free memory fast
> enough so a second task isn't killed as a false positive, but it's not
> like anybody will notice much of a difference if a second task is
> killed, it wasn't commonly happening either.

The OOM reaper does not need to free memory fast enough, for the OOM killer
does not select the second task for kill until the OOM reaper sets
MMF_OOM_SKIP or __mmput() sets MMF_OOM_SKIP.

I think that the main point of the OOM reaper nowadays are that
"how can we allow the OOM reaper to take mmap_sem for read (because
khugepaged might take mmap_sem of the OOM victim for write)"

----------
[  493.787997] Out of memory: Kill process 3163 (a.out) score 739 or sacrifice child
[  493.791708] Killed process 3163 (a.out) total-vm:4268108kB, anon-rss:2754236kB, file-rss:0kB, shmem-rss:0kB
[  494.838382] oom_reaper: unable to reap pid:3163 (a.out)
[  494.847768] 
[  494.847768] Showing all locks held in the system:
[  494.861357] 1 lock held by oom_reaper/59:
[  494.865903]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9f0c202d>] debug_show_all_locks+0x3d/0x1a0
[  494.872934] 1 lock held by khugepaged/63:
[  494.877426]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f1d5a4d>] khugepaged+0x99d/0x1af0
[  494.884165] 3 locks held by kswapd0/75:
[  494.888628]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
[  494.894125]  #1:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.898328]  #2:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03aeafd>] xfs_reclaim_inodes_ag+0x3ad/0x4d0 [xfs]
[  494.902703] 3 locks held by kworker/u128:31/387:
[  494.905404]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
[  494.909237]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
[  494.913205]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.916954] 1 lock held by xfsaild/sda1/422:
[  494.919288]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8828>] xfs_ilock_nowait+0x148/0x240 [xfs]
[  494.923470] 1 lock held by systemd-journal/491:
[  494.926102]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.929942] 1 lock held by gmain/745:
[  494.932368]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.936505] 1 lock held by tuned/1009:
[  494.938856]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.942824] 2 locks held by agetty/982:
[  494.944900]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff9f78503f>] ldsem_down_read+0x1f/0x30
[  494.948244]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff9f4108bf>] n_tty_read+0xbf/0x8e0
[  494.952118] 1 lock held by sendmail/984:
[  494.954408]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.958370] 5 locks held by a.out/3163:
[  494.960544]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f05ca34>] __do_page_fault+0x154/0x4c0
[  494.964191]  #1:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
[  494.967922]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.971548]  #3:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03ae7fe>] xfs_reclaim_inodes_ag+0xae/0x4d0 [xfs]
[  494.975644]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8580>] xfs_ilock+0xc0/0x1b0 [xfs]
[  494.979194] 1 lock held by a.out/3164:
[  494.981220]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.984448] 1 lock held by a.out/3165:
[  494.986554]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.989841] 1 lock held by a.out/3166:
[  494.992089]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.995388] 1 lock held by a.out/3167:
[  494.997420]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
----------

  collapse_huge_page at mm/khugepaged.c:1001
   (inlined by) khugepaged_scan_pmd at mm/khugepaged.c:1209
   (inlined by) khugepaged_scan_mm_slot at mm/khugepaged.c:1728
   (inlined by) khugepaged_do_scan at mm/khugepaged.c:1809
   (inlined by) khugepaged at mm/khugepaged.c:1854

and "how can we close race between checking MMF_OOM_SKIP and doing last alloc_page_from_freelist()
attempt (because that race allows needlessly selecting the second task for kill)" in addition to
"how can we close race between unmap_page_range() and the page faults with retry fallback".

> 
> Certainly it's preferable to get two tasks killed than corrupted core
> dumps or corrupted memory, so if oom reaper will stay we need to
> document how  we guarantee it's mutually exclusive against core dumping
> and it'd better not slowdown page fault fast paths considering it's
> possible to do so by arming page-less migration entries that can wait
> for sigkill to be delivered in do_swap_page.
> 
> It's a big hammer feature that is nice to have but doing it safely and
> without adding branches to the fast paths, is somewhat more complex
> than current code.

The subject of this thread is "how can we close race between unmap_page_range()
and the page faults with retry fallback". Are you suggesting that we should remove
the OOM reaper so that we don't need to change page faults and/or __mmput() paths?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-08 23:35       ` Tetsuo Handa
  0 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-08 23:35 UTC (permalink / raw)
  To: aarcange, mhocko
  Cc: akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

Andrea Arcangeli wrote:
> Overall OOM killing to me was reliable also before the oom reaper was
> introduced.

I don't think so. We spent a lot of time in order to remove possible locations
which can lead to failing to invoke the OOM killer when out_of_memory() is called.

> 
> I just did a search in bz for RHEL7 and there's a single bugreport
> related to OOM issues but it's hanging in a non-ext4 filesystem, and
> not nested in alloc_pages (but in wait_for_completion) and it's not
> reproducible with ext4. And it's happening only in an artificial
> specific "eatmemory" stress test from QA, there seems to be zero
> customer related bugreports about OOM hangs.

Since RHEL7 changed default filesystem from ext4 to xfs, OOM related problems
became much easier to occur, for xfs involves many kernel threads where
TIF_MEMDIE based access to memory reserves cannot work among relevant threads.

Judging from my experience at a support center, it is too difficult for customers
to report OOM hangs. It requires customers to stand by in front of the console
twenty-four seven so that we get SysRq-t etc. whenever an OOM related problem is
suspected. We can't ask customers for such effort. There is no report does not mean
OOM hang is not occurring without artificial memory stress tests.

> 
> A couple of years ago I could trivially trigger OOM deadlocks on
> various ext4 paths that loops or use GFP_NOFAIL, but that was just a
> matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
> memory reserves below the low watermark.
> 
> It is also fine to kill a few more processes in fact. It's not the end
> of the world if two tasks are killed because the first one couldn't
> reach exit_mmap without oom reaper assistance. The fs kind of OOM
> hangs in kernel threads are major issues if the whole filesystem in
> the journal or something tends to prevent a multitude of tasks to
> handle SIGKILL, so it has to be handled with reserves and it looked
> like it was working fine already.
> 
> The main point of the oom reaper nowadays is to free memory fast
> enough so a second task isn't killed as a false positive, but it's not
> like anybody will notice much of a difference if a second task is
> killed, it wasn't commonly happening either.

The OOM reaper does not need to free memory fast enough, for the OOM killer
does not select the second task for kill until the OOM reaper sets
MMF_OOM_SKIP or __mmput() sets MMF_OOM_SKIP.

I think that the main point of the OOM reaper nowadays are that
"how can we allow the OOM reaper to take mmap_sem for read (because
khugepaged might take mmap_sem of the OOM victim for write)"

----------
[  493.787997] Out of memory: Kill process 3163 (a.out) score 739 or sacrifice child
[  493.791708] Killed process 3163 (a.out) total-vm:4268108kB, anon-rss:2754236kB, file-rss:0kB, shmem-rss:0kB
[  494.838382] oom_reaper: unable to reap pid:3163 (a.out)
[  494.847768] 
[  494.847768] Showing all locks held in the system:
[  494.861357] 1 lock held by oom_reaper/59:
[  494.865903]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9f0c202d>] debug_show_all_locks+0x3d/0x1a0
[  494.872934] 1 lock held by khugepaged/63:
[  494.877426]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f1d5a4d>] khugepaged+0x99d/0x1af0
[  494.884165] 3 locks held by kswapd0/75:
[  494.888628]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
[  494.894125]  #1:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.898328]  #2:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03aeafd>] xfs_reclaim_inodes_ag+0x3ad/0x4d0 [xfs]
[  494.902703] 3 locks held by kworker/u128:31/387:
[  494.905404]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
[  494.909237]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
[  494.913205]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.916954] 1 lock held by xfsaild/sda1/422:
[  494.919288]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8828>] xfs_ilock_nowait+0x148/0x240 [xfs]
[  494.923470] 1 lock held by systemd-journal/491:
[  494.926102]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.929942] 1 lock held by gmain/745:
[  494.932368]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.936505] 1 lock held by tuned/1009:
[  494.938856]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.942824] 2 locks held by agetty/982:
[  494.944900]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff9f78503f>] ldsem_down_read+0x1f/0x30
[  494.948244]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff9f4108bf>] n_tty_read+0xbf/0x8e0
[  494.952118] 1 lock held by sendmail/984:
[  494.954408]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.958370] 5 locks held by a.out/3163:
[  494.960544]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f05ca34>] __do_page_fault+0x154/0x4c0
[  494.964191]  #1:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
[  494.967922]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.971548]  #3:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03ae7fe>] xfs_reclaim_inodes_ag+0xae/0x4d0 [xfs]
[  494.975644]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8580>] xfs_ilock+0xc0/0x1b0 [xfs]
[  494.979194] 1 lock held by a.out/3164:
[  494.981220]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.984448] 1 lock held by a.out/3165:
[  494.986554]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.989841] 1 lock held by a.out/3166:
[  494.992089]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.995388] 1 lock held by a.out/3167:
[  494.997420]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
----------

  collapse_huge_page at mm/khugepaged.c:1001
   (inlined by) khugepaged_scan_pmd at mm/khugepaged.c:1209
   (inlined by) khugepaged_scan_mm_slot at mm/khugepaged.c:1728
   (inlined by) khugepaged_do_scan at mm/khugepaged.c:1809
   (inlined by) khugepaged at mm/khugepaged.c:1854

and "how can we close race between checking MMF_OOM_SKIP and doing last alloc_page_from_freelist()
attempt (because that race allows needlessly selecting the second task for kill)" in addition to
"how can we close race between unmap_page_range() and the page faults with retry fallback".

> 
> Certainly it's preferable to get two tasks killed than corrupted core
> dumps or corrupted memory, so if oom reaper will stay we need to
> document how  we guarantee it's mutually exclusive against core dumping
> and it'd better not slowdown page fault fast paths considering it's
> possible to do so by arming page-less migration entries that can wait
> for sigkill to be delivered in do_swap_page.
> 
> It's a big hammer feature that is nice to have but doing it safely and
> without adding branches to the fast paths, is somewhat more complex
> than current code.

The subject of this thread is "how can we close race between unmap_page_range()
and the page faults with retry fallback". Are you suggesting that we should remove
the OOM reaper so that we don't need to change page faults and/or __mmput() paths?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-08 23:35       ` Tetsuo Handa
@ 2017-08-09 18:36         ` Andrea Arcangeli
  -1 siblings, 0 replies; 58+ messages in thread
From: Andrea Arcangeli @ 2017-08-09 18:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

On Wed, Aug 09, 2017 at 08:35:36AM +0900, Tetsuo Handa wrote:
> I don't think so. We spent a lot of time in order to remove possible locations
> which can lead to failing to invoke the OOM killer when out_of_memory() is called.

It's not clear the connection between failing to invoke the OOM killer
and the OOM reaper. I assume you mean failing to kill the task after
the OOM killer has been invoked through out_of_memory().

You should always see in the logs "%s: Kill process %d (%s) score %u
or sacrifice child\n", the invocation itself should never been an
issue and it's unrelated to the OOM reaper.

> Since RHEL7 changed default filesystem from ext4 to xfs, OOM related problems
> became much easier to occur, for xfs involves many kernel threads where
> TIF_MEMDIE based access to memory reserves cannot work among relevant threads.

I could reproduce similar issues where the TIF_MEMDIE task was hung on
fs locks hold by kernel threads in ext4 too but those should have been
solved by other means.

> Judging from my experience at a support center, it is too difficult for customers
> to report OOM hangs. It requires customers to stand by in front of the console
> twenty-four seven so that we get SysRq-t etc. whenever an OOM related problem is
> suspected. We can't ask customers for such effort. There is no report does not mean
> OOM hang is not occurring without artificial memory stress tests.

The printk above is likely to show in the logs after reboot, but I
agree in the cloud a node hanging on OOM is probably hidden and there
are all sort of management provisions possible to prevent hitting a
real OOM too. For example memcg.

Still having no apparent customer complains I think is significant
because it means they easily tackle the problem by other means, be it
watchdogs or they prevent it in the first place with memcg.

I'm not saying it's a minor issue, to me it's totally annoying if my
system hangs on OOM so it should be reliable in practice. I'm only not
sure if tacking the OOM issues with the big hammer that still cannot
guarantee anything 100%, is justified, considering the complexity it
brings to the VM core and there's still no guarantee of not hanging.

> The OOM reaper does not need to free memory fast enough, for the OOM killer
> does not select the second task for kill until the OOM reaper sets
> MMF_OOM_SKIP or __mmput() sets MMF_OOM_SKIP.

Right, there's no need to be fast there.

> I think that the main point of the OOM reaper nowadays are that
> "how can we allow the OOM reaper to take mmap_sem for read (because
> khugepaged might take mmap_sem of the OOM victim for write)"

The main point of the OOM reaper is to avoid killing more tasks. Not
just because it would be a false positive but also because even if we
kill more tasks, they may be all stuck on the same fs locks hold by
kernel threads that cannot be killed and loop asking for more memory.

So the OOM reaper tends to reduce the risk of OOM hangs but sure thing
it cannot guarantee perfection either.

Incidentally the OOM reaper still has a timeout where it gives up and
it moves to kill another task after the timeout.

khugepaged doesn't allocate memory while holding the mmap_sem for
writing.

It's not exactly clear how in the below dump khugepaged is the problem
because 3163 is also definitely holding the mmap_sem for reading and
it cannot release it independent of khugepaged. However khugepaged
could try to grab it for writing and the fairness provisions of the
rwsem would prevent down_read_trylock to go ahead.

There's nothing specific about khugepaged here, you can try to do a
pthread_create() to create a thread in your a.out program and then
call mmap munmap in a loop (no need to touch any memory). Eventually
you'll get the page fault in your a.out process holding the mmap_sem
for reading and the child thread trying to take it for writing. Which
should be enough to block the OOM reaper entirely with the child stuck
in D state.

I already have a patch in my tree that let exit_mmap and OOM reapear
to take down pagetables concurrently only serialized by the PT lock
(upstream the OOM reaper can only run before exit_mmap starts while
mm_users is still > 0). This lets the OOM reaper run even if mm_users
of the TIF_MEMDIE task already reached 0. However to avoid taking the
mmap_sem in __oom_reap_task_mm for reading you would need to do the
opposite of upstream and then it would only solve OOM hangs between
the last mmput and exit_mmap.

To zap pagetables without mmap_sem I think quite some overhaul is
needed (likely much bigger than the one required to fix the memory and
coredump corruption). If that is done it should be done to run
MADV_DONTNEED without mmap_sem if something. OOM reaper increased
accuracy wouldn't be enough of a motivation to justify such an
increase in complexity and constant fast-path overhead (be it to
release vmas with RCU through callbacks with delayed freeing or
anything else required to drop the mmap_sem while still allowing the
OOM reapear to run while mm_users is still > 0). It'd be quite
challenging to do that because the vma bits are also protected by
mmap_sem and you can only replace rbtree nodes with RCU, not to
rebalance with argumentation.

Assuming we do all that work and slowdown the fast paths further, just
for the OOM reaper, what would then happen if the process hung has no
anonymous memory to free and instead it runs on shmem only? Would we
be back to square one and hang with the below dump?

What if we fix xfs instead to get rid of the below problem? Wouldn't
then the OOM reaper become irrelevant if removed or not?

> ----------
> [  493.787997] Out of memory: Kill process 3163 (a.out) score 739 or sacrifice child
> [  493.791708] Killed process 3163 (a.out) total-vm:4268108kB, anon-rss:2754236kB, file-rss:0kB, shmem-rss:0kB
> [  494.838382] oom_reaper: unable to reap pid:3163 (a.out)
> [  494.847768] 
> [  494.847768] Showing all locks held in the system:
> [  494.861357] 1 lock held by oom_reaper/59:
> [  494.865903]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9f0c202d>] debug_show_all_locks+0x3d/0x1a0
> [  494.872934] 1 lock held by khugepaged/63:
> [  494.877426]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f1d5a4d>] khugepaged+0x99d/0x1af0
> [  494.884165] 3 locks held by kswapd0/75:
> [  494.888628]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
> [  494.894125]  #1:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.898328]  #2:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03aeafd>] xfs_reclaim_inodes_ag+0x3ad/0x4d0 [xfs]
> [  494.902703] 3 locks held by kworker/u128:31/387:
> [  494.905404]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
> [  494.909237]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
> [  494.913205]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.916954] 1 lock held by xfsaild/sda1/422:
> [  494.919288]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8828>] xfs_ilock_nowait+0x148/0x240 [xfs]
> [  494.923470] 1 lock held by systemd-journal/491:
> [  494.926102]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.929942] 1 lock held by gmain/745:
> [  494.932368]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.936505] 1 lock held by tuned/1009:
> [  494.938856]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.942824] 2 locks held by agetty/982:
> [  494.944900]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff9f78503f>] ldsem_down_read+0x1f/0x30
> [  494.948244]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff9f4108bf>] n_tty_read+0xbf/0x8e0
> [  494.952118] 1 lock held by sendmail/984:
> [  494.954408]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.958370] 5 locks held by a.out/3163:
> [  494.960544]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f05ca34>] __do_page_fault+0x154/0x4c0
> [  494.964191]  #1:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
> [  494.967922]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.971548]  #3:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03ae7fe>] xfs_reclaim_inodes_ag+0xae/0x4d0 [xfs]
> [  494.975644]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8580>] xfs_ilock+0xc0/0x1b0 [xfs]
> [  494.979194] 1 lock held by a.out/3164:
> [  494.981220]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.984448] 1 lock held by a.out/3165:
> [  494.986554]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.989841] 1 lock held by a.out/3166:
> [  494.992089]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.995388] 1 lock held by a.out/3167:
> [  494.997420]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> ----------
> 
>   collapse_huge_page at mm/khugepaged.c:1001
>    (inlined by) khugepaged_scan_pmd at mm/khugepaged.c:1209
>    (inlined by) khugepaged_scan_mm_slot at mm/khugepaged.c:1728
>    (inlined by) khugepaged_do_scan at mm/khugepaged.c:1809
>    (inlined by) khugepaged at mm/khugepaged.c:1854
> 
> and "how can we close race between checking MMF_OOM_SKIP and doing last alloc_page_from_freelist()
> attempt (because that race allows needlessly selecting the second task for kill)" in addition to
> "how can we close race between unmap_page_range() and the page faults with retry fallback".

Yes. And the "how is OOM reaper guaranteed not to run already while
coredumping is starting" should be added to the above list of things
to fix or explain.

I'm just questioning if all this energy isn't better spent in fixing
XFS with a memory reserve in xfs_reclaim_inode for kmem_alloc (like we
have mempools for bio) and drop the OOM reaper leaving the VM fast
paths alone.

> The subject of this thread is "how can we close race between unmap_page_range()
> and the page faults with retry fallback". Are you suggesting that we should remove
> the OOM reaper so that we don't need to change page faults and/or __mmput() paths?

Well certainly if it's not fixed, I think we'd be better off to remove
it because the risk of an hang is preferable than risk of memory
corruption or corrupted core dumps.

If it was that simple as it is currently it was nice to have, but
doing it safe without risk to corrupt memory and coredumps and without
slowing down the VM fast paths, sounds overkill. Last but not the
least it hides reproducible of issues like the above hang you posted,
that I think it can't do anything about even if you remove khugepaged...

... unless we drop the mmap_sem from MADV_DONTNEED but it's not easily
feasible if unmap_page_range has to run while mm_users may still be
still > 0. Doing more VM changes that are OOM reaper specific doesn't
seem attractive to me.

I'd rather prefer if we can fix the issues in xfs the old fashioned
way that won't end up again in a hang, if after all that work, the
TIF_MEMDIE task happened to have 0 anon mem allocated in it.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-09 18:36         ` Andrea Arcangeli
  0 siblings, 0 replies; 58+ messages in thread
From: Andrea Arcangeli @ 2017-08-09 18:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

On Wed, Aug 09, 2017 at 08:35:36AM +0900, Tetsuo Handa wrote:
> I don't think so. We spent a lot of time in order to remove possible locations
> which can lead to failing to invoke the OOM killer when out_of_memory() is called.

It's not clear the connection between failing to invoke the OOM killer
and the OOM reaper. I assume you mean failing to kill the task after
the OOM killer has been invoked through out_of_memory().

You should always see in the logs "%s: Kill process %d (%s) score %u
or sacrifice child\n", the invocation itself should never been an
issue and it's unrelated to the OOM reaper.

> Since RHEL7 changed default filesystem from ext4 to xfs, OOM related problems
> became much easier to occur, for xfs involves many kernel threads where
> TIF_MEMDIE based access to memory reserves cannot work among relevant threads.

I could reproduce similar issues where the TIF_MEMDIE task was hung on
fs locks hold by kernel threads in ext4 too but those should have been
solved by other means.

> Judging from my experience at a support center, it is too difficult for customers
> to report OOM hangs. It requires customers to stand by in front of the console
> twenty-four seven so that we get SysRq-t etc. whenever an OOM related problem is
> suspected. We can't ask customers for such effort. There is no report does not mean
> OOM hang is not occurring without artificial memory stress tests.

The printk above is likely to show in the logs after reboot, but I
agree in the cloud a node hanging on OOM is probably hidden and there
are all sort of management provisions possible to prevent hitting a
real OOM too. For example memcg.

Still having no apparent customer complains I think is significant
because it means they easily tackle the problem by other means, be it
watchdogs or they prevent it in the first place with memcg.

I'm not saying it's a minor issue, to me it's totally annoying if my
system hangs on OOM so it should be reliable in practice. I'm only not
sure if tacking the OOM issues with the big hammer that still cannot
guarantee anything 100%, is justified, considering the complexity it
brings to the VM core and there's still no guarantee of not hanging.

> The OOM reaper does not need to free memory fast enough, for the OOM killer
> does not select the second task for kill until the OOM reaper sets
> MMF_OOM_SKIP or __mmput() sets MMF_OOM_SKIP.

Right, there's no need to be fast there.

> I think that the main point of the OOM reaper nowadays are that
> "how can we allow the OOM reaper to take mmap_sem for read (because
> khugepaged might take mmap_sem of the OOM victim for write)"

The main point of the OOM reaper is to avoid killing more tasks. Not
just because it would be a false positive but also because even if we
kill more tasks, they may be all stuck on the same fs locks hold by
kernel threads that cannot be killed and loop asking for more memory.

So the OOM reaper tends to reduce the risk of OOM hangs but sure thing
it cannot guarantee perfection either.

Incidentally the OOM reaper still has a timeout where it gives up and
it moves to kill another task after the timeout.

khugepaged doesn't allocate memory while holding the mmap_sem for
writing.

It's not exactly clear how in the below dump khugepaged is the problem
because 3163 is also definitely holding the mmap_sem for reading and
it cannot release it independent of khugepaged. However khugepaged
could try to grab it for writing and the fairness provisions of the
rwsem would prevent down_read_trylock to go ahead.

There's nothing specific about khugepaged here, you can try to do a
pthread_create() to create a thread in your a.out program and then
call mmap munmap in a loop (no need to touch any memory). Eventually
you'll get the page fault in your a.out process holding the mmap_sem
for reading and the child thread trying to take it for writing. Which
should be enough to block the OOM reaper entirely with the child stuck
in D state.

I already have a patch in my tree that let exit_mmap and OOM reapear
to take down pagetables concurrently only serialized by the PT lock
(upstream the OOM reaper can only run before exit_mmap starts while
mm_users is still > 0). This lets the OOM reaper run even if mm_users
of the TIF_MEMDIE task already reached 0. However to avoid taking the
mmap_sem in __oom_reap_task_mm for reading you would need to do the
opposite of upstream and then it would only solve OOM hangs between
the last mmput and exit_mmap.

To zap pagetables without mmap_sem I think quite some overhaul is
needed (likely much bigger than the one required to fix the memory and
coredump corruption). If that is done it should be done to run
MADV_DONTNEED without mmap_sem if something. OOM reaper increased
accuracy wouldn't be enough of a motivation to justify such an
increase in complexity and constant fast-path overhead (be it to
release vmas with RCU through callbacks with delayed freeing or
anything else required to drop the mmap_sem while still allowing the
OOM reapear to run while mm_users is still > 0). It'd be quite
challenging to do that because the vma bits are also protected by
mmap_sem and you can only replace rbtree nodes with RCU, not to
rebalance with argumentation.

Assuming we do all that work and slowdown the fast paths further, just
for the OOM reaper, what would then happen if the process hung has no
anonymous memory to free and instead it runs on shmem only? Would we
be back to square one and hang with the below dump?

What if we fix xfs instead to get rid of the below problem? Wouldn't
then the OOM reaper become irrelevant if removed or not?

> ----------
> [  493.787997] Out of memory: Kill process 3163 (a.out) score 739 or sacrifice child
> [  493.791708] Killed process 3163 (a.out) total-vm:4268108kB, anon-rss:2754236kB, file-rss:0kB, shmem-rss:0kB
> [  494.838382] oom_reaper: unable to reap pid:3163 (a.out)
> [  494.847768] 
> [  494.847768] Showing all locks held in the system:
> [  494.861357] 1 lock held by oom_reaper/59:
> [  494.865903]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9f0c202d>] debug_show_all_locks+0x3d/0x1a0
> [  494.872934] 1 lock held by khugepaged/63:
> [  494.877426]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f1d5a4d>] khugepaged+0x99d/0x1af0
> [  494.884165] 3 locks held by kswapd0/75:
> [  494.888628]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
> [  494.894125]  #1:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.898328]  #2:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03aeafd>] xfs_reclaim_inodes_ag+0x3ad/0x4d0 [xfs]
> [  494.902703] 3 locks held by kworker/u128:31/387:
> [  494.905404]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
> [  494.909237]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
> [  494.913205]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.916954] 1 lock held by xfsaild/sda1/422:
> [  494.919288]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8828>] xfs_ilock_nowait+0x148/0x240 [xfs]
> [  494.923470] 1 lock held by systemd-journal/491:
> [  494.926102]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.929942] 1 lock held by gmain/745:
> [  494.932368]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.936505] 1 lock held by tuned/1009:
> [  494.938856]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.942824] 2 locks held by agetty/982:
> [  494.944900]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff9f78503f>] ldsem_down_read+0x1f/0x30
> [  494.948244]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff9f4108bf>] n_tty_read+0xbf/0x8e0
> [  494.952118] 1 lock held by sendmail/984:
> [  494.954408]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.958370] 5 locks held by a.out/3163:
> [  494.960544]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f05ca34>] __do_page_fault+0x154/0x4c0
> [  494.964191]  #1:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
> [  494.967922]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.971548]  #3:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03ae7fe>] xfs_reclaim_inodes_ag+0xae/0x4d0 [xfs]
> [  494.975644]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8580>] xfs_ilock+0xc0/0x1b0 [xfs]
> [  494.979194] 1 lock held by a.out/3164:
> [  494.981220]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.984448] 1 lock held by a.out/3165:
> [  494.986554]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.989841] 1 lock held by a.out/3166:
> [  494.992089]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.995388] 1 lock held by a.out/3167:
> [  494.997420]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> ----------
> 
>   collapse_huge_page at mm/khugepaged.c:1001
>    (inlined by) khugepaged_scan_pmd at mm/khugepaged.c:1209
>    (inlined by) khugepaged_scan_mm_slot at mm/khugepaged.c:1728
>    (inlined by) khugepaged_do_scan at mm/khugepaged.c:1809
>    (inlined by) khugepaged at mm/khugepaged.c:1854
> 
> and "how can we close race between checking MMF_OOM_SKIP and doing last alloc_page_from_freelist()
> attempt (because that race allows needlessly selecting the second task for kill)" in addition to
> "how can we close race between unmap_page_range() and the page faults with retry fallback".

Yes. And the "how is OOM reaper guaranteed not to run already while
coredumping is starting" should be added to the above list of things
to fix or explain.

I'm just questioning if all this energy isn't better spent in fixing
XFS with a memory reserve in xfs_reclaim_inode for kmem_alloc (like we
have mempools for bio) and drop the OOM reaper leaving the VM fast
paths alone.

> The subject of this thread is "how can we close race between unmap_page_range()
> and the page faults with retry fallback". Are you suggesting that we should remove
> the OOM reaper so that we don't need to change page faults and/or __mmput() paths?

Well certainly if it's not fixed, I think we'd be better off to remove
it because the risk of an hang is preferable than risk of memory
corruption or corrupted core dumps.

If it was that simple as it is currently it was nice to have, but
doing it safe without risk to corrupt memory and coredumps and without
slowing down the VM fast paths, sounds overkill. Last but not the
least it hides reproducible of issues like the above hang you posted,
that I think it can't do anything about even if you remove khugepaged...

... unless we drop the mmap_sem from MADV_DONTNEED but it's not easily
feasible if unmap_page_range has to run while mm_users may still be
still > 0. Doing more VM changes that are OOM reaper specific doesn't
seem attractive to me.

I'd rather prefer if we can fix the issues in xfs the old fashioned
way that won't end up again in a hang, if after all that work, the
TIF_MEMDIE task happened to have 0 anon mem allocated in it.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-08 17:48     ` Andrea Arcangeli
@ 2017-08-10  8:21       ` Michal Hocko
  -1 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-10  8:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML

On Tue 08-08-17 19:48:55, Andrea Arcangeli wrote:
[...]
> The bug corrected by this patch 1/2 I pointed it out last week while
> reviewing other oom reaper fixes so that looks fine.
> 
> However I'd prefer to dump MMF_UNSTABLE for good instead of adding
> more of it. It can be replaced with unmap_page_range in
> __oom_reap_task_mm with a function that arms a special migration entry
> so that no branchs are added to the fast paths and it's all hidden
> inside is_migration_entry slow paths.

This sounds like an interesting idea but I would like to address the
_correctness_ issue first and optimize on top of it. If for nothing else
backporting a follow up fix sounds easier than a complete rework. There
are quite some callers of is_migration_entry and the patch won't be
trivial either. So can we focus on the fix first please?

[...]

> Overall OOM killing to me was reliable also before the oom reaper was
> introduced.

Yeah, this is the case in my experience as well but there are others
claiming otherwise and implementation wise the code was really fragile
enough to support their claims. Unbound lockup on TIF_MEMDIE task just
asks for troubles, especially when we have no idea what the oom victim
might be doing. Things are very simple when the victim was kicked out
from the userspace but this all gets very hairy when it was somewhere in
the kernel waiting for locks. It seems that we are mostly lucky in the
global oom situations. We have seen lockups with memcgs and had to move
the memcg oom handling to a lockless PF context. Those two were not too
different except the memcg was easier to hit.

[...]

> A couple of years ago I could trivially trigger OOM deadlocks on
> various ext4 paths that loops or use GFP_NOFAIL, but that was just a
> matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
> memory reserves below the low watermark.

You would have to identify the dependency chain to do this properly,
otherwise you simply consume memory reserves and you are back to square
one.

> It is also fine to kill a few more processes in fact.

I strongly disagree. It might be acceptable to kill more tasks if there
is absolutely no other choice. OOM killing is a very disruptive action
and we shoud _really_ reduce it to absolute minimum.

[...]
> The main point of the oom reaper nowadays is to free memory fast
> enough so a second task isn't killed as a false positive, but it's not
> like anybody will notice much of a difference if a second task is
> killed, it wasn't commonly happening either.

No, you seem to misunderstand. Adding a kernel thread to optimize a
glacial kind of slow path would be really hard to justify. The sole
purpose of the oom reaper is _reliability_. We do not select another
task from an oom domain if there is an existing oom victim alive. So we
do not need the reaper to prevent another victim selection. All we need
this async context for is to _guarantee_ that somebody tries to reclaim
as much memory of the victim as possible and then allow the oom killer
to continue if the OOM situation is not resolve. Because that endless
waiting for a sync context is what causes those lockups.

> Certainly it's preferable to get two tasks killed than corrupted core
> dumps or corrupted memory, so if oom reaper will stay we need to
> document how we guarantee it's mutually exclusive against core dumping

corrupted anonymous memory in the core dump was deemed acceptable
trade off to get a more reliable oom handling. If there is a strong
usecase for the reliable core dump then we can work on it, of course but
the system stability is at the first place IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-10  8:21       ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-10  8:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML

On Tue 08-08-17 19:48:55, Andrea Arcangeli wrote:
[...]
> The bug corrected by this patch 1/2 I pointed it out last week while
> reviewing other oom reaper fixes so that looks fine.
> 
> However I'd prefer to dump MMF_UNSTABLE for good instead of adding
> more of it. It can be replaced with unmap_page_range in
> __oom_reap_task_mm with a function that arms a special migration entry
> so that no branchs are added to the fast paths and it's all hidden
> inside is_migration_entry slow paths.

This sounds like an interesting idea but I would like to address the
_correctness_ issue first and optimize on top of it. If for nothing else
backporting a follow up fix sounds easier than a complete rework. There
are quite some callers of is_migration_entry and the patch won't be
trivial either. So can we focus on the fix first please?

[...]

> Overall OOM killing to me was reliable also before the oom reaper was
> introduced.

Yeah, this is the case in my experience as well but there are others
claiming otherwise and implementation wise the code was really fragile
enough to support their claims. Unbound lockup on TIF_MEMDIE task just
asks for troubles, especially when we have no idea what the oom victim
might be doing. Things are very simple when the victim was kicked out
from the userspace but this all gets very hairy when it was somewhere in
the kernel waiting for locks. It seems that we are mostly lucky in the
global oom situations. We have seen lockups with memcgs and had to move
the memcg oom handling to a lockless PF context. Those two were not too
different except the memcg was easier to hit.

[...]

> A couple of years ago I could trivially trigger OOM deadlocks on
> various ext4 paths that loops or use GFP_NOFAIL, but that was just a
> matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
> memory reserves below the low watermark.

You would have to identify the dependency chain to do this properly,
otherwise you simply consume memory reserves and you are back to square
one.

> It is also fine to kill a few more processes in fact.

I strongly disagree. It might be acceptable to kill more tasks if there
is absolutely no other choice. OOM killing is a very disruptive action
and we shoud _really_ reduce it to absolute minimum.

[...]
> The main point of the oom reaper nowadays is to free memory fast
> enough so a second task isn't killed as a false positive, but it's not
> like anybody will notice much of a difference if a second task is
> killed, it wasn't commonly happening either.

No, you seem to misunderstand. Adding a kernel thread to optimize a
glacial kind of slow path would be really hard to justify. The sole
purpose of the oom reaper is _reliability_. We do not select another
task from an oom domain if there is an existing oom victim alive. So we
do not need the reaper to prevent another victim selection. All we need
this async context for is to _guarantee_ that somebody tries to reclaim
as much memory of the victim as possible and then allow the oom killer
to continue if the OOM situation is not resolve. Because that endless
waiting for a sync context is what causes those lockups.

> Certainly it's preferable to get two tasks killed than corrupted core
> dumps or corrupted memory, so if oom reaper will stay we need to
> document how we guarantee it's mutually exclusive against core dumping

corrupted anonymous memory in the core dump was deemed acceptable
trade off to get a more reliable oom handling. If there is a strong
usecase for the reliable core dump then we can work on it, of course but
the system stability is at the first place IMHO.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-10  8:21       ` Michal Hocko
@ 2017-08-10 13:33         ` Michal Hocko
  -1 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-10 13:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML

On Thu 10-08-17 10:21:18, Michal Hocko wrote:
> On Tue 08-08-17 19:48:55, Andrea Arcangeli wrote:
> [...]
> > The bug corrected by this patch 1/2 I pointed it out last week while
> > reviewing other oom reaper fixes so that looks fine.
> > 
> > However I'd prefer to dump MMF_UNSTABLE for good instead of adding
> > more of it. It can be replaced with unmap_page_range in
> > __oom_reap_task_mm with a function that arms a special migration entry
> > so that no branchs are added to the fast paths and it's all hidden
> > inside is_migration_entry slow paths.
> 
> This sounds like an interesting idea but I would like to address the
> _correctness_ issue first and optimize on top of it. If for nothing else
> backporting a follow up fix sounds easier than a complete rework. There
> are quite some callers of is_migration_entry and the patch won't be
> trivial either. So can we focus on the fix first please?

Btw, if the overhead is a concern then we can add a jump label and only
make the code active only while the OOM is in progress. We already do
count all oom victims so we have a clear entry and exit points. This
would still sound easier to do than teach every is_migration_entry a new
migration entry type and handle it properly, not to mention make
everybody aware of this for future callers of is_migration_entry.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-10 13:33         ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-10 13:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML

On Thu 10-08-17 10:21:18, Michal Hocko wrote:
> On Tue 08-08-17 19:48:55, Andrea Arcangeli wrote:
> [...]
> > The bug corrected by this patch 1/2 I pointed it out last week while
> > reviewing other oom reaper fixes so that looks fine.
> > 
> > However I'd prefer to dump MMF_UNSTABLE for good instead of adding
> > more of it. It can be replaced with unmap_page_range in
> > __oom_reap_task_mm with a function that arms a special migration entry
> > so that no branchs are added to the fast paths and it's all hidden
> > inside is_migration_entry slow paths.
> 
> This sounds like an interesting idea but I would like to address the
> _correctness_ issue first and optimize on top of it. If for nothing else
> backporting a follow up fix sounds easier than a complete rework. There
> are quite some callers of is_migration_entry and the patch won't be
> trivial either. So can we focus on the fix first please?

Btw, if the overhead is a concern then we can add a jump label and only
make the code active only while the OOM is in progress. We already do
count all oom victims so we have a clear entry and exit points. This
would still sound easier to do than teach every is_migration_entry a new
migration entry type and handle it properly, not to mention make
everybody aware of this for future callers of is_migration_entry.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-07 11:38   ` Michal Hocko
@ 2017-08-11  2:28     ` Tetsuo Handa
  -1 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-11  2:28 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> +/*
> + * Checks whether a page fault on the given mm is still reliable.
> + * This is no longer true if the oom reaper started to reap the
> + * address space which is reflected by MMF_UNSTABLE flag set in
> + * the mm. At that moment any !shared mapping would lose the content
> + * and could cause a memory corruption (zero pages instead of the
> + * original content).
> + *
> + * User should call this before establishing a page table entry for
> + * a !shared mapping and under the proper page table lock.
> + *
> + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> + */
> +static inline int check_stable_address_space(struct mm_struct *mm)
> +{
> +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> +		return VM_FAULT_SIGBUS;
> +	return 0;
> +}
> +

Will you explain the mechanism why random values are written instead of zeros
so that this patch can actually fix the race problem? I consider that writing
random values (though it seems like portion of process image) instead of zeros
to a file might cause a security problem, and the patch that fixes it should be
able to be backported to stable kernels.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11  2:28     ` Tetsuo Handa
  0 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-11  2:28 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> +/*
> + * Checks whether a page fault on the given mm is still reliable.
> + * This is no longer true if the oom reaper started to reap the
> + * address space which is reflected by MMF_UNSTABLE flag set in
> + * the mm. At that moment any !shared mapping would lose the content
> + * and could cause a memory corruption (zero pages instead of the
> + * original content).
> + *
> + * User should call this before establishing a page table entry for
> + * a !shared mapping and under the proper page table lock.
> + *
> + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> + */
> +static inline int check_stable_address_space(struct mm_struct *mm)
> +{
> +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> +		return VM_FAULT_SIGBUS;
> +	return 0;
> +}
> +

Will you explain the mechanism why random values are written instead of zeros
so that this patch can actually fix the race problem? I consider that writing
random values (though it seems like portion of process image) instead of zeros
to a file might cause a security problem, and the patch that fixes it should be
able to be backported to stable kernels.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11  2:28     ` Tetsuo Handa
@ 2017-08-11  7:09       ` Michal Hocko
  -1 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-11  7:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > +/*
> > + * Checks whether a page fault on the given mm is still reliable.
> > + * This is no longer true if the oom reaper started to reap the
> > + * address space which is reflected by MMF_UNSTABLE flag set in
> > + * the mm. At that moment any !shared mapping would lose the content
> > + * and could cause a memory corruption (zero pages instead of the
> > + * original content).
> > + *
> > + * User should call this before establishing a page table entry for
> > + * a !shared mapping and under the proper page table lock.
> > + *
> > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > + */
> > +static inline int check_stable_address_space(struct mm_struct *mm)
> > +{
> > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > +		return VM_FAULT_SIGBUS;
> > +	return 0;
> > +}
> > +
> 
> Will you explain the mechanism why random values are written instead of zeros
> so that this patch can actually fix the race problem?

I am not sure what you mean here. Were you able to see a write with an
unexpected content?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11  7:09       ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-11  7:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > +/*
> > + * Checks whether a page fault on the given mm is still reliable.
> > + * This is no longer true if the oom reaper started to reap the
> > + * address space which is reflected by MMF_UNSTABLE flag set in
> > + * the mm. At that moment any !shared mapping would lose the content
> > + * and could cause a memory corruption (zero pages instead of the
> > + * original content).
> > + *
> > + * User should call this before establishing a page table entry for
> > + * a !shared mapping and under the proper page table lock.
> > + *
> > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > + */
> > +static inline int check_stable_address_space(struct mm_struct *mm)
> > +{
> > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > +		return VM_FAULT_SIGBUS;
> > +	return 0;
> > +}
> > +
> 
> Will you explain the mechanism why random values are written instead of zeros
> so that this patch can actually fix the race problem?

I am not sure what you mean here. Were you able to see a write with an
unexpected content?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11  7:09       ` Michal Hocko
@ 2017-08-11  7:54         ` Tetsuo Handa
  -1 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-11  7:54 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > +/*
> > > + * Checks whether a page fault on the given mm is still reliable.
> > > + * This is no longer true if the oom reaper started to reap the
> > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > + * the mm. At that moment any !shared mapping would lose the content
> > > + * and could cause a memory corruption (zero pages instead of the
> > > + * original content).
> > > + *
> > > + * User should call this before establishing a page table entry for
> > > + * a !shared mapping and under the proper page table lock.
> > > + *
> > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > + */
> > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > +{
> > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > +		return VM_FAULT_SIGBUS;
> > > +	return 0;
> > > +}
> > > +
> > 
> > Will you explain the mechanism why random values are written instead of zeros
> > so that this patch can actually fix the race problem?
> 
> I am not sure what you mean here. Were you able to see a write with an
> unexpected content?

Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11  7:54         ` Tetsuo Handa
  0 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-11  7:54 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > +/*
> > > + * Checks whether a page fault on the given mm is still reliable.
> > > + * This is no longer true if the oom reaper started to reap the
> > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > + * the mm. At that moment any !shared mapping would lose the content
> > > + * and could cause a memory corruption (zero pages instead of the
> > > + * original content).
> > > + *
> > > + * User should call this before establishing a page table entry for
> > > + * a !shared mapping and under the proper page table lock.
> > > + *
> > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > + */
> > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > +{
> > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > +		return VM_FAULT_SIGBUS;
> > > +	return 0;
> > > +}
> > > +
> > 
> > Will you explain the mechanism why random values are written instead of zeros
> > so that this patch can actually fix the race problem?
> 
> I am not sure what you mean here. Were you able to see a write with an
> unexpected content?

Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11  7:54         ` Tetsuo Handa
@ 2017-08-11 10:22           ` Andrea Arcangeli
  -1 siblings, 0 replies; 58+ messages in thread
From: Andrea Arcangeli @ 2017-08-11 10:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri, Aug 11, 2017 at 04:54:36PM +0900, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > +/*
> > > > + * Checks whether a page fault on the given mm is still reliable.
> > > > + * This is no longer true if the oom reaper started to reap the
> > > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > > + * the mm. At that moment any !shared mapping would lose the content
> > > > + * and could cause a memory corruption (zero pages instead of the
> > > > + * original content).
> > > > + *
> > > > + * User should call this before establishing a page table entry for
> > > > + * a !shared mapping and under the proper page table lock.
> > > > + *
> > > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > > + */
> > > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > > +{
> > > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > > +		return VM_FAULT_SIGBUS;
> > > > +	return 0;
> > > > +}
> > > > +
> > > 
> > > Will you explain the mechanism why random values are written instead of zeros
> > > so that this patch can actually fix the race problem?
> > 
> > I am not sure what you mean here. Were you able to see a write with an
> > unexpected content?
> 
> Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

The oom reaper depends on userland not possibly running anymore in any
thread associated with the reaped "mm" by the time wake_oom_reaper is
called and I'm not sure do_send_sig_info is anything close to provide
such guarantee. Problem is the reschedule seems async see
native_smp_send_reschedule invoked by kick_process. So perhaps the
thread is running with a corrupted stack for a little while until the
IPI arrives to destination. I guess it wouldn't be reproducible
without a large NUMA system.

Said that I looked the assembly of your program and I don't see
anything in the file_writer that could load data from the stack by the
time it starts to write() and clearly the sigkill and
smp_send_reschedule() will happen after it's already in the write()
tight loop. The only thing it loads from the user stack after it
reaches the tight loop is the canary which should then crash it if it
breaks out of the write loop which still wouldn't cause a write.

So I don't see much explanation on the VM side, but perhaps it's
possible this is a filesystem bug that enlarges the i_size before
issuing the write that SIGBUS in copy_from_user, because of
MMF_UNSTABLE is set at first access? And then leaves i_size enlarged
and what you're seeing in od -b is leaked content from an unintialized
disk block? This would happen on ext4 as well if mounted with -o
journal=data instead of -o journal=ordered in fact, perhaps you simply
have a filesystem that isn't mounted with journal=oredered semantics
and this isn't the OOM killer.

Also why you're using octal output? -x would be more intuitive for the
0xff (377) which is to be expected (should be all zeros or 0xff, and
some zero showup too).

Assuming those values not-zeros and not-0xff are simply lack of
ordered journaling mode and it's deleted file data (you clearly must
not have a ssd with -o discard or it'd be zero there), even if you
would only see zeroes it wouldn't concern me any bit less.

The non zeroes and non-0xff if they happen beyond the end of the
previous i_size they concern me less becuase they're at least less
obviously going to create sticky data corruption in a OOM killed
database. The database could handle it by recording the valid i_size
it successfully expanded the file to, with userland journaling in its
own user metadata.

Those expected zeroes that showup in your dump, are the real major
issue here and they showup as well. A database that hits OOM would
then generate persistent sticky memory corruption in user data that
could break the entire userland journaling and you could notice only
much later too.

OOM deadlock is certainly preferable here. Rebooting on a OOM hang is
totally ok and very minor issue as the user journaling is guaranteed
to be preserved. Writing random zeroes on shared storage may break the
whole thing instead and you may notice at next reboot to upgrade the
kernel that the db journaling fails and nothing starts and you could
have lost data too.

Back to your previous xfs OOM reaper timeout failure, one way around
it, is to implement a down_read_trylock_unfair, that will obtain a
read lock ignoring any write waiter breaking fairness but if done only
in the OOM reaper that would be not a
concern. down_read_trylock_unfair should solve this xfs lockup
involving khugepaged without the need to remove the mmap_sem from the
OOM reaper while mm_users > 0. Problem would then remain if the OOM
selected task is allocating memory and stuck on a xfs lock taken by
shrink_slabs while holding the mmap_sem for writing. This is why my
preference would be to dig in xfs and solve the source of the OOM
lockup at its core, as the OOM reaper is kicking the can down the
road, and ultimately if the process runs on pure
MAP_ANONYMOUS|MAP_SHARED kicking the can won't move it one bit, unless
OOM reaper starts to reap shmem too by expanding even more with more
checks and stuff when the fix for xfs ultimately will become simpler
and more self contained and targeted.

I would like if it would be possible to tell which kernel thread has
to be allowed to make progress lowering the wmark to unstuck the
TIF_MEMDIE task. For kernel threads this could involve adding a
pf_memalloc_pid dependency that is accessible at OOM time. Workqueues
submitted in PF_MEMALLOC context could set this pf_memalloc_pid
dependency in the worker threads themselves, fs kernel threads would
need the filesystem to set this pid dependency. So if TIF_MEMDIE pid
matches the current kernel thread pf_memalloc_pid, the kernel thread
allocation would inherit PF_MEMALLOC wmark privileges, by artificially
lowering the wmark for the TIF_MEMDIE task.

Or simply we could stop calling shrink_slab for fs dependent slab
caches with a per shrinker flag, in direct reclaim and offload those
to kswapd only. That would be a real simple change, much simpler than
the current unsafe but simpler OOM reaper.

There are several dozen of mbytes of RAM available when the system
hangs and fails to get rid of the TIF_MEMDIE task, problem they must
be given to the kernel thread that the TIF_MEMDIE task is waiting for
and we can't rely on lockdep to sort it out or it's too slow.

Refusal to fix the fs hangs and relying solely on the OOM reaper
ultimately causes the OOM reaper to keep escalating, to the point not
even down_read_trylock_unfair would suffice anymore and it would need
to zap pagetables without holding the mmap_sem at all (for example in
order to solve your same xfs OOM hang that would still remain if
shrink_slabs runs in direct reclaim under a mmap_sem-for-writing
section like while allocating a vma in mmap).

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11 10:22           ` Andrea Arcangeli
  0 siblings, 0 replies; 58+ messages in thread
From: Andrea Arcangeli @ 2017-08-11 10:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri, Aug 11, 2017 at 04:54:36PM +0900, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > +/*
> > > > + * Checks whether a page fault on the given mm is still reliable.
> > > > + * This is no longer true if the oom reaper started to reap the
> > > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > > + * the mm. At that moment any !shared mapping would lose the content
> > > > + * and could cause a memory corruption (zero pages instead of the
> > > > + * original content).
> > > > + *
> > > > + * User should call this before establishing a page table entry for
> > > > + * a !shared mapping and under the proper page table lock.
> > > > + *
> > > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > > + */
> > > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > > +{
> > > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > > +		return VM_FAULT_SIGBUS;
> > > > +	return 0;
> > > > +}
> > > > +
> > > 
> > > Will you explain the mechanism why random values are written instead of zeros
> > > so that this patch can actually fix the race problem?
> > 
> > I am not sure what you mean here. Were you able to see a write with an
> > unexpected content?
> 
> Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

The oom reaper depends on userland not possibly running anymore in any
thread associated with the reaped "mm" by the time wake_oom_reaper is
called and I'm not sure do_send_sig_info is anything close to provide
such guarantee. Problem is the reschedule seems async see
native_smp_send_reschedule invoked by kick_process. So perhaps the
thread is running with a corrupted stack for a little while until the
IPI arrives to destination. I guess it wouldn't be reproducible
without a large NUMA system.

Said that I looked the assembly of your program and I don't see
anything in the file_writer that could load data from the stack by the
time it starts to write() and clearly the sigkill and
smp_send_reschedule() will happen after it's already in the write()
tight loop. The only thing it loads from the user stack after it
reaches the tight loop is the canary which should then crash it if it
breaks out of the write loop which still wouldn't cause a write.

So I don't see much explanation on the VM side, but perhaps it's
possible this is a filesystem bug that enlarges the i_size before
issuing the write that SIGBUS in copy_from_user, because of
MMF_UNSTABLE is set at first access? And then leaves i_size enlarged
and what you're seeing in od -b is leaked content from an unintialized
disk block? This would happen on ext4 as well if mounted with -o
journal=data instead of -o journal=ordered in fact, perhaps you simply
have a filesystem that isn't mounted with journal=oredered semantics
and this isn't the OOM killer.

Also why you're using octal output? -x would be more intuitive for the
0xff (377) which is to be expected (should be all zeros or 0xff, and
some zero showup too).

Assuming those values not-zeros and not-0xff are simply lack of
ordered journaling mode and it's deleted file data (you clearly must
not have a ssd with -o discard or it'd be zero there), even if you
would only see zeroes it wouldn't concern me any bit less.

The non zeroes and non-0xff if they happen beyond the end of the
previous i_size they concern me less becuase they're at least less
obviously going to create sticky data corruption in a OOM killed
database. The database could handle it by recording the valid i_size
it successfully expanded the file to, with userland journaling in its
own user metadata.

Those expected zeroes that showup in your dump, are the real major
issue here and they showup as well. A database that hits OOM would
then generate persistent sticky memory corruption in user data that
could break the entire userland journaling and you could notice only
much later too.

OOM deadlock is certainly preferable here. Rebooting on a OOM hang is
totally ok and very minor issue as the user journaling is guaranteed
to be preserved. Writing random zeroes on shared storage may break the
whole thing instead and you may notice at next reboot to upgrade the
kernel that the db journaling fails and nothing starts and you could
have lost data too.

Back to your previous xfs OOM reaper timeout failure, one way around
it, is to implement a down_read_trylock_unfair, that will obtain a
read lock ignoring any write waiter breaking fairness but if done only
in the OOM reaper that would be not a
concern. down_read_trylock_unfair should solve this xfs lockup
involving khugepaged without the need to remove the mmap_sem from the
OOM reaper while mm_users > 0. Problem would then remain if the OOM
selected task is allocating memory and stuck on a xfs lock taken by
shrink_slabs while holding the mmap_sem for writing. This is why my
preference would be to dig in xfs and solve the source of the OOM
lockup at its core, as the OOM reaper is kicking the can down the
road, and ultimately if the process runs on pure
MAP_ANONYMOUS|MAP_SHARED kicking the can won't move it one bit, unless
OOM reaper starts to reap shmem too by expanding even more with more
checks and stuff when the fix for xfs ultimately will become simpler
and more self contained and targeted.

I would like if it would be possible to tell which kernel thread has
to be allowed to make progress lowering the wmark to unstuck the
TIF_MEMDIE task. For kernel threads this could involve adding a
pf_memalloc_pid dependency that is accessible at OOM time. Workqueues
submitted in PF_MEMALLOC context could set this pf_memalloc_pid
dependency in the worker threads themselves, fs kernel threads would
need the filesystem to set this pid dependency. So if TIF_MEMDIE pid
matches the current kernel thread pf_memalloc_pid, the kernel thread
allocation would inherit PF_MEMALLOC wmark privileges, by artificially
lowering the wmark for the TIF_MEMDIE task.

Or simply we could stop calling shrink_slab for fs dependent slab
caches with a per shrinker flag, in direct reclaim and offload those
to kswapd only. That would be a real simple change, much simpler than
the current unsafe but simpler OOM reaper.

There are several dozen of mbytes of RAM available when the system
hangs and fails to get rid of the TIF_MEMDIE task, problem they must
be given to the kernel thread that the TIF_MEMDIE task is waiting for
and we can't rely on lockdep to sort it out or it's too slow.

Refusal to fix the fs hangs and relying solely on the OOM reaper
ultimately causes the OOM reaper to keep escalating, to the point not
even down_read_trylock_unfair would suffice anymore and it would need
to zap pagetables without holding the mmap_sem at all (for example in
order to solve your same xfs OOM hang that would still remain if
shrink_slabs runs in direct reclaim under a mmap_sem-for-writing
section like while allocating a vma in mmap).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11 10:22           ` Andrea Arcangeli
@ 2017-08-11 10:42             ` Andrea Arcangeli
  -1 siblings, 0 replies; 58+ messages in thread
From: Andrea Arcangeli @ 2017-08-11 10:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri, Aug 11, 2017 at 12:22:56PM +0200, Andrea Arcangeli wrote:
> disk block? This would happen on ext4 as well if mounted with -o
> journal=data instead of -o journal=ordered in fact, perhaps you simply

Oops above I meant journal=writeback, journal=data is even stronger
than journal=ordered of course.

And I shall clarify further that old disk content can only showup
legitimately on journal=writeback after a hard reboot or crash or in
general an unclean unmount. Even if there's no journaling at all
(i.e. ext2/vfat) old disk content cannot be shown at any given time no
matter what if there's no unclean unmount that requires a journal
reply.

This theory of a completely unrelated fs bug showing you disk content
as result of the OOM reaper induced SIGBUS interrupting a
copy_from_user at its very start, is purely motivated by the fact like
Michal I didn't see much explanation on the VM side that could cause
those not-zero not-0xff values showing up in the buffer of the write
syscall. You can try to change fs and see if it happens again to rule
it out. If it always happens regardless of the filesystem used, then
it's likely not a fs bug of course. You've got an entire and aligned
4k fs block showing up that data.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11 10:42             ` Andrea Arcangeli
  0 siblings, 0 replies; 58+ messages in thread
From: Andrea Arcangeli @ 2017-08-11 10:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri, Aug 11, 2017 at 12:22:56PM +0200, Andrea Arcangeli wrote:
> disk block? This would happen on ext4 as well if mounted with -o
> journal=data instead of -o journal=ordered in fact, perhaps you simply

Oops above I meant journal=writeback, journal=data is even stronger
than journal=ordered of course.

And I shall clarify further that old disk content can only showup
legitimately on journal=writeback after a hard reboot or crash or in
general an unclean unmount. Even if there's no journaling at all
(i.e. ext2/vfat) old disk content cannot be shown at any given time no
matter what if there's no unclean unmount that requires a journal
reply.

This theory of a completely unrelated fs bug showing you disk content
as result of the OOM reaper induced SIGBUS interrupting a
copy_from_user at its very start, is purely motivated by the fact like
Michal I didn't see much explanation on the VM side that could cause
those not-zero not-0xff values showing up in the buffer of the write
syscall. You can try to change fs and see if it happens again to rule
it out. If it always happens regardless of the filesystem used, then
it's likely not a fs bug of course. You've got an entire and aligned
4k fs block showing up that data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11 10:42             ` Andrea Arcangeli
@ 2017-08-11 11:53               ` Tetsuo Handa
  -1 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-11 11:53 UTC (permalink / raw)
  To: aarcange; +Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Andrea Arcangeli wrote:
> On Fri, Aug 11, 2017 at 12:22:56PM +0200, Andrea Arcangeli wrote:
> > disk block? This would happen on ext4 as well if mounted with -o
> > journal=data instead of -o journal=ordered in fact, perhaps you simply
> 
> Oops above I meant journal=writeback, journal=data is even stronger
> than journal=ordered of course.
> 
> And I shall clarify further that old disk content can only showup
> legitimately on journal=writeback after a hard reboot or crash or in
> general an unclean unmount. Even if there's no journaling at all
> (i.e. ext2/vfat) old disk content cannot be shown at any given time no
> matter what if there's no unclean unmount that requires a journal
> reply.

I'm using XFS on a small non-NUMA system (4 CPUs / 4096MB RAM).

  /dev/sda1 / xfs rw,relatime,attr2,inode64,noquota 0 0

As far as I tested, not-zero not-0xff values did not show up with 4.6.7
kernel (i.e. all not-0xff bytes are zero) while not-zero not-0xff values
show up with 4.13.0-rc4-next-20170811 kernel.

> 
> This theory of a completely unrelated fs bug showing you disk content
> as result of the OOM reaper induced SIGBUS interrupting a
> copy_from_user at its very start, is purely motivated by the fact like
> Michal I didn't see much explanation on the VM side that could cause
> those not-zero not-0xff values showing up in the buffer of the write
> syscall. You can try to change fs and see if it happens again to rule
> it out. If it always happens regardless of the filesystem used, then
> it's likely not a fs bug of course. You've got an entire and aligned
> 4k fs block showing up that data.
> 

What is strange is that, as far as I tested, the pattern of not-zero not-0xff
bytes seems to be always the same. Such thing unlikely happens if old content
on the disk is by chance showing up. Maybe the content written is not random
but specific 4096 bytes of memory image of executable file.

$ cat checker.c
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
        char buffer2[64] = { };
        int ret = 0;
        int i;
        for (i = 0; i < 1024; i++) {
                 int flag = 0;
                 int fd;
                 unsigned int byte[256];
                 int j;
                 snprintf(buffer2, sizeof(buffer2), "/tmp/file.%u", i);
                 fd = open(buffer2, O_RDONLY);
                 if (fd == EOF)
                         continue;
                 memset(byte, 0, sizeof(byte));
                 while (1) {
                         static unsigned char buffer[1048576];
                         int len = read(fd, (char *) buffer, sizeof(buffer));
                         if (len <= 0)
                                 break;
                         for (j = 0; j < len; j++)
                                 if (buffer[j] != 0xFF)
                                         byte[buffer[j]]++;
                 }
                 close(fd);
                 for (j = 0; j < 255; j++)
                         if (byte[j]) {
                                 printf("ERROR: %u %u in %s\n", byte[j], j, buffer2);
                                 flag = 1;
                         }
                 if (flag == 0)
                         unlink(buffer2);
                 else
                         ret = 1;
        }
        return ret;
}
$ uname -r
4.13.0-rc4-next-20170811
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.4
$ /bin/rm /tmp/file.4
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.6
$ /bin/rm /tmp/file.6
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.0
$ /bin/rm /tmp/file.0
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 2549 0 in /tmp/file.4
ERROR: 40 1 in /tmp/file.4
ERROR: 53 2 in /tmp/file.4
ERROR: 29 3 in /tmp/file.4
ERROR: 27 4 in /tmp/file.4
ERROR: 5 5 in /tmp/file.4
ERROR: 14 6 in /tmp/file.4
ERROR: 8 7 in /tmp/file.4
ERROR: 16 8 in /tmp/file.4
ERROR: 4 9 in /tmp/file.4
ERROR: 12 10 in /tmp/file.4
ERROR: 4 11 in /tmp/file.4
ERROR: 2 12 in /tmp/file.4
ERROR: 10 13 in /tmp/file.4
ERROR: 13 14 in /tmp/file.4
ERROR: 4 15 in /tmp/file.4
ERROR: 26 16 in /tmp/file.4
ERROR: 5 17 in /tmp/file.4
ERROR: 23 18 in /tmp/file.4
ERROR: 4 19 in /tmp/file.4
ERROR: 8 20 in /tmp/file.4
ERROR: 2 21 in /tmp/file.4
ERROR: 1 22 in /tmp/file.4
ERROR: 2 23 in /tmp/file.4
ERROR: 17 24 in /tmp/file.4
ERROR: 5 25 in /tmp/file.4
ERROR: 2 26 in /tmp/file.4
ERROR: 1 27 in /tmp/file.4
ERROR: 3 28 in /tmp/file.4
ERROR: 17 32 in /tmp/file.4
ERROR: 1 35 in /tmp/file.4
ERROR: 1 36 in /tmp/file.4
ERROR: 2 38 in /tmp/file.4
ERROR: 5 40 in /tmp/file.4
ERROR: 1 41 in /tmp/file.4
ERROR: 3 45 in /tmp/file.4
ERROR: 65 46 in /tmp/file.4
ERROR: 2 48 in /tmp/file.4
ERROR: 4 49 in /tmp/file.4
ERROR: 24 50 in /tmp/file.4
ERROR: 3 51 in /tmp/file.4
ERROR: 4 52 in /tmp/file.4
ERROR: 12 53 in /tmp/file.4
ERROR: 2 54 in /tmp/file.4
ERROR: 1 55 in /tmp/file.4
ERROR: 5 56 in /tmp/file.4
ERROR: 1 60 in /tmp/file.4
ERROR: 75 64 in /tmp/file.4
ERROR: 5 65 in /tmp/file.4
ERROR: 17 66 in /tmp/file.4
ERROR: 19 67 in /tmp/file.4
ERROR: 5 68 in /tmp/file.4
ERROR: 6 69 in /tmp/file.4
ERROR: 3 70 in /tmp/file.4
ERROR: 13 71 in /tmp/file.4
ERROR: 18 73 in /tmp/file.4
ERROR: 3 74 in /tmp/file.4
ERROR: 17 76 in /tmp/file.4
ERROR: 7 77 in /tmp/file.4
ERROR: 5 78 in /tmp/file.4
ERROR: 4 79 in /tmp/file.4
ERROR: 1 80 in /tmp/file.4
ERROR: 4 82 in /tmp/file.4
ERROR: 2 83 in /tmp/file.4
ERROR: 13 84 in /tmp/file.4
ERROR: 1 85 in /tmp/file.4
ERROR: 1 86 in /tmp/file.4
ERROR: 1 89 in /tmp/file.4
ERROR: 2 94 in /tmp/file.4
ERROR: 118 95 in /tmp/file.4
ERROR: 24 96 in /tmp/file.4
ERROR: 54 97 in /tmp/file.4
ERROR: 14 98 in /tmp/file.4
ERROR: 18 99 in /tmp/file.4
ERROR: 29 100 in /tmp/file.4
ERROR: 57 101 in /tmp/file.4
ERROR: 16 102 in /tmp/file.4
ERROR: 15 103 in /tmp/file.4
ERROR: 9 104 in /tmp/file.4
ERROR: 48 105 in /tmp/file.4
ERROR: 1 106 in /tmp/file.4
ERROR: 2 107 in /tmp/file.4
ERROR: 30 108 in /tmp/file.4
ERROR: 22 109 in /tmp/file.4
ERROR: 43 110 in /tmp/file.4
ERROR: 29 111 in /tmp/file.4
ERROR: 13 112 in /tmp/file.4
ERROR: 56 114 in /tmp/file.4
ERROR: 42 115 in /tmp/file.4
ERROR: 65 116 in /tmp/file.4
ERROR: 14 117 in /tmp/file.4
ERROR: 3 118 in /tmp/file.4
ERROR: 2 119 in /tmp/file.4
ERROR: 3 120 in /tmp/file.4
ERROR: 16 121 in /tmp/file.4
ERROR: 1 122 in /tmp/file.4
ERROR: 1 125 in /tmp/file.4
ERROR: 1 126 in /tmp/file.4
ERROR: 5 128 in /tmp/file.4
ERROR: 1 132 in /tmp/file.4
ERROR: 4 134 in /tmp/file.4
ERROR: 1 137 in /tmp/file.4
ERROR: 1 141 in /tmp/file.4
ERROR: 1 142 in /tmp/file.4
ERROR: 1 144 in /tmp/file.4
ERROR: 1 145 in /tmp/file.4
ERROR: 2 148 in /tmp/file.4
ERROR: 6 152 in /tmp/file.4
ERROR: 2 153 in /tmp/file.4
ERROR: 1 154 in /tmp/file.4
ERROR: 6 160 in /tmp/file.4
ERROR: 1 166 in /tmp/file.4
ERROR: 3 168 in /tmp/file.4
ERROR: 1 176 in /tmp/file.4
ERROR: 1 180 in /tmp/file.4
ERROR: 1 181 in /tmp/file.4
ERROR: 3 184 in /tmp/file.4
ERROR: 1 188 in /tmp/file.4
ERROR: 4 192 in /tmp/file.4
ERROR: 1 193 in /tmp/file.4
ERROR: 1 198 in /tmp/file.4
ERROR: 3 200 in /tmp/file.4
ERROR: 2 208 in /tmp/file.4
ERROR: 1 216 in /tmp/file.4
ERROR: 1 223 in /tmp/file.4
ERROR: 4 224 in /tmp/file.4
ERROR: 1 227 in /tmp/file.4
ERROR: 1 236 in /tmp/file.4
ERROR: 1 237 in /tmp/file.4
ERROR: 4 241 in /tmp/file.4
ERROR: 1 243 in /tmp/file.4
ERROR: 1 244 in /tmp/file.4
ERROR: 1 245 in /tmp/file.4
ERROR: 1 246 in /tmp/file.4
ERROR: 2 248 in /tmp/file.4
ERROR: 1 249 in /tmp/file.4
ERROR: 1 254 in /tmp/file.4
$ od -cb /tmp/file.4
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
        377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
600000000   -   1   1   )  \0  \0   .   s   y   m   t   a   b  \0   .   s
        055 061 061 051 000 000 056 163 171 155 164 141 142 000 056 163
600000020   t   r   t   a   b  \0   .   s   h   s   t   r   t   a   b  \0
        164 162 164 141 142 000 056 163 150 163 164 162 164 141 142 000
600000040   .   i   n   t   e   r   p  \0   .   n   o   t   e   .   A   B
        056 151 156 164 145 162 160 000 056 156 157 164 145 056 101 102
600000060   I   -   t   a   g  \0   .   n   o   t   e   .   g   n   u   .
        111 055 164 141 147 000 056 156 157 164 145 056 147 156 165 056
600000100   b   u   i   l   d   -   i   d  \0   .   g   n   u   .   h   a
        142 165 151 154 144 055 151 144 000 056 147 156 165 056 150 141
600000120   s   h  \0   .   d   y   n   s   y   m  \0   .   d   y   n   s
        163 150 000 056 144 171 156 163 171 155 000 056 144 171 156 163
600000140   t   r  \0   .   g   n   u   .   v   e   r   s   i   o   n  \0
        164 162 000 056 147 156 165 056 166 145 162 163 151 157 156 000
600000160   .   g   n   u   .   v   e   r   s   i   o   n   _   r  \0   .
        056 147 156 165 056 166 145 162 163 151 157 156 137 162 000 056
600000200   r   e   l   a   .   d   y   n  \0   .   r   e   l   a   .   p
        162 145 154 141 056 144 171 156 000 056 162 145 154 141 056 160
600000220   l   t  \0   .   i   n   i   t  \0   .   t   e   x   t  \0   .
        154 164 000 056 151 156 151 164 000 056 164 145 170 164 000 056
600000240   f   i   n   i  \0   .   r   o   d   a   t   a  \0   .   e   h
        146 151 156 151 000 056 162 157 144 141 164 141 000 056 145 150
600000260   _   f   r   a   m   e   _   h   d   r  \0   .   e   h   _   f
        137 146 162 141 155 145 137 150 144 162 000 056 145 150 137 146
600000300   r   a   m   e  \0   .   i   n   i   t   _   a   r   r   a   y
        162 141 155 145 000 056 151 156 151 164 137 141 162 162 141 171
600000320  \0   .   f   i   n   i   _   a   r   r   a   y  \0   .   j   c
        000 056 146 151 156 151 137 141 162 162 141 171 000 056 152 143
600000340   r  \0   .   d   y   n   a   m   i   c  \0   .   g   o   t  \0
        162 000 056 144 171 156 141 155 151 143 000 056 147 157 164 000
600000360   .   g   o   t   .   p   l   t  \0   .   d   a   t   a  \0   .
        056 147 157 164 056 160 154 164 000 056 144 141 164 141 000 056
600000400   b   s   s  \0   .   c   o   m   m   e   n   t  \0  \0  \0  \0
        142 163 163 000 056 143 157 155 155 145 156 164 000 000 000 000
600000420  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600000440  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 001  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 001 000
600000460   8 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        070 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000500  \0  \0  \0  \0 003  \0 002  \0   T 002   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 002 000 124 002 100 000 000 000 000 000
600000520  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 003  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 003 000
600000540   t 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        164 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000560  \0  \0  \0  \0 003  \0 004  \0 230 002   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 004 000 230 002 100 000 000 000 000 000
600000600  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 005  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 005 000
600000620 270 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        270 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000640  \0  \0  \0  \0 003  \0 006  \0  \b 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 006 000 010 004 100 000 000 000 000 000
600000660  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \a  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 007 000
600000700 206 004   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        206 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000720  \0  \0  \0  \0 003  \0  \b  \0 250 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 010 000 250 004 100 000 000 000 000 000
600000740  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \t  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 011 000
600000760 310 004   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        310 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001000  \0  \0  \0  \0 003  \0  \n  \0 340 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 012 000 340 004 100 000 000 000 000 000
600001020  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \v  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 013 000
600001040 030 006   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001060  \0  \0  \0  \0 003  \0  \f  \0   @ 006   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 014 000 100 006 100 000 000 000 000 000
600001100  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \r  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 015 000
600001120      \a   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        040 007 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001140  \0  \0  \0  \0 003  \0 016  \0 024  \n   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 016 000 024 012 100 000 000 000 000 000
600001160  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 017  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 017 000
600001200      \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        040 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001220  \0  \0  \0  \0 003  \0 020  \0   @  \n   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 020 000 100 012 100 000 000 000 000 000
600001240  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 021  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 021 000
600001260 200  \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        200 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001300  \0  \0  \0  \0 003  \0 022  \0 020 016   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 022 000 020 016 140 000 000 000 000 000
600001320  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 023  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 023 000
600001340 030 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001360  \0  \0  \0  \0 003  \0 024  \0     016   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 024 000 040 016 140 000 000 000 000 000
600001400  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 025  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 025 000
600001420   ( 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001440  \0  \0  \0  \0 003  \0 026  \0 370 017   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 026 000 370 017 140 000 000 000 000 000
600001460  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 027  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 027 000
600001500  \0 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001520  \0  \0  \0  \0 003  \0 030  \0 200 020   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 030 000 200 020 140 000 000 000 000 000
600001540  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 031  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 031 000
600001560 240 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        240 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001600  \0  \0  \0  \0 003  \0 032  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 003 000 032 000 000 000 000 000 000 000 000 000
600001620  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0 004  \0 361 377
        000 000 000 000 000 000 000 000 001 000 000 000 004 000 361 377
600001640  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600001660  \b  \0  \0  \0 002  \0  \r  \0  \0  \t   @  \0  \0  \0  \0  \0
        010 000 000 000 002 000 015 000 000 011 100 000 000 000 000 000
600001700 221  \0  \0  \0  \0  \0  \0  \0 024  \0  \0  \0 001  \0 031  \0
        221 000 000 000 000 000 000 000 024 000 000 000 001 000 031 000
600001720 300 020   `  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0
        300 020 140 000 000 000 000 000 000 000 020 000 000 000 000 000
600001740      \0  \0  \0 001  \0 030  \0 220 020   `  \0  \0  \0  \0  \0
        040 000 000 000 001 000 030 000 220 020 140 000 000 000 000 000
600001760  \b  \0  \0  \0  \0  \0  \0  \0   (  \0  \0  \0 004  \0 361 377
        010 000 000 000 000 000 000 000 050 000 000 000 004 000 361 377
600002000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002020   3  \0  \0  \0 001  \0 024  \0     016   `  \0  \0  \0  \0  \0
        063 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
600002040  \0  \0  \0  \0  \0  \0  \0  \0   @  \0  \0  \0 002  \0  \r  \0
        000 000 000 000 000 000 000 000 100 000 000 000 002 000 015 000
600002060   @  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        100 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002100   U  \0  \0  \0 002  \0  \r  \0   p  \b   @  \0  \0  \0  \0  \0
        125 000 000 000 002 000 015 000 160 010 100 000 000 000 000 000
600002120  \0  \0  \0  \0  \0  \0  \0  \0   h  \0  \0  \0 002  \0  \r  \0
        000 000 000 000 000 000 000 000 150 000 000 000 002 000 015 000
600002140 260  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        260 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002160   ~  \0  \0  \0 001  \0 031  \0 240 020   `  \0  \0  \0  \0  \0
        176 000 000 000 001 000 031 000 240 020 140 000 000 000 000 000
600002200 001  \0  \0  \0  \0  \0  \0  \0 215  \0  \0  \0 001  \0 023  \0
        001 000 000 000 000 000 000 000 215 000 000 000 001 000 023 000
600002220 030 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002240 264  \0  \0  \0 002  \0  \r  \0 320  \b   @  \0  \0  \0  \0  \0
        264 000 000 000 002 000 015 000 320 010 100 000 000 000 000 000
600002260  \0  \0  \0  \0  \0  \0  \0  \0 300  \0  \0  \0 001  \0 022  \0
        000 000 000 000 000 000 000 000 300 000 000 000 001 000 022 000
600002300 020 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        020 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002320   (  \0  \0  \0 004  \0 361 377  \0  \0  \0  \0  \0  \0  \0  \0
        050 000 000 000 004 000 361 377 000 000 000 000 000 000 000 000
600002340  \0  \0  \0  \0  \0  \0  \0  \0 337  \0  \0  \0 001  \0 021  \0
        000 000 000 000 000 000 000 000 337 000 000 000 001 000 021 000
600002360 300  \v   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        300 013 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002400 355  \0  \0  \0 001  \0 024  \0     016   `  \0  \0  \0  \0  \0
        355 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
600002420  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 004  \0 361 377
        000 000 000 000 000 000 000 000 000 000 000 000 004 000 361 377
600002440  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002460 371  \0  \0  \0  \0  \0 022  \0 030 016   `  \0  \0  \0  \0  \0
        371 000 000 000 000 000 022 000 030 016 140 000 000 000 000 000
600002500  \0  \0  \0  \0  \0  \0  \0  \0  \n 001  \0  \0 001  \0 025  \0
        000 000 000 000 000 000 000 000 012 001 000 000 001 000 025 000
600002520   ( 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002540 023 001  \0  \0  \0  \0 022  \0 020 016   `  \0  \0  \0  \0  \0
        023 001 000 000 000 000 022 000 020 016 140 000 000 000 000 000
600002560  \0  \0  \0  \0  \0  \0  \0  \0   & 001  \0  \0 001  \0 027  \0
        000 000 000 000 000 000 000 000 046 001 000 000 001 000 027 000
600002600  \0 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002620   < 001  \0  \0 022  \0  \r  \0 020  \n   @  \0  \0  \0  \0  \0
        074 001 000 000 022 000 015 000 020 012 100 000 000 000 000 000
600002640 002  \0  \0  \0  \0  \0  \0  \0   L 001  \0  \0      \0  \0  \0
        002 000 000 000 000 000 000 000 114 001 000 000 040 000 000 000
600002660  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002700   h 001  \0  \0      \0 030  \0 200 020   `  \0  \0  \0  \0  \0
        150 001 000 000 040 000 030 000 200 020 140 000 000 000 000 000
600002720  \0  \0  \0  \0  \0  \0  \0  \0   s 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 163 001 000 000 022 000 000 000
600002740  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002760 206 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        206 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003000  \0  \0  \0  \0  \0  \0  \0  \0 231 001  \0  \0 020  \0 030  \0
        000 000 000 000 000 000 000 000 231 001 000 000 020 000 030 000
600003020 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003040 240 001  \0  \0 022  \0 016  \0 024  \n   @  \0  \0  \0  \0  \0
        240 001 000 000 022 000 016 000 024 012 100 000 000 000 000 000
600003060  \0  \0  \0  \0  \0  \0  \0  \0 246 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 246 001 000 000 022 000 000 000
600003100  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003120 274 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        274 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003140  \0  \0  \0  \0  \0  \0  \0  \0 320 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 320 001 000 000 022 000 000 000
600003160  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003200 343 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        343 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003220  \0  \0  \0  \0  \0  \0  \0  \0 365 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 365 001 000 000 022 000 000 000
600003240  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003260  \a 002  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        007 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003300  \0  \0  \0  \0  \0  \0  \0  \0   & 002  \0  \0 020  \0 030  \0
        000 000 000 000 000 000 000 000 046 002 000 000 020 000 030 000
600003320 200 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        200 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003340   3 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        063 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600003360  \0  \0  \0  \0  \0  \0  \0  \0   B 002  \0  \0 021 002 017  \0
        000 000 000 000 000 000 000 000 102 002 000 000 021 002 017 000
600003400   (  \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600003420   O 002  \0  \0 021  \0 017  \0      \n   @  \0  \0  \0  \0  \0
        117 002 000 000 021 000 017 000 040 012 100 000 000 000 000 000
600003440 004  \0  \0  \0  \0  \0  \0  \0   ^ 002  \0  \0 022  \0  \0  \0
        004 000 000 000 000 000 000 000 136 002 000 000 022 000 000 000
600003460  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003500   p 002  \0  \0 022  \0  \r  \0 240  \t   @  \0  \0  \0  \0  \0
        160 002 000 000 022 000 015 000 240 011 100 000 000 000 000 000
600003520   e  \0  \0  \0  \0  \0  \0  \0 200 002  \0  \0 022  \0  \0  \0
        145 000 000 000 000 000 000 000 200 002 000 000 022 000 000 000
600003540  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003560 224 002  \0  \0 020  \0 031  \0 300 020   p  \0  \0  \0  \0  \0
        224 002 000 000 020 000 031 000 300 020 160 000 000 000 000 000
600003600  \0  \0  \0  \0  \0  \0  \0  \0 231 002  \0  \0 022  \0  \r  \0
        000 000 000 000 000 000 000 000 231 002 000 000 022 000 015 000
600003620 023  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        023 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600003640 240 002  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        240 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003660  \0  \0  \0  \0  \0  \0  \0  \0 265 002  \0  \0 020  \0 031  \0
        000 000 000 000 000 000 000 000 265 002 000 000 020 000 031 000
600003700 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003720 301 002  \0  \0 022  \0  \r  \0      \a   @  \0  \0  \0  \0  \0
        301 002 000 000 022 000 015 000 040 007 100 000 000 000 000 000
600003740 363  \0  \0  \0  \0  \0  \0  \0 306 002  \0  \0 022  \0  \0  \0
        363 000 000 000 000 000 000 000 306 002 000 000 022 000 000 000
600003760  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600004000 330 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        330 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600004020  \0  \0  \0  \0  \0  \0  \0  \0 354 002  \0  \0 021 002 030  \0
        000 000 000 000 000 000 000 000 354 002 000 000 021 002 030 000
600004040 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600004060 370 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        370 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600004100  \0  \0  \0  \0  \0  \0  \0  \0 022 003  \0  \0 022  \0  \v  \0
        000 000 000 000 000 000 000 000 022 003 000 000 022 000 013 000
600004120 030 006   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600004140  \0   0   8   0   4   .   c  \0   f   i   l   e   _   w   r   i
        000 060 070 060 064 056 143 000 146 151 154 145 137 167 162 151
600004160   t   e   r  \0   b   u   f   f   e   r   .   4   7   6   1  \0
        164 145 162 000 142 165 146 146 145 162 056 064 067 066 061 000
600004200   p   i   p   e   _   f   d  \0   c   r   t   s   t   u   f   f
        160 151 160 145 137 146 144 000 143 162 164 163 164 165 146 146
600004220   .   c  \0   _   _   J   C   R   _   L   I   S   T   _   _  \0
        056 143 000 137 137 112 103 122 137 114 111 123 124 137 137 000
600004240   d   e   r   e   g   i   s   t   e   r   _   t   m   _   c   l
        144 145 162 145 147 151 163 164 145 162 137 164 155 137 143 154
600004260   o   n   e   s  \0   r   e   g   i   s   t   e   r   _   t   m
        157 156 145 163 000 162 145 147 151 163 164 145 162 137 164 155
600004300   _   c   l   o   n   e   s  \0   _   _   d   o   _   g   l   o
        137 143 154 157 156 145 163 000 137 137 144 157 137 147 154 157
600004320   b   a   l   _   d   t   o   r   s   _   a   u   x  \0   c   o
        142 141 154 137 144 164 157 162 163 137 141 165 170 000 143 157
600004340   m   p   l   e   t   e   d   .   6   3   4   4  \0   _   _   d
        155 160 154 145 164 145 144 056 066 063 064 064 000 137 137 144
600004360   o   _   g   l   o   b   a   l   _   d   t   o   r   s   _   a
        157 137 147 154 157 142 141 154 137 144 164 157 162 163 137 141
600004400   u   x   _   f   i   n   i   _   a   r   r   a   y   _   e   n
        165 170 137 146 151 156 151 137 141 162 162 141 171 137 145 156
600004420   t   r   y  \0   f   r   a   m   e   _   d   u   m   m   y  \0
        164 162 171 000 146 162 141 155 145 137 144 165 155 155 171 000
600004440   _   _   f   r   a   m   e   _   d   u   m   m   y   _   i   n
        137 137 146 162 141 155 145 137 144 165 155 155 171 137 151 156
600004460   i   t   _   a   r   r   a   y   _   e   n   t   r   y  \0   _
        151 164 137 141 162 162 141 171 137 145 156 164 162 171 000 137
600004500   _   F   R   A   M   E   _   E   N   D   _   _  \0   _   _   J
        137 106 122 101 115 105 137 105 116 104 137 137 000 137 137 112
600004520   C   R   _   E   N   D   _   _  \0   _   _   i   n   i   t   _
        103 122 137 105 116 104 137 137 000 137 137 151 156 151 164 137
600004540   a   r   r   a   y   _   e   n   d  \0   _   D   Y   N   A   M
        141 162 162 141 171 137 145 156 144 000 137 104 131 116 101 115
600004560   I   C  \0   _   _   i   n   i   t   _   a   r   r   a   y   _
        111 103 000 137 137 151 156 151 164 137 141 162 162 141 171 137
600004600   s   t   a   r   t  \0   _   G   L   O   B   A   L   _   O   F
        163 164 141 162 164 000 137 107 114 117 102 101 114 137 117 106
600004620   F   S   E   T   _   T   A   B   L   E   _  \0   _   _   l   i
        106 123 105 124 137 124 101 102 114 105 137 000 137 137 154 151
600004640   b   c   _   c   s   u   _   f   i   n   i  \0   _   I   T   M
        142 143 137 143 163 165 137 146 151 156 151 000 137 111 124 115
600004660   _   d   e   r   e   g   i   s   t   e   r   T   M   C   l   o
        137 144 145 162 145 147 151 163 164 145 162 124 115 103 154 157
600004700   n   e   T   a   b   l   e  \0   d   a   t   a   _   s   t   a
        156 145 124 141 142 154 145 000 144 141 164 141 137 163 164 141
600004720   r   t  \0   c   l   o   n   e   @   @   G   L   I   B   C   _
        162 164 000 143 154 157 156 145 100 100 107 114 111 102 103 137
600004740   2   .   2   .   5  \0   w   r   i   t   e   @   @   G   L   I
        062 056 062 056 065 000 167 162 151 164 145 100 100 107 114 111
600004760   B   C   _   2   .   2   .   5  \0   _   e   d   a   t   a  \0
        102 103 137 062 056 062 056 065 000 137 145 144 141 164 141 000
600005000   _   f   i   n   i  \0   s   n   p   r   i   n   t   f   @   @
        137 146 151 156 151 000 163 156 160 162 151 156 164 146 100 100
600005020   G   L   I   B   C   _   2   .   2   .   5  \0   m   e   m   s
        107 114 111 102 103 137 062 056 062 056 065 000 155 145 155 163
600005040   e   t   @   @   G   L   I   B   C   _   2   .   2   .   5  \0
        145 164 100 100 107 114 111 102 103 137 062 056 062 056 065 000
600005060   c   l   o   s   e   @   @   G   L   I   B   C   _   2   .   2
        143 154 157 163 145 100 100 107 114 111 102 103 137 062 056 062
600005100   .   5  \0   p   i   p   e   @   @   G   L   I   B   C   _   2
        056 065 000 160 151 160 145 100 100 107 114 111 102 103 137 062
600005120   .   2   .   5  \0   r   e   a   d   @   @   G   L   I   B   C
        056 062 056 065 000 162 145 141 144 100 100 107 114 111 102 103
600005140   _   2   .   2   .   5  \0   _   _   l   i   b   c   _   s   t
        137 062 056 062 056 065 000 137 137 154 151 142 143 137 163 164
600005160   a   r   t   _   m   a   i   n   @   @   G   L   I   B   C   _
        141 162 164 137 155 141 151 156 100 100 107 114 111 102 103 137
600005200   2   .   2   .   5  \0   _   _   d   a   t   a   _   s   t   a
        062 056 062 056 065 000 137 137 144 141 164 141 137 163 164 141
600005220   r   t  \0   _   _   g   m   o   n   _   s   t   a   r   t   _
        162 164 000 137 137 147 155 157 156 137 163 164 141 162 164 137
600005240   _  \0   _   _   d   s   o   _   h   a   n   d   l   e  \0   _
        137 000 137 137 144 163 157 137 150 141 156 144 154 145 000 137
600005260   I   O   _   s   t   d   i   n   _   u   s   e   d  \0   k   i
        111 117 137 163 164 144 151 156 137 165 163 145 144 000 153 151
600005300   l   l   @   @   G   L   I   B   C   _   2   .   2   .   5  \0
        154 154 100 100 107 114 111 102 103 137 062 056 062 056 065 000
600005320   _   _   l   i   b   c   _   c   s   u   _   i   n   i   t  \0
        137 137 154 151 142 143 137 143 163 165 137 151 156 151 164 000
600005340   m   a   l   l   o   c   @   @   G   L   I   B   C   _   2   .
        155 141 154 154 157 143 100 100 107 114 111 102 103 137 062 056
600005360   2   .   5  \0   _   e   n   d  \0   _   s   t   a   r   t  \0
        062 056 065 000 137 145 156 144 000 137 163 164 141 162 164 000
600005400   r   e   a   l   l   o   c   @   @   G   L   I   B   C   _   2
        162 145 141 154 154 157 143 100 100 107 114 111 102 103 137 062
600005420   .   2   .   5  \0   _   _   b   s   s   _   s   t   a   r   t
        056 062 056 065 000 137 137 142 163 163 137 163 164 141 162 164
600005440  \0   m   a   i   n  \0   o   p   e   n   @   @   G   L   I   B
        000 155 141 151 156 000 157 160 145 156 100 100 107 114 111 102
600005460   C   _   2   .   2   .   5  \0   _   J   v   _   R   e   g   i
        103 137 062 056 062 056 065 000 137 112 166 137 122 145 147 151
600005500   s   t   e   r   C   l   a   s   s   e   s  \0   _   _   T   M
        163 164 145 162 103 154 141 163 163 145 163 000 137 137 124 115
600005520   C   _   E   N   D   _   _  \0   _   I   T   M   _   r   e   g
        103 137 105 116 104 137 137 000 137 111 124 115 137 162 145 147
600005540   i   s   t   e   r   T   M   C   l   o   n   e   T   a   b   l
        151 163 164 145 162 124 115 103 154 157 156 145 124 141 142 154
600005560   e  \0   _   i   n   i   t  \0  \0  \0  \0  \0  \0  \0  \0  \0
        145 000 137 151 156 151 164 000 000 000 000 000 000 000 000 000
600005600  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
600005660  \0  \0  \0  \0  \0  \0  \0  \0 033  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 033 000 000 000 001 000 000 000
600005700 002  \0  \0  \0  \0  \0  \0  \0   8 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 070 002 100 000 000 000 000 000
600005720   8 002  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        070 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600005740  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
600005760  \0  \0  \0  \0  \0  \0  \0  \0   #  \0  \0  \0  \a  \0  \0  \0
        000 000 000 000 000 000 000 000 043 000 000 000 007 000 000 000
600006000 002  \0  \0  \0  \0  \0  \0  \0   T 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 124 002 100 000 000 000 000 000
600006020   T 002  \0  \0  \0  \0  \0  \0      \0  \0  \0  \0  \0  \0  \0
        124 002 000 000 000 000 000 000 040 000 000 000 000 000 000 000
600006040  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600006060  \0  \0  \0  \0  \0  \0  \0  \0   1  \0  \0  \0  \a  \0  \0  \0
        000 000 000 000 000 000 000 000 061 000 000 000 007 000 000 000
600006100 002  \0  \0  \0  \0  \0  \0  \0   t 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 164 002 100 000 000 000 000 000
600006120   t 002  \0  \0  \0  \0  \0  \0   $  \0  \0  \0  \0  \0  \0  \0
        164 002 000 000 000 000 000 000 044 000 000 000 000 000 000 000
600006140  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600006160  \0  \0  \0  \0  \0  \0  \0  \0   D  \0  \0  \0 366 377 377   o
        000 000 000 000 000 000 000 000 104 000 000 000 366 377 377 157
600006200 002  \0  \0  \0  \0  \0  \0  \0 230 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 230 002 100 000 000 000 000 000
600006220 230 002  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        230 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600006240 005  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
600006260  \0  \0  \0  \0  \0  \0  \0  \0   N  \0  \0  \0  \v  \0  \0  \0
        000 000 000 000 000 000 000 000 116 000 000 000 013 000 000 000
600006300 002  \0  \0  \0  \0  \0  \0  \0 270 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 270 002 100 000 000 000 000 000
600006320 270 002  \0  \0  \0  \0  \0  \0   P 001  \0  \0  \0  \0  \0  \0
        270 002 000 000 000 000 000 000 120 001 000 000 000 000 000 000
600006340 006  \0  \0  \0 001  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
600006360 030  \0  \0  \0  \0  \0  \0  \0   V  \0  \0  \0 003  \0  \0  \0
        030 000 000 000 000 000 000 000 126 000 000 000 003 000 000 000
600006400 002  \0  \0  \0  \0  \0  \0  \0  \b 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 010 004 100 000 000 000 000 000
600006420  \b 004  \0  \0  \0  \0  \0  \0   }  \0  \0  \0  \0  \0  \0  \0
        010 004 000 000 000 000 000 000 175 000 000 000 000 000 000 000
600006440  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
600006460  \0  \0  \0  \0  \0  \0  \0  \0   ^  \0  \0  \0 377 377 377   o
        000 000 000 000 000 000 000 000 136 000 000 000 377 377 377 157
600006500 002  \0  \0  \0  \0  \0  \0  \0 206 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 206 004 100 000 000 000 000 000
600006520 206 004  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        206 004 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600006540 005  \0  \0  \0  \0  \0  \0  \0 002  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 002 000 000 000 000 000 000 000
600006560 002  \0  \0  \0  \0  \0  \0  \0   k  \0  \0  \0 376 377 377   o
        002 000 000 000 000 000 000 000 153 000 000 000 376 377 377 157
600006600 002  \0  \0  \0  \0  \0  \0  \0 250 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 250 004 100 000 000 000 000 000
600006620 250 004  \0  \0  \0  \0  \0  \0      \0  \0  \0  \0  \0  \0  \0
        250 004 000 000 000 000 000 000 040 000 000 000 000 000 000 000
600006640 006  \0  \0  \0 001  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
600006660  \0  \0  \0  \0  \0  \0  \0  \0   z  \0  \0  \0 004  \0  \0  \0
        000 000 000 000 000 000 000 000 172 000 000 000 004 000 000 000
600006700 002  \0  \0  \0  \0  \0  \0  \0 310 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 310 004 100 000 000 000 000 000
600006720 310 004  \0  \0  \0  \0  \0  \0 030  \0  \0  \0  \0  \0  \0  \0
        310 004 000 000 000 000 000 000 030 000 000 000 000 000 000 000
600006740 005  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
600006760 030  \0  \0  \0  \0  \0  \0  \0 204  \0  \0  \0 004  \0  \0  \0
        030 000 000 000 000 000 000 000 204 000 000 000 004 000 000 000
600007000   B  \0  \0  \0  \0  \0  \0  \0 340 004   @  \0  \0  \0  \0  \0
        102 000 000 000 000 000 000 000 340 004 100 000 000 000 000 000
600007020 340 004  \0  \0  \0  \0  \0  \0   8 001  \0  \0  \0  \0  \0  \0
        340 004 000 000 000 000 000 000 070 001 000 000 000 000 000 000
600007040 005  \0  \0  \0  \f  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 014 000 000 000 010 000 000 000 000 000 000 000
600007060 030  \0  \0  \0  \0  \0  \0  \0 216  \0  \0  \0 001  \0  \0  \0
        030 000 000 000 000 000 000 000 216 000 000 000 001 000 000 000
600007100 006  \0  \0  \0  \0  \0  \0  \0 030 006   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 030 006 100 000 000 000 000 000
600007120 030 006  \0  \0  \0  \0  \0  \0 032  \0  \0  \0  \0  \0  \0  \0
        030 006 000 000 000 000 000 000 032 000 000 000 000 000 000 000
600007140  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600007160  \0  \0  \0  \0  \0  \0  \0  \0 211  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 211 000 000 000 001 000 000 000
600007200 006  \0  \0  \0  \0  \0  \0  \0   @ 006   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 100 006 100 000 000 000 000 000
600007220   @ 006  \0  \0  \0  \0  \0  \0 340  \0  \0  \0  \0  \0  \0  \0
        100 006 000 000 000 000 000 000 340 000 000 000 000 000 000 000
600007240  \0  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
600007260 020  \0  \0  \0  \0  \0  \0  \0 224  \0  \0  \0 001  \0  \0  \0
        020 000 000 000 000 000 000 000 224 000 000 000 001 000 000 000
600007300 006  \0  \0  \0  \0  \0  \0  \0      \a   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 040 007 100 000 000 000 000 000
600007320      \a  \0  \0  \0  \0  \0  \0 364 002  \0  \0  \0  \0  \0  \0
        040 007 000 000 000 000 000 000 364 002 000 000 000 000 000 000
600007340  \0  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
600007360  \0  \0  \0  \0  \0  \0  \0  \0 232  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 232 000 000 000 001 000 000 000
600007400 006  \0  \0  \0  \0  \0  \0  \0 024  \n   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 024 012 100 000 000 000 000 000
600007420 024  \n  \0  \0  \0  \0  \0  \0  \t  \0  \0  \0  \0  \0  \0  \0
        024 012 000 000 000 000 000 000 011 000 000 000 000 000 000 000
600007440  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600007460  \0  \0  \0  \0  \0  \0  \0  \0 240  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 240 000 000 000 001 000 000 000
600007500  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
600010000
$ mv /tmp/file.4 /tmp/file.4.old
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 2549 0 in /tmp/file.2
ERROR: 40 1 in /tmp/file.2
ERROR: 53 2 in /tmp/file.2
ERROR: 29 3 in /tmp/file.2
ERROR: 27 4 in /tmp/file.2
ERROR: 5 5 in /tmp/file.2
ERROR: 14 6 in /tmp/file.2
ERROR: 8 7 in /tmp/file.2
ERROR: 16 8 in /tmp/file.2
ERROR: 4 9 in /tmp/file.2
ERROR: 12 10 in /tmp/file.2
ERROR: 4 11 in /tmp/file.2
ERROR: 2 12 in /tmp/file.2
ERROR: 10 13 in /tmp/file.2
ERROR: 13 14 in /tmp/file.2
ERROR: 4 15 in /tmp/file.2
ERROR: 26 16 in /tmp/file.2
ERROR: 5 17 in /tmp/file.2
ERROR: 23 18 in /tmp/file.2
ERROR: 4 19 in /tmp/file.2
ERROR: 8 20 in /tmp/file.2
ERROR: 2 21 in /tmp/file.2
ERROR: 1 22 in /tmp/file.2
ERROR: 2 23 in /tmp/file.2
ERROR: 17 24 in /tmp/file.2
ERROR: 5 25 in /tmp/file.2
ERROR: 2 26 in /tmp/file.2
ERROR: 1 27 in /tmp/file.2
ERROR: 3 28 in /tmp/file.2
ERROR: 17 32 in /tmp/file.2
ERROR: 1 35 in /tmp/file.2
ERROR: 1 36 in /tmp/file.2
ERROR: 2 38 in /tmp/file.2
ERROR: 5 40 in /tmp/file.2
ERROR: 1 41 in /tmp/file.2
ERROR: 3 45 in /tmp/file.2
ERROR: 65 46 in /tmp/file.2
ERROR: 2 48 in /tmp/file.2
ERROR: 4 49 in /tmp/file.2
ERROR: 24 50 in /tmp/file.2
ERROR: 3 51 in /tmp/file.2
ERROR: 4 52 in /tmp/file.2
ERROR: 12 53 in /tmp/file.2
ERROR: 2 54 in /tmp/file.2
ERROR: 1 55 in /tmp/file.2
ERROR: 5 56 in /tmp/file.2
ERROR: 1 60 in /tmp/file.2
ERROR: 75 64 in /tmp/file.2
ERROR: 5 65 in /tmp/file.2
ERROR: 17 66 in /tmp/file.2
ERROR: 19 67 in /tmp/file.2
ERROR: 5 68 in /tmp/file.2
ERROR: 6 69 in /tmp/file.2
ERROR: 3 70 in /tmp/file.2
ERROR: 13 71 in /tmp/file.2
ERROR: 18 73 in /tmp/file.2
ERROR: 3 74 in /tmp/file.2
ERROR: 17 76 in /tmp/file.2
ERROR: 7 77 in /tmp/file.2
ERROR: 5 78 in /tmp/file.2
ERROR: 4 79 in /tmp/file.2
ERROR: 1 80 in /tmp/file.2
ERROR: 4 82 in /tmp/file.2
ERROR: 2 83 in /tmp/file.2
ERROR: 13 84 in /tmp/file.2
ERROR: 1 85 in /tmp/file.2
ERROR: 1 86 in /tmp/file.2
ERROR: 1 89 in /tmp/file.2
ERROR: 2 94 in /tmp/file.2
ERROR: 118 95 in /tmp/file.2
ERROR: 24 96 in /tmp/file.2
ERROR: 54 97 in /tmp/file.2
ERROR: 14 98 in /tmp/file.2
ERROR: 18 99 in /tmp/file.2
ERROR: 29 100 in /tmp/file.2
ERROR: 57 101 in /tmp/file.2
ERROR: 16 102 in /tmp/file.2
ERROR: 15 103 in /tmp/file.2
ERROR: 9 104 in /tmp/file.2
ERROR: 48 105 in /tmp/file.2
ERROR: 1 106 in /tmp/file.2
ERROR: 2 107 in /tmp/file.2
ERROR: 30 108 in /tmp/file.2
ERROR: 22 109 in /tmp/file.2
ERROR: 43 110 in /tmp/file.2
ERROR: 29 111 in /tmp/file.2
ERROR: 13 112 in /tmp/file.2
ERROR: 56 114 in /tmp/file.2
ERROR: 42 115 in /tmp/file.2
ERROR: 65 116 in /tmp/file.2
ERROR: 14 117 in /tmp/file.2
ERROR: 3 118 in /tmp/file.2
ERROR: 2 119 in /tmp/file.2
ERROR: 3 120 in /tmp/file.2
ERROR: 16 121 in /tmp/file.2
ERROR: 1 122 in /tmp/file.2
ERROR: 1 125 in /tmp/file.2
ERROR: 1 126 in /tmp/file.2
ERROR: 5 128 in /tmp/file.2
ERROR: 1 132 in /tmp/file.2
ERROR: 4 134 in /tmp/file.2
ERROR: 1 137 in /tmp/file.2
ERROR: 1 141 in /tmp/file.2
ERROR: 1 142 in /tmp/file.2
ERROR: 1 144 in /tmp/file.2
ERROR: 1 145 in /tmp/file.2
ERROR: 2 148 in /tmp/file.2
ERROR: 6 152 in /tmp/file.2
ERROR: 2 153 in /tmp/file.2
ERROR: 1 154 in /tmp/file.2
ERROR: 6 160 in /tmp/file.2
ERROR: 1 166 in /tmp/file.2
ERROR: 3 168 in /tmp/file.2
ERROR: 1 176 in /tmp/file.2
ERROR: 1 180 in /tmp/file.2
ERROR: 1 181 in /tmp/file.2
ERROR: 3 184 in /tmp/file.2
ERROR: 1 188 in /tmp/file.2
ERROR: 4 192 in /tmp/file.2
ERROR: 1 193 in /tmp/file.2
ERROR: 1 198 in /tmp/file.2
ERROR: 3 200 in /tmp/file.2
ERROR: 2 208 in /tmp/file.2
ERROR: 1 216 in /tmp/file.2
ERROR: 1 223 in /tmp/file.2
ERROR: 4 224 in /tmp/file.2
ERROR: 1 227 in /tmp/file.2
ERROR: 1 236 in /tmp/file.2
ERROR: 1 237 in /tmp/file.2
ERROR: 4 241 in /tmp/file.2
ERROR: 1 243 in /tmp/file.2
ERROR: 1 244 in /tmp/file.2
ERROR: 1 245 in /tmp/file.2
ERROR: 1 246 in /tmp/file.2
ERROR: 2 248 in /tmp/file.2
ERROR: 1 249 in /tmp/file.2
ERROR: 1 254 in /tmp/file.2
ERROR: 2549 0 in /tmp/file.7
ERROR: 40 1 in /tmp/file.7
ERROR: 53 2 in /tmp/file.7
ERROR: 29 3 in /tmp/file.7
ERROR: 27 4 in /tmp/file.7
ERROR: 5 5 in /tmp/file.7
ERROR: 14 6 in /tmp/file.7
ERROR: 8 7 in /tmp/file.7
ERROR: 16 8 in /tmp/file.7
ERROR: 4 9 in /tmp/file.7
ERROR: 12 10 in /tmp/file.7
ERROR: 4 11 in /tmp/file.7
ERROR: 2 12 in /tmp/file.7
ERROR: 10 13 in /tmp/file.7
ERROR: 13 14 in /tmp/file.7
ERROR: 4 15 in /tmp/file.7
ERROR: 26 16 in /tmp/file.7
ERROR: 5 17 in /tmp/file.7
ERROR: 23 18 in /tmp/file.7
ERROR: 4 19 in /tmp/file.7
ERROR: 8 20 in /tmp/file.7
ERROR: 2 21 in /tmp/file.7
ERROR: 1 22 in /tmp/file.7
ERROR: 2 23 in /tmp/file.7
ERROR: 17 24 in /tmp/file.7
ERROR: 5 25 in /tmp/file.7
ERROR: 2 26 in /tmp/file.7
ERROR: 1 27 in /tmp/file.7
ERROR: 3 28 in /tmp/file.7
ERROR: 17 32 in /tmp/file.7
ERROR: 1 35 in /tmp/file.7
ERROR: 1 36 in /tmp/file.7
ERROR: 2 38 in /tmp/file.7
ERROR: 5 40 in /tmp/file.7
ERROR: 1 41 in /tmp/file.7
ERROR: 3 45 in /tmp/file.7
ERROR: 65 46 in /tmp/file.7
ERROR: 2 48 in /tmp/file.7
ERROR: 4 49 in /tmp/file.7
ERROR: 24 50 in /tmp/file.7
ERROR: 3 51 in /tmp/file.7
ERROR: 4 52 in /tmp/file.7
ERROR: 12 53 in /tmp/file.7
ERROR: 2 54 in /tmp/file.7
ERROR: 1 55 in /tmp/file.7
ERROR: 5 56 in /tmp/file.7
ERROR: 1 60 in /tmp/file.7
ERROR: 75 64 in /tmp/file.7
ERROR: 5 65 in /tmp/file.7
ERROR: 17 66 in /tmp/file.7
ERROR: 19 67 in /tmp/file.7
ERROR: 5 68 in /tmp/file.7
ERROR: 6 69 in /tmp/file.7
ERROR: 3 70 in /tmp/file.7
ERROR: 13 71 in /tmp/file.7
ERROR: 18 73 in /tmp/file.7
ERROR: 3 74 in /tmp/file.7
ERROR: 17 76 in /tmp/file.7
ERROR: 7 77 in /tmp/file.7
ERROR: 5 78 in /tmp/file.7
ERROR: 4 79 in /tmp/file.7
ERROR: 1 80 in /tmp/file.7
ERROR: 4 82 in /tmp/file.7
ERROR: 2 83 in /tmp/file.7
ERROR: 13 84 in /tmp/file.7
ERROR: 1 85 in /tmp/file.7
ERROR: 1 86 in /tmp/file.7
ERROR: 1 89 in /tmp/file.7
ERROR: 2 94 in /tmp/file.7
ERROR: 118 95 in /tmp/file.7
ERROR: 24 96 in /tmp/file.7
ERROR: 54 97 in /tmp/file.7
ERROR: 14 98 in /tmp/file.7
ERROR: 18 99 in /tmp/file.7
ERROR: 29 100 in /tmp/file.7
ERROR: 57 101 in /tmp/file.7
ERROR: 16 102 in /tmp/file.7
ERROR: 15 103 in /tmp/file.7
ERROR: 9 104 in /tmp/file.7
ERROR: 48 105 in /tmp/file.7
ERROR: 1 106 in /tmp/file.7
ERROR: 2 107 in /tmp/file.7
ERROR: 30 108 in /tmp/file.7
ERROR: 22 109 in /tmp/file.7
ERROR: 43 110 in /tmp/file.7
ERROR: 29 111 in /tmp/file.7
ERROR: 13 112 in /tmp/file.7
ERROR: 56 114 in /tmp/file.7
ERROR: 42 115 in /tmp/file.7
ERROR: 65 116 in /tmp/file.7
ERROR: 14 117 in /tmp/file.7
ERROR: 3 118 in /tmp/file.7
ERROR: 2 119 in /tmp/file.7
ERROR: 3 120 in /tmp/file.7
ERROR: 16 121 in /tmp/file.7
ERROR: 1 122 in /tmp/file.7
ERROR: 1 125 in /tmp/file.7
ERROR: 1 126 in /tmp/file.7
ERROR: 5 128 in /tmp/file.7
ERROR: 1 132 in /tmp/file.7
ERROR: 4 134 in /tmp/file.7
ERROR: 1 137 in /tmp/file.7
ERROR: 1 141 in /tmp/file.7
ERROR: 1 142 in /tmp/file.7
ERROR: 1 144 in /tmp/file.7
ERROR: 1 145 in /tmp/file.7
ERROR: 2 148 in /tmp/file.7
ERROR: 6 152 in /tmp/file.7
ERROR: 2 153 in /tmp/file.7
ERROR: 1 154 in /tmp/file.7
ERROR: 6 160 in /tmp/file.7
ERROR: 1 166 in /tmp/file.7
ERROR: 3 168 in /tmp/file.7
ERROR: 1 176 in /tmp/file.7
ERROR: 1 180 in /tmp/file.7
ERROR: 1 181 in /tmp/file.7
ERROR: 3 184 in /tmp/file.7
ERROR: 1 188 in /tmp/file.7
ERROR: 4 192 in /tmp/file.7
ERROR: 1 193 in /tmp/file.7
ERROR: 1 198 in /tmp/file.7
ERROR: 3 200 in /tmp/file.7
ERROR: 2 208 in /tmp/file.7
ERROR: 1 216 in /tmp/file.7
ERROR: 1 223 in /tmp/file.7
ERROR: 4 224 in /tmp/file.7
ERROR: 1 227 in /tmp/file.7
ERROR: 1 236 in /tmp/file.7
ERROR: 1 237 in /tmp/file.7
ERROR: 4 241 in /tmp/file.7
ERROR: 1 243 in /tmp/file.7
ERROR: 1 244 in /tmp/file.7
ERROR: 1 245 in /tmp/file.7
ERROR: 1 246 in /tmp/file.7
ERROR: 2 248 in /tmp/file.7
ERROR: 1 249 in /tmp/file.7
ERROR: 1 254 in /tmp/file.7

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11 11:53               ` Tetsuo Handa
  0 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-11 11:53 UTC (permalink / raw)
  To: aarcange; +Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Andrea Arcangeli wrote:
> On Fri, Aug 11, 2017 at 12:22:56PM +0200, Andrea Arcangeli wrote:
> > disk block? This would happen on ext4 as well if mounted with -o
> > journal=data instead of -o journal=ordered in fact, perhaps you simply
> 
> Oops above I meant journal=writeback, journal=data is even stronger
> than journal=ordered of course.
> 
> And I shall clarify further that old disk content can only showup
> legitimately on journal=writeback after a hard reboot or crash or in
> general an unclean unmount. Even if there's no journaling at all
> (i.e. ext2/vfat) old disk content cannot be shown at any given time no
> matter what if there's no unclean unmount that requires a journal
> reply.

I'm using XFS on a small non-NUMA system (4 CPUs / 4096MB RAM).

  /dev/sda1 / xfs rw,relatime,attr2,inode64,noquota 0 0

As far as I tested, not-zero not-0xff values did not show up with 4.6.7
kernel (i.e. all not-0xff bytes are zero) while not-zero not-0xff values
show up with 4.13.0-rc4-next-20170811 kernel.

> 
> This theory of a completely unrelated fs bug showing you disk content
> as result of the OOM reaper induced SIGBUS interrupting a
> copy_from_user at its very start, is purely motivated by the fact like
> Michal I didn't see much explanation on the VM side that could cause
> those not-zero not-0xff values showing up in the buffer of the write
> syscall. You can try to change fs and see if it happens again to rule
> it out. If it always happens regardless of the filesystem used, then
> it's likely not a fs bug of course. You've got an entire and aligned
> 4k fs block showing up that data.
> 

What is strange is that, as far as I tested, the pattern of not-zero not-0xff
bytes seems to be always the same. Such thing unlikely happens if old content
on the disk is by chance showing up. Maybe the content written is not random
but specific 4096 bytes of memory image of executable file.

$ cat checker.c
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
        char buffer2[64] = { };
        int ret = 0;
        int i;
        for (i = 0; i < 1024; i++) {
                 int flag = 0;
                 int fd;
                 unsigned int byte[256];
                 int j;
                 snprintf(buffer2, sizeof(buffer2), "/tmp/file.%u", i);
                 fd = open(buffer2, O_RDONLY);
                 if (fd == EOF)
                         continue;
                 memset(byte, 0, sizeof(byte));
                 while (1) {
                         static unsigned char buffer[1048576];
                         int len = read(fd, (char *) buffer, sizeof(buffer));
                         if (len <= 0)
                                 break;
                         for (j = 0; j < len; j++)
                                 if (buffer[j] != 0xFF)
                                         byte[buffer[j]]++;
                 }
                 close(fd);
                 for (j = 0; j < 255; j++)
                         if (byte[j]) {
                                 printf("ERROR: %u %u in %s\n", byte[j], j, buffer2);
                                 flag = 1;
                         }
                 if (flag == 0)
                         unlink(buffer2);
                 else
                         ret = 1;
        }
        return ret;
}
$ uname -r
4.13.0-rc4-next-20170811
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.4
$ /bin/rm /tmp/file.4
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.6
$ /bin/rm /tmp/file.6
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.0
$ /bin/rm /tmp/file.0
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 2549 0 in /tmp/file.4
ERROR: 40 1 in /tmp/file.4
ERROR: 53 2 in /tmp/file.4
ERROR: 29 3 in /tmp/file.4
ERROR: 27 4 in /tmp/file.4
ERROR: 5 5 in /tmp/file.4
ERROR: 14 6 in /tmp/file.4
ERROR: 8 7 in /tmp/file.4
ERROR: 16 8 in /tmp/file.4
ERROR: 4 9 in /tmp/file.4
ERROR: 12 10 in /tmp/file.4
ERROR: 4 11 in /tmp/file.4
ERROR: 2 12 in /tmp/file.4
ERROR: 10 13 in /tmp/file.4
ERROR: 13 14 in /tmp/file.4
ERROR: 4 15 in /tmp/file.4
ERROR: 26 16 in /tmp/file.4
ERROR: 5 17 in /tmp/file.4
ERROR: 23 18 in /tmp/file.4
ERROR: 4 19 in /tmp/file.4
ERROR: 8 20 in /tmp/file.4
ERROR: 2 21 in /tmp/file.4
ERROR: 1 22 in /tmp/file.4
ERROR: 2 23 in /tmp/file.4
ERROR: 17 24 in /tmp/file.4
ERROR: 5 25 in /tmp/file.4
ERROR: 2 26 in /tmp/file.4
ERROR: 1 27 in /tmp/file.4
ERROR: 3 28 in /tmp/file.4
ERROR: 17 32 in /tmp/file.4
ERROR: 1 35 in /tmp/file.4
ERROR: 1 36 in /tmp/file.4
ERROR: 2 38 in /tmp/file.4
ERROR: 5 40 in /tmp/file.4
ERROR: 1 41 in /tmp/file.4
ERROR: 3 45 in /tmp/file.4
ERROR: 65 46 in /tmp/file.4
ERROR: 2 48 in /tmp/file.4
ERROR: 4 49 in /tmp/file.4
ERROR: 24 50 in /tmp/file.4
ERROR: 3 51 in /tmp/file.4
ERROR: 4 52 in /tmp/file.4
ERROR: 12 53 in /tmp/file.4
ERROR: 2 54 in /tmp/file.4
ERROR: 1 55 in /tmp/file.4
ERROR: 5 56 in /tmp/file.4
ERROR: 1 60 in /tmp/file.4
ERROR: 75 64 in /tmp/file.4
ERROR: 5 65 in /tmp/file.4
ERROR: 17 66 in /tmp/file.4
ERROR: 19 67 in /tmp/file.4
ERROR: 5 68 in /tmp/file.4
ERROR: 6 69 in /tmp/file.4
ERROR: 3 70 in /tmp/file.4
ERROR: 13 71 in /tmp/file.4
ERROR: 18 73 in /tmp/file.4
ERROR: 3 74 in /tmp/file.4
ERROR: 17 76 in /tmp/file.4
ERROR: 7 77 in /tmp/file.4
ERROR: 5 78 in /tmp/file.4
ERROR: 4 79 in /tmp/file.4
ERROR: 1 80 in /tmp/file.4
ERROR: 4 82 in /tmp/file.4
ERROR: 2 83 in /tmp/file.4
ERROR: 13 84 in /tmp/file.4
ERROR: 1 85 in /tmp/file.4
ERROR: 1 86 in /tmp/file.4
ERROR: 1 89 in /tmp/file.4
ERROR: 2 94 in /tmp/file.4
ERROR: 118 95 in /tmp/file.4
ERROR: 24 96 in /tmp/file.4
ERROR: 54 97 in /tmp/file.4
ERROR: 14 98 in /tmp/file.4
ERROR: 18 99 in /tmp/file.4
ERROR: 29 100 in /tmp/file.4
ERROR: 57 101 in /tmp/file.4
ERROR: 16 102 in /tmp/file.4
ERROR: 15 103 in /tmp/file.4
ERROR: 9 104 in /tmp/file.4
ERROR: 48 105 in /tmp/file.4
ERROR: 1 106 in /tmp/file.4
ERROR: 2 107 in /tmp/file.4
ERROR: 30 108 in /tmp/file.4
ERROR: 22 109 in /tmp/file.4
ERROR: 43 110 in /tmp/file.4
ERROR: 29 111 in /tmp/file.4
ERROR: 13 112 in /tmp/file.4
ERROR: 56 114 in /tmp/file.4
ERROR: 42 115 in /tmp/file.4
ERROR: 65 116 in /tmp/file.4
ERROR: 14 117 in /tmp/file.4
ERROR: 3 118 in /tmp/file.4
ERROR: 2 119 in /tmp/file.4
ERROR: 3 120 in /tmp/file.4
ERROR: 16 121 in /tmp/file.4
ERROR: 1 122 in /tmp/file.4
ERROR: 1 125 in /tmp/file.4
ERROR: 1 126 in /tmp/file.4
ERROR: 5 128 in /tmp/file.4
ERROR: 1 132 in /tmp/file.4
ERROR: 4 134 in /tmp/file.4
ERROR: 1 137 in /tmp/file.4
ERROR: 1 141 in /tmp/file.4
ERROR: 1 142 in /tmp/file.4
ERROR: 1 144 in /tmp/file.4
ERROR: 1 145 in /tmp/file.4
ERROR: 2 148 in /tmp/file.4
ERROR: 6 152 in /tmp/file.4
ERROR: 2 153 in /tmp/file.4
ERROR: 1 154 in /tmp/file.4
ERROR: 6 160 in /tmp/file.4
ERROR: 1 166 in /tmp/file.4
ERROR: 3 168 in /tmp/file.4
ERROR: 1 176 in /tmp/file.4
ERROR: 1 180 in /tmp/file.4
ERROR: 1 181 in /tmp/file.4
ERROR: 3 184 in /tmp/file.4
ERROR: 1 188 in /tmp/file.4
ERROR: 4 192 in /tmp/file.4
ERROR: 1 193 in /tmp/file.4
ERROR: 1 198 in /tmp/file.4
ERROR: 3 200 in /tmp/file.4
ERROR: 2 208 in /tmp/file.4
ERROR: 1 216 in /tmp/file.4
ERROR: 1 223 in /tmp/file.4
ERROR: 4 224 in /tmp/file.4
ERROR: 1 227 in /tmp/file.4
ERROR: 1 236 in /tmp/file.4
ERROR: 1 237 in /tmp/file.4
ERROR: 4 241 in /tmp/file.4
ERROR: 1 243 in /tmp/file.4
ERROR: 1 244 in /tmp/file.4
ERROR: 1 245 in /tmp/file.4
ERROR: 1 246 in /tmp/file.4
ERROR: 2 248 in /tmp/file.4
ERROR: 1 249 in /tmp/file.4
ERROR: 1 254 in /tmp/file.4
$ od -cb /tmp/file.4
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
        377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
600000000   -   1   1   )  \0  \0   .   s   y   m   t   a   b  \0   .   s
        055 061 061 051 000 000 056 163 171 155 164 141 142 000 056 163
600000020   t   r   t   a   b  \0   .   s   h   s   t   r   t   a   b  \0
        164 162 164 141 142 000 056 163 150 163 164 162 164 141 142 000
600000040   .   i   n   t   e   r   p  \0   .   n   o   t   e   .   A   B
        056 151 156 164 145 162 160 000 056 156 157 164 145 056 101 102
600000060   I   -   t   a   g  \0   .   n   o   t   e   .   g   n   u   .
        111 055 164 141 147 000 056 156 157 164 145 056 147 156 165 056
600000100   b   u   i   l   d   -   i   d  \0   .   g   n   u   .   h   a
        142 165 151 154 144 055 151 144 000 056 147 156 165 056 150 141
600000120   s   h  \0   .   d   y   n   s   y   m  \0   .   d   y   n   s
        163 150 000 056 144 171 156 163 171 155 000 056 144 171 156 163
600000140   t   r  \0   .   g   n   u   .   v   e   r   s   i   o   n  \0
        164 162 000 056 147 156 165 056 166 145 162 163 151 157 156 000
600000160   .   g   n   u   .   v   e   r   s   i   o   n   _   r  \0   .
        056 147 156 165 056 166 145 162 163 151 157 156 137 162 000 056
600000200   r   e   l   a   .   d   y   n  \0   .   r   e   l   a   .   p
        162 145 154 141 056 144 171 156 000 056 162 145 154 141 056 160
600000220   l   t  \0   .   i   n   i   t  \0   .   t   e   x   t  \0   .
        154 164 000 056 151 156 151 164 000 056 164 145 170 164 000 056
600000240   f   i   n   i  \0   .   r   o   d   a   t   a  \0   .   e   h
        146 151 156 151 000 056 162 157 144 141 164 141 000 056 145 150
600000260   _   f   r   a   m   e   _   h   d   r  \0   .   e   h   _   f
        137 146 162 141 155 145 137 150 144 162 000 056 145 150 137 146
600000300   r   a   m   e  \0   .   i   n   i   t   _   a   r   r   a   y
        162 141 155 145 000 056 151 156 151 164 137 141 162 162 141 171
600000320  \0   .   f   i   n   i   _   a   r   r   a   y  \0   .   j   c
        000 056 146 151 156 151 137 141 162 162 141 171 000 056 152 143
600000340   r  \0   .   d   y   n   a   m   i   c  \0   .   g   o   t  \0
        162 000 056 144 171 156 141 155 151 143 000 056 147 157 164 000
600000360   .   g   o   t   .   p   l   t  \0   .   d   a   t   a  \0   .
        056 147 157 164 056 160 154 164 000 056 144 141 164 141 000 056
600000400   b   s   s  \0   .   c   o   m   m   e   n   t  \0  \0  \0  \0
        142 163 163 000 056 143 157 155 155 145 156 164 000 000 000 000
600000420  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600000440  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 001  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 001 000
600000460   8 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        070 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000500  \0  \0  \0  \0 003  \0 002  \0   T 002   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 002 000 124 002 100 000 000 000 000 000
600000520  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 003  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 003 000
600000540   t 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        164 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000560  \0  \0  \0  \0 003  \0 004  \0 230 002   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 004 000 230 002 100 000 000 000 000 000
600000600  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 005  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 005 000
600000620 270 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        270 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000640  \0  \0  \0  \0 003  \0 006  \0  \b 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 006 000 010 004 100 000 000 000 000 000
600000660  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \a  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 007 000
600000700 206 004   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        206 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000720  \0  \0  \0  \0 003  \0  \b  \0 250 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 010 000 250 004 100 000 000 000 000 000
600000740  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \t  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 011 000
600000760 310 004   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        310 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001000  \0  \0  \0  \0 003  \0  \n  \0 340 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 012 000 340 004 100 000 000 000 000 000
600001020  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \v  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 013 000
600001040 030 006   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001060  \0  \0  \0  \0 003  \0  \f  \0   @ 006   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 014 000 100 006 100 000 000 000 000 000
600001100  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \r  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 015 000
600001120      \a   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        040 007 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001140  \0  \0  \0  \0 003  \0 016  \0 024  \n   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 016 000 024 012 100 000 000 000 000 000
600001160  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 017  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 017 000
600001200      \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        040 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001220  \0  \0  \0  \0 003  \0 020  \0   @  \n   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 020 000 100 012 100 000 000 000 000 000
600001240  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 021  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 021 000
600001260 200  \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        200 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001300  \0  \0  \0  \0 003  \0 022  \0 020 016   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 022 000 020 016 140 000 000 000 000 000
600001320  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 023  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 023 000
600001340 030 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001360  \0  \0  \0  \0 003  \0 024  \0     016   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 024 000 040 016 140 000 000 000 000 000
600001400  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 025  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 025 000
600001420   ( 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001440  \0  \0  \0  \0 003  \0 026  \0 370 017   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 026 000 370 017 140 000 000 000 000 000
600001460  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 027  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 027 000
600001500  \0 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001520  \0  \0  \0  \0 003  \0 030  \0 200 020   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 030 000 200 020 140 000 000 000 000 000
600001540  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 031  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 031 000
600001560 240 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        240 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001600  \0  \0  \0  \0 003  \0 032  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 003 000 032 000 000 000 000 000 000 000 000 000
600001620  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0 004  \0 361 377
        000 000 000 000 000 000 000 000 001 000 000 000 004 000 361 377
600001640  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600001660  \b  \0  \0  \0 002  \0  \r  \0  \0  \t   @  \0  \0  \0  \0  \0
        010 000 000 000 002 000 015 000 000 011 100 000 000 000 000 000
600001700 221  \0  \0  \0  \0  \0  \0  \0 024  \0  \0  \0 001  \0 031  \0
        221 000 000 000 000 000 000 000 024 000 000 000 001 000 031 000
600001720 300 020   `  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0
        300 020 140 000 000 000 000 000 000 000 020 000 000 000 000 000
600001740      \0  \0  \0 001  \0 030  \0 220 020   `  \0  \0  \0  \0  \0
        040 000 000 000 001 000 030 000 220 020 140 000 000 000 000 000
600001760  \b  \0  \0  \0  \0  \0  \0  \0   (  \0  \0  \0 004  \0 361 377
        010 000 000 000 000 000 000 000 050 000 000 000 004 000 361 377
600002000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002020   3  \0  \0  \0 001  \0 024  \0     016   `  \0  \0  \0  \0  \0
        063 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
600002040  \0  \0  \0  \0  \0  \0  \0  \0   @  \0  \0  \0 002  \0  \r  \0
        000 000 000 000 000 000 000 000 100 000 000 000 002 000 015 000
600002060   @  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        100 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002100   U  \0  \0  \0 002  \0  \r  \0   p  \b   @  \0  \0  \0  \0  \0
        125 000 000 000 002 000 015 000 160 010 100 000 000 000 000 000
600002120  \0  \0  \0  \0  \0  \0  \0  \0   h  \0  \0  \0 002  \0  \r  \0
        000 000 000 000 000 000 000 000 150 000 000 000 002 000 015 000
600002140 260  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        260 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002160   ~  \0  \0  \0 001  \0 031  \0 240 020   `  \0  \0  \0  \0  \0
        176 000 000 000 001 000 031 000 240 020 140 000 000 000 000 000
600002200 001  \0  \0  \0  \0  \0  \0  \0 215  \0  \0  \0 001  \0 023  \0
        001 000 000 000 000 000 000 000 215 000 000 000 001 000 023 000
600002220 030 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002240 264  \0  \0  \0 002  \0  \r  \0 320  \b   @  \0  \0  \0  \0  \0
        264 000 000 000 002 000 015 000 320 010 100 000 000 000 000 000
600002260  \0  \0  \0  \0  \0  \0  \0  \0 300  \0  \0  \0 001  \0 022  \0
        000 000 000 000 000 000 000 000 300 000 000 000 001 000 022 000
600002300 020 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        020 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002320   (  \0  \0  \0 004  \0 361 377  \0  \0  \0  \0  \0  \0  \0  \0
        050 000 000 000 004 000 361 377 000 000 000 000 000 000 000 000
600002340  \0  \0  \0  \0  \0  \0  \0  \0 337  \0  \0  \0 001  \0 021  \0
        000 000 000 000 000 000 000 000 337 000 000 000 001 000 021 000
600002360 300  \v   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        300 013 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002400 355  \0  \0  \0 001  \0 024  \0     016   `  \0  \0  \0  \0  \0
        355 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
600002420  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 004  \0 361 377
        000 000 000 000 000 000 000 000 000 000 000 000 004 000 361 377
600002440  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002460 371  \0  \0  \0  \0  \0 022  \0 030 016   `  \0  \0  \0  \0  \0
        371 000 000 000 000 000 022 000 030 016 140 000 000 000 000 000
600002500  \0  \0  \0  \0  \0  \0  \0  \0  \n 001  \0  \0 001  \0 025  \0
        000 000 000 000 000 000 000 000 012 001 000 000 001 000 025 000
600002520   ( 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002540 023 001  \0  \0  \0  \0 022  \0 020 016   `  \0  \0  \0  \0  \0
        023 001 000 000 000 000 022 000 020 016 140 000 000 000 000 000
600002560  \0  \0  \0  \0  \0  \0  \0  \0   & 001  \0  \0 001  \0 027  \0
        000 000 000 000 000 000 000 000 046 001 000 000 001 000 027 000
600002600  \0 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002620   < 001  \0  \0 022  \0  \r  \0 020  \n   @  \0  \0  \0  \0  \0
        074 001 000 000 022 000 015 000 020 012 100 000 000 000 000 000
600002640 002  \0  \0  \0  \0  \0  \0  \0   L 001  \0  \0      \0  \0  \0
        002 000 000 000 000 000 000 000 114 001 000 000 040 000 000 000
600002660  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002700   h 001  \0  \0      \0 030  \0 200 020   `  \0  \0  \0  \0  \0
        150 001 000 000 040 000 030 000 200 020 140 000 000 000 000 000
600002720  \0  \0  \0  \0  \0  \0  \0  \0   s 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 163 001 000 000 022 000 000 000
600002740  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002760 206 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        206 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003000  \0  \0  \0  \0  \0  \0  \0  \0 231 001  \0  \0 020  \0 030  \0
        000 000 000 000 000 000 000 000 231 001 000 000 020 000 030 000
600003020 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003040 240 001  \0  \0 022  \0 016  \0 024  \n   @  \0  \0  \0  \0  \0
        240 001 000 000 022 000 016 000 024 012 100 000 000 000 000 000
600003060  \0  \0  \0  \0  \0  \0  \0  \0 246 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 246 001 000 000 022 000 000 000
600003100  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003120 274 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        274 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003140  \0  \0  \0  \0  \0  \0  \0  \0 320 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 320 001 000 000 022 000 000 000
600003160  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003200 343 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        343 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003220  \0  \0  \0  \0  \0  \0  \0  \0 365 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 365 001 000 000 022 000 000 000
600003240  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003260  \a 002  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        007 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003300  \0  \0  \0  \0  \0  \0  \0  \0   & 002  \0  \0 020  \0 030  \0
        000 000 000 000 000 000 000 000 046 002 000 000 020 000 030 000
600003320 200 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        200 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003340   3 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        063 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600003360  \0  \0  \0  \0  \0  \0  \0  \0   B 002  \0  \0 021 002 017  \0
        000 000 000 000 000 000 000 000 102 002 000 000 021 002 017 000
600003400   (  \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600003420   O 002  \0  \0 021  \0 017  \0      \n   @  \0  \0  \0  \0  \0
        117 002 000 000 021 000 017 000 040 012 100 000 000 000 000 000
600003440 004  \0  \0  \0  \0  \0  \0  \0   ^ 002  \0  \0 022  \0  \0  \0
        004 000 000 000 000 000 000 000 136 002 000 000 022 000 000 000
600003460  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003500   p 002  \0  \0 022  \0  \r  \0 240  \t   @  \0  \0  \0  \0  \0
        160 002 000 000 022 000 015 000 240 011 100 000 000 000 000 000
600003520   e  \0  \0  \0  \0  \0  \0  \0 200 002  \0  \0 022  \0  \0  \0
        145 000 000 000 000 000 000 000 200 002 000 000 022 000 000 000
600003540  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003560 224 002  \0  \0 020  \0 031  \0 300 020   p  \0  \0  \0  \0  \0
        224 002 000 000 020 000 031 000 300 020 160 000 000 000 000 000
600003600  \0  \0  \0  \0  \0  \0  \0  \0 231 002  \0  \0 022  \0  \r  \0
        000 000 000 000 000 000 000 000 231 002 000 000 022 000 015 000
600003620 023  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        023 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600003640 240 002  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        240 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003660  \0  \0  \0  \0  \0  \0  \0  \0 265 002  \0  \0 020  \0 031  \0
        000 000 000 000 000 000 000 000 265 002 000 000 020 000 031 000
600003700 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003720 301 002  \0  \0 022  \0  \r  \0      \a   @  \0  \0  \0  \0  \0
        301 002 000 000 022 000 015 000 040 007 100 000 000 000 000 000
600003740 363  \0  \0  \0  \0  \0  \0  \0 306 002  \0  \0 022  \0  \0  \0
        363 000 000 000 000 000 000 000 306 002 000 000 022 000 000 000
600003760  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600004000 330 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        330 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600004020  \0  \0  \0  \0  \0  \0  \0  \0 354 002  \0  \0 021 002 030  \0
        000 000 000 000 000 000 000 000 354 002 000 000 021 002 030 000
600004040 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600004060 370 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        370 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600004100  \0  \0  \0  \0  \0  \0  \0  \0 022 003  \0  \0 022  \0  \v  \0
        000 000 000 000 000 000 000 000 022 003 000 000 022 000 013 000
600004120 030 006   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600004140  \0   0   8   0   4   .   c  \0   f   i   l   e   _   w   r   i
        000 060 070 060 064 056 143 000 146 151 154 145 137 167 162 151
600004160   t   e   r  \0   b   u   f   f   e   r   .   4   7   6   1  \0
        164 145 162 000 142 165 146 146 145 162 056 064 067 066 061 000
600004200   p   i   p   e   _   f   d  \0   c   r   t   s   t   u   f   f
        160 151 160 145 137 146 144 000 143 162 164 163 164 165 146 146
600004220   .   c  \0   _   _   J   C   R   _   L   I   S   T   _   _  \0
        056 143 000 137 137 112 103 122 137 114 111 123 124 137 137 000
600004240   d   e   r   e   g   i   s   t   e   r   _   t   m   _   c   l
        144 145 162 145 147 151 163 164 145 162 137 164 155 137 143 154
600004260   o   n   e   s  \0   r   e   g   i   s   t   e   r   _   t   m
        157 156 145 163 000 162 145 147 151 163 164 145 162 137 164 155
600004300   _   c   l   o   n   e   s  \0   _   _   d   o   _   g   l   o
        137 143 154 157 156 145 163 000 137 137 144 157 137 147 154 157
600004320   b   a   l   _   d   t   o   r   s   _   a   u   x  \0   c   o
        142 141 154 137 144 164 157 162 163 137 141 165 170 000 143 157
600004340   m   p   l   e   t   e   d   .   6   3   4   4  \0   _   _   d
        155 160 154 145 164 145 144 056 066 063 064 064 000 137 137 144
600004360   o   _   g   l   o   b   a   l   _   d   t   o   r   s   _   a
        157 137 147 154 157 142 141 154 137 144 164 157 162 163 137 141
600004400   u   x   _   f   i   n   i   _   a   r   r   a   y   _   e   n
        165 170 137 146 151 156 151 137 141 162 162 141 171 137 145 156
600004420   t   r   y  \0   f   r   a   m   e   _   d   u   m   m   y  \0
        164 162 171 000 146 162 141 155 145 137 144 165 155 155 171 000
600004440   _   _   f   r   a   m   e   _   d   u   m   m   y   _   i   n
        137 137 146 162 141 155 145 137 144 165 155 155 171 137 151 156
600004460   i   t   _   a   r   r   a   y   _   e   n   t   r   y  \0   _
        151 164 137 141 162 162 141 171 137 145 156 164 162 171 000 137
600004500   _   F   R   A   M   E   _   E   N   D   _   _  \0   _   _   J
        137 106 122 101 115 105 137 105 116 104 137 137 000 137 137 112
600004520   C   R   _   E   N   D   _   _  \0   _   _   i   n   i   t   _
        103 122 137 105 116 104 137 137 000 137 137 151 156 151 164 137
600004540   a   r   r   a   y   _   e   n   d  \0   _   D   Y   N   A   M
        141 162 162 141 171 137 145 156 144 000 137 104 131 116 101 115
600004560   I   C  \0   _   _   i   n   i   t   _   a   r   r   a   y   _
        111 103 000 137 137 151 156 151 164 137 141 162 162 141 171 137
600004600   s   t   a   r   t  \0   _   G   L   O   B   A   L   _   O   F
        163 164 141 162 164 000 137 107 114 117 102 101 114 137 117 106
600004620   F   S   E   T   _   T   A   B   L   E   _  \0   _   _   l   i
        106 123 105 124 137 124 101 102 114 105 137 000 137 137 154 151
600004640   b   c   _   c   s   u   _   f   i   n   i  \0   _   I   T   M
        142 143 137 143 163 165 137 146 151 156 151 000 137 111 124 115
600004660   _   d   e   r   e   g   i   s   t   e   r   T   M   C   l   o
        137 144 145 162 145 147 151 163 164 145 162 124 115 103 154 157
600004700   n   e   T   a   b   l   e  \0   d   a   t   a   _   s   t   a
        156 145 124 141 142 154 145 000 144 141 164 141 137 163 164 141
600004720   r   t  \0   c   l   o   n   e   @   @   G   L   I   B   C   _
        162 164 000 143 154 157 156 145 100 100 107 114 111 102 103 137
600004740   2   .   2   .   5  \0   w   r   i   t   e   @   @   G   L   I
        062 056 062 056 065 000 167 162 151 164 145 100 100 107 114 111
600004760   B   C   _   2   .   2   .   5  \0   _   e   d   a   t   a  \0
        102 103 137 062 056 062 056 065 000 137 145 144 141 164 141 000
600005000   _   f   i   n   i  \0   s   n   p   r   i   n   t   f   @   @
        137 146 151 156 151 000 163 156 160 162 151 156 164 146 100 100
600005020   G   L   I   B   C   _   2   .   2   .   5  \0   m   e   m   s
        107 114 111 102 103 137 062 056 062 056 065 000 155 145 155 163
600005040   e   t   @   @   G   L   I   B   C   _   2   .   2   .   5  \0
        145 164 100 100 107 114 111 102 103 137 062 056 062 056 065 000
600005060   c   l   o   s   e   @   @   G   L   I   B   C   _   2   .   2
        143 154 157 163 145 100 100 107 114 111 102 103 137 062 056 062
600005100   .   5  \0   p   i   p   e   @   @   G   L   I   B   C   _   2
        056 065 000 160 151 160 145 100 100 107 114 111 102 103 137 062
600005120   .   2   .   5  \0   r   e   a   d   @   @   G   L   I   B   C
        056 062 056 065 000 162 145 141 144 100 100 107 114 111 102 103
600005140   _   2   .   2   .   5  \0   _   _   l   i   b   c   _   s   t
        137 062 056 062 056 065 000 137 137 154 151 142 143 137 163 164
600005160   a   r   t   _   m   a   i   n   @   @   G   L   I   B   C   _
        141 162 164 137 155 141 151 156 100 100 107 114 111 102 103 137
600005200   2   .   2   .   5  \0   _   _   d   a   t   a   _   s   t   a
        062 056 062 056 065 000 137 137 144 141 164 141 137 163 164 141
600005220   r   t  \0   _   _   g   m   o   n   _   s   t   a   r   t   _
        162 164 000 137 137 147 155 157 156 137 163 164 141 162 164 137
600005240   _  \0   _   _   d   s   o   _   h   a   n   d   l   e  \0   _
        137 000 137 137 144 163 157 137 150 141 156 144 154 145 000 137
600005260   I   O   _   s   t   d   i   n   _   u   s   e   d  \0   k   i
        111 117 137 163 164 144 151 156 137 165 163 145 144 000 153 151
600005300   l   l   @   @   G   L   I   B   C   _   2   .   2   .   5  \0
        154 154 100 100 107 114 111 102 103 137 062 056 062 056 065 000
600005320   _   _   l   i   b   c   _   c   s   u   _   i   n   i   t  \0
        137 137 154 151 142 143 137 143 163 165 137 151 156 151 164 000
600005340   m   a   l   l   o   c   @   @   G   L   I   B   C   _   2   .
        155 141 154 154 157 143 100 100 107 114 111 102 103 137 062 056
600005360   2   .   5  \0   _   e   n   d  \0   _   s   t   a   r   t  \0
        062 056 065 000 137 145 156 144 000 137 163 164 141 162 164 000
600005400   r   e   a   l   l   o   c   @   @   G   L   I   B   C   _   2
        162 145 141 154 154 157 143 100 100 107 114 111 102 103 137 062
600005420   .   2   .   5  \0   _   _   b   s   s   _   s   t   a   r   t
        056 062 056 065 000 137 137 142 163 163 137 163 164 141 162 164
600005440  \0   m   a   i   n  \0   o   p   e   n   @   @   G   L   I   B
        000 155 141 151 156 000 157 160 145 156 100 100 107 114 111 102
600005460   C   _   2   .   2   .   5  \0   _   J   v   _   R   e   g   i
        103 137 062 056 062 056 065 000 137 112 166 137 122 145 147 151
600005500   s   t   e   r   C   l   a   s   s   e   s  \0   _   _   T   M
        163 164 145 162 103 154 141 163 163 145 163 000 137 137 124 115
600005520   C   _   E   N   D   _   _  \0   _   I   T   M   _   r   e   g
        103 137 105 116 104 137 137 000 137 111 124 115 137 162 145 147
600005540   i   s   t   e   r   T   M   C   l   o   n   e   T   a   b   l
        151 163 164 145 162 124 115 103 154 157 156 145 124 141 142 154
600005560   e  \0   _   i   n   i   t  \0  \0  \0  \0  \0  \0  \0  \0  \0
        145 000 137 151 156 151 164 000 000 000 000 000 000 000 000 000
600005600  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
600005660  \0  \0  \0  \0  \0  \0  \0  \0 033  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 033 000 000 000 001 000 000 000
600005700 002  \0  \0  \0  \0  \0  \0  \0   8 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 070 002 100 000 000 000 000 000
600005720   8 002  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        070 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600005740  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
600005760  \0  \0  \0  \0  \0  \0  \0  \0   #  \0  \0  \0  \a  \0  \0  \0
        000 000 000 000 000 000 000 000 043 000 000 000 007 000 000 000
600006000 002  \0  \0  \0  \0  \0  \0  \0   T 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 124 002 100 000 000 000 000 000
600006020   T 002  \0  \0  \0  \0  \0  \0      \0  \0  \0  \0  \0  \0  \0
        124 002 000 000 000 000 000 000 040 000 000 000 000 000 000 000
600006040  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600006060  \0  \0  \0  \0  \0  \0  \0  \0   1  \0  \0  \0  \a  \0  \0  \0
        000 000 000 000 000 000 000 000 061 000 000 000 007 000 000 000
600006100 002  \0  \0  \0  \0  \0  \0  \0   t 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 164 002 100 000 000 000 000 000
600006120   t 002  \0  \0  \0  \0  \0  \0   $  \0  \0  \0  \0  \0  \0  \0
        164 002 000 000 000 000 000 000 044 000 000 000 000 000 000 000
600006140  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600006160  \0  \0  \0  \0  \0  \0  \0  \0   D  \0  \0  \0 366 377 377   o
        000 000 000 000 000 000 000 000 104 000 000 000 366 377 377 157
600006200 002  \0  \0  \0  \0  \0  \0  \0 230 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 230 002 100 000 000 000 000 000
600006220 230 002  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        230 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600006240 005  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
600006260  \0  \0  \0  \0  \0  \0  \0  \0   N  \0  \0  \0  \v  \0  \0  \0
        000 000 000 000 000 000 000 000 116 000 000 000 013 000 000 000
600006300 002  \0  \0  \0  \0  \0  \0  \0 270 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 270 002 100 000 000 000 000 000
600006320 270 002  \0  \0  \0  \0  \0  \0   P 001  \0  \0  \0  \0  \0  \0
        270 002 000 000 000 000 000 000 120 001 000 000 000 000 000 000
600006340 006  \0  \0  \0 001  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
600006360 030  \0  \0  \0  \0  \0  \0  \0   V  \0  \0  \0 003  \0  \0  \0
        030 000 000 000 000 000 000 000 126 000 000 000 003 000 000 000
600006400 002  \0  \0  \0  \0  \0  \0  \0  \b 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 010 004 100 000 000 000 000 000
600006420  \b 004  \0  \0  \0  \0  \0  \0   }  \0  \0  \0  \0  \0  \0  \0
        010 004 000 000 000 000 000 000 175 000 000 000 000 000 000 000
600006440  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
600006460  \0  \0  \0  \0  \0  \0  \0  \0   ^  \0  \0  \0 377 377 377   o
        000 000 000 000 000 000 000 000 136 000 000 000 377 377 377 157
600006500 002  \0  \0  \0  \0  \0  \0  \0 206 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 206 004 100 000 000 000 000 000
600006520 206 004  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        206 004 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600006540 005  \0  \0  \0  \0  \0  \0  \0 002  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 002 000 000 000 000 000 000 000
600006560 002  \0  \0  \0  \0  \0  \0  \0   k  \0  \0  \0 376 377 377   o
        002 000 000 000 000 000 000 000 153 000 000 000 376 377 377 157
600006600 002  \0  \0  \0  \0  \0  \0  \0 250 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 250 004 100 000 000 000 000 000
600006620 250 004  \0  \0  \0  \0  \0  \0      \0  \0  \0  \0  \0  \0  \0
        250 004 000 000 000 000 000 000 040 000 000 000 000 000 000 000
600006640 006  \0  \0  \0 001  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
600006660  \0  \0  \0  \0  \0  \0  \0  \0   z  \0  \0  \0 004  \0  \0  \0
        000 000 000 000 000 000 000 000 172 000 000 000 004 000 000 000
600006700 002  \0  \0  \0  \0  \0  \0  \0 310 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 310 004 100 000 000 000 000 000
600006720 310 004  \0  \0  \0  \0  \0  \0 030  \0  \0  \0  \0  \0  \0  \0
        310 004 000 000 000 000 000 000 030 000 000 000 000 000 000 000
600006740 005  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
600006760 030  \0  \0  \0  \0  \0  \0  \0 204  \0  \0  \0 004  \0  \0  \0
        030 000 000 000 000 000 000 000 204 000 000 000 004 000 000 000
600007000   B  \0  \0  \0  \0  \0  \0  \0 340 004   @  \0  \0  \0  \0  \0
        102 000 000 000 000 000 000 000 340 004 100 000 000 000 000 000
600007020 340 004  \0  \0  \0  \0  \0  \0   8 001  \0  \0  \0  \0  \0  \0
        340 004 000 000 000 000 000 000 070 001 000 000 000 000 000 000
600007040 005  \0  \0  \0  \f  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 014 000 000 000 010 000 000 000 000 000 000 000
600007060 030  \0  \0  \0  \0  \0  \0  \0 216  \0  \0  \0 001  \0  \0  \0
        030 000 000 000 000 000 000 000 216 000 000 000 001 000 000 000
600007100 006  \0  \0  \0  \0  \0  \0  \0 030 006   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 030 006 100 000 000 000 000 000
600007120 030 006  \0  \0  \0  \0  \0  \0 032  \0  \0  \0  \0  \0  \0  \0
        030 006 000 000 000 000 000 000 032 000 000 000 000 000 000 000
600007140  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600007160  \0  \0  \0  \0  \0  \0  \0  \0 211  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 211 000 000 000 001 000 000 000
600007200 006  \0  \0  \0  \0  \0  \0  \0   @ 006   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 100 006 100 000 000 000 000 000
600007220   @ 006  \0  \0  \0  \0  \0  \0 340  \0  \0  \0  \0  \0  \0  \0
        100 006 000 000 000 000 000 000 340 000 000 000 000 000 000 000
600007240  \0  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
600007260 020  \0  \0  \0  \0  \0  \0  \0 224  \0  \0  \0 001  \0  \0  \0
        020 000 000 000 000 000 000 000 224 000 000 000 001 000 000 000
600007300 006  \0  \0  \0  \0  \0  \0  \0      \a   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 040 007 100 000 000 000 000 000
600007320      \a  \0  \0  \0  \0  \0  \0 364 002  \0  \0  \0  \0  \0  \0
        040 007 000 000 000 000 000 000 364 002 000 000 000 000 000 000
600007340  \0  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
600007360  \0  \0  \0  \0  \0  \0  \0  \0 232  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 232 000 000 000 001 000 000 000
600007400 006  \0  \0  \0  \0  \0  \0  \0 024  \n   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 024 012 100 000 000 000 000 000
600007420 024  \n  \0  \0  \0  \0  \0  \0  \t  \0  \0  \0  \0  \0  \0  \0
        024 012 000 000 000 000 000 000 011 000 000 000 000 000 000 000
600007440  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600007460  \0  \0  \0  \0  \0  \0  \0  \0 240  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 240 000 000 000 001 000 000 000
600007500  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
600010000
$ mv /tmp/file.4 /tmp/file.4.old
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 2549 0 in /tmp/file.2
ERROR: 40 1 in /tmp/file.2
ERROR: 53 2 in /tmp/file.2
ERROR: 29 3 in /tmp/file.2
ERROR: 27 4 in /tmp/file.2
ERROR: 5 5 in /tmp/file.2
ERROR: 14 6 in /tmp/file.2
ERROR: 8 7 in /tmp/file.2
ERROR: 16 8 in /tmp/file.2
ERROR: 4 9 in /tmp/file.2
ERROR: 12 10 in /tmp/file.2
ERROR: 4 11 in /tmp/file.2
ERROR: 2 12 in /tmp/file.2
ERROR: 10 13 in /tmp/file.2
ERROR: 13 14 in /tmp/file.2
ERROR: 4 15 in /tmp/file.2
ERROR: 26 16 in /tmp/file.2
ERROR: 5 17 in /tmp/file.2
ERROR: 23 18 in /tmp/file.2
ERROR: 4 19 in /tmp/file.2
ERROR: 8 20 in /tmp/file.2
ERROR: 2 21 in /tmp/file.2
ERROR: 1 22 in /tmp/file.2
ERROR: 2 23 in /tmp/file.2
ERROR: 17 24 in /tmp/file.2
ERROR: 5 25 in /tmp/file.2
ERROR: 2 26 in /tmp/file.2
ERROR: 1 27 in /tmp/file.2
ERROR: 3 28 in /tmp/file.2
ERROR: 17 32 in /tmp/file.2
ERROR: 1 35 in /tmp/file.2
ERROR: 1 36 in /tmp/file.2
ERROR: 2 38 in /tmp/file.2
ERROR: 5 40 in /tmp/file.2
ERROR: 1 41 in /tmp/file.2
ERROR: 3 45 in /tmp/file.2
ERROR: 65 46 in /tmp/file.2
ERROR: 2 48 in /tmp/file.2
ERROR: 4 49 in /tmp/file.2
ERROR: 24 50 in /tmp/file.2
ERROR: 3 51 in /tmp/file.2
ERROR: 4 52 in /tmp/file.2
ERROR: 12 53 in /tmp/file.2
ERROR: 2 54 in /tmp/file.2
ERROR: 1 55 in /tmp/file.2
ERROR: 5 56 in /tmp/file.2
ERROR: 1 60 in /tmp/file.2
ERROR: 75 64 in /tmp/file.2
ERROR: 5 65 in /tmp/file.2
ERROR: 17 66 in /tmp/file.2
ERROR: 19 67 in /tmp/file.2
ERROR: 5 68 in /tmp/file.2
ERROR: 6 69 in /tmp/file.2
ERROR: 3 70 in /tmp/file.2
ERROR: 13 71 in /tmp/file.2
ERROR: 18 73 in /tmp/file.2
ERROR: 3 74 in /tmp/file.2
ERROR: 17 76 in /tmp/file.2
ERROR: 7 77 in /tmp/file.2
ERROR: 5 78 in /tmp/file.2
ERROR: 4 79 in /tmp/file.2
ERROR: 1 80 in /tmp/file.2
ERROR: 4 82 in /tmp/file.2
ERROR: 2 83 in /tmp/file.2
ERROR: 13 84 in /tmp/file.2
ERROR: 1 85 in /tmp/file.2
ERROR: 1 86 in /tmp/file.2
ERROR: 1 89 in /tmp/file.2
ERROR: 2 94 in /tmp/file.2
ERROR: 118 95 in /tmp/file.2
ERROR: 24 96 in /tmp/file.2
ERROR: 54 97 in /tmp/file.2
ERROR: 14 98 in /tmp/file.2
ERROR: 18 99 in /tmp/file.2
ERROR: 29 100 in /tmp/file.2
ERROR: 57 101 in /tmp/file.2
ERROR: 16 102 in /tmp/file.2
ERROR: 15 103 in /tmp/file.2
ERROR: 9 104 in /tmp/file.2
ERROR: 48 105 in /tmp/file.2
ERROR: 1 106 in /tmp/file.2
ERROR: 2 107 in /tmp/file.2
ERROR: 30 108 in /tmp/file.2
ERROR: 22 109 in /tmp/file.2
ERROR: 43 110 in /tmp/file.2
ERROR: 29 111 in /tmp/file.2
ERROR: 13 112 in /tmp/file.2
ERROR: 56 114 in /tmp/file.2
ERROR: 42 115 in /tmp/file.2
ERROR: 65 116 in /tmp/file.2
ERROR: 14 117 in /tmp/file.2
ERROR: 3 118 in /tmp/file.2
ERROR: 2 119 in /tmp/file.2
ERROR: 3 120 in /tmp/file.2
ERROR: 16 121 in /tmp/file.2
ERROR: 1 122 in /tmp/file.2
ERROR: 1 125 in /tmp/file.2
ERROR: 1 126 in /tmp/file.2
ERROR: 5 128 in /tmp/file.2
ERROR: 1 132 in /tmp/file.2
ERROR: 4 134 in /tmp/file.2
ERROR: 1 137 in /tmp/file.2
ERROR: 1 141 in /tmp/file.2
ERROR: 1 142 in /tmp/file.2
ERROR: 1 144 in /tmp/file.2
ERROR: 1 145 in /tmp/file.2
ERROR: 2 148 in /tmp/file.2
ERROR: 6 152 in /tmp/file.2
ERROR: 2 153 in /tmp/file.2
ERROR: 1 154 in /tmp/file.2
ERROR: 6 160 in /tmp/file.2
ERROR: 1 166 in /tmp/file.2
ERROR: 3 168 in /tmp/file.2
ERROR: 1 176 in /tmp/file.2
ERROR: 1 180 in /tmp/file.2
ERROR: 1 181 in /tmp/file.2
ERROR: 3 184 in /tmp/file.2
ERROR: 1 188 in /tmp/file.2
ERROR: 4 192 in /tmp/file.2
ERROR: 1 193 in /tmp/file.2
ERROR: 1 198 in /tmp/file.2
ERROR: 3 200 in /tmp/file.2
ERROR: 2 208 in /tmp/file.2
ERROR: 1 216 in /tmp/file.2
ERROR: 1 223 in /tmp/file.2
ERROR: 4 224 in /tmp/file.2
ERROR: 1 227 in /tmp/file.2
ERROR: 1 236 in /tmp/file.2
ERROR: 1 237 in /tmp/file.2
ERROR: 4 241 in /tmp/file.2
ERROR: 1 243 in /tmp/file.2
ERROR: 1 244 in /tmp/file.2
ERROR: 1 245 in /tmp/file.2
ERROR: 1 246 in /tmp/file.2
ERROR: 2 248 in /tmp/file.2
ERROR: 1 249 in /tmp/file.2
ERROR: 1 254 in /tmp/file.2
ERROR: 2549 0 in /tmp/file.7
ERROR: 40 1 in /tmp/file.7
ERROR: 53 2 in /tmp/file.7
ERROR: 29 3 in /tmp/file.7
ERROR: 27 4 in /tmp/file.7
ERROR: 5 5 in /tmp/file.7
ERROR: 14 6 in /tmp/file.7
ERROR: 8 7 in /tmp/file.7
ERROR: 16 8 in /tmp/file.7
ERROR: 4 9 in /tmp/file.7
ERROR: 12 10 in /tmp/file.7
ERROR: 4 11 in /tmp/file.7
ERROR: 2 12 in /tmp/file.7
ERROR: 10 13 in /tmp/file.7
ERROR: 13 14 in /tmp/file.7
ERROR: 4 15 in /tmp/file.7
ERROR: 26 16 in /tmp/file.7
ERROR: 5 17 in /tmp/file.7
ERROR: 23 18 in /tmp/file.7
ERROR: 4 19 in /tmp/file.7
ERROR: 8 20 in /tmp/file.7
ERROR: 2 21 in /tmp/file.7
ERROR: 1 22 in /tmp/file.7
ERROR: 2 23 in /tmp/file.7
ERROR: 17 24 in /tmp/file.7
ERROR: 5 25 in /tmp/file.7
ERROR: 2 26 in /tmp/file.7
ERROR: 1 27 in /tmp/file.7
ERROR: 3 28 in /tmp/file.7
ERROR: 17 32 in /tmp/file.7
ERROR: 1 35 in /tmp/file.7
ERROR: 1 36 in /tmp/file.7
ERROR: 2 38 in /tmp/file.7
ERROR: 5 40 in /tmp/file.7
ERROR: 1 41 in /tmp/file.7
ERROR: 3 45 in /tmp/file.7
ERROR: 65 46 in /tmp/file.7
ERROR: 2 48 in /tmp/file.7
ERROR: 4 49 in /tmp/file.7
ERROR: 24 50 in /tmp/file.7
ERROR: 3 51 in /tmp/file.7
ERROR: 4 52 in /tmp/file.7
ERROR: 12 53 in /tmp/file.7
ERROR: 2 54 in /tmp/file.7
ERROR: 1 55 in /tmp/file.7
ERROR: 5 56 in /tmp/file.7
ERROR: 1 60 in /tmp/file.7
ERROR: 75 64 in /tmp/file.7
ERROR: 5 65 in /tmp/file.7
ERROR: 17 66 in /tmp/file.7
ERROR: 19 67 in /tmp/file.7
ERROR: 5 68 in /tmp/file.7
ERROR: 6 69 in /tmp/file.7
ERROR: 3 70 in /tmp/file.7
ERROR: 13 71 in /tmp/file.7
ERROR: 18 73 in /tmp/file.7
ERROR: 3 74 in /tmp/file.7
ERROR: 17 76 in /tmp/file.7
ERROR: 7 77 in /tmp/file.7
ERROR: 5 78 in /tmp/file.7
ERROR: 4 79 in /tmp/file.7
ERROR: 1 80 in /tmp/file.7
ERROR: 4 82 in /tmp/file.7
ERROR: 2 83 in /tmp/file.7
ERROR: 13 84 in /tmp/file.7
ERROR: 1 85 in /tmp/file.7
ERROR: 1 86 in /tmp/file.7
ERROR: 1 89 in /tmp/file.7
ERROR: 2 94 in /tmp/file.7
ERROR: 118 95 in /tmp/file.7
ERROR: 24 96 in /tmp/file.7
ERROR: 54 97 in /tmp/file.7
ERROR: 14 98 in /tmp/file.7
ERROR: 18 99 in /tmp/file.7
ERROR: 29 100 in /tmp/file.7
ERROR: 57 101 in /tmp/file.7
ERROR: 16 102 in /tmp/file.7
ERROR: 15 103 in /tmp/file.7
ERROR: 9 104 in /tmp/file.7
ERROR: 48 105 in /tmp/file.7
ERROR: 1 106 in /tmp/file.7
ERROR: 2 107 in /tmp/file.7
ERROR: 30 108 in /tmp/file.7
ERROR: 22 109 in /tmp/file.7
ERROR: 43 110 in /tmp/file.7
ERROR: 29 111 in /tmp/file.7
ERROR: 13 112 in /tmp/file.7
ERROR: 56 114 in /tmp/file.7
ERROR: 42 115 in /tmp/file.7
ERROR: 65 116 in /tmp/file.7
ERROR: 14 117 in /tmp/file.7
ERROR: 3 118 in /tmp/file.7
ERROR: 2 119 in /tmp/file.7
ERROR: 3 120 in /tmp/file.7
ERROR: 16 121 in /tmp/file.7
ERROR: 1 122 in /tmp/file.7
ERROR: 1 125 in /tmp/file.7
ERROR: 1 126 in /tmp/file.7
ERROR: 5 128 in /tmp/file.7
ERROR: 1 132 in /tmp/file.7
ERROR: 4 134 in /tmp/file.7
ERROR: 1 137 in /tmp/file.7
ERROR: 1 141 in /tmp/file.7
ERROR: 1 142 in /tmp/file.7
ERROR: 1 144 in /tmp/file.7
ERROR: 1 145 in /tmp/file.7
ERROR: 2 148 in /tmp/file.7
ERROR: 6 152 in /tmp/file.7
ERROR: 2 153 in /tmp/file.7
ERROR: 1 154 in /tmp/file.7
ERROR: 6 160 in /tmp/file.7
ERROR: 1 166 in /tmp/file.7
ERROR: 3 168 in /tmp/file.7
ERROR: 1 176 in /tmp/file.7
ERROR: 1 180 in /tmp/file.7
ERROR: 1 181 in /tmp/file.7
ERROR: 3 184 in /tmp/file.7
ERROR: 1 188 in /tmp/file.7
ERROR: 4 192 in /tmp/file.7
ERROR: 1 193 in /tmp/file.7
ERROR: 1 198 in /tmp/file.7
ERROR: 3 200 in /tmp/file.7
ERROR: 2 208 in /tmp/file.7
ERROR: 1 216 in /tmp/file.7
ERROR: 1 223 in /tmp/file.7
ERROR: 4 224 in /tmp/file.7
ERROR: 1 227 in /tmp/file.7
ERROR: 1 236 in /tmp/file.7
ERROR: 1 237 in /tmp/file.7
ERROR: 4 241 in /tmp/file.7
ERROR: 1 243 in /tmp/file.7
ERROR: 1 244 in /tmp/file.7
ERROR: 1 245 in /tmp/file.7
ERROR: 1 246 in /tmp/file.7
ERROR: 2 248 in /tmp/file.7
ERROR: 1 249 in /tmp/file.7
ERROR: 1 254 in /tmp/file.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11  7:54         ` Tetsuo Handa
@ 2017-08-11 12:08           ` Michal Hocko
  -1 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-11 12:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > +/*
> > > > + * Checks whether a page fault on the given mm is still reliable.
> > > > + * This is no longer true if the oom reaper started to reap the
> > > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > > + * the mm. At that moment any !shared mapping would lose the content
> > > > + * and could cause a memory corruption (zero pages instead of the
> > > > + * original content).
> > > > + *
> > > > + * User should call this before establishing a page table entry for
> > > > + * a !shared mapping and under the proper page table lock.
> > > > + *
> > > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > > + */
> > > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > > +{
> > > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > > +		return VM_FAULT_SIGBUS;
> > > > +	return 0;
> > > > +}
> > > > +
> > > 
> > > Will you explain the mechanism why random values are written instead of zeros
> > > so that this patch can actually fix the race problem?
> > 
> > I am not sure what you mean here. Were you able to see a write with an
> > unexpected content?
> 
> Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

Ahh, I've missed that random part of your output. That is really strange
because AFAICS the oom reaper shouldn't really interact here. We are
only unmapping anonymous memory and even if a refault slips through we
should always get zeros.

Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
get any uninitialized data from a file by missing CoWed content. The
only possible explanations would be that a page fault returned a
non-zero data which would be a bug on its own or that a file write
extend the file without actually writing to it which smells like a fs
bug to me.

Anyway I wasn't able to reproduce this and I was running your usecase
in the loop for quite some time (with xfs storage). How reproducible
is this? If you can reproduce easily can you simply comment out
unmap_page_range in __oom_reap_task_mm and see if that makes any change
just to be sure that the oom reaper can be ruled out?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11 12:08           ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-11 12:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > +/*
> > > > + * Checks whether a page fault on the given mm is still reliable.
> > > > + * This is no longer true if the oom reaper started to reap the
> > > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > > + * the mm. At that moment any !shared mapping would lose the content
> > > > + * and could cause a memory corruption (zero pages instead of the
> > > > + * original content).
> > > > + *
> > > > + * User should call this before establishing a page table entry for
> > > > + * a !shared mapping and under the proper page table lock.
> > > > + *
> > > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > > + */
> > > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > > +{
> > > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > > +		return VM_FAULT_SIGBUS;
> > > > +	return 0;
> > > > +}
> > > > +
> > > 
> > > Will you explain the mechanism why random values are written instead of zeros
> > > so that this patch can actually fix the race problem?
> > 
> > I am not sure what you mean here. Were you able to see a write with an
> > unexpected content?
> 
> Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

Ahh, I've missed that random part of your output. That is really strange
because AFAICS the oom reaper shouldn't really interact here. We are
only unmapping anonymous memory and even if a refault slips through we
should always get zeros.

Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
get any uninitialized data from a file by missing CoWed content. The
only possible explanations would be that a page fault returned a
non-zero data which would be a bug on its own or that a file write
extend the file without actually writing to it which smells like a fs
bug to me.

Anyway I wasn't able to reproduce this and I was running your usecase
in the loop for quite some time (with xfs storage). How reproducible
is this? If you can reproduce easily can you simply comment out
unmap_page_range in __oom_reap_task_mm and see if that makes any change
just to be sure that the oom reaper can be ruled out?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11 12:08           ` Michal Hocko
@ 2017-08-11 15:46             ` Tetsuo Handa
  -1 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-11 15:46 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > Will you explain the mechanism why random values are written instead of zeros
> > > > so that this patch can actually fix the race problem?
> > > 
> > > I am not sure what you mean here. Were you able to see a write with an
> > > unexpected content?
> > 
> > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> 
> Ahh, I've missed that random part of your output. That is really strange
> because AFAICS the oom reaper shouldn't really interact here. We are
> only unmapping anonymous memory and even if a refault slips through we
> should always get zeros.
> 
> Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> get any uninitialized data from a file by missing CoWed content. The
> only possible explanations would be that a page fault returned a
> non-zero data which would be a bug on its own or that a file write
> extend the file without actually writing to it which smells like a fs
> bug to me.

As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
I don't think it is a fs bug.

> 
> Anyway I wasn't able to reproduce this and I was running your usecase
> in the loop for quite some time (with xfs storage). How reproducible
> is this? If you can reproduce easily can you simply comment out
> unmap_page_range in __oom_reap_task_mm and see if that makes any change
> just to be sure that the oom reaper can be ruled out?

Frequency of writing not-zero values is lower than frequency of writing zero values.
But if I comment out unmap_page_range() in __oom_reap_task_mm(), I can't even
reproduce writing zero values. As far as I tested, writing not-zero values occurs
only if the OOM reaper is involved.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11 15:46             ` Tetsuo Handa
  0 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-11 15:46 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > Will you explain the mechanism why random values are written instead of zeros
> > > > so that this patch can actually fix the race problem?
> > > 
> > > I am not sure what you mean here. Were you able to see a write with an
> > > unexpected content?
> > 
> > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> 
> Ahh, I've missed that random part of your output. That is really strange
> because AFAICS the oom reaper shouldn't really interact here. We are
> only unmapping anonymous memory and even if a refault slips through we
> should always get zeros.
> 
> Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> get any uninitialized data from a file by missing CoWed content. The
> only possible explanations would be that a page fault returned a
> non-zero data which would be a bug on its own or that a file write
> extend the file without actually writing to it which smells like a fs
> bug to me.

As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
I don't think it is a fs bug.

> 
> Anyway I wasn't able to reproduce this and I was running your usecase
> in the loop for quite some time (with xfs storage). How reproducible
> is this? If you can reproduce easily can you simply comment out
> unmap_page_range in __oom_reap_task_mm and see if that makes any change
> just to be sure that the oom reaper can be ruled out?

Frequency of writing not-zero values is lower than frequency of writing zero values.
But if I comment out unmap_page_range() in __oom_reap_task_mm(), I can't even
reproduce writing zero values. As far as I tested, writing not-zero values occurs
only if the OOM reaper is involved.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11 15:46             ` Tetsuo Handa
@ 2017-08-14 13:59               ` Michal Hocko
  -1 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-14 13:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Sat 12-08-17 00:46:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > > Will you explain the mechanism why random values are written instead of zeros
> > > > > so that this patch can actually fix the race problem?
> > > > 
> > > > I am not sure what you mean here. Were you able to see a write with an
> > > > unexpected content?
> > > 
> > > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> > 
> > Ahh, I've missed that random part of your output. That is really strange
> > because AFAICS the oom reaper shouldn't really interact here. We are
> > only unmapping anonymous memory and even if a refault slips through we
> > should always get zeros.
> > 
> > Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> > get any uninitialized data from a file by missing CoWed content. The
> > only possible explanations would be that a page fault returned a
> > non-zero data which would be a bug on its own or that a file write
> > extend the file without actually writing to it which smells like a fs
> > bug to me.
> 
> As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
> I don't think it is a fs bug.

Were you able to reproduce with other filesystems? I wonder what is
different in my testing because I cannot reproduce this at all. Well, I
had to reduce the number of competing writer threads to 128 because I
quickly hit the trashing behavior with more of them (and 4 CPUs). I will
try on a larger machine.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-14 13:59               ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-14 13:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Sat 12-08-17 00:46:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > > Will you explain the mechanism why random values are written instead of zeros
> > > > > so that this patch can actually fix the race problem?
> > > > 
> > > > I am not sure what you mean here. Were you able to see a write with an
> > > > unexpected content?
> > > 
> > > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> > 
> > Ahh, I've missed that random part of your output. That is really strange
> > because AFAICS the oom reaper shouldn't really interact here. We are
> > only unmapping anonymous memory and even if a refault slips through we
> > should always get zeros.
> > 
> > Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> > get any uninitialized data from a file by missing CoWed content. The
> > only possible explanations would be that a page fault returned a
> > non-zero data which would be a bug on its own or that a file write
> > extend the file without actually writing to it which smells like a fs
> > bug to me.
> 
> As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
> I don't think it is a fs bug.

Were you able to reproduce with other filesystems? I wonder what is
different in my testing because I cannot reproduce this at all. Well, I
had to reduce the number of competing writer threads to 128 because I
quickly hit the trashing behavior with more of them (and 4 CPUs). I will
try on a larger machine.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-14 13:59               ` Michal Hocko
  (?)
@ 2017-08-14 22:51               ` Tetsuo Handa
  2017-08-15  6:55                   ` Michal Hocko
  2017-08-15  8:41                   ` Michal Hocko
  -1 siblings, 2 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-14 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, andrea, kirill, oleg, wenwei.tww, linux-mm,
	linux-kernel

Michal Hocko wrote:
> On Sat 12-08-17 00:46:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > > > Will you explain the mechanism why random values are written instead of zeros
> > > > > > so that this patch can actually fix the race problem?
> > > > > 
> > > > > I am not sure what you mean here. Were you able to see a write with an
> > > > > unexpected content?
> > > > 
> > > > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> > > 
> > > Ahh, I've missed that random part of your output. That is really strange
> > > because AFAICS the oom reaper shouldn't really interact here. We are
> > > only unmapping anonymous memory and even if a refault slips through we
> > > should always get zeros.
> > > 
> > > Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> > > get any uninitialized data from a file by missing CoWed content. The
> > > only possible explanations would be that a page fault returned a
> > > non-zero data which would be a bug on its own or that a file write
> > > extend the file without actually writing to it which smells like a fs
> > > bug to me.
> > 
> > As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
> > I don't think it is a fs bug.
> 
> Were you able to reproduce with other filesystems?

Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
on Oracle VM VirtualBox on Windows.

I believe that this is not old data from disk, for I can reproduce this problem
using newly attached /dev/sdb which has never written any data (other than data
written by mkfs.xfs and mkfs.ext4).

  /dev/sdb /tmp ext4 rw,seclabel,relatime,data=ordered 0 0
  
The garbage pattern (the last 4096 bytes) is identical for both xfs and ext4.

>                                                    I wonder what is
> different in my testing because I cannot reproduce this at all. Well, I
> had to reduce the number of competing writer threads to 128 because I
> quickly hit the trashing behavior with more of them (and 4 CPUs). I will
> try on a larger machine.

I don't think a larger machine is necessary.
I can reproduce this problem with 8 competing writer threads on 4 CPUs.

I don't have native Linux environment. Maybe that is the difference.
Can you try VMware Workstation Player or Oracle VM VirtualBox environment?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
  2017-08-07 11:38   ` Michal Hocko
@ 2017-08-15  0:49     ` David Rientjes
  -1 siblings, 0 replies; 58+ messages in thread
From: David Rientjes @ 2017-08-15  0:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Andrea Argangeli, Kirill A. Shutemov,
	Tetsuo Handa, Oleg Nesterov, Wenwei Tao, linux-mm, LKML,
	Michal Hocko

On Mon, 7 Aug 2017, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
> handle_mm_fault causes a lockdep splat
> [   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
> [   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   58.557480] ------------[ cut here ]------------
> [   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
> [   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
> [   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
> [   58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
> [   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
> [   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
> [   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
> [   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
> [   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
> [   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
> [   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
> [   58.657280] Call Trace:
> [   58.659989]  up_read+0x1a/0x40
> [   58.662825]  __do_page_fault+0x28e/0x4c0
> [   58.665946]  do_page_fault+0x30/0x80
> [   58.668911]  page_fault+0x28/0x30
> 
> The reason is that the page fault path might have dropped the mmap_sem
> and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
> the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
> that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
> the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
> existing error code because all arch specific page fault handlers and
> g-u-p would have to learn a new error code combination.
> 
> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
> Cc: stable # 4.9+
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
@ 2017-08-15  0:49     ` David Rientjes
  0 siblings, 0 replies; 58+ messages in thread
From: David Rientjes @ 2017-08-15  0:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Andrea Argangeli, Kirill A. Shutemov,
	Tetsuo Handa, Oleg Nesterov, Wenwei Tao, linux-mm, LKML,
	Michal Hocko

On Mon, 7 Aug 2017, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
> handle_mm_fault causes a lockdep splat
> [   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
> [   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   58.557480] ------------[ cut here ]------------
> [   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
> [   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
> [   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
> [   58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
> [   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
> [   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
> [   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
> [   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
> [   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
> [   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
> [   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
> [   58.657280] Call Trace:
> [   58.659989]  up_read+0x1a/0x40
> [   58.662825]  __do_page_fault+0x28e/0x4c0
> [   58.665946]  do_page_fault+0x30/0x80
> [   58.668911]  page_fault+0x28/0x30
> 
> The reason is that the page fault path might have dropped the mmap_sem
> and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
> the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
> that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
> the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
> existing error code because all arch specific page fault handlers and
> g-u-p would have to learn a new error code combination.
> 
> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
> Cc: stable # 4.9+
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-14 13:59               ` Michal Hocko
  (?)
  (?)
@ 2017-08-15  5:30               ` Tetsuo Handa
  -1 siblings, 0 replies; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-15  5:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Tetsuo Handa, akpm, andrea, kirill, oleg,
	wenwei.tww, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 12-08-17 00:46:18, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > > > > Will you explain the mechanism why random values are written instead of zeros
> > > > > > > so that this patch can actually fix the race problem?
> > > > > > 
> > > > > > I am not sure what you mean here. Were you able to see a write with an
> > > > > > unexpected content?
> > > > > 
> > > > > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> > > > 
> > > > Ahh, I've missed that random part of your output. That is really strange
> > > > because AFAICS the oom reaper shouldn't really interact here. We are
> > > > only unmapping anonymous memory and even if a refault slips through we
> > > > should always get zeros.
> > > > 
> > > > Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> > > > get any uninitialized data from a file by missing CoWed content. The
> > > > only possible explanations would be that a page fault returned a
> > > > non-zero data which would be a bug on its own or that a file write
> > > > extend the file without actually writing to it which smells like a fs
> > > > bug to me.
> > > 
> > > As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
> > > I don't think it is a fs bug.
> > 
> > Were you able to reproduce with other filesystems?
> 
> Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
> on Oracle VM VirtualBox on Windows.
> 
> I believe that this is not old data from disk, for I can reproduce this problem
> using newly attached /dev/sdb which has never written any data (other than data
> written by mkfs.xfs and mkfs.ext4).
> 
>   /dev/sdb /tmp ext4 rw,seclabel,relatime,data=ordered 0 0
>   
> The garbage pattern (the last 4096 bytes) is identical for both xfs and ext4.

I can reproduce this problem very easily using btrfs on 4.11.11-200.fc25.x86_64
on Oracle VM VirtualBox on Windows.

  /dev/sdb /tmp btrfs rw,seclabel,relatime,space_cache,subvolid=5,subvol=/ 0 0

The garbage pattern is identical for all xfs/ext4/btrfs.
More complicated things a fs does, more likely to hit this problem?
I tried ntfs but so far I am not able to reproduce this problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-14 22:51               ` Tetsuo Handa
@ 2017-08-15  6:55                   ` Michal Hocko
  2017-08-15  8:41                   ` Michal Hocko
  1 sibling, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-15  6:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Tue 15-08-17 07:51:02, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Were you able to reproduce with other filesystems?
> 
> Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
> on Oracle VM VirtualBox on Windows.
> 
> I believe that this is not old data from disk, for I can reproduce this problem
> using newly attached /dev/sdb which has never written any data (other than data
> written by mkfs.xfs and mkfs.ext4).
> 
>   /dev/sdb /tmp ext4 rw,seclabel,relatime,data=ordered 0 0
>   
> The garbage pattern (the last 4096 bytes) is identical for both xfs and ext4.

Thanks a lot for retesting. It is now obvious that FS doesn't have
anything to do with this issue which is in line with my investigation
from yesterday and Friday. I simply cannot see any way the file position
would be updated with a zero length write. So this must be something
else. I have double checked the MM side of the page fault I couldn't
find anything there either so this smells like a stray pte while the
underlying page got reused or something TLB related.
 
> >                                                    I wonder what is
> > different in my testing because I cannot reproduce this at all. Well, I
> > had to reduce the number of competing writer threads to 128 because I
> > quickly hit the trashing behavior with more of them (and 4 CPUs). I will
> > try on a larger machine.
> 
> I don't think a larger machine is necessary.
> I can reproduce this problem with 8 competing writer threads on 4 CPUs.

OK, I will try with fewer writers which should make it easier to have it
run for long time without any trashing.
 
> I don't have native Linux environment. Maybe that is the difference.
> Can you try VMware Workstation Player or Oracle VM VirtualBox environment?

Hmm, I do not have any of those handy for use, unfortunately. I will
keep focusing on the native HW and KVM for today.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-15  6:55                   ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-15  6:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Tue 15-08-17 07:51:02, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Were you able to reproduce with other filesystems?
> 
> Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
> on Oracle VM VirtualBox on Windows.
> 
> I believe that this is not old data from disk, for I can reproduce this problem
> using newly attached /dev/sdb which has never written any data (other than data
> written by mkfs.xfs and mkfs.ext4).
> 
>   /dev/sdb /tmp ext4 rw,seclabel,relatime,data=ordered 0 0
>   
> The garbage pattern (the last 4096 bytes) is identical for both xfs and ext4.

Thanks a lot for retesting. It is now obvious that FS doesn't have
anything to do with this issue which is in line with my investigation
from yesterday and Friday. I simply cannot see any way the file position
would be updated with a zero length write. So this must be something
else. I have double checked the MM side of the page fault I couldn't
find anything there either so this smells like a stray pte while the
underlying page got reused or something TLB related.
 
> >                                                    I wonder what is
> > different in my testing because I cannot reproduce this at all. Well, I
> > had to reduce the number of competing writer threads to 128 because I
> > quickly hit the trashing behavior with more of them (and 4 CPUs). I will
> > try on a larger machine.
> 
> I don't think a larger machine is necessary.
> I can reproduce this problem with 8 competing writer threads on 4 CPUs.

OK, I will try with fewer writers which should make it easier to have it
run for long time without any trashing.
 
> I don't have native Linux environment. Maybe that is the difference.
> Can you try VMware Workstation Player or Oracle VM VirtualBox environment?

Hmm, I do not have any of those handy for use, unfortunately. I will
keep focusing on the native HW and KVM for today.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-14 22:51               ` Tetsuo Handa
@ 2017-08-15  8:41                   ` Michal Hocko
  2017-08-15  8:41                   ` Michal Hocko
  1 sibling, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-15  8:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Tue 15-08-17 07:51:02, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Were you able to reproduce with other filesystems?
> 
> Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
> on Oracle VM VirtualBox on Windows.

Just a quick question.
http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp
mentioned next-20170811 kernel and this one 4.11. Your original report
as a reply to this thread
http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp
mentioned next-20170728. None of them seem to have this fix
http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org so let
me ask again. Have you seen an unexpected content written with that
patch applied?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-15  8:41                   ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-15  8:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Tue 15-08-17 07:51:02, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Were you able to reproduce with other filesystems?
> 
> Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
> on Oracle VM VirtualBox on Windows.

Just a quick question.
http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp
mentioned next-20170811 kernel and this one 4.11. Your original report
as a reply to this thread
http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp
mentioned next-20170728. None of them seem to have this fix
http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org so let
me ask again. Have you seen an unexpected content written with that
patch applied?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-15  8:41                   ` Michal Hocko
  (?)
@ 2017-08-15 10:06                   ` Tetsuo Handa
  2017-08-15 12:26                       ` Michal Hocko
  -1 siblings, 1 reply; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-15 10:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Tue 15-08-17 07:51:02, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > Were you able to reproduce with other filesystems?
> > 
> > Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
> > on Oracle VM VirtualBox on Windows.
> 
> Just a quick question.
> http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp
> mentioned next-20170811 kernel and this one 4.11. Your original report
> as a reply to this thread
> http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp
> mentioned next-20170728. None of them seem to have this fix
> http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org so let
> me ask again. Have you seen an unexpected content written with that
> patch applied?

No. All non-zero non-0xFF values are without that patch applied.
I want to confirm that that patch actually fixes non-zero non-0xFF values
(so that we can have better patch description for that patch).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-15 10:06                   ` Tetsuo Handa
@ 2017-08-15 12:26                       ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-15 12:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Tue 15-08-17 19:06:28, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 15-08-17 07:51:02, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > Were you able to reproduce with other filesystems?
> > > 
> > > Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
> > > on Oracle VM VirtualBox on Windows.
> > 
> > Just a quick question.
> > http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp
> > mentioned next-20170811 kernel and this one 4.11. Your original report
> > as a reply to this thread
> > http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp
> > mentioned next-20170728. None of them seem to have this fix
> > http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org so let
> > me ask again. Have you seen an unexpected content written with that
> > patch applied?
> 
> No. All non-zero non-0xFF values are without that patch applied.
> I want to confirm that that patch actually fixes non-zero non-0xFF values
> (so that we can have better patch description for that patch).

OK, so I have clearly misunderstood you. I thought that you can still
see corrupted content with the patch _applied_. Now I see why I couldn't
reproduce this...

Now I also understand what you meant when asking for an explanation. I
can only speculate how we could end up with the non-zero page previously
but the closest match would be that the page got unmapped and reused by
a different path and a stalled tlb entry would leak the content. Such a
thing would happen if we freed the page _before_ we flushed the tlb
during unmap.

Considering that oom_reaper is relying on unmap_page_range which seems
to be doing the right thing wrt. flushing vs. freeing ordering (enforced
by the tlb_gather) I am wondering what else could go wrong but I vaguely
remember there were some races between THP and MADV_DONTNEED in the
past. Maybe we have hit an incarnation of something like that. Anyway
the oom_reaper doesn't try to be clever and it only calls to
unmap_page_range which should be safe from that context.

The primary bug here was that we allowed to refault an unmmaped memory
and that should be fixed by the patch AFAICS. If there are more issues
we should definitely track those down but those should be oom_reaper
independent because we really do not do anything special here.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-15 12:26                       ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-15 12:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Tue 15-08-17 19:06:28, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 15-08-17 07:51:02, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > Were you able to reproduce with other filesystems?
> > > 
> > > Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
> > > on Oracle VM VirtualBox on Windows.
> > 
> > Just a quick question.
> > http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp
> > mentioned next-20170811 kernel and this one 4.11. Your original report
> > as a reply to this thread
> > http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp
> > mentioned next-20170728. None of them seem to have this fix
> > http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org so let
> > me ask again. Have you seen an unexpected content written with that
> > patch applied?
> 
> No. All non-zero non-0xFF values are without that patch applied.
> I want to confirm that that patch actually fixes non-zero non-0xFF values
> (so that we can have better patch description for that patch).

OK, so I have clearly misunderstood you. I thought that you can still
see corrupted content with the patch _applied_. Now I see why I couldn't
reproduce this...

Now I also understand what you meant when asking for an explanation. I
can only speculate how we could end up with the non-zero page previously
but the closest match would be that the page got unmapped and reused by
a different path and a stalled tlb entry would leak the content. Such a
thing would happen if we freed the page _before_ we flushed the tlb
during unmap.

Considering that oom_reaper is relying on unmap_page_range which seems
to be doing the right thing wrt. flushing vs. freeing ordering (enforced
by the tlb_gather) I am wondering what else could go wrong but I vaguely
remember there were some races between THP and MADV_DONTNEED in the
past. Maybe we have hit an incarnation of something like that. Anyway
the oom_reaper doesn't try to be clever and it only calls to
unmap_page_range which should be safe from that context.

The primary bug here was that we allowed to refault an unmmaped memory
and that should be fixed by the patch AFAICS. If there are more issues
we should definitely track those down but those should be oom_reaper
independent because we really do not do anything special here.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] mm, oom: fix oom_reaper fallouts
  2017-08-07 11:38 ` Michal Hocko
@ 2017-08-15 12:29   ` Michal Hocko
  -1 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-15 12:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Argangeli, Kirill A. Shutemov, Tetsuo Handa,
	Oleg Nesterov, Wenwei Tao, linux-mm, LKML

On Mon 07-08-17 13:38:37, Michal Hocko wrote:
> Hi,
> there are two issues this patch series attempts to fix. First one is
> something that has been broken since MMF_UNSTABLE flag introduction
> and I guess we should backport it stable trees (patch 1). The other
> issue has been brought up by Wenwei Tao and Tetsuo Handa has created
> a test case to trigger it very reliably. I am not yet sure this is a
> stable material because the test case is rather artificial. If there is
> a demand for the stable backport I will prepare it, of course, though.
> 
> I hope I've done the second patch correctly but I would definitely
> appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
> previous attempt with some more context was posted here
> http://lkml.kernel.org/r/20170803135902.31977-1-mhocko@kernel.org
> 
> My testing didn't show anything unusual with these two applied on top of
> the mmotm tree.

unless anybody object can we have this merged? Whether to push this to
the stable tree is still questionable because it requires a rather
artificial workload to trigger the issue but if others think it would be
better to have it backported I will prepare backports for all relevant
stable trees.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] mm, oom: fix oom_reaper fallouts
@ 2017-08-15 12:29   ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-15 12:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Argangeli, Kirill A. Shutemov, Tetsuo Handa,
	Oleg Nesterov, Wenwei Tao, linux-mm, LKML

On Mon 07-08-17 13:38:37, Michal Hocko wrote:
> Hi,
> there are two issues this patch series attempts to fix. First one is
> something that has been broken since MMF_UNSTABLE flag introduction
> and I guess we should backport it stable trees (patch 1). The other
> issue has been brought up by Wenwei Tao and Tetsuo Handa has created
> a test case to trigger it very reliably. I am not yet sure this is a
> stable material because the test case is rather artificial. If there is
> a demand for the stable backport I will prepare it, of course, though.
> 
> I hope I've done the second patch correctly but I would definitely
> appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
> previous attempt with some more context was posted here
> http://lkml.kernel.org/r/20170803135902.31977-1-mhocko@kernel.org
> 
> My testing didn't show anything unusual with these two applied on top of
> the mmotm tree.

unless anybody object can we have this merged? Whether to push this to
the stable tree is still questionable because it requires a rather
artificial workload to trigger the issue but if others think it would be
better to have it backported I will prepare backports for all relevant
stable trees.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-15 12:26                       ` Michal Hocko
  (?)
@ 2017-08-15 12:58                       ` Tetsuo Handa
  2017-08-17 13:58                           ` Michal Hocko
  -1 siblings, 1 reply; 58+ messages in thread
From: Tetsuo Handa @ 2017-08-15 12:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

> On Tue 15-08-17 19:06:28, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 15-08-17 07:51:02, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > [...]
> > > > > Were you able to reproduce with other filesystems?
> > > > 
> > > > Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
> > > > on Oracle VM VirtualBox on Windows.
> > > 
> > > Just a quick question.
> > > http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp
> > > mentioned next-20170811 kernel and this one 4.11. Your original report
> > > as a reply to this thread
> > > http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp
> > > mentioned next-20170728. None of them seem to have this fix
> > > http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org so let
> > > me ask again. Have you seen an unexpected content written with that
> > > patch applied?
> > 
> > No. All non-zero non-0xFF values are without that patch applied.
> > I want to confirm that that patch actually fixes non-zero non-0xFF values
> > (so that we can have better patch description for that patch).
> 
> OK, so I have clearly misunderstood you. I thought that you can still
> see corrupted content with the patch _applied_. Now I see why I couldn't
> reproduce this...

If I apply this patch, I can no longer reproduce this problem even with btrfs.

-+ * and could cause a memory corruption (zero pages instead of the
-+ * original content).
++ * and could cause a memory corruption (random content instead of the
++ * original content).

Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

> 
> Now I also understand what you meant when asking for an explanation. I
> can only speculate how we could end up with the non-zero page previously
> but the closest match would be that the page got unmapped and reused by
> a different path and a stalled tlb entry would leak the content. Such a
> thing would happen if we freed the page _before_ we flushed the tlb
> during unmap.
> 
> Considering that oom_reaper is relying on unmap_page_range which seems
> to be doing the right thing wrt. flushing vs. freeing ordering (enforced
> by the tlb_gather) I am wondering what else could go wrong but I vaguely
> remember there were some races between THP and MADV_DONTNEED in the
> past. Maybe we have hit an incarnation of something like that. Anyway
> the oom_reaper doesn't try to be clever and it only calls to
> unmap_page_range which should be safe from that context.
> 
> The primary bug here was that we allowed to refault an unmmaped memory
> and that should be fixed by the patch AFAICS. If there are more issues
> we should definitely track those down but those should be oom_reaper
> independent because we really do not do anything special here.
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-15 12:58                       ` Tetsuo Handa
@ 2017-08-17 13:58                           ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-17 13:58 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Tue 15-08-17 21:58:29, Tetsuo Handa wrote:
[...]
> If I apply this patch, I can no longer reproduce this problem even with btrfs.
> 
> -+ * and could cause a memory corruption (zero pages instead of the
> -+ * original content).
> ++ * and could cause a memory corruption (random content instead of the
> ++ * original content).

If anything then I would word it this way

and could cause a memory corruption (zero pages for refaults but even a
random content has been observed but never explained properly)

> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-17 13:58                           ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-17 13:58 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Tue 15-08-17 21:58:29, Tetsuo Handa wrote:
[...]
> If I apply this patch, I can no longer reproduce this problem even with btrfs.
> 
> -+ * and could cause a memory corruption (zero pages instead of the
> -+ * original content).
> ++ * and could cause a memory corruption (random content instead of the
> ++ * original content).

If anything then I would word it this way

and could cause a memory corruption (zero pages for refaults but even a
random content has been observed but never explained properly)

> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  8:33 ` [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS Michal Hocko
@ 2017-08-04  8:33     ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-04  8:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Wenwei Tao, Oleg Nesterov,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies is not
entirely true. __task_will_free_mem consider a task dying when
SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads
_after_ the flag is set. So there is a race window when some threads
won't have fatal_signal_pending while the oom_reaper could start
unmapping the address space. generic_perform_write could then write
zero page to the page cache and corrupt data.

The race window is rather small and close to impossible to happen but it
would be better to have it covered.

Fix this by extending the existing MMF_UNSTABLE check in handle_mm_fault
and segfault on any page fault after the oom reaper started its work.
This means that nobody will ever observe a potentially corrupted
content. Formerly we cared only about use_mm users because those can
outlive the oom victim quite easily but having the process itself
protected sounds like a reasonable thing to do as well.

There doesn't seem to be any real life bug report so this is merely a
fix of a theoretical bug.

Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4fe5b6254688..e7308e633b52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3874,15 +3874,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * This mm has been already reaped by the oom reaper and so the
 	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
+	 * lose data and give a zero page instead e.g.
 	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+	if (unlikely(!(ret & VM_FAULT_ERROR)
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-
 		/*
 		 * We are going to enforce SIGBUS but the PF path might have
 		 * dropped the mmap_sem already so take it again so that
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04  8:33     ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-04  8:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Wenwei Tao, Oleg Nesterov,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies is not
entirely true. __task_will_free_mem consider a task dying when
SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads
_after_ the flag is set. So there is a race window when some threads
won't have fatal_signal_pending while the oom_reaper could start
unmapping the address space. generic_perform_write could then write
zero page to the page cache and corrupt data.

The race window is rather small and close to impossible to happen but it
would be better to have it covered.

Fix this by extending the existing MMF_UNSTABLE check in handle_mm_fault
and segfault on any page fault after the oom reaper started its work.
This means that nobody will ever observe a potentially corrupted
content. Formerly we cared only about use_mm users because those can
outlive the oom victim quite easily but having the process itself
protected sounds like a reasonable thing to do as well.

There doesn't seem to be any real life bug report so this is merely a
fix of a theoretical bug.

Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4fe5b6254688..e7308e633b52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3874,15 +3874,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * This mm has been already reaped by the oom reaper and so the
 	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
+	 * lose data and give a zero page instead e.g.
 	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+	if (unlikely(!(ret & VM_FAULT_ERROR)
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-
 		/*
 		 * We are going to enforce SIGBUS but the PF path might have
 		 * dropped the mmap_sem already so take it again so that
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2017-08-17 13:58 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-07 11:38 [PATCH 0/2] mm, oom: fix oom_reaper fallouts Michal Hocko
2017-08-07 11:38 ` Michal Hocko
2017-08-07 11:38 ` [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS Michal Hocko
2017-08-07 11:38   ` Michal Hocko
2017-08-15  0:49   ` David Rientjes
2017-08-15  0:49     ` David Rientjes
2017-08-07 11:38 ` [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer Michal Hocko
2017-08-07 11:38   ` Michal Hocko
2017-08-08 17:48   ` Andrea Arcangeli
2017-08-08 17:48     ` Andrea Arcangeli
2017-08-08 23:35     ` Tetsuo Handa
2017-08-08 23:35       ` Tetsuo Handa
2017-08-09 18:36       ` Andrea Arcangeli
2017-08-09 18:36         ` Andrea Arcangeli
2017-08-10  8:21     ` Michal Hocko
2017-08-10  8:21       ` Michal Hocko
2017-08-10 13:33       ` Michal Hocko
2017-08-10 13:33         ` Michal Hocko
2017-08-11  2:28   ` Tetsuo Handa
2017-08-11  2:28     ` Tetsuo Handa
2017-08-11  7:09     ` Michal Hocko
2017-08-11  7:09       ` Michal Hocko
2017-08-11  7:54       ` Tetsuo Handa
2017-08-11  7:54         ` Tetsuo Handa
2017-08-11 10:22         ` Andrea Arcangeli
2017-08-11 10:22           ` Andrea Arcangeli
2017-08-11 10:42           ` Andrea Arcangeli
2017-08-11 10:42             ` Andrea Arcangeli
2017-08-11 11:53             ` Tetsuo Handa
2017-08-11 11:53               ` Tetsuo Handa
2017-08-11 12:08         ` Michal Hocko
2017-08-11 12:08           ` Michal Hocko
2017-08-11 15:46           ` Tetsuo Handa
2017-08-11 15:46             ` Tetsuo Handa
2017-08-14 13:59             ` Michal Hocko
2017-08-14 13:59               ` Michal Hocko
2017-08-14 22:51               ` Tetsuo Handa
2017-08-15  6:55                 ` Michal Hocko
2017-08-15  6:55                   ` Michal Hocko
2017-08-15  8:41                 ` Michal Hocko
2017-08-15  8:41                   ` Michal Hocko
2017-08-15 10:06                   ` Tetsuo Handa
2017-08-15 12:26                     ` Michal Hocko
2017-08-15 12:26                       ` Michal Hocko
2017-08-15 12:58                       ` Tetsuo Handa
2017-08-17 13:58                         ` Michal Hocko
2017-08-17 13:58                           ` Michal Hocko
2017-08-15  5:30               ` Tetsuo Handa
2017-08-07 13:28 ` [PATCH 0/2] mm, oom: fix oom_reaper fallouts Tetsuo Handa
2017-08-07 13:28   ` Tetsuo Handa
2017-08-07 14:04   ` Michal Hocko
2017-08-07 14:04     ` Michal Hocko
2017-08-07 15:23     ` Tetsuo Handa
2017-08-07 15:23       ` Tetsuo Handa
2017-08-15 12:29 ` Michal Hocko
2017-08-15 12:29   ` Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2017-08-04  8:32 Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer Michal Hocko
2017-08-04  8:33 ` [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS Michal Hocko
2017-08-04  8:33   ` [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer Michal Hocko
2017-08-04  8:33     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.