All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-03 13:59 ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-03 13:59 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Wenwei Tao, Oleg Nesterov, Tetsuo Handa,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies is not
entirely true. __task_will_free_mem consider a task dying when
SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads
_after_ the flag is set. So there is a race window when some threads
won't have fatal_signal_pending while the oom_reaper could start
unmapping the address space. generic_perform_write could then write
zero page to the page cache and corrupt data.

The race window is rather small and close to impossible to happen but it
would be better to have it covered.

Fix this by extending the existing MMF_UNSTABLE check in handle_mm_fault
and segfault on any page fault after the oom reaper started its work.
This means that nobody will ever observe a potentially corrupted
content. Formerly we cared only about use_mm users because those can
outlive the oom victim quite easily but having the process itself
protected sounds like a reasonable thing to do as well.

There doesn't seem to be any real life bug report so this is merely a
fix of a theoretical bug.

Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
Wenwei has contacted me off list and this is a result of the dicussion.
I do not think this would be serious enough to warrant a stable backport
even though the description might sound scary. The race is highly unlikely.

 mm/memory.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..3d8bfeaca38a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3874,13 +3874,9 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * This mm has been already reaped by the oom reaper and so the
 	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
+	 * lose data and give a zero page instead e.g.
 	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+	if (unlikely(!(ret & VM_FAULT_ERROR)
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
 		ret = VM_FAULT_SIGBUS;
 
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-03 13:59 ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-03 13:59 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Wenwei Tao, Oleg Nesterov, Tetsuo Handa,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies is not
entirely true. __task_will_free_mem consider a task dying when
SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads
_after_ the flag is set. So there is a race window when some threads
won't have fatal_signal_pending while the oom_reaper could start
unmapping the address space. generic_perform_write could then write
zero page to the page cache and corrupt data.

The race window is rather small and close to impossible to happen but it
would be better to have it covered.

Fix this by extending the existing MMF_UNSTABLE check in handle_mm_fault
and segfault on any page fault after the oom reaper started its work.
This means that nobody will ever observe a potentially corrupted
content. Formerly we cared only about use_mm users because those can
outlive the oom victim quite easily but having the process itself
protected sounds like a reasonable thing to do as well.

There doesn't seem to be any real life bug report so this is merely a
fix of a theoretical bug.

Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
Wenwei has contacted me off list and this is a result of the dicussion.
I do not think this would be serious enough to warrant a stable backport
even though the description might sound scary. The race is highly unlikely.

 mm/memory.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..3d8bfeaca38a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3874,13 +3874,9 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * This mm has been already reaped by the oom reaper and so the
 	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
+	 * lose data and give a zero page instead e.g.
 	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+	if (unlikely(!(ret & VM_FAULT_ERROR)
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
 		ret = VM_FAULT_SIGBUS;
 
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-03 13:59 ` Michal Hocko
  (?)
@ 2017-08-04  6:46 ` Tetsuo Handa
  2017-08-04  7:42     ` Michal Hocko
  -1 siblings, 1 reply; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-04  6:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, Tetsuo Handa,
	David Rientjes, LKML, Michal Hocko

Michal Hocko wrote:
>                          So there is a race window when some threads
> won't have fatal_signal_pending while the oom_reaper could start
> unmapping the address space. generic_perform_write could then write
> zero page to the page cache and corrupt data.

Oh, simple generic_perform_write() ?

> 
> The race window is rather small and close to impossible to happen but it
> would be better to have it covered.

OK, I confirmed that this problem is easily reproducible using below reproducer.

----------
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>

#define NUMTHREADS 512
#define STACKSIZE 8192

static int pipe_fd[2] = { EOF, EOF };
static int file_writer(void *i)
{
	char buffer[4096] = { };
	int fd;
	snprintf(buffer, sizeof(buffer), "/tmp/file.%lu", (unsigned long) i);
	fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
	memset(buffer, 0xFF, sizeof(buffer));
	read(pipe_fd[0], buffer, 1);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

int main(int argc, char *argv[])
{
	char *buf = NULL;
	unsigned long size;
	unsigned long i;
	char *stack;
	if (pipe(pipe_fd))
		return 1;
	stack = malloc(STACKSIZE * NUMTHREADS);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	for (i = 0; i < NUMTHREADS; i++)
                if (clone(file_writer, stack + (i + 1) * STACKSIZE,
			  CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | CLONE_FS |
			  CLONE_FILES, (void *) i) == -1)
                        break;
	close(pipe_fd[1]);
	/* Will cause OOM due to overcommit; if not use SysRq-f */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	kill(-1, SIGKILL);
	return 0;
}
----------
$ cat /tmp/file.* | od -b | head
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
307730000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
307740000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
316600000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
316610000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
----------

Applying your patch seems to avoid this problem, but as far as I tested
your patch seems to trivially trigger something lock related problem.
Is your patch really safe?

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170804.txt.xz 
and config is at http://I-love.SAKURA.ne.jp/tmp/config-20170804 .

----------
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.569076] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30
[   58.671629] RIP: 0033:0x40092f
[   58.674221] RSP: 002b:00007fb931f99ff0 EFLAGS: 00010217
[   58.677556] RAX: 0000000000001000 RBX: 00007fb931f99ff0 RCX: 00007fb93224ec90
[   58.681489] RDX: 0000000000001000 RSI: 00007fb931f99ff0 RDI: 0000000000000117
[   58.685297] RBP: 0000000000000117 R08: 00007fb9321ae938 R09: 000000000000000d
[   58.689123] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000100000000
[   58.692879] R13: 00007fb731f59010 R14: 0000000000000000 R15: 00007fb731f59010
[   58.696588] Code: 5e 41 5f 5d c3 e8 df a7 26 00 85 c0 74 1f 8b 35 2d 2f df 01 85 f6 75 15 48 c7 c6 66 7c a0 a3 48 c7 c7 5b 41 a0 a3 e8 6a 14 01 00 <0f> ff 4c 89 fa 4c 89 ee 48 89 df e8 fe c8 ff ff eb 88 48 c7 c7 
[   58.705635] ---[ end trace 91ff0f99e79ee485 ]---
[   58.831028] oom_reaper: reaped process 1056 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
----------

----------
[  187.202689] Out of memory: Kill process 2113 (a.out) score 734 or sacrifice child
[  187.208024] Killed process 2113 (a.out) total-vm:4268108kB, anon-rss:2735276kB, file-rss:0kB, shmem-rss:0kB
[  187.463902] oom_reaper: reaped process 2113 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  188.249973] DEBUG_LOCKS_WARN_ON(depth <= 0)
[  188.249983] ------------[ cut here ]------------
[  188.257247] WARNING: CPU: 7 PID: 2313 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[  188.263282] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp pcspkr vmw_balloon ppdev i2c_piix4 vmw_vmci shpchp sg parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi ahci mptscsih ata_piix libahci e1000 mptbase libata
[  188.295888] CPU: 7 PID: 2313 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[  188.300975] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  188.307049] task: ffff8c4433840040 task.stack: ffff9459c660c000
[  188.311510] RIP: 0010:lock_release+0x172/0x1e0
[  188.315530] RSP: 0000:ffff9459c660fe58 EFLAGS: 00010082
[  188.319895] RAX: 000000000000001f RBX: ffff8c4433840040 RCX: 0000000000000000
[  188.324908] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff810d4ba4
[  188.329894] RBP: ffff9459c660fe98 R08: 0000000000000000 R09: 0000000000000001
[  188.334724] R10: 0000000000000000 R11: 000000000000001f R12: ffff9459c660ff58
[  188.339707] R13: ffff8c4434644c90 R14: 0000000000000000 R15: ffffffff8105cb6e
[  188.344553] FS:  00007f0e2aca8740(0000) GS:ffff8c4439fc0000(0000) knlGS:0000000000000000
[  188.349616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  188.353835] CR2: 00007f0e2a665ff0 CR3: 00000001346a5005 CR4: 00000000000606e0
[  188.358604] Call Trace:
[  188.361539]  up_read+0x1a/0x40
[  188.364661]  __do_page_fault+0x28e/0x4c0
[  188.368059]  do_page_fault+0x30/0x80
[  188.371338]  page_fault+0x28/0x30
[  188.374426] RIP: 0033:0x7f0e2a7c6c90
[  188.377543] RSP: 002b:00007f0e2a46bfe8 EFLAGS: 00010246
[  188.381238] RAX: 0000000000001000 RBX: 00007f0e2a46bff0 RCX: 00007f0e2a7c6c90
[  188.385576] RDX: 0000000000001000 RSI: 00007f0e2a46bff0 RDI: 00000000000000fd
[  188.389886] RBP: 00000000000000fd R08: 00007f0e2a726938 R09: 000000000000000d
[  188.394148] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000100000000
[  188.398339] R13: 00007f0c2a4d1010 R14: 0000000000000000 R15: 00007f0c2a4d1010
[  188.402508] Code: 5e 41 5f 5d c3 e8 df a7 26 00 85 c0 74 1f 8b 35 2d 2f df 01 85 f6 75 15 48 c7 c6 66 7c a0 81 48 c7 c7 5b 41 a0 81 e8 6a 14 01 00 <0f> ff 4c 89 fa 4c 89 ee 48 89 df e8 fe c8 ff ff eb 88 48 c7 c7 
[  188.412704] ---[ end trace d42863c48bb12d0a ]---
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  6:46 ` Tetsuo Handa
@ 2017-08-04  7:42     ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04  7:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 15:46:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> >                          So there is a race window when some threads
> > won't have fatal_signal_pending while the oom_reaper could start
> > unmapping the address space. generic_perform_write could then write
> > zero page to the page cache and corrupt data.
> 
> Oh, simple generic_perform_write() ?
> 
> > 
> > The race window is rather small and close to impossible to happen but it
> > would be better to have it covered.
> 
> OK, I confirmed that this problem is easily reproducible using below reproducer.

Yeah, I can imagine this could be triggered artificially. I am somehow
more skeptical about real life oom scenarios to trigger this though.
Anyway, thanks for your test case!
 
> Applying your patch seems to avoid this problem, but as far as I tested
> your patch seems to trivially trigger something lock related problem.
> Is your patch really safe?

> ----------
> [   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
> [   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   58.557480] ------------[ cut here ]------------
> [   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
> [   58.569076] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
> [   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
> [   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
> [   58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
> [   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
> [   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
> [   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
> [   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
> [   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
> [   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
> [   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
> [   58.657280] Call Trace:
> [   58.659989]  up_read+0x1a/0x40
> [   58.662825]  __do_page_fault+0x28e/0x4c0
> [   58.665946]  do_page_fault+0x30/0x80
> [   58.668911]  page_fault+0x28/0x30

OK, I know what is going on here. The page fault must have returned with
VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the
this error code so the page fault path doesn't know that the lock is no
longer held and releases is unconditionally. This is a preexisting
problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will
not refault oom reaped memory"). I should have considered this option.

I believe the easiest way around this is the following patch
---
>From dd31779f763bbe2aa86100f804656ac680c49d35 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 4 Aug 2017 09:36:34 +0200
Subject: [PATCH] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced
 SIGBUS

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30

The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
existing error code because all arch specific page fault handlers and
g-u-p would have to learn a new error code combination.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Cc: stable # 4.9+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..4fe5b6254688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3881,8 +3881,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * further.
 	 */
 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+
+		/*
+		 * We are going to enforce SIGBUS but the PF path might have
+		 * dropped the mmap_sem already so take it again so that
+		 * we do not break expectations of all arch specific PF paths
+		 * and g-u-p
+		 */
+		if (ret & VM_FAULT_RETRY)
+			down_read(&vma->vm_mm->mmap_sem);
 		ret = VM_FAULT_SIGBUS;
+	}
 
 	return ret;
 }
-- 
2.13.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04  7:42     ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04  7:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 15:46:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> >                          So there is a race window when some threads
> > won't have fatal_signal_pending while the oom_reaper could start
> > unmapping the address space. generic_perform_write could then write
> > zero page to the page cache and corrupt data.
> 
> Oh, simple generic_perform_write() ?
> 
> > 
> > The race window is rather small and close to impossible to happen but it
> > would be better to have it covered.
> 
> OK, I confirmed that this problem is easily reproducible using below reproducer.

Yeah, I can imagine this could be triggered artificially. I am somehow
more skeptical about real life oom scenarios to trigger this though.
Anyway, thanks for your test case!
 
> Applying your patch seems to avoid this problem, but as far as I tested
> your patch seems to trivially trigger something lock related problem.
> Is your patch really safe?

> ----------
> [   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
> [   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   58.557480] ------------[ cut here ]------------
> [   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
> [   58.569076] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
> [   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
> [   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
> [   58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
> [   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
> [   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
> [   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
> [   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
> [   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
> [   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
> [   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
> [   58.657280] Call Trace:
> [   58.659989]  up_read+0x1a/0x40
> [   58.662825]  __do_page_fault+0x28e/0x4c0
> [   58.665946]  do_page_fault+0x30/0x80
> [   58.668911]  page_fault+0x28/0x30

OK, I know what is going on here. The page fault must have returned with
VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the
this error code so the page fault path doesn't know that the lock is no
longer held and releases is unconditionally. This is a preexisting
problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will
not refault oom reaped memory"). I should have considered this option.

I believe the easiest way around this is the following patch
---

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  7:42     ` Michal Hocko
  (?)
@ 2017-08-04  8:25     ` Tetsuo Handa
  2017-08-04  8:32         ` Michal Hocko
  2017-08-04  9:16         ` Michal Hocko
  -1 siblings, 2 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-04  8:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

Well, while lockdep warning is gone, this problem is remaining.

diff --git a/mm/memory.c b/mm/memory.c
index edabf6f..1e06c29 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
        /*
         * This mm has been already reaped by the oom reaper and so the
         * refault cannot be trusted in general. Anonymous refaults would
-        * lose data and give a zero page instead e.g. This is especially
-        * problem for use_mm() because regular tasks will just die and
-        * the corrupted data will not be visible anywhere while kthread
-        * will outlive the oom victim and potentially propagate the date
-        * further.
+        * lose data and give a zero page instead e.g.
         */
-       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+       if (unlikely(!(ret & VM_FAULT_ERROR)
+                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+               if (ret & VM_FAULT_RETRY)
+                       down_read(&vma->vm_mm->mmap_sem);
                ret = VM_FAULT_SIGBUS;
+       }

        return ret;
 }

$ cat /tmp/file.* | od -b | head
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
420330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
420340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
457330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
457340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  8:25     ` Tetsuo Handa
@ 2017-08-04  8:32         ` Michal Hocko
  2017-08-04  9:16         ` Michal Hocko
  1 sibling, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04  8:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.

Ohh, I should have been more specific. Both patches have to be applied.
I have based this one first because it should go to stable. The later
one needs a trivial conflict resolution. I will send both of them as a
reply to this email!

Thanks for retesting. It matches my testing results.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04  8:32         ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04  8:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.

Ohh, I should have been more specific. Both patches have to be applied.
I have based this one first because it should go to stable. The later
one needs a trivial conflict resolution. I will send both of them as a
reply to this email!

Thanks for retesting. It matches my testing results.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
  2017-08-04  8:32         ` Michal Hocko
@ 2017-08-04  8:33           ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04  8:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Wenwei Tao, Oleg Nesterov,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30

The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
existing error code because all arch specific page fault handlers and
g-u-p would have to learn a new error code combination.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Cc: stable # 4.9+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..4fe5b6254688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3881,8 +3881,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * further.
 	 */
 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+
+		/*
+		 * We are going to enforce SIGBUS but the PF path might have
+		 * dropped the mmap_sem already so take it again so that
+		 * we do not break expectations of all arch specific PF paths
+		 * and g-u-p
+		 */
+		if (ret & VM_FAULT_RETRY)
+			down_read(&vma->vm_mm->mmap_sem);
 		ret = VM_FAULT_SIGBUS;
+	}
 
 	return ret;
 }
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
@ 2017-08-04  8:33           ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04  8:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Wenwei Tao, Oleg Nesterov,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30

The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
existing error code because all arch specific page fault handlers and
g-u-p would have to learn a new error code combination.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Cc: stable # 4.9+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..4fe5b6254688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3881,8 +3881,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * further.
 	 */
 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+
+		/*
+		 * We are going to enforce SIGBUS but the PF path might have
+		 * dropped the mmap_sem already so take it again so that
+		 * we do not break expectations of all arch specific PF paths
+		 * and g-u-p
+		 */
+		if (ret & VM_FAULT_RETRY)
+			down_read(&vma->vm_mm->mmap_sem);
 		ret = VM_FAULT_SIGBUS;
+	}
 
 	return ret;
 }
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  8:33           ` Michal Hocko
@ 2017-08-04  8:33             ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04  8:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Wenwei Tao, Oleg Nesterov,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies is not
entirely true. __task_will_free_mem consider a task dying when
SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads
_after_ the flag is set. So there is a race window when some threads
won't have fatal_signal_pending while the oom_reaper could start
unmapping the address space. generic_perform_write could then write
zero page to the page cache and corrupt data.

The race window is rather small and close to impossible to happen but it
would be better to have it covered.

Fix this by extending the existing MMF_UNSTABLE check in handle_mm_fault
and segfault on any page fault after the oom reaper started its work.
This means that nobody will ever observe a potentially corrupted
content. Formerly we cared only about use_mm users because those can
outlive the oom victim quite easily but having the process itself
protected sounds like a reasonable thing to do as well.

There doesn't seem to be any real life bug report so this is merely a
fix of a theoretical bug.

Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4fe5b6254688..e7308e633b52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3874,15 +3874,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * This mm has been already reaped by the oom reaper and so the
 	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
+	 * lose data and give a zero page instead e.g.
 	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+	if (unlikely(!(ret & VM_FAULT_ERROR)
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-
 		/*
 		 * We are going to enforce SIGBUS but the PF path might have
 		 * dropped the mmap_sem already so take it again so that
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04  8:33             ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04  8:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Wenwei Tao, Oleg Nesterov,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies is not
entirely true. __task_will_free_mem consider a task dying when
SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads
_after_ the flag is set. So there is a race window when some threads
won't have fatal_signal_pending while the oom_reaper could start
unmapping the address space. generic_perform_write could then write
zero page to the page cache and corrupt data.

The race window is rather small and close to impossible to happen but it
would be better to have it covered.

Fix this by extending the existing MMF_UNSTABLE check in handle_mm_fault
and segfault on any page fault after the oom reaper started its work.
This means that nobody will ever observe a potentially corrupted
content. Formerly we cared only about use_mm users because those can
outlive the oom victim quite easily but having the process itself
protected sounds like a reasonable thing to do as well.

There doesn't seem to be any real life bug report so this is merely a
fix of a theoretical bug.

Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4fe5b6254688..e7308e633b52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3874,15 +3874,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * This mm has been already reaped by the oom reaper and so the
 	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
+	 * lose data and give a zero page instead e.g.
 	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+	if (unlikely(!(ret & VM_FAULT_ERROR)
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-
 		/*
 		 * We are going to enforce SIGBUS but the PF path might have
 		 * dropped the mmap_sem already so take it again so that
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  8:25     ` Tetsuo Handa
@ 2017-08-04  9:16         ` Michal Hocko
  2017-08-04  9:16         ` Michal Hocko
  1 sibling, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04  9:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index edabf6f..1e06c29 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         /*
>          * This mm has been already reaped by the oom reaper and so the
>          * refault cannot be trusted in general. Anonymous refaults would
> -        * lose data and give a zero page instead e.g. This is especially
> -        * problem for use_mm() because regular tasks will just die and
> -        * the corrupted data will not be visible anywhere while kthread
> -        * will outlive the oom victim and potentially propagate the date
> -        * further.
> +        * lose data and give a zero page instead e.g.
>          */
> -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> +       if (unlikely(!(ret & VM_FAULT_ERROR)
> +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> +               if (ret & VM_FAULT_RETRY)
> +                       down_read(&vma->vm_mm->mmap_sem);
>                 ret = VM_FAULT_SIGBUS;
> +       }
> 
>         return ret;
>  }

I have re-read your email again and I guess I misread previously. Are
you saying that the data corruption happens with the both patches
applied?

> 
> $ cat /tmp/file.* | od -b | head
> 0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 420330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 420340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 457330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 457340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04  9:16         ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04  9:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index edabf6f..1e06c29 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         /*
>          * This mm has been already reaped by the oom reaper and so the
>          * refault cannot be trusted in general. Anonymous refaults would
> -        * lose data and give a zero page instead e.g. This is especially
> -        * problem for use_mm() because regular tasks will just die and
> -        * the corrupted data will not be visible anywhere while kthread
> -        * will outlive the oom victim and potentially propagate the date
> -        * further.
> +        * lose data and give a zero page instead e.g.
>          */
> -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> +       if (unlikely(!(ret & VM_FAULT_ERROR)
> +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> +               if (ret & VM_FAULT_RETRY)
> +                       down_read(&vma->vm_mm->mmap_sem);
>                 ret = VM_FAULT_SIGBUS;
> +       }
> 
>         return ret;
>  }

I have re-read your email again and I guess I misread previously. Are
you saying that the data corruption happens with the both patches
applied?

> 
> $ cat /tmp/file.* | od -b | head
> 0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 420330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 420340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 457330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 457340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  9:16         ` Michal Hocko
@ 2017-08-04 10:41           ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-04 10:41 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

Michal Hocko wrote:
> On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > Well, while lockdep warning is gone, this problem is remaining.
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index edabf6f..1e06c29 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >         /*
> >          * This mm has been already reaped by the oom reaper and so the
> >          * refault cannot be trusted in general. Anonymous refaults would
> > -        * lose data and give a zero page instead e.g. This is especially
> > -        * problem for use_mm() because regular tasks will just die and
> > -        * the corrupted data will not be visible anywhere while kthread
> > -        * will outlive the oom victim and potentially propagate the date
> > -        * further.
> > +        * lose data and give a zero page instead e.g.
> >          */
> > -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > +       if (unlikely(!(ret & VM_FAULT_ERROR)
> > +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> > +               if (ret & VM_FAULT_RETRY)
> > +                       down_read(&vma->vm_mm->mmap_sem);
> >                 ret = VM_FAULT_SIGBUS;
> > +       }
> > 
> >         return ret;
> >  }
> 
> I have re-read your email again and I guess I misread previously. Are
> you saying that the data corruption happens with the both patches
> applied?

Yes. Data corruption still happens.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04 10:41           ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-04 10:41 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

Michal Hocko wrote:
> On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > Well, while lockdep warning is gone, this problem is remaining.
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index edabf6f..1e06c29 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >         /*
> >          * This mm has been already reaped by the oom reaper and so the
> >          * refault cannot be trusted in general. Anonymous refaults would
> > -        * lose data and give a zero page instead e.g. This is especially
> > -        * problem for use_mm() because regular tasks will just die and
> > -        * the corrupted data will not be visible anywhere while kthread
> > -        * will outlive the oom victim and potentially propagate the date
> > -        * further.
> > +        * lose data and give a zero page instead e.g.
> >          */
> > -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > +       if (unlikely(!(ret & VM_FAULT_ERROR)
> > +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> > +               if (ret & VM_FAULT_RETRY)
> > +                       down_read(&vma->vm_mm->mmap_sem);
> >                 ret = VM_FAULT_SIGBUS;
> > +       }
> > 
> >         return ret;
> >  }
> 
> I have re-read your email again and I guess I misread previously. Are
> you saying that the data corruption happens with the both patches
> applied?

Yes. Data corruption still happens.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04 10:41           ` Tetsuo Handa
@ 2017-08-04 11:00             ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04 11:00 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > > Well, while lockdep warning is gone, this problem is remaining.
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index edabf6f..1e06c29 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > >         /*
> > >          * This mm has been already reaped by the oom reaper and so the
> > >          * refault cannot be trusted in general. Anonymous refaults would
> > > -        * lose data and give a zero page instead e.g. This is especially
> > > -        * problem for use_mm() because regular tasks will just die and
> > > -        * the corrupted data will not be visible anywhere while kthread
> > > -        * will outlive the oom victim and potentially propagate the date
> > > -        * further.
> > > +        * lose data and give a zero page instead e.g.
> > >          */
> > > -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > > -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > > +       if (unlikely(!(ret & VM_FAULT_ERROR)
> > > +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> > > +               if (ret & VM_FAULT_RETRY)
> > > +                       down_read(&vma->vm_mm->mmap_sem);
> > >                 ret = VM_FAULT_SIGBUS;
> > > +       }
> > > 
> > >         return ret;
> > >  }
> > 
> > I have re-read your email again and I guess I misread previously. Are
> > you saying that the data corruption happens with the both patches
> > applied?
> 
> Yes. Data corruption still happens.

I guess I managed to reproduce finally. Will investigate further.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04 11:00             ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04 11:00 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > > Well, while lockdep warning is gone, this problem is remaining.
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index edabf6f..1e06c29 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > >         /*
> > >          * This mm has been already reaped by the oom reaper and so the
> > >          * refault cannot be trusted in general. Anonymous refaults would
> > > -        * lose data and give a zero page instead e.g. This is especially
> > > -        * problem for use_mm() because regular tasks will just die and
> > > -        * the corrupted data will not be visible anywhere while kthread
> > > -        * will outlive the oom victim and potentially propagate the date
> > > -        * further.
> > > +        * lose data and give a zero page instead e.g.
> > >          */
> > > -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > > -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > > +       if (unlikely(!(ret & VM_FAULT_ERROR)
> > > +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> > > +               if (ret & VM_FAULT_RETRY)
> > > +                       down_read(&vma->vm_mm->mmap_sem);
> > >                 ret = VM_FAULT_SIGBUS;
> > > +       }
> > > 
> > >         return ret;
> > >  }
> > 
> > I have re-read your email again and I guess I misread previously. Are
> > you saying that the data corruption happens with the both patches
> > applied?
> 
> Yes. Data corruption still happens.

I guess I managed to reproduce finally. Will investigate further.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04 11:00             ` Michal Hocko
@ 2017-08-04 14:56               ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04 14:56 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
[...]
> > Yes. Data corruption still happens.
> 
> I guess I managed to reproduce finally. Will investigate further.

One limitation of the current MMF_UNSTABLE implementation is that it
still keeps the new page mapped and only sends EFAULT/kill to the
consumer. If somebody tries to re-read the same content nothing will
really happen. I went this way because it was much simpler and memory
consumers usually do not retry on EFAULT. Maybe this is not the case
here.

I've been staring into iov_iter_copy_from_user_atomic which I
believe should be the common write path which reads the user buffer
where the corruption caused by the oom_reaper would come from.
iov_iter_fault_in_readable should be called before this function. If
this happened after MMF_UNSTABLE was set then we should get EFAULT and
bail out early. Let's assume this wasn't the case. Then we should get
down to iov_iter_copy_from_user_atomic and that one shouldn't copy any
data because __copy_from_user_inatomic says

 * If copying succeeds, the return value must be 0.  If some data cannot be
 * fetched, it is permitted to copy less than had been fetched; the only
 * hard requirement is that not storing anything at all (i.e. returning size)
 * should happen only when nothing could be copied.  In other words, you don't
 * have to squeeze as much as possible - it is allowed, but not necessary.

which should be our case.

I was testing with xfs (but generic_perform_write seem to be doing the
same thing) and that one does
		if (unlikely(copied == 0)) {
			/*
			 * If we were unable to copy any data at all, we must
			 * fall back to a single segment length write.
			 *
			 * If we didn't fallback here, we could livelock
			 * because not all segments in the iov can be copied at
			 * once without a pagefault.
			 */
			bytes = min_t(unsigned long, PAGE_SIZE - offset,
						iov_iter_single_seg_count(i));
			goto again;
		}

and that again will go through iov_iter_fault_in_readable again and that
will succeed now.

And that's why we still see the corruption. That, however, means that
the MMF_UNSTABLE implementation has to be more complex and we have to
hook into all anonymous memory fault paths which I hoped I could avoid
previously.

This is a rough first draft that passes the test case from Tetsuo on my
system. It will need much more eyes on it and I will return to it with a
fresh brain next week. I would appreciate as much testing as possible.

Note that this is on top of the previous attempt for the fix but I will
squash the result into one patch because the previous one is not
sufficient.
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..1fbc78d423d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	int ret;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	pgtable = pte_alloc_one(vma->vm_mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto release;
 	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
@@ -583,6 +583,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	} else {
 		pmd_t entry;
 
+		/*
+		 * range could have been already torn down by
+		 * the oom reaper
+		 */
+		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+			spin_unlock(vmf->ptl);
+			ret = VM_FAULT_SIGBUS;
+			goto release;
+		}
 		/* Deliver the page fault to userland */
 		if (userfaultfd_missing(vma)) {
 			int ret;
@@ -610,6 +619,13 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	}
 
 	return 0;
+release:
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
+	mem_cgroup_cancel_charge(page, memcg, true);
+	put_page(page);
+	return ret;
+
 }
 
 /*
@@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		ret = 0;
 		set = false;
 		if (pmd_none(*vmf->pmd)) {
-			if (userfaultfd_missing(vma)) {
+			/*
+			 * range could have been already torn down by
+			 * the oom reaper
+			 */
+			if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+				spin_unlock(vmf->ptl);
+				ret = VM_FAULT_SIGBUS;
+			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
diff --git a/mm/memory.c b/mm/memory.c
index e7308e633b52..7de9508e38e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2864,6 +2864,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mem_cgroup *memcg;
 	struct page *page;
+	int ret = 0;
 	pte_t entry;
 
 	/* File mapping without ->vm_ops ? */
@@ -2896,6 +2897,14 @@ static int do_anonymous_page(struct vm_fault *vmf)
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte))
 			goto unlock;
+		/*
+		 * range could have been already torn down by
+		 * the oom reaper
+		 */
+		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+			ret = VM_FAULT_SIGBUS;
+			goto unlock;
+		}
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2930,6 +2939,15 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (!pte_none(*vmf->pte))
 		goto release;
 
+	/*
+	 * range could have been already torn down by
+	 * the oom reaper
+	 */
+	if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+		ret = VM_FAULT_SIGBUS;
+		goto release;
+	}
+
 	/* Deliver the page fault to userland, check inside PT lock */
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2949,7 +2967,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return 0;
+	return ret;
 release:
 	mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
@@ -3231,7 +3249,10 @@ int finish_fault(struct vm_fault *vmf)
 		page = vmf->cow_page;
 	else
 		page = vmf->page;
-	ret = alloc_set_pte(vmf, vmf->memcg, page);
+	if (!test_bit(MMF_UNSTABLE, &vmf->vma->vm_mm->flags))
+		ret = alloc_set_pte(vmf, vmf->memcg, page);
+	else
+		ret = VM_FAULT_SIGBUS;
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
@@ -3871,24 +3892,6 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
-	/*
-	 * This mm has been already reaped by the oom reaper and so the
-	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g.
-	 */
-	if (unlikely(!(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-		/*
-		 * We are going to enforce SIGBUS but the PF path might have
-		 * dropped the mmap_sem already so take it again so that
-		 * we do not break expectations of all arch specific PF paths
-		 * and g-u-p
-		 */
-		if (ret & VM_FAULT_RETRY)
-			down_read(&vma->vm_mm->mmap_sem);
-		ret = VM_FAULT_SIGBUS;
-	}
-
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04 14:56               ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-04 14:56 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
[...]
> > Yes. Data corruption still happens.
> 
> I guess I managed to reproduce finally. Will investigate further.

One limitation of the current MMF_UNSTABLE implementation is that it
still keeps the new page mapped and only sends EFAULT/kill to the
consumer. If somebody tries to re-read the same content nothing will
really happen. I went this way because it was much simpler and memory
consumers usually do not retry on EFAULT. Maybe this is not the case
here.

I've been staring into iov_iter_copy_from_user_atomic which I
believe should be the common write path which reads the user buffer
where the corruption caused by the oom_reaper would come from.
iov_iter_fault_in_readable should be called before this function. If
this happened after MMF_UNSTABLE was set then we should get EFAULT and
bail out early. Let's assume this wasn't the case. Then we should get
down to iov_iter_copy_from_user_atomic and that one shouldn't copy any
data because __copy_from_user_inatomic says

 * If copying succeeds, the return value must be 0.  If some data cannot be
 * fetched, it is permitted to copy less than had been fetched; the only
 * hard requirement is that not storing anything at all (i.e. returning size)
 * should happen only when nothing could be copied.  In other words, you don't
 * have to squeeze as much as possible - it is allowed, but not necessary.

which should be our case.

I was testing with xfs (but generic_perform_write seem to be doing the
same thing) and that one does
		if (unlikely(copied == 0)) {
			/*
			 * If we were unable to copy any data at all, we must
			 * fall back to a single segment length write.
			 *
			 * If we didn't fallback here, we could livelock
			 * because not all segments in the iov can be copied at
			 * once without a pagefault.
			 */
			bytes = min_t(unsigned long, PAGE_SIZE - offset,
						iov_iter_single_seg_count(i));
			goto again;
		}

and that again will go through iov_iter_fault_in_readable again and that
will succeed now.

And that's why we still see the corruption. That, however, means that
the MMF_UNSTABLE implementation has to be more complex and we have to
hook into all anonymous memory fault paths which I hoped I could avoid
previously.

This is a rough first draft that passes the test case from Tetsuo on my
system. It will need much more eyes on it and I will return to it with a
fresh brain next week. I would appreciate as much testing as possible.

Note that this is on top of the previous attempt for the fix but I will
squash the result into one patch because the previous one is not
sufficient.
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..1fbc78d423d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	int ret;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	pgtable = pte_alloc_one(vma->vm_mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto release;
 	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
@@ -583,6 +583,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	} else {
 		pmd_t entry;
 
+		/*
+		 * range could have been already torn down by
+		 * the oom reaper
+		 */
+		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+			spin_unlock(vmf->ptl);
+			ret = VM_FAULT_SIGBUS;
+			goto release;
+		}
 		/* Deliver the page fault to userland */
 		if (userfaultfd_missing(vma)) {
 			int ret;
@@ -610,6 +619,13 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	}
 
 	return 0;
+release:
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
+	mem_cgroup_cancel_charge(page, memcg, true);
+	put_page(page);
+	return ret;
+
 }
 
 /*
@@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		ret = 0;
 		set = false;
 		if (pmd_none(*vmf->pmd)) {
-			if (userfaultfd_missing(vma)) {
+			/*
+			 * range could have been already torn down by
+			 * the oom reaper
+			 */
+			if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+				spin_unlock(vmf->ptl);
+				ret = VM_FAULT_SIGBUS;
+			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
diff --git a/mm/memory.c b/mm/memory.c
index e7308e633b52..7de9508e38e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2864,6 +2864,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mem_cgroup *memcg;
 	struct page *page;
+	int ret = 0;
 	pte_t entry;
 
 	/* File mapping without ->vm_ops ? */
@@ -2896,6 +2897,14 @@ static int do_anonymous_page(struct vm_fault *vmf)
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte))
 			goto unlock;
+		/*
+		 * range could have been already torn down by
+		 * the oom reaper
+		 */
+		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+			ret = VM_FAULT_SIGBUS;
+			goto unlock;
+		}
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2930,6 +2939,15 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (!pte_none(*vmf->pte))
 		goto release;
 
+	/*
+	 * range could have been already torn down by
+	 * the oom reaper
+	 */
+	if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+		ret = VM_FAULT_SIGBUS;
+		goto release;
+	}
+
 	/* Deliver the page fault to userland, check inside PT lock */
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2949,7 +2967,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return 0;
+	return ret;
 release:
 	mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
@@ -3231,7 +3249,10 @@ int finish_fault(struct vm_fault *vmf)
 		page = vmf->cow_page;
 	else
 		page = vmf->page;
-	ret = alloc_set_pte(vmf, vmf->memcg, page);
+	if (!test_bit(MMF_UNSTABLE, &vmf->vma->vm_mm->flags))
+		ret = alloc_set_pte(vmf, vmf->memcg, page);
+	else
+		ret = VM_FAULT_SIGBUS;
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
@@ -3871,24 +3892,6 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
-	/*
-	 * This mm has been already reaped by the oom reaper and so the
-	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g.
-	 */
-	if (unlikely(!(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-		/*
-		 * We are going to enforce SIGBUS but the PF path might have
-		 * dropped the mmap_sem already so take it again so that
-		 * we do not break expectations of all arch specific PF paths
-		 * and g-u-p
-		 */
-		if (ret & VM_FAULT_RETRY)
-			down_read(&vma->vm_mm->mmap_sem);
-		ret = VM_FAULT_SIGBUS;
-	}
-
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04 14:56               ` Michal Hocko
@ 2017-08-04 16:49                 ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-04 16:49 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

Michal Hocko wrote:
> And that's why we still see the corruption. That, however, means that
> the MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.

I don't understand mm internals including pte/ptl etc. , but I guess that
the direction is correct. Since the OOM reaper basically does

  Set MMF_UNSTABLE flag on mm_struct.
  For each reapable page in mm_struct {
    Take ptl lock.
    Remove pte.
    Release ptl lock.
  }

the page fault handler will need to check MMF_UNSTABLE with lock held.

  For each faulted page in mm_struct {
    Take ptl lock.
    Add pte only if MMF_UNSTABLE flag is not set.
    Release ptl lock.
  }

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04 16:49                 ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-04 16:49 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

Michal Hocko wrote:
> And that's why we still see the corruption. That, however, means that
> the MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.

I don't understand mm internals including pte/ptl etc. , but I guess that
the direction is correct. Since the OOM reaper basically does

  Set MMF_UNSTABLE flag on mm_struct.
  For each reapable page in mm_struct {
    Take ptl lock.
    Remove pte.
    Release ptl lock.
  }

the page fault handler will need to check MMF_UNSTABLE with lock held.

  For each faulted page in mm_struct {
    Take ptl lock.
    Add pte only if MMF_UNSTABLE flag is not set.
    Release ptl lock.
  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04 14:56               ` Michal Hocko
@ 2017-08-05  1:46                 ` 陶文苇
  -1 siblings, 0 replies; 55+ messages in thread
From: 陶文苇 @ 2017-08-05  1:46 UTC (permalink / raw)
  To: 'Michal Hocko', 'Tetsuo Handa'
  Cc: linux-mm, akpm, oleg, rientjes, linux-kernel



> -----Original Message-----
> From: Michal Hocko [mailto:mhocko@kernel.org]
> Sent: Friday, August 04, 2017 10:57 PM
> To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: linux-mm@kvack.org; akpm@linux-foundation.org; 陶文苇
> <wenwei.tww@alibaba-inc.com>; oleg@redhat.com; rientjes@google.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] mm, oom: fix potential data corruption when
> oom_reaper races with writer
> 
> On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> > On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> [...]
> > > Yes. Data corruption still happens.
> >
> > I guess I managed to reproduce finally. Will investigate further.
> 
> One limitation of the current MMF_UNSTABLE implementation is that it still
> keeps the new page mapped and only sends EFAULT/kill to the consumer. If
> somebody tries to re-read the same content nothing will really happen. I
> went this way because it was much simpler and memory consumers usually
> do not retry on EFAULT. Maybe this is not the case here.
> 
> I've been staring into iov_iter_copy_from_user_atomic which I believe
> should be the common write path which reads the user buffer where the
> corruption caused by the oom_reaper would come from.
> iov_iter_fault_in_readable should be called before this function. If this
> happened after MMF_UNSTABLE was set then we should get EFAULT and bail
> out early. Let's assume this wasn't the case. Then we should get down to
> iov_iter_copy_from_user_atomic and that one shouldn't copy any data
> because __copy_from_user_inatomic says
> 
>  * If copying succeeds, the return value must be 0.  If some data cannot
be
>  * fetched, it is permitted to copy less than had been fetched; the only
>  * hard requirement is that not storing anything at all (i.e. returning
size)
>  * should happen only when nothing could be copied.  In other words, you
> don't
>  * have to squeeze as much as possible - it is allowed, but not necessary.
> 
> which should be our case.
> 
> I was testing with xfs (but generic_perform_write seem to be doing the
same
> thing) and that one does
> 		if (unlikely(copied == 0)) {
> 			/*
> 			 * If we were unable to copy any data at all, we
must
> 			 * fall back to a single segment length write.
> 			 *
> 			 * If we didn't fallback here, we could livelock
> 			 * because not all segments in the iov can be copied
at
> 			 * once without a pagefault.
> 			 */
> 			bytes = min_t(unsigned long, PAGE_SIZE - offset,
>
iov_iter_single_seg_count(i));
> 			goto again;
> 		}
> 
> and that again will go through iov_iter_fault_in_readable again and that
will
> succeed now.
> 
Agree, didn't notice this case before.

> And that's why we still see the corruption. That, however, means that the
> MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.
> 
> This is a rough first draft that passes the test case from Tetsuo on my
system.
> It will need much more eyes on it and I will return to it with a fresh
brain next
> week. I would appreciate as much testing as possible.
> 
> Note that this is on top of the previous attempt for the fix but I will
squash
> the result into one patch because the previous one is not sufficient.
> ---
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
> 86975dec0ba1..1fbc78d423d7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
>  	struct mem_cgroup *memcg;
>  	pgtable_t pgtable;
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +	int ret;
> 
>  	VM_BUG_ON_PAGE(!PageCompound(page), page);
> 
> @@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
> 
>  	pgtable = pte_alloc_one(vma->vm_mm, haddr);
>  	if (unlikely(!pgtable)) {
> -		mem_cgroup_cancel_charge(page, memcg, true);
> -		put_page(page);
> -		return VM_FAULT_OOM;
> +		ret = VM_FAULT_OOM;
> +		goto release;
>  	}
> 
>  	clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -583,6 +583,15
> @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> struct page *page,
>  	} else {
>  		pmd_t entry;
> 
> +		/*
> +		 * range could have been already torn down by
> +		 * the oom reaper
> +		 */
> +		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +			spin_unlock(vmf->ptl);
> +			ret = VM_FAULT_SIGBUS;
> +			goto release;
> +		}
>  		/* Deliver the page fault to userland */
>  		if (userfaultfd_missing(vma)) {
>  			int ret;
> @@ -610,6 +619,13 @@ static int
> __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page
> *page,
>  	}
> 
>  	return 0;
> +release:
> +	if (pgtable)
> +		pte_free(vma->vm_mm, pgtable);
> +	mem_cgroup_cancel_charge(page, memcg, true);
> +	put_page(page);
> +	return ret;
> +
>  }
> 
>  /*
> @@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct
> vm_fault *vmf)
>  		ret = 0;
>  		set = false;
>  		if (pmd_none(*vmf->pmd)) {
> -			if (userfaultfd_missing(vma)) {
> +			/*
> +			 * range could have been already torn down by
> +			 * the oom reaper
> +			 */
> +			if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +				spin_unlock(vmf->ptl);
> +				ret = VM_FAULT_SIGBUS;
> +			} else if (userfaultfd_missing(vma)) {
>  				spin_unlock(vmf->ptl);
>  				ret = handle_userfault(vmf,
VM_UFFD_MISSING);
>  				VM_BUG_ON(ret & VM_FAULT_FALLBACK); diff
--git
> a/mm/memory.c b/mm/memory.c index e7308e633b52..7de9508e38e4
> 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2864,6 +2864,7 @@ static int do_anonymous_page(struct vm_fault
> *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct mem_cgroup *memcg;
>  	struct page *page;
> +	int ret = 0;
>  	pte_t entry;
> 
>  	/* File mapping without ->vm_ops ? */
> @@ -2896,6 +2897,14 @@ static int do_anonymous_page(struct vm_fault
> *vmf)
>  				vmf->address, &vmf->ptl);
>  		if (!pte_none(*vmf->pte))
>  			goto unlock;
> +		/*
> +		 * range could have been already torn down by
> +		 * the oom reaper
> +		 */
> +		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +			ret = VM_FAULT_SIGBUS;
> +			goto unlock;
> +		}
>  		/* Deliver the page fault to userland, check inside PT lock
*/
>  		if (userfaultfd_missing(vma)) {
>  			pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -2930,6
> +2939,15 @@ static int do_anonymous_page(struct vm_fault *vmf)
>  	if (!pte_none(*vmf->pte))
>  		goto release;
> 
> +	/*
> +	 * range could have been already torn down by
> +	 * the oom reaper
> +	 */
> +	if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto release;
> +	}
> +
>  	/* Deliver the page fault to userland, check inside PT lock */
>  	if (userfaultfd_missing(vma)) {
>  		pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -2949,7 +2967,7 @@
> static int do_anonymous_page(struct vm_fault *vmf)
>  	update_mmu_cache(vma, vmf->address, vmf->pte);
>  unlock:
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> -	return 0;
> +	return ret;
>  release:
>  	mem_cgroup_cancel_charge(page, memcg, false);
>  	put_page(page);
> @@ -3231,7 +3249,10 @@ int finish_fault(struct vm_fault *vmf)
>  		page = vmf->cow_page;
>  	else
>  		page = vmf->page;
> -	ret = alloc_set_pte(vmf, vmf->memcg, page);
> +	if (!test_bit(MMF_UNSTABLE, &vmf->vma->vm_mm->flags))
> +		ret = alloc_set_pte(vmf, vmf->memcg, page);
> +	else
> +		ret = VM_FAULT_SIGBUS;
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	return ret;
> @@ -3871,24 +3892,6 @@ int handle_mm_fault(struct vm_area_struct
> *vma, unsigned long address,
>  			mem_cgroup_oom_synchronize(false);
>  	}
> 
> -	/*
> -	 * This mm has been already reaped by the oom reaper and so the
> -	 * refault cannot be trusted in general. Anonymous refaults would
> -	 * lose data and give a zero page instead e.g.
> -	 */
> -	if (unlikely(!(ret & VM_FAULT_ERROR)
> -				&& test_bit(MMF_UNSTABLE,
&vma->vm_mm->flags))) {
> -		/*
> -		 * We are going to enforce SIGBUS but the PF path might have
> -		 * dropped the mmap_sem already so take it again so that
> -		 * we do not break expectations of all arch specific PF
paths
> -		 * and g-u-p
> -		 */
> -		if (ret & VM_FAULT_RETRY)
> -			down_read(&vma->vm_mm->mmap_sem);
> -		ret = VM_FAULT_SIGBUS;
> -	}
> -
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-05  1:46                 ` 陶文苇
  0 siblings, 0 replies; 55+ messages in thread
From: 陶文苇 @ 2017-08-05  1:46 UTC (permalink / raw)
  To: 'Michal Hocko', 'Tetsuo Handa'
  Cc: linux-mm, akpm, oleg, rientjes, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 8703 bytes --]



> -----Original Message-----
> From: Michal Hocko [mailto:mhocko@kernel.org]
> Sent: Friday, August 04, 2017 10:57 PM
> To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: linux-mm@kvack.org; akpm@linux-foundation.org; 陶文苇
> <wenwei.tww@alibaba-inc.com>; oleg@redhat.com; rientjes@google.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] mm, oom: fix potential data corruption when
> oom_reaper races with writer
> 
> On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> > On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> [...]
> > > Yes. Data corruption still happens.
> >
> > I guess I managed to reproduce finally. Will investigate further.
> 
> One limitation of the current MMF_UNSTABLE implementation is that it still
> keeps the new page mapped and only sends EFAULT/kill to the consumer. If
> somebody tries to re-read the same content nothing will really happen. I
> went this way because it was much simpler and memory consumers usually
> do not retry on EFAULT. Maybe this is not the case here.
> 
> I've been staring into iov_iter_copy_from_user_atomic which I believe
> should be the common write path which reads the user buffer where the
> corruption caused by the oom_reaper would come from.
> iov_iter_fault_in_readable should be called before this function. If this
> happened after MMF_UNSTABLE was set then we should get EFAULT and bail
> out early. Let's assume this wasn't the case. Then we should get down to
> iov_iter_copy_from_user_atomic and that one shouldn't copy any data
> because __copy_from_user_inatomic says
> 
>  * If copying succeeds, the return value must be 0.  If some data cannot
be
>  * fetched, it is permitted to copy less than had been fetched; the only
>  * hard requirement is that not storing anything at all (i.e. returning
size)
>  * should happen only when nothing could be copied.  In other words, you
> don't
>  * have to squeeze as much as possible - it is allowed, but not necessary.
> 
> which should be our case.
> 
> I was testing with xfs (but generic_perform_write seem to be doing the
same
> thing) and that one does
> 		if (unlikely(copied == 0)) {
> 			/*
> 			 * If we were unable to copy any data at all, we
must
> 			 * fall back to a single segment length write.
> 			 *
> 			 * If we didn't fallback here, we could livelock
> 			 * because not all segments in the iov can be copied
at
> 			 * once without a pagefault.
> 			 */
> 			bytes = min_t(unsigned long, PAGE_SIZE - offset,
>
iov_iter_single_seg_count(i));
> 			goto again;
> 		}
> 
> and that again will go through iov_iter_fault_in_readable again and that
will
> succeed now.
> 
Agree, didn't notice this case before.

> And that's why we still see the corruption. That, however, means that the
> MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.
> 
> This is a rough first draft that passes the test case from Tetsuo on my
system.
> It will need much more eyes on it and I will return to it with a fresh
brain next
> week. I would appreciate as much testing as possible.
> 
> Note that this is on top of the previous attempt for the fix but I will
squash
> the result into one patch because the previous one is not sufficient.
> ---
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
> 86975dec0ba1..1fbc78d423d7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
>  	struct mem_cgroup *memcg;
>  	pgtable_t pgtable;
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +	int ret;
> 
>  	VM_BUG_ON_PAGE(!PageCompound(page), page);
> 
> @@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
> 
>  	pgtable = pte_alloc_one(vma->vm_mm, haddr);
>  	if (unlikely(!pgtable)) {
> -		mem_cgroup_cancel_charge(page, memcg, true);
> -		put_page(page);
> -		return VM_FAULT_OOM;
> +		ret = VM_FAULT_OOM;
> +		goto release;
>  	}
> 
>  	clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -583,6 +583,15
> @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> struct page *page,
>  	} else {
>  		pmd_t entry;
> 
> +		/*
> +		 * range could have been already torn down by
> +		 * the oom reaper
> +		 */
> +		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +			spin_unlock(vmf->ptl);
> +			ret = VM_FAULT_SIGBUS;
> +			goto release;
> +		}
>  		/* Deliver the page fault to userland */
>  		if (userfaultfd_missing(vma)) {
>  			int ret;
> @@ -610,6 +619,13 @@ static int
> __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page
> *page,
>  	}
> 
>  	return 0;
> +release:
> +	if (pgtable)
> +		pte_free(vma->vm_mm, pgtable);
> +	mem_cgroup_cancel_charge(page, memcg, true);
> +	put_page(page);
> +	return ret;
> +
>  }
> 
>  /*
> @@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct
> vm_fault *vmf)
>  		ret = 0;
>  		set = false;
>  		if (pmd_none(*vmf->pmd)) {
> -			if (userfaultfd_missing(vma)) {
> +			/*
> +			 * range could have been already torn down by
> +			 * the oom reaper
> +			 */
> +			if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +				spin_unlock(vmf->ptl);
> +				ret = VM_FAULT_SIGBUS;
> +			} else if (userfaultfd_missing(vma)) {
>  				spin_unlock(vmf->ptl);
>  				ret = handle_userfault(vmf,
VM_UFFD_MISSING);
>  				VM_BUG_ON(ret & VM_FAULT_FALLBACK); diff
--git
> a/mm/memory.c b/mm/memory.c index e7308e633b52..7de9508e38e4
> 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2864,6 +2864,7 @@ static int do_anonymous_page(struct vm_fault
> *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct mem_cgroup *memcg;
>  	struct page *page;
> +	int ret = 0;
>  	pte_t entry;
> 
>  	/* File mapping without ->vm_ops ? */
> @@ -2896,6 +2897,14 @@ static int do_anonymous_page(struct vm_fault
> *vmf)
>  				vmf->address, &vmf->ptl);
>  		if (!pte_none(*vmf->pte))
>  			goto unlock;
> +		/*
> +		 * range could have been already torn down by
> +		 * the oom reaper
> +		 */
> +		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +			ret = VM_FAULT_SIGBUS;
> +			goto unlock;
> +		}
>  		/* Deliver the page fault to userland, check inside PT lock
*/
>  		if (userfaultfd_missing(vma)) {
>  			pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -2930,6
> +2939,15 @@ static int do_anonymous_page(struct vm_fault *vmf)
>  	if (!pte_none(*vmf->pte))
>  		goto release;
> 
> +	/*
> +	 * range could have been already torn down by
> +	 * the oom reaper
> +	 */
> +	if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto release;
> +	}
> +
>  	/* Deliver the page fault to userland, check inside PT lock */
>  	if (userfaultfd_missing(vma)) {
>  		pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -2949,7 +2967,7 @@
> static int do_anonymous_page(struct vm_fault *vmf)
>  	update_mmu_cache(vma, vmf->address, vmf->pte);
>  unlock:
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> -	return 0;
> +	return ret;
>  release:
>  	mem_cgroup_cancel_charge(page, memcg, false);
>  	put_page(page);
> @@ -3231,7 +3249,10 @@ int finish_fault(struct vm_fault *vmf)
>  		page = vmf->cow_page;
>  	else
>  		page = vmf->page;
> -	ret = alloc_set_pte(vmf, vmf->memcg, page);
> +	if (!test_bit(MMF_UNSTABLE, &vmf->vma->vm_mm->flags))
> +		ret = alloc_set_pte(vmf, vmf->memcg, page);
> +	else
> +		ret = VM_FAULT_SIGBUS;
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	return ret;
> @@ -3871,24 +3892,6 @@ int handle_mm_fault(struct vm_area_struct
> *vma, unsigned long address,
>  			mem_cgroup_oom_synchronize(false);
>  	}
> 
> -	/*
> -	 * This mm has been already reaped by the oom reaper and so the
> -	 * refault cannot be trusted in general. Anonymous refaults would
> -	 * lose data and give a zero page instead e.g.
> -	 */
> -	if (unlikely(!(ret & VM_FAULT_ERROR)
> -				&& test_bit(MMF_UNSTABLE,
&vma->vm_mm->flags))) {
> -		/*
> -		 * We are going to enforce SIGBUS but the PF path might have
> -		 * dropped the mmap_sem already so take it again so that
> -		 * we do not break expectations of all arch specific PF
paths
> -		 * and g-u-p
> -		 */
> -		if (ret & VM_FAULT_RETRY)
> -			down_read(&vma->vm_mm->mmap_sem);
> -		ret = VM_FAULT_SIGBUS;
> -	}
> -
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);
> --
> Michal Hocko
> SUSE LabsN‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-14 13:59               ` Michal Hocko
  (?)
@ 2017-08-15  5:30               ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-15  5:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Tetsuo Handa, akpm, andrea, kirill, oleg,
	wenwei.tww, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 12-08-17 00:46:18, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > > > > Will you explain the mechanism why random values are written instead of zeros
> > > > > > > so that this patch can actually fix the race problem?
> > > > > > 
> > > > > > I am not sure what you mean here. Were you able to see a write with an
> > > > > > unexpected content?
> > > > > 
> > > > > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> > > > 
> > > > Ahh, I've missed that random part of your output. That is really strange
> > > > because AFAICS the oom reaper shouldn't really interact here. We are
> > > > only unmapping anonymous memory and even if a refault slips through we
> > > > should always get zeros.
> > > > 
> > > > Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> > > > get any uninitialized data from a file by missing CoWed content. The
> > > > only possible explanations would be that a page fault returned a
> > > > non-zero data which would be a bug on its own or that a file write
> > > > extend the file without actually writing to it which smells like a fs
> > > > bug to me.
> > > 
> > > As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
> > > I don't think it is a fs bug.
> > 
> > Were you able to reproduce with other filesystems?
> 
> Yes, I can reproduce this problem using both xfs and ext4 on 4.11.11-200.fc25.x86_64
> on Oracle VM VirtualBox on Windows.
> 
> I believe that this is not old data from disk, for I can reproduce this problem
> using newly attached /dev/sdb which has never written any data (other than data
> written by mkfs.xfs and mkfs.ext4).
> 
>   /dev/sdb /tmp ext4 rw,seclabel,relatime,data=ordered 0 0
>   
> The garbage pattern (the last 4096 bytes) is identical for both xfs and ext4.

I can reproduce this problem very easily using btrfs on 4.11.11-200.fc25.x86_64
on Oracle VM VirtualBox on Windows.

  /dev/sdb /tmp btrfs rw,seclabel,relatime,space_cache,subvolid=5,subvol=/ 0 0

The garbage pattern is identical for all xfs/ext4/btrfs.
More complicated things a fs does, more likely to hit this problem?
I tried ntfs but so far I am not able to reproduce this problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11 15:46             ` Tetsuo Handa
@ 2017-08-14 13:59               ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-14 13:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Sat 12-08-17 00:46:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > > Will you explain the mechanism why random values are written instead of zeros
> > > > > so that this patch can actually fix the race problem?
> > > > 
> > > > I am not sure what you mean here. Were you able to see a write with an
> > > > unexpected content?
> > > 
> > > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> > 
> > Ahh, I've missed that random part of your output. That is really strange
> > because AFAICS the oom reaper shouldn't really interact here. We are
> > only unmapping anonymous memory and even if a refault slips through we
> > should always get zeros.
> > 
> > Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> > get any uninitialized data from a file by missing CoWed content. The
> > only possible explanations would be that a page fault returned a
> > non-zero data which would be a bug on its own or that a file write
> > extend the file without actually writing to it which smells like a fs
> > bug to me.
> 
> As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
> I don't think it is a fs bug.

Were you able to reproduce with other filesystems? I wonder what is
different in my testing because I cannot reproduce this at all. Well, I
had to reduce the number of competing writer threads to 128 because I
quickly hit the trashing behavior with more of them (and 4 CPUs). I will
try on a larger machine.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-14 13:59               ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-14 13:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Sat 12-08-17 00:46:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > > Will you explain the mechanism why random values are written instead of zeros
> > > > > so that this patch can actually fix the race problem?
> > > > 
> > > > I am not sure what you mean here. Were you able to see a write with an
> > > > unexpected content?
> > > 
> > > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> > 
> > Ahh, I've missed that random part of your output. That is really strange
> > because AFAICS the oom reaper shouldn't really interact here. We are
> > only unmapping anonymous memory and even if a refault slips through we
> > should always get zeros.
> > 
> > Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> > get any uninitialized data from a file by missing CoWed content. The
> > only possible explanations would be that a page fault returned a
> > non-zero data which would be a bug on its own or that a file write
> > extend the file without actually writing to it which smells like a fs
> > bug to me.
> 
> As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
> I don't think it is a fs bug.

Were you able to reproduce with other filesystems? I wonder what is
different in my testing because I cannot reproduce this at all. Well, I
had to reduce the number of competing writer threads to 128 because I
quickly hit the trashing behavior with more of them (and 4 CPUs). I will
try on a larger machine.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11 12:08           ` Michal Hocko
@ 2017-08-11 15:46             ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-11 15:46 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > Will you explain the mechanism why random values are written instead of zeros
> > > > so that this patch can actually fix the race problem?
> > > 
> > > I am not sure what you mean here. Were you able to see a write with an
> > > unexpected content?
> > 
> > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> 
> Ahh, I've missed that random part of your output. That is really strange
> because AFAICS the oom reaper shouldn't really interact here. We are
> only unmapping anonymous memory and even if a refault slips through we
> should always get zeros.
> 
> Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> get any uninitialized data from a file by missing CoWed content. The
> only possible explanations would be that a page fault returned a
> non-zero data which would be a bug on its own or that a file write
> extend the file without actually writing to it which smells like a fs
> bug to me.

As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
I don't think it is a fs bug.

> 
> Anyway I wasn't able to reproduce this and I was running your usecase
> in the loop for quite some time (with xfs storage). How reproducible
> is this? If you can reproduce easily can you simply comment out
> unmap_page_range in __oom_reap_task_mm and see if that makes any change
> just to be sure that the oom reaper can be ruled out?

Frequency of writing not-zero values is lower than frequency of writing zero values.
But if I comment out unmap_page_range() in __oom_reap_task_mm(), I can't even
reproduce writing zero values. As far as I tested, writing not-zero values occurs
only if the OOM reaper is involved.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11 15:46             ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-11 15:46 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > Will you explain the mechanism why random values are written instead of zeros
> > > > so that this patch can actually fix the race problem?
> > > 
> > > I am not sure what you mean here. Were you able to see a write with an
> > > unexpected content?
> > 
> > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .
> 
> Ahh, I've missed that random part of your output. That is really strange
> because AFAICS the oom reaper shouldn't really interact here. We are
> only unmapping anonymous memory and even if a refault slips through we
> should always get zeros.
> 
> Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> get any uninitialized data from a file by missing CoWed content. The
> only possible explanations would be that a page fault returned a
> non-zero data which would be a bug on its own or that a file write
> extend the file without actually writing to it which smells like a fs
> bug to me.

As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@I-love.SAKURA.ne.jp ,
I don't think it is a fs bug.

> 
> Anyway I wasn't able to reproduce this and I was running your usecase
> in the loop for quite some time (with xfs storage). How reproducible
> is this? If you can reproduce easily can you simply comment out
> unmap_page_range in __oom_reap_task_mm and see if that makes any change
> just to be sure that the oom reaper can be ruled out?

Frequency of writing not-zero values is lower than frequency of writing zero values.
But if I comment out unmap_page_range() in __oom_reap_task_mm(), I can't even
reproduce writing zero values. As far as I tested, writing not-zero values occurs
only if the OOM reaper is involved.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11  7:54         ` Tetsuo Handa
@ 2017-08-11 12:08           ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-11 12:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > +/*
> > > > + * Checks whether a page fault on the given mm is still reliable.
> > > > + * This is no longer true if the oom reaper started to reap the
> > > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > > + * the mm. At that moment any !shared mapping would lose the content
> > > > + * and could cause a memory corruption (zero pages instead of the
> > > > + * original content).
> > > > + *
> > > > + * User should call this before establishing a page table entry for
> > > > + * a !shared mapping and under the proper page table lock.
> > > > + *
> > > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > > + */
> > > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > > +{
> > > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > > +		return VM_FAULT_SIGBUS;
> > > > +	return 0;
> > > > +}
> > > > +
> > > 
> > > Will you explain the mechanism why random values are written instead of zeros
> > > so that this patch can actually fix the race problem?
> > 
> > I am not sure what you mean here. Were you able to see a write with an
> > unexpected content?
> 
> Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

Ahh, I've missed that random part of your output. That is really strange
because AFAICS the oom reaper shouldn't really interact here. We are
only unmapping anonymous memory and even if a refault slips through we
should always get zeros.

Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
get any uninitialized data from a file by missing CoWed content. The
only possible explanations would be that a page fault returned a
non-zero data which would be a bug on its own or that a file write
extend the file without actually writing to it which smells like a fs
bug to me.

Anyway I wasn't able to reproduce this and I was running your usecase
in the loop for quite some time (with xfs storage). How reproducible
is this? If you can reproduce easily can you simply comment out
unmap_page_range in __oom_reap_task_mm and see if that makes any change
just to be sure that the oom reaper can be ruled out?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11 12:08           ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-11 12:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > +/*
> > > > + * Checks whether a page fault on the given mm is still reliable.
> > > > + * This is no longer true if the oom reaper started to reap the
> > > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > > + * the mm. At that moment any !shared mapping would lose the content
> > > > + * and could cause a memory corruption (zero pages instead of the
> > > > + * original content).
> > > > + *
> > > > + * User should call this before establishing a page table entry for
> > > > + * a !shared mapping and under the proper page table lock.
> > > > + *
> > > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > > + */
> > > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > > +{
> > > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > > +		return VM_FAULT_SIGBUS;
> > > > +	return 0;
> > > > +}
> > > > +
> > > 
> > > Will you explain the mechanism why random values are written instead of zeros
> > > so that this patch can actually fix the race problem?
> > 
> > I am not sure what you mean here. Were you able to see a write with an
> > unexpected content?
> 
> Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

Ahh, I've missed that random part of your output. That is really strange
because AFAICS the oom reaper shouldn't really interact here. We are
only unmapping anonymous memory and even if a refault slips through we
should always get zeros.

Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
get any uninitialized data from a file by missing CoWed content. The
only possible explanations would be that a page fault returned a
non-zero data which would be a bug on its own or that a file write
extend the file without actually writing to it which smells like a fs
bug to me.

Anyway I wasn't able to reproduce this and I was running your usecase
in the loop for quite some time (with xfs storage). How reproducible
is this? If you can reproduce easily can you simply comment out
unmap_page_range in __oom_reap_task_mm and see if that makes any change
just to be sure that the oom reaper can be ruled out?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11 10:42             ` Andrea Arcangeli
@ 2017-08-11 11:53               ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-11 11:53 UTC (permalink / raw)
  To: aarcange; +Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Andrea Arcangeli wrote:
> On Fri, Aug 11, 2017 at 12:22:56PM +0200, Andrea Arcangeli wrote:
> > disk block? This would happen on ext4 as well if mounted with -o
> > journal=data instead of -o journal=ordered in fact, perhaps you simply
> 
> Oops above I meant journal=writeback, journal=data is even stronger
> than journal=ordered of course.
> 
> And I shall clarify further that old disk content can only showup
> legitimately on journal=writeback after a hard reboot or crash or in
> general an unclean unmount. Even if there's no journaling at all
> (i.e. ext2/vfat) old disk content cannot be shown at any given time no
> matter what if there's no unclean unmount that requires a journal
> reply.

I'm using XFS on a small non-NUMA system (4 CPUs / 4096MB RAM).

  /dev/sda1 / xfs rw,relatime,attr2,inode64,noquota 0 0

As far as I tested, not-zero not-0xff values did not show up with 4.6.7
kernel (i.e. all not-0xff bytes are zero) while not-zero not-0xff values
show up with 4.13.0-rc4-next-20170811 kernel.

> 
> This theory of a completely unrelated fs bug showing you disk content
> as result of the OOM reaper induced SIGBUS interrupting a
> copy_from_user at its very start, is purely motivated by the fact like
> Michal I didn't see much explanation on the VM side that could cause
> those not-zero not-0xff values showing up in the buffer of the write
> syscall. You can try to change fs and see if it happens again to rule
> it out. If it always happens regardless of the filesystem used, then
> it's likely not a fs bug of course. You've got an entire and aligned
> 4k fs block showing up that data.
> 

What is strange is that, as far as I tested, the pattern of not-zero not-0xff
bytes seems to be always the same. Such thing unlikely happens if old content
on the disk is by chance showing up. Maybe the content written is not random
but specific 4096 bytes of memory image of executable file.

$ cat checker.c
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
        char buffer2[64] = { };
        int ret = 0;
        int i;
        for (i = 0; i < 1024; i++) {
                 int flag = 0;
                 int fd;
                 unsigned int byte[256];
                 int j;
                 snprintf(buffer2, sizeof(buffer2), "/tmp/file.%u", i);
                 fd = open(buffer2, O_RDONLY);
                 if (fd == EOF)
                         continue;
                 memset(byte, 0, sizeof(byte));
                 while (1) {
                         static unsigned char buffer[1048576];
                         int len = read(fd, (char *) buffer, sizeof(buffer));
                         if (len <= 0)
                                 break;
                         for (j = 0; j < len; j++)
                                 if (buffer[j] != 0xFF)
                                         byte[buffer[j]]++;
                 }
                 close(fd);
                 for (j = 0; j < 255; j++)
                         if (byte[j]) {
                                 printf("ERROR: %u %u in %s\n", byte[j], j, buffer2);
                                 flag = 1;
                         }
                 if (flag == 0)
                         unlink(buffer2);
                 else
                         ret = 1;
        }
        return ret;
}
$ uname -r
4.13.0-rc4-next-20170811
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.4
$ /bin/rm /tmp/file.4
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.6
$ /bin/rm /tmp/file.6
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.0
$ /bin/rm /tmp/file.0
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 2549 0 in /tmp/file.4
ERROR: 40 1 in /tmp/file.4
ERROR: 53 2 in /tmp/file.4
ERROR: 29 3 in /tmp/file.4
ERROR: 27 4 in /tmp/file.4
ERROR: 5 5 in /tmp/file.4
ERROR: 14 6 in /tmp/file.4
ERROR: 8 7 in /tmp/file.4
ERROR: 16 8 in /tmp/file.4
ERROR: 4 9 in /tmp/file.4
ERROR: 12 10 in /tmp/file.4
ERROR: 4 11 in /tmp/file.4
ERROR: 2 12 in /tmp/file.4
ERROR: 10 13 in /tmp/file.4
ERROR: 13 14 in /tmp/file.4
ERROR: 4 15 in /tmp/file.4
ERROR: 26 16 in /tmp/file.4
ERROR: 5 17 in /tmp/file.4
ERROR: 23 18 in /tmp/file.4
ERROR: 4 19 in /tmp/file.4
ERROR: 8 20 in /tmp/file.4
ERROR: 2 21 in /tmp/file.4
ERROR: 1 22 in /tmp/file.4
ERROR: 2 23 in /tmp/file.4
ERROR: 17 24 in /tmp/file.4
ERROR: 5 25 in /tmp/file.4
ERROR: 2 26 in /tmp/file.4
ERROR: 1 27 in /tmp/file.4
ERROR: 3 28 in /tmp/file.4
ERROR: 17 32 in /tmp/file.4
ERROR: 1 35 in /tmp/file.4
ERROR: 1 36 in /tmp/file.4
ERROR: 2 38 in /tmp/file.4
ERROR: 5 40 in /tmp/file.4
ERROR: 1 41 in /tmp/file.4
ERROR: 3 45 in /tmp/file.4
ERROR: 65 46 in /tmp/file.4
ERROR: 2 48 in /tmp/file.4
ERROR: 4 49 in /tmp/file.4
ERROR: 24 50 in /tmp/file.4
ERROR: 3 51 in /tmp/file.4
ERROR: 4 52 in /tmp/file.4
ERROR: 12 53 in /tmp/file.4
ERROR: 2 54 in /tmp/file.4
ERROR: 1 55 in /tmp/file.4
ERROR: 5 56 in /tmp/file.4
ERROR: 1 60 in /tmp/file.4
ERROR: 75 64 in /tmp/file.4
ERROR: 5 65 in /tmp/file.4
ERROR: 17 66 in /tmp/file.4
ERROR: 19 67 in /tmp/file.4
ERROR: 5 68 in /tmp/file.4
ERROR: 6 69 in /tmp/file.4
ERROR: 3 70 in /tmp/file.4
ERROR: 13 71 in /tmp/file.4
ERROR: 18 73 in /tmp/file.4
ERROR: 3 74 in /tmp/file.4
ERROR: 17 76 in /tmp/file.4
ERROR: 7 77 in /tmp/file.4
ERROR: 5 78 in /tmp/file.4
ERROR: 4 79 in /tmp/file.4
ERROR: 1 80 in /tmp/file.4
ERROR: 4 82 in /tmp/file.4
ERROR: 2 83 in /tmp/file.4
ERROR: 13 84 in /tmp/file.4
ERROR: 1 85 in /tmp/file.4
ERROR: 1 86 in /tmp/file.4
ERROR: 1 89 in /tmp/file.4
ERROR: 2 94 in /tmp/file.4
ERROR: 118 95 in /tmp/file.4
ERROR: 24 96 in /tmp/file.4
ERROR: 54 97 in /tmp/file.4
ERROR: 14 98 in /tmp/file.4
ERROR: 18 99 in /tmp/file.4
ERROR: 29 100 in /tmp/file.4
ERROR: 57 101 in /tmp/file.4
ERROR: 16 102 in /tmp/file.4
ERROR: 15 103 in /tmp/file.4
ERROR: 9 104 in /tmp/file.4
ERROR: 48 105 in /tmp/file.4
ERROR: 1 106 in /tmp/file.4
ERROR: 2 107 in /tmp/file.4
ERROR: 30 108 in /tmp/file.4
ERROR: 22 109 in /tmp/file.4
ERROR: 43 110 in /tmp/file.4
ERROR: 29 111 in /tmp/file.4
ERROR: 13 112 in /tmp/file.4
ERROR: 56 114 in /tmp/file.4
ERROR: 42 115 in /tmp/file.4
ERROR: 65 116 in /tmp/file.4
ERROR: 14 117 in /tmp/file.4
ERROR: 3 118 in /tmp/file.4
ERROR: 2 119 in /tmp/file.4
ERROR: 3 120 in /tmp/file.4
ERROR: 16 121 in /tmp/file.4
ERROR: 1 122 in /tmp/file.4
ERROR: 1 125 in /tmp/file.4
ERROR: 1 126 in /tmp/file.4
ERROR: 5 128 in /tmp/file.4
ERROR: 1 132 in /tmp/file.4
ERROR: 4 134 in /tmp/file.4
ERROR: 1 137 in /tmp/file.4
ERROR: 1 141 in /tmp/file.4
ERROR: 1 142 in /tmp/file.4
ERROR: 1 144 in /tmp/file.4
ERROR: 1 145 in /tmp/file.4
ERROR: 2 148 in /tmp/file.4
ERROR: 6 152 in /tmp/file.4
ERROR: 2 153 in /tmp/file.4
ERROR: 1 154 in /tmp/file.4
ERROR: 6 160 in /tmp/file.4
ERROR: 1 166 in /tmp/file.4
ERROR: 3 168 in /tmp/file.4
ERROR: 1 176 in /tmp/file.4
ERROR: 1 180 in /tmp/file.4
ERROR: 1 181 in /tmp/file.4
ERROR: 3 184 in /tmp/file.4
ERROR: 1 188 in /tmp/file.4
ERROR: 4 192 in /tmp/file.4
ERROR: 1 193 in /tmp/file.4
ERROR: 1 198 in /tmp/file.4
ERROR: 3 200 in /tmp/file.4
ERROR: 2 208 in /tmp/file.4
ERROR: 1 216 in /tmp/file.4
ERROR: 1 223 in /tmp/file.4
ERROR: 4 224 in /tmp/file.4
ERROR: 1 227 in /tmp/file.4
ERROR: 1 236 in /tmp/file.4
ERROR: 1 237 in /tmp/file.4
ERROR: 4 241 in /tmp/file.4
ERROR: 1 243 in /tmp/file.4
ERROR: 1 244 in /tmp/file.4
ERROR: 1 245 in /tmp/file.4
ERROR: 1 246 in /tmp/file.4
ERROR: 2 248 in /tmp/file.4
ERROR: 1 249 in /tmp/file.4
ERROR: 1 254 in /tmp/file.4
$ od -cb /tmp/file.4
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
        377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
600000000   -   1   1   )  \0  \0   .   s   y   m   t   a   b  \0   .   s
        055 061 061 051 000 000 056 163 171 155 164 141 142 000 056 163
600000020   t   r   t   a   b  \0   .   s   h   s   t   r   t   a   b  \0
        164 162 164 141 142 000 056 163 150 163 164 162 164 141 142 000
600000040   .   i   n   t   e   r   p  \0   .   n   o   t   e   .   A   B
        056 151 156 164 145 162 160 000 056 156 157 164 145 056 101 102
600000060   I   -   t   a   g  \0   .   n   o   t   e   .   g   n   u   .
        111 055 164 141 147 000 056 156 157 164 145 056 147 156 165 056
600000100   b   u   i   l   d   -   i   d  \0   .   g   n   u   .   h   a
        142 165 151 154 144 055 151 144 000 056 147 156 165 056 150 141
600000120   s   h  \0   .   d   y   n   s   y   m  \0   .   d   y   n   s
        163 150 000 056 144 171 156 163 171 155 000 056 144 171 156 163
600000140   t   r  \0   .   g   n   u   .   v   e   r   s   i   o   n  \0
        164 162 000 056 147 156 165 056 166 145 162 163 151 157 156 000
600000160   .   g   n   u   .   v   e   r   s   i   o   n   _   r  \0   .
        056 147 156 165 056 166 145 162 163 151 157 156 137 162 000 056
600000200   r   e   l   a   .   d   y   n  \0   .   r   e   l   a   .   p
        162 145 154 141 056 144 171 156 000 056 162 145 154 141 056 160
600000220   l   t  \0   .   i   n   i   t  \0   .   t   e   x   t  \0   .
        154 164 000 056 151 156 151 164 000 056 164 145 170 164 000 056
600000240   f   i   n   i  \0   .   r   o   d   a   t   a  \0   .   e   h
        146 151 156 151 000 056 162 157 144 141 164 141 000 056 145 150
600000260   _   f   r   a   m   e   _   h   d   r  \0   .   e   h   _   f
        137 146 162 141 155 145 137 150 144 162 000 056 145 150 137 146
600000300   r   a   m   e  \0   .   i   n   i   t   _   a   r   r   a   y
        162 141 155 145 000 056 151 156 151 164 137 141 162 162 141 171
600000320  \0   .   f   i   n   i   _   a   r   r   a   y  \0   .   j   c
        000 056 146 151 156 151 137 141 162 162 141 171 000 056 152 143
600000340   r  \0   .   d   y   n   a   m   i   c  \0   .   g   o   t  \0
        162 000 056 144 171 156 141 155 151 143 000 056 147 157 164 000
600000360   .   g   o   t   .   p   l   t  \0   .   d   a   t   a  \0   .
        056 147 157 164 056 160 154 164 000 056 144 141 164 141 000 056
600000400   b   s   s  \0   .   c   o   m   m   e   n   t  \0  \0  \0  \0
        142 163 163 000 056 143 157 155 155 145 156 164 000 000 000 000
600000420  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600000440  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 001  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 001 000
600000460   8 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        070 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000500  \0  \0  \0  \0 003  \0 002  \0   T 002   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 002 000 124 002 100 000 000 000 000 000
600000520  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 003  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 003 000
600000540   t 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        164 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000560  \0  \0  \0  \0 003  \0 004  \0 230 002   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 004 000 230 002 100 000 000 000 000 000
600000600  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 005  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 005 000
600000620 270 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        270 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000640  \0  \0  \0  \0 003  \0 006  \0  \b 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 006 000 010 004 100 000 000 000 000 000
600000660  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \a  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 007 000
600000700 206 004   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        206 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000720  \0  \0  \0  \0 003  \0  \b  \0 250 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 010 000 250 004 100 000 000 000 000 000
600000740  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \t  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 011 000
600000760 310 004   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        310 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001000  \0  \0  \0  \0 003  \0  \n  \0 340 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 012 000 340 004 100 000 000 000 000 000
600001020  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \v  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 013 000
600001040 030 006   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001060  \0  \0  \0  \0 003  \0  \f  \0   @ 006   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 014 000 100 006 100 000 000 000 000 000
600001100  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \r  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 015 000
600001120      \a   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        040 007 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001140  \0  \0  \0  \0 003  \0 016  \0 024  \n   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 016 000 024 012 100 000 000 000 000 000
600001160  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 017  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 017 000
600001200      \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        040 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001220  \0  \0  \0  \0 003  \0 020  \0   @  \n   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 020 000 100 012 100 000 000 000 000 000
600001240  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 021  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 021 000
600001260 200  \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        200 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001300  \0  \0  \0  \0 003  \0 022  \0 020 016   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 022 000 020 016 140 000 000 000 000 000
600001320  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 023  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 023 000
600001340 030 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001360  \0  \0  \0  \0 003  \0 024  \0     016   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 024 000 040 016 140 000 000 000 000 000
600001400  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 025  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 025 000
600001420   ( 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001440  \0  \0  \0  \0 003  \0 026  \0 370 017   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 026 000 370 017 140 000 000 000 000 000
600001460  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 027  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 027 000
600001500  \0 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001520  \0  \0  \0  \0 003  \0 030  \0 200 020   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 030 000 200 020 140 000 000 000 000 000
600001540  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 031  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 031 000
600001560 240 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        240 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001600  \0  \0  \0  \0 003  \0 032  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 003 000 032 000 000 000 000 000 000 000 000 000
600001620  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0 004  \0 361 377
        000 000 000 000 000 000 000 000 001 000 000 000 004 000 361 377
600001640  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600001660  \b  \0  \0  \0 002  \0  \r  \0  \0  \t   @  \0  \0  \0  \0  \0
        010 000 000 000 002 000 015 000 000 011 100 000 000 000 000 000
600001700 221  \0  \0  \0  \0  \0  \0  \0 024  \0  \0  \0 001  \0 031  \0
        221 000 000 000 000 000 000 000 024 000 000 000 001 000 031 000
600001720 300 020   `  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0
        300 020 140 000 000 000 000 000 000 000 020 000 000 000 000 000
600001740      \0  \0  \0 001  \0 030  \0 220 020   `  \0  \0  \0  \0  \0
        040 000 000 000 001 000 030 000 220 020 140 000 000 000 000 000
600001760  \b  \0  \0  \0  \0  \0  \0  \0   (  \0  \0  \0 004  \0 361 377
        010 000 000 000 000 000 000 000 050 000 000 000 004 000 361 377
600002000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002020   3  \0  \0  \0 001  \0 024  \0     016   `  \0  \0  \0  \0  \0
        063 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
600002040  \0  \0  \0  \0  \0  \0  \0  \0   @  \0  \0  \0 002  \0  \r  \0
        000 000 000 000 000 000 000 000 100 000 000 000 002 000 015 000
600002060   @  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        100 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002100   U  \0  \0  \0 002  \0  \r  \0   p  \b   @  \0  \0  \0  \0  \0
        125 000 000 000 002 000 015 000 160 010 100 000 000 000 000 000
600002120  \0  \0  \0  \0  \0  \0  \0  \0   h  \0  \0  \0 002  \0  \r  \0
        000 000 000 000 000 000 000 000 150 000 000 000 002 000 015 000
600002140 260  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        260 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002160   ~  \0  \0  \0 001  \0 031  \0 240 020   `  \0  \0  \0  \0  \0
        176 000 000 000 001 000 031 000 240 020 140 000 000 000 000 000
600002200 001  \0  \0  \0  \0  \0  \0  \0 215  \0  \0  \0 001  \0 023  \0
        001 000 000 000 000 000 000 000 215 000 000 000 001 000 023 000
600002220 030 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002240 264  \0  \0  \0 002  \0  \r  \0 320  \b   @  \0  \0  \0  \0  \0
        264 000 000 000 002 000 015 000 320 010 100 000 000 000 000 000
600002260  \0  \0  \0  \0  \0  \0  \0  \0 300  \0  \0  \0 001  \0 022  \0
        000 000 000 000 000 000 000 000 300 000 000 000 001 000 022 000
600002300 020 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        020 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002320   (  \0  \0  \0 004  \0 361 377  \0  \0  \0  \0  \0  \0  \0  \0
        050 000 000 000 004 000 361 377 000 000 000 000 000 000 000 000
600002340  \0  \0  \0  \0  \0  \0  \0  \0 337  \0  \0  \0 001  \0 021  \0
        000 000 000 000 000 000 000 000 337 000 000 000 001 000 021 000
600002360 300  \v   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        300 013 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002400 355  \0  \0  \0 001  \0 024  \0     016   `  \0  \0  \0  \0  \0
        355 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
600002420  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 004  \0 361 377
        000 000 000 000 000 000 000 000 000 000 000 000 004 000 361 377
600002440  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002460 371  \0  \0  \0  \0  \0 022  \0 030 016   `  \0  \0  \0  \0  \0
        371 000 000 000 000 000 022 000 030 016 140 000 000 000 000 000
600002500  \0  \0  \0  \0  \0  \0  \0  \0  \n 001  \0  \0 001  \0 025  \0
        000 000 000 000 000 000 000 000 012 001 000 000 001 000 025 000
600002520   ( 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002540 023 001  \0  \0  \0  \0 022  \0 020 016   `  \0  \0  \0  \0  \0
        023 001 000 000 000 000 022 000 020 016 140 000 000 000 000 000
600002560  \0  \0  \0  \0  \0  \0  \0  \0   & 001  \0  \0 001  \0 027  \0
        000 000 000 000 000 000 000 000 046 001 000 000 001 000 027 000
600002600  \0 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002620   < 001  \0  \0 022  \0  \r  \0 020  \n   @  \0  \0  \0  \0  \0
        074 001 000 000 022 000 015 000 020 012 100 000 000 000 000 000
600002640 002  \0  \0  \0  \0  \0  \0  \0   L 001  \0  \0      \0  \0  \0
        002 000 000 000 000 000 000 000 114 001 000 000 040 000 000 000
600002660  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002700   h 001  \0  \0      \0 030  \0 200 020   `  \0  \0  \0  \0  \0
        150 001 000 000 040 000 030 000 200 020 140 000 000 000 000 000
600002720  \0  \0  \0  \0  \0  \0  \0  \0   s 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 163 001 000 000 022 000 000 000
600002740  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002760 206 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        206 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003000  \0  \0  \0  \0  \0  \0  \0  \0 231 001  \0  \0 020  \0 030  \0
        000 000 000 000 000 000 000 000 231 001 000 000 020 000 030 000
600003020 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003040 240 001  \0  \0 022  \0 016  \0 024  \n   @  \0  \0  \0  \0  \0
        240 001 000 000 022 000 016 000 024 012 100 000 000 000 000 000
600003060  \0  \0  \0  \0  \0  \0  \0  \0 246 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 246 001 000 000 022 000 000 000
600003100  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003120 274 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        274 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003140  \0  \0  \0  \0  \0  \0  \0  \0 320 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 320 001 000 000 022 000 000 000
600003160  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003200 343 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        343 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003220  \0  \0  \0  \0  \0  \0  \0  \0 365 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 365 001 000 000 022 000 000 000
600003240  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003260  \a 002  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        007 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003300  \0  \0  \0  \0  \0  \0  \0  \0   & 002  \0  \0 020  \0 030  \0
        000 000 000 000 000 000 000 000 046 002 000 000 020 000 030 000
600003320 200 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        200 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003340   3 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        063 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600003360  \0  \0  \0  \0  \0  \0  \0  \0   B 002  \0  \0 021 002 017  \0
        000 000 000 000 000 000 000 000 102 002 000 000 021 002 017 000
600003400   (  \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600003420   O 002  \0  \0 021  \0 017  \0      \n   @  \0  \0  \0  \0  \0
        117 002 000 000 021 000 017 000 040 012 100 000 000 000 000 000
600003440 004  \0  \0  \0  \0  \0  \0  \0   ^ 002  \0  \0 022  \0  \0  \0
        004 000 000 000 000 000 000 000 136 002 000 000 022 000 000 000
600003460  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003500   p 002  \0  \0 022  \0  \r  \0 240  \t   @  \0  \0  \0  \0  \0
        160 002 000 000 022 000 015 000 240 011 100 000 000 000 000 000
600003520   e  \0  \0  \0  \0  \0  \0  \0 200 002  \0  \0 022  \0  \0  \0
        145 000 000 000 000 000 000 000 200 002 000 000 022 000 000 000
600003540  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003560 224 002  \0  \0 020  \0 031  \0 300 020   p  \0  \0  \0  \0  \0
        224 002 000 000 020 000 031 000 300 020 160 000 000 000 000 000
600003600  \0  \0  \0  \0  \0  \0  \0  \0 231 002  \0  \0 022  \0  \r  \0
        000 000 000 000 000 000 000 000 231 002 000 000 022 000 015 000
600003620 023  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        023 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600003640 240 002  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        240 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003660  \0  \0  \0  \0  \0  \0  \0  \0 265 002  \0  \0 020  \0 031  \0
        000 000 000 000 000 000 000 000 265 002 000 000 020 000 031 000
600003700 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003720 301 002  \0  \0 022  \0  \r  \0      \a   @  \0  \0  \0  \0  \0
        301 002 000 000 022 000 015 000 040 007 100 000 000 000 000 000
600003740 363  \0  \0  \0  \0  \0  \0  \0 306 002  \0  \0 022  \0  \0  \0
        363 000 000 000 000 000 000 000 306 002 000 000 022 000 000 000
600003760  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600004000 330 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        330 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600004020  \0  \0  \0  \0  \0  \0  \0  \0 354 002  \0  \0 021 002 030  \0
        000 000 000 000 000 000 000 000 354 002 000 000 021 002 030 000
600004040 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600004060 370 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        370 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600004100  \0  \0  \0  \0  \0  \0  \0  \0 022 003  \0  \0 022  \0  \v  \0
        000 000 000 000 000 000 000 000 022 003 000 000 022 000 013 000
600004120 030 006   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600004140  \0   0   8   0   4   .   c  \0   f   i   l   e   _   w   r   i
        000 060 070 060 064 056 143 000 146 151 154 145 137 167 162 151
600004160   t   e   r  \0   b   u   f   f   e   r   .   4   7   6   1  \0
        164 145 162 000 142 165 146 146 145 162 056 064 067 066 061 000
600004200   p   i   p   e   _   f   d  \0   c   r   t   s   t   u   f   f
        160 151 160 145 137 146 144 000 143 162 164 163 164 165 146 146
600004220   .   c  \0   _   _   J   C   R   _   L   I   S   T   _   _  \0
        056 143 000 137 137 112 103 122 137 114 111 123 124 137 137 000
600004240   d   e   r   e   g   i   s   t   e   r   _   t   m   _   c   l
        144 145 162 145 147 151 163 164 145 162 137 164 155 137 143 154
600004260   o   n   e   s  \0   r   e   g   i   s   t   e   r   _   t   m
        157 156 145 163 000 162 145 147 151 163 164 145 162 137 164 155
600004300   _   c   l   o   n   e   s  \0   _   _   d   o   _   g   l   o
        137 143 154 157 156 145 163 000 137 137 144 157 137 147 154 157
600004320   b   a   l   _   d   t   o   r   s   _   a   u   x  \0   c   o
        142 141 154 137 144 164 157 162 163 137 141 165 170 000 143 157
600004340   m   p   l   e   t   e   d   .   6   3   4   4  \0   _   _   d
        155 160 154 145 164 145 144 056 066 063 064 064 000 137 137 144
600004360   o   _   g   l   o   b   a   l   _   d   t   o   r   s   _   a
        157 137 147 154 157 142 141 154 137 144 164 157 162 163 137 141
600004400   u   x   _   f   i   n   i   _   a   r   r   a   y   _   e   n
        165 170 137 146 151 156 151 137 141 162 162 141 171 137 145 156
600004420   t   r   y  \0   f   r   a   m   e   _   d   u   m   m   y  \0
        164 162 171 000 146 162 141 155 145 137 144 165 155 155 171 000
600004440   _   _   f   r   a   m   e   _   d   u   m   m   y   _   i   n
        137 137 146 162 141 155 145 137 144 165 155 155 171 137 151 156
600004460   i   t   _   a   r   r   a   y   _   e   n   t   r   y  \0   _
        151 164 137 141 162 162 141 171 137 145 156 164 162 171 000 137
600004500   _   F   R   A   M   E   _   E   N   D   _   _  \0   _   _   J
        137 106 122 101 115 105 137 105 116 104 137 137 000 137 137 112
600004520   C   R   _   E   N   D   _   _  \0   _   _   i   n   i   t   _
        103 122 137 105 116 104 137 137 000 137 137 151 156 151 164 137
600004540   a   r   r   a   y   _   e   n   d  \0   _   D   Y   N   A   M
        141 162 162 141 171 137 145 156 144 000 137 104 131 116 101 115
600004560   I   C  \0   _   _   i   n   i   t   _   a   r   r   a   y   _
        111 103 000 137 137 151 156 151 164 137 141 162 162 141 171 137
600004600   s   t   a   r   t  \0   _   G   L   O   B   A   L   _   O   F
        163 164 141 162 164 000 137 107 114 117 102 101 114 137 117 106
600004620   F   S   E   T   _   T   A   B   L   E   _  \0   _   _   l   i
        106 123 105 124 137 124 101 102 114 105 137 000 137 137 154 151
600004640   b   c   _   c   s   u   _   f   i   n   i  \0   _   I   T   M
        142 143 137 143 163 165 137 146 151 156 151 000 137 111 124 115
600004660   _   d   e   r   e   g   i   s   t   e   r   T   M   C   l   o
        137 144 145 162 145 147 151 163 164 145 162 124 115 103 154 157
600004700   n   e   T   a   b   l   e  \0   d   a   t   a   _   s   t   a
        156 145 124 141 142 154 145 000 144 141 164 141 137 163 164 141
600004720   r   t  \0   c   l   o   n   e   @   @   G   L   I   B   C   _
        162 164 000 143 154 157 156 145 100 100 107 114 111 102 103 137
600004740   2   .   2   .   5  \0   w   r   i   t   e   @   @   G   L   I
        062 056 062 056 065 000 167 162 151 164 145 100 100 107 114 111
600004760   B   C   _   2   .   2   .   5  \0   _   e   d   a   t   a  \0
        102 103 137 062 056 062 056 065 000 137 145 144 141 164 141 000
600005000   _   f   i   n   i  \0   s   n   p   r   i   n   t   f   @   @
        137 146 151 156 151 000 163 156 160 162 151 156 164 146 100 100
600005020   G   L   I   B   C   _   2   .   2   .   5  \0   m   e   m   s
        107 114 111 102 103 137 062 056 062 056 065 000 155 145 155 163
600005040   e   t   @   @   G   L   I   B   C   _   2   .   2   .   5  \0
        145 164 100 100 107 114 111 102 103 137 062 056 062 056 065 000
600005060   c   l   o   s   e   @   @   G   L   I   B   C   _   2   .   2
        143 154 157 163 145 100 100 107 114 111 102 103 137 062 056 062
600005100   .   5  \0   p   i   p   e   @   @   G   L   I   B   C   _   2
        056 065 000 160 151 160 145 100 100 107 114 111 102 103 137 062
600005120   .   2   .   5  \0   r   e   a   d   @   @   G   L   I   B   C
        056 062 056 065 000 162 145 141 144 100 100 107 114 111 102 103
600005140   _   2   .   2   .   5  \0   _   _   l   i   b   c   _   s   t
        137 062 056 062 056 065 000 137 137 154 151 142 143 137 163 164
600005160   a   r   t   _   m   a   i   n   @   @   G   L   I   B   C   _
        141 162 164 137 155 141 151 156 100 100 107 114 111 102 103 137
600005200   2   .   2   .   5  \0   _   _   d   a   t   a   _   s   t   a
        062 056 062 056 065 000 137 137 144 141 164 141 137 163 164 141
600005220   r   t  \0   _   _   g   m   o   n   _   s   t   a   r   t   _
        162 164 000 137 137 147 155 157 156 137 163 164 141 162 164 137
600005240   _  \0   _   _   d   s   o   _   h   a   n   d   l   e  \0   _
        137 000 137 137 144 163 157 137 150 141 156 144 154 145 000 137
600005260   I   O   _   s   t   d   i   n   _   u   s   e   d  \0   k   i
        111 117 137 163 164 144 151 156 137 165 163 145 144 000 153 151
600005300   l   l   @   @   G   L   I   B   C   _   2   .   2   .   5  \0
        154 154 100 100 107 114 111 102 103 137 062 056 062 056 065 000
600005320   _   _   l   i   b   c   _   c   s   u   _   i   n   i   t  \0
        137 137 154 151 142 143 137 143 163 165 137 151 156 151 164 000
600005340   m   a   l   l   o   c   @   @   G   L   I   B   C   _   2   .
        155 141 154 154 157 143 100 100 107 114 111 102 103 137 062 056
600005360   2   .   5  \0   _   e   n   d  \0   _   s   t   a   r   t  \0
        062 056 065 000 137 145 156 144 000 137 163 164 141 162 164 000
600005400   r   e   a   l   l   o   c   @   @   G   L   I   B   C   _   2
        162 145 141 154 154 157 143 100 100 107 114 111 102 103 137 062
600005420   .   2   .   5  \0   _   _   b   s   s   _   s   t   a   r   t
        056 062 056 065 000 137 137 142 163 163 137 163 164 141 162 164
600005440  \0   m   a   i   n  \0   o   p   e   n   @   @   G   L   I   B
        000 155 141 151 156 000 157 160 145 156 100 100 107 114 111 102
600005460   C   _   2   .   2   .   5  \0   _   J   v   _   R   e   g   i
        103 137 062 056 062 056 065 000 137 112 166 137 122 145 147 151
600005500   s   t   e   r   C   l   a   s   s   e   s  \0   _   _   T   M
        163 164 145 162 103 154 141 163 163 145 163 000 137 137 124 115
600005520   C   _   E   N   D   _   _  \0   _   I   T   M   _   r   e   g
        103 137 105 116 104 137 137 000 137 111 124 115 137 162 145 147
600005540   i   s   t   e   r   T   M   C   l   o   n   e   T   a   b   l
        151 163 164 145 162 124 115 103 154 157 156 145 124 141 142 154
600005560   e  \0   _   i   n   i   t  \0  \0  \0  \0  \0  \0  \0  \0  \0
        145 000 137 151 156 151 164 000 000 000 000 000 000 000 000 000
600005600  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
600005660  \0  \0  \0  \0  \0  \0  \0  \0 033  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 033 000 000 000 001 000 000 000
600005700 002  \0  \0  \0  \0  \0  \0  \0   8 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 070 002 100 000 000 000 000 000
600005720   8 002  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        070 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600005740  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
600005760  \0  \0  \0  \0  \0  \0  \0  \0   #  \0  \0  \0  \a  \0  \0  \0
        000 000 000 000 000 000 000 000 043 000 000 000 007 000 000 000
600006000 002  \0  \0  \0  \0  \0  \0  \0   T 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 124 002 100 000 000 000 000 000
600006020   T 002  \0  \0  \0  \0  \0  \0      \0  \0  \0  \0  \0  \0  \0
        124 002 000 000 000 000 000 000 040 000 000 000 000 000 000 000
600006040  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600006060  \0  \0  \0  \0  \0  \0  \0  \0   1  \0  \0  \0  \a  \0  \0  \0
        000 000 000 000 000 000 000 000 061 000 000 000 007 000 000 000
600006100 002  \0  \0  \0  \0  \0  \0  \0   t 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 164 002 100 000 000 000 000 000
600006120   t 002  \0  \0  \0  \0  \0  \0   $  \0  \0  \0  \0  \0  \0  \0
        164 002 000 000 000 000 000 000 044 000 000 000 000 000 000 000
600006140  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600006160  \0  \0  \0  \0  \0  \0  \0  \0   D  \0  \0  \0 366 377 377   o
        000 000 000 000 000 000 000 000 104 000 000 000 366 377 377 157
600006200 002  \0  \0  \0  \0  \0  \0  \0 230 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 230 002 100 000 000 000 000 000
600006220 230 002  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        230 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600006240 005  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
600006260  \0  \0  \0  \0  \0  \0  \0  \0   N  \0  \0  \0  \v  \0  \0  \0
        000 000 000 000 000 000 000 000 116 000 000 000 013 000 000 000
600006300 002  \0  \0  \0  \0  \0  \0  \0 270 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 270 002 100 000 000 000 000 000
600006320 270 002  \0  \0  \0  \0  \0  \0   P 001  \0  \0  \0  \0  \0  \0
        270 002 000 000 000 000 000 000 120 001 000 000 000 000 000 000
600006340 006  \0  \0  \0 001  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
600006360 030  \0  \0  \0  \0  \0  \0  \0   V  \0  \0  \0 003  \0  \0  \0
        030 000 000 000 000 000 000 000 126 000 000 000 003 000 000 000
600006400 002  \0  \0  \0  \0  \0  \0  \0  \b 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 010 004 100 000 000 000 000 000
600006420  \b 004  \0  \0  \0  \0  \0  \0   }  \0  \0  \0  \0  \0  \0  \0
        010 004 000 000 000 000 000 000 175 000 000 000 000 000 000 000
600006440  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
600006460  \0  \0  \0  \0  \0  \0  \0  \0   ^  \0  \0  \0 377 377 377   o
        000 000 000 000 000 000 000 000 136 000 000 000 377 377 377 157
600006500 002  \0  \0  \0  \0  \0  \0  \0 206 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 206 004 100 000 000 000 000 000
600006520 206 004  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        206 004 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600006540 005  \0  \0  \0  \0  \0  \0  \0 002  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 002 000 000 000 000 000 000 000
600006560 002  \0  \0  \0  \0  \0  \0  \0   k  \0  \0  \0 376 377 377   o
        002 000 000 000 000 000 000 000 153 000 000 000 376 377 377 157
600006600 002  \0  \0  \0  \0  \0  \0  \0 250 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 250 004 100 000 000 000 000 000
600006620 250 004  \0  \0  \0  \0  \0  \0      \0  \0  \0  \0  \0  \0  \0
        250 004 000 000 000 000 000 000 040 000 000 000 000 000 000 000
600006640 006  \0  \0  \0 001  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
600006660  \0  \0  \0  \0  \0  \0  \0  \0   z  \0  \0  \0 004  \0  \0  \0
        000 000 000 000 000 000 000 000 172 000 000 000 004 000 000 000
600006700 002  \0  \0  \0  \0  \0  \0  \0 310 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 310 004 100 000 000 000 000 000
600006720 310 004  \0  \0  \0  \0  \0  \0 030  \0  \0  \0  \0  \0  \0  \0
        310 004 000 000 000 000 000 000 030 000 000 000 000 000 000 000
600006740 005  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
600006760 030  \0  \0  \0  \0  \0  \0  \0 204  \0  \0  \0 004  \0  \0  \0
        030 000 000 000 000 000 000 000 204 000 000 000 004 000 000 000
600007000   B  \0  \0  \0  \0  \0  \0  \0 340 004   @  \0  \0  \0  \0  \0
        102 000 000 000 000 000 000 000 340 004 100 000 000 000 000 000
600007020 340 004  \0  \0  \0  \0  \0  \0   8 001  \0  \0  \0  \0  \0  \0
        340 004 000 000 000 000 000 000 070 001 000 000 000 000 000 000
600007040 005  \0  \0  \0  \f  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 014 000 000 000 010 000 000 000 000 000 000 000
600007060 030  \0  \0  \0  \0  \0  \0  \0 216  \0  \0  \0 001  \0  \0  \0
        030 000 000 000 000 000 000 000 216 000 000 000 001 000 000 000
600007100 006  \0  \0  \0  \0  \0  \0  \0 030 006   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 030 006 100 000 000 000 000 000
600007120 030 006  \0  \0  \0  \0  \0  \0 032  \0  \0  \0  \0  \0  \0  \0
        030 006 000 000 000 000 000 000 032 000 000 000 000 000 000 000
600007140  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600007160  \0  \0  \0  \0  \0  \0  \0  \0 211  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 211 000 000 000 001 000 000 000
600007200 006  \0  \0  \0  \0  \0  \0  \0   @ 006   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 100 006 100 000 000 000 000 000
600007220   @ 006  \0  \0  \0  \0  \0  \0 340  \0  \0  \0  \0  \0  \0  \0
        100 006 000 000 000 000 000 000 340 000 000 000 000 000 000 000
600007240  \0  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
600007260 020  \0  \0  \0  \0  \0  \0  \0 224  \0  \0  \0 001  \0  \0  \0
        020 000 000 000 000 000 000 000 224 000 000 000 001 000 000 000
600007300 006  \0  \0  \0  \0  \0  \0  \0      \a   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 040 007 100 000 000 000 000 000
600007320      \a  \0  \0  \0  \0  \0  \0 364 002  \0  \0  \0  \0  \0  \0
        040 007 000 000 000 000 000 000 364 002 000 000 000 000 000 000
600007340  \0  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
600007360  \0  \0  \0  \0  \0  \0  \0  \0 232  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 232 000 000 000 001 000 000 000
600007400 006  \0  \0  \0  \0  \0  \0  \0 024  \n   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 024 012 100 000 000 000 000 000
600007420 024  \n  \0  \0  \0  \0  \0  \0  \t  \0  \0  \0  \0  \0  \0  \0
        024 012 000 000 000 000 000 000 011 000 000 000 000 000 000 000
600007440  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600007460  \0  \0  \0  \0  \0  \0  \0  \0 240  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 240 000 000 000 001 000 000 000
600007500  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
600010000
$ mv /tmp/file.4 /tmp/file.4.old
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 2549 0 in /tmp/file.2
ERROR: 40 1 in /tmp/file.2
ERROR: 53 2 in /tmp/file.2
ERROR: 29 3 in /tmp/file.2
ERROR: 27 4 in /tmp/file.2
ERROR: 5 5 in /tmp/file.2
ERROR: 14 6 in /tmp/file.2
ERROR: 8 7 in /tmp/file.2
ERROR: 16 8 in /tmp/file.2
ERROR: 4 9 in /tmp/file.2
ERROR: 12 10 in /tmp/file.2
ERROR: 4 11 in /tmp/file.2
ERROR: 2 12 in /tmp/file.2
ERROR: 10 13 in /tmp/file.2
ERROR: 13 14 in /tmp/file.2
ERROR: 4 15 in /tmp/file.2
ERROR: 26 16 in /tmp/file.2
ERROR: 5 17 in /tmp/file.2
ERROR: 23 18 in /tmp/file.2
ERROR: 4 19 in /tmp/file.2
ERROR: 8 20 in /tmp/file.2
ERROR: 2 21 in /tmp/file.2
ERROR: 1 22 in /tmp/file.2
ERROR: 2 23 in /tmp/file.2
ERROR: 17 24 in /tmp/file.2
ERROR: 5 25 in /tmp/file.2
ERROR: 2 26 in /tmp/file.2
ERROR: 1 27 in /tmp/file.2
ERROR: 3 28 in /tmp/file.2
ERROR: 17 32 in /tmp/file.2
ERROR: 1 35 in /tmp/file.2
ERROR: 1 36 in /tmp/file.2
ERROR: 2 38 in /tmp/file.2
ERROR: 5 40 in /tmp/file.2
ERROR: 1 41 in /tmp/file.2
ERROR: 3 45 in /tmp/file.2
ERROR: 65 46 in /tmp/file.2
ERROR: 2 48 in /tmp/file.2
ERROR: 4 49 in /tmp/file.2
ERROR: 24 50 in /tmp/file.2
ERROR: 3 51 in /tmp/file.2
ERROR: 4 52 in /tmp/file.2
ERROR: 12 53 in /tmp/file.2
ERROR: 2 54 in /tmp/file.2
ERROR: 1 55 in /tmp/file.2
ERROR: 5 56 in /tmp/file.2
ERROR: 1 60 in /tmp/file.2
ERROR: 75 64 in /tmp/file.2
ERROR: 5 65 in /tmp/file.2
ERROR: 17 66 in /tmp/file.2
ERROR: 19 67 in /tmp/file.2
ERROR: 5 68 in /tmp/file.2
ERROR: 6 69 in /tmp/file.2
ERROR: 3 70 in /tmp/file.2
ERROR: 13 71 in /tmp/file.2
ERROR: 18 73 in /tmp/file.2
ERROR: 3 74 in /tmp/file.2
ERROR: 17 76 in /tmp/file.2
ERROR: 7 77 in /tmp/file.2
ERROR: 5 78 in /tmp/file.2
ERROR: 4 79 in /tmp/file.2
ERROR: 1 80 in /tmp/file.2
ERROR: 4 82 in /tmp/file.2
ERROR: 2 83 in /tmp/file.2
ERROR: 13 84 in /tmp/file.2
ERROR: 1 85 in /tmp/file.2
ERROR: 1 86 in /tmp/file.2
ERROR: 1 89 in /tmp/file.2
ERROR: 2 94 in /tmp/file.2
ERROR: 118 95 in /tmp/file.2
ERROR: 24 96 in /tmp/file.2
ERROR: 54 97 in /tmp/file.2
ERROR: 14 98 in /tmp/file.2
ERROR: 18 99 in /tmp/file.2
ERROR: 29 100 in /tmp/file.2
ERROR: 57 101 in /tmp/file.2
ERROR: 16 102 in /tmp/file.2
ERROR: 15 103 in /tmp/file.2
ERROR: 9 104 in /tmp/file.2
ERROR: 48 105 in /tmp/file.2
ERROR: 1 106 in /tmp/file.2
ERROR: 2 107 in /tmp/file.2
ERROR: 30 108 in /tmp/file.2
ERROR: 22 109 in /tmp/file.2
ERROR: 43 110 in /tmp/file.2
ERROR: 29 111 in /tmp/file.2
ERROR: 13 112 in /tmp/file.2
ERROR: 56 114 in /tmp/file.2
ERROR: 42 115 in /tmp/file.2
ERROR: 65 116 in /tmp/file.2
ERROR: 14 117 in /tmp/file.2
ERROR: 3 118 in /tmp/file.2
ERROR: 2 119 in /tmp/file.2
ERROR: 3 120 in /tmp/file.2
ERROR: 16 121 in /tmp/file.2
ERROR: 1 122 in /tmp/file.2
ERROR: 1 125 in /tmp/file.2
ERROR: 1 126 in /tmp/file.2
ERROR: 5 128 in /tmp/file.2
ERROR: 1 132 in /tmp/file.2
ERROR: 4 134 in /tmp/file.2
ERROR: 1 137 in /tmp/file.2
ERROR: 1 141 in /tmp/file.2
ERROR: 1 142 in /tmp/file.2
ERROR: 1 144 in /tmp/file.2
ERROR: 1 145 in /tmp/file.2
ERROR: 2 148 in /tmp/file.2
ERROR: 6 152 in /tmp/file.2
ERROR: 2 153 in /tmp/file.2
ERROR: 1 154 in /tmp/file.2
ERROR: 6 160 in /tmp/file.2
ERROR: 1 166 in /tmp/file.2
ERROR: 3 168 in /tmp/file.2
ERROR: 1 176 in /tmp/file.2
ERROR: 1 180 in /tmp/file.2
ERROR: 1 181 in /tmp/file.2
ERROR: 3 184 in /tmp/file.2
ERROR: 1 188 in /tmp/file.2
ERROR: 4 192 in /tmp/file.2
ERROR: 1 193 in /tmp/file.2
ERROR: 1 198 in /tmp/file.2
ERROR: 3 200 in /tmp/file.2
ERROR: 2 208 in /tmp/file.2
ERROR: 1 216 in /tmp/file.2
ERROR: 1 223 in /tmp/file.2
ERROR: 4 224 in /tmp/file.2
ERROR: 1 227 in /tmp/file.2
ERROR: 1 236 in /tmp/file.2
ERROR: 1 237 in /tmp/file.2
ERROR: 4 241 in /tmp/file.2
ERROR: 1 243 in /tmp/file.2
ERROR: 1 244 in /tmp/file.2
ERROR: 1 245 in /tmp/file.2
ERROR: 1 246 in /tmp/file.2
ERROR: 2 248 in /tmp/file.2
ERROR: 1 249 in /tmp/file.2
ERROR: 1 254 in /tmp/file.2
ERROR: 2549 0 in /tmp/file.7
ERROR: 40 1 in /tmp/file.7
ERROR: 53 2 in /tmp/file.7
ERROR: 29 3 in /tmp/file.7
ERROR: 27 4 in /tmp/file.7
ERROR: 5 5 in /tmp/file.7
ERROR: 14 6 in /tmp/file.7
ERROR: 8 7 in /tmp/file.7
ERROR: 16 8 in /tmp/file.7
ERROR: 4 9 in /tmp/file.7
ERROR: 12 10 in /tmp/file.7
ERROR: 4 11 in /tmp/file.7
ERROR: 2 12 in /tmp/file.7
ERROR: 10 13 in /tmp/file.7
ERROR: 13 14 in /tmp/file.7
ERROR: 4 15 in /tmp/file.7
ERROR: 26 16 in /tmp/file.7
ERROR: 5 17 in /tmp/file.7
ERROR: 23 18 in /tmp/file.7
ERROR: 4 19 in /tmp/file.7
ERROR: 8 20 in /tmp/file.7
ERROR: 2 21 in /tmp/file.7
ERROR: 1 22 in /tmp/file.7
ERROR: 2 23 in /tmp/file.7
ERROR: 17 24 in /tmp/file.7
ERROR: 5 25 in /tmp/file.7
ERROR: 2 26 in /tmp/file.7
ERROR: 1 27 in /tmp/file.7
ERROR: 3 28 in /tmp/file.7
ERROR: 17 32 in /tmp/file.7
ERROR: 1 35 in /tmp/file.7
ERROR: 1 36 in /tmp/file.7
ERROR: 2 38 in /tmp/file.7
ERROR: 5 40 in /tmp/file.7
ERROR: 1 41 in /tmp/file.7
ERROR: 3 45 in /tmp/file.7
ERROR: 65 46 in /tmp/file.7
ERROR: 2 48 in /tmp/file.7
ERROR: 4 49 in /tmp/file.7
ERROR: 24 50 in /tmp/file.7
ERROR: 3 51 in /tmp/file.7
ERROR: 4 52 in /tmp/file.7
ERROR: 12 53 in /tmp/file.7
ERROR: 2 54 in /tmp/file.7
ERROR: 1 55 in /tmp/file.7
ERROR: 5 56 in /tmp/file.7
ERROR: 1 60 in /tmp/file.7
ERROR: 75 64 in /tmp/file.7
ERROR: 5 65 in /tmp/file.7
ERROR: 17 66 in /tmp/file.7
ERROR: 19 67 in /tmp/file.7
ERROR: 5 68 in /tmp/file.7
ERROR: 6 69 in /tmp/file.7
ERROR: 3 70 in /tmp/file.7
ERROR: 13 71 in /tmp/file.7
ERROR: 18 73 in /tmp/file.7
ERROR: 3 74 in /tmp/file.7
ERROR: 17 76 in /tmp/file.7
ERROR: 7 77 in /tmp/file.7
ERROR: 5 78 in /tmp/file.7
ERROR: 4 79 in /tmp/file.7
ERROR: 1 80 in /tmp/file.7
ERROR: 4 82 in /tmp/file.7
ERROR: 2 83 in /tmp/file.7
ERROR: 13 84 in /tmp/file.7
ERROR: 1 85 in /tmp/file.7
ERROR: 1 86 in /tmp/file.7
ERROR: 1 89 in /tmp/file.7
ERROR: 2 94 in /tmp/file.7
ERROR: 118 95 in /tmp/file.7
ERROR: 24 96 in /tmp/file.7
ERROR: 54 97 in /tmp/file.7
ERROR: 14 98 in /tmp/file.7
ERROR: 18 99 in /tmp/file.7
ERROR: 29 100 in /tmp/file.7
ERROR: 57 101 in /tmp/file.7
ERROR: 16 102 in /tmp/file.7
ERROR: 15 103 in /tmp/file.7
ERROR: 9 104 in /tmp/file.7
ERROR: 48 105 in /tmp/file.7
ERROR: 1 106 in /tmp/file.7
ERROR: 2 107 in /tmp/file.7
ERROR: 30 108 in /tmp/file.7
ERROR: 22 109 in /tmp/file.7
ERROR: 43 110 in /tmp/file.7
ERROR: 29 111 in /tmp/file.7
ERROR: 13 112 in /tmp/file.7
ERROR: 56 114 in /tmp/file.7
ERROR: 42 115 in /tmp/file.7
ERROR: 65 116 in /tmp/file.7
ERROR: 14 117 in /tmp/file.7
ERROR: 3 118 in /tmp/file.7
ERROR: 2 119 in /tmp/file.7
ERROR: 3 120 in /tmp/file.7
ERROR: 16 121 in /tmp/file.7
ERROR: 1 122 in /tmp/file.7
ERROR: 1 125 in /tmp/file.7
ERROR: 1 126 in /tmp/file.7
ERROR: 5 128 in /tmp/file.7
ERROR: 1 132 in /tmp/file.7
ERROR: 4 134 in /tmp/file.7
ERROR: 1 137 in /tmp/file.7
ERROR: 1 141 in /tmp/file.7
ERROR: 1 142 in /tmp/file.7
ERROR: 1 144 in /tmp/file.7
ERROR: 1 145 in /tmp/file.7
ERROR: 2 148 in /tmp/file.7
ERROR: 6 152 in /tmp/file.7
ERROR: 2 153 in /tmp/file.7
ERROR: 1 154 in /tmp/file.7
ERROR: 6 160 in /tmp/file.7
ERROR: 1 166 in /tmp/file.7
ERROR: 3 168 in /tmp/file.7
ERROR: 1 176 in /tmp/file.7
ERROR: 1 180 in /tmp/file.7
ERROR: 1 181 in /tmp/file.7
ERROR: 3 184 in /tmp/file.7
ERROR: 1 188 in /tmp/file.7
ERROR: 4 192 in /tmp/file.7
ERROR: 1 193 in /tmp/file.7
ERROR: 1 198 in /tmp/file.7
ERROR: 3 200 in /tmp/file.7
ERROR: 2 208 in /tmp/file.7
ERROR: 1 216 in /tmp/file.7
ERROR: 1 223 in /tmp/file.7
ERROR: 4 224 in /tmp/file.7
ERROR: 1 227 in /tmp/file.7
ERROR: 1 236 in /tmp/file.7
ERROR: 1 237 in /tmp/file.7
ERROR: 4 241 in /tmp/file.7
ERROR: 1 243 in /tmp/file.7
ERROR: 1 244 in /tmp/file.7
ERROR: 1 245 in /tmp/file.7
ERROR: 1 246 in /tmp/file.7
ERROR: 2 248 in /tmp/file.7
ERROR: 1 249 in /tmp/file.7
ERROR: 1 254 in /tmp/file.7

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11 11:53               ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-11 11:53 UTC (permalink / raw)
  To: aarcange; +Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Andrea Arcangeli wrote:
> On Fri, Aug 11, 2017 at 12:22:56PM +0200, Andrea Arcangeli wrote:
> > disk block? This would happen on ext4 as well if mounted with -o
> > journal=data instead of -o journal=ordered in fact, perhaps you simply
> 
> Oops above I meant journal=writeback, journal=data is even stronger
> than journal=ordered of course.
> 
> And I shall clarify further that old disk content can only showup
> legitimately on journal=writeback after a hard reboot or crash or in
> general an unclean unmount. Even if there's no journaling at all
> (i.e. ext2/vfat) old disk content cannot be shown at any given time no
> matter what if there's no unclean unmount that requires a journal
> reply.

I'm using XFS on a small non-NUMA system (4 CPUs / 4096MB RAM).

  /dev/sda1 / xfs rw,relatime,attr2,inode64,noquota 0 0

As far as I tested, not-zero not-0xff values did not show up with 4.6.7
kernel (i.e. all not-0xff bytes are zero) while not-zero not-0xff values
show up with 4.13.0-rc4-next-20170811 kernel.

> 
> This theory of a completely unrelated fs bug showing you disk content
> as result of the OOM reaper induced SIGBUS interrupting a
> copy_from_user at its very start, is purely motivated by the fact like
> Michal I didn't see much explanation on the VM side that could cause
> those not-zero not-0xff values showing up in the buffer of the write
> syscall. You can try to change fs and see if it happens again to rule
> it out. If it always happens regardless of the filesystem used, then
> it's likely not a fs bug of course. You've got an entire and aligned
> 4k fs block showing up that data.
> 

What is strange is that, as far as I tested, the pattern of not-zero not-0xff
bytes seems to be always the same. Such thing unlikely happens if old content
on the disk is by chance showing up. Maybe the content written is not random
but specific 4096 bytes of memory image of executable file.

$ cat checker.c
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
        char buffer2[64] = { };
        int ret = 0;
        int i;
        for (i = 0; i < 1024; i++) {
                 int flag = 0;
                 int fd;
                 unsigned int byte[256];
                 int j;
                 snprintf(buffer2, sizeof(buffer2), "/tmp/file.%u", i);
                 fd = open(buffer2, O_RDONLY);
                 if (fd == EOF)
                         continue;
                 memset(byte, 0, sizeof(byte));
                 while (1) {
                         static unsigned char buffer[1048576];
                         int len = read(fd, (char *) buffer, sizeof(buffer));
                         if (len <= 0)
                                 break;
                         for (j = 0; j < len; j++)
                                 if (buffer[j] != 0xFF)
                                         byte[buffer[j]]++;
                 }
                 close(fd);
                 for (j = 0; j < 255; j++)
                         if (byte[j]) {
                                 printf("ERROR: %u %u in %s\n", byte[j], j, buffer2);
                                 flag = 1;
                         }
                 if (flag == 0)
                         unlink(buffer2);
                 else
                         ret = 1;
        }
        return ret;
}
$ uname -r
4.13.0-rc4-next-20170811
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.4
$ /bin/rm /tmp/file.4
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.6
$ /bin/rm /tmp/file.6
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.0
$ /bin/rm /tmp/file.0
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 2549 0 in /tmp/file.4
ERROR: 40 1 in /tmp/file.4
ERROR: 53 2 in /tmp/file.4
ERROR: 29 3 in /tmp/file.4
ERROR: 27 4 in /tmp/file.4
ERROR: 5 5 in /tmp/file.4
ERROR: 14 6 in /tmp/file.4
ERROR: 8 7 in /tmp/file.4
ERROR: 16 8 in /tmp/file.4
ERROR: 4 9 in /tmp/file.4
ERROR: 12 10 in /tmp/file.4
ERROR: 4 11 in /tmp/file.4
ERROR: 2 12 in /tmp/file.4
ERROR: 10 13 in /tmp/file.4
ERROR: 13 14 in /tmp/file.4
ERROR: 4 15 in /tmp/file.4
ERROR: 26 16 in /tmp/file.4
ERROR: 5 17 in /tmp/file.4
ERROR: 23 18 in /tmp/file.4
ERROR: 4 19 in /tmp/file.4
ERROR: 8 20 in /tmp/file.4
ERROR: 2 21 in /tmp/file.4
ERROR: 1 22 in /tmp/file.4
ERROR: 2 23 in /tmp/file.4
ERROR: 17 24 in /tmp/file.4
ERROR: 5 25 in /tmp/file.4
ERROR: 2 26 in /tmp/file.4
ERROR: 1 27 in /tmp/file.4
ERROR: 3 28 in /tmp/file.4
ERROR: 17 32 in /tmp/file.4
ERROR: 1 35 in /tmp/file.4
ERROR: 1 36 in /tmp/file.4
ERROR: 2 38 in /tmp/file.4
ERROR: 5 40 in /tmp/file.4
ERROR: 1 41 in /tmp/file.4
ERROR: 3 45 in /tmp/file.4
ERROR: 65 46 in /tmp/file.4
ERROR: 2 48 in /tmp/file.4
ERROR: 4 49 in /tmp/file.4
ERROR: 24 50 in /tmp/file.4
ERROR: 3 51 in /tmp/file.4
ERROR: 4 52 in /tmp/file.4
ERROR: 12 53 in /tmp/file.4
ERROR: 2 54 in /tmp/file.4
ERROR: 1 55 in /tmp/file.4
ERROR: 5 56 in /tmp/file.4
ERROR: 1 60 in /tmp/file.4
ERROR: 75 64 in /tmp/file.4
ERROR: 5 65 in /tmp/file.4
ERROR: 17 66 in /tmp/file.4
ERROR: 19 67 in /tmp/file.4
ERROR: 5 68 in /tmp/file.4
ERROR: 6 69 in /tmp/file.4
ERROR: 3 70 in /tmp/file.4
ERROR: 13 71 in /tmp/file.4
ERROR: 18 73 in /tmp/file.4
ERROR: 3 74 in /tmp/file.4
ERROR: 17 76 in /tmp/file.4
ERROR: 7 77 in /tmp/file.4
ERROR: 5 78 in /tmp/file.4
ERROR: 4 79 in /tmp/file.4
ERROR: 1 80 in /tmp/file.4
ERROR: 4 82 in /tmp/file.4
ERROR: 2 83 in /tmp/file.4
ERROR: 13 84 in /tmp/file.4
ERROR: 1 85 in /tmp/file.4
ERROR: 1 86 in /tmp/file.4
ERROR: 1 89 in /tmp/file.4
ERROR: 2 94 in /tmp/file.4
ERROR: 118 95 in /tmp/file.4
ERROR: 24 96 in /tmp/file.4
ERROR: 54 97 in /tmp/file.4
ERROR: 14 98 in /tmp/file.4
ERROR: 18 99 in /tmp/file.4
ERROR: 29 100 in /tmp/file.4
ERROR: 57 101 in /tmp/file.4
ERROR: 16 102 in /tmp/file.4
ERROR: 15 103 in /tmp/file.4
ERROR: 9 104 in /tmp/file.4
ERROR: 48 105 in /tmp/file.4
ERROR: 1 106 in /tmp/file.4
ERROR: 2 107 in /tmp/file.4
ERROR: 30 108 in /tmp/file.4
ERROR: 22 109 in /tmp/file.4
ERROR: 43 110 in /tmp/file.4
ERROR: 29 111 in /tmp/file.4
ERROR: 13 112 in /tmp/file.4
ERROR: 56 114 in /tmp/file.4
ERROR: 42 115 in /tmp/file.4
ERROR: 65 116 in /tmp/file.4
ERROR: 14 117 in /tmp/file.4
ERROR: 3 118 in /tmp/file.4
ERROR: 2 119 in /tmp/file.4
ERROR: 3 120 in /tmp/file.4
ERROR: 16 121 in /tmp/file.4
ERROR: 1 122 in /tmp/file.4
ERROR: 1 125 in /tmp/file.4
ERROR: 1 126 in /tmp/file.4
ERROR: 5 128 in /tmp/file.4
ERROR: 1 132 in /tmp/file.4
ERROR: 4 134 in /tmp/file.4
ERROR: 1 137 in /tmp/file.4
ERROR: 1 141 in /tmp/file.4
ERROR: 1 142 in /tmp/file.4
ERROR: 1 144 in /tmp/file.4
ERROR: 1 145 in /tmp/file.4
ERROR: 2 148 in /tmp/file.4
ERROR: 6 152 in /tmp/file.4
ERROR: 2 153 in /tmp/file.4
ERROR: 1 154 in /tmp/file.4
ERROR: 6 160 in /tmp/file.4
ERROR: 1 166 in /tmp/file.4
ERROR: 3 168 in /tmp/file.4
ERROR: 1 176 in /tmp/file.4
ERROR: 1 180 in /tmp/file.4
ERROR: 1 181 in /tmp/file.4
ERROR: 3 184 in /tmp/file.4
ERROR: 1 188 in /tmp/file.4
ERROR: 4 192 in /tmp/file.4
ERROR: 1 193 in /tmp/file.4
ERROR: 1 198 in /tmp/file.4
ERROR: 3 200 in /tmp/file.4
ERROR: 2 208 in /tmp/file.4
ERROR: 1 216 in /tmp/file.4
ERROR: 1 223 in /tmp/file.4
ERROR: 4 224 in /tmp/file.4
ERROR: 1 227 in /tmp/file.4
ERROR: 1 236 in /tmp/file.4
ERROR: 1 237 in /tmp/file.4
ERROR: 4 241 in /tmp/file.4
ERROR: 1 243 in /tmp/file.4
ERROR: 1 244 in /tmp/file.4
ERROR: 1 245 in /tmp/file.4
ERROR: 1 246 in /tmp/file.4
ERROR: 2 248 in /tmp/file.4
ERROR: 1 249 in /tmp/file.4
ERROR: 1 254 in /tmp/file.4
$ od -cb /tmp/file.4
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
        377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
600000000   -   1   1   )  \0  \0   .   s   y   m   t   a   b  \0   .   s
        055 061 061 051 000 000 056 163 171 155 164 141 142 000 056 163
600000020   t   r   t   a   b  \0   .   s   h   s   t   r   t   a   b  \0
        164 162 164 141 142 000 056 163 150 163 164 162 164 141 142 000
600000040   .   i   n   t   e   r   p  \0   .   n   o   t   e   .   A   B
        056 151 156 164 145 162 160 000 056 156 157 164 145 056 101 102
600000060   I   -   t   a   g  \0   .   n   o   t   e   .   g   n   u   .
        111 055 164 141 147 000 056 156 157 164 145 056 147 156 165 056
600000100   b   u   i   l   d   -   i   d  \0   .   g   n   u   .   h   a
        142 165 151 154 144 055 151 144 000 056 147 156 165 056 150 141
600000120   s   h  \0   .   d   y   n   s   y   m  \0   .   d   y   n   s
        163 150 000 056 144 171 156 163 171 155 000 056 144 171 156 163
600000140   t   r  \0   .   g   n   u   .   v   e   r   s   i   o   n  \0
        164 162 000 056 147 156 165 056 166 145 162 163 151 157 156 000
600000160   .   g   n   u   .   v   e   r   s   i   o   n   _   r  \0   .
        056 147 156 165 056 166 145 162 163 151 157 156 137 162 000 056
600000200   r   e   l   a   .   d   y   n  \0   .   r   e   l   a   .   p
        162 145 154 141 056 144 171 156 000 056 162 145 154 141 056 160
600000220   l   t  \0   .   i   n   i   t  \0   .   t   e   x   t  \0   .
        154 164 000 056 151 156 151 164 000 056 164 145 170 164 000 056
600000240   f   i   n   i  \0   .   r   o   d   a   t   a  \0   .   e   h
        146 151 156 151 000 056 162 157 144 141 164 141 000 056 145 150
600000260   _   f   r   a   m   e   _   h   d   r  \0   .   e   h   _   f
        137 146 162 141 155 145 137 150 144 162 000 056 145 150 137 146
600000300   r   a   m   e  \0   .   i   n   i   t   _   a   r   r   a   y
        162 141 155 145 000 056 151 156 151 164 137 141 162 162 141 171
600000320  \0   .   f   i   n   i   _   a   r   r   a   y  \0   .   j   c
        000 056 146 151 156 151 137 141 162 162 141 171 000 056 152 143
600000340   r  \0   .   d   y   n   a   m   i   c  \0   .   g   o   t  \0
        162 000 056 144 171 156 141 155 151 143 000 056 147 157 164 000
600000360   .   g   o   t   .   p   l   t  \0   .   d   a   t   a  \0   .
        056 147 157 164 056 160 154 164 000 056 144 141 164 141 000 056
600000400   b   s   s  \0   .   c   o   m   m   e   n   t  \0  \0  \0  \0
        142 163 163 000 056 143 157 155 155 145 156 164 000 000 000 000
600000420  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600000440  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 001  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 001 000
600000460   8 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        070 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000500  \0  \0  \0  \0 003  \0 002  \0   T 002   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 002 000 124 002 100 000 000 000 000 000
600000520  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 003  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 003 000
600000540   t 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        164 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000560  \0  \0  \0  \0 003  \0 004  \0 230 002   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 004 000 230 002 100 000 000 000 000 000
600000600  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 005  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 005 000
600000620 270 002   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        270 002 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000640  \0  \0  \0  \0 003  \0 006  \0  \b 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 006 000 010 004 100 000 000 000 000 000
600000660  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \a  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 007 000
600000700 206 004   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        206 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600000720  \0  \0  \0  \0 003  \0  \b  \0 250 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 010 000 250 004 100 000 000 000 000 000
600000740  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \t  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 011 000
600000760 310 004   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        310 004 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001000  \0  \0  \0  \0 003  \0  \n  \0 340 004   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 012 000 340 004 100 000 000 000 000 000
600001020  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \v  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 013 000
600001040 030 006   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001060  \0  \0  \0  \0 003  \0  \f  \0   @ 006   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 014 000 100 006 100 000 000 000 000 000
600001100  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0  \r  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 015 000
600001120      \a   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        040 007 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001140  \0  \0  \0  \0 003  \0 016  \0 024  \n   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 016 000 024 012 100 000 000 000 000 000
600001160  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 017  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 017 000
600001200      \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        040 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001220  \0  \0  \0  \0 003  \0 020  \0   @  \n   @  \0  \0  \0  \0  \0
        000 000 000 000 003 000 020 000 100 012 100 000 000 000 000 000
600001240  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 021  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 021 000
600001260 200  \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        200 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600001300  \0  \0  \0  \0 003  \0 022  \0 020 016   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 022 000 020 016 140 000 000 000 000 000
600001320  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 023  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 023 000
600001340 030 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001360  \0  \0  \0  \0 003  \0 024  \0     016   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 024 000 040 016 140 000 000 000 000 000
600001400  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 025  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 025 000
600001420   ( 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001440  \0  \0  \0  \0 003  \0 026  \0 370 017   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 026 000 370 017 140 000 000 000 000 000
600001460  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 027  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 027 000
600001500  \0 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001520  \0  \0  \0  \0 003  \0 030  \0 200 020   `  \0  \0  \0  \0  \0
        000 000 000 000 003 000 030 000 200 020 140 000 000 000 000 000
600001540  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 003  \0 031  \0
        000 000 000 000 000 000 000 000 000 000 000 000 003 000 031 000
600001560 240 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        240 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600001600  \0  \0  \0  \0 003  \0 032  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 003 000 032 000 000 000 000 000 000 000 000 000
600001620  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0 004  \0 361 377
        000 000 000 000 000 000 000 000 001 000 000 000 004 000 361 377
600001640  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600001660  \b  \0  \0  \0 002  \0  \r  \0  \0  \t   @  \0  \0  \0  \0  \0
        010 000 000 000 002 000 015 000 000 011 100 000 000 000 000 000
600001700 221  \0  \0  \0  \0  \0  \0  \0 024  \0  \0  \0 001  \0 031  \0
        221 000 000 000 000 000 000 000 024 000 000 000 001 000 031 000
600001720 300 020   `  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0
        300 020 140 000 000 000 000 000 000 000 020 000 000 000 000 000
600001740      \0  \0  \0 001  \0 030  \0 220 020   `  \0  \0  \0  \0  \0
        040 000 000 000 001 000 030 000 220 020 140 000 000 000 000 000
600001760  \b  \0  \0  \0  \0  \0  \0  \0   (  \0  \0  \0 004  \0 361 377
        010 000 000 000 000 000 000 000 050 000 000 000 004 000 361 377
600002000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002020   3  \0  \0  \0 001  \0 024  \0     016   `  \0  \0  \0  \0  \0
        063 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
600002040  \0  \0  \0  \0  \0  \0  \0  \0   @  \0  \0  \0 002  \0  \r  \0
        000 000 000 000 000 000 000 000 100 000 000 000 002 000 015 000
600002060   @  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        100 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002100   U  \0  \0  \0 002  \0  \r  \0   p  \b   @  \0  \0  \0  \0  \0
        125 000 000 000 002 000 015 000 160 010 100 000 000 000 000 000
600002120  \0  \0  \0  \0  \0  \0  \0  \0   h  \0  \0  \0 002  \0  \r  \0
        000 000 000 000 000 000 000 000 150 000 000 000 002 000 015 000
600002140 260  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        260 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002160   ~  \0  \0  \0 001  \0 031  \0 240 020   `  \0  \0  \0  \0  \0
        176 000 000 000 001 000 031 000 240 020 140 000 000 000 000 000
600002200 001  \0  \0  \0  \0  \0  \0  \0 215  \0  \0  \0 001  \0 023  \0
        001 000 000 000 000 000 000 000 215 000 000 000 001 000 023 000
600002220 030 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002240 264  \0  \0  \0 002  \0  \r  \0 320  \b   @  \0  \0  \0  \0  \0
        264 000 000 000 002 000 015 000 320 010 100 000 000 000 000 000
600002260  \0  \0  \0  \0  \0  \0  \0  \0 300  \0  \0  \0 001  \0 022  \0
        000 000 000 000 000 000 000 000 300 000 000 000 001 000 022 000
600002300 020 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        020 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002320   (  \0  \0  \0 004  \0 361 377  \0  \0  \0  \0  \0  \0  \0  \0
        050 000 000 000 004 000 361 377 000 000 000 000 000 000 000 000
600002340  \0  \0  \0  \0  \0  \0  \0  \0 337  \0  \0  \0 001  \0 021  \0
        000 000 000 000 000 000 000 000 337 000 000 000 001 000 021 000
600002360 300  \v   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        300 013 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600002400 355  \0  \0  \0 001  \0 024  \0     016   `  \0  \0  \0  \0  \0
        355 000 000 000 001 000 024 000 040 016 140 000 000 000 000 000
600002420  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 004  \0 361 377
        000 000 000 000 000 000 000 000 000 000 000 000 004 000 361 377
600002440  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002460 371  \0  \0  \0  \0  \0 022  \0 030 016   `  \0  \0  \0  \0  \0
        371 000 000 000 000 000 022 000 030 016 140 000 000 000 000 000
600002500  \0  \0  \0  \0  \0  \0  \0  \0  \n 001  \0  \0 001  \0 025  \0
        000 000 000 000 000 000 000 000 012 001 000 000 001 000 025 000
600002520   ( 016   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 016 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002540 023 001  \0  \0  \0  \0 022  \0 020 016   `  \0  \0  \0  \0  \0
        023 001 000 000 000 000 022 000 020 016 140 000 000 000 000 000
600002560  \0  \0  \0  \0  \0  \0  \0  \0   & 001  \0  \0 001  \0 027  \0
        000 000 000 000 000 000 000 000 046 001 000 000 001 000 027 000
600002600  \0 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600002620   < 001  \0  \0 022  \0  \r  \0 020  \n   @  \0  \0  \0  \0  \0
        074 001 000 000 022 000 015 000 020 012 100 000 000 000 000 000
600002640 002  \0  \0  \0  \0  \0  \0  \0   L 001  \0  \0      \0  \0  \0
        002 000 000 000 000 000 000 000 114 001 000 000 040 000 000 000
600002660  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002700   h 001  \0  \0      \0 030  \0 200 020   `  \0  \0  \0  \0  \0
        150 001 000 000 040 000 030 000 200 020 140 000 000 000 000 000
600002720  \0  \0  \0  \0  \0  \0  \0  \0   s 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 163 001 000 000 022 000 000 000
600002740  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600002760 206 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        206 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003000  \0  \0  \0  \0  \0  \0  \0  \0 231 001  \0  \0 020  \0 030  \0
        000 000 000 000 000 000 000 000 231 001 000 000 020 000 030 000
600003020 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003040 240 001  \0  \0 022  \0 016  \0 024  \n   @  \0  \0  \0  \0  \0
        240 001 000 000 022 000 016 000 024 012 100 000 000 000 000 000
600003060  \0  \0  \0  \0  \0  \0  \0  \0 246 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 246 001 000 000 022 000 000 000
600003100  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003120 274 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        274 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003140  \0  \0  \0  \0  \0  \0  \0  \0 320 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 320 001 000 000 022 000 000 000
600003160  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003200 343 001  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        343 001 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003220  \0  \0  \0  \0  \0  \0  \0  \0 365 001  \0  \0 022  \0  \0  \0
        000 000 000 000 000 000 000 000 365 001 000 000 022 000 000 000
600003240  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003260  \a 002  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        007 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003300  \0  \0  \0  \0  \0  \0  \0  \0   & 002  \0  \0 020  \0 030  \0
        000 000 000 000 000 000 000 000 046 002 000 000 020 000 030 000
600003320 200 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        200 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003340   3 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        063 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600003360  \0  \0  \0  \0  \0  \0  \0  \0   B 002  \0  \0 021 002 017  \0
        000 000 000 000 000 000 000 000 102 002 000 000 021 002 017 000
600003400   (  \n   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        050 012 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600003420   O 002  \0  \0 021  \0 017  \0      \n   @  \0  \0  \0  \0  \0
        117 002 000 000 021 000 017 000 040 012 100 000 000 000 000 000
600003440 004  \0  \0  \0  \0  \0  \0  \0   ^ 002  \0  \0 022  \0  \0  \0
        004 000 000 000 000 000 000 000 136 002 000 000 022 000 000 000
600003460  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003500   p 002  \0  \0 022  \0  \r  \0 240  \t   @  \0  \0  \0  \0  \0
        160 002 000 000 022 000 015 000 240 011 100 000 000 000 000 000
600003520   e  \0  \0  \0  \0  \0  \0  \0 200 002  \0  \0 022  \0  \0  \0
        145 000 000 000 000 000 000 000 200 002 000 000 022 000 000 000
600003540  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600003560 224 002  \0  \0 020  \0 031  \0 300 020   p  \0  \0  \0  \0  \0
        224 002 000 000 020 000 031 000 300 020 160 000 000 000 000 000
600003600  \0  \0  \0  \0  \0  \0  \0  \0 231 002  \0  \0 022  \0  \r  \0
        000 000 000 000 000 000 000 000 231 002 000 000 022 000 015 000
600003620 023  \b   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        023 010 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600003640 240 002  \0  \0 022  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        240 002 000 000 022 000 000 000 000 000 000 000 000 000 000 000
600003660  \0  \0  \0  \0  \0  \0  \0  \0 265 002  \0  \0 020  \0 031  \0
        000 000 000 000 000 000 000 000 265 002 000 000 020 000 031 000
600003700 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600003720 301 002  \0  \0 022  \0  \r  \0      \a   @  \0  \0  \0  \0  \0
        301 002 000 000 022 000 015 000 040 007 100 000 000 000 000 000
600003740 363  \0  \0  \0  \0  \0  \0  \0 306 002  \0  \0 022  \0  \0  \0
        363 000 000 000 000 000 000 000 306 002 000 000 022 000 000 000
600003760  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
600004000 330 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        330 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600004020  \0  \0  \0  \0  \0  \0  \0  \0 354 002  \0  \0 021 002 030  \0
        000 000 000 000 000 000 000 000 354 002 000 000 021 002 030 000
600004040 230 020   `  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        230 020 140 000 000 000 000 000 000 000 000 000 000 000 000 000
600004060 370 002  \0  \0      \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        370 002 000 000 040 000 000 000 000 000 000 000 000 000 000 000
600004100  \0  \0  \0  \0  \0  \0  \0  \0 022 003  \0  \0 022  \0  \v  \0
        000 000 000 000 000 000 000 000 022 003 000 000 022 000 013 000
600004120 030 006   @  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        030 006 100 000 000 000 000 000 000 000 000 000 000 000 000 000
600004140  \0   0   8   0   4   .   c  \0   f   i   l   e   _   w   r   i
        000 060 070 060 064 056 143 000 146 151 154 145 137 167 162 151
600004160   t   e   r  \0   b   u   f   f   e   r   .   4   7   6   1  \0
        164 145 162 000 142 165 146 146 145 162 056 064 067 066 061 000
600004200   p   i   p   e   _   f   d  \0   c   r   t   s   t   u   f   f
        160 151 160 145 137 146 144 000 143 162 164 163 164 165 146 146
600004220   .   c  \0   _   _   J   C   R   _   L   I   S   T   _   _  \0
        056 143 000 137 137 112 103 122 137 114 111 123 124 137 137 000
600004240   d   e   r   e   g   i   s   t   e   r   _   t   m   _   c   l
        144 145 162 145 147 151 163 164 145 162 137 164 155 137 143 154
600004260   o   n   e   s  \0   r   e   g   i   s   t   e   r   _   t   m
        157 156 145 163 000 162 145 147 151 163 164 145 162 137 164 155
600004300   _   c   l   o   n   e   s  \0   _   _   d   o   _   g   l   o
        137 143 154 157 156 145 163 000 137 137 144 157 137 147 154 157
600004320   b   a   l   _   d   t   o   r   s   _   a   u   x  \0   c   o
        142 141 154 137 144 164 157 162 163 137 141 165 170 000 143 157
600004340   m   p   l   e   t   e   d   .   6   3   4   4  \0   _   _   d
        155 160 154 145 164 145 144 056 066 063 064 064 000 137 137 144
600004360   o   _   g   l   o   b   a   l   _   d   t   o   r   s   _   a
        157 137 147 154 157 142 141 154 137 144 164 157 162 163 137 141
600004400   u   x   _   f   i   n   i   _   a   r   r   a   y   _   e   n
        165 170 137 146 151 156 151 137 141 162 162 141 171 137 145 156
600004420   t   r   y  \0   f   r   a   m   e   _   d   u   m   m   y  \0
        164 162 171 000 146 162 141 155 145 137 144 165 155 155 171 000
600004440   _   _   f   r   a   m   e   _   d   u   m   m   y   _   i   n
        137 137 146 162 141 155 145 137 144 165 155 155 171 137 151 156
600004460   i   t   _   a   r   r   a   y   _   e   n   t   r   y  \0   _
        151 164 137 141 162 162 141 171 137 145 156 164 162 171 000 137
600004500   _   F   R   A   M   E   _   E   N   D   _   _  \0   _   _   J
        137 106 122 101 115 105 137 105 116 104 137 137 000 137 137 112
600004520   C   R   _   E   N   D   _   _  \0   _   _   i   n   i   t   _
        103 122 137 105 116 104 137 137 000 137 137 151 156 151 164 137
600004540   a   r   r   a   y   _   e   n   d  \0   _   D   Y   N   A   M
        141 162 162 141 171 137 145 156 144 000 137 104 131 116 101 115
600004560   I   C  \0   _   _   i   n   i   t   _   a   r   r   a   y   _
        111 103 000 137 137 151 156 151 164 137 141 162 162 141 171 137
600004600   s   t   a   r   t  \0   _   G   L   O   B   A   L   _   O   F
        163 164 141 162 164 000 137 107 114 117 102 101 114 137 117 106
600004620   F   S   E   T   _   T   A   B   L   E   _  \0   _   _   l   i
        106 123 105 124 137 124 101 102 114 105 137 000 137 137 154 151
600004640   b   c   _   c   s   u   _   f   i   n   i  \0   _   I   T   M
        142 143 137 143 163 165 137 146 151 156 151 000 137 111 124 115
600004660   _   d   e   r   e   g   i   s   t   e   r   T   M   C   l   o
        137 144 145 162 145 147 151 163 164 145 162 124 115 103 154 157
600004700   n   e   T   a   b   l   e  \0   d   a   t   a   _   s   t   a
        156 145 124 141 142 154 145 000 144 141 164 141 137 163 164 141
600004720   r   t  \0   c   l   o   n   e   @   @   G   L   I   B   C   _
        162 164 000 143 154 157 156 145 100 100 107 114 111 102 103 137
600004740   2   .   2   .   5  \0   w   r   i   t   e   @   @   G   L   I
        062 056 062 056 065 000 167 162 151 164 145 100 100 107 114 111
600004760   B   C   _   2   .   2   .   5  \0   _   e   d   a   t   a  \0
        102 103 137 062 056 062 056 065 000 137 145 144 141 164 141 000
600005000   _   f   i   n   i  \0   s   n   p   r   i   n   t   f   @   @
        137 146 151 156 151 000 163 156 160 162 151 156 164 146 100 100
600005020   G   L   I   B   C   _   2   .   2   .   5  \0   m   e   m   s
        107 114 111 102 103 137 062 056 062 056 065 000 155 145 155 163
600005040   e   t   @   @   G   L   I   B   C   _   2   .   2   .   5  \0
        145 164 100 100 107 114 111 102 103 137 062 056 062 056 065 000
600005060   c   l   o   s   e   @   @   G   L   I   B   C   _   2   .   2
        143 154 157 163 145 100 100 107 114 111 102 103 137 062 056 062
600005100   .   5  \0   p   i   p   e   @   @   G   L   I   B   C   _   2
        056 065 000 160 151 160 145 100 100 107 114 111 102 103 137 062
600005120   .   2   .   5  \0   r   e   a   d   @   @   G   L   I   B   C
        056 062 056 065 000 162 145 141 144 100 100 107 114 111 102 103
600005140   _   2   .   2   .   5  \0   _   _   l   i   b   c   _   s   t
        137 062 056 062 056 065 000 137 137 154 151 142 143 137 163 164
600005160   a   r   t   _   m   a   i   n   @   @   G   L   I   B   C   _
        141 162 164 137 155 141 151 156 100 100 107 114 111 102 103 137
600005200   2   .   2   .   5  \0   _   _   d   a   t   a   _   s   t   a
        062 056 062 056 065 000 137 137 144 141 164 141 137 163 164 141
600005220   r   t  \0   _   _   g   m   o   n   _   s   t   a   r   t   _
        162 164 000 137 137 147 155 157 156 137 163 164 141 162 164 137
600005240   _  \0   _   _   d   s   o   _   h   a   n   d   l   e  \0   _
        137 000 137 137 144 163 157 137 150 141 156 144 154 145 000 137
600005260   I   O   _   s   t   d   i   n   _   u   s   e   d  \0   k   i
        111 117 137 163 164 144 151 156 137 165 163 145 144 000 153 151
600005300   l   l   @   @   G   L   I   B   C   _   2   .   2   .   5  \0
        154 154 100 100 107 114 111 102 103 137 062 056 062 056 065 000
600005320   _   _   l   i   b   c   _   c   s   u   _   i   n   i   t  \0
        137 137 154 151 142 143 137 143 163 165 137 151 156 151 164 000
600005340   m   a   l   l   o   c   @   @   G   L   I   B   C   _   2   .
        155 141 154 154 157 143 100 100 107 114 111 102 103 137 062 056
600005360   2   .   5  \0   _   e   n   d  \0   _   s   t   a   r   t  \0
        062 056 065 000 137 145 156 144 000 137 163 164 141 162 164 000
600005400   r   e   a   l   l   o   c   @   @   G   L   I   B   C   _   2
        162 145 141 154 154 157 143 100 100 107 114 111 102 103 137 062
600005420   .   2   .   5  \0   _   _   b   s   s   _   s   t   a   r   t
        056 062 056 065 000 137 137 142 163 163 137 163 164 141 162 164
600005440  \0   m   a   i   n  \0   o   p   e   n   @   @   G   L   I   B
        000 155 141 151 156 000 157 160 145 156 100 100 107 114 111 102
600005460   C   _   2   .   2   .   5  \0   _   J   v   _   R   e   g   i
        103 137 062 056 062 056 065 000 137 112 166 137 122 145 147 151
600005500   s   t   e   r   C   l   a   s   s   e   s  \0   _   _   T   M
        163 164 145 162 103 154 141 163 163 145 163 000 137 137 124 115
600005520   C   _   E   N   D   _   _  \0   _   I   T   M   _   r   e   g
        103 137 105 116 104 137 137 000 137 111 124 115 137 162 145 147
600005540   i   s   t   e   r   T   M   C   l   o   n   e   T   a   b   l
        151 163 164 145 162 124 115 103 154 157 156 145 124 141 142 154
600005560   e  \0   _   i   n   i   t  \0  \0  \0  \0  \0  \0  \0  \0  \0
        145 000 137 151 156 151 164 000 000 000 000 000 000 000 000 000
600005600  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
600005660  \0  \0  \0  \0  \0  \0  \0  \0 033  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 033 000 000 000 001 000 000 000
600005700 002  \0  \0  \0  \0  \0  \0  \0   8 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 070 002 100 000 000 000 000 000
600005720   8 002  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        070 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600005740  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
600005760  \0  \0  \0  \0  \0  \0  \0  \0   #  \0  \0  \0  \a  \0  \0  \0
        000 000 000 000 000 000 000 000 043 000 000 000 007 000 000 000
600006000 002  \0  \0  \0  \0  \0  \0  \0   T 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 124 002 100 000 000 000 000 000
600006020   T 002  \0  \0  \0  \0  \0  \0      \0  \0  \0  \0  \0  \0  \0
        124 002 000 000 000 000 000 000 040 000 000 000 000 000 000 000
600006040  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600006060  \0  \0  \0  \0  \0  \0  \0  \0   1  \0  \0  \0  \a  \0  \0  \0
        000 000 000 000 000 000 000 000 061 000 000 000 007 000 000 000
600006100 002  \0  \0  \0  \0  \0  \0  \0   t 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 164 002 100 000 000 000 000 000
600006120   t 002  \0  \0  \0  \0  \0  \0   $  \0  \0  \0  \0  \0  \0  \0
        164 002 000 000 000 000 000 000 044 000 000 000 000 000 000 000
600006140  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600006160  \0  \0  \0  \0  \0  \0  \0  \0   D  \0  \0  \0 366 377 377   o
        000 000 000 000 000 000 000 000 104 000 000 000 366 377 377 157
600006200 002  \0  \0  \0  \0  \0  \0  \0 230 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 230 002 100 000 000 000 000 000
600006220 230 002  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        230 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600006240 005  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
600006260  \0  \0  \0  \0  \0  \0  \0  \0   N  \0  \0  \0  \v  \0  \0  \0
        000 000 000 000 000 000 000 000 116 000 000 000 013 000 000 000
600006300 002  \0  \0  \0  \0  \0  \0  \0 270 002   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 270 002 100 000 000 000 000 000
600006320 270 002  \0  \0  \0  \0  \0  \0   P 001  \0  \0  \0  \0  \0  \0
        270 002 000 000 000 000 000 000 120 001 000 000 000 000 000 000
600006340 006  \0  \0  \0 001  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
600006360 030  \0  \0  \0  \0  \0  \0  \0   V  \0  \0  \0 003  \0  \0  \0
        030 000 000 000 000 000 000 000 126 000 000 000 003 000 000 000
600006400 002  \0  \0  \0  \0  \0  \0  \0  \b 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 010 004 100 000 000 000 000 000
600006420  \b 004  \0  \0  \0  \0  \0  \0   }  \0  \0  \0  \0  \0  \0  \0
        010 004 000 000 000 000 000 000 175 000 000 000 000 000 000 000
600006440  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 001 000 000 000 000 000 000 000
600006460  \0  \0  \0  \0  \0  \0  \0  \0   ^  \0  \0  \0 377 377 377   o
        000 000 000 000 000 000 000 000 136 000 000 000 377 377 377 157
600006500 002  \0  \0  \0  \0  \0  \0  \0 206 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 206 004 100 000 000 000 000 000
600006520 206 004  \0  \0  \0  \0  \0  \0 034  \0  \0  \0  \0  \0  \0  \0
        206 004 000 000 000 000 000 000 034 000 000 000 000 000 000 000
600006540 005  \0  \0  \0  \0  \0  \0  \0 002  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 002 000 000 000 000 000 000 000
600006560 002  \0  \0  \0  \0  \0  \0  \0   k  \0  \0  \0 376 377 377   o
        002 000 000 000 000 000 000 000 153 000 000 000 376 377 377 157
600006600 002  \0  \0  \0  \0  \0  \0  \0 250 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 250 004 100 000 000 000 000 000
600006620 250 004  \0  \0  \0  \0  \0  \0      \0  \0  \0  \0  \0  \0  \0
        250 004 000 000 000 000 000 000 040 000 000 000 000 000 000 000
600006640 006  \0  \0  \0 001  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        006 000 000 000 001 000 000 000 010 000 000 000 000 000 000 000
600006660  \0  \0  \0  \0  \0  \0  \0  \0   z  \0  \0  \0 004  \0  \0  \0
        000 000 000 000 000 000 000 000 172 000 000 000 004 000 000 000
600006700 002  \0  \0  \0  \0  \0  \0  \0 310 004   @  \0  \0  \0  \0  \0
        002 000 000 000 000 000 000 000 310 004 100 000 000 000 000 000
600006720 310 004  \0  \0  \0  \0  \0  \0 030  \0  \0  \0  \0  \0  \0  \0
        310 004 000 000 000 000 000 000 030 000 000 000 000 000 000 000
600006740 005  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 000 000 000 000 010 000 000 000 000 000 000 000
600006760 030  \0  \0  \0  \0  \0  \0  \0 204  \0  \0  \0 004  \0  \0  \0
        030 000 000 000 000 000 000 000 204 000 000 000 004 000 000 000
600007000   B  \0  \0  \0  \0  \0  \0  \0 340 004   @  \0  \0  \0  \0  \0
        102 000 000 000 000 000 000 000 340 004 100 000 000 000 000 000
600007020 340 004  \0  \0  \0  \0  \0  \0   8 001  \0  \0  \0  \0  \0  \0
        340 004 000 000 000 000 000 000 070 001 000 000 000 000 000 000
600007040 005  \0  \0  \0  \f  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0  \0
        005 000 000 000 014 000 000 000 010 000 000 000 000 000 000 000
600007060 030  \0  \0  \0  \0  \0  \0  \0 216  \0  \0  \0 001  \0  \0  \0
        030 000 000 000 000 000 000 000 216 000 000 000 001 000 000 000
600007100 006  \0  \0  \0  \0  \0  \0  \0 030 006   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 030 006 100 000 000 000 000 000
600007120 030 006  \0  \0  \0  \0  \0  \0 032  \0  \0  \0  \0  \0  \0  \0
        030 006 000 000 000 000 000 000 032 000 000 000 000 000 000 000
600007140  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600007160  \0  \0  \0  \0  \0  \0  \0  \0 211  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 211 000 000 000 001 000 000 000
600007200 006  \0  \0  \0  \0  \0  \0  \0   @ 006   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 100 006 100 000 000 000 000 000
600007220   @ 006  \0  \0  \0  \0  \0  \0 340  \0  \0  \0  \0  \0  \0  \0
        100 006 000 000 000 000 000 000 340 000 000 000 000 000 000 000
600007240  \0  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
600007260 020  \0  \0  \0  \0  \0  \0  \0 224  \0  \0  \0 001  \0  \0  \0
        020 000 000 000 000 000 000 000 224 000 000 000 001 000 000 000
600007300 006  \0  \0  \0  \0  \0  \0  \0      \a   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 040 007 100 000 000 000 000 000
600007320      \a  \0  \0  \0  \0  \0  \0 364 002  \0  \0  \0  \0  \0  \0
        040 007 000 000 000 000 000 000 364 002 000 000 000 000 000 000
600007340  \0  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 020 000 000 000 000 000 000 000
600007360  \0  \0  \0  \0  \0  \0  \0  \0 232  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 232 000 000 000 001 000 000 000
600007400 006  \0  \0  \0  \0  \0  \0  \0 024  \n   @  \0  \0  \0  \0  \0
        006 000 000 000 000 000 000 000 024 012 100 000 000 000 000 000
600007420 024  \n  \0  \0  \0  \0  \0  \0  \t  \0  \0  \0  \0  \0  \0  \0
        024 012 000 000 000 000 000 000 011 000 000 000 000 000 000 000
600007440  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 004 000 000 000 000 000 000 000
600007460  \0  \0  \0  \0  \0  \0  \0  \0 240  \0  \0  \0 001  \0  \0  \0
        000 000 000 000 000 000 000 000 240 000 000 000 001 000 000 000
600007500  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
        000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
600010000
$ mv /tmp/file.4 /tmp/file.4.old
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 2549 0 in /tmp/file.2
ERROR: 40 1 in /tmp/file.2
ERROR: 53 2 in /tmp/file.2
ERROR: 29 3 in /tmp/file.2
ERROR: 27 4 in /tmp/file.2
ERROR: 5 5 in /tmp/file.2
ERROR: 14 6 in /tmp/file.2
ERROR: 8 7 in /tmp/file.2
ERROR: 16 8 in /tmp/file.2
ERROR: 4 9 in /tmp/file.2
ERROR: 12 10 in /tmp/file.2
ERROR: 4 11 in /tmp/file.2
ERROR: 2 12 in /tmp/file.2
ERROR: 10 13 in /tmp/file.2
ERROR: 13 14 in /tmp/file.2
ERROR: 4 15 in /tmp/file.2
ERROR: 26 16 in /tmp/file.2
ERROR: 5 17 in /tmp/file.2
ERROR: 23 18 in /tmp/file.2
ERROR: 4 19 in /tmp/file.2
ERROR: 8 20 in /tmp/file.2
ERROR: 2 21 in /tmp/file.2
ERROR: 1 22 in /tmp/file.2
ERROR: 2 23 in /tmp/file.2
ERROR: 17 24 in /tmp/file.2
ERROR: 5 25 in /tmp/file.2
ERROR: 2 26 in /tmp/file.2
ERROR: 1 27 in /tmp/file.2
ERROR: 3 28 in /tmp/file.2
ERROR: 17 32 in /tmp/file.2
ERROR: 1 35 in /tmp/file.2
ERROR: 1 36 in /tmp/file.2
ERROR: 2 38 in /tmp/file.2
ERROR: 5 40 in /tmp/file.2
ERROR: 1 41 in /tmp/file.2
ERROR: 3 45 in /tmp/file.2
ERROR: 65 46 in /tmp/file.2
ERROR: 2 48 in /tmp/file.2
ERROR: 4 49 in /tmp/file.2
ERROR: 24 50 in /tmp/file.2
ERROR: 3 51 in /tmp/file.2
ERROR: 4 52 in /tmp/file.2
ERROR: 12 53 in /tmp/file.2
ERROR: 2 54 in /tmp/file.2
ERROR: 1 55 in /tmp/file.2
ERROR: 5 56 in /tmp/file.2
ERROR: 1 60 in /tmp/file.2
ERROR: 75 64 in /tmp/file.2
ERROR: 5 65 in /tmp/file.2
ERROR: 17 66 in /tmp/file.2
ERROR: 19 67 in /tmp/file.2
ERROR: 5 68 in /tmp/file.2
ERROR: 6 69 in /tmp/file.2
ERROR: 3 70 in /tmp/file.2
ERROR: 13 71 in /tmp/file.2
ERROR: 18 73 in /tmp/file.2
ERROR: 3 74 in /tmp/file.2
ERROR: 17 76 in /tmp/file.2
ERROR: 7 77 in /tmp/file.2
ERROR: 5 78 in /tmp/file.2
ERROR: 4 79 in /tmp/file.2
ERROR: 1 80 in /tmp/file.2
ERROR: 4 82 in /tmp/file.2
ERROR: 2 83 in /tmp/file.2
ERROR: 13 84 in /tmp/file.2
ERROR: 1 85 in /tmp/file.2
ERROR: 1 86 in /tmp/file.2
ERROR: 1 89 in /tmp/file.2
ERROR: 2 94 in /tmp/file.2
ERROR: 118 95 in /tmp/file.2
ERROR: 24 96 in /tmp/file.2
ERROR: 54 97 in /tmp/file.2
ERROR: 14 98 in /tmp/file.2
ERROR: 18 99 in /tmp/file.2
ERROR: 29 100 in /tmp/file.2
ERROR: 57 101 in /tmp/file.2
ERROR: 16 102 in /tmp/file.2
ERROR: 15 103 in /tmp/file.2
ERROR: 9 104 in /tmp/file.2
ERROR: 48 105 in /tmp/file.2
ERROR: 1 106 in /tmp/file.2
ERROR: 2 107 in /tmp/file.2
ERROR: 30 108 in /tmp/file.2
ERROR: 22 109 in /tmp/file.2
ERROR: 43 110 in /tmp/file.2
ERROR: 29 111 in /tmp/file.2
ERROR: 13 112 in /tmp/file.2
ERROR: 56 114 in /tmp/file.2
ERROR: 42 115 in /tmp/file.2
ERROR: 65 116 in /tmp/file.2
ERROR: 14 117 in /tmp/file.2
ERROR: 3 118 in /tmp/file.2
ERROR: 2 119 in /tmp/file.2
ERROR: 3 120 in /tmp/file.2
ERROR: 16 121 in /tmp/file.2
ERROR: 1 122 in /tmp/file.2
ERROR: 1 125 in /tmp/file.2
ERROR: 1 126 in /tmp/file.2
ERROR: 5 128 in /tmp/file.2
ERROR: 1 132 in /tmp/file.2
ERROR: 4 134 in /tmp/file.2
ERROR: 1 137 in /tmp/file.2
ERROR: 1 141 in /tmp/file.2
ERROR: 1 142 in /tmp/file.2
ERROR: 1 144 in /tmp/file.2
ERROR: 1 145 in /tmp/file.2
ERROR: 2 148 in /tmp/file.2
ERROR: 6 152 in /tmp/file.2
ERROR: 2 153 in /tmp/file.2
ERROR: 1 154 in /tmp/file.2
ERROR: 6 160 in /tmp/file.2
ERROR: 1 166 in /tmp/file.2
ERROR: 3 168 in /tmp/file.2
ERROR: 1 176 in /tmp/file.2
ERROR: 1 180 in /tmp/file.2
ERROR: 1 181 in /tmp/file.2
ERROR: 3 184 in /tmp/file.2
ERROR: 1 188 in /tmp/file.2
ERROR: 4 192 in /tmp/file.2
ERROR: 1 193 in /tmp/file.2
ERROR: 1 198 in /tmp/file.2
ERROR: 3 200 in /tmp/file.2
ERROR: 2 208 in /tmp/file.2
ERROR: 1 216 in /tmp/file.2
ERROR: 1 223 in /tmp/file.2
ERROR: 4 224 in /tmp/file.2
ERROR: 1 227 in /tmp/file.2
ERROR: 1 236 in /tmp/file.2
ERROR: 1 237 in /tmp/file.2
ERROR: 4 241 in /tmp/file.2
ERROR: 1 243 in /tmp/file.2
ERROR: 1 244 in /tmp/file.2
ERROR: 1 245 in /tmp/file.2
ERROR: 1 246 in /tmp/file.2
ERROR: 2 248 in /tmp/file.2
ERROR: 1 249 in /tmp/file.2
ERROR: 1 254 in /tmp/file.2
ERROR: 2549 0 in /tmp/file.7
ERROR: 40 1 in /tmp/file.7
ERROR: 53 2 in /tmp/file.7
ERROR: 29 3 in /tmp/file.7
ERROR: 27 4 in /tmp/file.7
ERROR: 5 5 in /tmp/file.7
ERROR: 14 6 in /tmp/file.7
ERROR: 8 7 in /tmp/file.7
ERROR: 16 8 in /tmp/file.7
ERROR: 4 9 in /tmp/file.7
ERROR: 12 10 in /tmp/file.7
ERROR: 4 11 in /tmp/file.7
ERROR: 2 12 in /tmp/file.7
ERROR: 10 13 in /tmp/file.7
ERROR: 13 14 in /tmp/file.7
ERROR: 4 15 in /tmp/file.7
ERROR: 26 16 in /tmp/file.7
ERROR: 5 17 in /tmp/file.7
ERROR: 23 18 in /tmp/file.7
ERROR: 4 19 in /tmp/file.7
ERROR: 8 20 in /tmp/file.7
ERROR: 2 21 in /tmp/file.7
ERROR: 1 22 in /tmp/file.7
ERROR: 2 23 in /tmp/file.7
ERROR: 17 24 in /tmp/file.7
ERROR: 5 25 in /tmp/file.7
ERROR: 2 26 in /tmp/file.7
ERROR: 1 27 in /tmp/file.7
ERROR: 3 28 in /tmp/file.7
ERROR: 17 32 in /tmp/file.7
ERROR: 1 35 in /tmp/file.7
ERROR: 1 36 in /tmp/file.7
ERROR: 2 38 in /tmp/file.7
ERROR: 5 40 in /tmp/file.7
ERROR: 1 41 in /tmp/file.7
ERROR: 3 45 in /tmp/file.7
ERROR: 65 46 in /tmp/file.7
ERROR: 2 48 in /tmp/file.7
ERROR: 4 49 in /tmp/file.7
ERROR: 24 50 in /tmp/file.7
ERROR: 3 51 in /tmp/file.7
ERROR: 4 52 in /tmp/file.7
ERROR: 12 53 in /tmp/file.7
ERROR: 2 54 in /tmp/file.7
ERROR: 1 55 in /tmp/file.7
ERROR: 5 56 in /tmp/file.7
ERROR: 1 60 in /tmp/file.7
ERROR: 75 64 in /tmp/file.7
ERROR: 5 65 in /tmp/file.7
ERROR: 17 66 in /tmp/file.7
ERROR: 19 67 in /tmp/file.7
ERROR: 5 68 in /tmp/file.7
ERROR: 6 69 in /tmp/file.7
ERROR: 3 70 in /tmp/file.7
ERROR: 13 71 in /tmp/file.7
ERROR: 18 73 in /tmp/file.7
ERROR: 3 74 in /tmp/file.7
ERROR: 17 76 in /tmp/file.7
ERROR: 7 77 in /tmp/file.7
ERROR: 5 78 in /tmp/file.7
ERROR: 4 79 in /tmp/file.7
ERROR: 1 80 in /tmp/file.7
ERROR: 4 82 in /tmp/file.7
ERROR: 2 83 in /tmp/file.7
ERROR: 13 84 in /tmp/file.7
ERROR: 1 85 in /tmp/file.7
ERROR: 1 86 in /tmp/file.7
ERROR: 1 89 in /tmp/file.7
ERROR: 2 94 in /tmp/file.7
ERROR: 118 95 in /tmp/file.7
ERROR: 24 96 in /tmp/file.7
ERROR: 54 97 in /tmp/file.7
ERROR: 14 98 in /tmp/file.7
ERROR: 18 99 in /tmp/file.7
ERROR: 29 100 in /tmp/file.7
ERROR: 57 101 in /tmp/file.7
ERROR: 16 102 in /tmp/file.7
ERROR: 15 103 in /tmp/file.7
ERROR: 9 104 in /tmp/file.7
ERROR: 48 105 in /tmp/file.7
ERROR: 1 106 in /tmp/file.7
ERROR: 2 107 in /tmp/file.7
ERROR: 30 108 in /tmp/file.7
ERROR: 22 109 in /tmp/file.7
ERROR: 43 110 in /tmp/file.7
ERROR: 29 111 in /tmp/file.7
ERROR: 13 112 in /tmp/file.7
ERROR: 56 114 in /tmp/file.7
ERROR: 42 115 in /tmp/file.7
ERROR: 65 116 in /tmp/file.7
ERROR: 14 117 in /tmp/file.7
ERROR: 3 118 in /tmp/file.7
ERROR: 2 119 in /tmp/file.7
ERROR: 3 120 in /tmp/file.7
ERROR: 16 121 in /tmp/file.7
ERROR: 1 122 in /tmp/file.7
ERROR: 1 125 in /tmp/file.7
ERROR: 1 126 in /tmp/file.7
ERROR: 5 128 in /tmp/file.7
ERROR: 1 132 in /tmp/file.7
ERROR: 4 134 in /tmp/file.7
ERROR: 1 137 in /tmp/file.7
ERROR: 1 141 in /tmp/file.7
ERROR: 1 142 in /tmp/file.7
ERROR: 1 144 in /tmp/file.7
ERROR: 1 145 in /tmp/file.7
ERROR: 2 148 in /tmp/file.7
ERROR: 6 152 in /tmp/file.7
ERROR: 2 153 in /tmp/file.7
ERROR: 1 154 in /tmp/file.7
ERROR: 6 160 in /tmp/file.7
ERROR: 1 166 in /tmp/file.7
ERROR: 3 168 in /tmp/file.7
ERROR: 1 176 in /tmp/file.7
ERROR: 1 180 in /tmp/file.7
ERROR: 1 181 in /tmp/file.7
ERROR: 3 184 in /tmp/file.7
ERROR: 1 188 in /tmp/file.7
ERROR: 4 192 in /tmp/file.7
ERROR: 1 193 in /tmp/file.7
ERROR: 1 198 in /tmp/file.7
ERROR: 3 200 in /tmp/file.7
ERROR: 2 208 in /tmp/file.7
ERROR: 1 216 in /tmp/file.7
ERROR: 1 223 in /tmp/file.7
ERROR: 4 224 in /tmp/file.7
ERROR: 1 227 in /tmp/file.7
ERROR: 1 236 in /tmp/file.7
ERROR: 1 237 in /tmp/file.7
ERROR: 4 241 in /tmp/file.7
ERROR: 1 243 in /tmp/file.7
ERROR: 1 244 in /tmp/file.7
ERROR: 1 245 in /tmp/file.7
ERROR: 1 246 in /tmp/file.7
ERROR: 2 248 in /tmp/file.7
ERROR: 1 249 in /tmp/file.7
ERROR: 1 254 in /tmp/file.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11 10:22           ` Andrea Arcangeli
@ 2017-08-11 10:42             ` Andrea Arcangeli
  -1 siblings, 0 replies; 55+ messages in thread
From: Andrea Arcangeli @ 2017-08-11 10:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri, Aug 11, 2017 at 12:22:56PM +0200, Andrea Arcangeli wrote:
> disk block? This would happen on ext4 as well if mounted with -o
> journal=data instead of -o journal=ordered in fact, perhaps you simply

Oops above I meant journal=writeback, journal=data is even stronger
than journal=ordered of course.

And I shall clarify further that old disk content can only showup
legitimately on journal=writeback after a hard reboot or crash or in
general an unclean unmount. Even if there's no journaling at all
(i.e. ext2/vfat) old disk content cannot be shown at any given time no
matter what if there's no unclean unmount that requires a journal
reply.

This theory of a completely unrelated fs bug showing you disk content
as result of the OOM reaper induced SIGBUS interrupting a
copy_from_user at its very start, is purely motivated by the fact like
Michal I didn't see much explanation on the VM side that could cause
those not-zero not-0xff values showing up in the buffer of the write
syscall. You can try to change fs and see if it happens again to rule
it out. If it always happens regardless of the filesystem used, then
it's likely not a fs bug of course. You've got an entire and aligned
4k fs block showing up that data.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11 10:42             ` Andrea Arcangeli
  0 siblings, 0 replies; 55+ messages in thread
From: Andrea Arcangeli @ 2017-08-11 10:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri, Aug 11, 2017 at 12:22:56PM +0200, Andrea Arcangeli wrote:
> disk block? This would happen on ext4 as well if mounted with -o
> journal=data instead of -o journal=ordered in fact, perhaps you simply

Oops above I meant journal=writeback, journal=data is even stronger
than journal=ordered of course.

And I shall clarify further that old disk content can only showup
legitimately on journal=writeback after a hard reboot or crash or in
general an unclean unmount. Even if there's no journaling at all
(i.e. ext2/vfat) old disk content cannot be shown at any given time no
matter what if there's no unclean unmount that requires a journal
reply.

This theory of a completely unrelated fs bug showing you disk content
as result of the OOM reaper induced SIGBUS interrupting a
copy_from_user at its very start, is purely motivated by the fact like
Michal I didn't see much explanation on the VM side that could cause
those not-zero not-0xff values showing up in the buffer of the write
syscall. You can try to change fs and see if it happens again to rule
it out. If it always happens regardless of the filesystem used, then
it's likely not a fs bug of course. You've got an entire and aligned
4k fs block showing up that data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11  7:54         ` Tetsuo Handa
@ 2017-08-11 10:22           ` Andrea Arcangeli
  -1 siblings, 0 replies; 55+ messages in thread
From: Andrea Arcangeli @ 2017-08-11 10:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri, Aug 11, 2017 at 04:54:36PM +0900, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > +/*
> > > > + * Checks whether a page fault on the given mm is still reliable.
> > > > + * This is no longer true if the oom reaper started to reap the
> > > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > > + * the mm. At that moment any !shared mapping would lose the content
> > > > + * and could cause a memory corruption (zero pages instead of the
> > > > + * original content).
> > > > + *
> > > > + * User should call this before establishing a page table entry for
> > > > + * a !shared mapping and under the proper page table lock.
> > > > + *
> > > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > > + */
> > > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > > +{
> > > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > > +		return VM_FAULT_SIGBUS;
> > > > +	return 0;
> > > > +}
> > > > +
> > > 
> > > Will you explain the mechanism why random values are written instead of zeros
> > > so that this patch can actually fix the race problem?
> > 
> > I am not sure what you mean here. Were you able to see a write with an
> > unexpected content?
> 
> Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

The oom reaper depends on userland not possibly running anymore in any
thread associated with the reaped "mm" by the time wake_oom_reaper is
called and I'm not sure do_send_sig_info is anything close to provide
such guarantee. Problem is the reschedule seems async see
native_smp_send_reschedule invoked by kick_process. So perhaps the
thread is running with a corrupted stack for a little while until the
IPI arrives to destination. I guess it wouldn't be reproducible
without a large NUMA system.

Said that I looked the assembly of your program and I don't see
anything in the file_writer that could load data from the stack by the
time it starts to write() and clearly the sigkill and
smp_send_reschedule() will happen after it's already in the write()
tight loop. The only thing it loads from the user stack after it
reaches the tight loop is the canary which should then crash it if it
breaks out of the write loop which still wouldn't cause a write.

So I don't see much explanation on the VM side, but perhaps it's
possible this is a filesystem bug that enlarges the i_size before
issuing the write that SIGBUS in copy_from_user, because of
MMF_UNSTABLE is set at first access? And then leaves i_size enlarged
and what you're seeing in od -b is leaked content from an unintialized
disk block? This would happen on ext4 as well if mounted with -o
journal=data instead of -o journal=ordered in fact, perhaps you simply
have a filesystem that isn't mounted with journal=oredered semantics
and this isn't the OOM killer.

Also why you're using octal output? -x would be more intuitive for the
0xff (377) which is to be expected (should be all zeros or 0xff, and
some zero showup too).

Assuming those values not-zeros and not-0xff are simply lack of
ordered journaling mode and it's deleted file data (you clearly must
not have a ssd with -o discard or it'd be zero there), even if you
would only see zeroes it wouldn't concern me any bit less.

The non zeroes and non-0xff if they happen beyond the end of the
previous i_size they concern me less becuase they're at least less
obviously going to create sticky data corruption in a OOM killed
database. The database could handle it by recording the valid i_size
it successfully expanded the file to, with userland journaling in its
own user metadata.

Those expected zeroes that showup in your dump, are the real major
issue here and they showup as well. A database that hits OOM would
then generate persistent sticky memory corruption in user data that
could break the entire userland journaling and you could notice only
much later too.

OOM deadlock is certainly preferable here. Rebooting on a OOM hang is
totally ok and very minor issue as the user journaling is guaranteed
to be preserved. Writing random zeroes on shared storage may break the
whole thing instead and you may notice at next reboot to upgrade the
kernel that the db journaling fails and nothing starts and you could
have lost data too.

Back to your previous xfs OOM reaper timeout failure, one way around
it, is to implement a down_read_trylock_unfair, that will obtain a
read lock ignoring any write waiter breaking fairness but if done only
in the OOM reaper that would be not a
concern. down_read_trylock_unfair should solve this xfs lockup
involving khugepaged without the need to remove the mmap_sem from the
OOM reaper while mm_users > 0. Problem would then remain if the OOM
selected task is allocating memory and stuck on a xfs lock taken by
shrink_slabs while holding the mmap_sem for writing. This is why my
preference would be to dig in xfs and solve the source of the OOM
lockup at its core, as the OOM reaper is kicking the can down the
road, and ultimately if the process runs on pure
MAP_ANONYMOUS|MAP_SHARED kicking the can won't move it one bit, unless
OOM reaper starts to reap shmem too by expanding even more with more
checks and stuff when the fix for xfs ultimately will become simpler
and more self contained and targeted.

I would like if it would be possible to tell which kernel thread has
to be allowed to make progress lowering the wmark to unstuck the
TIF_MEMDIE task. For kernel threads this could involve adding a
pf_memalloc_pid dependency that is accessible at OOM time. Workqueues
submitted in PF_MEMALLOC context could set this pf_memalloc_pid
dependency in the worker threads themselves, fs kernel threads would
need the filesystem to set this pid dependency. So if TIF_MEMDIE pid
matches the current kernel thread pf_memalloc_pid, the kernel thread
allocation would inherit PF_MEMALLOC wmark privileges, by artificially
lowering the wmark for the TIF_MEMDIE task.

Or simply we could stop calling shrink_slab for fs dependent slab
caches with a per shrinker flag, in direct reclaim and offload those
to kswapd only. That would be a real simple change, much simpler than
the current unsafe but simpler OOM reaper.

There are several dozen of mbytes of RAM available when the system
hangs and fails to get rid of the TIF_MEMDIE task, problem they must
be given to the kernel thread that the TIF_MEMDIE task is waiting for
and we can't rely on lockdep to sort it out or it's too slow.

Refusal to fix the fs hangs and relying solely on the OOM reaper
ultimately causes the OOM reaper to keep escalating, to the point not
even down_read_trylock_unfair would suffice anymore and it would need
to zap pagetables without holding the mmap_sem at all (for example in
order to solve your same xfs OOM hang that would still remain if
shrink_slabs runs in direct reclaim under a mmap_sem-for-writing
section like while allocating a vma in mmap).

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11 10:22           ` Andrea Arcangeli
  0 siblings, 0 replies; 55+ messages in thread
From: Andrea Arcangeli @ 2017-08-11 10:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri, Aug 11, 2017 at 04:54:36PM +0900, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > +/*
> > > > + * Checks whether a page fault on the given mm is still reliable.
> > > > + * This is no longer true if the oom reaper started to reap the
> > > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > > + * the mm. At that moment any !shared mapping would lose the content
> > > > + * and could cause a memory corruption (zero pages instead of the
> > > > + * original content).
> > > > + *
> > > > + * User should call this before establishing a page table entry for
> > > > + * a !shared mapping and under the proper page table lock.
> > > > + *
> > > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > > + */
> > > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > > +{
> > > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > > +		return VM_FAULT_SIGBUS;
> > > > +	return 0;
> > > > +}
> > > > +
> > > 
> > > Will you explain the mechanism why random values are written instead of zeros
> > > so that this patch can actually fix the race problem?
> > 
> > I am not sure what you mean here. Were you able to see a write with an
> > unexpected content?
> 
> Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

The oom reaper depends on userland not possibly running anymore in any
thread associated with the reaped "mm" by the time wake_oom_reaper is
called and I'm not sure do_send_sig_info is anything close to provide
such guarantee. Problem is the reschedule seems async see
native_smp_send_reschedule invoked by kick_process. So perhaps the
thread is running with a corrupted stack for a little while until the
IPI arrives to destination. I guess it wouldn't be reproducible
without a large NUMA system.

Said that I looked the assembly of your program and I don't see
anything in the file_writer that could load data from the stack by the
time it starts to write() and clearly the sigkill and
smp_send_reschedule() will happen after it's already in the write()
tight loop. The only thing it loads from the user stack after it
reaches the tight loop is the canary which should then crash it if it
breaks out of the write loop which still wouldn't cause a write.

So I don't see much explanation on the VM side, but perhaps it's
possible this is a filesystem bug that enlarges the i_size before
issuing the write that SIGBUS in copy_from_user, because of
MMF_UNSTABLE is set at first access? And then leaves i_size enlarged
and what you're seeing in od -b is leaked content from an unintialized
disk block? This would happen on ext4 as well if mounted with -o
journal=data instead of -o journal=ordered in fact, perhaps you simply
have a filesystem that isn't mounted with journal=oredered semantics
and this isn't the OOM killer.

Also why you're using octal output? -x would be more intuitive for the
0xff (377) which is to be expected (should be all zeros or 0xff, and
some zero showup too).

Assuming those values not-zeros and not-0xff are simply lack of
ordered journaling mode and it's deleted file data (you clearly must
not have a ssd with -o discard or it'd be zero there), even if you
would only see zeroes it wouldn't concern me any bit less.

The non zeroes and non-0xff if they happen beyond the end of the
previous i_size they concern me less becuase they're at least less
obviously going to create sticky data corruption in a OOM killed
database. The database could handle it by recording the valid i_size
it successfully expanded the file to, with userland journaling in its
own user metadata.

Those expected zeroes that showup in your dump, are the real major
issue here and they showup as well. A database that hits OOM would
then generate persistent sticky memory corruption in user data that
could break the entire userland journaling and you could notice only
much later too.

OOM deadlock is certainly preferable here. Rebooting on a OOM hang is
totally ok and very minor issue as the user journaling is guaranteed
to be preserved. Writing random zeroes on shared storage may break the
whole thing instead and you may notice at next reboot to upgrade the
kernel that the db journaling fails and nothing starts and you could
have lost data too.

Back to your previous xfs OOM reaper timeout failure, one way around
it, is to implement a down_read_trylock_unfair, that will obtain a
read lock ignoring any write waiter breaking fairness but if done only
in the OOM reaper that would be not a
concern. down_read_trylock_unfair should solve this xfs lockup
involving khugepaged without the need to remove the mmap_sem from the
OOM reaper while mm_users > 0. Problem would then remain if the OOM
selected task is allocating memory and stuck on a xfs lock taken by
shrink_slabs while holding the mmap_sem for writing. This is why my
preference would be to dig in xfs and solve the source of the OOM
lockup at its core, as the OOM reaper is kicking the can down the
road, and ultimately if the process runs on pure
MAP_ANONYMOUS|MAP_SHARED kicking the can won't move it one bit, unless
OOM reaper starts to reap shmem too by expanding even more with more
checks and stuff when the fix for xfs ultimately will become simpler
and more self contained and targeted.

I would like if it would be possible to tell which kernel thread has
to be allowed to make progress lowering the wmark to unstuck the
TIF_MEMDIE task. For kernel threads this could involve adding a
pf_memalloc_pid dependency that is accessible at OOM time. Workqueues
submitted in PF_MEMALLOC context could set this pf_memalloc_pid
dependency in the worker threads themselves, fs kernel threads would
need the filesystem to set this pid dependency. So if TIF_MEMDIE pid
matches the current kernel thread pf_memalloc_pid, the kernel thread
allocation would inherit PF_MEMALLOC wmark privileges, by artificially
lowering the wmark for the TIF_MEMDIE task.

Or simply we could stop calling shrink_slab for fs dependent slab
caches with a per shrinker flag, in direct reclaim and offload those
to kswapd only. That would be a real simple change, much simpler than
the current unsafe but simpler OOM reaper.

There are several dozen of mbytes of RAM available when the system
hangs and fails to get rid of the TIF_MEMDIE task, problem they must
be given to the kernel thread that the TIF_MEMDIE task is waiting for
and we can't rely on lockdep to sort it out or it's too slow.

Refusal to fix the fs hangs and relying solely on the OOM reaper
ultimately causes the OOM reaper to keep escalating, to the point not
even down_read_trylock_unfair would suffice anymore and it would need
to zap pagetables without holding the mmap_sem at all (for example in
order to solve your same xfs OOM hang that would still remain if
shrink_slabs runs in direct reclaim under a mmap_sem-for-writing
section like while allocating a vma in mmap).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11  7:09       ` Michal Hocko
@ 2017-08-11  7:54         ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-11  7:54 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > +/*
> > > + * Checks whether a page fault on the given mm is still reliable.
> > > + * This is no longer true if the oom reaper started to reap the
> > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > + * the mm. At that moment any !shared mapping would lose the content
> > > + * and could cause a memory corruption (zero pages instead of the
> > > + * original content).
> > > + *
> > > + * User should call this before establishing a page table entry for
> > > + * a !shared mapping and under the proper page table lock.
> > > + *
> > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > + */
> > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > +{
> > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > +		return VM_FAULT_SIGBUS;
> > > +	return 0;
> > > +}
> > > +
> > 
> > Will you explain the mechanism why random values are written instead of zeros
> > so that this patch can actually fix the race problem?
> 
> I am not sure what you mean here. Were you able to see a write with an
> unexpected content?

Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11  7:54         ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-11  7:54 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > +/*
> > > + * Checks whether a page fault on the given mm is still reliable.
> > > + * This is no longer true if the oom reaper started to reap the
> > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > + * the mm. At that moment any !shared mapping would lose the content
> > > + * and could cause a memory corruption (zero pages instead of the
> > > + * original content).
> > > + *
> > > + * User should call this before establishing a page table entry for
> > > + * a !shared mapping and under the proper page table lock.
> > > + *
> > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > + */
> > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > +{
> > > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > +		return VM_FAULT_SIGBUS;
> > > +	return 0;
> > > +}
> > > +
> > 
> > Will you explain the mechanism why random values are written instead of zeros
> > so that this patch can actually fix the race problem?
> 
> I am not sure what you mean here. Were you able to see a write with an
> unexpected content?

Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-11  2:28     ` Tetsuo Handa
@ 2017-08-11  7:09       ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-11  7:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > +/*
> > + * Checks whether a page fault on the given mm is still reliable.
> > + * This is no longer true if the oom reaper started to reap the
> > + * address space which is reflected by MMF_UNSTABLE flag set in
> > + * the mm. At that moment any !shared mapping would lose the content
> > + * and could cause a memory corruption (zero pages instead of the
> > + * original content).
> > + *
> > + * User should call this before establishing a page table entry for
> > + * a !shared mapping and under the proper page table lock.
> > + *
> > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > + */
> > +static inline int check_stable_address_space(struct mm_struct *mm)
> > +{
> > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > +		return VM_FAULT_SIGBUS;
> > +	return 0;
> > +}
> > +
> 
> Will you explain the mechanism why random values are written instead of zeros
> so that this patch can actually fix the race problem?

I am not sure what you mean here. Were you able to see a write with an
unexpected content?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11  7:09       ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-11  7:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel

On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > +/*
> > + * Checks whether a page fault on the given mm is still reliable.
> > + * This is no longer true if the oom reaper started to reap the
> > + * address space which is reflected by MMF_UNSTABLE flag set in
> > + * the mm. At that moment any !shared mapping would lose the content
> > + * and could cause a memory corruption (zero pages instead of the
> > + * original content).
> > + *
> > + * User should call this before establishing a page table entry for
> > + * a !shared mapping and under the proper page table lock.
> > + *
> > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > + */
> > +static inline int check_stable_address_space(struct mm_struct *mm)
> > +{
> > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > +		return VM_FAULT_SIGBUS;
> > +	return 0;
> > +}
> > +
> 
> Will you explain the mechanism why random values are written instead of zeros
> so that this patch can actually fix the race problem?

I am not sure what you mean here. Were you able to see a write with an
unexpected content?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-07 11:38   ` Michal Hocko
@ 2017-08-11  2:28     ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-11  2:28 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> +/*
> + * Checks whether a page fault on the given mm is still reliable.
> + * This is no longer true if the oom reaper started to reap the
> + * address space which is reflected by MMF_UNSTABLE flag set in
> + * the mm. At that moment any !shared mapping would lose the content
> + * and could cause a memory corruption (zero pages instead of the
> + * original content).
> + *
> + * User should call this before establishing a page table entry for
> + * a !shared mapping and under the proper page table lock.
> + *
> + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> + */
> +static inline int check_stable_address_space(struct mm_struct *mm)
> +{
> +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> +		return VM_FAULT_SIGBUS;
> +	return 0;
> +}
> +

Will you explain the mechanism why random values are written instead of zeros
so that this patch can actually fix the race problem? I consider that writing
random values (though it seems like portion of process image) instead of zeros
to a file might cause a security problem, and the patch that fixes it should be
able to be backported to stable kernels.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-11  2:28     ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-11  2:28 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: andrea, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> +/*
> + * Checks whether a page fault on the given mm is still reliable.
> + * This is no longer true if the oom reaper started to reap the
> + * address space which is reflected by MMF_UNSTABLE flag set in
> + * the mm. At that moment any !shared mapping would lose the content
> + * and could cause a memory corruption (zero pages instead of the
> + * original content).
> + *
> + * User should call this before establishing a page table entry for
> + * a !shared mapping and under the proper page table lock.
> + *
> + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> + */
> +static inline int check_stable_address_space(struct mm_struct *mm)
> +{
> +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> +		return VM_FAULT_SIGBUS;
> +	return 0;
> +}
> +

Will you explain the mechanism why random values are written instead of zeros
so that this patch can actually fix the race problem? I consider that writing
random values (though it seems like portion of process image) instead of zeros
to a file might cause a security problem, and the patch that fixes it should be
able to be backported to stable kernels.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-10  8:21       ` Michal Hocko
@ 2017-08-10 13:33         ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-10 13:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML

On Thu 10-08-17 10:21:18, Michal Hocko wrote:
> On Tue 08-08-17 19:48:55, Andrea Arcangeli wrote:
> [...]
> > The bug corrected by this patch 1/2 I pointed it out last week while
> > reviewing other oom reaper fixes so that looks fine.
> > 
> > However I'd prefer to dump MMF_UNSTABLE for good instead of adding
> > more of it. It can be replaced with unmap_page_range in
> > __oom_reap_task_mm with a function that arms a special migration entry
> > so that no branchs are added to the fast paths and it's all hidden
> > inside is_migration_entry slow paths.
> 
> This sounds like an interesting idea but I would like to address the
> _correctness_ issue first and optimize on top of it. If for nothing else
> backporting a follow up fix sounds easier than a complete rework. There
> are quite some callers of is_migration_entry and the patch won't be
> trivial either. So can we focus on the fix first please?

Btw, if the overhead is a concern then we can add a jump label and only
make the code active only while the OOM is in progress. We already do
count all oom victims so we have a clear entry and exit points. This
would still sound easier to do than teach every is_migration_entry a new
migration entry type and handle it properly, not to mention make
everybody aware of this for future callers of is_migration_entry.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-10 13:33         ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-10 13:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML

On Thu 10-08-17 10:21:18, Michal Hocko wrote:
> On Tue 08-08-17 19:48:55, Andrea Arcangeli wrote:
> [...]
> > The bug corrected by this patch 1/2 I pointed it out last week while
> > reviewing other oom reaper fixes so that looks fine.
> > 
> > However I'd prefer to dump MMF_UNSTABLE for good instead of adding
> > more of it. It can be replaced with unmap_page_range in
> > __oom_reap_task_mm with a function that arms a special migration entry
> > so that no branchs are added to the fast paths and it's all hidden
> > inside is_migration_entry slow paths.
> 
> This sounds like an interesting idea but I would like to address the
> _correctness_ issue first and optimize on top of it. If for nothing else
> backporting a follow up fix sounds easier than a complete rework. There
> are quite some callers of is_migration_entry and the patch won't be
> trivial either. So can we focus on the fix first please?

Btw, if the overhead is a concern then we can add a jump label and only
make the code active only while the OOM is in progress. We already do
count all oom victims so we have a clear entry and exit points. This
would still sound easier to do than teach every is_migration_entry a new
migration entry type and handle it properly, not to mention make
everybody aware of this for future callers of is_migration_entry.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-08 17:48     ` Andrea Arcangeli
@ 2017-08-10  8:21       ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-10  8:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML

On Tue 08-08-17 19:48:55, Andrea Arcangeli wrote:
[...]
> The bug corrected by this patch 1/2 I pointed it out last week while
> reviewing other oom reaper fixes so that looks fine.
> 
> However I'd prefer to dump MMF_UNSTABLE for good instead of adding
> more of it. It can be replaced with unmap_page_range in
> __oom_reap_task_mm with a function that arms a special migration entry
> so that no branchs are added to the fast paths and it's all hidden
> inside is_migration_entry slow paths.

This sounds like an interesting idea but I would like to address the
_correctness_ issue first and optimize on top of it. If for nothing else
backporting a follow up fix sounds easier than a complete rework. There
are quite some callers of is_migration_entry and the patch won't be
trivial either. So can we focus on the fix first please?

[...]

> Overall OOM killing to me was reliable also before the oom reaper was
> introduced.

Yeah, this is the case in my experience as well but there are others
claiming otherwise and implementation wise the code was really fragile
enough to support their claims. Unbound lockup on TIF_MEMDIE task just
asks for troubles, especially when we have no idea what the oom victim
might be doing. Things are very simple when the victim was kicked out
from the userspace but this all gets very hairy when it was somewhere in
the kernel waiting for locks. It seems that we are mostly lucky in the
global oom situations. We have seen lockups with memcgs and had to move
the memcg oom handling to a lockless PF context. Those two were not too
different except the memcg was easier to hit.

[...]

> A couple of years ago I could trivially trigger OOM deadlocks on
> various ext4 paths that loops or use GFP_NOFAIL, but that was just a
> matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
> memory reserves below the low watermark.

You would have to identify the dependency chain to do this properly,
otherwise you simply consume memory reserves and you are back to square
one.

> It is also fine to kill a few more processes in fact.

I strongly disagree. It might be acceptable to kill more tasks if there
is absolutely no other choice. OOM killing is a very disruptive action
and we shoud _really_ reduce it to absolute minimum.

[...]
> The main point of the oom reaper nowadays is to free memory fast
> enough so a second task isn't killed as a false positive, but it's not
> like anybody will notice much of a difference if a second task is
> killed, it wasn't commonly happening either.

No, you seem to misunderstand. Adding a kernel thread to optimize a
glacial kind of slow path would be really hard to justify. The sole
purpose of the oom reaper is _reliability_. We do not select another
task from an oom domain if there is an existing oom victim alive. So we
do not need the reaper to prevent another victim selection. All we need
this async context for is to _guarantee_ that somebody tries to reclaim
as much memory of the victim as possible and then allow the oom killer
to continue if the OOM situation is not resolve. Because that endless
waiting for a sync context is what causes those lockups.

> Certainly it's preferable to get two tasks killed than corrupted core
> dumps or corrupted memory, so if oom reaper will stay we need to
> document how we guarantee it's mutually exclusive against core dumping

corrupted anonymous memory in the core dump was deemed acceptable
trade off to get a more reliable oom handling. If there is a strong
usecase for the reliable core dump then we can work on it, of course but
the system stability is at the first place IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-10  8:21       ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-10  8:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML

On Tue 08-08-17 19:48:55, Andrea Arcangeli wrote:
[...]
> The bug corrected by this patch 1/2 I pointed it out last week while
> reviewing other oom reaper fixes so that looks fine.
> 
> However I'd prefer to dump MMF_UNSTABLE for good instead of adding
> more of it. It can be replaced with unmap_page_range in
> __oom_reap_task_mm with a function that arms a special migration entry
> so that no branchs are added to the fast paths and it's all hidden
> inside is_migration_entry slow paths.

This sounds like an interesting idea but I would like to address the
_correctness_ issue first and optimize on top of it. If for nothing else
backporting a follow up fix sounds easier than a complete rework. There
are quite some callers of is_migration_entry and the patch won't be
trivial either. So can we focus on the fix first please?

[...]

> Overall OOM killing to me was reliable also before the oom reaper was
> introduced.

Yeah, this is the case in my experience as well but there are others
claiming otherwise and implementation wise the code was really fragile
enough to support their claims. Unbound lockup on TIF_MEMDIE task just
asks for troubles, especially when we have no idea what the oom victim
might be doing. Things are very simple when the victim was kicked out
from the userspace but this all gets very hairy when it was somewhere in
the kernel waiting for locks. It seems that we are mostly lucky in the
global oom situations. We have seen lockups with memcgs and had to move
the memcg oom handling to a lockless PF context. Those two were not too
different except the memcg was easier to hit.

[...]

> A couple of years ago I could trivially trigger OOM deadlocks on
> various ext4 paths that loops or use GFP_NOFAIL, but that was just a
> matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
> memory reserves below the low watermark.

You would have to identify the dependency chain to do this properly,
otherwise you simply consume memory reserves and you are back to square
one.

> It is also fine to kill a few more processes in fact.

I strongly disagree. It might be acceptable to kill more tasks if there
is absolutely no other choice. OOM killing is a very disruptive action
and we shoud _really_ reduce it to absolute minimum.

[...]
> The main point of the oom reaper nowadays is to free memory fast
> enough so a second task isn't killed as a false positive, but it's not
> like anybody will notice much of a difference if a second task is
> killed, it wasn't commonly happening either.

No, you seem to misunderstand. Adding a kernel thread to optimize a
glacial kind of slow path would be really hard to justify. The sole
purpose of the oom reaper is _reliability_. We do not select another
task from an oom domain if there is an existing oom victim alive. So we
do not need the reaper to prevent another victim selection. All we need
this async context for is to _guarantee_ that somebody tries to reclaim
as much memory of the victim as possible and then allow the oom killer
to continue if the OOM situation is not resolve. Because that endless
waiting for a sync context is what causes those lockups.

> Certainly it's preferable to get two tasks killed than corrupted core
> dumps or corrupted memory, so if oom reaper will stay we need to
> document how we guarantee it's mutually exclusive against core dumping

corrupted anonymous memory in the core dump was deemed acceptable
trade off to get a more reliable oom handling. If there is a strong
usecase for the reliable core dump then we can work on it, of course but
the system stability is at the first place IMHO.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-08 23:35       ` Tetsuo Handa
@ 2017-08-09 18:36         ` Andrea Arcangeli
  -1 siblings, 0 replies; 55+ messages in thread
From: Andrea Arcangeli @ 2017-08-09 18:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

On Wed, Aug 09, 2017 at 08:35:36AM +0900, Tetsuo Handa wrote:
> I don't think so. We spent a lot of time in order to remove possible locations
> which can lead to failing to invoke the OOM killer when out_of_memory() is called.

It's not clear the connection between failing to invoke the OOM killer
and the OOM reaper. I assume you mean failing to kill the task after
the OOM killer has been invoked through out_of_memory().

You should always see in the logs "%s: Kill process %d (%s) score %u
or sacrifice child\n", the invocation itself should never been an
issue and it's unrelated to the OOM reaper.

> Since RHEL7 changed default filesystem from ext4 to xfs, OOM related problems
> became much easier to occur, for xfs involves many kernel threads where
> TIF_MEMDIE based access to memory reserves cannot work among relevant threads.

I could reproduce similar issues where the TIF_MEMDIE task was hung on
fs locks hold by kernel threads in ext4 too but those should have been
solved by other means.

> Judging from my experience at a support center, it is too difficult for customers
> to report OOM hangs. It requires customers to stand by in front of the console
> twenty-four seven so that we get SysRq-t etc. whenever an OOM related problem is
> suspected. We can't ask customers for such effort. There is no report does not mean
> OOM hang is not occurring without artificial memory stress tests.

The printk above is likely to show in the logs after reboot, but I
agree in the cloud a node hanging on OOM is probably hidden and there
are all sort of management provisions possible to prevent hitting a
real OOM too. For example memcg.

Still having no apparent customer complains I think is significant
because it means they easily tackle the problem by other means, be it
watchdogs or they prevent it in the first place with memcg.

I'm not saying it's a minor issue, to me it's totally annoying if my
system hangs on OOM so it should be reliable in practice. I'm only not
sure if tacking the OOM issues with the big hammer that still cannot
guarantee anything 100%, is justified, considering the complexity it
brings to the VM core and there's still no guarantee of not hanging.

> The OOM reaper does not need to free memory fast enough, for the OOM killer
> does not select the second task for kill until the OOM reaper sets
> MMF_OOM_SKIP or __mmput() sets MMF_OOM_SKIP.

Right, there's no need to be fast there.

> I think that the main point of the OOM reaper nowadays are that
> "how can we allow the OOM reaper to take mmap_sem for read (because
> khugepaged might take mmap_sem of the OOM victim for write)"

The main point of the OOM reaper is to avoid killing more tasks. Not
just because it would be a false positive but also because even if we
kill more tasks, they may be all stuck on the same fs locks hold by
kernel threads that cannot be killed and loop asking for more memory.

So the OOM reaper tends to reduce the risk of OOM hangs but sure thing
it cannot guarantee perfection either.

Incidentally the OOM reaper still has a timeout where it gives up and
it moves to kill another task after the timeout.

khugepaged doesn't allocate memory while holding the mmap_sem for
writing.

It's not exactly clear how in the below dump khugepaged is the problem
because 3163 is also definitely holding the mmap_sem for reading and
it cannot release it independent of khugepaged. However khugepaged
could try to grab it for writing and the fairness provisions of the
rwsem would prevent down_read_trylock to go ahead.

There's nothing specific about khugepaged here, you can try to do a
pthread_create() to create a thread in your a.out program and then
call mmap munmap in a loop (no need to touch any memory). Eventually
you'll get the page fault in your a.out process holding the mmap_sem
for reading and the child thread trying to take it for writing. Which
should be enough to block the OOM reaper entirely with the child stuck
in D state.

I already have a patch in my tree that let exit_mmap and OOM reapear
to take down pagetables concurrently only serialized by the PT lock
(upstream the OOM reaper can only run before exit_mmap starts while
mm_users is still > 0). This lets the OOM reaper run even if mm_users
of the TIF_MEMDIE task already reached 0. However to avoid taking the
mmap_sem in __oom_reap_task_mm for reading you would need to do the
opposite of upstream and then it would only solve OOM hangs between
the last mmput and exit_mmap.

To zap pagetables without mmap_sem I think quite some overhaul is
needed (likely much bigger than the one required to fix the memory and
coredump corruption). If that is done it should be done to run
MADV_DONTNEED without mmap_sem if something. OOM reaper increased
accuracy wouldn't be enough of a motivation to justify such an
increase in complexity and constant fast-path overhead (be it to
release vmas with RCU through callbacks with delayed freeing or
anything else required to drop the mmap_sem while still allowing the
OOM reapear to run while mm_users is still > 0). It'd be quite
challenging to do that because the vma bits are also protected by
mmap_sem and you can only replace rbtree nodes with RCU, not to
rebalance with argumentation.

Assuming we do all that work and slowdown the fast paths further, just
for the OOM reaper, what would then happen if the process hung has no
anonymous memory to free and instead it runs on shmem only? Would we
be back to square one and hang with the below dump?

What if we fix xfs instead to get rid of the below problem? Wouldn't
then the OOM reaper become irrelevant if removed or not?

> ----------
> [  493.787997] Out of memory: Kill process 3163 (a.out) score 739 or sacrifice child
> [  493.791708] Killed process 3163 (a.out) total-vm:4268108kB, anon-rss:2754236kB, file-rss:0kB, shmem-rss:0kB
> [  494.838382] oom_reaper: unable to reap pid:3163 (a.out)
> [  494.847768] 
> [  494.847768] Showing all locks held in the system:
> [  494.861357] 1 lock held by oom_reaper/59:
> [  494.865903]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9f0c202d>] debug_show_all_locks+0x3d/0x1a0
> [  494.872934] 1 lock held by khugepaged/63:
> [  494.877426]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f1d5a4d>] khugepaged+0x99d/0x1af0
> [  494.884165] 3 locks held by kswapd0/75:
> [  494.888628]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
> [  494.894125]  #1:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.898328]  #2:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03aeafd>] xfs_reclaim_inodes_ag+0x3ad/0x4d0 [xfs]
> [  494.902703] 3 locks held by kworker/u128:31/387:
> [  494.905404]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
> [  494.909237]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
> [  494.913205]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.916954] 1 lock held by xfsaild/sda1/422:
> [  494.919288]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8828>] xfs_ilock_nowait+0x148/0x240 [xfs]
> [  494.923470] 1 lock held by systemd-journal/491:
> [  494.926102]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.929942] 1 lock held by gmain/745:
> [  494.932368]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.936505] 1 lock held by tuned/1009:
> [  494.938856]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.942824] 2 locks held by agetty/982:
> [  494.944900]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff9f78503f>] ldsem_down_read+0x1f/0x30
> [  494.948244]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff9f4108bf>] n_tty_read+0xbf/0x8e0
> [  494.952118] 1 lock held by sendmail/984:
> [  494.954408]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.958370] 5 locks held by a.out/3163:
> [  494.960544]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f05ca34>] __do_page_fault+0x154/0x4c0
> [  494.964191]  #1:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
> [  494.967922]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.971548]  #3:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03ae7fe>] xfs_reclaim_inodes_ag+0xae/0x4d0 [xfs]
> [  494.975644]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8580>] xfs_ilock+0xc0/0x1b0 [xfs]
> [  494.979194] 1 lock held by a.out/3164:
> [  494.981220]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.984448] 1 lock held by a.out/3165:
> [  494.986554]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.989841] 1 lock held by a.out/3166:
> [  494.992089]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.995388] 1 lock held by a.out/3167:
> [  494.997420]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> ----------
> 
>   collapse_huge_page at mm/khugepaged.c:1001
>    (inlined by) khugepaged_scan_pmd at mm/khugepaged.c:1209
>    (inlined by) khugepaged_scan_mm_slot at mm/khugepaged.c:1728
>    (inlined by) khugepaged_do_scan at mm/khugepaged.c:1809
>    (inlined by) khugepaged at mm/khugepaged.c:1854
> 
> and "how can we close race between checking MMF_OOM_SKIP and doing last alloc_page_from_freelist()
> attempt (because that race allows needlessly selecting the second task for kill)" in addition to
> "how can we close race between unmap_page_range() and the page faults with retry fallback".

Yes. And the "how is OOM reaper guaranteed not to run already while
coredumping is starting" should be added to the above list of things
to fix or explain.

I'm just questioning if all this energy isn't better spent in fixing
XFS with a memory reserve in xfs_reclaim_inode for kmem_alloc (like we
have mempools for bio) and drop the OOM reaper leaving the VM fast
paths alone.

> The subject of this thread is "how can we close race between unmap_page_range()
> and the page faults with retry fallback". Are you suggesting that we should remove
> the OOM reaper so that we don't need to change page faults and/or __mmput() paths?

Well certainly if it's not fixed, I think we'd be better off to remove
it because the risk of an hang is preferable than risk of memory
corruption or corrupted core dumps.

If it was that simple as it is currently it was nice to have, but
doing it safe without risk to corrupt memory and coredumps and without
slowing down the VM fast paths, sounds overkill. Last but not the
least it hides reproducible of issues like the above hang you posted,
that I think it can't do anything about even if you remove khugepaged...

... unless we drop the mmap_sem from MADV_DONTNEED but it's not easily
feasible if unmap_page_range has to run while mm_users may still be
still > 0. Doing more VM changes that are OOM reaper specific doesn't
seem attractive to me.

I'd rather prefer if we can fix the issues in xfs the old fashioned
way that won't end up again in a hang, if after all that work, the
TIF_MEMDIE task happened to have 0 anon mem allocated in it.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-09 18:36         ` Andrea Arcangeli
  0 siblings, 0 replies; 55+ messages in thread
From: Andrea Arcangeli @ 2017-08-09 18:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

On Wed, Aug 09, 2017 at 08:35:36AM +0900, Tetsuo Handa wrote:
> I don't think so. We spent a lot of time in order to remove possible locations
> which can lead to failing to invoke the OOM killer when out_of_memory() is called.

It's not clear the connection between failing to invoke the OOM killer
and the OOM reaper. I assume you mean failing to kill the task after
the OOM killer has been invoked through out_of_memory().

You should always see in the logs "%s: Kill process %d (%s) score %u
or sacrifice child\n", the invocation itself should never been an
issue and it's unrelated to the OOM reaper.

> Since RHEL7 changed default filesystem from ext4 to xfs, OOM related problems
> became much easier to occur, for xfs involves many kernel threads where
> TIF_MEMDIE based access to memory reserves cannot work among relevant threads.

I could reproduce similar issues where the TIF_MEMDIE task was hung on
fs locks hold by kernel threads in ext4 too but those should have been
solved by other means.

> Judging from my experience at a support center, it is too difficult for customers
> to report OOM hangs. It requires customers to stand by in front of the console
> twenty-four seven so that we get SysRq-t etc. whenever an OOM related problem is
> suspected. We can't ask customers for such effort. There is no report does not mean
> OOM hang is not occurring without artificial memory stress tests.

The printk above is likely to show in the logs after reboot, but I
agree in the cloud a node hanging on OOM is probably hidden and there
are all sort of management provisions possible to prevent hitting a
real OOM too. For example memcg.

Still having no apparent customer complains I think is significant
because it means they easily tackle the problem by other means, be it
watchdogs or they prevent it in the first place with memcg.

I'm not saying it's a minor issue, to me it's totally annoying if my
system hangs on OOM so it should be reliable in practice. I'm only not
sure if tacking the OOM issues with the big hammer that still cannot
guarantee anything 100%, is justified, considering the complexity it
brings to the VM core and there's still no guarantee of not hanging.

> The OOM reaper does not need to free memory fast enough, for the OOM killer
> does not select the second task for kill until the OOM reaper sets
> MMF_OOM_SKIP or __mmput() sets MMF_OOM_SKIP.

Right, there's no need to be fast there.

> I think that the main point of the OOM reaper nowadays are that
> "how can we allow the OOM reaper to take mmap_sem for read (because
> khugepaged might take mmap_sem of the OOM victim for write)"

The main point of the OOM reaper is to avoid killing more tasks. Not
just because it would be a false positive but also because even if we
kill more tasks, they may be all stuck on the same fs locks hold by
kernel threads that cannot be killed and loop asking for more memory.

So the OOM reaper tends to reduce the risk of OOM hangs but sure thing
it cannot guarantee perfection either.

Incidentally the OOM reaper still has a timeout where it gives up and
it moves to kill another task after the timeout.

khugepaged doesn't allocate memory while holding the mmap_sem for
writing.

It's not exactly clear how in the below dump khugepaged is the problem
because 3163 is also definitely holding the mmap_sem for reading and
it cannot release it independent of khugepaged. However khugepaged
could try to grab it for writing and the fairness provisions of the
rwsem would prevent down_read_trylock to go ahead.

There's nothing specific about khugepaged here, you can try to do a
pthread_create() to create a thread in your a.out program and then
call mmap munmap in a loop (no need to touch any memory). Eventually
you'll get the page fault in your a.out process holding the mmap_sem
for reading and the child thread trying to take it for writing. Which
should be enough to block the OOM reaper entirely with the child stuck
in D state.

I already have a patch in my tree that let exit_mmap and OOM reapear
to take down pagetables concurrently only serialized by the PT lock
(upstream the OOM reaper can only run before exit_mmap starts while
mm_users is still > 0). This lets the OOM reaper run even if mm_users
of the TIF_MEMDIE task already reached 0. However to avoid taking the
mmap_sem in __oom_reap_task_mm for reading you would need to do the
opposite of upstream and then it would only solve OOM hangs between
the last mmput and exit_mmap.

To zap pagetables without mmap_sem I think quite some overhaul is
needed (likely much bigger than the one required to fix the memory and
coredump corruption). If that is done it should be done to run
MADV_DONTNEED without mmap_sem if something. OOM reaper increased
accuracy wouldn't be enough of a motivation to justify such an
increase in complexity and constant fast-path overhead (be it to
release vmas with RCU through callbacks with delayed freeing or
anything else required to drop the mmap_sem while still allowing the
OOM reapear to run while mm_users is still > 0). It'd be quite
challenging to do that because the vma bits are also protected by
mmap_sem and you can only replace rbtree nodes with RCU, not to
rebalance with argumentation.

Assuming we do all that work and slowdown the fast paths further, just
for the OOM reaper, what would then happen if the process hung has no
anonymous memory to free and instead it runs on shmem only? Would we
be back to square one and hang with the below dump?

What if we fix xfs instead to get rid of the below problem? Wouldn't
then the OOM reaper become irrelevant if removed or not?

> ----------
> [  493.787997] Out of memory: Kill process 3163 (a.out) score 739 or sacrifice child
> [  493.791708] Killed process 3163 (a.out) total-vm:4268108kB, anon-rss:2754236kB, file-rss:0kB, shmem-rss:0kB
> [  494.838382] oom_reaper: unable to reap pid:3163 (a.out)
> [  494.847768] 
> [  494.847768] Showing all locks held in the system:
> [  494.861357] 1 lock held by oom_reaper/59:
> [  494.865903]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9f0c202d>] debug_show_all_locks+0x3d/0x1a0
> [  494.872934] 1 lock held by khugepaged/63:
> [  494.877426]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f1d5a4d>] khugepaged+0x99d/0x1af0
> [  494.884165] 3 locks held by kswapd0/75:
> [  494.888628]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
> [  494.894125]  #1:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.898328]  #2:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03aeafd>] xfs_reclaim_inodes_ag+0x3ad/0x4d0 [xfs]
> [  494.902703] 3 locks held by kworker/u128:31/387:
> [  494.905404]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
> [  494.909237]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
> [  494.913205]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.916954] 1 lock held by xfsaild/sda1/422:
> [  494.919288]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8828>] xfs_ilock_nowait+0x148/0x240 [xfs]
> [  494.923470] 1 lock held by systemd-journal/491:
> [  494.926102]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.929942] 1 lock held by gmain/745:
> [  494.932368]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.936505] 1 lock held by tuned/1009:
> [  494.938856]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.942824] 2 locks held by agetty/982:
> [  494.944900]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff9f78503f>] ldsem_down_read+0x1f/0x30
> [  494.948244]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff9f4108bf>] n_tty_read+0xbf/0x8e0
> [  494.952118] 1 lock held by sendmail/984:
> [  494.954408]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.958370] 5 locks held by a.out/3163:
> [  494.960544]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f05ca34>] __do_page_fault+0x154/0x4c0
> [  494.964191]  #1:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
> [  494.967922]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.971548]  #3:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03ae7fe>] xfs_reclaim_inodes_ag+0xae/0x4d0 [xfs]
> [  494.975644]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8580>] xfs_ilock+0xc0/0x1b0 [xfs]
> [  494.979194] 1 lock held by a.out/3164:
> [  494.981220]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.984448] 1 lock held by a.out/3165:
> [  494.986554]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.989841] 1 lock held by a.out/3166:
> [  494.992089]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.995388] 1 lock held by a.out/3167:
> [  494.997420]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> ----------
> 
>   collapse_huge_page at mm/khugepaged.c:1001
>    (inlined by) khugepaged_scan_pmd at mm/khugepaged.c:1209
>    (inlined by) khugepaged_scan_mm_slot at mm/khugepaged.c:1728
>    (inlined by) khugepaged_do_scan at mm/khugepaged.c:1809
>    (inlined by) khugepaged at mm/khugepaged.c:1854
> 
> and "how can we close race between checking MMF_OOM_SKIP and doing last alloc_page_from_freelist()
> attempt (because that race allows needlessly selecting the second task for kill)" in addition to
> "how can we close race between unmap_page_range() and the page faults with retry fallback".

Yes. And the "how is OOM reaper guaranteed not to run already while
coredumping is starting" should be added to the above list of things
to fix or explain.

I'm just questioning if all this energy isn't better spent in fixing
XFS with a memory reserve in xfs_reclaim_inode for kmem_alloc (like we
have mempools for bio) and drop the OOM reaper leaving the VM fast
paths alone.

> The subject of this thread is "how can we close race between unmap_page_range()
> and the page faults with retry fallback". Are you suggesting that we should remove
> the OOM reaper so that we don't need to change page faults and/or __mmput() paths?

Well certainly if it's not fixed, I think we'd be better off to remove
it because the risk of an hang is preferable than risk of memory
corruption or corrupted core dumps.

If it was that simple as it is currently it was nice to have, but
doing it safe without risk to corrupt memory and coredumps and without
slowing down the VM fast paths, sounds overkill. Last but not the
least it hides reproducible of issues like the above hang you posted,
that I think it can't do anything about even if you remove khugepaged...

... unless we drop the mmap_sem from MADV_DONTNEED but it's not easily
feasible if unmap_page_range has to run while mm_users may still be
still > 0. Doing more VM changes that are OOM reaper specific doesn't
seem attractive to me.

I'd rather prefer if we can fix the issues in xfs the old fashioned
way that won't end up again in a hang, if after all that work, the
TIF_MEMDIE task happened to have 0 anon mem allocated in it.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-08 17:48     ` Andrea Arcangeli
@ 2017-08-08 23:35       ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-08 23:35 UTC (permalink / raw)
  To: aarcange, mhocko
  Cc: akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

Andrea Arcangeli wrote:
> Overall OOM killing to me was reliable also before the oom reaper was
> introduced.

I don't think so. We spent a lot of time in order to remove possible locations
which can lead to failing to invoke the OOM killer when out_of_memory() is called.

> 
> I just did a search in bz for RHEL7 and there's a single bugreport
> related to OOM issues but it's hanging in a non-ext4 filesystem, and
> not nested in alloc_pages (but in wait_for_completion) and it's not
> reproducible with ext4. And it's happening only in an artificial
> specific "eatmemory" stress test from QA, there seems to be zero
> customer related bugreports about OOM hangs.

Since RHEL7 changed default filesystem from ext4 to xfs, OOM related problems
became much easier to occur, for xfs involves many kernel threads where
TIF_MEMDIE based access to memory reserves cannot work among relevant threads.

Judging from my experience at a support center, it is too difficult for customers
to report OOM hangs. It requires customers to stand by in front of the console
twenty-four seven so that we get SysRq-t etc. whenever an OOM related problem is
suspected. We can't ask customers for such effort. There is no report does not mean
OOM hang is not occurring without artificial memory stress tests.

> 
> A couple of years ago I could trivially trigger OOM deadlocks on
> various ext4 paths that loops or use GFP_NOFAIL, but that was just a
> matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
> memory reserves below the low watermark.
> 
> It is also fine to kill a few more processes in fact. It's not the end
> of the world if two tasks are killed because the first one couldn't
> reach exit_mmap without oom reaper assistance. The fs kind of OOM
> hangs in kernel threads are major issues if the whole filesystem in
> the journal or something tends to prevent a multitude of tasks to
> handle SIGKILL, so it has to be handled with reserves and it looked
> like it was working fine already.
> 
> The main point of the oom reaper nowadays is to free memory fast
> enough so a second task isn't killed as a false positive, but it's not
> like anybody will notice much of a difference if a second task is
> killed, it wasn't commonly happening either.

The OOM reaper does not need to free memory fast enough, for the OOM killer
does not select the second task for kill until the OOM reaper sets
MMF_OOM_SKIP or __mmput() sets MMF_OOM_SKIP.

I think that the main point of the OOM reaper nowadays are that
"how can we allow the OOM reaper to take mmap_sem for read (because
khugepaged might take mmap_sem of the OOM victim for write)"

----------
[  493.787997] Out of memory: Kill process 3163 (a.out) score 739 or sacrifice child
[  493.791708] Killed process 3163 (a.out) total-vm:4268108kB, anon-rss:2754236kB, file-rss:0kB, shmem-rss:0kB
[  494.838382] oom_reaper: unable to reap pid:3163 (a.out)
[  494.847768] 
[  494.847768] Showing all locks held in the system:
[  494.861357] 1 lock held by oom_reaper/59:
[  494.865903]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9f0c202d>] debug_show_all_locks+0x3d/0x1a0
[  494.872934] 1 lock held by khugepaged/63:
[  494.877426]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f1d5a4d>] khugepaged+0x99d/0x1af0
[  494.884165] 3 locks held by kswapd0/75:
[  494.888628]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
[  494.894125]  #1:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.898328]  #2:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03aeafd>] xfs_reclaim_inodes_ag+0x3ad/0x4d0 [xfs]
[  494.902703] 3 locks held by kworker/u128:31/387:
[  494.905404]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
[  494.909237]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
[  494.913205]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.916954] 1 lock held by xfsaild/sda1/422:
[  494.919288]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8828>] xfs_ilock_nowait+0x148/0x240 [xfs]
[  494.923470] 1 lock held by systemd-journal/491:
[  494.926102]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.929942] 1 lock held by gmain/745:
[  494.932368]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.936505] 1 lock held by tuned/1009:
[  494.938856]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.942824] 2 locks held by agetty/982:
[  494.944900]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff9f78503f>] ldsem_down_read+0x1f/0x30
[  494.948244]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff9f4108bf>] n_tty_read+0xbf/0x8e0
[  494.952118] 1 lock held by sendmail/984:
[  494.954408]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.958370] 5 locks held by a.out/3163:
[  494.960544]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f05ca34>] __do_page_fault+0x154/0x4c0
[  494.964191]  #1:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
[  494.967922]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.971548]  #3:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03ae7fe>] xfs_reclaim_inodes_ag+0xae/0x4d0 [xfs]
[  494.975644]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8580>] xfs_ilock+0xc0/0x1b0 [xfs]
[  494.979194] 1 lock held by a.out/3164:
[  494.981220]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.984448] 1 lock held by a.out/3165:
[  494.986554]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.989841] 1 lock held by a.out/3166:
[  494.992089]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.995388] 1 lock held by a.out/3167:
[  494.997420]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
----------

  collapse_huge_page at mm/khugepaged.c:1001
   (inlined by) khugepaged_scan_pmd at mm/khugepaged.c:1209
   (inlined by) khugepaged_scan_mm_slot at mm/khugepaged.c:1728
   (inlined by) khugepaged_do_scan at mm/khugepaged.c:1809
   (inlined by) khugepaged at mm/khugepaged.c:1854

and "how can we close race between checking MMF_OOM_SKIP and doing last alloc_page_from_freelist()
attempt (because that race allows needlessly selecting the second task for kill)" in addition to
"how can we close race between unmap_page_range() and the page faults with retry fallback".

> 
> Certainly it's preferable to get two tasks killed than corrupted core
> dumps or corrupted memory, so if oom reaper will stay we need to
> document how  we guarantee it's mutually exclusive against core dumping
> and it'd better not slowdown page fault fast paths considering it's
> possible to do so by arming page-less migration entries that can wait
> for sigkill to be delivered in do_swap_page.
> 
> It's a big hammer feature that is nice to have but doing it safely and
> without adding branches to the fast paths, is somewhat more complex
> than current code.

The subject of this thread is "how can we close race between unmap_page_range()
and the page faults with retry fallback". Are you suggesting that we should remove
the OOM reaper so that we don't need to change page faults and/or __mmput() paths?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-08 23:35       ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-08 23:35 UTC (permalink / raw)
  To: aarcange, mhocko
  Cc: akpm, kirill, oleg, wenwei.tww, linux-mm, linux-kernel, mhocko

Andrea Arcangeli wrote:
> Overall OOM killing to me was reliable also before the oom reaper was
> introduced.

I don't think so. We spent a lot of time in order to remove possible locations
which can lead to failing to invoke the OOM killer when out_of_memory() is called.

> 
> I just did a search in bz for RHEL7 and there's a single bugreport
> related to OOM issues but it's hanging in a non-ext4 filesystem, and
> not nested in alloc_pages (but in wait_for_completion) and it's not
> reproducible with ext4. And it's happening only in an artificial
> specific "eatmemory" stress test from QA, there seems to be zero
> customer related bugreports about OOM hangs.

Since RHEL7 changed default filesystem from ext4 to xfs, OOM related problems
became much easier to occur, for xfs involves many kernel threads where
TIF_MEMDIE based access to memory reserves cannot work among relevant threads.

Judging from my experience at a support center, it is too difficult for customers
to report OOM hangs. It requires customers to stand by in front of the console
twenty-four seven so that we get SysRq-t etc. whenever an OOM related problem is
suspected. We can't ask customers for such effort. There is no report does not mean
OOM hang is not occurring without artificial memory stress tests.

> 
> A couple of years ago I could trivially trigger OOM deadlocks on
> various ext4 paths that loops or use GFP_NOFAIL, but that was just a
> matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
> memory reserves below the low watermark.
> 
> It is also fine to kill a few more processes in fact. It's not the end
> of the world if two tasks are killed because the first one couldn't
> reach exit_mmap without oom reaper assistance. The fs kind of OOM
> hangs in kernel threads are major issues if the whole filesystem in
> the journal or something tends to prevent a multitude of tasks to
> handle SIGKILL, so it has to be handled with reserves and it looked
> like it was working fine already.
> 
> The main point of the oom reaper nowadays is to free memory fast
> enough so a second task isn't killed as a false positive, but it's not
> like anybody will notice much of a difference if a second task is
> killed, it wasn't commonly happening either.

The OOM reaper does not need to free memory fast enough, for the OOM killer
does not select the second task for kill until the OOM reaper sets
MMF_OOM_SKIP or __mmput() sets MMF_OOM_SKIP.

I think that the main point of the OOM reaper nowadays are that
"how can we allow the OOM reaper to take mmap_sem for read (because
khugepaged might take mmap_sem of the OOM victim for write)"

----------
[  493.787997] Out of memory: Kill process 3163 (a.out) score 739 or sacrifice child
[  493.791708] Killed process 3163 (a.out) total-vm:4268108kB, anon-rss:2754236kB, file-rss:0kB, shmem-rss:0kB
[  494.838382] oom_reaper: unable to reap pid:3163 (a.out)
[  494.847768] 
[  494.847768] Showing all locks held in the system:
[  494.861357] 1 lock held by oom_reaper/59:
[  494.865903]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9f0c202d>] debug_show_all_locks+0x3d/0x1a0
[  494.872934] 1 lock held by khugepaged/63:
[  494.877426]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f1d5a4d>] khugepaged+0x99d/0x1af0
[  494.884165] 3 locks held by kswapd0/75:
[  494.888628]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
[  494.894125]  #1:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.898328]  #2:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03aeafd>] xfs_reclaim_inodes_ag+0x3ad/0x4d0 [xfs]
[  494.902703] 3 locks held by kworker/u128:31/387:
[  494.905404]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
[  494.909237]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
[  494.913205]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.916954] 1 lock held by xfsaild/sda1/422:
[  494.919288]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8828>] xfs_ilock_nowait+0x148/0x240 [xfs]
[  494.923470] 1 lock held by systemd-journal/491:
[  494.926102]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.929942] 1 lock held by gmain/745:
[  494.932368]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.936505] 1 lock held by tuned/1009:
[  494.938856]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.942824] 2 locks held by agetty/982:
[  494.944900]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff9f78503f>] ldsem_down_read+0x1f/0x30
[  494.948244]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff9f4108bf>] n_tty_read+0xbf/0x8e0
[  494.952118] 1 lock held by sendmail/984:
[  494.954408]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.958370] 5 locks held by a.out/3163:
[  494.960544]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f05ca34>] __do_page_fault+0x154/0x4c0
[  494.964191]  #1:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
[  494.967922]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
[  494.971548]  #3:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03ae7fe>] xfs_reclaim_inodes_ag+0xae/0x4d0 [xfs]
[  494.975644]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8580>] xfs_ilock+0xc0/0x1b0 [xfs]
[  494.979194] 1 lock held by a.out/3164:
[  494.981220]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.984448] 1 lock held by a.out/3165:
[  494.986554]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.989841] 1 lock held by a.out/3166:
[  494.992089]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
[  494.995388] 1 lock held by a.out/3167:
[  494.997420]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
----------

  collapse_huge_page at mm/khugepaged.c:1001
   (inlined by) khugepaged_scan_pmd at mm/khugepaged.c:1209
   (inlined by) khugepaged_scan_mm_slot at mm/khugepaged.c:1728
   (inlined by) khugepaged_do_scan at mm/khugepaged.c:1809
   (inlined by) khugepaged at mm/khugepaged.c:1854

and "how can we close race between checking MMF_OOM_SKIP and doing last alloc_page_from_freelist()
attempt (because that race allows needlessly selecting the second task for kill)" in addition to
"how can we close race between unmap_page_range() and the page faults with retry fallback".

> 
> Certainly it's preferable to get two tasks killed than corrupted core
> dumps or corrupted memory, so if oom reaper will stay we need to
> document how  we guarantee it's mutually exclusive against core dumping
> and it'd better not slowdown page fault fast paths considering it's
> possible to do so by arming page-less migration entries that can wait
> for sigkill to be delivered in do_swap_page.
> 
> It's a big hammer feature that is nice to have but doing it safely and
> without adding branches to the fast paths, is somewhat more complex
> than current code.

The subject of this thread is "how can we close race between unmap_page_range()
and the page faults with retry fallback". Are you suggesting that we should remove
the OOM reaper so that we don't need to change page faults and/or __mmput() paths?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-07 11:38   ` Michal Hocko
@ 2017-08-08 17:48     ` Andrea Arcangeli
  -1 siblings, 0 replies; 55+ messages in thread
From: Andrea Arcangeli @ 2017-08-08 17:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML, Michal Hocko

Hello,

On Mon, Aug 07, 2017 at 01:38:39PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Wenwei Tao has noticed that our current assumption that the oom victim
> is dying and never doing any visible changes after it dies, and so the
> oom_reaper can tear it down, is not entirely true.
> 
> __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT
> is set but do_group_exit sends SIGKILL to all threads _after_ the
> flag is set. So there is a race window when some threads won't have
> fatal_signal_pending while the oom_reaper could start unmapping the
> address space. Moreover some paths might not check for fatal signals
> before each PF/g-u-p/copy_from_user.
> 
> We already have a protection for oom_reaper vs. PF races by checking
> MMF_UNSTABLE. This has been, however, checked only for kernel threads
> (use_mm users) which can outlive the oom victim. A simple fix would be
> to extend the current check in handle_mm_fault for all tasks but that
> wouldn't be sufficient because the current check assumes that a kernel
> thread would bail out after EFAULT from get_user*/copy_from_user and
> never re-read the same address which would succeed because the PF path
> has established page tables already. This seems to be the case for the
> only existing use_mm user currently (virtio driver) but it is rather
> fragile in general.
> 
> This is even more fragile in general for more complex paths such as
> generic_perform_write which can re-read the same address more times
> (e.g. iov_iter_copy_from_user_atomic to fail and then
> iov_iter_fault_in_readable on retry). Therefore we have to implement
> MMF_UNSTABLE protection in a robust way and never make a potentially
> corrupted content visible. That requires to hook deeper into the PF
> path and check for the flag _every time_ before a pte for anonymous
> memory is established (that means all !VM_SHARED mappings).
> 
> The corruption can be triggered artificially [1] but there doesn't seem
> to be any real life bug report. The race window should be quite tight
> to trigger most of the time.

The bug corrected by this patch 1/2 I pointed it out last week while
reviewing other oom reaper fixes so that looks fine.

However I'd prefer to dump MMF_UNSTABLE for good instead of adding
more of it. It can be replaced with unmap_page_range in
__oom_reap_task_mm with a function that arms a special migration entry
so that no branchs are added to the fast paths and it's all hidden
inside is_migration_entry slow paths. Instead of triggering a
wait_on_page_bit(TASK_UNINTERRUPTIBLE) when is_migration_entry(entry)
is true, it will do a:

   __set_current_state(TASK_KILLABLE);
   schedule();
   return VM_FAULT_SIGBUS;

Because the SIGKILL is already posted by the time it gets waken, the
sigbus handler cannot run because the process will exit before
returning to userland, and the error should prevent GUP to keep trying
in a loop (which would happen with a regular migration entry).

It will be a page-less migration entry, so a fake, fixed,
non-page-struct-backing page pointer, could be used to create the
migration entry. migration_entry_to_page will not return a page, but
such entry can be cleared fine during exit_mmap like a regular
migration entry. No pagetable will be established either during those
migration entry blocking events in do_swap_page.

The above however looks simple compared to the core dumping. That is
an additional trouble, and not just because it can call
handle_mm_fault without mmap_sem. Regardless of mmap_sem, I wonder if
SIGNAL_GROUP_COREDUMP can get set while __oom_reap_task_mm is already
running and then what happens?  It can't be ok if core dumping can run
in those page-less migration entries and if it does, there's no chance
to get a coherent coredump after that, the page contents are already
freed and reused by the time. There should be an explanation of how
this race against coredumping is controlled to be sure oom reaper
can't start during coredumping (of course there's the check already,
but I'm just wondering if such check leaves a window for the race, if
there was a race already in the main page faults).

Overall OOM killing to me was reliable also before the oom reaper was
introduced.

I just did a search in bz for RHEL7 and there's a single bugreport
related to OOM issues but it's hanging in a non-ext4 filesystem, and
not nested in alloc_pages (but in wait_for_completion) and it's not
reproducible with ext4. And it's happening only in an artificial
specific "eatmemory" stress test from QA, there seems to be zero
customer related bugreports about OOM hangs.

A couple of years ago I could trivially trigger OOM deadlocks on
various ext4 paths that loops or use GFP_NOFAIL, but that was just a
matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
memory reserves below the low watermark.

It is also fine to kill a few more processes in fact. It's not the end
of the world if two tasks are killed because the first one couldn't
reach exit_mmap without oom reaper assistance. The fs kind of OOM
hangs in kernel threads are major issues if the whole filesystem in
the journal or something tends to prevent a multitude of tasks to
handle SIGKILL, so it has to be handled with reserves and it looked
like it was working fine already.

The main point of the oom reaper nowadays is to free memory fast
enough so a second task isn't killed as a false positive, but it's not
like anybody will notice much of a difference if a second task is
killed, it wasn't commonly happening either.

Certainly it's preferable to get two tasks killed than corrupted core
dumps or corrupted memory, so if oom reaper will stay we need to
document how we guarantee it's mutually exclusive against core dumping
and it'd better not slowdown page fault fast paths considering it's
possible to do so by arming page-less migration entries that can wait
for sigkill to be delivered in do_swap_page.

It's a big hammer feature that is nice to have but doing it safely and
without adding branches to the fast paths, is somewhat more complex
than current code.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-08 17:48     ` Andrea Arcangeli
  0 siblings, 0 replies; 55+ messages in thread
From: Andrea Arcangeli @ 2017-08-08 17:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Tetsuo Handa, Oleg Nesterov,
	Wenwei Tao, linux-mm, LKML, Michal Hocko

Hello,

On Mon, Aug 07, 2017 at 01:38:39PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Wenwei Tao has noticed that our current assumption that the oom victim
> is dying and never doing any visible changes after it dies, and so the
> oom_reaper can tear it down, is not entirely true.
> 
> __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT
> is set but do_group_exit sends SIGKILL to all threads _after_ the
> flag is set. So there is a race window when some threads won't have
> fatal_signal_pending while the oom_reaper could start unmapping the
> address space. Moreover some paths might not check for fatal signals
> before each PF/g-u-p/copy_from_user.
> 
> We already have a protection for oom_reaper vs. PF races by checking
> MMF_UNSTABLE. This has been, however, checked only for kernel threads
> (use_mm users) which can outlive the oom victim. A simple fix would be
> to extend the current check in handle_mm_fault for all tasks but that
> wouldn't be sufficient because the current check assumes that a kernel
> thread would bail out after EFAULT from get_user*/copy_from_user and
> never re-read the same address which would succeed because the PF path
> has established page tables already. This seems to be the case for the
> only existing use_mm user currently (virtio driver) but it is rather
> fragile in general.
> 
> This is even more fragile in general for more complex paths such as
> generic_perform_write which can re-read the same address more times
> (e.g. iov_iter_copy_from_user_atomic to fail and then
> iov_iter_fault_in_readable on retry). Therefore we have to implement
> MMF_UNSTABLE protection in a robust way and never make a potentially
> corrupted content visible. That requires to hook deeper into the PF
> path and check for the flag _every time_ before a pte for anonymous
> memory is established (that means all !VM_SHARED mappings).
> 
> The corruption can be triggered artificially [1] but there doesn't seem
> to be any real life bug report. The race window should be quite tight
> to trigger most of the time.

The bug corrected by this patch 1/2 I pointed it out last week while
reviewing other oom reaper fixes so that looks fine.

However I'd prefer to dump MMF_UNSTABLE for good instead of adding
more of it. It can be replaced with unmap_page_range in
__oom_reap_task_mm with a function that arms a special migration entry
so that no branchs are added to the fast paths and it's all hidden
inside is_migration_entry slow paths. Instead of triggering a
wait_on_page_bit(TASK_UNINTERRUPTIBLE) when is_migration_entry(entry)
is true, it will do a:

   __set_current_state(TASK_KILLABLE);
   schedule();
   return VM_FAULT_SIGBUS;

Because the SIGKILL is already posted by the time it gets waken, the
sigbus handler cannot run because the process will exit before
returning to userland, and the error should prevent GUP to keep trying
in a loop (which would happen with a regular migration entry).

It will be a page-less migration entry, so a fake, fixed,
non-page-struct-backing page pointer, could be used to create the
migration entry. migration_entry_to_page will not return a page, but
such entry can be cleared fine during exit_mmap like a regular
migration entry. No pagetable will be established either during those
migration entry blocking events in do_swap_page.

The above however looks simple compared to the core dumping. That is
an additional trouble, and not just because it can call
handle_mm_fault without mmap_sem. Regardless of mmap_sem, I wonder if
SIGNAL_GROUP_COREDUMP can get set while __oom_reap_task_mm is already
running and then what happens?  It can't be ok if core dumping can run
in those page-less migration entries and if it does, there's no chance
to get a coherent coredump after that, the page contents are already
freed and reused by the time. There should be an explanation of how
this race against coredumping is controlled to be sure oom reaper
can't start during coredumping (of course there's the check already,
but I'm just wondering if such check leaves a window for the race, if
there was a race already in the main page faults).

Overall OOM killing to me was reliable also before the oom reaper was
introduced.

I just did a search in bz for RHEL7 and there's a single bugreport
related to OOM issues but it's hanging in a non-ext4 filesystem, and
not nested in alloc_pages (but in wait_for_completion) and it's not
reproducible with ext4. And it's happening only in an artificial
specific "eatmemory" stress test from QA, there seems to be zero
customer related bugreports about OOM hangs.

A couple of years ago I could trivially trigger OOM deadlocks on
various ext4 paths that loops or use GFP_NOFAIL, but that was just a
matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
memory reserves below the low watermark.

It is also fine to kill a few more processes in fact. It's not the end
of the world if two tasks are killed because the first one couldn't
reach exit_mmap without oom reaper assistance. The fs kind of OOM
hangs in kernel threads are major issues if the whole filesystem in
the journal or something tends to prevent a multitude of tasks to
handle SIGKILL, so it has to be handled with reserves and it looked
like it was working fine already.

The main point of the oom reaper nowadays is to free memory fast
enough so a second task isn't killed as a false positive, but it's not
like anybody will notice much of a difference if a second task is
killed, it wasn't commonly happening either.

Certainly it's preferable to get two tasks killed than corrupted core
dumps or corrupted memory, so if oom reaper will stay we need to
document how we guarantee it's mutually exclusive against core dumping
and it'd better not slowdown page fault fast paths considering it's
possible to do so by arming page-less migration entries that can wait
for sigkill to be delivered in do_swap_page.

It's a big hammer feature that is nice to have but doing it safely and
without adding branches to the fast paths, is somewhat more complex
than current code.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-07 11:38 [PATCH 0/2] mm, oom: fix oom_reaper fallouts Michal Hocko
@ 2017-08-07 11:38   ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-07 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Argangeli, Kirill A. Shutemov, Tetsuo Handa,
	Oleg Nesterov, Wenwei Tao, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies, and so the
oom_reaper can tear it down, is not entirely true.

__task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT
is set but do_group_exit sends SIGKILL to all threads _after_ the
flag is set. So there is a race window when some threads won't have
fatal_signal_pending while the oom_reaper could start unmapping the
address space. Moreover some paths might not check for fatal signals
before each PF/g-u-p/copy_from_user.

We already have a protection for oom_reaper vs. PF races by checking
MMF_UNSTABLE. This has been, however, checked only for kernel threads
(use_mm users) which can outlive the oom victim. A simple fix would be
to extend the current check in handle_mm_fault for all tasks but that
wouldn't be sufficient because the current check assumes that a kernel
thread would bail out after EFAULT from get_user*/copy_from_user and
never re-read the same address which would succeed because the PF path
has established page tables already. This seems to be the case for the
only existing use_mm user currently (virtio driver) but it is rather
fragile in general.

This is even more fragile in general for more complex paths such as
generic_perform_write which can re-read the same address more times
(e.g. iov_iter_copy_from_user_atomic to fail and then
iov_iter_fault_in_readable on retry). Therefore we have to implement
MMF_UNSTABLE protection in a robust way and never make a potentially
corrupted content visible. That requires to hook deeper into the PF
path and check for the flag _every time_ before a pte for anonymous
memory is established (that means all !VM_SHARED mappings).

The corruption can be triggered artificially [1] but there doesn't seem
to be any real life bug report. The race window should be quite tight
to trigger most of the time.

Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

[1] http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp
---
 include/linux/oom.h | 22 ++++++++++++++++++++++
 mm/huge_memory.c    | 30 ++++++++++++++++++++++--------
 mm/memory.c         | 46 ++++++++++++++++++++--------------------------
 3 files changed, 64 insertions(+), 34 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2be5a6..76aac4ce39bc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -6,6 +6,8 @@
 #include <linux/types.h>
 #include <linux/nodemask.h>
 #include <uapi/linux/oom.h>
+#include <linux/sched/coredump.h> /* MMF_* */
+#include <linux/mm.h> /* VM_FAULT* */
 
 struct zonelist;
 struct notifier_block;
@@ -63,6 +65,26 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
 	return tsk->signal->oom_mm;
 }
 
+/*
+ * Checks whether a page fault on the given mm is still reliable.
+ * This is no longer true if the oom reaper started to reap the
+ * address space which is reflected by MMF_UNSTABLE flag set in
+ * the mm. At that moment any !shared mapping would lose the content
+ * and could cause a memory corruption (zero pages instead of the
+ * original content).
+ *
+ * User should call this before establishing a page table entry for
+ * a !shared mapping and under the proper page table lock.
+ *
+ * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
+ */
+static inline int check_stable_address_space(struct mm_struct *mm)
+{
+	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
+		return VM_FAULT_SIGBUS;
+	return 0;
+}
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..b03cfc0d3141 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -32,6 +32,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
+#include <linux/oom.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -550,6 +551,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	int ret = 0;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -561,9 +563,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	pgtable = pte_alloc_one(vma->vm_mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto release;
 	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
@@ -576,13 +577,14 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_none(*vmf->pmd))) {
-		spin_unlock(vmf->ptl);
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		pte_free(vma->vm_mm, pgtable);
+		goto unlock_release;
 	} else {
 		pmd_t entry;
 
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock_release;
+
 		/* Deliver the page fault to userland */
 		if (userfaultfd_missing(vma)) {
 			int ret;
@@ -610,6 +612,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	}
 
 	return 0;
+unlock_release:
+	spin_unlock(vmf->ptl);
+release:
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
+	mem_cgroup_cancel_charge(page, memcg, true);
+	put_page(page);
+	return ret;
+
 }
 
 /*
@@ -688,7 +699,10 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		ret = 0;
 		set = false;
 		if (pmd_none(*vmf->pmd)) {
-			if (userfaultfd_missing(vma)) {
+			ret = check_stable_address_space(vma->vm_mm);
+			if (ret) {
+				spin_unlock(vmf->ptl);
+			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
diff --git a/mm/memory.c b/mm/memory.c
index 4fe5b6254688..1b4504441bd2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -68,6 +68,7 @@
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
+#include <linux/oom.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -2864,6 +2865,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mem_cgroup *memcg;
 	struct page *page;
+	int ret = 0;
 	pte_t entry;
 
 	/* File mapping without ->vm_ops ? */
@@ -2896,6 +2898,9 @@ static int do_anonymous_page(struct vm_fault *vmf)
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte))
 			goto unlock;
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2930,6 +2935,10 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (!pte_none(*vmf->pte))
 		goto release;
 
+	ret = check_stable_address_space(vma->vm_mm);
+	if (ret)
+		goto release;
+
 	/* Deliver the page fault to userland, check inside PT lock */
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2949,7 +2958,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return 0;
+	return ret;
 release:
 	mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
@@ -3223,7 +3232,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 int finish_fault(struct vm_fault *vmf)
 {
 	struct page *page;
-	int ret;
+	int ret = 0;
 
 	/* Did we COW the page? */
 	if ((vmf->flags & FAULT_FLAG_WRITE) &&
@@ -3231,7 +3240,15 @@ int finish_fault(struct vm_fault *vmf)
 		page = vmf->cow_page;
 	else
 		page = vmf->page;
-	ret = alloc_set_pte(vmf, vmf->memcg, page);
+
+	/*
+	 * check even for read faults because we might have lost our CoWed
+	 * page
+	 */
+	if (!(vmf->vma->vm_flags & VM_SHARED))
+		ret = check_stable_address_space(vmf->vma->vm_mm);
+	if (!ret)
+		ret = alloc_set_pte(vmf, vmf->memcg, page);
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
@@ -3871,29 +3888,6 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
-	/*
-	 * This mm has been already reaped by the oom reaper and so the
-	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
-	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-
-		/*
-		 * We are going to enforce SIGBUS but the PF path might have
-		 * dropped the mmap_sem already so take it again so that
-		 * we do not break expectations of all arch specific PF paths
-		 * and g-u-p
-		 */
-		if (ret & VM_FAULT_RETRY)
-			down_read(&vma->vm_mm->mmap_sem);
-		ret = VM_FAULT_SIGBUS;
-	}
-
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-07 11:38   ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-07 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Argangeli, Kirill A. Shutemov, Tetsuo Handa,
	Oleg Nesterov, Wenwei Tao, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies, and so the
oom_reaper can tear it down, is not entirely true.

__task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT
is set but do_group_exit sends SIGKILL to all threads _after_ the
flag is set. So there is a race window when some threads won't have
fatal_signal_pending while the oom_reaper could start unmapping the
address space. Moreover some paths might not check for fatal signals
before each PF/g-u-p/copy_from_user.

We already have a protection for oom_reaper vs. PF races by checking
MMF_UNSTABLE. This has been, however, checked only for kernel threads
(use_mm users) which can outlive the oom victim. A simple fix would be
to extend the current check in handle_mm_fault for all tasks but that
wouldn't be sufficient because the current check assumes that a kernel
thread would bail out after EFAULT from get_user*/copy_from_user and
never re-read the same address which would succeed because the PF path
has established page tables already. This seems to be the case for the
only existing use_mm user currently (virtio driver) but it is rather
fragile in general.

This is even more fragile in general for more complex paths such as
generic_perform_write which can re-read the same address more times
(e.g. iov_iter_copy_from_user_atomic to fail and then
iov_iter_fault_in_readable on retry). Therefore we have to implement
MMF_UNSTABLE protection in a robust way and never make a potentially
corrupted content visible. That requires to hook deeper into the PF
path and check for the flag _every time_ before a pte for anonymous
memory is established (that means all !VM_SHARED mappings).

The corruption can be triggered artificially [1] but there doesn't seem
to be any real life bug report. The race window should be quite tight
to trigger most of the time.

Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

[1] http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp
---
 include/linux/oom.h | 22 ++++++++++++++++++++++
 mm/huge_memory.c    | 30 ++++++++++++++++++++++--------
 mm/memory.c         | 46 ++++++++++++++++++++--------------------------
 3 files changed, 64 insertions(+), 34 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2be5a6..76aac4ce39bc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -6,6 +6,8 @@
 #include <linux/types.h>
 #include <linux/nodemask.h>
 #include <uapi/linux/oom.h>
+#include <linux/sched/coredump.h> /* MMF_* */
+#include <linux/mm.h> /* VM_FAULT* */
 
 struct zonelist;
 struct notifier_block;
@@ -63,6 +65,26 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
 	return tsk->signal->oom_mm;
 }
 
+/*
+ * Checks whether a page fault on the given mm is still reliable.
+ * This is no longer true if the oom reaper started to reap the
+ * address space which is reflected by MMF_UNSTABLE flag set in
+ * the mm. At that moment any !shared mapping would lose the content
+ * and could cause a memory corruption (zero pages instead of the
+ * original content).
+ *
+ * User should call this before establishing a page table entry for
+ * a !shared mapping and under the proper page table lock.
+ *
+ * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
+ */
+static inline int check_stable_address_space(struct mm_struct *mm)
+{
+	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
+		return VM_FAULT_SIGBUS;
+	return 0;
+}
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..b03cfc0d3141 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -32,6 +32,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
+#include <linux/oom.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -550,6 +551,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	int ret = 0;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -561,9 +563,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	pgtable = pte_alloc_one(vma->vm_mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto release;
 	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
@@ -576,13 +577,14 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_none(*vmf->pmd))) {
-		spin_unlock(vmf->ptl);
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		pte_free(vma->vm_mm, pgtable);
+		goto unlock_release;
 	} else {
 		pmd_t entry;
 
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock_release;
+
 		/* Deliver the page fault to userland */
 		if (userfaultfd_missing(vma)) {
 			int ret;
@@ -610,6 +612,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	}
 
 	return 0;
+unlock_release:
+	spin_unlock(vmf->ptl);
+release:
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
+	mem_cgroup_cancel_charge(page, memcg, true);
+	put_page(page);
+	return ret;
+
 }
 
 /*
@@ -688,7 +699,10 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		ret = 0;
 		set = false;
 		if (pmd_none(*vmf->pmd)) {
-			if (userfaultfd_missing(vma)) {
+			ret = check_stable_address_space(vma->vm_mm);
+			if (ret) {
+				spin_unlock(vmf->ptl);
+			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
diff --git a/mm/memory.c b/mm/memory.c
index 4fe5b6254688..1b4504441bd2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -68,6 +68,7 @@
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
+#include <linux/oom.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -2864,6 +2865,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mem_cgroup *memcg;
 	struct page *page;
+	int ret = 0;
 	pte_t entry;
 
 	/* File mapping without ->vm_ops ? */
@@ -2896,6 +2898,9 @@ static int do_anonymous_page(struct vm_fault *vmf)
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte))
 			goto unlock;
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2930,6 +2935,10 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (!pte_none(*vmf->pte))
 		goto release;
 
+	ret = check_stable_address_space(vma->vm_mm);
+	if (ret)
+		goto release;
+
 	/* Deliver the page fault to userland, check inside PT lock */
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2949,7 +2958,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return 0;
+	return ret;
 release:
 	mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
@@ -3223,7 +3232,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 int finish_fault(struct vm_fault *vmf)
 {
 	struct page *page;
-	int ret;
+	int ret = 0;
 
 	/* Did we COW the page? */
 	if ((vmf->flags & FAULT_FLAG_WRITE) &&
@@ -3231,7 +3240,15 @@ int finish_fault(struct vm_fault *vmf)
 		page = vmf->cow_page;
 	else
 		page = vmf->page;
-	ret = alloc_set_pte(vmf, vmf->memcg, page);
+
+	/*
+	 * check even for read faults because we might have lost our CoWed
+	 * page
+	 */
+	if (!(vmf->vma->vm_flags & VM_SHARED))
+		ret = check_stable_address_space(vmf->vma->vm_mm);
+	if (!ret)
+		ret = alloc_set_pte(vmf, vmf->memcg, page);
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
@@ -3871,29 +3888,6 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
-	/*
-	 * This mm has been already reaped by the oom reaper and so the
-	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
-	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-
-		/*
-		 * We are going to enforce SIGBUS but the PF path might have
-		 * dropped the mmap_sem already so take it again so that
-		 * we do not break expectations of all arch specific PF paths
-		 * and g-u-p
-		 */
-		if (ret & VM_FAULT_RETRY)
-			down_read(&vma->vm_mm->mmap_sem);
-		ret = VM_FAULT_SIGBUS;
-	}
-
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2017-08-15  5:30 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-03 13:59 [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer Michal Hocko
2017-08-03 13:59 ` Michal Hocko
2017-08-04  6:46 ` Tetsuo Handa
2017-08-04  7:42   ` Michal Hocko
2017-08-04  7:42     ` Michal Hocko
2017-08-04  8:25     ` Tetsuo Handa
2017-08-04  8:32       ` Michal Hocko
2017-08-04  8:32         ` Michal Hocko
2017-08-04  8:33         ` [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS Michal Hocko
2017-08-04  8:33           ` Michal Hocko
2017-08-04  8:33           ` [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer Michal Hocko
2017-08-04  8:33             ` Michal Hocko
2017-08-04  9:16       ` Re: [PATCH] " Michal Hocko
2017-08-04  9:16         ` Michal Hocko
2017-08-04 10:41         ` Tetsuo Handa
2017-08-04 10:41           ` Tetsuo Handa
2017-08-04 11:00           ` Michal Hocko
2017-08-04 11:00             ` Michal Hocko
2017-08-04 14:56             ` Michal Hocko
2017-08-04 14:56               ` Michal Hocko
2017-08-04 16:49               ` Tetsuo Handa
2017-08-04 16:49                 ` Tetsuo Handa
2017-08-05  1:46               ` 陶文苇
2017-08-05  1:46                 ` 陶文苇
2017-08-07 11:38 [PATCH 0/2] mm, oom: fix oom_reaper fallouts Michal Hocko
2017-08-07 11:38 ` [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer Michal Hocko
2017-08-07 11:38   ` Michal Hocko
2017-08-08 17:48   ` Andrea Arcangeli
2017-08-08 17:48     ` Andrea Arcangeli
2017-08-08 23:35     ` Tetsuo Handa
2017-08-08 23:35       ` Tetsuo Handa
2017-08-09 18:36       ` Andrea Arcangeli
2017-08-09 18:36         ` Andrea Arcangeli
2017-08-10  8:21     ` Michal Hocko
2017-08-10  8:21       ` Michal Hocko
2017-08-10 13:33       ` Michal Hocko
2017-08-10 13:33         ` Michal Hocko
2017-08-11  2:28   ` Tetsuo Handa
2017-08-11  2:28     ` Tetsuo Handa
2017-08-11  7:09     ` Michal Hocko
2017-08-11  7:09       ` Michal Hocko
2017-08-11  7:54       ` Tetsuo Handa
2017-08-11  7:54         ` Tetsuo Handa
2017-08-11 10:22         ` Andrea Arcangeli
2017-08-11 10:22           ` Andrea Arcangeli
2017-08-11 10:42           ` Andrea Arcangeli
2017-08-11 10:42             ` Andrea Arcangeli
2017-08-11 11:53             ` Tetsuo Handa
2017-08-11 11:53               ` Tetsuo Handa
2017-08-11 12:08         ` Michal Hocko
2017-08-11 12:08           ` Michal Hocko
2017-08-11 15:46           ` Tetsuo Handa
2017-08-11 15:46             ` Tetsuo Handa
2017-08-14 13:59             ` Michal Hocko
2017-08-14 13:59               ` Michal Hocko
2017-08-15  5:30               ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.