linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linux 2.6.38
@ 2011-03-15  1:49 Linus Torvalds
  2011-03-15  3:13 ` David Rientjes
                   ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Linus Torvalds @ 2011-03-15  1:49 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Not a lot of changes since -rc8. Most notably perhaps some late nfs
and btrfs work, and a mips update. Along with some more vfs RCU lookup
fallout (which would only be noticeable with the filesystem exported
with nfsd, which is why nobody ever noticed).

And the usual driver updates, mostly media and GPU, but some
networking too. The appended shortlog is for the changes since -rc8,
and gives some feel for it. Nothing really too exciting, I think.

As to the "big picture", ie all the changes since 2.6.37, my personal
favorite remains the VFS name lookup changes. They did end up causing
some breakage, and Al has made it clear that he wants more cleanups,
but on the whole I think it was surprisingly smooth. I think we had
more problems with random other components (nasty memory corruption in
networking etc) than with the rather fundamental path lookup change.

So I'm hoping this ends up being a fairly calm release despite some
really deep changes like that.

                     Linus

---

Abhilash K V (1):
      ASoC: AM3517: Update codec name after multi-component update

Al Viro (12):
      minimal fix for do_filp_open() race
      unfuck proc_sysctl ->d_compare()
      nd->inode is not set on the second attempt in path_walk()
      /proc/self is never going to be invalidated...
      reiserfs xattr ->d_revalidate() shouldn't care about RCU
      ceph: fix d_revalidate oopsen on NFS exports
      fuse: fix d_revalidate oopsen on NFS exports
      gfs2: fix d_revalidate oopsen on NFS exports
      ocfs2: fix d_revalidate oopsen on NFS exports
      jfs: fix d_revalidate oopsen on NFS exports
      fat: fix d_revalidate oopsen on NFS exports
      compat breakage in preadv() and pwritev()

Andrea Arcangeli (2):
      x86/mm: Fix pgd_lock deadlock
      thp: fix page_referenced to modify mapcount/vm_flags only if page is found

Andrey Vagin (1):
      x86/mm: Handle mm_fault_error() in kernel space

Andy Adamson (2):
      NFSv4: remove duplicate clientid in struct nfs_client
      NFSv4.1 reclaim complete must wait for completion

Andy Walls (2):
      [media] cx23885: Revert "Check for slave nack on all transactions"
      [media] cx23885: Remove unused 'err:' labels to quiet compiler warning

Anoop P A (2):
      MIPS: Select R4K timer lib for all MSP platforms
      MIPS: MSP: Fix MSP71xx bpci interrupt handler return value

Antony Pavlov (2):
      mtd: jedec_probe: Change variable name from cfi_p to cfi
      mtd: jedec_probe: initialise make sector erase command variable

Antti Seppälä (1):
      [media] Fix sysfs rc protocol lookup for rc-5-sz

Arnaldo Carvalho de Melo (1):
      perf symbols: Fix vmlinux path when not using --symfs

Arnaud Patard (1):
      [media] mantis_pci: remove asm/pgtable.h include

Axel Lin (3):
      mtd: add "platform:" prefix for platform modalias
      gpio: add MODULE_DEVICE_TABLE
      watchdog: hpwdt: eliminate section mismatch warning

Balbir Singh (1):
      sched: Fix sched rt group scheduling when hierachy is enabled

Ben Hutchings (1):
      sunrpc: Propagate errors from xs_bind() through xs_create_sock()

Benjamin Herrenschmidt (2):
      powerpc/iseries: Fix early init access to lppaca
      powerpc/pseries: Disable VPNH feature

Benny Halevy (1):
      NFSD: fix decode_cb_sequence4resok

Chris Mason (4):
      Btrfs: fix regressions in copy_from_user handling
      Btrfs: deal with short returns from copy_from_user
      Btrfs: make sure not to return overlapping extents to fiemap
      Btrfs: break out of shrink_delalloc earlier

Chuck Lever (1):
      NFS: NFSROOT should default to "proto=udp"

Cliff Wickman (1):
      x86, UV: Initialize the broadcast assist unit base destination
node id properly

Dan Carpenter (1):
      watchdog: sch311x_wdt: fix printk condition

Daniel J Blueman (2):
      x86, build: Make sure mkpiggy fails on read error
      btrfs: fix dip leak

Daniel Turull (1):
      pktgen: fix errata in show results

Dave Airlie (3):
      drm/radeon: add pageflip hooks for fusion
      drm/radeon: fix page flipping hangs on r300/r400
      drm/radeon: fix problem with changing active VRAM size. (v2)

David Daney (5):
      MIPS: Add an unreachable return statement to satisfy buggy GCCs.
      MIPS: Fix GCC-4.6 'set but not used' warning in signal*.c
      MIPS: Remove unused code from arch/mips/kernel/syscall.c
      MIPS: Fix GCC-4.6 'set but not used' warning in ieee754int.h
      MIPS: Fix GCC-4.6 'set but not used' warning in arch/mips/mm/init.c

David Howells (2):
      MN10300: The SMP_ICACHE_INV_FLUSH_RANGE IPI command does not exist
      MN10300: atomic_read() should ensure it emits a load

David S. Miller (2):
      ipv4: Fix erroneous uses of ifa_address.
      ipv6: Don't create clones of host routes.

Deng-Cheng Zhu (5):
      MIPS, Perf-events: Work with irq_work
      MIPS, Perf-events: Work with the new PMU interface
      MIPS, Perf-events: Fix event check in validate_event()
      MIPS, Perf-events: Work with the new callchain interface
      MIPS, Perf-events: Use unsigned delta for right shift in event update

Devin Heitmueller (2):
      [media] au0828: fix VBI handling when in V4L2 streaming mode
      [media] cx18: Add support for Hauppauge HVR-1600 models with s5h1411

Dmitry Kravkov (4):
      bnx2x: fix non-pmf device load flow
      bnx2x: fix link notification
      bnx2x: (NPAR) prevent HW access in D3 state
      bnx2x: fix MaxBW configuration

Doe, YiCheng (1):
      ipmi: Fix IPMI errors due to timing problems

Florian Fainelli (3):
      r6040: bump to version 0.27 and date 23Feb2011
      MIPS: MTX-1: Make au1000_eth probe all PHY addresses
      MIPS: Alchemy: Fix reset for MTX-1 and XXS1500

Frank Filz (1):
      (try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause
problems if bit 31 or 63 are set in fileid

Grant Likely (1):
      i2c-ocores: Fix pointer type mismatch error

Göran Weinholt (1):
      net/smsc911x.c: Set the VLAN1 register to fix VLAN MTU problem

Hans de Goede (2):
      hwmon/f71882fg: Fix a typo in a comment
      hwmon/f71882fg: Set platform drvdata to NULL later

Huang Weiyi (1):
      nfs4: remove duplicated #include

Hugh Dickins (1):
      thp+memcg-numa: fix BUG at include/linux/mm.h:370!

J. Bruce Fields (2):
      nfsd4: fix bad pointer on failure to find delegation
      fs/dcache: allow d_obtain_alias() to return unhashed dentries

Jarod Wilson (3):
      [media] nuvoton-cir: fix wake from suspend
      [media] mceusb: don't claim multifunction device non-IR parts
      [media] tda829x: fix regression in probe functions

Jeff Layton (1):
      nfs: close NFSv4 COMMIT vs. CLOSE race

Jesper Juhl (1):
      SUNRPC: Remove resource leak in svc_rdma_send_error()

Jiri Slaby (1):
      watchdog: sbc_fitpc2_wdt, fix crash on systems without DMI_BOARD_NAME

Joakim Tjernlund (1):
      mtd: fix race in cfi_cmdset_0001 driver

Jon Mason (1):
      vxge: update MAINTAINERS

Jovi Zhang (1):
      nfs: fix compilation warning

Lin Ming (1):
      perf symbols: Avoid resolving [kernel.kallsyms] to real path for
buildid cache

Linus Torvalds (2):
      Revert "oom: oom_kill_process: fix the child_points logic"
      Linux 2.6.38

Lukas Czerner (1):
      block: fix mis-synchronisation in blkdev_issue_zeroout()

Maksim Rayskiy (1):
      MIPS: Move idle task creation to work queue

Malcolm Priestley (1):
      [media] DM04/QQBOX memcpy to const char fix

Marco Stornelli (1):
      Check for immutable/append flag in fallocate path

Mark Brown (4):
      ASoC: Fix broken bitfield definitions in WM8978
      ASoC: Use the correct DAPM context when cleaning up final widget set
      ASoC: Fix typo in late revision WM8994 DAC2R name
      ASoC: Ensure WM8958 gets all WM8994 late revision widgets

Matt Turner (1):
      alpha: fix compile error from IRQ clean up

Mauro Carvalho Chehab (1):
      [media] ir-raw: Properly initialize the IR event (BZ#27202)

Maurus Cuelenaere (1):
      MIPS: Jz4740: Add HAVE_CLK

Maxim Levitsky (1):
      mtd: mtd_blkdevs: fix double free on error path

Miao Xie (1):
      btrfs: fix not enough reserved space

Michael (1):
      [media] ivtv: Fix corrective action taken upon DMA ERR interrupt
to avoid hang

Michal Marek (1):
      kbuild: Fix computing srcversion for modules

Naga Chumbalkar (2):
      x86: Don't check for BIOS corruption in first 64K when there's no need to
      [CPUFREQ] pcc-cpufreq: don't load driver if get_freq fails during init.

Neil Horman (1):
      rds: prevent BUG_ON triggering on congestion map updates

Nicholas Bellinger (1):
      [SCSI] target: Fix t_transport_aborted handling in LUN_RESET +
active I/O shutdown

Nicolas Kaiser (1):
      drivers/net/macvtap: fix error check

Nils Carlson (2):
      bonding 802.3ad: Fix the state machine locking v2
      bonding 802.3ad: Rename rx_machine_lock to state_machine_lock

Ohad Ben-Cohen (1):
      mmc: fix CONFIG_MMC_UNSAFE_RESUME regression

Oleg Nesterov (1):
      oom: oom_kill_process: fix the child_points logic

Olivier Grenie (1):
      [media] DiB7000M: add pid filtering

Pawel Osciak (1):
      [media] Fix double free of video_device in mem2mem_testdev

Rainer Weikusat (1):
      net: fix multithreaded signal handling in unix recv routines

Rajendra Nayak (1):
      i2c-omap: Program I2C_WE on OMAP4 to enable i2c wakeup

Randy Dunlap (1):
      net: bridge builtin vs. ipv6 modular

Ricardo Labiaga (1):
      NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY

Robert Millan (1):
      MIPS: Loongson: Remove ad-hoc cmdline default

Sebastian Andrzej Siewior (1):
      x86: ce4100: Set pci ops via callback instead of module init

Shawn Lin (1):
      r6040: fix multicast operations

Stanislav Fomichev (1):
      nfs: add kmalloc return value check in decode_and_add_ds

Stanislaw Gruszka (1):
      mtd: amd76xrom: fix oops at boot when resources are not available

Stefan Oberhumer (1):
      MIPS: Clear the correct flag in sysmips(MIPS_FIXADE, ...).

Stefan Weil (1):
      MIPS: Loongson: Fix potentially wrong string handling

Stephen Rothwell (2):
      sysctl: the include of rcupdate.h is only needed in the kernel
      sysctl: the include of rcupdate.h is only needed in the kernel

Sven Barth (1):
      [media] cx25840: fix probing of cx2583x chips

Takashi Iwai (1):
      drm/i915: Revive combination mode for backlight control

Thomas Gleixner (1):
      MIPS: Replace deprecated spinlock initialization

Thomas Graf (1):
      net: Enter net/ipv6/ even if CONFIG_IPV6=n

Timo Warns (1):
      Fix corrupted OSF partition table parsing

Tkhai Kirill (1):
      MN10300: Proper use of macros get_user() in the case of
incremented pointers

Trond Myklebust (5):
      SUNRPC: Close a race in __rpc_wait_for_completion_task()
      NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses
      NFSv4.1: Fix the handling of the SEQUENCE status bits
      NFSv4: Fix the setlk error handler
      NFSv4: nfs4_state_mark_reclaim_nograce() should be static

Vasiliy Kulikov (1):
      net: don't allow CAP_NET_ADMIN to load non-netdev kernel modules

Wim Van Sebroeck (3):
      watchdog: cpwd: Fix buffer-overflow
      watchdog: sch311x_wdt: Fix LDN active check
      watchdog: w83697ug_wdt: Fix set bit 0 to activate GPIO2

Wolfram Sang (1):
      i2c-eg20t: include slab.h for memory allocations

Wu Zhangjin (5):
      MIPS, Tracing: Speed up function graph tracer
      MIPS, Tracing: Substitute in_kernel_space() for in_module()
      MIPS, Tracing: Clean up prepare_ftrace_return()
      MIPS, Tracing: Clean up ftrace_make_nop()
      MIPS, Tracing: Fix set_graph_function of function graph tracer

Yinghai Lu (1):
      x86, numa: Fix numa_emulation code with memory-less node0

Yoichi Yuasa (1):
      MIPS: Fix always CONFIG_LOONGSON_UART_BASE=y

j223yang@asset.uwaterloo.ca (1):
      ariadne: remove redundant NULL check

roel (1):
      nfsd: wrong index used in inner loop

sensoray-dev (1):
      [media] s2255drv: firmware re-loading changes

stephen hemminger (1):
      ip6ip6: autoload ip6 tunnel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  1:49 Linux 2.6.38 Linus Torvalds
@ 2011-03-15  3:13 ` David Rientjes
  2011-03-15  4:06   ` Steven Rostedt
                     ` (2 more replies)
  2011-03-15  3:14 ` Steven Rostedt
  2011-03-16 17:30 ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Melchior FRANZ
  2 siblings, 3 replies; 74+ messages in thread
From: David Rientjes @ 2011-03-15  3:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Rik van Riel, KOSAKI Motohiro, Andrew Morton

This kernel includes a broken commit that was merged a couple of hours 
before release:

commit dc1b83ab08f1954335692cdcd499f78c94f4c42a
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Mon Mar 14 20:05:30 2011 +0100

    oom: oom_kill_process: fix the child_points logic
    
    oom_kill_process() starts with victim_points == 0.  This means that
    (most likely) any child has more points and can be killed erroneously.
    
    Also, "children has a different mm" doesn't match the reality, we should
    check child->mm != t->mm.  This check is not exactly correct if t->mm ==
    NULL but this doesn't really matter, oom_kill_task() will kill them
    anyway.
    
    Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
    too.
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

As a result of this change, the oom killer will no longer attempt to 
sacrifice a child of the selected process in favor of the parent unless 
its memory usage exceeds the parent (and this will be an unreachable state 
once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged 
from -mm).

This means systems running a webserver, for example, will kill the 
webserver itself in oom conditions and not one of its threads serving a 
connection; simply forking too many client connections in this scenario 
would lead to an oom condition that would kill the server instead of one 
of its threads.

Admins who find this behavior to cause disruptions in service should apply 
the following revert.

Signed-off-by: David Rientjes <rientjes@google.com>
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -458,10 +458,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    struct mem_cgroup *mem, nodemask_t *nodemask,
 			    const char *message)
 {
-	struct task_struct *victim;
+	struct task_struct *victim = p;
 	struct task_struct *child;
-	struct task_struct *t;
-	unsigned int victim_points;
+	struct task_struct *t = p;
+	unsigned int victim_points = 0;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem, nodemask);
@@ -487,15 +487,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	victim_points = oom_badness(p, mem, nodemask, totalpages);
-	victim = p;
-	t = p;
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
 
-			if (child->mm == t->mm)
-				continue;
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  1:49 Linux 2.6.38 Linus Torvalds
  2011-03-15  3:13 ` David Rientjes
@ 2011-03-15  3:14 ` Steven Rostedt
  2011-03-15  4:15   ` Linus Torvalds
  2011-03-16 17:30 ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Melchior FRANZ
  2 siblings, 1 reply; 74+ messages in thread
From: Steven Rostedt @ 2011-03-15  3:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List

On Mon, Mar 14, 2011 at 06:49:37PM -0700, Linus Torvalds wrote:
> 
> So I'm hoping this ends up being a fairly calm release despite some
> really deep changes like that.

It's so calm, it's like it's not even there.

-- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  3:13 ` David Rientjes
@ 2011-03-15  4:06   ` Steven Rostedt
  2011-03-15  4:14   ` Linus Torvalds
  2011-03-15  4:33   ` Andrew Morton
  2 siblings, 0 replies; 74+ messages in thread
From: Steven Rostedt @ 2011-03-15  4:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, linux-kernel, Rik van Riel, KOSAKI Motohiro,
	Andrew Morton

On Mon, Mar 14, 2011 at 08:13:38PM -0700, David Rientjes wrote:
> This kernel includes a broken commit that was merged a couple of hours 
> before release:
> 
> commit dc1b83ab08f1954335692cdcd499f78c94f4c42a
> Author: Oleg Nesterov <oleg@redhat.com>
> Date:   Mon Mar 14 20:05:30 2011 +0100
> 
>     oom: oom_kill_process: fix the child_points logic
>

Don't worry. If you download the patch for 2.6.38, you'll see that the
revert was in the final release.

Woo hoo, ketchup is useful again!

-- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  3:13 ` David Rientjes
  2011-03-15  4:06   ` Steven Rostedt
@ 2011-03-15  4:14   ` Linus Torvalds
  2011-03-15  4:29     ` David Rientjes
  2011-03-15  4:33   ` Andrew Morton
  2 siblings, 1 reply; 74+ messages in thread
From: Linus Torvalds @ 2011-03-15  4:14 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, Rik van Riel, KOSAKI Motohiro, Andrew Morton

On Mon, Mar 14, 2011 at 8:13 PM, David Rientjes <rientjes@google.com> wrote:
> This kernel includes a broken commit that was merged a couple of hours
> before release:

Actually, it doesn't. It got reverted before the release because of
the worries about it.

                     Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  3:14 ` Steven Rostedt
@ 2011-03-15  4:15   ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2011-03-15  4:15 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Linux Kernel Mailing List

On Mon, Mar 14, 2011 at 8:14 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, Mar 14, 2011 at 06:49:37PM -0700, Linus Torvalds wrote:
>>
>> So I'm hoping this ends up being a fairly calm release despite some
>> really deep changes like that.
>
> It's so calm, it's like it's not even there.

Yes, it's a very Zen release.

I'd uploaded the patch and tar-ball, but forgot to actually push out.
Usually it's the other way around.

                            Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  4:14   ` Linus Torvalds
@ 2011-03-15  4:29     ` David Rientjes
  0 siblings, 0 replies; 74+ messages in thread
From: David Rientjes @ 2011-03-15  4:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Rik van Riel, KOSAKI Motohiro, Andrew Morton

On Mon, 14 Mar 2011, Linus Torvalds wrote:

> > This kernel includes a broken commit that was merged a couple of hours
> > before release:
> 
> Actually, it doesn't. It got reverted before the release because of
> the worries about it.
> 

Looks good, thanks!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  3:13 ` David Rientjes
  2011-03-15  4:06   ` Steven Rostedt
  2011-03-15  4:14   ` Linus Torvalds
@ 2011-03-15  4:33   ` Andrew Morton
  2011-03-15  4:50     ` David Rientjes
  2 siblings, 1 reply; 74+ messages in thread
From: Andrew Morton @ 2011-03-15  4:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, linux-kernel, Rik van Riel, KOSAKI Motohiro,
	Oleg Nesterov

On Mon, 14 Mar 2011 20:13:38 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged 
> from -mm

Please (re)send the patches which you believe should be merged into
2.6.38 to address the problems which Oleg found, and any other critical
problems.  Not in a huge rush - let's get this right.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  4:33   ` Andrew Morton
@ 2011-03-15  4:50     ` David Rientjes
  2011-03-15  6:21       ` Andrew Morton
  2011-03-15 21:08       ` Linux 2.6.38 Oleg Nesterov
  0 siblings, 2 replies; 74+ messages in thread
From: David Rientjes @ 2011-03-15  4:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, linux-kernel, Rik van Riel, KOSAKI Motohiro,
	Oleg Nesterov

On Mon, 14 Mar 2011, Andrew Morton wrote:

> > once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged 
> > from -mm
> 
> Please (re)send the patches which you believe should be merged into
> 2.6.38 to address the problems which Oleg found, and any other critical
> problems.  Not in a huge rush - let's get this right.
> 

In my testing, Oleg's three test cases that he sent to the security list 
and cc'd us on get appropriately oom killed once swap is exhausted or 
swapoff -a is used on mmotm-2011-03-10-16-42 because of these two patches:

	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
	oom-skip-zombies-when-iterating-tasklist.patch

He also presented a test case on linux-mm that caused the oom killer to 
avoid acting if a thread is ptracing a thread in the exit path with 
PTRACE_O_TRACEEXIT.  That should be fixed with

	http://marc.info/?l=linux-mm&m=129997893430351

that has yet to see -mm.  There are no other test cases that have been 
presented that cause undesired behavior.

That said, my approach to doing this has been to avoid arbitrary 
heuristics for special cases and address known issues by adding the 
appropriate logic in the oom killer.  For example, the ptrace problem that 
Oleg presented showed that the oom killer logic incorrectly deferred doing 
anything when an eligible thread was PF_EXITING.  It had done that 
believing that nothing would stop the thread from exiting or current 
would be given access to memory reserves itself and that assumption was 
broken for PTRACE_O_TRACEEXIT.  My patch above, in combination with 
Andrey's patch that only considers threads with valid mm's, fixes that 
issue because we'll now only defer if you still have an attached mm, are 
PF_EXITING, and are not being traced.

If, at some point, there is another gap in the exit code where a thread 
may hold PF_EXITING with a valid mm for an indefinite period, we'll need 
to address that in the oom killer as well.  We use PF_EXITING specifically 
in the oom killer to identify tasks that are going to exit soon and need 
handling for any case where that isn't guaranteed.  Anything else results 
in needlessly killing other tasks or, in the worst case, panicking when 
there is nothing left that is eligible.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  4:50     ` David Rientjes
@ 2011-03-15  6:21       ` Andrew Morton
  2011-03-16  9:09         ` KOSAKI Motohiro
  2011-03-15 21:08       ` Linux 2.6.38 Oleg Nesterov
  1 sibling, 1 reply; 74+ messages in thread
From: Andrew Morton @ 2011-03-15  6:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, linux-kernel, Rik van Riel, KOSAKI Motohiro,
	Oleg Nesterov

On Mon, 14 Mar 2011 21:50:24 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> On Mon, 14 Mar 2011, Andrew Morton wrote:
> 
> > > once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged 
> > > from -mm
> > 
> > Please (re)send the patches which you believe should be merged into
> > 2.6.38 to address the problems which Oleg found, and any other critical
> > problems.  Not in a huge rush - let's get this right.
> > 
> 
> In my testing, Oleg's three test cases that he sent to the security list 
> and cc'd us on get appropriately oom killed once swap is exhausted or 
> swapoff -a is used on mmotm-2011-03-10-16-42 because of these two patches:
> 
> 	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> 	oom-skip-zombies-when-iterating-tasklist.patch
> 
> He also presented a test case on linux-mm that caused the oom killer to 
> avoid acting if a thread is ptracing a thread in the exit path with 
> PTRACE_O_TRACEEXIT.  That should be fixed with
> 
> 	http://marc.info/?l=linux-mm&m=129997893430351
> 
> that has yet to see -mm.  There are no other test cases that have been 
> presented that cause undesired behavior.
> 
> That said, my approach to doing this has been to avoid arbitrary 
> heuristics for special cases and address known issues by adding the 
> appropriate logic in the oom killer.  For example, the ptrace problem that 
> Oleg presented showed that the oom killer logic incorrectly deferred doing 
> anything when an eligible thread was PF_EXITING.  It had done that 
> believing that nothing would stop the thread from exiting or current 
> would be given access to memory reserves itself and that assumption was 
> broken for PTRACE_O_TRACEEXIT.  My patch above, in combination with 
> Andrey's patch that only considers threads with valid mm's, fixes that 
> issue because we'll now only defer if you still have an attached mm, are 
> PF_EXITING, and are not being traced.
> 
> If, at some point, there is another gap in the exit code where a thread 
> may hold PF_EXITING with a valid mm for an indefinite period, we'll need 
> to address that in the oom killer as well.  We use PF_EXITING specifically 
> in the oom killer to identify tasks that are going to exit soon and need 
> handling for any case where that isn't guaranteed.  Anything else results 
> in needlessly killing other tasks or, in the worst case, panicking when 
> there is nothing left that is eligible.

So we're talking about three patches:

oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
oom-skip-zombies-when-iterating-tasklist.patch
oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch

all appended below.

About all of which Oleg had serious complaints, some of which haven't
yet been addressed.

And that's OK.  As I said, please let's work through it and get it right.



From: David Rientjes <rientjes@google.com>

This patch prevents unnecessary oom kills or kernel panics by reverting
two commits:

	495789a5 (oom: make oom_score to per-process value)
	cef1d352 (oom: multi threaded process coredump don't make deadlock)

First, 495789a5 (oom: make oom_score to per-process value) ignores the
fact that all threads in a thread group do not necessarily exit at the
same time.

It is imperative that select_bad_process() detect threads that are in the
exit path, specifically those with PF_EXITING set, to prevent needlessly
killing additional tasks.  If a process is oom killed and the thread group
leader exits, select_bad_process() cannot detect the other threads that
are PF_EXITING by iterating over only processes.  Thus, it currently
chooses another task unnecessarily for oom kill or panics the machine when
nothing else is eligible.

By iterating over threads instead, it is possible to detect threads that
are exiting and nominate them for oom kill so they get access to memory
reserves.

Second, cef1d352 (oom: multi threaded process coredump don't make
deadlock) erroneously avoids making the oom killer a no-op when an
eligible thread other than current isfound to be exiting.  We want to
detect this situation so that we may allow that exiting thread time to
exit and free its memory; if it is able to exit on its own, that should
free memory so current is no loner oom.  If it is not able to exit on its
own, the oom killer will nominate it for oom kill which, in this case,
only means it will get access to memory reserves.

Without this change, it is easy for the oom killer to unnecessarily target
tasks when all threads of a victim don't exit before the thread group
leader or, in the worst case, panic the machine.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/oom_kill.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff -puN mm/oom_kill.c~oom-prevent-unnecessary-oom-kills-or-kernel-panics mm/oom_kill.c
--- a/mm/oom_kill.c~oom-prevent-unnecessary-oom-kills-or-kernel-panics
+++ a/mm/oom_kill.c
@@ -292,11 +292,11 @@ static struct task_struct *select_bad_pr
 		unsigned long totalpages, struct mem_cgroup *mem,
 		const nodemask_t *nodemask)
 {
-	struct task_struct *p;
+	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
 	*ppoints = 0;
 
-	for_each_process(p) {
+	do_each_thread(g, p) {
 		unsigned int points;
 
 		if (oom_unkillable_task(p, mem, nodemask))
@@ -324,7 +324,7 @@ static struct task_struct *select_bad_pr
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
+		if ((p->flags & PF_EXITING) && p->mm) {
 			if (p != current)
 				return ERR_PTR(-1UL);
 
@@ -337,7 +337,7 @@ static struct task_struct *select_bad_pr
 			chosen = p;
 			*ppoints = points;
 		}
-	}
+	} while_each_thread(g, p);
 
 	return chosen;
 }
_



From: Andrey Vagin <avagin@openvz.org>

We shouldn't defer oom killing if a thread has already detached its ->mm
and still has TIF_MEMDIE set.  Memory needs to be freed, so find kill
other threads that pin the same ->mm or find another task to kill.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/oom_kill.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff -puN mm/oom_kill.c~oom-skip-zombies-when-iterating-tasklist mm/oom_kill.c
--- a/mm/oom_kill.c~oom-skip-zombies-when-iterating-tasklist
+++ a/mm/oom_kill.c
@@ -299,6 +299,8 @@ static struct task_struct *select_bad_pr
 	do_each_thread(g, p) {
 		unsigned int points;
 
+		if (!p->mm)
+			continue;
 		if (oom_unkillable_task(p, mem, nodemask))
 			continue;
 
@@ -324,7 +326,7 @@ static struct task_struct *select_bad_pr
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if ((p->flags & PF_EXITING) && p->mm) {
+		if (p->flags & PF_EXITING) {
 			if (p != current)
 				return ERR_PTR(-1UL);
 
_



From: David Rientjes <rientjes@google.com>

The oom killer naturally defers killing anything if it finds an eligible
task that is already exiting and has yet to detach its ->mm.  This avoids
unnecessarily killing tasks when one is already in the exit path and may
free enough memory that the oom killer is no longer needed.  This is
detected by PF_EXITING since threads that have already detached its ->mm
are no longer considered at all.

The problem with always deferring when a thread is PF_EXITING, however, is
that it may never actually exit when being traced, specifically if another
task is tracing it with PTRACE_O_TRACEEXIT.  The oom killer does not want
to defer in this case since there is no guarantee that thread will ever
exit without intervention.

This patch will now only defer the oom killer when a thread is PF_EXITING
and no ptracer has stopped its progress in the exit path.  It also ensures
that a child is sacrificed for the chosen parent only if it has a
different ->mm as the comment implies: this ensures that the thread group
leader is always targeted appropriately.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/oom_kill.c |   40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff -puN mm/oom_kill.c~oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced mm/oom_kill.c
--- a/mm/oom_kill.c~oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced
+++ a/mm/oom_kill.c
@@ -31,6 +31,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mempolicy.h>
 #include <linux/security.h>
+#include <linux/ptrace.h>
 
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
@@ -316,22 +317,29 @@ static struct task_struct *select_bad_pr
 		if (test_tsk_thread_flag(p, TIF_MEMDIE))
 			return ERR_PTR(-1UL);
 
-		/*
-		 * This is in the process of releasing memory so wait for it
-		 * to finish before killing some other task by mistake.
-		 *
-		 * However, if p is the current task, we allow the 'kill' to
-		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
-		 * which will allow it to gain access to memory reserves in
-		 * the process of exiting and releasing its resources.
-		 * Otherwise we could get an easy OOM deadlock.
-		 */
 		if (p->flags & PF_EXITING) {
-			if (p != current)
-				return ERR_PTR(-1UL);
-
-			chosen = p;
-			*ppoints = 1000;
+			/*
+			 * If p is the current task and is in the process of
+			 * releasing memory, we allow the "kill" to set
+			 * TIF_MEMDIE, which will allow it to gain access to
+			 * memory reserves.  Otherwise, it may stall forever.
+			 *
+			 * The loop isn't broken here, however, in case other
+			 * threads are found to have already been oom killed.
+			 */
+			if (p == current) {
+				chosen = p;
+				*ppoints = 1000;
+			} else {
+				/*
+				 * If this task is not being ptraced on exit,
+				 * then wait for it to finish before killing
+				 * some other task unnecessarily.
+				 */
+				if (!(task_ptrace(p->group_leader) &
+							PT_TRACE_EXIT))
+					return ERR_PTR(-1UL);
+			}
 		}
 
 		points = oom_badness(p, mem, nodemask, totalpages);
@@ -493,6 +501,8 @@ static int oom_kill_process(struct task_
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
 
+			if (child->mm == p->mm)
+				continue;
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */
_


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  4:50     ` David Rientjes
  2011-03-15  6:21       ` Andrew Morton
@ 2011-03-15 21:08       ` Oleg Nesterov
  1 sibling, 0 replies; 74+ messages in thread
From: Oleg Nesterov @ 2011-03-15 21:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, Rik van Riel,
	KOSAKI Motohiro

On 03/14, David Rientjes wrote:
>
> He also presented a test case on linux-mm that caused the oom killer to 
> avoid acting if a thread is ptracing a thread in the exit path with 
> PTRACE_O_TRACEEXIT.  That should be fixed with
>
> 	http://marc.info/?l=linux-mm&m=129997893430351

I don't think it can fix this. I didn't verify this, but the slightly
different test-case below should have the same effect.

But this doesn't matter. We can fix this particular case, and we have
the problems with the coredump anyway.

What I can't understand is what exactly the first patch tries to fix.
When I ask you, you tell me that for_each_process() can miss the group
leader because it can exit before sub-threads. This must not happen,
or we have some serious bug triggered by your workload.

So, once again. Could you please explain the original problem and how
this patch helps?

Oleg.

#include <unistd.h>
#include <signal.h>
#include <pthread.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <assert.h>
#include <stdio.h>

void *tfunc(void* arg)
{
	if (arg) {
		ptrace(PTRACE_TRACEME, 0,0,0);
		raise(SIGSTOP);
		pthread_kill(*(pthread_t*)arg, SIGQUIT);
	}
	pause();
}

int main(void)
{
	int pid;

	if (!fork()) {
		pthread_t thread1, thread2;
		pthread_create(&thread1, NULL, tfunc, NULL);
		pthread_create(&thread2, NULL, tfunc, &thread1);
		pause();
		return 0;
	}

	assert((pid = waitpid(-1, NULL, __WALL)) > 0);
	assert(ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACEEXIT) == 0);
	assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
	wait(NULL);

	pause();
	return 0;
}


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Linux 2.6.38
  2011-03-15  6:21       ` Andrew Morton
@ 2011-03-16  9:09         ` KOSAKI Motohiro
  2011-03-22 11:04           ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-16  9:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, David Rientjes, Linus Torvalds, linux-kernel,
	Rik van Riel, Oleg Nesterov


[Yesterdays earthquake was announced magnitude 6.5, but M6 quake is
 no longer treated significant news in this country. We are living 
 slightly in a floating mood.]


> So we're talking about three patches:
> 
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> oom-skip-zombies-when-iterating-tasklist.patch
> oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch
> 
> all appended below.
> 
> About all of which Oleg had serious complaints, some of which haven't
> yet been addressed.
> 
> And that's OK.  As I said, please let's work through it and get it right.

I haven't understand what is "OK" and what do you want talk. probably
the reason is in my language skill or I haven't catch up Oleg and David
discussion. then instead, I'll post my debugging progressing condition.


 o vmscan.c#all_unreclaimable() might return false negative and lead
   to prevent oom-killer by mistaken. Why? zone->pages_scanned is not
   protected by lock, in other words, it's unstable value. in the other
   hands, x86 ZONE_DMA has only a very little memory, then usually
   never recover all_unreclaimable=no if once become all_unreclaimable=yes.
   then, if zone state become unmatched (eg pages_scanned=0 and all_unreclaimable=yes)
   it can't be recovered never. I mean I could reproduced Andrey reported issue.

 o oom_kill.c#boost_dying_task_prio() makes kernel hang-up if user
   are using cpu cgroups. because cpu cgroup has inadequate default
   RT rt_runtime_us (0 by default. 0 mean RT tasks can't run at all).

 o oom_kill.c#TIF_MEMDIE check makes kernel hang-up. I haven't catch
   the exact reason of a oom killed process sticking even though zone has
   enough memory. 

My dislikeness is, Many people in the list fun to make flamewar but 
nobody except really a few developers run the real code nor join to 
debug real and actual reported issue. In fact, Andrey made testcase and
reported his test environment and help we made reproduce envronemnt.

I also dislike some developer say they haven't seen oom livelock case yet.
It indicate they haven't tested stress workload oom scenario. How do i
trust an untested patch, an untested guys? All developer have to test
until seen oom livelock.

I know oom debugging is very painful and need to take a lot of time.
much false positive, much unfixable live lock, need mililion reset. 
But, I don't think this is good reason to take untested.

Now I'm only access a three years old PC. Therefore, I have no reason
anyone can't debug the issue.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38)
  2011-03-15  1:49 Linux 2.6.38 Linus Torvalds
  2011-03-15  3:13 ` David Rientjes
  2011-03-15  3:14 ` Steven Rostedt
@ 2011-03-16 17:30 ` Melchior FRANZ
  2011-03-16 19:22   ` i915/kms regression after 2.6.38-rc8 Jiri Slaby
                     ` (2 more replies)
  2 siblings, 3 replies; 74+ messages in thread
From: Melchior FRANZ @ 2011-03-16 17:30 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On my i915 using Acer TravelMate 5735z I could run kernel 2.6.38-rc8
with KMS. On 2.6.38 I get a black screen instead. In case anyone
cares, just tell me what information you need. (bisect result on
request, but I assume the experts know which patch caused it.)

m.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: i915/kms regression after 2.6.38-rc8
  2011-03-16 17:30 ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Melchior FRANZ
@ 2011-03-16 19:22   ` Jiri Slaby
  2011-03-16 19:43   ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Chris Wilson
  2011-03-20 18:30   ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Maciej Rutecki
  2 siblings, 0 replies; 74+ messages in thread
From: Jiri Slaby @ 2011-03-16 19:22 UTC (permalink / raw)
  To: Melchior FRANZ; +Cc: Linux Kernel Mailing List, Chris Wilson, dri-devel

Ccing some relevant people (you should have done this).

On 03/16/2011 06:30 PM, Melchior FRANZ wrote:
> On my i915 using Acer TravelMate 5735z I could run kernel 2.6.38-rc8
> with KMS. On 2.6.38 I get a black screen instead. In case anyone
> cares, just tell me what information you need. (bisect result on
> request, but I assume the experts know which patch caused it.)


-- 
js

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38)
  2011-03-16 17:30 ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Melchior FRANZ
  2011-03-16 19:22   ` i915/kms regression after 2.6.38-rc8 Jiri Slaby
@ 2011-03-16 19:43   ` Chris Wilson
  2011-03-16 21:09     ` i915/kms regression after 2.6.38-rc8 Melchior FRANZ
  2011-03-20 18:30   ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Maciej Rutecki
  2 siblings, 1 reply; 74+ messages in thread
From: Chris Wilson @ 2011-03-16 19:43 UTC (permalink / raw)
  To: Melchior FRANZ, Linux Kernel Mailing List

On Wed, 16 Mar 2011 18:30:51 +0100, Melchior FRANZ <melchior.franz@gmail.com> wrote:
> On my i915 using Acer TravelMate 5735z I could run kernel 2.6.38-rc8
> with KMS. On 2.6.38 I get a black screen instead. In case anyone
> cares, just tell me what information you need. (bisect result on
> request, but I assume the experts know which patch caused it.)

There's only one patch directly related to i915, so you could begin there.
Useful information to include is a dmesg (particularly one with
drm.debug=0xe kernel parameters) and lspci (though google says you have a
gm45).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: i915/kms regression after 2.6.38-rc8
  2011-03-16 19:43   ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Chris Wilson
@ 2011-03-16 21:09     ` Melchior FRANZ
  0 siblings, 0 replies; 74+ messages in thread
From: Melchior FRANZ @ 2011-03-16 21:09 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Linux Kernel Mailing List, dri-devel

[-- Attachment #1: Type: Text/Plain, Size: 576 bytes --]

* Chris Wilson -- Wednesday 16 March 2011:
> There's only one patch directly related to i915, so you could begin there.

I'll try later. Was just too obvious for now.  :-}



> Useful information to include is a dmesg (particularly one with
> drm.debug=0xe kernel parameters) and lspci

OK, thanks for that info. Attached.



> (though google says you have a gm45).

I didn't look it up before, I just trusted the kernel with its "i915".
But you are probably right, this Acer TravelMate 5735Z-452G32Mnss is
supposed to have an "Intel Graphics Media Accelerator 4500MHD". 

m.

[-- Attachment #2: acer_5735z_i915_blackout.log.gz --]
[-- Type: application/x-gzip, Size: 16849 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38)
  2011-03-16 17:30 ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Melchior FRANZ
  2011-03-16 19:22   ` i915/kms regression after 2.6.38-rc8 Jiri Slaby
  2011-03-16 19:43   ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Chris Wilson
@ 2011-03-20 18:30   ` Maciej Rutecki
  2 siblings, 0 replies; 74+ messages in thread
From: Maciej Rutecki @ 2011-03-20 18:30 UTC (permalink / raw)
  To: Melchior FRANZ; +Cc: Linux Kernel Mailing List

On środa, 16 marca 2011 o 18:30:51 Melchior FRANZ wrote:
> On my i915 using Acer TravelMate 5735z I could run kernel 2.6.38-rc8
> with KMS. On 2.6.38 I get a black screen instead. In case anyone
> cares, just tell me what information you need. (bisect result on
> request, but I assume the experts know which patch caused it.)
> 
> m.

I created a Bugzilla entry at 
https://bugzilla.kernel.org/show_bug.cgi?id=31522
for your bug report, please add your address to the CC list in there, thanks!

-- 
Maciej Rutecki
http://www.maciek.unixy.pl

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [patch 0/5] oom: a few anti fork bomb patches
  2011-03-16  9:09         ` KOSAKI Motohiro
@ 2011-03-22 11:04           ` KOSAKI Motohiro
  2011-03-22 11:05             ` [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely KOSAKI Motohiro
                               ` (4 more replies)
  0 siblings, 5 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-22 11:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: kosaki.motohiro, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki

[-- Attachment #1: Type: text/plain, Size: 294 bytes --]

Hi

I'm backed. Andrey's (attached) fork bomb testcase effectively kill
my machine when swap is disabled. therefore, I've made a few anti andrey
test patches.

This patches only avoid kernel livelock. they doesn't genocide fork-bombs.
Kamezawa-san is trying such effort.

comments are welcome.

[-- Attachment #2: memeater.py --]
[-- Type: application/octet-stream, Size: 695 bytes --]

import sys, time, mmap, os
from subprocess import Popen, PIPE
import random

global mem_size

def info(msg):
	pid = os.getpid()
	print >> sys.stderr, "%s: %s" % (pid, msg)
	sys.stderr.flush()



def memory_loop(cmd = "a"):
	"""
	cmd may be:
		c: check memory
		else: touch memory
	"""
	c = 0
	for j in xrange(0, mem_size):
		if cmd == "c":
			if f[j<<12] != chr(j % 255):
				info("Data corruption")
				sys.exit(1)
		else:
			f[j<<12] = chr(j % 255)

while True:
	pid = os.fork()
	if (pid != 0):
		mem_size = random.randint(0, 56 * 4096)
		f = mmap.mmap(-1, mem_size << 12, mmap.MAP_ANONYMOUS|mmap.MAP_PRIVATE)
		memory_loop()
		memory_loop("c")
		f.close()

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-22 11:04           ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
@ 2011-03-22 11:05             ` KOSAKI Motohiro
  2011-03-22 14:49               ` Minchan Kim
  2011-03-23  7:41               ` KAMEZAWA Hiroyuki
  2011-03-22 11:08             ` [PATCH 3/5] oom: create oom autogroup KOSAKI Motohiro
                               ` (3 subsequent siblings)
  4 siblings, 2 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-22 11:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim, Johannes Weiner

all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.

	2006 Sep 25; commit 408d8544; oom: use unreclaimable info

And it went through strange history. firstly, following commit broke
the logic unintentionally.

	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
				      costly-order allocations

Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.

	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
				      return value when priority==0

But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .

	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
				      in direct reclaim path

But, recently, Andrey Vagin found the new corner case. Look,

	struct zone {
	  ..
	        int                     all_unreclaimable;
	  ..
	        unsigned long           pages_scanned;
	  ..
	}

zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore a zone can become a state
of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
current all_unreclaimable() return false even though
zone->all_unreclaimabe=1.

Is this ignorable minor issue? No. Unfortunatelly, x86 has very
small dma zone and it become zone->all_unreclamble=1 easily. and
if it becase all_unreclaimable, it never return all_unreclaimable=0
beucase it typicall don't have reclaimable pages.

Eventually, oom-killer never works on such systems.  Let's remove
this problematic logic completely.

Reported-by: Andrey Vagin <avagin@openvz.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   36 +-----------------------------------
 1 files changed, 1 insertions(+), 35 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..254aada 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1989,33 +1989,6 @@ static bool zone_reclaimable(struct zone *zone)
 }
 
 /*
- * As hibernation is going on, kswapd is freezed so that it can't mark
- * the zone into all_unreclaimable. It can't handle OOM during hibernation.
- * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
- */
-static bool all_unreclaimable(struct zonelist *zonelist,
-		struct scan_control *sc)
-{
-	struct zoneref *z;
-	struct zone *zone;
-	bool all_unreclaimable = true;
-
-	for_each_zone_zonelist_nodemask(zone, z, zonelist,
-			gfp_zone(sc->gfp_mask), sc->nodemask) {
-		if (!populated_zone(zone))
-			continue;
-		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-			continue;
-		if (zone_reclaimable(zone)) {
-			all_unreclaimable = false;
-			break;
-		}
-	}
-
-	return all_unreclaimable;
-}
-
-/*
  * This is the main entry point to direct page reclaim.
  *
  * If a full scan of the inactive list fails to free enough memory then we
@@ -2105,14 +2078,7 @@ out:
 	delayacct_freepages_end();
 	put_mems_allowed();
 
-	if (sc->nr_reclaimed)
-		return sc->nr_reclaimed;
-
-	/* top priority shrink_zones still had more to do? don't OOM, then */
-	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
-		return 1;
-
-	return 0;
+	return sc->nr_reclaimed;
 }
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 3/5] oom: create oom autogroup
  2011-03-22 11:04           ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
  2011-03-22 11:05             ` [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely KOSAKI Motohiro
@ 2011-03-22 11:08             ` KOSAKI Motohiro
  2011-03-22 23:21               ` Minchan Kim
  2011-03-22 11:08             ` [PATCH 4/5] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-22 11:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Mike Galbraith

When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
never exit, at least, human feel it's never. therefore kernel become
hang-up.

"perf sched" tell us a hint.

 ------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Average delay ms | Maximum delay ms |
 ------------------------------------------------------------------------------
  python:1754           |      0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
  python:1843           |      0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
  python:1715           |      0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
  python:1818           |      2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
  ...
  ...

Processes flood makes crazy scheduler delay. and then the victim process
can't run enough. Grr. Should we do?

Fortunately, we already have anti process flood framework, autogroup!
This patch reuse this framework and avoid kernel live lock.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/oom.h      |    1 +
 include/linux/sched.h    |    4 ++++
 init/main.c              |    2 ++
 kernel/sched_autogroup.c |    4 ++--
 mm/oom_kill.c            |   23 +++++++++++++++++++++++
 5 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..86bcea3 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,6 +67,7 @@ extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
 		      const nodemask_t *nodemask, unsigned long uptime);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
+extern void oom_init(void);
 
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 98fc7ed..bdaad3f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1947,6 +1947,8 @@ int sched_rt_handler(struct ctl_table *table, int write,
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern unsigned int sysctl_sched_autogroup_enabled;
 
+extern struct autogroup *autogroup_create(void);
+extern void autogroup_move_group(struct task_struct *p, struct autogroup *ag);
 extern void sched_autogroup_create_attach(struct task_struct *p);
 extern void sched_autogroup_detach(struct task_struct *p);
 extern void sched_autogroup_fork(struct signal_struct *sig);
@@ -1956,6 +1958,8 @@ extern void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_fil
 extern int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice);
 #endif
 #else
+extern struct autogroup *autogroup_create(void) { return NULL; }
+extern void autogroup_move_group(struct task_struct *p, struct autogroup *ag) {}
 static inline void sched_autogroup_create_attach(struct task_struct *p) { }
 static inline void sched_autogroup_detach(struct task_struct *p) { }
 static inline void sched_autogroup_fork(struct signal_struct *sig) { }
diff --git a/init/main.c b/init/main.c
index 4a9479e..2c6e8da 100644
--- a/init/main.c
+++ b/init/main.c
@@ -68,6 +68,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/slab.h>
 #include <linux/perf_event.h>
+#include <linux/oom.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -549,6 +550,7 @@ asmlinkage void __init start_kernel(void)
 	gfp_allowed_mask = __GFP_BITS_MASK;
 
 	kmem_cache_init_late();
+	oom_init();
 
 	/*
 	 * HACK ALERT! This is early. We're enabling the console before
diff --git a/kernel/sched_autogroup.c b/kernel/sched_autogroup.c
index 5946ac5..6a1a2c4 100644
--- a/kernel/sched_autogroup.c
+++ b/kernel/sched_autogroup.c
@@ -63,7 +63,7 @@ static inline struct autogroup *autogroup_task_get(struct task_struct *p)
 static void free_rt_sched_group(struct task_group *tg);
 #endif
 
-static inline struct autogroup *autogroup_create(void)
+struct autogroup *autogroup_create(void)
 {
 	struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
 	struct task_group *tg;
@@ -143,7 +143,7 @@ autogroup_task_group(struct task_struct *p, struct task_group *tg)
 	return tg;
 }
 
-static void
+void
 autogroup_move_group(struct task_struct *p, struct autogroup *ag)
 {
 	struct autogroup *prev;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 739dee4..2519e6a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -38,6 +38,28 @@ int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
 
+#ifdef CONFIG_SCHED_AUTOGROUP
+struct autogroup *oom_ag;
+
+void __init oom_init(void)
+{
+	oom_ag = autogroup_create();
+}
+
+static void oom_move_oom_ag(struct task_struct *p)
+{
+	autogroup_move_group(p, oom_ag);
+}
+#else
+void __init oom_init(void)
+{
+}
+
+static void oom_move_oom_ag(struct task_struct *p)
+{
+}
+#endif
+
 #ifdef CONFIG_NUMA
 /**
  * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -432,6 +454,7 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 		}
 
 	set_tsk_thread_flag(p, TIF_MEMDIE);
+	oom_move_oom_ag(p);
 	force_sig(SIGKILL, p);
 
 	return 0;
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 4/5] mm: introduce wait_on_page_locked_killable
  2011-03-22 11:04           ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
  2011-03-22 11:05             ` [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely KOSAKI Motohiro
  2011-03-22 11:08             ` [PATCH 3/5] oom: create oom autogroup KOSAKI Motohiro
@ 2011-03-22 11:08             ` KOSAKI Motohiro
  2011-03-23  7:44               ` KAMEZAWA Hiroyuki
  2011-03-24 15:04               ` Minchan Kim
  2011-03-22 11:09             ` [PATCH 5/5] x86,mm: make pagefault killable KOSAKI Motohiro
       [not found]             ` <20110322200657.B064.A69D9226@jp.fujitsu.com>
  4 siblings, 2 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-22 11:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki

commit 2687a356 (Add lock_page_killable) introduced killable
lock_page(). Similarly this patch introdues killable
wait_on_page_locked().

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/pagemap.h |    9 +++++++++
 mm/filemap.c            |   11 +++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e407601..49f9315 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -369,6 +369,15 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
 
+extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+
+static inline int wait_on_page_locked_killable(struct page *page)
+{
+	if (PageLocked(page))
+		return wait_on_page_bit_killable(page, PG_locked);
+	return 0;
+}
+
 /* 
  * Wait for a page to be unlocked.
  *
diff --git a/mm/filemap.c b/mm/filemap.c
index a6cfecf..f5f9ac2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -608,6 +608,17 @@ void wait_on_page_bit(struct page *page, int bit_nr)
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+int wait_on_page_bit_killable(struct page *page, int bit_nr)
+{
+	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+
+	if (!test_bit(bit_nr, &page->flags))
+		return 0;
+
+	return __wait_on_bit(page_waitqueue(page), &wait, sync_page_killable,
+							TASK_KILLABLE);
+}
+
 /**
  * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
  * @page: Page defining the wait queue of interest
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 5/5] x86,mm: make pagefault killable
  2011-03-22 11:04           ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
                               ` (2 preceding siblings ...)
  2011-03-22 11:08             ` [PATCH 4/5] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
@ 2011-03-22 11:09             ` KOSAKI Motohiro
  2011-03-23  7:49               ` KAMEZAWA Hiroyuki
                                 ` (2 more replies)
       [not found]             ` <20110322200657.B064.A69D9226@jp.fujitsu.com>
  4 siblings, 3 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-22 11:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki

When oom killer occured, almost processes are getting stuck following
two points.

	1) __alloc_pages_nodemask
	2) __lock_page_or_retry

1) is not much problematic because TIF_MEMDIE lead to make allocation
failure and get out from page allocator. 2) is more problematic. When
OOM situation, Zones typically don't have page cache at all and Memory
starvation might lead to reduce IO performance largely. When fork bomb
occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
new process quickly rather than oom-killer kill it. Then, the system
may become livelock.

This patch makes pagefault interruptible by SIGKILL.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 arch/x86/mm/fault.c |    9 +++++++++
 include/linux/mm.h  |    1 +
 mm/filemap.c        |   22 +++++++++++++++++-----
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 20e3f87..797c7d0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	if (user_mode_vm(regs)) {
 		local_irq_enable();
 		error_code |= PF_USER;
+		flags |= FAULT_FLAG_KILLABLE;
 	} else {
 		if (regs->flags & X86_EFLAGS_IF)
 			local_irq_enable();
@@ -1138,6 +1139,14 @@ good_area:
 	}
 
 	/*
+	 * Pagefault was interrupted by SIGKILL. We have no reason to
+	 * continue pagefault.
+	 */
+	if ((flags & FAULT_FLAG_KILLABLE) && (fault & VM_FAULT_RETRY) &&
+	    fatal_signal_pending(current))
+		return;
+
+	/*
 	 * Major/minor page fault accounting is only done on the
 	 * initial attempt. If we go through a retry, it is extremely
 	 * likely that the page will be found in page cache at that point.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0716517..9e7c567 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -152,6 +152,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
 #define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
+#define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/mm/filemap.c b/mm/filemap.c
index f5f9ac2..affba94 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -719,15 +719,27 @@ void __lock_page_nosync(struct page *page)
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
-	if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
-		__lock_page(page);
-		return 1;
-	} else {
+	int ret;
+
+	if (flags & FAULT_FLAG_ALLOW_RETRY) {
 		if (!(flags & FAULT_FLAG_RETRY_NOWAIT)) {
 			up_read(&mm->mmap_sem);
-			wait_on_page_locked(page);
+			if (flags & FAULT_FLAG_KILLABLE)
+				wait_on_page_locked_killable(page);
+			else
+				wait_on_page_locked(page);
 		}
 		return 0;
+	} else {
+		if (flags & FAULT_FLAG_KILLABLE) {
+			ret = __lock_page_killable(page);
+			if (ret) {
+				up_read(&mm->mmap_sem);
+				return 0;
+			}
+		} else
+			__lock_page(page);
+		return 1;
 	}
 }
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-22 11:05             ` [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely KOSAKI Motohiro
@ 2011-03-22 14:49               ` Minchan Kim
  2011-03-23  5:21                 ` KOSAKI Motohiro
  2011-03-23  7:41               ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-22 14:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

Hi Kosaki,

On Tue, Mar 22, 2011 at 08:05:55PM +0900, KOSAKI Motohiro wrote:
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
> 
> 	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
> 
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
> 
> 	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> 				      costly-order allocations
> 
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
> 
> 	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> 				      return value when priority==0
> 
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
> 
> 	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> 				      in direct reclaim path
> 
> But, recently, Andrey Vagin found the new corner case. Look,
> 
> 	struct zone {
> 	  ..
> 	        int                     all_unreclaimable;
> 	  ..
> 	        unsigned long           pages_scanned;
> 	  ..
> 	}
> 
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore a zone can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,

Possible although it's very rare.

> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.

The case is very rare since we reset zone->all_unreclaimabe to zero
right before resetting zone->page_scanned to zero.
But I admit it's possible.

        CPU 0                                           CPU 1
free_pcppages_bulk                              balance_pgdat
        zone->all_unreclaimabe = 0
                                                        zone->all_unreclaimabe = 1
        zone->pages_scanned = 0
> 
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it becase all_unreclaimable, it never return all_unreclaimable=0
        ^^^^^ it's very important verb.    ^^^^^ return? reset?

        I can't understand your point due to the typo. Please correct the typo.

> beucase it typicall don't have reclaimable pages.

If DMA zone have very small reclaimable pages or zero reclaimable pages,
zone_reclaimable() can return false easily so all_unreclaimable() could return
true. Eventually oom-killer might works.

In my test, I saw the livelock, too so apparently we have a problem.
I couldn't dig in it recently by another urgent my work.
I think you know root cause but the description in this patch isn't enough
for me to be persuaded.

Could you explain the root cause in detail?

> 
> Eventually, oom-killer never works on such systems.  Let's remove
> this problematic logic completely.
> 
> Reported-by: Andrey Vagin <avagin@openvz.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/vmscan.c |   36 +-----------------------------------
>  1 files changed, 1 insertions(+), 35 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 060e4c1..254aada 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1989,33 +1989,6 @@ static bool zone_reclaimable(struct zone *zone)
>  }
>  
>  /*
> - * As hibernation is going on, kswapd is freezed so that it can't mark
> - * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> - * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> - */
> -static bool all_unreclaimable(struct zonelist *zonelist,
> -		struct scan_control *sc)
> -{
> -	struct zoneref *z;
> -	struct zone *zone;
> -	bool all_unreclaimable = true;
> -
> -	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> -			gfp_zone(sc->gfp_mask), sc->nodemask) {
> -		if (!populated_zone(zone))
> -			continue;
> -		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> -			continue;
> -		if (zone_reclaimable(zone)) {
> -			all_unreclaimable = false;
> -			break;
> -		}
> -	}
> -
> -	return all_unreclaimable;
> -}
> -
> -/*
>   * This is the main entry point to direct page reclaim.
>   *
>   * If a full scan of the inactive list fails to free enough memory then we
> @@ -2105,14 +2078,7 @@ out:
>  	delayacct_freepages_end();
>  	put_mems_allowed();
>  
> -	if (sc->nr_reclaimed)
> -		return sc->nr_reclaimed;
> -
> -	/* top priority shrink_zones still had more to do? don't OOM, then */
> -	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
> -		return 1;
> -
> -	return 0;
> +	return sc->nr_reclaimed;
>  }
>  
>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> -- 
> 1.6.5.2
> 
> 
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/5] oom: create oom autogroup
  2011-03-22 11:08             ` [PATCH 3/5] oom: create oom autogroup KOSAKI Motohiro
@ 2011-03-22 23:21               ` Minchan Kim
  2011-03-23  1:27                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-22 23:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Mike Galbraith

On Tue, Mar 22, 2011 at 8:08 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
> never exit, at least, human feel it's never. therefore kernel become
> hang-up.
>
> "perf sched" tell us a hint.
>
>  ------------------------------------------------------------------------------
>  Task                  |   Runtime ms  | Average delay ms | Maximum delay ms |
>  ------------------------------------------------------------------------------
>  python:1754           |      0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
>  python:1843           |      0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
>  python:1715           |      0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
>  python:1818           |      2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
>  ...
>  ...
>
> Processes flood makes crazy scheduler delay. and then the victim process
> can't run enough. Grr. Should we do?
>
> Fortunately, we already have anti process flood framework, autogroup!
> This patch reuse this framework and avoid kernel live lock.

That's cool idea but I have a concern.

You remove boosting priority in [2/5] and move victim tasks into autogroup.
If I understand autogroup right, victim process and threads in the
process take less schedule chance than now.

Could it make unnecessary killing of other tasks?
I am not sure. Just out of curiosity.

Thanks for nice work, Kosaki.
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/5] oom: create oom autogroup
  2011-03-22 23:21               ` Minchan Kim
@ 2011-03-23  1:27                 ` KOSAKI Motohiro
  2011-03-23  2:41                   ` Mike Galbraith
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23  1:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Mike Galbraith,
	Luis Claudio R. Goncalves

> On Tue, Mar 22, 2011 at 8:08 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> > When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
> > never exit, at least, human feel it's never. therefore kernel become
> > hang-up.
> >
> > "perf sched" tell us a hint.
> >
> >  ------------------------------------------------------------------------------
> >  Task                  |   Runtime ms  | Average delay ms | Maximum delay ms |
> >  ------------------------------------------------------------------------------
> >  python:1754           |      0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
> >  python:1843           |      0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
> >  python:1715           |      0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
> >  python:1818           |      2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
> >  ...
> >  ...
> >
> > Processes flood makes crazy scheduler delay. and then the victim process
> > can't run enough. Grr. Should we do?
> >
> > Fortunately, we already have anti process flood framework, autogroup!
> > This patch reuse this framework and avoid kernel live lock.
> 
> That's cool idea but I have a concern.
> 
> You remove boosting priority in [2/5] and move victim tasks into autogroup.
> If I understand autogroup right, victim process and threads in the
> process take less schedule chance than now.

Right. Icky cpu-cgroup rt-runtime default enforce me to seek another solution.
Today, I got private mail from Luis and he seems to have another improvement
idea. so, I might drop this patch if his one works fine.

> Could it make unnecessary killing of other tasks?
> I am not sure. Just out of curiosity.

If you are talking about OOM serialization, It isn't. I don't change
OOM serialization stuff. at least for now.
If you are talking about scheduler fairness, both current and this patch
have scheduler unfairness. But that's ok. 1) When OOM situation, scheduling
fairness has been broken already by heavy memory reclaim effort 2) autogroup
mean to change scheduling grouping *automatically*. then, the patch change it
again for exiting memory starvation.

> 
> Thanks for nice work, Kosaki.







^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/5] oom: create oom autogroup
  2011-03-23  1:27                 ` KOSAKI Motohiro
@ 2011-03-23  2:41                   ` Mike Galbraith
  0 siblings, 0 replies; 74+ messages in thread
From: Mike Galbraith @ 2011-03-23  2:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves

On Wed, 2011-03-23 at 10:27 +0900, KOSAKI Motohiro wrote:
> > On Tue, Mar 22, 2011 at 8:08 PM, KOSAKI Motohiro
> > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
> > > never exit, at least, human feel it's never. therefore kernel become
> > > hang-up.
> > >
> > > "perf sched" tell us a hint.
> > >
> > >  ------------------------------------------------------------------------------
> > >  Task                  |   Runtime ms  | Average delay ms | Maximum delay ms |
> > >  ------------------------------------------------------------------------------
> > >  python:1754           |      0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
> > >  python:1843           |      0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
> > >  python:1715           |      0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
> > >  python:1818           |      2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
> > >  ...
> > >  ...
> > >
> > > Processes flood makes crazy scheduler delay. and then the victim process
> > > can't run enough. Grr. Should we do?
> > >
> > > Fortunately, we already have anti process flood framework, autogroup!
> > > This patch reuse this framework and avoid kernel live lock.
> > 
> > That's cool idea but I have a concern.
> > 
> > You remove boosting priority in [2/5] and move victim tasks into autogroup.
> > If I understand autogroup right, victim process and threads in the
> > process take less schedule chance than now.
> 
> Right. Icky cpu-cgroup rt-runtime default enforce me to seek another solution.

I was going to mention rt, and there's s/fork/clone as well.

> Today, I got private mail from Luis and he seems to have another improvement
> idea. so, I might drop this patch if his one works fine.

Perhaps if TIF_MEMDIE threads needs special treatment, preemption tests
could take that into account?  (though I don't like touching fast path
for oddball cases)

	-Mike






^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-22 14:49               ` Minchan Kim
@ 2011-03-23  5:21                 ` KOSAKI Motohiro
  2011-03-23  6:59                   ` Minchan Kim
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23  5:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

Hi Minchan,

> > zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> > variables nor protected by lock. Therefore a zone can become a state
> > of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> 
> Possible although it's very rare.

Can you test by yourself andrey's case on x86 box? It seems
reprodusable. 

> > current all_unreclaimable() return false even though
> > zone->all_unreclaimabe=1.
> 
> The case is very rare since we reset zone->all_unreclaimabe to zero
> right before resetting zone->page_scanned to zero.
> But I admit it's possible.

Please apply this patch and run oom-killer. You may see following
pages_scanned:0 and all_unreclaimable:yes combination. likes below.
(but you may need >30min)

	Node 0 DMA free:4024kB min:40kB low:48kB high:60kB active_anon:11804kB 
	inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB 
	isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB 
	dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
	slab_unreclaimable:0kB kernel_stack:0kB pagetables:68kB unstable:0kB 
	bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes


> 
>         CPU 0                                           CPU 1
> free_pcppages_bulk                              balance_pgdat
>         zone->all_unreclaimabe = 0
>                                                         zone->all_unreclaimabe = 1
>         zone->pages_scanned = 0
> > 
> > Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> > small dma zone and it become zone->all_unreclamble=1 easily. and
> > if it becase all_unreclaimable, it never return all_unreclaimable=0
>         ^^^^^ it's very important verb.    ^^^^^ return? reset?
> 
>         I can't understand your point due to the typo. Please correct the typo.
> 
> > beucase it typicall don't have reclaimable pages.
> 
> If DMA zone have very small reclaimable pages or zero reclaimable pages,
> zone_reclaimable() can return false easily so all_unreclaimable() could return
> true. Eventually oom-killer might works.

The point is, vmscan has following all_unreclaimable check in several place.

                        if (zone->all_unreclaimable && priority != DEF_PRIORITY)
                                continue;

But, if the zone has only a few lru pages, get_scan_count(DEF_PRIORITY) return
{0, 0, 0, 0} array. It mean zone will never scan lru pages anymore. therefore
false negative smaller pages_scanned can't be corrected.

Then, false negative all_unreclaimable() also can't be corrected.


btw, Why get_scan_count() return 0 instead 1? Why don't we round up?
Git log says it is intentionally.

	commit e0f79b8f1f3394bb344b7b83d6f121ac2af327de
	Author: Johannes Weiner <hannes@saeurebad.de>
	Date:   Sat Oct 18 20:26:55 2008 -0700

	    vmscan: don't accumulate scan pressure on unrelated lists

> 
> In my test, I saw the livelock, too so apparently we have a problem.
> I couldn't dig in it recently by another urgent my work.
> I think you know root cause but the description in this patch isn't enough
> for me to be persuaded.
> 
> Could you explain the root cause in detail?

If you have an another fixing idea, please let me know. :)




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-23  5:21                 ` KOSAKI Motohiro
@ 2011-03-23  6:59                   ` Minchan Kim
  2011-03-23  7:13                     ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-23  6:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Wed, Mar 23, 2011 at 2:21 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Minchan,
>
>> > zone->all_unreclaimable and zone->pages_scanned are neigher atomic
>> > variables nor protected by lock. Therefore a zone can become a state
>> > of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
>>
>> Possible although it's very rare.
>
> Can you test by yourself andrey's case on x86 box? It seems
> reprodusable.
>
>> > current all_unreclaimable() return false even though
>> > zone->all_unreclaimabe=1.
>>
>> The case is very rare since we reset zone->all_unreclaimabe to zero
>> right before resetting zone->page_scanned to zero.
>> But I admit it's possible.
>
> Please apply this patch and run oom-killer. You may see following
> pages_scanned:0 and all_unreclaimable:yes combination. likes below.
> (but you may need >30min)
>
>        Node 0 DMA free:4024kB min:40kB low:48kB high:60kB active_anon:11804kB
>        inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB
>        isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB
>        dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>        slab_unreclaimable:0kB kernel_stack:0kB pagetables:68kB unstable:0kB
>        bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
>
>
>>
>>         CPU 0                                           CPU 1
>> free_pcppages_bulk                              balance_pgdat
>>         zone->all_unreclaimabe = 0
>>                                                         zone->all_unreclaimabe = 1
>>         zone->pages_scanned = 0
>> >
>> > Is this ignorable minor issue? No. Unfortunatelly, x86 has very
>> > small dma zone and it become zone->all_unreclamble=1 easily. and
>> > if it becase all_unreclaimable, it never return all_unreclaimable=0
>>         ^^^^^ it's very important verb.    ^^^^^ return? reset?
>>
>>         I can't understand your point due to the typo. Please correct the typo.
>>
>> > beucase it typicall don't have reclaimable pages.
>>
>> If DMA zone have very small reclaimable pages or zero reclaimable pages,
>> zone_reclaimable() can return false easily so all_unreclaimable() could return
>> true. Eventually oom-killer might works.
>
> The point is, vmscan has following all_unreclaimable check in several place.
>
>                        if (zone->all_unreclaimable && priority != DEF_PRIORITY)
>                                continue;
>
> But, if the zone has only a few lru pages, get_scan_count(DEF_PRIORITY) return
> {0, 0, 0, 0} array. It mean zone will never scan lru pages anymore. therefore
> false negative smaller pages_scanned can't be corrected.
>
> Then, false negative all_unreclaimable() also can't be corrected.
>
>
> btw, Why get_scan_count() return 0 instead 1? Why don't we round up?
> Git log says it is intentionally.
>
>        commit e0f79b8f1f3394bb344b7b83d6f121ac2af327de
>        Author: Johannes Weiner <hannes@saeurebad.de>
>        Date:   Sat Oct 18 20:26:55 2008 -0700
>
>            vmscan: don't accumulate scan pressure on unrelated lists
>
>>
>> In my test, I saw the livelock, too so apparently we have a problem.
>> I couldn't dig in it recently by another urgent my work.
>> I think you know root cause but the description in this patch isn't enough
>> for me to be persuaded.
>>
>> Could you explain the root cause in detail?
>
> If you have an another fixing idea, please let me know. :)
>
>
>
>

Okay. I got it.

The problem is following as.
By the race the free_pcppages_bulk and balance_pgdat, it is possible
zone->all_unreclaimable = 1 and zone->pages_scanned = 0.
DMA zone have few LRU pages and in case of no-swap and big memory
pressure, there could be a just a page in inactive file list like your
example. (anon lru pages isn't important in case of non-swap system)
In such case, shrink_zones doesn't scan the page at all until priority
become 0 as get_scan_count does scan >>= priority(it's mostly zero).
And although priority become 0, nr_scan_try_batch returns zero until
saved pages become 32. So for scanning the page, at least, we need 32
times iteration of priority 12..0.  If system has fork-bomb, it is
almost livelock.

If is is right, how about this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..34983e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1973,6 +1973,9 @@ static void shrink_zones(int priority, struct
zonelist *zonelist,

 static bool zone_reclaimable(struct zone *zone)
 {
+       if (zone->all_unreclaimable)
+               return false;
+
        return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
 }


-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-23  6:59                   ` Minchan Kim
@ 2011-03-23  7:13                     ` KOSAKI Motohiro
  2011-03-23  8:24                       ` Minchan Kim
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23  7:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

> Okay. I got it.
> 
> The problem is following as.
> By the race the free_pcppages_bulk and balance_pgdat, it is possible
> zone->all_unreclaimable = 1 and zone->pages_scanned = 0.
> DMA zone have few LRU pages and in case of no-swap and big memory
> pressure, there could be a just a page in inactive file list like your
> example. (anon lru pages isn't important in case of non-swap system)
> In such case, shrink_zones doesn't scan the page at all until priority
> become 0 as get_scan_count does scan >>= priority(it's mostly zero).

Nope.

                        if (zone->all_unreclaimable && priority != DEF_PRIORITY)
                                continue;

This tow lines mean, all_unreclaimable prevent priority 0 reclaim.


> And although priority become 0, nr_scan_try_batch returns zero until
> saved pages become 32. So for scanning the page, at least, we need 32
> times iteration of priority 12..0.  If system has fork-bomb, it is
> almost livelock.

Therefore, 1000 times get_scan_count(DEF_PRIORITY) takes 1000 times no-op.

> 
> If is is right, how about this?

Boo.
You seems forgot why you introduced current all_unreclaimable() function.
While hibernation, we can't trust all_unreclaimable.

That's the reason why I proposed following patch when you introduced
all_unreclaimable().


---
 mm/vmscan.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..1919d8a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/oom.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1931,7 +1932,7 @@ out:
 		return sc->nr_reclaimed;
 
 	/* top priority shrink_zones still had more to do? don't OOM, then */
-	if (scanning_global_lru(sc) && !all_unreclaimable)
+	if (scanning_global_lru(sc) && !all_unreclaimable && !oom_killer_disabled)
 		return 1;
 
 	return 0;
-- 
1.6.5.2






^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-22 11:05             ` [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely KOSAKI Motohiro
  2011-03-22 14:49               ` Minchan Kim
@ 2011-03-23  7:41               ` KAMEZAWA Hiroyuki
  2011-03-23  7:55                 ` KOSAKI Motohiro
  1 sibling, 1 reply; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-23  7:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, Nick Piggin, Minchan Kim, Johannes Weiner

On Tue, 22 Mar 2011 20:05:55 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
> 
> 	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
> 
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
> 
> 	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> 				      costly-order allocations
> 
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
> 
> 	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> 				      return value when priority==0
> 
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
> 
> 	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> 				      in direct reclaim path
> 
> But, recently, Andrey Vagin found the new corner case. Look,
> 
> 	struct zone {
> 	  ..
> 	        int                     all_unreclaimable;
> 	  ..
> 	        unsigned long           pages_scanned;
> 	  ..
> 	}
> 
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore a zone can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
> 
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it becase all_unreclaimable, it never return all_unreclaimable=0
> beucase it typicall don't have reclaimable pages.
> 
> Eventually, oom-killer never works on such systems.  Let's remove
> this problematic logic completely.
> 
> Reported-by: Andrey Vagin <avagin@openvz.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

IIUC, I saw the pehnomenon which you pointed out, as
 - all zone->all_unreclaimable = yes
 - zone_reclaimable() returns true
 - no pgscan proceeds.

on a swapless system. So, I'd like to vote for this patch.

But hmm...what happens all of pages are isolated or locked and now under freeing ?
I think we should have alternative safe-guard logic for avoiding to call
oom-killer. Hmm.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/5] mm: introduce wait_on_page_locked_killable
  2011-03-22 11:08             ` [PATCH 4/5] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
@ 2011-03-23  7:44               ` KAMEZAWA Hiroyuki
  2011-03-24 15:04               ` Minchan Kim
  1 sibling, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-23  7:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins

On Tue, 22 Mar 2011 20:08:41 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> commit 2687a356 (Add lock_page_killable) introduced killable
> lock_page(). Similarly this patch introdues killable
> wait_on_page_locked().
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/5] x86,mm: make pagefault killable
  2011-03-22 11:09             ` [PATCH 5/5] x86,mm: make pagefault killable KOSAKI Motohiro
@ 2011-03-23  7:49               ` KAMEZAWA Hiroyuki
  2011-03-23  8:09                 ` KOSAKI Motohiro
  2011-03-24 15:10               ` Minchan Kim
  2011-03-24 17:13               ` Oleg Nesterov
  2 siblings, 1 reply; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-23  7:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins

On Tue, 22 Mar 2011 20:09:29 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> When oom killer occured, almost processes are getting stuck following
> two points.
> 
> 	1) __alloc_pages_nodemask
> 	2) __lock_page_or_retry
> 
> 1) is not much problematic because TIF_MEMDIE lead to make allocation
> failure and get out from page allocator. 2) is more problematic. When
> OOM situation, Zones typically don't have page cache at all and Memory
> starvation might lead to reduce IO performance largely. When fork bomb
> occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
> new process quickly rather than oom-killer kill it. Then, the system
> may become livelock.
> 
> This patch makes pagefault interruptible by SIGKILL.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  arch/x86/mm/fault.c |    9 +++++++++
>  include/linux/mm.h  |    1 +
>  mm/filemap.c        |   22 +++++++++++++++++-----
>  3 files changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 20e3f87..797c7d0 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
>  	if (user_mode_vm(regs)) {
>  		local_irq_enable();
>  		error_code |= PF_USER;
> +		flags |= FAULT_FLAG_KILLABLE;
>  	} else {
>  		if (regs->flags & X86_EFLAGS_IF)
>  			local_irq_enable();
> @@ -1138,6 +1139,14 @@ good_area:
>  	}
>  
>  	/*
> +	 * Pagefault was interrupted by SIGKILL. We have no reason to
> +	 * continue pagefault.
> +	 */
> +	if ((flags & FAULT_FLAG_KILLABLE) && (fault & VM_FAULT_RETRY) &&
> +	    fatal_signal_pending(current))
> +		return;
> +

Hmm? up_read(&mm->mmap_sem) ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-23  7:41               ` KAMEZAWA Hiroyuki
@ 2011-03-23  7:55                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23  7:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, Nick Piggin, Minchan Kim,
	Johannes Weiner

> > Reported-by: Andrey Vagin <avagin@openvz.org>
> > Cc: Nick Piggin <npiggin@kernel.dk>
> > Cc: Minchan Kim <minchan.kim@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> IIUC, I saw the pehnomenon which you pointed out, as
>  - all zone->all_unreclaimable = yes
>  - zone_reclaimable() returns true
>  - no pgscan proceeds.
> 
> on a swapless system. So, I'd like to vote for this patch.
> 
> But hmm...what happens all of pages are isolated or locked and now under freeing ?
> I think we should have alternative safe-guard logic for avoiding to call
> oom-killer. Hmm.

Yes, this patch has small risk. but 1) this logic didn't work about two
years (see changelog) 2) memcg haven't use this logic and I haven't get
any bug report from memcg developers. therefore I decided to take most
simple way.

Of cource, I'll make another protection if I'll get any regression report.




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/5] x86,mm: make pagefault killable
  2011-03-23  7:49               ` KAMEZAWA Hiroyuki
@ 2011-03-23  8:09                 ` KOSAKI Motohiro
  2011-03-23 14:34                   ` Linus Torvalds
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23  8:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins

> On Tue, 22 Mar 2011 20:09:29 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > When oom killer occured, almost processes are getting stuck following
> > two points.
> > 
> > 	1) __alloc_pages_nodemask
> > 	2) __lock_page_or_retry
> > 
> > 1) is not much problematic because TIF_MEMDIE lead to make allocation
> > failure and get out from page allocator. 2) is more problematic. When
> > OOM situation, Zones typically don't have page cache at all and Memory
> > starvation might lead to reduce IO performance largely. When fork bomb
> > occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
> > new process quickly rather than oom-killer kill it. Then, the system
> > may become livelock.
> > 
> > This patch makes pagefault interruptible by SIGKILL.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  arch/x86/mm/fault.c |    9 +++++++++
> >  include/linux/mm.h  |    1 +
> >  mm/filemap.c        |   22 +++++++++++++++++-----
> >  3 files changed, 27 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 20e3f87..797c7d0 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> >  	if (user_mode_vm(regs)) {
> >  		local_irq_enable();
> >  		error_code |= PF_USER;
> > +		flags |= FAULT_FLAG_KILLABLE;
> >  	} else {
> >  		if (regs->flags & X86_EFLAGS_IF)
> >  			local_irq_enable();
> > @@ -1138,6 +1139,14 @@ good_area:
> >  	}
> >  
> >  	/*
> > +	 * Pagefault was interrupted by SIGKILL. We have no reason to
> > +	 * continue pagefault.
> > +	 */
> > +	if ((flags & FAULT_FLAG_KILLABLE) && (fault & VM_FAULT_RETRY) &&
> > +	    fatal_signal_pending(current))
> > +		return;
> > +
> 
> Hmm? up_read(&mm->mmap_sem) ?

When __lock_page_or_retry() return 0, It call up_read(mmap_sem) in this
function.

I agree this is strange (or ugly). but I don't want change this spec in
this time.




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-23  7:13                     ` KOSAKI Motohiro
@ 2011-03-23  8:24                       ` Minchan Kim
  2011-03-23  8:44                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-23  8:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Wed, Mar 23, 2011 at 04:13:21PM +0900, KOSAKI Motohiro wrote:
> > Okay. I got it.
> > 
> > The problem is following as.
> > By the race the free_pcppages_bulk and balance_pgdat, it is possible
> > zone->all_unreclaimable = 1 and zone->pages_scanned = 0.
> > DMA zone have few LRU pages and in case of no-swap and big memory
> > pressure, there could be a just a page in inactive file list like your
> > example. (anon lru pages isn't important in case of non-swap system)
> > In such case, shrink_zones doesn't scan the page at all until priority
> > become 0 as get_scan_count does scan >>= priority(it's mostly zero).
> 
> Nope.
> 
>                         if (zone->all_unreclaimable && priority != DEF_PRIORITY)
>                                 continue;
> 
> This tow lines mean, all_unreclaimable prevent priority 0 reclaim.
> 

Yes. I missed it. Thanks. 

> 
> > And although priority become 0, nr_scan_try_batch returns zero until
> > saved pages become 32. So for scanning the page, at least, we need 32
> > times iteration of priority 12..0.  If system has fork-bomb, it is
> > almost livelock.
> 
> Therefore, 1000 times get_scan_count(DEF_PRIORITY) takes 1000 times no-op.
> 
> > 
> > If is is right, how about this?
> 
> Boo.
> You seems forgot why you introduced current all_unreclaimable() function.
> While hibernation, we can't trust all_unreclaimable.

Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
kswapd is freezed so it can't mark the zone->all_unreclaimable.
So I think hibernation can't be a problem.
Am I miss something?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-23  8:24                       ` Minchan Kim
@ 2011-03-23  8:44                         ` KOSAKI Motohiro
  2011-03-23  9:02                           ` Minchan Kim
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23  8:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

> > Boo.
> > You seems forgot why you introduced current all_unreclaimable() function.
> > While hibernation, we can't trust all_unreclaimable.
> 
> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
> kswapd is freezed so it can't mark the zone->all_unreclaimable.
> So I think hibernation can't be a problem.
> Am I miss something?

Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
Can you please explain why do you like your one than mine?

btw, Your one is very similar andrey's initial patch. If your one is
better, I'd like to ack with andrey instead.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-23  8:44                         ` KOSAKI Motohiro
@ 2011-03-23  9:02                           ` Minchan Kim
  2011-03-24  2:11                             ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-23  9:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Wed, Mar 23, 2011 at 5:44 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> > Boo.
>> > You seems forgot why you introduced current all_unreclaimable() function.
>> > While hibernation, we can't trust all_unreclaimable.
>>
>> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
>> kswapd is freezed so it can't mark the zone->all_unreclaimable.
>> So I think hibernation can't be a problem.
>> Am I miss something?
>
> Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
> Can you please explain why do you like your one than mine?

Just _simple_ :)
I don't want to change many lines although we can do it simple and very clear.

>
> btw, Your one is very similar andrey's initial patch. If your one is
> better, I'd like to ack with andrey instead.

When Andrey sent a patch, I though this as zone_reclaimable() is right
place to check it than out of zone_reclaimable. Why I didn't ack is
that Andrey can't explain root cause but you did so you persuade me.

I don't mind if Andrey move the check in zone_reclaimable and resend
or I resend with concrete description.

Anyway, most important thing is good description to show the root cause.
It is applied to your patch, too.
You should have written down root cause in description.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
       [not found]               ` <20110323164229.6b647004.kamezawa.hiroyu@jp.fujitsu.com>
@ 2011-03-23 13:40                 ` Luis Claudio R. Goncalves
  2011-03-24  0:06                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Luis Claudio R. Goncalves @ 2011-03-23 13:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins

On Wed, Mar 23, 2011 at 04:42:29PM +0900, KAMEZAWA Hiroyuki wrote:
| On Tue, 22 Mar 2011 20:06:48 +0900 (JST)
| KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
| 
| > This reverts commit 93b43fa55088fe977503a156d1097cc2055449a2.
| > 
| > The commit dramatically improve oom killer logic when fork-bomb
| > occur. But, I've found it has nasty corner case. Now cpu cgroup
| > has strange default RT runtime. It's 0! That said, if a process
| > under cpu cgroup promote RT scheduling class, the process never
| > run at all.
| > 
| > Eventually, kernel may hang up when oom kill occur.
| > 
| > The author need to resubmit it as adding knob and disabled
| > by default if he really need this feature.
| > 
| > Cc: Luis Claudio R. Goncalves <lclaudio@uudg.org>
| > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
| 
| Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

The original patch was written to fix an issue observed in 2.6.24.7-rt.
As the logic sounded useful, I ported it to upstream. Anyway,I am trying
a few ideas to rework that patch. In the meantime, I'm pretty fine with
reverting the commit.

Acked-by: Luis Claudio R. Gonçalves <lgoncalv@uudg.org>

-- 
[ Luis Claudio R. Goncalves                    Bass - Gospel - RT ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9  2696 7203 D980 A448 C8F8 ]


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/5] x86,mm: make pagefault killable
  2011-03-23  8:09                 ` KOSAKI Motohiro
@ 2011-03-23 14:34                   ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2011-03-23 14:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, David Rientjes,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins

On Wed, Mar 23, 2011 at 1:09 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>
> When __lock_page_or_retry() return 0, It call up_read(mmap_sem) in this
> function.

Indeed.

> I agree this is strange (or ugly). but I don't want change this spec in
> this time.

I agree that it is strange, and I don't like functions that touch
locks that they didn't take themselves, but since the original point
of the whole thing was to wait for the page without holding the
mmap_sem lock, that function has to do the up_read() early.

                   Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-23 13:40                 ` [PATCH 2/5] Revert "oom: give the dying task a higher priority" Luis Claudio R. Goncalves
@ 2011-03-24  0:06                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24  0:06 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton,
	David Rientjes, Linus Torvalds, Rik van Riel, Oleg Nesterov,
	linux-mm, Andrey Vagin, Hugh Dickins

> On Wed, Mar 23, 2011 at 04:42:29PM +0900, KAMEZAWA Hiroyuki wrote:
> | On Tue, 22 Mar 2011 20:06:48 +0900 (JST)
> | KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> | 
> | > This reverts commit 93b43fa55088fe977503a156d1097cc2055449a2.
> | > 
> | > The commit dramatically improve oom killer logic when fork-bomb
> | > occur. But, I've found it has nasty corner case. Now cpu cgroup
> | > has strange default RT runtime. It's 0! That said, if a process
> | > under cpu cgroup promote RT scheduling class, the process never
> | > run at all.
> | > 
> | > Eventually, kernel may hang up when oom kill occur.
> | > 
> | > The author need to resubmit it as adding knob and disabled
> | > by default if he really need this feature.
> | > 
> | > Cc: Luis Claudio R. Goncalves <lclaudio@uudg.org>
> | > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> | 
> | Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> The original patch was written to fix an issue observed in 2.6.24.7-rt.
> As the logic sounded useful, I ported it to upstream. Anyway,I am trying
> a few ideas to rework that patch. In the meantime, I'm pretty fine with
> reverting the commit.
> 
> Acked-by: Luis Claudio R. Gonçalves <lgoncalv@uudg.org>

Ok, and then, I'll drop [patch 3/5] too. I hope to focus to discuss your idea.





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-23  9:02                           ` Minchan Kim
@ 2011-03-24  2:11                             ` KOSAKI Motohiro
  2011-03-24  2:21                               ` Andrew Morton
  2011-03-24  4:19                               ` Minchan Kim
  0 siblings, 2 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24  2:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

> On Wed, Mar 23, 2011 at 5:44 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> > Boo.
> >> > You seems forgot why you introduced current all_unreclaimable() function.
> >> > While hibernation, we can't trust all_unreclaimable.
> >>
> >> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
> >> kswapd is freezed so it can't mark the zone->all_unreclaimable.
> >> So I think hibernation can't be a problem.
> >> Am I miss something?
> >
> > Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
> > Can you please explain why do you like your one than mine?
> 
> Just _simple_ :)
> I don't want to change many lines although we can do it simple and very clear.
>
> >
> > btw, Your one is very similar andrey's initial patch. If your one is
> > better, I'd like to ack with andrey instead.
> 
> When Andrey sent a patch, I though this as zone_reclaimable() is right
> place to check it than out of zone_reclaimable. Why I didn't ack is
> that Andrey can't explain root cause but you did so you persuade me.
> 
> I don't mind if Andrey move the check in zone_reclaimable and resend
> or I resend with concrete description.
> 
> Anyway, most important thing is good description to show the root cause.
> It is applied to your patch, too.
> You should have written down root cause in description.

honestly, I really dislike to use mixing zone->pages_scanned and 
zone->all_unreclaimable. because I think it's no simple. I don't 
think it's good taste nor easy to review. Even though you who VM 
expert didn't understand this issue at once, it's smell of too 
mess code.

therefore, I prefore to take either 1) just remove the function or
2) just only check zone->all_unreclaimable and oom_killer_disabled 
instead zone->pages_scanned.

And, I agree I need to rewrite the description. 
How's this?

==================================================
>From 216bcf3fb0476b453080debf8999c74c635ed72f Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Sun, 8 May 2011 17:39:44 +0900
Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely

all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.

	2006 Sep 25; commit 408d8544; oom: use unreclaimable info

And it went through strange history. firstly, following commit broke
the logic unintentionally.

	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
				      costly-order allocations

Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.

	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
				      return value when priority==0

But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .

	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
				      in direct reclaim path

But, recently, Andrey Vagin found the new corner case. Look,

	struct zone {
	  ..
	        int                     all_unreclaimable;
	  ..
	        unsigned long           pages_scanned;
	  ..
	}

zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore zones can become a state
of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
current all_unreclaimable() return false even though
zone->all_unreclaimabe=1.

Is this ignorable minor issue? No. Unfortunatelly, x86 has very
small dma zone and it become zone->all_unreclamble=1 easily. and
if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
at all!

Eventually, oom-killer never works on such systems.  Let's remove
this problematic logic completely.

Reported-by: Andrey Vagin <avagin@openvz.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   36 +-----------------------------------
 1 files changed, 1 insertions(+), 35 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..254aada 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1989,33 +1989,6 @@ static bool zone_reclaimable(struct zone *zone)
 }
 
 /*
- * As hibernation is going on, kswapd is freezed so that it can't mark
- * the zone into all_unreclaimable. It can't handle OOM during hibernation.
- * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
- */
-static bool all_unreclaimable(struct zonelist *zonelist,
-		struct scan_control *sc)
-{
-	struct zoneref *z;
-	struct zone *zone;
-	bool all_unreclaimable = true;
-
-	for_each_zone_zonelist_nodemask(zone, z, zonelist,
-			gfp_zone(sc->gfp_mask), sc->nodemask) {
-		if (!populated_zone(zone))
-			continue;
-		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-			continue;
-		if (zone_reclaimable(zone)) {
-			all_unreclaimable = false;
-			break;
-		}
-	}
-
-	return all_unreclaimable;
-}
-
-/*
  * This is the main entry point to direct page reclaim.
  *
  * If a full scan of the inactive list fails to free enough memory then we
@@ -2105,14 +2078,7 @@ out:
 	delayacct_freepages_end();
 	put_mems_allowed();
 
-	if (sc->nr_reclaimed)
-		return sc->nr_reclaimed;
-
-	/* top priority shrink_zones still had more to do? don't OOM, then */
-	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
-		return 1;
-
-	return 0;
+	return sc->nr_reclaimed;
 }
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-- 
1.6.5.2






^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  2:11                             ` KOSAKI Motohiro
@ 2011-03-24  2:21                               ` Andrew Morton
  2011-03-24  2:48                                 ` KOSAKI Motohiro
  2011-03-24  4:19                               ` Minchan Kim
  1 sibling, 1 reply; 74+ messages in thread
From: Andrew Morton @ 2011-03-24  2:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, linux-kernel, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely

zone.all_unreclaimable is there to prevent reclaim from wasting CPU
cycles scanning a zone which has no reclaimable pages.  When originally
implemented it did this very well.

That you guys keep breaking it, or don't feel like improving it is not a
reason to remove it!

If the code is unneeded and the kernel now reliably solves this problem
by other means then this should have been fully explained in the
changelog, but it was not even mentioned.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  2:21                               ` Andrew Morton
@ 2011-03-24  2:48                                 ` KOSAKI Motohiro
  2011-03-24  3:04                                   ` Andrew Morton
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24  2:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Minchan Kim, linux-kernel, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

> On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
> 
> zone.all_unreclaimable is there to prevent reclaim from wasting CPU
> cycles scanning a zone which has no reclaimable pages.  When originally
> implemented it did this very well.
>
> That you guys keep breaking it, or don't feel like improving it is not a
> reason to remove it!
> 
> If the code is unneeded and the kernel now reliably solves this problem
> by other means then this should have been fully explained in the
> changelog, but it was not even mentioned.

The changelog says, the logic was removed at 2008. three years ago.
even though it's unintentionally. and I and minchan tried to resurrect
the broken logic and resurrected a bug in the logic too. then, we
are discussed it should die or alive.

Which part is hard to understand for you?




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  2:48                                 ` KOSAKI Motohiro
@ 2011-03-24  3:04                                   ` Andrew Morton
  2011-03-24  5:35                                     ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Andrew Morton @ 2011-03-24  3:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, linux-kernel, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Thu, 24 Mar 2011 11:48:19 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
> > 
> > zone.all_unreclaimable is there to prevent reclaim from wasting CPU
> > cycles scanning a zone which has no reclaimable pages.  When originally
> > implemented it did this very well.
> >
> > That you guys keep breaking it, or don't feel like improving it is not a
> > reason to remove it!
> > 
> > If the code is unneeded and the kernel now reliably solves this problem
> > by other means then this should have been fully explained in the
> > changelog, but it was not even mentioned.
> 
> The changelog says, the logic was removed at 2008. three years ago.
> even though it's unintentionally. and I and minchan tried to resurrect
> the broken logic and resurrected a bug in the logic too. then, we
> are discussed it should die or alive.
> 
> Which part is hard to understand for you?
> 

The part which isn't there: how does the kernel now address the problem
which that code fixed?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  2:11                             ` KOSAKI Motohiro
  2011-03-24  2:21                               ` Andrew Morton
@ 2011-03-24  4:19                               ` Minchan Kim
  2011-03-24  5:35                                 ` KOSAKI Motohiro
  1 sibling, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-24  4:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Thu, Mar 24, 2011 at 11:11 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Wed, Mar 23, 2011 at 5:44 PM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> > Boo.
>> >> > You seems forgot why you introduced current all_unreclaimable() function.
>> >> > While hibernation, we can't trust all_unreclaimable.
>> >>
>> >> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
>> >> kswapd is freezed so it can't mark the zone->all_unreclaimable.
>> >> So I think hibernation can't be a problem.
>> >> Am I miss something?
>> >
>> > Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
>> > Can you please explain why do you like your one than mine?
>>
>> Just _simple_ :)
>> I don't want to change many lines although we can do it simple and very clear.
>>
>> >
>> > btw, Your one is very similar andrey's initial patch. If your one is
>> > better, I'd like to ack with andrey instead.
>>
>> When Andrey sent a patch, I though this as zone_reclaimable() is right
>> place to check it than out of zone_reclaimable. Why I didn't ack is
>> that Andrey can't explain root cause but you did so you persuade me.
>>
>> I don't mind if Andrey move the check in zone_reclaimable and resend
>> or I resend with concrete description.
>>
>> Anyway, most important thing is good description to show the root cause.
>> It is applied to your patch, too.
>> You should have written down root cause in description.
>
> honestly, I really dislike to use mixing zone->pages_scanned and
> zone->all_unreclaimable. because I think it's no simple. I don't
> think it's good taste nor easy to review. Even though you who VM
> expert didn't understand this issue at once, it's smell of too
> mess code.
>
> therefore, I prefore to take either 1) just remove the function or
> 2) just only check zone->all_unreclaimable and oom_killer_disabled
> instead zone->pages_scanned.

Nick's original goal is to prevent OOM killing until all zone we're
interested in are unreclaimable and whether zone is reclaimable or not
depends on kswapd. And Nick's original solution is just peeking
zone->all_unreclaimable but I made it dirty when we are considering
kswapd freeze in hibernation. So I think we still need it to handle
kswapd freeze problem and we should add original behavior we missed at
that time like below.

static bool zone_reclaimable(struct zone *zone)
{
        if (zone->all_unreclaimable)
                return false;

        return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
}

If you remove the logic, the problem Nick addressed would be showed
up, again. How about addressing the problem in your patch? If you
remove the logic, __alloc_pages_direct_reclaim lose the chance calling
dran_all_pages. Of course, it was a side effect but we should handle
it.

And my last concern is we are going on right way?
I think fundamental cause of this problem is page_scanned and
all_unreclaimable is race so isn't the approach fixing the race right
way?
If it is hard or very costly, your and my approach will be fallback.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  3:04                                   ` Andrew Morton
@ 2011-03-24  5:35                                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24  5:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Minchan Kim, linux-kernel, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

> On Thu, 24 Mar 2011 11:48:19 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
> > > 
> > > zone.all_unreclaimable is there to prevent reclaim from wasting CPU
> > > cycles scanning a zone which has no reclaimable pages.  When originally
> > > implemented it did this very well.
> > >
> > > That you guys keep breaking it, or don't feel like improving it is not a
> > > reason to remove it!
> > > 
> > > If the code is unneeded and the kernel now reliably solves this problem
> > > by other means then this should have been fully explained in the
> > > changelog, but it was not even mentioned.
> > 
> > The changelog says, the logic was removed at 2008. three years ago.
> > even though it's unintentionally. and I and minchan tried to resurrect
> > the broken logic and resurrected a bug in the logic too. then, we
> > are discussed it should die or alive.
> > 
> > Which part is hard to understand for you?
> 
> The part which isn't there: how does the kernel now address the problem
> which that code fixed?

Ah, got it.
The history says the problem haven't occur for three years. thus I
meant

past: code exist, but broken and don't work for three years.
new:  code removed.

What's different? But last minchan's mail pointed out recent
drain_all_pages() stuff depend on a return value of try_to_free_pages.

thus, I've made new patch and sent it. please see it?





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  4:19                               ` Minchan Kim
@ 2011-03-24  5:35                                 ` KOSAKI Motohiro
  2011-03-24  5:53                                   ` Minchan Kim
  2011-03-24  7:43                                   ` Minchan Kim
  0 siblings, 2 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24  5:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

Hi Minchan,

> Nick's original goal is to prevent OOM killing until all zone we're
> interested in are unreclaimable and whether zone is reclaimable or not
> depends on kswapd. And Nick's original solution is just peeking
> zone->all_unreclaimable but I made it dirty when we are considering
> kswapd freeze in hibernation. So I think we still need it to handle
> kswapd freeze problem and we should add original behavior we missed at
> that time like below.
> 
> static bool zone_reclaimable(struct zone *zone)
> {
>         if (zone->all_unreclaimable)
>                 return false;
> 
>         return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> }
> 
> If you remove the logic, the problem Nick addressed would be showed
> up, again. How about addressing the problem in your patch? If you
> remove the logic, __alloc_pages_direct_reclaim lose the chance calling
> dran_all_pages. Of course, it was a side effect but we should handle
> it.

Ok, you are successfull to persuade me. lost drain_all_pages() chance has
a risk.

> And my last concern is we are going on right way?


> I think fundamental cause of this problem is page_scanned and
> all_unreclaimable is race so isn't the approach fixing the race right
> way?

Hmm..
If we can avoid lock, we should. I think. that's performance reason.
therefore I'd like to cap the issue in do_try_to_free_pages(). it's
slow path.

Is the following patch acceptable to you? it is
 o rewrote the description
 o avoid mix to use zone->all_unreclaimable and zone->pages_scanned
 o avoid to reintroduce hibernation issue
 o don't touch fast path


> If it is hard or very costly, your and my approach will be fallback.

-----------------------------------------------------------------
>From f3d277057ad3a092aa1c94244f0ed0d3ebe5411c Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Sat, 14 May 2011 05:07:48 +0900
Subject: [PATCH] vmscan: all_unreclaimable() use zone->all_unreclaimable as the name

all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.

	2006 Sep 25; commit 408d8544; oom: use unreclaimable info

And it went through strange history. firstly, following commit broke
the logic unintentionally.

	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
				      costly-order allocations

Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.

	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
				      return value when priority==0

But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .

	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
				      in direct reclaim path

But, recently, Andrey Vagin found the new corner case. Look,

	struct zone {
	  ..
	        int                     all_unreclaimable;
	  ..
	        unsigned long           pages_scanned;
	  ..
	}

zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore zones can become a state
of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
current all_unreclaimable() return false even though
zone->all_unreclaimabe=1.

Is this ignorable minor issue? No. Unfortunatelly, x86 has very
small dma zone and it become zone->all_unreclamble=1 easily. and
if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
at all!

Eventually, oom-killer never works on such systems. That said, we
can't use zone->pages_scanned for this purpose. This patch restore
all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of
commit d1908362.

Reported-by: Andrey Vagin <avagin@openvz.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   24 +++++++++++++-----------
 1 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..54ac548 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -41,6 +41,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/oom.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1988,17 +1989,12 @@ static bool zone_reclaimable(struct zone *zone)
 	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
 }
 
-/*
- * As hibernation is going on, kswapd is freezed so that it can't mark
- * the zone into all_unreclaimable. It can't handle OOM during hibernation.
- * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
- */
+/* All zones in zonelist are unreclaimable? */
 static bool all_unreclaimable(struct zonelist *zonelist,
 		struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
-	bool all_unreclaimable = true;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2006,13 +2002,11 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (zone_reclaimable(zone)) {
-			all_unreclaimable = false;
-			break;
-		}
+		if (!zone->all_unreclaimable)
+			return false;
 	}
 
-	return all_unreclaimable;
+	return true;
 }
 
 /*
@@ -2108,6 +2102,14 @@ out:
 	if (sc->nr_reclaimed)
 		return sc->nr_reclaimed;
 
+	/*
+	 * As hibernation is going on, kswapd is freezed so that it can't mark
+	 * the zone into all_unreclaimable. Thus bypassing all_unreclaimable
+	 * check.
+	 */
+	if (oom_killer_disabled)
+		return 0;
+
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  5:35                                 ` KOSAKI Motohiro
@ 2011-03-24  5:53                                   ` Minchan Kim
  2011-03-24  6:16                                     ` KOSAKI Motohiro
  2011-03-24  7:43                                   ` Minchan Kim
  1 sibling, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-24  5:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

Hi Kosaki,

On Thu, Mar 24, 2011 at 2:35 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Minchan,
>
>> Nick's original goal is to prevent OOM killing until all zone we're
>> interested in are unreclaimable and whether zone is reclaimable or not
>> depends on kswapd. And Nick's original solution is just peeking
>> zone->all_unreclaimable but I made it dirty when we are considering
>> kswapd freeze in hibernation. So I think we still need it to handle
>> kswapd freeze problem and we should add original behavior we missed at
>> that time like below.
>>
>> static bool zone_reclaimable(struct zone *zone)
>> {
>>         if (zone->all_unreclaimable)
>>                 return false;
>>
>>         return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> }
>>
>> If you remove the logic, the problem Nick addressed would be showed
>> up, again. How about addressing the problem in your patch? If you
>> remove the logic, __alloc_pages_direct_reclaim lose the chance calling
>> dran_all_pages. Of course, it was a side effect but we should handle
>> it.
>
> Ok, you are successfull to persuade me. lost drain_all_pages() chance has
> a risk.
>
>> And my last concern is we are going on right way?
>
>
>> I think fundamental cause of this problem is page_scanned and
>> all_unreclaimable is race so isn't the approach fixing the race right
>> way?
>
> Hmm..
> If we can avoid lock, we should. I think. that's performance reason.
> therefore I'd like to cap the issue in do_try_to_free_pages(). it's
> slow path.
>
> Is the following patch acceptable to you? it is
>  o rewrote the description
>  o avoid mix to use zone->all_unreclaimable and zone->pages_scanned
>  o avoid to reintroduce hibernation issue
>  o don't touch fast path
>
>
>> If it is hard or very costly, your and my approach will be fallback.
>
> -----------------------------------------------------------------
> From f3d277057ad3a092aa1c94244f0ed0d3ebe5411c Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Sat, 14 May 2011 05:07:48 +0900
> Subject: [PATCH] vmscan: all_unreclaimable() use zone->all_unreclaimable as the name
>
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
>
>        2006 Sep 25; commit 408d8544; oom: use unreclaimable info
>
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
>
>        2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
>                                      costly-order allocations
>
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
>
>        2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
>                                      return value when priority==0
>
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
>
>        2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
>                                      in direct reclaim path
>
> But, recently, Andrey Vagin found the new corner case. Look,
>
>        struct zone {
>          ..
>                int                     all_unreclaimable;
>          ..
>                unsigned long           pages_scanned;
>          ..
>        }
>
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore zones can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
>
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
> Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
> a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
> at all!
>
> Eventually, oom-killer never works on such systems. That said, we
> can't use zone->pages_scanned for this purpose. This patch restore
> all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
> to add oom_killer_disabled check to avoid reintroduce the issue of
> commit d1908362.
>
> Reported-by: Andrey Vagin <avagin@openvz.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/vmscan.c |   24 +++++++++++++-----------
>  1 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 060e4c1..54ac548 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -41,6 +41,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
> +#include <linux/oom.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -1988,17 +1989,12 @@ static bool zone_reclaimable(struct zone *zone)
>        return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>  }
>
> -/*
> - * As hibernation is going on, kswapd is freezed so that it can't mark
> - * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> - * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> - */
> +/* All zones in zonelist are unreclaimable? */
>  static bool all_unreclaimable(struct zonelist *zonelist,
>                struct scan_control *sc)
>  {
>        struct zoneref *z;
>        struct zone *zone;
> -       bool all_unreclaimable = true;
>
>        for_each_zone_zonelist_nodemask(zone, z, zonelist,
>                        gfp_zone(sc->gfp_mask), sc->nodemask) {
> @@ -2006,13 +2002,11 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>                        continue;
>                if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>                        continue;
> -               if (zone_reclaimable(zone)) {
> -                       all_unreclaimable = false;
> -                       break;
> -               }
> +               if (!zone->all_unreclaimable)
> +                       return false;
>        }
>
> -       return all_unreclaimable;
> +       return true;
>  }
>
>  /*
> @@ -2108,6 +2102,14 @@ out:
>        if (sc->nr_reclaimed)
>                return sc->nr_reclaimed;
>
> +       /*
> +        * As hibernation is going on, kswapd is freezed so that it can't mark
> +        * the zone into all_unreclaimable. Thus bypassing all_unreclaimable
> +        * check.
> +        */
> +       if (oom_killer_disabled)
> +               return 0;
> +
>        /* top priority shrink_zones still had more to do? don't OOM, then */
>        if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
>                return 1;
> --
> 1.6.5.2
>
Thanks for your effort, Kosaki.
But I still doubt this patch is good.

This patch makes early oom killing in hibernation as it skip
all_unreclaimable check.
Normally,  hibernation needs many memory so page_reclaim pressure
would be big in small memory system. So I don't like early give up.

Do you think my patch has a problem? Personally, I think it's very
simple and clear. :)

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  5:53                                   ` Minchan Kim
@ 2011-03-24  6:16                                     ` KOSAKI Motohiro
  2011-03-24  6:32                                       ` Minchan Kim
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24  6:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

Hi

> Thanks for your effort, Kosaki.
> But I still doubt this patch is good.
> 
> This patch makes early oom killing in hibernation as it skip
> all_unreclaimable check.
> Normally,  hibernation needs many memory so page_reclaim pressure
> would be big in small memory system. So I don't like early give up.

Wait. When occur big pressure? hibernation reclaim pressure
(sc->nr_to_recliam) depend on physical memory size. therefore
a pressure seems to don't depend on the size.


> Do you think my patch has a problem? Personally, I think it's very
> simple and clear. :)

To be honest, I dislike following parts. It's madness on madness.

	static bool zone_reclaimable(struct zone *zone)
	{
	        if (zone->all_unreclaimable)
	                return false;
	
	        return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
	}


The function require a reviewer know 

 o pages_scanned and all_unreclaimable are racy
 o at hibernation, zone->all_unreclaimable can be false negative,
   but can't be false positive.

And, a function comment of all_unreclaimable() says

	 /*
	  * As hibernation is going on, kswapd is freezed so that it can't mark
	  * the zone into all_unreclaimable. It can't handle OOM during hibernation.
	  * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
	  */

But, now it is no longer copy of kswapd algorithm. 

If you strongly prefer this idea even if you hear above explanation,
please consider to add much and much comments. I can't say
current your patch is enough readable/reviewable.

Thanks.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  6:16                                     ` KOSAKI Motohiro
@ 2011-03-24  6:32                                       ` Minchan Kim
  2011-03-24  7:03                                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-24  6:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Thu, Mar 24, 2011 at 3:16 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi
>
>> Thanks for your effort, Kosaki.
>> But I still doubt this patch is good.
>>
>> This patch makes early oom killing in hibernation as it skip
>> all_unreclaimable check.
>> Normally,  hibernation needs many memory so page_reclaim pressure
>> would be big in small memory system. So I don't like early give up.
>
> Wait. When occur big pressure? hibernation reclaim pressure
> (sc->nr_to_recliam) depend on physical memory size. therefore
> a pressure seems to don't depend on the size.

It depends on physical memory size and /sys/power/image_size.
If you want to tune image size bigger, reclaim pressure would be big.

>
>
>> Do you think my patch has a problem? Personally, I think it's very
>> simple and clear. :)
>
> To be honest, I dislike following parts. It's madness on madness.
>
>        static bool zone_reclaimable(struct zone *zone)
>        {
>                if (zone->all_unreclaimable)
>                        return false;
>
>                return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>        }
>
>
> The function require a reviewer know
>
>  o pages_scanned and all_unreclaimable are racy

Yes. That part should be written down of comment.

>  o at hibernation, zone->all_unreclaimable can be false negative,
>   but can't be false positive.

The comment of all_unreclaimable already does explain it well, I think.

>
> And, a function comment of all_unreclaimable() says
>
>         /*
>          * As hibernation is going on, kswapd is freezed so that it can't mark
>          * the zone into all_unreclaimable. It can't handle OOM during hibernation.
>          * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
>          */
>
> But, now it is no longer copy of kswapd algorithm.

The comment don't say it should be a copy of kswapd.

>
> If you strongly prefer this idea even if you hear above explanation,
> please consider to add much and much comments. I can't say
> current your patch is enough readable/reviewable.

My patch isn't a formal patch for merge but just a concept to show.
If you agree the idea, of course, I will add more concrete comment
when I send formal patch.

Before, I would like to get a your agreement. :)
If you solve my concern(early give up in hibernation) in your patch, I
don't insist on my patch, either.

Thanks for the comment, Kosaki.

>
> Thanks.
>
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  6:32                                       ` Minchan Kim
@ 2011-03-24  7:03                                         ` KOSAKI Motohiro
  2011-03-24  7:25                                           ` Minchan Kim
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24  7:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

> On Thu, Mar 24, 2011 at 3:16 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> > Hi
> >
> >> Thanks for your effort, Kosaki.
> >> But I still doubt this patch is good.
> >>
> >> This patch makes early oom killing in hibernation as it skip
> >> all_unreclaimable check.
> >> Normally,  hibernation needs many memory so page_reclaim pressure
> >> would be big in small memory system. So I don't like early give up.
> >
> > Wait. When occur big pressure? hibernation reclaim pressure
> > (sc->nr_to_recliam) depend on physical memory size. therefore
> > a pressure seems to don't depend on the size.
> 
> It depends on physical memory size and /sys/power/image_size.
> If you want to tune image size bigger, reclaim pressure would be big.

Ok, _If_ I want.
However, I haven't seen desktop people customize it.


> >> Do you think my patch has a problem? Personally, I think it's very
> >> simple and clear. :)
> >
> > To be honest, I dislike following parts. It's madness on madness.
> >
> >        static bool zone_reclaimable(struct zone *zone)
> >        {
> >                if (zone->all_unreclaimable)
> >                        return false;
> >
> >                return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> >        }
> >
> >
> > The function require a reviewer know
> >
> >  o pages_scanned and all_unreclaimable are racy
> 
> Yes. That part should be written down of comment.
> 
> >  o at hibernation, zone->all_unreclaimable can be false negative,
> >   but can't be false positive.
> 
> The comment of all_unreclaimable already does explain it well, I think.

Where is?


> > And, a function comment of all_unreclaimable() says
> >
> >         /*
> >          * As hibernation is going on, kswapd is freezed so that it can't mark
> >          * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> >          * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> >          */
> >
> > But, now it is no longer copy of kswapd algorithm.
> 
> The comment don't say it should be a copy of kswapd.

I meant the comments says

          * So let's check zone's unreclaimable in direct reclaim as well as kswapd.

but now it isn't aswell as kswapd.

I think it's critical important. If people can't understand why the
algorithm was choosed, anyone will break the code again sooner or later.


> > If you strongly prefer this idea even if you hear above explanation,
> > please consider to add much and much comments. I can't say
> > current your patch is enough readable/reviewable.
> 
> My patch isn't a formal patch for merge but just a concept to show.
> If you agree the idea, of course, I will add more concrete comment
> when I send formal patch.
> 
> Before, I would like to get a your agreement. :)
> If you solve my concern(early give up in hibernation) in your patch, I
> don't insist on my patch, either.

Ok. Let's try.

Please concern why priority=0 is not enough. zone_reclaimable_pages(zone) * 6 
is a conservative value of worry about multi thread race. While one task
is reclaiming, others can allocate/free memory concurrently. therefore,
even after priority=0, we have a chance getting reclaimable pages on lru.
But, in hibernation case, almost all tasks was freezed before hibernation
call shrink_all_memory(). therefore, there is no race. priority=0 reclaim
can cover all lru pages.

Is this enough explanation for you?


> 
> Thanks for the comment, Kosaki.





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  7:03                                         ` KOSAKI Motohiro
@ 2011-03-24  7:25                                           ` Minchan Kim
  2011-03-24  7:28                                             ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-24  7:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Thu, Mar 24, 2011 at 4:03 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Thu, Mar 24, 2011 at 3:16 PM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> > Hi
>> >
>> >> Thanks for your effort, Kosaki.
>> >> But I still doubt this patch is good.
>> >>
>> >> This patch makes early oom killing in hibernation as it skip
>> >> all_unreclaimable check.
>> >> Normally,  hibernation needs many memory so page_reclaim pressure
>> >> would be big in small memory system. So I don't like early give up.
>> >
>> > Wait. When occur big pressure? hibernation reclaim pressure
>> > (sc->nr_to_recliam) depend on physical memory size. therefore
>> > a pressure seems to don't depend on the size.
>>
>> It depends on physical memory size and /sys/power/image_size.
>> If you want to tune image size bigger, reclaim pressure would be big.
>
> Ok, _If_ I want.
> However, I haven't seen desktop people customize it.
>
>
>> >> Do you think my patch has a problem? Personally, I think it's very
>> >> simple and clear. :)
>> >
>> > To be honest, I dislike following parts. It's madness on madness.
>> >
>> >        static bool zone_reclaimable(struct zone *zone)
>> >        {
>> >                if (zone->all_unreclaimable)
>> >                        return false;
>> >
>> >                return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> >        }
>> >
>> >
>> > The function require a reviewer know
>> >
>> >  o pages_scanned and all_unreclaimable are racy
>>
>> Yes. That part should be written down of comment.
>>
>> >  o at hibernation, zone->all_unreclaimable can be false negative,
>> >   but can't be false positive.
>>
>> The comment of all_unreclaimable already does explain it well, I think.
>
> Where is?
>
>
>> > And, a function comment of all_unreclaimable() says
>> >
>> >         /*
>> >          * As hibernation is going on, kswapd is freezed so that it can't mark
>> >          * the zone into all_unreclaimable. It can't handle OOM during hibernation.
>> >          * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
>> >          */
>> >
>> > But, now it is no longer copy of kswapd algorithm.
>>
>> The comment don't say it should be a copy of kswapd.
>
> I meant the comments says
>
>           * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
>
> but now it isn't aswell as kswapd.
>
> I think it's critical important. If people can't understand why the
> algorithm was choosed, anyone will break the code again sooner or later.
>
>
>> > If you strongly prefer this idea even if you hear above explanation,
>> > please consider to add much and much comments. I can't say
>> > current your patch is enough readable/reviewable.
>>
>> My patch isn't a formal patch for merge but just a concept to show.
>> If you agree the idea, of course, I will add more concrete comment
>> when I send formal patch.
>>
>> Before, I would like to get a your agreement. :)
>> If you solve my concern(early give up in hibernation) in your patch, I
>> don't insist on my patch, either.
>
> Ok. Let's try.
>
> Please concern why priority=0 is not enough. zone_reclaimable_pages(zone) * 6
> is a conservative value of worry about multi thread race. While one task
> is reclaiming, others can allocate/free memory concurrently. therefore,
> even after priority=0, we have a chance getting reclaimable pages on lru.
> But, in hibernation case, almost all tasks was freezed before hibernation
> call shrink_all_memory(). therefore, there is no race. priority=0 reclaim
> can cover all lru pages.
>
> Is this enough explanation for you?

For example, In 4G desktop system.
32M full scanning and fail to reclaim a page.
It's under 1% coverage.

Is it enough to give up?
I am not sure.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  7:25                                           ` Minchan Kim
@ 2011-03-24  7:28                                             ` KOSAKI Motohiro
  2011-03-24  7:34                                               ` Minchan Kim
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24  7:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

> For example, In 4G desktop system.
> 32M full scanning and fail to reclaim a page.
> It's under 1% coverage.

?? I'm sorry. I haven't catch it.
Where 32M come from?


> Is it enough to give up?
> I am not sure.




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  7:28                                             ` KOSAKI Motohiro
@ 2011-03-24  7:34                                               ` Minchan Kim
  2011-03-24  7:41                                                 ` Minchan Kim
  2011-03-24  7:43                                                 ` KOSAKI Motohiro
  0 siblings, 2 replies; 74+ messages in thread
From: Minchan Kim @ 2011-03-24  7:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Thu, Mar 24, 2011 at 4:28 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> For example, In 4G desktop system.
>> 32M full scanning and fail to reclaim a page.
>> It's under 1% coverage.
>
> ?? I'm sorry. I haven't catch it.
> Where 32M come from?

(1<<12) * 4K + (1<<11) + 4K + .. (1 << 0) + 4K in shrink_zones.


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  7:34                                               ` Minchan Kim
@ 2011-03-24  7:41                                                 ` Minchan Kim
  2011-03-24  7:43                                                 ` KOSAKI Motohiro
  1 sibling, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2011-03-24  7:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Thu, Mar 24, 2011 at 4:34 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 4:28 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
>>> For example, In 4G desktop system.
>>> 32M full scanning and fail to reclaim a page.
>>> It's under 1% coverage.
>>
>> ?? I'm sorry. I haven't catch it.
>> Where 32M come from?
>
> (1<<12) * 4K + (1<<11) + 4K + .. (1 << 0) + 4K in shrink_zones.

Stupid me.
Sorry. my calculation is totally wrong.

Your explanation was perfect
Okay. I don't have any objection to your solution.



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  7:34                                               ` Minchan Kim
  2011-03-24  7:41                                                 ` Minchan Kim
@ 2011-03-24  7:43                                                 ` KOSAKI Motohiro
  1 sibling, 0 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24  7:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
	Johannes Weiner

> On Thu, Mar 24, 2011 at 4:28 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> For example, In 4G desktop system.
> >> 32M full scanning and fail to reclaim a page.
> >> It's under 1% coverage.
> >
> > ?? I'm sorry. I haven't catch it.
> > Where 32M come from?
> 
> (1<<12) * 4K + (1<<11) + 4K + .. (1 << 0) + 4K in shrink_zones.

(lru-pages>>12) + (lru-pages>>11) + (lru-pages>>10) ... =~ 2 * lru-page 

?




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
  2011-03-24  5:35                                 ` KOSAKI Motohiro
  2011-03-24  5:53                                   ` Minchan Kim
@ 2011-03-24  7:43                                   ` Minchan Kim
  1 sibling, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2011-03-24  7:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner

On Thu, Mar 24, 2011 at 2:35 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Minchan,
>
>> Nick's original goal is to prevent OOM killing until all zone we're
>> interested in are unreclaimable and whether zone is reclaimable or not
>> depends on kswapd. And Nick's original solution is just peeking
>> zone->all_unreclaimable but I made it dirty when we are considering
>> kswapd freeze in hibernation. So I think we still need it to handle
>> kswapd freeze problem and we should add original behavior we missed at
>> that time like below.
>>
>> static bool zone_reclaimable(struct zone *zone)
>> {
>>         if (zone->all_unreclaimable)
>>                 return false;
>>
>>         return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> }
>>
>> If you remove the logic, the problem Nick addressed would be showed
>> up, again. How about addressing the problem in your patch? If you
>> remove the logic, __alloc_pages_direct_reclaim lose the chance calling
>> dran_all_pages. Of course, it was a side effect but we should handle
>> it.
>
> Ok, you are successfull to persuade me. lost drain_all_pages() chance has
> a risk.
>
>> And my last concern is we are going on right way?
>
>
>> I think fundamental cause of this problem is page_scanned and
>> all_unreclaimable is race so isn't the approach fixing the race right
>> way?
>
> Hmm..
> If we can avoid lock, we should. I think. that's performance reason.
> therefore I'd like to cap the issue in do_try_to_free_pages(). it's
> slow path.
>
> Is the following patch acceptable to you? it is
>  o rewrote the description
>  o avoid mix to use zone->all_unreclaimable and zone->pages_scanned
>  o avoid to reintroduce hibernation issue
>  o don't touch fast path
>
>
>> If it is hard or very costly, your and my approach will be fallback.
>
> -----------------------------------------------------------------
> From f3d277057ad3a092aa1c94244f0ed0d3ebe5411c Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Sat, 14 May 2011 05:07:48 +0900
> Subject: [PATCH] vmscan: all_unreclaimable() use zone->all_unreclaimable as the name
>
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
>
>        2006 Sep 25; commit 408d8544; oom: use unreclaimable info
>
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
>
>        2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
>                                      costly-order allocations
>
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
>
>        2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
>                                      return value when priority==0
>
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
>
>        2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
>                                      in direct reclaim path
>
> But, recently, Andrey Vagin found the new corner case. Look,
>
>        struct zone {
>          ..
>                int                     all_unreclaimable;
>          ..
>                unsigned long           pages_scanned;
>          ..
>        }
>
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore zones can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
>
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
> Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
> a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
> at all!
>
> Eventually, oom-killer never works on such systems. That said, we
> can't use zone->pages_scanned for this purpose. This patch restore
> all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
> to add oom_killer_disabled check to avoid reintroduce the issue of
> commit d1908362.
>
> Reported-by: Andrey Vagin <avagin@openvz.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Thanks for the good discussion, Kosaki.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/5] mm: introduce wait_on_page_locked_killable
  2011-03-22 11:08             ` [PATCH 4/5] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
  2011-03-23  7:44               ` KAMEZAWA Hiroyuki
@ 2011-03-24 15:04               ` Minchan Kim
  1 sibling, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2011-03-24 15:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki

On Tue, Mar 22, 2011 at 08:08:41PM +0900, KOSAKI Motohiro wrote:
> commit 2687a356 (Add lock_page_killable) introduced killable
> lock_page(). Similarly this patch introdues killable
> wait_on_page_locked().
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/5] x86,mm: make pagefault killable
  2011-03-22 11:09             ` [PATCH 5/5] x86,mm: make pagefault killable KOSAKI Motohiro
  2011-03-23  7:49               ` KAMEZAWA Hiroyuki
@ 2011-03-24 15:10               ` Minchan Kim
  2011-03-24 17:13               ` Oleg Nesterov
  2 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2011-03-24 15:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki

On Tue, Mar 22, 2011 at 08:09:29PM +0900, KOSAKI Motohiro wrote:
> When oom killer occured, almost processes are getting stuck following
> two points.
> 
> 	1) __alloc_pages_nodemask
> 	2) __lock_page_or_retry
> 
> 1) is not much problematic because TIF_MEMDIE lead to make allocation
> failure and get out from page allocator. 2) is more problematic. When
> OOM situation, Zones typically don't have page cache at all and Memory
> starvation might lead to reduce IO performance largely. When fork bomb
> occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
> new process quickly rather than oom-killer kill it. Then, the system
> may become livelock.
> 
> This patch makes pagefault interruptible by SIGKILL.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Looks like a cool idea.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
       [not found]             ` <20110322200657.B064.A69D9226@jp.fujitsu.com>
       [not found]               ` <20110323164229.6b647004.kamezawa.hiroyu@jp.fujitsu.com>
@ 2011-03-24 15:27               ` Minchan Kim
  2011-03-28  9:48                 ` KOSAKI Motohiro
  2011-03-28  9:51                 ` Peter Zijlstra
  1 sibling, 2 replies; 74+ messages in thread
From: Minchan Kim @ 2011-03-24 15:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Luis Claudio R. Goncalves,
	Peter Zijlstra

On Tue, Mar 22, 2011 at 08:06:48PM +0900, KOSAKI Motohiro wrote:
> This reverts commit 93b43fa55088fe977503a156d1097cc2055449a2.
> 
> The commit dramatically improve oom killer logic when fork-bomb
> occur. But, I've found it has nasty corner case. Now cpu cgroup
> has strange default RT runtime. It's 0! That said, if a process
> under cpu cgroup promote RT scheduling class, the process never
> run at all.
> 
> Eventually, kernel may hang up when oom kill occur.
> 
> The author need to resubmit it as adding knob and disabled
> by default if he really need this feature.
> 
> Cc: Luis Claudio R. Goncalves <lclaudio@uudg.org>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Just a comment below.

> ---
>  mm/oom_kill.c |   27 ---------------------------
>  1 files changed, 0 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3100bc5..739dee4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -84,24 +84,6 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
>  #endif /* CONFIG_NUMA */
>  
>  /*
> - * If this is a system OOM (not a memcg OOM) and the task selected to be
> - * killed is not already running at high (RT) priorities, speed up the
> - * recovery by boosting the dying task to the lowest FIFO priority.
> - * That helps with the recovery and avoids interfering with RT tasks.
> - */
> -static void boost_dying_task_prio(struct task_struct *p,
> -				  struct mem_cgroup *mem)
> -{
> -	struct sched_param param = { .sched_priority = 1 };
> -
> -	if (mem)
> -		return;
> -
> -	if (!rt_task(p))
> -		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> -}
> -
> -/*
>   * The process p may have detached its own ->mm while exiting or through
>   * use_mm(), but one or more of its subthreads may still have a valid
>   * pointer.  Return p, or any of its subthreads with a valid ->mm, with
> @@ -452,13 +434,6 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
>  	set_tsk_thread_flag(p, TIF_MEMDIE);
>  	force_sig(SIGKILL, p);
>  
> -	/*
> -	 * We give our sacrificial lamb high priority and access to
> -	 * all the memory it needs. That way it should be able to
> -	 * exit() and clear out its resources quickly...
> -	 */
> -	boost_dying_task_prio(p, mem);
> -

Before merging 93b43fa5508, we had a following routine.

+static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 {
        p = find_lock_task_mm(p);
        if (!p) {
@@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
                K(get_mm_counter(p->mm, MM_FILEPAGES)));
        task_unlock(p);
 
-       p->rt.time_slice = HZ; <<---- THIS
+
        set_tsk_thread_flag(p, TIF_MEMDIE);
        force_sig(SIGKILL, p);
+
+       /*
+        * We give our sacrificial lamb high priority and access to
+        * all the memory it needs. That way it should be able to
+        * exit() and clear out its resources quickly...
+        */
+       boost_dying_task_prio(p, mem);
+
        return 0;
 }

At that time, I thought that routine is meaningless in non-RT scheduler.
So I Cced Peter but don't get the answer.
I just want to confirm it.

Do you still think it's meaningless? 
so you remove it when you revert 93b43fa5508?
Then, this isn't just revert patch but revert + killing meaningless code patch.


- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/5] x86,mm: make pagefault killable
  2011-03-22 11:09             ` [PATCH 5/5] x86,mm: make pagefault killable KOSAKI Motohiro
  2011-03-23  7:49               ` KAMEZAWA Hiroyuki
  2011-03-24 15:10               ` Minchan Kim
@ 2011-03-24 17:13               ` Oleg Nesterov
  2011-03-24 17:34                 ` Linus Torvalds
  2 siblings, 1 reply; 74+ messages in thread
From: Oleg Nesterov @ 2011-03-24 17:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, linux-mm, Andrey Vagin, Hugh Dickins,
	KAMEZAWA Hiroyuki

On 03/22, KOSAKI Motohiro wrote:
>
> This patch makes pagefault interruptible by SIGKILL.

Not a comment, but the question...

> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
>  	if (user_mode_vm(regs)) {
>  		local_irq_enable();
>  		error_code |= PF_USER;
> +		flags |= FAULT_FLAG_KILLABLE;

OK, this is clear.

I am wondering, can't we set FAULT_FLAG_KILLABLE unconditionally
but check PF_USER when we get VM_FAULT_RETRY? I mean,

	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
		if (!(error_code & PF_USER))
			no_context(...);
		return;
	}


Probably not... but I can't find any example of in-kernel fault which
can be broken by -EFAULT if current was killed.

mm_release()->put_user(clear_child_tid) should be fine...

Just curious, I feel I missed something obvious.

Oleg.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/5] x86,mm: make pagefault killable
  2011-03-24 17:13               ` Oleg Nesterov
@ 2011-03-24 17:34                 ` Linus Torvalds
  2011-03-28  7:00                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Linus Torvalds @ 2011-03-24 17:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Rik van Riel, linux-mm, Andrey Vagin, Hugh Dickins,
	KAMEZAWA Hiroyuki

On Thu, Mar 24, 2011 at 10:13 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I am wondering, can't we set FAULT_FLAG_KILLABLE unconditionally
> but check PF_USER when we get VM_FAULT_RETRY? I mean,
>
>        if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
>                if (!(error_code & PF_USER))
>                        no_context(...);
>                return;
>        }

I agree, we should do this.

> Probably not... but I can't find any example of in-kernel fault which
> can be broken by -EFAULT if current was killed.

There's no way that can validly break anything, since any such
codepath has to be able to handle -EFAULT for other reasons anyway.

The only issue is whether we're ok with a regular write() system call
(for example) not being atomic in the presence of a fatal signal. So
it does change semantics, but I think it changes it in a good way
(technically POSIX requires atomicity, but on the other hand,
technically POSIX also doesn't talk about the process being killed,
and writes would still be atomic for the case where they actually
return. Not to mention NFS etc where writes have never been atomic
anyway, so a program that relies on strict "all or nothing" write
behavior is fundamentally broken to begin with).

                         Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/5] x86,mm: make pagefault killable
  2011-03-24 17:34                 ` Linus Torvalds
@ 2011-03-28  7:00                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-28  7:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Oleg Nesterov, linux-kernel, Andrew Morton,
	David Rientjes, Rik van Riel, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki

> On Thu, Mar 24, 2011 at 10:13 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I am wondering, can't we set FAULT_FLAG_KILLABLE unconditionally
> > but check PF_USER when we get VM_FAULT_RETRY? I mean,
> >
> >        if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> >                if (!(error_code & PF_USER))
> >                        no_context(...);
> >                return;
> >        }
> 
> I agree, we should do this.
> 
> > Probably not... but I can't find any example of in-kernel fault which
> > can be broken by -EFAULT if current was killed.
> 
> There's no way that can validly break anything, since any such
> codepath has to be able to handle -EFAULT for other reasons anyway.
> 
> The only issue is whether we're ok with a regular write() system call
> (for example) not being atomic in the presence of a fatal signal. So
> it does change semantics, but I think it changes it in a good way
> (technically POSIX requires atomicity, but on the other hand,
> technically POSIX also doesn't talk about the process being killed,
> and writes would still be atomic for the case where they actually
> return. Not to mention NFS etc where writes have never been atomic
> anyway, so a program that relies on strict "all or nothing" write
> behavior is fundamentally broken to begin with).

Ok, I didn't have enough brave. Will do.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-24 15:27               ` Minchan Kim
@ 2011-03-28  9:48                 ` KOSAKI Motohiro
  2011-03-28 12:28                   ` Minchan Kim
  2011-03-28  9:51                 ` Peter Zijlstra
  1 sibling, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-28  9:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, Peter Zijlstra

Hi

> @@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
>                 K(get_mm_counter(p->mm, MM_FILEPAGES)));
>         task_unlock(p);
>  
> -       p->rt.time_slice = HZ; <<---- THIS
> +
>         set_tsk_thread_flag(p, TIF_MEMDIE);
>         force_sig(SIGKILL, p);
> +
> +       /*
> +        * We give our sacrificial lamb high priority and access to
> +        * all the memory it needs. That way it should be able to
> +        * exit() and clear out its resources quickly...
> +        */
> +       boost_dying_task_prio(p, mem);
> +
>         return 0;
>  }
> 
> At that time, I thought that routine is meaningless in non-RT scheduler.
> So I Cced Peter but don't get the answer.
> I just want to confirm it.
> 
> Do you still think it's meaningless? 

In short, yes.


> so you remove it when you revert 93b43fa5508?
> Then, this isn't just revert patch but revert + killing meaningless code patch.

If you want, I'd like to rename a patch title. That said, we can't revert
93b43fa5508 simple cleanly, several patches depend on it. therefore I
reverted it manualy. and at that time, I don't want to resurrect
meaningless logic. anyway it's no matter. Luis is preparing new patches.
therefore we will get the same end result. :)




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-24 15:27               ` Minchan Kim
  2011-03-28  9:48                 ` KOSAKI Motohiro
@ 2011-03-28  9:51                 ` Peter Zijlstra
  2011-03-28 12:21                   ` Minchan Kim
  1 sibling, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2011-03-28  9:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves

On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> 
> At that time, I thought that routine is meaningless in non-RT scheduler.
> So I Cced Peter but don't get the answer.
> I just want to confirm it. 

Probably lost somewhere in the mess that is my inbox :/, what is the
full question?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-28  9:51                 ` Peter Zijlstra
@ 2011-03-28 12:21                   ` Minchan Kim
  2011-03-28 12:28                     ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-28 12:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves

Hi Peter,

On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> > 
> > At that time, I thought that routine is meaningless in non-RT scheduler.
> > So I Cced Peter but don't get the answer.
> > I just want to confirm it. 
> 
> Probably lost somewhere in the mess that is my inbox :/, what is the
> full question?

The question is we had a routine which change rt.time_slice with HZ to 
accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
any more about normal task. So we removed it. Is it right?

And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
and Luis said he has a another solution to replace 93b43fa5508. 
If rt.time_slice handleing is effective, we should restore it until Luis's patch
will be merged.


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-28 12:21                   ` Minchan Kim
@ 2011-03-28 12:28                     ` Peter Zijlstra
  2011-03-28 12:40                       ` Minchan Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2011-03-28 12:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves

On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
> Hi Peter,
> 
> On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> > > 
> > > At that time, I thought that routine is meaningless in non-RT scheduler.
> > > So I Cced Peter but don't get the answer.
> > > I just want to confirm it. 
> > 
> > Probably lost somewhere in the mess that is my inbox :/, what is the
> > full question?
> 
> The question is we had a routine which change rt.time_slice with HZ to 
> accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
> any more about normal task. So we removed it. Is it right?

rt.time_slice is only relevant to SCHED_RR, since you seem to use
SCHED_FIFO (which runs for as long as the task is runnable), its
completely irrelevant.

> And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
> and Luis said he has a another solution to replace 93b43fa5508. 
> If rt.time_slice handleing is effective, we should restore it until Luis's patch
> will be merged.

Right, so only SCHED_RR is affected by time_slice, it will be
decremented on tick (so anything that avoids ticks will also avoid the
decrement) and once it reaches 0 the task will be queued at the tail of
its static priority and reset the slice. If there is no other task on
that same priority we'll again schedule that task.

In short, don't use SCHED_RR and don't worry about time_slice.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-28  9:48                 ` KOSAKI Motohiro
@ 2011-03-28 12:28                   ` Minchan Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2011-03-28 12:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki, Luis Claudio R. Goncalves,
	Peter Zijlstra

On Mon, Mar 28, 2011 at 06:48:13PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > @@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
> >                 K(get_mm_counter(p->mm, MM_FILEPAGES)));
> >         task_unlock(p);
> >  
> > -       p->rt.time_slice = HZ; <<---- THIS
> > +
> >         set_tsk_thread_flag(p, TIF_MEMDIE);
> >         force_sig(SIGKILL, p);
> > +
> > +       /*
> > +        * We give our sacrificial lamb high priority and access to
> > +        * all the memory it needs. That way it should be able to
> > +        * exit() and clear out its resources quickly...
> > +        */
> > +       boost_dying_task_prio(p, mem);
> > +
> >         return 0;
> >  }
> > 
> > At that time, I thought that routine is meaningless in non-RT scheduler.
> > So I Cced Peter but don't get the answer.
> > I just want to confirm it.
> > 
> > Do you still think it's meaningless? 
> 
> In short, yes.
> 
> 
> > so you remove it when you revert 93b43fa5508?
> > Then, this isn't just revert patch but revert + killing meaningless code patch.
> 
> If you want, I'd like to rename a patch title. That said, we can't revert
> 93b43fa5508 simple cleanly, several patches depend on it. therefore I
> reverted it manualy. and at that time, I don't want to resurrect
> meaningless logic. anyway it's no matter. Luis is preparing new patches.
> therefore we will get the same end result. :)

I don't mind it, either. :)
I just want to make sure the meaningless logic.
Thanks.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-28 12:28                     ` Peter Zijlstra
@ 2011-03-28 12:40                       ` Minchan Kim
  2011-03-28 13:10                         ` Luis Claudio R. Goncalves
  0 siblings, 1 reply; 74+ messages in thread
From: Minchan Kim @ 2011-03-28 12:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
	Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
	Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves

On Mon, Mar 28, 2011 at 02:28:27PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
> > Hi Peter,
> > 
> > On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> > > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> > > > 
> > > > At that time, I thought that routine is meaningless in non-RT scheduler.
> > > > So I Cced Peter but don't get the answer.
> > > > I just want to confirm it. 
> > > 
> > > Probably lost somewhere in the mess that is my inbox :/, what is the
> > > full question?
> > 
> > The question is we had a routine which change rt.time_slice with HZ to 
> > accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
> > any more about normal task. So we removed it. Is it right?
> 
> rt.time_slice is only relevant to SCHED_RR, since you seem to use
> SCHED_FIFO (which runs for as long as the task is runnable), its
> completely irrelevant.
> 
> > And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
> > and Luis said he has a another solution to replace 93b43fa5508. 
> > If rt.time_slice handleing is effective, we should restore it until Luis's patch
> > will be merged.
> 
> Right, so only SCHED_RR is affected by time_slice, it will be
> decremented on tick (so anything that avoids ticks will also avoid the
> decrement) and once it reaches 0 the task will be queued at the tail of
> its static priority and reset the slice. If there is no other task on
> that same priority we'll again schedule that task.
> 
> In short, don't use SCHED_RR and don't worry about time_slice.

There was meaningless code in there. I guess it was in there from CFS.
Thanks for the explanation, Peter.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-28 12:40                       ` Minchan Kim
@ 2011-03-28 13:10                         ` Luis Claudio R. Goncalves
  2011-03-28 13:18                           ` Peter Zijlstra
  2011-03-28 13:48                           ` Minchan Kim
  0 siblings, 2 replies; 74+ messages in thread
From: Luis Claudio R. Goncalves @ 2011-03-28 13:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Peter Zijlstra, KOSAKI Motohiro, linux-kernel, Andrew Morton,
	David Rientjes, Linus Torvalds, Rik van Riel, Oleg Nesterov,
	linux-mm, Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki

On Mon, Mar 28, 2011 at 09:40:25PM +0900, Minchan Kim wrote:
| On Mon, Mar 28, 2011 at 02:28:27PM +0200, Peter Zijlstra wrote:
| > On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
| > > Hi Peter,
| > > 
| > > On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
| > > > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
| > > > > 
| > > > > At that time, I thought that routine is meaningless in non-RT scheduler.
| > > > > So I Cced Peter but don't get the answer.
| > > > > I just want to confirm it. 
| > > > 
| > > > Probably lost somewhere in the mess that is my inbox :/, what is the
| > > > full question?
| > > 
| > > The question is we had a routine which change rt.time_slice with HZ to 
| > > accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
| > > any more about normal task. So we removed it. Is it right?
| > 
| > rt.time_slice is only relevant to SCHED_RR, since you seem to use
| > SCHED_FIFO (which runs for as long as the task is runnable), its
| > completely irrelevant.
| > 
| > > And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
| > > and Luis said he has a another solution to replace 93b43fa5508. 
| > > If rt.time_slice handleing is effective, we should restore it until Luis's patch
| > > will be merged.
| > 
| > Right, so only SCHED_RR is affected by time_slice, it will be
| > decremented on tick (so anything that avoids ticks will also avoid the
| > decrement) and once it reaches 0 the task will be queued at the tail of
| > its static priority and reset the slice. If there is no other task on
| > that same priority we'll again schedule that task.
| > 
| > In short, don't use SCHED_RR and don't worry about time_slice.
| 
| There was meaningless code in there. I guess it was in there from CFS.
| Thanks for the explanation, Peter.

Yes, it was CFS related:

	p = find_lock_task_mm(p);
	...
	p->rt.time_slice = HZ; <<---- THIS

Peter, would that be effective to boost the priority of the dying task?
I mean, in the context of SCHED_OTHER tasks would it really help the dying
task to be scheduled sooner to release its resources? If so, as we remove
the code in commit 93b43fa5508 we should re-add that old code.

Luis
-- 
[ Luis Claudio R. Goncalves             Red Hat  -  Realtime Team ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9  2696 7203 D980 A448 C8F8 ]


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-28 13:10                         ` Luis Claudio R. Goncalves
@ 2011-03-28 13:18                           ` Peter Zijlstra
  2011-03-28 13:56                             ` Luis Claudio R. Goncalves
  2011-03-29  2:46                             ` KOSAKI Motohiro
  2011-03-28 13:48                           ` Minchan Kim
  1 sibling, 2 replies; 74+ messages in thread
From: Peter Zijlstra @ 2011-03-28 13:18 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves
  Cc: Minchan Kim, KOSAKI Motohiro, linux-kernel, Andrew Morton,
	David Rientjes, Linus Torvalds, Rik van Riel, Oleg Nesterov,
	linux-mm, Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki

On Mon, 2011-03-28 at 10:10 -0300, Luis Claudio R. Goncalves wrote:
> | There was meaningless code in there. I guess it was in there from CFS.
> | Thanks for the explanation, Peter.
> 
> Yes, it was CFS related:
> 
>         p = find_lock_task_mm(p);
>         ...
>         p->rt.time_slice = HZ; <<---- THIS

CFS has never used rt.time_slice, that's always been a pure SCHED_RR
thing.

> Peter, would that be effective to boost the priority of the dying task?

The thing you're currently doing, making it SCHED_FIFO ?

> I mean, in the context of SCHED_OTHER tasks would it really help the dying
> task to be scheduled sooner to release its resources? 

That very much depends on how all this stuff works, I guess if everybody
serializes on OOM and only the first will actually kill a task and all
the waiting tasks will try to allocate a page again before also doing
the OOM thing, and the pending tasks are woken after the OOM target task
has completed dying.. then I don't see much point in boosting things,
since everybody interested in memory will block and eventually only the
dying task will be left running.

Its been a very long while since I stared at the OOM code..

> If so, as we remove
> the code in commit 93b43fa5508 we should re-add that old code. 

It doesn't make any sense to fiddle with rt.time_slice afaict.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-28 13:10                         ` Luis Claudio R. Goncalves
  2011-03-28 13:18                           ` Peter Zijlstra
@ 2011-03-28 13:48                           ` Minchan Kim
  1 sibling, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2011-03-28 13:48 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves
  Cc: Peter Zijlstra, KOSAKI Motohiro, linux-kernel, Andrew Morton,
	David Rientjes, Linus Torvalds, Rik van Riel, Oleg Nesterov,
	linux-mm, Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki

On Mon, Mar 28, 2011 at 10:10:29AM -0300, Luis Claudio R. Goncalves wrote:
> On Mon, Mar 28, 2011 at 09:40:25PM +0900, Minchan Kim wrote:
> | On Mon, Mar 28, 2011 at 02:28:27PM +0200, Peter Zijlstra wrote:
> | > On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
> | > > Hi Peter,
> | > > 
> | > > On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> | > > > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> | > > > > 
> | > > > > At that time, I thought that routine is meaningless in non-RT scheduler.
> | > > > > So I Cced Peter but don't get the answer.
> | > > > > I just want to confirm it. 
> | > > > 
> | > > > Probably lost somewhere in the mess that is my inbox :/, what is the
> | > > > full question?
> | > > 
> | > > The question is we had a routine which change rt.time_slice with HZ to 
> | > > accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
> | > > any more about normal task. So we removed it. Is it right?
> | > 
> | > rt.time_slice is only relevant to SCHED_RR, since you seem to use
> | > SCHED_FIFO (which runs for as long as the task is runnable), its
> | > completely irrelevant.
> | > 
> | > > And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
> | > > and Luis said he has a another solution to replace 93b43fa5508. 
> | > > If rt.time_slice handleing is effective, we should restore it until Luis's patch
> | > > will be merged.
> | > 
> | > Right, so only SCHED_RR is affected by time_slice, it will be
> | > decremented on tick (so anything that avoids ticks will also avoid the
> | > decrement) and once it reaches 0 the task will be queued at the tail of
> | > its static priority and reset the slice. If there is no other task on
> | > that same priority we'll again schedule that task.
> | > 
> | > In short, don't use SCHED_RR and don't worry about time_slice.
> | 
> | There was meaningless code in there. I guess it was in there from CFS.
> | Thanks for the explanation, Peter.
> 
> Yes, it was CFS related:

I think it wasn't related CFS but O(1).
I guess as we changed O(1) with CFS, the fault was remained.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-28 13:18                           ` Peter Zijlstra
@ 2011-03-28 13:56                             ` Luis Claudio R. Goncalves
  2011-03-29  2:46                             ` KOSAKI Motohiro
  1 sibling, 0 replies; 74+ messages in thread
From: Luis Claudio R. Goncalves @ 2011-03-28 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Minchan Kim, KOSAKI Motohiro, linux-kernel, Andrew Morton,
	David Rientjes, Linus Torvalds, Rik van Riel, Oleg Nesterov,
	linux-mm, Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki

On Mon, Mar 28, 2011 at 03:18:13PM +0200, Peter Zijlstra wrote:
| On Mon, 2011-03-28 at 10:10 -0300, Luis Claudio R. Goncalves wrote:
| > | There was meaningless code in there. I guess it was in there from CFS.
| > | Thanks for the explanation, Peter.
| > 
| > Yes, it was CFS related:
| > 
| >         p = find_lock_task_mm(p);
| >         ...
| >         p->rt.time_slice = HZ; <<---- THIS
| 
| CFS has never used rt.time_slice, that's always been a pure SCHED_RR
| thing.
| 
| > Peter, would that be effective to boost the priority of the dying task?
| 
| The thing you're currently doing, making it SCHED_FIFO ?

I meant the p->rt.time_slice line, but you already answered my question.
Thanks :)
 
| > I mean, in the context of SCHED_OTHER tasks would it really help the dying
| > task to be scheduled sooner to release its resources? 
| 
| That very much depends on how all this stuff works, I guess if everybody
| serializes on OOM and only the first will actually kill a task and all
| the waiting tasks will try to allocate a page again before also doing
| the OOM thing, and the pending tasks are woken after the OOM target task
| has completed dying.. then I don't see much point in boosting things,
| since everybody interested in memory will block and eventually only the
| dying task will be left running.
| 
| Its been a very long while since I stared at the OOM code..
| 
| > If so, as we remove
| > the code in commit 93b43fa5508 we should re-add that old code. 
| 
| It doesn't make any sense to fiddle with rt.time_slice afaict.
---end quoted text---

-- 
[ Luis Claudio R. Goncalves             Red Hat  -  Realtime Team ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9  2696 7203 D980 A448 C8F8 ]


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
  2011-03-28 13:18                           ` Peter Zijlstra
  2011-03-28 13:56                             ` Luis Claudio R. Goncalves
@ 2011-03-29  2:46                             ` KOSAKI Motohiro
  1 sibling, 0 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2011-03-29  2:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Luis Claudio R. Goncalves, Minchan Kim,
	linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
	Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin,
	Hugh Dickins, KAMEZAWA Hiroyuki

Hi

> > I mean, in the context of SCHED_OTHER tasks would it really help the dying
> > task to be scheduled sooner to release its resources? 
> 
> That very much depends on how all this stuff works, I guess if everybody
> serializes on OOM and only the first will actually kill a task and all
> the waiting tasks will try to allocate a page again before also doing
> the OOM thing, and the pending tasks are woken after the OOM target task
> has completed dying.. then I don't see much point in boosting things,
> since everybody interested in memory will block and eventually only the
> dying task will be left running.

Probably I can answer this question. When OOM occur, kernel has very a
few pages (typically 10 - 100). but not 0. therefore bloody page-in vs
page-out battle (aka allocation vs free battle) is running.

IOW, While we have multiple cpu or per-cpu page queue, we don't see
page cache become completely 0.

Therefore, not killed task doesn't sleep completely. page-out may have
very small allocation successful chance. (but almostly it's fail. pages
are stealed by another task)

Before Luis's patch, kernel livelock on oom may be solved within 30min,
but after his patch, it's solved within 1 second. that's big different
for human response time. That's the test result.

Thanks.



^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2011-03-29  2:46 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-15  1:49 Linux 2.6.38 Linus Torvalds
2011-03-15  3:13 ` David Rientjes
2011-03-15  4:06   ` Steven Rostedt
2011-03-15  4:14   ` Linus Torvalds
2011-03-15  4:29     ` David Rientjes
2011-03-15  4:33   ` Andrew Morton
2011-03-15  4:50     ` David Rientjes
2011-03-15  6:21       ` Andrew Morton
2011-03-16  9:09         ` KOSAKI Motohiro
2011-03-22 11:04           ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
2011-03-22 11:05             ` [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely KOSAKI Motohiro
2011-03-22 14:49               ` Minchan Kim
2011-03-23  5:21                 ` KOSAKI Motohiro
2011-03-23  6:59                   ` Minchan Kim
2011-03-23  7:13                     ` KOSAKI Motohiro
2011-03-23  8:24                       ` Minchan Kim
2011-03-23  8:44                         ` KOSAKI Motohiro
2011-03-23  9:02                           ` Minchan Kim
2011-03-24  2:11                             ` KOSAKI Motohiro
2011-03-24  2:21                               ` Andrew Morton
2011-03-24  2:48                                 ` KOSAKI Motohiro
2011-03-24  3:04                                   ` Andrew Morton
2011-03-24  5:35                                     ` KOSAKI Motohiro
2011-03-24  4:19                               ` Minchan Kim
2011-03-24  5:35                                 ` KOSAKI Motohiro
2011-03-24  5:53                                   ` Minchan Kim
2011-03-24  6:16                                     ` KOSAKI Motohiro
2011-03-24  6:32                                       ` Minchan Kim
2011-03-24  7:03                                         ` KOSAKI Motohiro
2011-03-24  7:25                                           ` Minchan Kim
2011-03-24  7:28                                             ` KOSAKI Motohiro
2011-03-24  7:34                                               ` Minchan Kim
2011-03-24  7:41                                                 ` Minchan Kim
2011-03-24  7:43                                                 ` KOSAKI Motohiro
2011-03-24  7:43                                   ` Minchan Kim
2011-03-23  7:41               ` KAMEZAWA Hiroyuki
2011-03-23  7:55                 ` KOSAKI Motohiro
2011-03-22 11:08             ` [PATCH 3/5] oom: create oom autogroup KOSAKI Motohiro
2011-03-22 23:21               ` Minchan Kim
2011-03-23  1:27                 ` KOSAKI Motohiro
2011-03-23  2:41                   ` Mike Galbraith
2011-03-22 11:08             ` [PATCH 4/5] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
2011-03-23  7:44               ` KAMEZAWA Hiroyuki
2011-03-24 15:04               ` Minchan Kim
2011-03-22 11:09             ` [PATCH 5/5] x86,mm: make pagefault killable KOSAKI Motohiro
2011-03-23  7:49               ` KAMEZAWA Hiroyuki
2011-03-23  8:09                 ` KOSAKI Motohiro
2011-03-23 14:34                   ` Linus Torvalds
2011-03-24 15:10               ` Minchan Kim
2011-03-24 17:13               ` Oleg Nesterov
2011-03-24 17:34                 ` Linus Torvalds
2011-03-28  7:00                   ` KOSAKI Motohiro
     [not found]             ` <20110322200657.B064.A69D9226@jp.fujitsu.com>
     [not found]               ` <20110323164229.6b647004.kamezawa.hiroyu@jp.fujitsu.com>
2011-03-23 13:40                 ` [PATCH 2/5] Revert "oom: give the dying task a higher priority" Luis Claudio R. Goncalves
2011-03-24  0:06                   ` KOSAKI Motohiro
2011-03-24 15:27               ` Minchan Kim
2011-03-28  9:48                 ` KOSAKI Motohiro
2011-03-28 12:28                   ` Minchan Kim
2011-03-28  9:51                 ` Peter Zijlstra
2011-03-28 12:21                   ` Minchan Kim
2011-03-28 12:28                     ` Peter Zijlstra
2011-03-28 12:40                       ` Minchan Kim
2011-03-28 13:10                         ` Luis Claudio R. Goncalves
2011-03-28 13:18                           ` Peter Zijlstra
2011-03-28 13:56                             ` Luis Claudio R. Goncalves
2011-03-29  2:46                             ` KOSAKI Motohiro
2011-03-28 13:48                           ` Minchan Kim
2011-03-15 21:08       ` Linux 2.6.38 Oleg Nesterov
2011-03-15  3:14 ` Steven Rostedt
2011-03-15  4:15   ` Linus Torvalds
2011-03-16 17:30 ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Melchior FRANZ
2011-03-16 19:22   ` i915/kms regression after 2.6.38-rc8 Jiri Slaby
2011-03-16 19:43   ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Chris Wilson
2011-03-16 21:09     ` i915/kms regression after 2.6.38-rc8 Melchior FRANZ
2011-03-20 18:30   ` i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38) Maciej Rutecki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).