linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] OOM Debug print selection and additional information
@ 2019-08-26 19:36 Edward Chron
  2019-08-26 19:36 ` [PATCH 01/10] mm/oom_debug: Add Debug base code Edward Chron
                   ` (11 more replies)
  0 siblings, 12 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron

This patch series provides code that works as a debug option through
debugfs to provide additional controls to limit how much information
gets printed when an OOM event occurs and or optionally print additional
information about slab usage, vmalloc allocations, user process memory
usage, the number of processes / tasks and some summary information
about these tasks (number runable, i/o wait), system information
(#CPUs, Kernel Version and other useful state of the system),
ARP and ND Cache entry information.

Linux OOM can optionally provide a lot of information, what's missing?
----------------------------------------------------------------------
Linux provides a variety of detailed information when an OOM event occurs
but has limited options to control how much output is produced. The
system related information is produced unconditionally and limited per
user process information is produced as a default enabled option. The
per user process information may be disabled.

Slab usage information was recently added and is output only if slab
usage exceeds user memory usage.

Many OOM events are due to user application memory usage sometimes in
combination with the use of kernel resource usage that exceeds what is
expected memory usage. Detailed information about how memory was being
used when the event occurred may be required to identify the root cause
of the OOM event.

However, some environments are very large and printing all of the
information about processes, slabs and or vmalloc allocations may
not be feasible. For other environments printing as much information
about these as possible may be needed to root cause OOM events.

Extensibility using OOM debug options
-------------------------------------
What is needed is an extensible system to optionally configure
debug options as needed and to then dynamically enable and disable
them. Also for options that produce multiple lines of entry based
output, to configure which entries to print based on how much
memory they use (or optionally all the entries).

Limiting print entry output based on object size
------------------------------------------------
To limit output, a fixed size of object could be used such as:
vmallocs that use more than 1MB, slabs that are using more than
512KB, processes using 16MB or more of memory. Such an apporach
is quite reasonable.

Using OOM's memory metrics to limit printing based on entry size
----------------------------------------------------------------
However, the current implementation of OOM which has been in use for
almost a decade scores based on 1/10 % of memory. This methodology scales
well as memory sizes increase. If you limit the objects you examine to
those using 0.1% of memory you still may get a large number of objects
but avoid printing those using a relatively small amount of memory.

Further options that allow limiting output based on object size
can have the minimum size set to zero. In this case objects
that use even a small amount of memory will be printed.

Use of debugfs to allow dynamic controls
----------------------------------------
By providing a debugfs interface that allows options to be configured,
enabled and where appropriate to set a minimum size for selecting
entries to print, the output produced when an OOM event occurs can be
dynamically adjusted to produce as little or as much detail as needed
for a given system.

OOM debug options can be added to the base code as needed.

Currently we have the following OOM debug options defined:

* System State Summary
  --------------------
  One line of output that includes:
  - Uptime (days, hour, minutes, seconds)
  - Number CPUs
  - Machine Type
  - Node name
  - Domain name
  - Kernel Release
  - Kernel Version

  Example output when configured and enabled:

Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019

* Tasks Summary
  -------------
  One line of output that includes:
  - Number of Threads
  - Number of processes
  - Forks since boot
  - Processes that are runnable
  - Processes that are in iowait

  Example output when configured and enabled:

Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0

* ARP Table and/or Neighbour Discovery Table Summary
  --------------------------------------------------
  One line of output each for ARP and ND that includes:
  - Table name
  - Table size (max # entries)
  - Key Length
  - Entry Size
  - Number of Entries
  - Last Flush (in seconds)
  - hash grows
  - entry allocations
  - entry destroys
  - Number lookups
  - Number of lookup hits
  - Resolution failures
  - Garbage Collection Forced Runs
  - Table Full
  - Proxy Queue Length

  Example output when configured and enabled (for both):

... kernel: neighbour: Table: arp_tbl size:   256 keyLen:  4 entrySize: 360 entries:     9 lastFlush:  1721s hGrows:     1 allocs:     9 destroys:     0 lookups:   204 hits:   199 resFailed:    38 gcRuns/Forced: 111 /  0 tblFull:  0 proxyQlen:  0

... kernel: neighbour: Table:  nd_tbl size:   128 keyLen: 16 entrySize: 368 entries:     6 lastFlush:  1720s hGrows:     0 allocs:     7 destroys:     1 lookups:     0 hits:     0 resFailed:     0 gcRuns/Forced: 110 /  0 tblFull:  0 proxyQlen:  0

* Add Select Slabs Print
  ----------------------
  Allow select slab entries (based on a minimum size) to be printed.
  Minimum size is specified as a percentage of the total RAM memory
  in tenths of a percent, consistent with existing OOM process scoring.
  Valid values are specified from 0 to 1000 where 0 prints all slab
  entries (all slabs that have at least one slab object in use) up
  to 1000 which would require a slab to use 100% of memory which can't
  happen so in that case only summary information is printed.

  The first line of output is the standard Linux output header for
  OOM printed Slab entries. This header looks like this:

Aug  6 09:37:21 egc103 yourserver: Unreclaimable slab info:

  The output is existing slab entry memory usage limited such that only
  entries equal to or larger than the minimum size are printed.
  Empty slabs (no slab entries in slabs in use) are never printed.

  Additional output consists of summary information that is printed
  at the end of the output. This summary information includes:
  - # entries examined
  - # entries selected and printed
  - minimum entry size for selection
  - Slabs total size (kB)
  - Slabs reclaimable size (kB)
  - Slabs unreclaimable size (kB)

  Example Summary output when configured and enabled:

Jul 23 23:26:34 yoursystem kernel: Summary: Slab entries examined: 123 printed: 83 minsize: 0kB

Jul 23 23:26:34 yoursystem kernel: Slabs Total: 151212kB Reclaim: 50632kB Unreclaim: 100580kB

* Add Select Vmalloc allocations Print
  ------------------------------------
  Allow select vmalloc entries (based on a minimum size) to be printed.
  Minimum size is specified as a percentage of the total RAM memory
  in tenths of a percent, consistent with existing OOM process scoring.
  Valid values are specified from 0 to 1000 where 0 prints all vmalloc
  entries (all vmalloc allocations that have at least one page in use) up
  to 1000 which would require a vmalloc to use 100% of memory which can't
  happen so in that case only summary information is printed.

  The first line of output is a new Vmalloc output header for
  OOM printed Vmalloc entries. This header looks like this:

Aug 19 19:27:01 yourserver kernel: Vmalloc Info:

  The output is vmalloc entry information output limited such that only
  entries equal to or larger than the minimum size are printed.
  Unused vmallocs (no pages assigned to the vmalloc) are never printed.
  The vmalloc entry information includes:
  - Size (in bytes)
  - pages (Number pages in use)
  - Caller Information to identify the request

  A sample vmalloc entry output looks like this:

Jul 22 20:16:09 yoursystem kernel: Vmalloc size=2625536 pages=640 caller=__do_sys_swapon+0x78e/0x113

  Additional output consists of summary information that is printed
  at the end of the output. This summary information includes:
  - Number of Vmalloc entries examined
  - Number of Vmalloc entries printed
  - minimum entry size for selection

  A sample Vmalloc Summary output looks like this:

Aug 19 19:27:01 coronado kernel: Summary: Vmalloc entries examined: 1070 printed: 989 minsize: 0kB

* Add Select Process Entries Print
  --------------------------------
  Allow select process entries (based on a minimum size) to be printed.
  Minimum size is specified as a percentage totalpages (RAM + swap)
  in tenths of a percent, consistent with existing OOM process scoring.
  Note: user process memory can be swapped out when swap space present
  so that is why swap space and ram memory comprise the totalpages
  used to calculate the percentage of memory a process is using.
  Valid values are specified from 0 to 1000 where 0 prints all user
  processes (that have valid mm sections and aren't exiting) up to
  1000 which would require a user process to use 100% of memory which
  can't happen so in that case only summary information is printed.

  The first line of output is the standard Linux output headers for
  OOM printed User Processes. This header looks like this:

Aug 19 19:27:01 yourserver kernel: Tasks state (memory values in pages):
Aug 19 19:27:01 yourserver kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name

  The output is existing per user process data limited such that only
  entries equal to or larger than the minimum size are printed.

Jul 21 20:07:48 yourserver kernel: [    579]     0   579     7942     1010          90112        0         -1000 systemd-udevd

  Additional output consists of summary information that is printed
  at the end of the output. This summary information includes:

Aug 19 19:27:01 yourserver kernel: Summary: OOM Tasks considered:277 printed:143 minimum size:0kB totalpages:32791608kB

* Add Slab Select Always Print Enable
  -----------------------------------
  This option will enable slab entries to be printed even when slab
  memory usage does not exceed the standard Linux user memory usage
  print trigger. The Standard OOM event Slab entry print trigger is
  that slab memory usage exceeds user memory usage. This covers cases
  where the Kernel or Kernel drivers are driving slab memory usage up
  causing it to be excessive. However, OOM Events are often caused by
  user processes causing too much memory usage. In some cases where
  the user memory usage is higher the amount of slab memory consumed
  can still be an important factor in determining what caused the OOM
  event. In such cases it would be useful to have slab memory usage
  for any slab entries using a significant amount of memory.

  No changes to output format occurs, enabling the option simply
  causes what ever slabs are print eligible (from Select Slabs
  option, which this option depends on) get printed on any OOM
  event regardless of whether the memory usage by Slabs exceeds
  user memory usage or not.

* Add Enhanced Slab Print Information
  -----------------------------------
  For any slab entries that are print eligible (from Select Slabs
  option, which this option depends on) print some additional
  details about the slab that can be useful to root causing
  OOM events.

  Output information for each enhanced slab entry includes:
  - Used space (KiB)
  - Total space (KiB)
  - Active objects
  - Total Objects
  - Object size
  - Aligned object size
  - Object per Slab
  - Pages per Slab
  - Active Slabs
  - Total Slabs
  - Slab name

  The header for enhanced slab entries is revised and looks like this:

Aug 19 19:27:01 coronado kernel:   UsedKiB   TotalKiB  ActiveObj   TotalObj   ObjSize AlignSize Objs/Slab Pgs/Slab ActiveSlab  TotalSlab Slab_Name

  Each enhanced slab entry is similar to the following output format:

Aug 19 19:27:01 coronado kernel:      9016       9016     384710     384710        24        24       170        1       2263       2263 avtab_node


* Add Enhanced Process Print Information
  --------------------------------------
  Add OOM Debug code that prints additional detailed information about
  users processes that were considered for OOM killing for any print
  selected processes. The information is displayed for each user process
  that OOM prints in the output.

  This supplemental per user process information is very helpful for
  determing how process memory is used to allow OOM event root cause
  identifcation that might not otherwise be possible.

  Output information for enhanced user process entrys printed includes:
  - pid
  - parent pid
  - ruid
  - euid
  - tgid
  - Process State (S)
  - utime in seconds
  - stime in seconds
  - oom_score_adjust
  - task comm value (name of process)
  - Vmem KiB
  - MaxRss KiB
  - CurRss KiB
  - Pte KiB
  - Swap KiB
  - Sock KiB
  - Lib KiB
  - Text KiB
  - Heap KiB
  - Stack KiB
  - File KiB
  - Shmem KiB
  - Read Pages
  - Fault Pages
  - Lock KiB
  - Pinned KiB

  The headers for Processes changes to match the data being printed:

Aug 19 19:27:01 yourserver kernel: Tasks state (memory values in KiB):

...: [  pid  ]    ppid    ruid    euid    tgid S  utimeSec  stimeSec   VmemKiB MaxRssKiB CurRssKiB    PteKiB   SwapKiB   SockKiB     LibKiB   TextKiB   HeapKiB  StackKiB   FileKiB  ShmemKiB     ReadPgs    FaultPgs   LockKiB PinnedKiB Adjust name

  A few entries that print formatted to match the second header:

...: [    570]       1       0       0     570 S     0.530     0.105     31632     12064      3864        88         0       416       9500       208      3608       132        36         0          60       41615         0         0  -1000 systemd-udevd
...: [    759]       1       0       0     759 S     1.264     0.545     17196      6072       788        72         0       624       8912        32       596       132         0         0           0           0         0         0      0 rngd
...: [   1626]    1553   10383   10383    1626 S     9.417     2.355   3347904    336316    231672       924         0       416      56452        16    170656       276      2116    150756           4        2309         0         0      0 gnome-shell

Configuring Patches:
-------------------
OOM Debug and any options you want to use must first be configured so
the code is included in your kernel. This requires selecting kernel
config file options. You will find config options to select under:

Kernel hacking ---> Memory Debugging --->

[*] Debug OOM
    [*] Debug OOM System State
    [*] Debug OOM System Tasks Summary
    [*] Debug OOM ARP Table
    [*] Debug OOM ND Table
    [*] Debug OOM Select Slabs Print
       [*] Debug OOM Slabs Select Always Print Enable
       [*] Debug OOM Enhanced Slab Print
    [*] Debug OOM Select Vmallocs Print
    [*] Debug OOM Select Process Print
       [*] Debug OOM Enhanced Process Print

The heirarchy shown also displays the dependencies between OOM Debug for
these options. Everything depends on Debug OOM as that is where the base
code that all options require is located. Process has an Enhanced output
but requires Select Process to be enabled so you can limit the output
since you're asking for more details. The same is true with Slabs the
Enhanced output requires Select Slabs and so does Slabs Select Always
Print, to ensure you can limit your output if you need to.

Dyanmic enable/disable and setting entry minsize for Options
------------------------------------------------------------
As mentioned all options can be dynamically disabled and re-enabled.
The Select Options also allow setting minimum entry size to limit
entry printing based on the amount of memory they use, using the
OOM 0% to 100% in 1/10 % increments (1-1000). This is impelemented in
debugfs. Entries for OOM Debug are defined in the /sys/kernel/debug/oom
directory.

Arbitrary default values have been selected. The default is to enable
configured options and to set minimum entry size to 10 which is 1% of
the memory (or memory plus swap for processes). The choice was to
make sure by default you don't get a lot of data just for enabling an
option. Here is what the current defaults are set to for all the
OOM Debug options we currently have defined:

[root@yourserver ~]# grep "" /sys/kernel/debug/oom/*
/sys/kernel/debug/oom/arp_table_summary_enabled:Y
/sys/kernel/debug/oom/nd_table_summary_enabled:Y
/sys/kernel/debug/oom/process_enhanced_print_enabled:Y
/sys/kernel/debug/oom/process_select_print_enabled:Y
/sys/kernel/debug/oom/process_select_print_tenthpercent:10
/sys/kernel/debug/oom/slab_enhanced_print_enabled:Y
/sys/kernel/debug/oom/slab_select_always_print_enabled:Y
/sys/kernel/debug/oom/slab_select_print_enabled:Y
/sys/kernel/debug/oom/slab_select_print_tenthpercent:10
/sys/kernel/debug/oom/system_state_summary_enabled:Y
/sys/kernel/debug/oom/tasks_summary_enabled:Y
/sys/kernel/debug/oom/vmalloc_select_print_enabled:Y
/sys/kernel/debug/oom/vmalloc_select_print_tenthpercent:10

You can disable or re-enable options in the appropriate enable file
or adjust the minimum size value in the appropriate tenthpercent file
as needed.

---------------------------------------------------------------------

Edward Chron (10):
  mm/oom_debug: Add Debug base code
  mm/oom_debug: Add System State Summary
  mm/oom_debug: Add Tasks Summary
  mm/oom_debug: Add ARP and ND Table Summary usage
  mm/oom_debug: Add Select Slabs Print
  mm/oom_debug: Add Select Vmalloc Entries Print
  mm/oom_debug: Add Select Process Entries Print
  mm/oom_debug: Add Slab Select Always Print Enable
  mm/oom_debug: Add Enhanced Slab Print Information
  mm/oom_debug: Add Enhanced Process Print Information

 include/linux/oom.h     |   1 +
 include/linux/vmalloc.h |  12 +
 include/net/neighbour.h |  12 +
 mm/Kconfig.debug        | 228 +++++++++++++
 mm/Makefile             |   1 +
 mm/oom_kill.c           |  83 ++++-
 mm/oom_kill_debug.c     | 736 ++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill_debug.h     |  58 ++++
 mm/slab.h               |   4 +
 mm/slab_common.c        |  94 +++++
 mm/vmalloc.c            |  43 +++
 net/core/neighbour.c    |  78 +++++
 12 files changed, 1339 insertions(+), 11 deletions(-)
 create mode 100644 mm/oom_kill_debug.c
 create mode 100644 mm/oom_kill_debug.h

-- 
2.20.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 01/10] mm/oom_debug: Add Debug base code
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
@ 2019-08-26 19:36 ` Edward Chron
  2019-08-27 13:28   ` kbuild test robot
  2019-08-26 19:36 ` [PATCH 02/10] mm/oom_debug: Add System State Summary Edward Chron
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron

OOM Debug code to control/limit information and to provide additional
information that is printed when an OOM event occurs.

Code is provided to provide some additional information as well as to
selectively limit the amount of information produced by an OOM event.
Additional information printed at the time of OOM event can prove
invaluable in determing the root cause of an OOM event, which is the
purpose of OOM Event reports.

Additional OOM information is provided as configurable options that once
configured can be dynamically enabled and disabled and for OOM debug
options that can potentially provide a number of records as part of their
output, a mechanism to dynamically adjust the records output based on
the amount of memory an object is using is provided. Specifying the
minimum size of objects to print is done by specifying size in units of
1/10% of the memory size.

By providing an extensible debugfs interface that allows options to be
configured, enabled and where appropriate to set a minimum size for
selecting entries to print, the output produced when an OOM event occurs
can be dynamically adjusted to produce as little or as much detail as
needed for a given system. This is useful in both production and for
test and development to debug and root cause OOM events.

-----------------------------------------------------------------------

Overview of configurable OOM Event Debug Options
------------------------------------------------
This patch provides common code needed for the various OOM debug options
to allow them to be selected for configuration. For configured options it
provides the debgufs code needed to allow configured options to be
dynamically enabled or disabled and if enabled to specify the print rate
limiting adjustment value if the option supports rate limiting.

New OOM Debug options should use and extend the base code provided here.
When possible add any new OOM debug code to mm/oom_kill_debug.c.
Configured options are compiled and included with the kernel.

To configure an option go to: Kernel hacking ---> Memory Debugging --->
Select: [*] Debug OOM to enable this OOM Debug base code and select
any OOM Debug Options as needed.

Implementation of dynamic controls using debugfs
------------------------------------------------
Each configured OOM debug option also includes code that allows the option
to be dynamically enabled or disabled. For options that can produce many
many lines of output a print rate limiting adjustment is also available.
The print rate limiting adjustment allows the amount of output for the
option for an OOM event to be adjusted.

Options may be dynamically enabled through the debugfs OOM debug interface
which can be found in entries under: /sys/kernel/debug/oom
Each configured OOM debug option adds one or two files in this directory.
All configured options add an enable file and options that can output a
number of entries add a second tenthpercent file to specify a minimum
size that entries must be to be printed, to help limit print output.

Dynamic enabled / disabled options
----------------------------------
Under the option's directory there will always be an enabled file for
each option that is configured. The ..._enabled file for each configfured
option can be used to enable or disable that option. A value of 1 is
enabled (which is the default setting) and a value of 0 is disabled.

Dynamic control of entry printing based on memory size
------------------------------------------------------
For each Select Print type of OOM debug option a second file
tenthpercent is present. The value specified in this file can range
from 0 to 1000. This value is used to specify the minimum memory
or memory and swap space (depending on the option) size the entry must
occupy to be selected for printing.

The value is specified in tenths of a percent of memory just as the
oom_score and oom_score_adj is specified. Specifying a value of zero
permits all entries for this option to be printed. A value of 1
specifies entries must be using 0.1% of the total memory or
total memory and total swap space to be selected for print. A value of
10 specifies entries must consume 1% or more and this can be increased
up to 1000 which specifies the entry must be using 100% of memory.
Entries can't possibly use 100% of memory so if the ..._tenthpercent
file has a value approaching 1000 no etries will be printed but
summary information will still be printed if the option is configured
and enabled. By default each configured Select Print OOM debug option
has a default print limiting minimum entry size of 10 or 1% of memory.

---------------------------------------------------------------------


Signed-off-by: Edward Chron <echron@arista.com>
---
 mm/Kconfig.debug    |  17 +++
 mm/Makefile         |   1 +
 mm/oom_kill.c       |   4 +
 mm/oom_kill_debug.c | 267 ++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill_debug.h |  20 ++++
 5 files changed, 309 insertions(+)
 create mode 100644 mm/oom_kill_debug.c
 create mode 100644 mm/oom_kill_debug.h

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 82b6a20898bd..5610da5fa614 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -115,3 +115,20 @@ config DEBUG_RODATA_TEST
     depends on STRICT_KERNEL_RWX
     ---help---
       This option enables a testcase for the setting rodata read-only.
+
+config DEBUG_OOM
+	bool "Debug OOM"
+	depends on DEBUG_KERNEL
+	depends on DEBUG_FS
+	help
+	  This feature enables OOM Debug common code needed to enable one
+	  or more OOM debug options that when enabled provide additional
+	  details about an OOM event. This debug option provides the common
+	  code needed to help configure the OOM options in the kernel config
+	  file and also the common code used to dynamically disable or
+	  re-enable any configured options. Some options also provide print
+	  rate limiting based on memory usage to reduce print output. The
+	  common code for print rate limiting is also provided here. This
+	  option is a prerequisite for selecting any OOM debugging options.
+
+	  If unsure, say N
diff --git a/mm/Makefile b/mm/Makefile
index d0b295c3b764..4bd7c137871c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -105,3 +105,4 @@ obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_DEBUG_OOM) += oom_kill_debug.o
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eda2e2a0bdc6..c10d61fe944f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -44,6 +44,7 @@
 #include <linux/mmu_notifier.h>
 
 #include <asm/tlb.h>
+#include "oom_kill_debug.h"
 #include "internal.h"
 #include "slab.h"
 
@@ -465,6 +466,9 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
 		if (is_dump_unreclaim_slabs())
 			dump_unreclaimable_slab();
 	}
+#ifdef CONFIG_DEBUG_OOM
+	oom_kill_debug_oom_event_is();
+#endif
 	if (sysctl_oom_dump_tasks)
 		dump_tasks(oc);
 	if (p)
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
new file mode 100644
index 000000000000..af07e662c808
--- /dev/null
+++ b/mm/oom_kill_debug.c
@@ -0,0 +1,267 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  linux/mm/oom_kill_debug.c
+ *
+ *  Copyright (C) 2019 Arista Networks Inc.
+ *  Author: Edward G. Chron (echron@arista.com)
+ *
+ *  OOM Debugfs Extensions to the Linux Out Of Memory Code found in
+ *  linux/mm/oom_kill.c
+ *
+ *  Debug OOM code, if enabled, allows supplemental output to be produced at
+ *  the time of an OOM event. It uses the Debugfs file system to allow
+ *  the various options available to be enabled and to control the amount
+ *  of output they produce for options that can produce more than a few lines
+ *  of output.
+ *
+ *  CONFIG_DEBUG_OOM Enables generic OOM Debug Common Code Options
+ *  All other options require this option to be specified and it enables
+ *  the compilation of this module.
+ *
+ *  Debugfs OOM code for enabling and disabling OOM debug options and also for
+ *  setting rate limiting values for any OOM debug options that support rate
+ *  limiting of what they print is provided.
+ *
+ *  Debug OOM options when configured are found under /sys/kernel/debug/oom
+ *  Each option has either one or two files in theis directory, depending
+ *  on the number of settings the option supports:
+ *
+ *  - All options have an enabled file that is set to true or false which
+ *    signifies: true - the option is enabled, false - the option is disabled.
+ *  - Select options also have a tenthpercent file to hold the percentage
+ *    of totalpages (memory and swap space totals) that is the minimum size
+ *    of totalpages the entry needs to be using to be printed.
+ *
+ *  The totalpages used depends on the option because some options are
+ *  examining kernel objects that can have pages swapped out to swap space,
+ *  while others only occupy ram memory pages.
+ *
+ *  Note: The totalpages used value: total ram memory pages + swap pages
+ *        for process and memory file system space that is swappable.
+ *        For slabs and vmalloc the total ram memory pages should be used.
+ *
+ *  Options are found as file options under the base oom debugfs directory:
+ *  /sys/kernel/debug/oom
+ *
+ *  The following option setting files are found in the oom debugfs directory
+ *  as specified above, one for each entry in the Options / Directory
+ *  Supported Option Settings files specified above:
+ *
+ *  Option Setting / Filename:
+ *  -------------------------
+ *  Enabled / ..._enabled
+ *  ---------------------
+ *  Enable / Disable is stored as either a value of one or zero respectively.
+ *  The default for configured options is Enabled (set to 1)
+ *
+ *  So for option Tasks Summary you'll find an entry in the OOM debugfs at:
+ *  /sys/kernel/debug/oom/tasks_summary_enabled
+ *
+ *  Tenths of a % totalpages Usage Print Limit / ..._select_print_tenthpercent
+ *  ---------------------------------------------------------------------------
+ *  Rate limiting is supplied as a value of zero to 1000 representing units of
+ *  one tenth of a percent of totalpages. A value of zero prints all entries,
+ *  a value of 1000 prints no entries, just summary information and values
+ *  of between 1-999 print entries using from 0.1% to 99.9% of totalpages.
+ *
+ *  For processes, the totalpages is total ram pages + total swap pages.
+ *  For slabs, vmallocs and in memory filesystems the totalpages consists
+ *  to total ram, since none of those are held in swapable memory pages.
+ *
+ *  For option Process Select Print you'll find an entry in the OOM debugfs at:
+ *  /sys/kernel/debug/oom/process_select_print_tenthpercent
+ *
+ *  Adding a new OOM Debug Option:
+ *  -----------------------------
+ *  - A Kernel config option needs to be added to mm/Kconfig.debug and it
+ *    should depend on DEBUG_OOM. This will make your code configurable so
+ *    for systems that don't need your option it won't be compiled.
+ *    Your option should be named as config DEBUG_OOM_<YOUR_OPTION>
+ *  - Add an entry for your configuration with CONFIG_DEBUG_OOM_<YOUR_OPTION>
+ *    to the oom_debug_options_table[] as the last entry in the table.
+ *    You just need to define two fields in your entry, format like this:
+ *
+ *      #ifdef CONFIG_DEBUG_OOM_<YOUR_OPTION>
+ *	{
+ *		.option_name	= "oom_kill_debug_<YOUR_OPTION>"
+ *		.support_tpercent = true or false,
+ *	},
+ *      #endif
+ *
+ *    where .support_tpercent should be set to true if your option supports
+ *    controlling output with the tenth of a percent option. Only options
+ *    that can produce more than a few lines of output, one for each object
+ *    of some type (like user processes, slabs, vmalloc entries) will need
+ *    this control set to true. So most likey you want to set this to false.
+ *  - Add an entry to the enum oom_debug_options_index list just above the
+ *    last entry which is the OUT_OF_BOUNDS entry. The format should be:
+ *
+ *      #ifdef CONFIG_DEBUG_OOM_<YOUR_OPTION>
+ *		YOUR_OPTION_STATE,
+ *      #endif
+ *
+ *  - You need to add your code to produce your output.
+ *    Unless your option must live in another module to access data there you
+ *    should add your code to mm/oom_kill_debug.c to keep as much of the
+ *    OOM Debug code in one place as possible. You should add your code with
+ *    the config conditional so you only get compiled into the kernel if
+ *    configured. Your code in mm/oom_kill_debug.c should look like this:
+ *
+ *      #ifdef CONFIG_DEBUG_OOM_<YOUR_OPTION>
+ *      static void oom_kill_debug_<your_option>(void)
+ *      {
+ *		your code>
+ *      }
+ *      #endif
+ *
+ *  - Invoke your code. Ideally, if your code is located in mm/oom_kill_debug.c
+ *    then you can just invoke it from oom_kill_debug_oom_event_is(void)
+ *    and you will want to add your invocation code with config conditional
+ *    ifdef and endif and then in your invocation code check to see that your
+ *    option is enabled before calling it:
+ *
+ *      #ifdef CONFIG_DEBUG_OOM_<YOUR_OPTION>
+ *		if (oom_kill_debug_<your_option>(YOUR_OPTION_STATE))
+ *		oom_kill_debug_<your_option>();
+ *      #endif
+ *
+ *    If your code cannot be invoked from mm/oom_kill_debug.c you will need
+ *    to add an external accessor reference in mm/oom_kill_debug.h and then
+ *    your code in mm/oom_kill_debug.c cannot be static. See code in
+ *    mm/oom_kill_debug.h for examples on how this done.
+ *
+ */
+#include <linux/types.h>
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+#include <linux/oom.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include "oom_kill_debug.h"
+
+#define OOMD_MAX_FNAME 48
+#define OOMD_MAX_OPTNAME 32
+
+#define K(x) ((x) << (PAGE_SHIFT-10))
+
+static const char oom_debug_path[] = "/sys/kernel/debug/oom";
+
+static const char od_root_name[] = "oom";
+static struct dentry *od_root_dir;
+static u32 oom_kill_debug_oom_events;
+
+/* One oom_debug_option entry per debug option */
+struct oom_debug_option {
+	const char *option_name;
+	umode_t mode;
+	struct dentry *dir_dentry;
+	struct dentry *enabled_dentry;
+	struct dentry *tenthpercent_dentry;
+	bool enabled;
+	u16 tenthpercent;
+	bool support_tpercent;
+};
+
+/* Table of oom debug options, new options need to be added here */
+static struct oom_debug_option oom_debug_options_table[] = {
+	{}
+};
+
+/* Option index by name for order one-lookup, add new options entry here */
+enum oom_debug_options_index {
+	OUT_OF_BOUNDS
+};
+
+bool oom_kill_debug_enabled(u16 index)
+{
+	return oom_debug_options_table[index].enabled;
+}
+
+u16 oom_kill_debug_tenthpercent(u16 index)
+{
+	return oom_debug_options_table[index].tenthpercent;
+}
+
+static void filename_gen(char *pdest, const char *optname, const char *fname)
+{
+	size_t len;
+	char *pmsg;
+
+	sprintf(pdest, "%s", optname);
+	len = strnlen(pdest, OOMD_MAX_OPTNAME);
+	pmsg = pdest + len;
+	sprintf(pmsg, "%s", fname);
+}
+
+static void enabled_file_gen(struct oom_debug_option *entry)
+{
+	char filename[OOMD_MAX_FNAME];
+
+	filename_gen(filename, entry->option_name, "enabled");
+	debugfs_create_bool(filename, 0644, entry->dir_dentry,
+			    &entry->enabled);
+	entry->enabled = OOM_KILL_DEBUG_DEFAULT_ENABLED;
+}
+
+static void tpercent_file_gen(struct oom_debug_option *entry)
+{
+	char filename[OOMD_MAX_FNAME];
+
+	filename_gen(filename, entry->option_name, "tenthpercent");
+	debugfs_create_u16(filename, 0644, entry->dir_dentry,
+			   &entry->tenthpercent);
+	entry->tenthpercent = OOM_KILL_DEBUG_DEFAULT_TENTHPERCENT;
+}
+
+static void oom_debugfs_init(void)
+{
+	struct oom_debug_option *table, *entry;
+
+	od_root_dir = debugfs_create_dir(od_root_name, NULL);
+
+	table = oom_debug_options_table;
+	for (entry = table; entry->option_name; entry++) {
+		entry->dir_dentry = od_root_dir;
+		enabled_file_gen(entry);
+		if (entry->support_tpercent)
+			tpercent_file_gen(entry);
+	}
+}
+
+static void oom_debug_common_cleanup(void)
+{
+	/* Cleanup for oom root directory */
+	debugfs_remove(od_root_dir);
+}
+
+u32 oom_kill_debug_oom_event(void)
+{
+	return oom_kill_debug_oom_events;
+}
+
+u32 oom_kill_debug_oom_event_is(void)
+{
+	++oom_kill_debug_oom_events;
+
+	return oom_kill_debug_oom_events;
+}
+
+static void __init oom_debug_init(void)
+{
+	/* Ensure we have a debugfs oom root directory */
+	od_root_dir = debugfs_lookup(od_root_name, NULL);
+	if (!od_root_dir)
+		oom_debugfs_init();
+}
+subsys_initcall(oom_debug_init)
+
+static void __exit oom_debug_exit(void)
+{
+	/* Cleanup for debugfs oom files and directories */
+	oom_debug_common_cleanup();
+}
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
new file mode 100644
index 000000000000..7288969db9ce
--- /dev/null
+++ b/mm/oom_kill_debug.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ *  mm / oom_kill_debug.h  Internal oom kill debug definitions.
+ *
+ *  Copyright (C) 2019 Arista Networks Inc. All Rights Reserved.
+ *  Written by Edward G. Chron (echron@arista.com)
+ */
+
+#ifndef __MM_OOM_KILL_DEBUG_H__
+#define __MM_OOM_KILL_DEBUG_H__
+
+extern u32 oom_kill_debug_oom_event_is(void);
+extern u32 oom_kill_debug_event(void);
+extern bool oom_kill_debug_enabled(u16 index);
+extern u16 oom_kill_debug_tenthpercent(u16 index);
+
+#define OOM_KILL_DEBUG_DEFAULT_ENABLED true
+#define OOM_KILL_DEBUG_DEFAULT_TENTHPERCENT 10
+
+#endif /* __MM_OOM_KILL_DEBUG_H__ */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 02/10] mm/oom_debug: Add System State Summary
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
  2019-08-26 19:36 ` [PATCH 01/10] mm/oom_debug: Add Debug base code Edward Chron
@ 2019-08-26 19:36 ` Edward Chron
  2019-08-26 19:36 ` [PATCH 03/10] mm/oom_debug: Add Tasks Summary Edward Chron
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron

When selected, prints the number of CPUs online at the time of the OOM
event. Also prints nodename, domainname, machine type, kernel release
and version, system uptime, total memory and swap size. Produces a
single line of output holding this information.

This information is useful to help determine the state the system was
in when the event was triggered which is helpful for debugging,
performance measurements and security issues.

Configuring this Debug Option (DEBUG_OOM_SYSTEM_STATE)
------------------------------------------------------
To enable the option it needs to be configured in the OOM Debugging
configure menu. The kernel configuration entry can be found in the
config at: Kernel hacking, Memory Debugging, OOM Debugging the
DEBUG_OOM_SYSTEM_STATE config entry.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: system_state_summary_
and the file for this option is the enable file.

This option may be disabled or re-enabled using the debugfs enable file
for this OOM debug option. The debugfs file to enable this entry is found
at: /sys/kernel/debug/oom/system_state_summary_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled and a value of 0 is disabled.
When configured the default setting is set to enabled.

Content and format of System State Summary Output
-------------------------------------------------
  One line of output that includes:
  - Uptime (days, hour, minutes, seconds)
  - Number CPUs
  - Machine Type
  - Node name
  - Domain name
  - Kernel Release
  - Kernel Version

Sample Output:
-------------
Sample System State Summary message:

Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27
 CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain
 Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019


Signed-off-by: Edward Chron <echron@arista.com>
---
 mm/Kconfig.debug    | 15 +++++++++
 mm/oom_kill_debug.c | 81 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 5610da5fa614..dbe599b67a3b 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -132,3 +132,18 @@ config DEBUG_OOM
 	  option is a prerequisite for selecting any OOM debugging options.
 
 	  If unsure, say N
+
+config DEBUG_OOM_SYSTEM_STATE
+	bool "Debug OOM System State"
+	depends on DEBUG_OOM
+	help
+	  When enabled, provides one line of output on an oom event to
+	  document the state of the system when the oom event occurred.
+	  Prints: uptime, # threads, # processes, system memory size in KiB
+	  and swap space size in KiB, nodename, domainname, machine type,
+	  kernel release and version. If configured it is enabled/disabled
+	  by setting the enabled file entry in the debugfs OOM interface
+	  at: /sys/kernel/debug/oom/system_state_summary_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  If unsure, say N.
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index af07e662c808..6eeaad86fca8 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -144,6 +144,14 @@
 #include <linux/sysfs.h>
 #include "oom_kill_debug.h"
 
+#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
+#include <linux/cpumask.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/utsname.h>
+#include <linux/sched/stat.h>
+#endif
+
 #define OOMD_MAX_FNAME 48
 #define OOMD_MAX_OPTNAME 32
 
@@ -169,11 +177,20 @@ struct oom_debug_option {
 
 /* Table of oom debug options, new options need to be added here */
 static struct oom_debug_option oom_debug_options_table[] = {
+#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
+	{
+		.option_name	= "system_state_summary_",
+		.support_tpercent = false,
+	},
+#endif
 	{}
 };
 
 /* Option index by name for order one-lookup, add new options entry here */
 enum oom_debug_options_index {
+#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
+	SYSTEM_STATE,
+#endif
 	OUT_OF_BOUNDS
 };
 
@@ -244,10 +261,74 @@ u32 oom_kill_debug_oom_event(void)
 	return oom_kill_debug_oom_events;
 }
 
+#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
+/*
+ * oom_kill_debug_system_summary_prt - provides one line of output to document
+ *                      some of the system state at the time of an oom event.
+ *                      Output line includes: uptime, # threads, # processes,
+ *                      system memory size in KiB and swap space size in KiB,
+ *                      nodename, domainname, machine type, kernel release
+ *                      and version.
+ */
+static void oom_kill_debug_system_summary_prt(void)
+{
+	struct new_utsname *p_uts;
+	char domainname[256];
+	unsigned long upsecs;
+	unsigned short hours;
+	struct timespec64 tp;
+	unsigned short days;
+	unsigned short mins;
+	unsigned short secs;
+	char nodename[256];
+	size_t nodesize;
+	char *p_wend;
+	long uptime;
+	int procs;
+
+	p_uts = utsname();
+
+	memset(nodename, 0, sizeof(nodename));
+	memset(domainname, 0, sizeof(domainname));
+
+	p_wend = strchr(p_uts->nodename, '.');
+	if (p_wend != NULL) {
+		nodesize = p_wend - p_uts->nodename;
+		++p_wend;
+		strncpy(nodename, p_uts->nodename, nodesize);
+		strcpy(domainname, p_wend);
+	} else {
+		strcpy(nodename, p_uts->nodename);
+		strcpy(domainname, "(none)");
+	}
+
+	procs = nr_processes();
+
+	ktime_get_boottime_ts64(&tp);
+	uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0);
+
+	days = uptime / 86400;
+	upsecs = uptime - (days * 86400);
+	hours = upsecs / 3600;
+	upsecs = upsecs - (hours * 3600);
+	mins = upsecs / 60;
+	secs = upsecs - (mins * 60);
+
+	pr_info("System Uptime:%hu days %02hu:%02hu:%02hu CPUs:%u Machine:%s Node:%s Domain:%s Kernel Release:%s Version:%s\n",
+		days, hours, mins, secs, num_online_cpus(), p_uts->machine,
+		nodename, domainname, p_uts->release, p_uts->version);
+}
+#endif /* CONFIG_DEBUG_OOM_SYSTEM_STATE */
+
 u32 oom_kill_debug_oom_event_is(void)
 {
 	++oom_kill_debug_oom_events;
 
+#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
+	if (oom_kill_debug_enabled(SYSTEM_STATE))
+		oom_kill_debug_system_summary_prt();
+#endif
+
 	return oom_kill_debug_oom_events;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 03/10] mm/oom_debug: Add Tasks Summary
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
  2019-08-26 19:36 ` [PATCH 01/10] mm/oom_debug: Add Debug base code Edward Chron
  2019-08-26 19:36 ` [PATCH 02/10] mm/oom_debug: Add System State Summary Edward Chron
@ 2019-08-26 19:36 ` Edward Chron
  2019-08-26 19:36 ` [PATCH 04/10] mm/oom_debug: Add ARP and ND Table Summary usage Edward Chron
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron

Adds config option and code to support printing a Process / Thread Summary
of process / thread activity when an OOM event occurs. The information
provided includes the number of process and threads active, the number
of oom eligible and oom ineligible tasks, the total number of forks
that have happened since the system booted and the number of runnable
and I/O blocked processes. All values are at the time of the OOM event.

Configuring this Debug Option (DEBUG_OOM_TASKS_SUMMARY)
-------------------------------------------------------
To get the tasks information summary this option must be configured.
The Tasks Summary option uses the CONFIG_DEBUG_OOM_TASKS_SUMMARY
kernel config option which is found in the kernel config under the entry:
Kernel hacking, Memory Debugging, OOM Debugging entry. The config option
to select is: DEBUG_OOM_TASKS_SUMMARY.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: tasks_summary_
and there is just one file for this option, the enable file.

The option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this option is found at:
/sys/kernel/debug/oom/tasks_summary_enabled
The option's enabled file value determines whether the facility is enabled
or disabled. A value of 1 is enabled (default) and a value of 0 is
disabled. When configured the default setting is set to enabled.

Content and format of Tasks Summary Output
------------------------------------------
One line of output that includes:
  - Number of Threads
  - Number of processes
  - Forks since boot
  - Processes that are runnable
  - Processes that are in iowait

Sample Output:
-------------
Sample Tasks Summary message output:

Aug 13 18:52:48 yoursystem kernel: Threads: 492 Processes: 248
 forks_since_boot: 7786 procs_runable: 4 procs_iowait: 0


Signed-off-by: Edward Chron <echron@arista.com>
---
 mm/Kconfig.debug    | 16 ++++++++++++++++
 mm/oom_kill_debug.c | 27 +++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index dbe599b67a3b..fcbc5f9aa146 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -147,3 +147,19 @@ config DEBUG_OOM_SYSTEM_STATE
 	  A value of 1 is enabled (default) and a value of 0 is disabled.
 
 	  If unsure, say N.
+
+config DEBUG_OOM_TASKS_SUMMARY
+	bool "Debug OOM System Tasks Summary"
+	depends on DEBUG_OOM
+	help
+	  When enabled, provides a kernel process/thread summary recording
+	  the system's process/thread activity at the time an OOM event.
+	  The number of processes and of threads, the number of runnable
+	  and I/O blocked threads, the number of forks since boot and the
+	  number of oom eligible and oom ineligble tasks are provided in
+	  the output. If configured it is enabled/disabled by setting the
+	  enabled file entry in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/tasks_summary_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  If unsure, say N.
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 6eeaad86fca8..395b3307f822 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -152,6 +152,10 @@
 #include <linux/sched/stat.h>
 #endif
 
+#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
+#include <linux/sched/stat.h>
+#endif
+
 #define OOMD_MAX_FNAME 48
 #define OOMD_MAX_OPTNAME 32
 
@@ -182,6 +186,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
 		.option_name	= "system_state_summary_",
 		.support_tpercent = false,
 	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
+	{
+		.option_name	= "tasks_summary_",
+		.support_tpercent = false,
+	},
 #endif
 	{}
 };
@@ -190,6 +200,9 @@ static struct oom_debug_option oom_debug_options_table[] = {
 enum oom_debug_options_index {
 #ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
 	SYSTEM_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
+	TASKS_STATE,
 #endif
 	OUT_OF_BOUNDS
 };
@@ -320,6 +333,15 @@ static void oom_kill_debug_system_summary_prt(void)
 }
 #endif /* CONFIG_DEBUG_OOM_SYSTEM_STATE */
 
+#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
+static void oom_kill_debug_tasks_summary_print(void)
+{
+	pr_info("Threads:%d Processes:%d forks_since_boot:%lu procs_runable:%lu procs_iowait:%lu\n",
+		nr_threads, nr_processes(),
+		total_forks, nr_running(), nr_iowait());
+}
+#endif /* CONFIG_DEBUG_OOM_TASKS_SUMMARY */
+
 u32 oom_kill_debug_oom_event_is(void)
 {
 	++oom_kill_debug_oom_events;
@@ -329,6 +351,11 @@ u32 oom_kill_debug_oom_event_is(void)
 		oom_kill_debug_system_summary_prt();
 #endif
 
+#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
+	if (oom_kill_debug_enabled(TASKS_STATE))
+		oom_kill_debug_tasks_summary_print();
+#endif
+
 	return oom_kill_debug_oom_events;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 04/10] mm/oom_debug: Add ARP and ND Table Summary usage
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
                   ` (2 preceding siblings ...)
  2019-08-26 19:36 ` [PATCH 03/10] mm/oom_debug: Add Tasks Summary Edward Chron
@ 2019-08-26 19:36 ` Edward Chron
  2019-08-26 19:36 ` [PATCH 05/10] mm/oom_debug: Add Select Slabs Print Edward Chron
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron, David S. Miller, netdev

Adds config options and code to support printing ARP Table usage and or
Neighbour Discovery Table usage when an OOM event occurs. This summarized
information provides the memory usage for each table when configured.

Configuring these two OOM Debug Options
---------------------------------------
Two OOM debug options: CONFIG_DEBUG_OOM_ARP_TBL, CONFIG_DEBUG_OOM_ND_TBL
To get the output for both tables they both must be configured.
The ARP Table uses the CONFIG_DEBUG_OOM_ARP_TBL kernel config option
and the ND Table uses the CONFIG_DEBUG_OOM_ND_TBL kernel config option
both of which are found in the kernel config under the entries:
Kernel hacking, Memory Debugging, OOM Debugging entry. The ARP Table and
ND Table are configured there with the options: DEBUG_OOM_ARP_TBL and
DEBUG_OOM_ND_TBL respectively.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option are: arp_table_summary_ and
nd_table_summary_ and there is just one enable file for each.

Either option may be disabled or re-enabled using the debugfs entry for
the OOM debug option. The debugfs file to enable the ARP Table option
is found at: /sys/kernel/debug/oom/arp_table_summary_enabled
Similarly, the debugfs file to enable the ND Table option is found at:
/sys/kernel/debug/oom/nd_table_summary_enabled
For either option their enabled file's value determines whether the
facility is enabled or disabled for that option. A value of 1 is enabled
(default) and a value of 0 is disabled. When configured the default
setting is set to enabled. Each option will produce 1 line of output.

Content and format of ARP and Neighbour Discovery Tables Summary Output
-----------------------------------------------------------------------
  One line of output each for ARP and ND that includes:
  - Table name
  - Table size (max # entries)
  - Key Length
  - Entry Size
  - Number of Entries
  - Last Flush (in seconds)
  - hash grows
  - entry allocations
  - entry destroys
  - Number lookups
  - Number of lookup hits
  - Resolution failures
  - Garbage Collection Forced Runs
  - Table Full
  - Proxy Queue Length

Sample Output:
-------------
Here is sample output for both the ARP table and ND table:

Jul 23 23:26:34 yuorsystem kernel: neighbour: Table: arp_tbl size:   256
 keyLen:  4 entrySize: 360 entries:     9 lastFlush:  1721s
 hGrows:     1 allocs:     9 destroys:     0 lookups:   204 hits:   199
 resFailed:    38 gcRuns/Forced: 111 /  0 tblFull:  0 proxyQlen:  0

Jul 23 23:26:34 yuorsystem kernel: neighbour: Table:  nd_tbl size:   128
 keyLen: 16 entrySize: 368 entries:     6 lastFlush:  1720s
 hGrows:     0 allocs:     7 destroys:     1 lookups:     0 hits:     0
 resFailed:     0 gcRuns/Forced: 110 /  0 tblFull:  0 proxyQlen:  0


Signed-off-by: Edward Chron <echron@arista.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
 include/net/neighbour.h | 12 +++++++
 mm/Kconfig.debug        | 26 ++++++++++++++
 mm/oom_kill_debug.c     | 38 ++++++++++++++++++++
 net/core/neighbour.c    | 78 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 154 insertions(+)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 50a67bd6a434..35fdecff2724 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -569,4 +569,16 @@ static inline void neigh_update_is_router(struct neighbour *neigh, u32 flags,
 		*notify = 1;
 	}
 }
+
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+/**
+ * Routine used to print arp table and neighbour table statistics.
+ * Output goes to dmesg along with all the other OOM related messages
+ * when the config options DEBUG_OOM_ARP_TBL and DEBUG_ND_TBL are set to
+ * yes, for the ARP table and Neighbour discovery table respectively.
+ */
+extern void neightbl_print_stats(const char * const tblname,
+				 struct neigh_table * const neightable);
+#endif /* CONFIG_DEBUG_OOM_ARP_TBL || CONFIG_DEBUG_OOM_ND_TBL */
+
 #endif
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fcbc5f9aa146..fe4bb5ce0a6d 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -163,3 +163,29 @@ config DEBUG_OOM_TASKS_SUMMARY
 	  A value of 1 is enabled (default) and a value of 0 is disabled.
 
 	  If unsure, say N.
+
+config DEBUG_OOM_ARP_TBL
+	bool "Debug OOM ARP Table"
+	depends on DEBUG_OOM
+	help
+	  When enabled, documents kernel memory usage by the ARP Table
+	  entries at the time of an OOM event. Output is one line of
+	  summarzied ARP Table usage. If configured it is enabled/disabled
+	  by setting the enabled file entry in the debugfs OOM interface
+	  at: /sys/kernel/debug/oom/arp_table_summary_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  If unsure, say N.
+
+config DEBUG_OOM_ND_TBL
+	bool "Debug OOM ND Table"
+	depends on DEBUG_OOM
+	help
+	  When enabled, documents kernel memory usage by the ND Table
+	  entries at the time of an OOM event. Output is one line of
+	  summarzied ND Table usage. If configured it is enabled/disabled
+	  by setting the enabled file entry in the debugfs OOM interface
+	  at: /sys/kernel/debug/oom/nd_table_summary_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  If unsure, say N.
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 395b3307f822..c4a9117633fd 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -156,6 +156,16 @@
 #include <linux/sched/stat.h>
 #endif
 
+#if defined(CONFIG_INET) && defined(CONFIG_DEBUG_OOM_ARP_TBL)
+#include <net/arp.h>
+#endif
+#if defined(CONFIG_IPV6) && defined(CONFIG_DEBUG_OOM_ND_TBL)
+#include <net/ndisc.h>
+#endif
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+#include <net/neighbour.h>
+#endif
+
 #define OOMD_MAX_FNAME 48
 #define OOMD_MAX_OPTNAME 32
 
@@ -192,6 +202,18 @@ static struct oom_debug_option oom_debug_options_table[] = {
 		.option_name	= "tasks_summary_",
 		.support_tpercent = false,
 	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ARP_TBL
+	{
+		.option_name	= "arp_table_summary_",
+		.support_tpercent = false,
+	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ND_TBL
+	{
+		.option_name	= "nd_table_summary_",
+		.support_tpercent = false,
+	},
 #endif
 	{}
 };
@@ -203,6 +225,12 @@ enum oom_debug_options_index {
 #endif
 #ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
 	TASKS_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ARP_TBL
+	ARP_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ND_TBL
+	ND_STATE,
 #endif
 	OUT_OF_BOUNDS
 };
@@ -351,6 +379,16 @@ u32 oom_kill_debug_oom_event_is(void)
 		oom_kill_debug_system_summary_prt();
 #endif
 
+#if defined(CONFIG_INET) && defined(CONFIG_DEBUG_OOM_ARP_TBL)
+	if (oom_kill_debug_enabled(ARP_STATE))
+		neightbl_print_stats("arp_tbl", &arp_tbl);
+#endif
+
+#if defined(CONFIG_IPV6) && defined(CONFIG_DEBUG_OOM_ND_TBL)
+	if (oom_kill_debug_enabled(ND_STATE))
+		neightbl_print_stats("nd_tbl", &nd_tbl);
+#endif
+
 #ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
 	if (oom_kill_debug_enabled(TASKS_STATE))
 		oom_kill_debug_tasks_summary_print();
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index f79e61c570ea..9f5a579542a9 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -3735,3 +3735,81 @@ static int __init neigh_init(void)
 }
 
 subsys_initcall(neigh_init);
+
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+void neightbl_print_stats(const char * const tblname,
+			  struct neigh_table * const tbl)
+{
+	struct neigh_hash_table *nht;
+	struct ndt_stats ndst;
+	u32 now;
+	u32 flush_delta;
+	u32 tblsize;
+	u16 key_len;
+	u16 entry_size;
+	u32 entries;
+	u32 last_flush;    /* delta to now in msecs */
+	u32 hash_shift;
+	u32 proxy_qlen;
+	int cpu;
+
+	read_lock_bh(&tbl->lock);
+	now = jiffies;
+	flush_delta = now - tbl->last_flush;
+
+	key_len = tbl->key_len;
+	if (tbl->entry_size)
+		entry_size = tbl->entry_size;
+	else
+		entry_size = ALIGN(offsetof(struct neighbour, primary_key) +
+				   key_len, NEIGH_PRIV_ALIGN);
+
+	entries = atomic_read(&tbl->entries);
+	if (entries == 0)
+		goto out_tbl_unlock;
+
+	/* last flush was last_flush seconds ago */
+	last_flush = jiffies_to_msecs(flush_delta) / 1000;
+	proxy_qlen = tbl->proxy_queue.qlen;
+
+	rcu_read_lock_bh();
+	nht = rcu_dereference_bh(tbl->nht);
+	if (nht)
+		hash_shift = nht->hash_shift + 1;
+	rcu_read_unlock_bh();
+	if (!nht)
+		goto out_tbl_unlock;
+
+	memset(&ndst, 0, sizeof(ndst));
+	for_each_possible_cpu(cpu) {
+		struct neigh_statistics *st;
+
+		st = per_cpu_ptr(tbl->stats, cpu);
+		ndst.ndts_allocs		+= st->allocs;
+		ndst.ndts_destroys		+= st->destroys;
+		ndst.ndts_hash_grows		+= st->hash_grows;
+		ndst.ndts_res_failed		+= st->res_failed;
+		ndst.ndts_lookups		+= st->lookups;
+		ndst.ndts_hits			+= st->hits;
+		ndst.ndts_periodic_gc_runs	+= st->periodic_gc_runs;
+		ndst.ndts_forced_gc_runs	+= st->forced_gc_runs;
+		ndst.ndts_table_fulls		+= st->table_fulls;
+	}
+
+	read_unlock_bh(&tbl->lock);
+	tblsize = (1 << hash_shift) * sizeof(struct neighbour *);
+	if (tblsize > PAGE_SIZE)
+		tblsize = get_order(tblsize);
+
+	pr_info("Table:%7s size:%5u keyLen:%2hu entrySize:%3hu entries:%5u lastFlush:%5us hGrows:%5llu allocs:%5llu destroys:%5llu lookups:%5llu hits:%5llu resFailed:%5llu gcRuns/Forced:%3llu / %2llu tblFull:%2llu proxyQlen:%2u\n",
+		tblname, tblsize, key_len, entry_size, entries, last_flush,
+		ndst.ndts_hash_grows, ndst.ndts_allocs, ndst.ndts_destroys,
+		ndst.ndts_lookups, ndst.ndts_hits, ndst.ndts_res_failed,
+		ndst.ndts_periodic_gc_runs, ndst.ndts_forced_gc_runs,
+		ndst.ndts_table_fulls, proxy_qlen);
+	return;
+
+out_tbl_unlock:
+	read_unlock_bh(&tbl->lock);
+}
+#endif /* CONFIG_DEBUG_OOM_ARP_TBL || CONFIG_DEBUG_OOM_ND_TBL */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 05/10] mm/oom_debug: Add Select Slabs Print
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
                   ` (3 preceding siblings ...)
  2019-08-26 19:36 ` [PATCH 04/10] mm/oom_debug: Add ARP and ND Table Summary usage Edward Chron
@ 2019-08-26 19:36 ` Edward Chron
  2019-08-26 19:36 ` [PATCH 06/10] mm/oom_debug: Add Select Vmalloc Entries Print Edward Chron
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron

Add OOM Debug code to allow select slab entries to be printed at the
time of an OOM event. Linux has added printing slab entries on an
OOM event, if the amount of memory used by slabs exceeds the amount
of memory used by user processes. This OOM Debug option allows
slab entries of a specified minimum entry size to be printed,
limiting the amount of print output an OOM event generates for slab
entries.

Configuring this OOM Debug Option (DEBUG_OOM_SLAB_SELECT_PRINT)
---------------------------------------------------------------
To configure this OOM debug option it needs to be configured
in the OOM Debugging configure menu. The kernel configuration entry
can be found in the config at: Kernel hacking, Memory Debugging,
OOM Debugging the DEBUG_OOM_SLAB_SELECT_PRINT config entry.

Two dynamic OOM debug settings for this option: enable, tenthpercent
--------------------------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: slab_select_print_
and for select options there are two files, the enable file and
the tenthpercent file are the debugfs files.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found
at: /sys/kernel/debug/oom/slab_select_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled. The
default if configured is enabled.

Specifying the minimum entry size (0-1000) in the tenthpercent file
-------------------------------------------------------------------
Also for DEBUG_OOM_SLAB_SELECT_PRINT the number of slab entries printed is
adjustable. By default if the DEBUG_OOM_SLAB_SELECT_PRINT config option
is enabled entries that use 1% or more of memory are printed. This can be
adjusted to be entries as small as 0% of memory or as large as 100% of
memory in which case only a summary line is printed, as no slab entry
could possibly use 100% of memory. Adjustments are made in the debugfs
file found at: /sys/kernel/debug/oom/slab_select_print_tenthpercent
Entry values that are valid are 0 through 1000 which represent memory
usage of 0% of memory to 100% of memory. A value of of 0 prints all
slabs that have at least one slab in use, unused slabs are not printed.

Content of Slab Summary Records Output
--------------------------------------
Additional output consists of summary information that is printed
at the end of the output. This summary information includes:
  - # entries examined
  - # entries selected and printed
  - minimum entry size for selection
  - Slabs total size (kB)
  - Slabs reclaimable size (kB)
  - Slabs unreclaimable size (kB)

Sample Output
-------------
Output produced consists of the standard output currently produced
by Linux for slab entries plus two lines of summary information.
(The standard output provides a section header and entry per slab)

Summary output (minsize = 0kB, all entries with > 0 slabs in use printed):

Jul 23 23:26:34 yoursystem kernel: Summary: Slab entries examined: 123
 printed: 83 minsize: 0kB

Jul 23 23:26:34 yoursystem kernel: Slabs Total: 151212kB Reclaim: 50632kB
 Unreclaim: 100580kB


Signed-off-by: Edward Chron <echron@arista.com>
---
 mm/Kconfig.debug    | 30 +++++++++++++++++++++
 mm/oom_kill.c       | 11 +++++++-
 mm/oom_kill_debug.c | 42 +++++++++++++++++++++++++++++
 mm/oom_kill_debug.h |  4 +++
 mm/slab.h           |  4 +++
 mm/slab_common.c    | 65 +++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 155 insertions(+), 1 deletion(-)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fe4bb5ce0a6d..c7d53ca95d32 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -189,3 +189,33 @@ config DEBUG_OOM_ND_TBL
 	  A value of 1 is enabled (default) and a value of 0 is disabled.
 
 	  If unsure, say N.
+
+config DEBUG_OOM_SLAB_SELECT_PRINT
+	bool "Debug OOM Select Slabs Print"
+	depends on DEBUG_OOM
+	help
+	  When enabled, allows the number of unreclaimable slab entries
+	  to be print rate limited based on the amount of memory the
+	  slab entry is consuming. By default all slab entries with more
+	  than one object in use are printed if the trigger condition is
+	  met to dump slab entries.
+
+	  If the option is configured it is enabled/disabled by setting
+	  the value of the file entry in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/slab_select_print_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  When enabled entries are print limited by the amount of memory
+	  they consume. The setting value defines the minimum memory
+	  size consumed and are represented in tenths of a percent.
+	  Values supported are 0 to 1000 where 0 allows all entries to be
+	  printed, 1 would allow entries using 0.1% or more to be printed,
+	  10 would allow entries using 1% or more of memory to be printed.
+
+	  If configured and enabled the rate limiting memory percentage
+	  is specified by setting a value in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/slab_select_print_tenthpercent
+	  If configured the default settings are set to enabled and
+	  print limit value of 10 or 1% of memory.
+
+	  If unsure, say N.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c10d61fe944f..9022297fa2ba 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -438,6 +438,15 @@ static void dump_tasks(struct oom_control *oc)
 	}
 }
 
+static void oom_kill_unreclaimable_slabs_print(void)
+{
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+	if (oom_kill_debug_unreclaimable_slabs_print())
+		return;
+#endif
+	dump_unreclaimable_slab();
+}
+
 static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
 {
 	/* one line summary of the oom killer context. */
@@ -464,7 +473,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
 	else {
 		show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask);
 		if (is_dump_unreclaim_slabs())
-			dump_unreclaimable_slab();
+			oom_kill_unreclaimable_slabs_print();
 	}
 #ifdef CONFIG_DEBUG_OOM
 	oom_kill_debug_oom_event_is();
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index c4a9117633fd..2b5245e1134d 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -165,6 +165,9 @@
 #if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
 #include <net/neighbour.h>
 #endif
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+#include "slab.h"
+#endif
 
 #define OOMD_MAX_FNAME 48
 #define OOMD_MAX_OPTNAME 32
@@ -214,6 +217,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
 		.option_name	= "nd_table_summary_",
 		.support_tpercent = false,
 	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+	{
+		.option_name	= "slab_select_print_",
+		.support_tpercent = true,
+	},
 #endif
 	{}
 };
@@ -231,6 +240,9 @@ enum oom_debug_options_index {
 #endif
 #ifdef CONFIG_DEBUG_OOM_ND_TBL
 	ND_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+	SELECT_SLABS_STATE,
 #endif
 	OUT_OF_BOUNDS
 };
@@ -361,6 +373,36 @@ static void oom_kill_debug_system_summary_prt(void)
 }
 #endif /* CONFIG_DEBUG_OOM_SYSTEM_STATE */
 
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+static inline u16 oom_kill_debug_slabs_tenthpercent(void)
+{
+	return oom_kill_debug_tenthpercent(SELECT_SLABS_STATE);
+}
+
+static void oom_kill_debug_slabs_and_summary_print(void)
+{
+	u16 pcttenth = oom_kill_debug_slabs_tenthpercent();
+	unsigned long minkb = (K(totalram_pages()) * pcttenth) / 1000;
+
+	slab_common_oom_debug_dump_unreclaimable(minkb);
+
+	pr_info("Slabs Total: %lukB Reclaim: %lukB Unreclaim: %lukB\n",
+		K((global_node_page_state(NR_SLAB_RECLAIMABLE) +
+		   global_node_page_state(NR_SLAB_UNRECLAIMABLE))),
+		K(global_node_page_state(NR_SLAB_RECLAIMABLE)),
+		K(global_node_page_state(NR_SLAB_UNRECLAIMABLE)));
+}
+
+bool oom_kill_debug_unreclaimable_slabs_print(void)
+{
+	if (oom_kill_debug_enabled(SELECT_SLABS_STATE)) {
+		oom_kill_debug_slabs_and_summary_print();
+		return true;
+	}
+	return false;
+}
+#endif /* CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT */
+
 #ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
 static void oom_kill_debug_tasks_summary_print(void)
 {
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
index 7288969db9ce..549b8da179d0 100644
--- a/mm/oom_kill_debug.h
+++ b/mm/oom_kill_debug.h
@@ -9,6 +9,10 @@
 #ifndef __MM_OOM_KILL_DEBUG_H__
 #define __MM_OOM_KILL_DEBUG_H__
 
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+extern bool oom_kill_debug_unreclaimable_slabs_print(void);
+#endif
+
 extern u32 oom_kill_debug_oom_event_is(void);
 extern u32 oom_kill_debug_event(void);
 extern bool oom_kill_debug_enabled(u16 index);
diff --git a/mm/slab.h b/mm/slab.h
index 9057b8056b07..703e914efedc 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -586,10 +586,14 @@ int memcg_slab_show(struct seq_file *m, void *p);
 
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
 void dump_unreclaimable_slab(void);
+void slab_common_oom_debug_dump_unreclaimable(unsigned long minkb);
 #else
 static inline void dump_unreclaimable_slab(void)
 {
 }
+static inline void slab_common_oom_debug_dump_unreclaimable(unsigned long minkb)
+{
+}
 #endif
 
 void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 807490fe217a..9ddc95040b60 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1450,6 +1450,71 @@ void dump_unreclaimable_slab(void)
 	mutex_unlock(&slab_mutex);
 }
 
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+static void oom_debug_slab_header_print(void)
+{
+	pr_info("Unreclaimable slab info:\n");
+	pr_info("Name                      Used          Total\n");
+}
+
+static void oom_debug_slab_print(struct slabinfo *psi, struct kmem_cache *pkc)
+{
+	pr_info("%-17s %10luKB %10luKB\n", cache_name(pkc),
+		(psi->active_objs * pkc->size) / 1024,
+		(psi->num_objs * pkc->size) / 1024);
+}
+
+static bool oom_debug_slab_check(struct slabinfo *psi, struct kmem_cache *pkc,
+				 unsigned long min_kb)
+{
+	if (psi->num_objs > 0) {
+		if (((psi->active_objs * pkc->size) / 1024) >= min_kb) {
+			oom_debug_slab_print(psi, pkc);
+			return true;
+		}
+	}
+	return false;
+}
+
+void slab_common_oom_debug_dump_unreclaimable(unsigned long minkb)
+{
+	struct kmem_cache *s, *s2;
+	struct slabinfo sinfo;
+	u32 slabs_examined = 0;
+	u32 slabs_printed = 0;
+
+	/*
+	 * Here acquiring slab_mutex is risky since we don't prefer to get
+	 * sleep in oom path. But, without mutex hold, it may introduce a
+	 * risk of crash.
+	 * Use mutex_trylock to protect the list traverse, dump nothing
+	 * without acquiring the mutex.
+	 */
+	if (!mutex_trylock(&slab_mutex)) {
+		pr_warn("excessive unreclaimable slab but cannot dump stats\n");
+		return;
+	}
+
+	oom_debug_slab_header_print();
+
+	list_for_each_entry_safe(s, s2, &slab_caches, list) {
+		if (!is_root_cache(s) || (s->flags & SLAB_RECLAIM_ACCOUNT))
+			continue;
+
+		get_slabinfo(s, &sinfo);
+
+		++slabs_examined;
+
+		if (oom_debug_slab_check(&sinfo, s, minkb))
+			++slabs_printed;
+	}
+	mutex_unlock(&slab_mutex);
+
+	pr_info("Summary: Slab entries examined:%u printed:%u minsize:%lukB\n",
+		slabs_examined, slabs_printed, minkb);
+}
+#endif  /* CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT */
+
 #if defined(CONFIG_MEMCG)
 void *memcg_slab_start(struct seq_file *m, loff_t *pos)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 06/10] mm/oom_debug: Add Select Vmalloc Entries Print
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
                   ` (4 preceding siblings ...)
  2019-08-26 19:36 ` [PATCH 05/10] mm/oom_debug: Add Select Slabs Print Edward Chron
@ 2019-08-26 19:36 ` Edward Chron
  2019-08-26 19:36 ` [PATCH 07/10] mm/oom_debug: Add Select Process " Edward Chron
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron

Add OOM Debug code to allow select vmalloc entries to be printed output
at the time of an OOM event. Listing some portion of the larger vmalloc
entries has proven useful in tracking memory usage during an OOM event
so the root cause of the event can be determined.

Configuring this OOM Debug Option (DEBUG_OOM_VMALLOC_SELECT_PRINT)
------------------------------------------------------------------
To configure this option it needs to be selected in the OOM Debugging
configure menu. The kernel configuration entry can be found in the
config at: Kernel hacking, Memory Debugging, OOM Debugging with the
DEBUG_OOM_VMALLOC_SELECT_PRINT config entry that configures this option.

Two dynamic OOM debug settings for this option: enable, tenthpercent
--------------------------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: vmalloc_select_print_
and for select options there are two files, the enable file and
the tenthpercent file are the debugfs files.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found
at: /sys/kernel/debug/oom/vmalloc_select_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled.

Specifying the minimum entry size (0-1000) in the tenthpercent file
-------------------------------------------------------------------
Also for DEBUG_OOM_VMALLOC_SELECT_PRINT the number of vmalloc entries
printed can be adjusted. By default if the DEBUG_OOM_VMALLOC_SELECT_PRINT
config option is enabled only entries that use 1% or more of memory are
printed. This can be adjusted to be entries as small as 0% of memory
or as large as 100% of memory in which case only a summary line is
printed, as no vmalloc entry could possibly use 100% of memory.
Adjustments are made through the debugfs file found at:
/sys/kernel/debug/oom/vmalloc_select_print_tenthpercent
Entry values that are valid are 0 through 1000 which represent memory
usage of 0% of memory to 100% of memory. Only entries that are using
at least one page of memory are printed even if the minimum entry
size is specified as 0, zero page entries have no memory assigned.

Content of Vmalloc entry records and Vmalloc summary record
-----------------------------------------------------------
The output is vmalloc entry information output limited such that only
entries equal to or larger than the minimum size are printed.
Unused vmallocs (no pages assigned to the vmalloc) are never printed.
The vmalloc entry information includes:
  - Size (in bytes)
  - pages (Number pages in use)
  - Caller Information to identify the request

Additional output consists of summary information that is printed
at the end of the output. This summary information includes:
  - Number of Vmalloc entries examined
  - Number of Vmalloc entries printed
  - minimum entry size for selection

Sample Output
-------------
Output produced consists of one line of output for each vmalloc entry
that is equal to or larger than the minimum entry size specified
by the percent_totalpages_print_limit (0% to 100.0%) followed by
one line of summary output. There is also a section header output
line and a summary line that are printed.

Sample Vmalloc entries section header:

Aug 19 19:27:01 coronado kernel: Vmalloc Info:

Sample per entry selected print line output:

Jul 22 20:16:09 yoursystem kernel: Vmalloc size=2625536 pages=640
 caller=__do_sys_swapon+0x78e/0x1130

Sample summary print line output:

Jul 22 19:03:26 yoursystem kernel: Summary: Vmalloc entries examined:1070
 printed:989 minsize:0kB


Signed-off-by: Edward Chron <echron@arista.com>
---
 include/linux/vmalloc.h | 12 ++++++++++++
 mm/Kconfig.debug        | 28 +++++++++++++++++++++++++++
 mm/oom_kill_debug.c     | 21 ++++++++++++++++++++
 mm/vmalloc.c            | 43 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 104 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 9b21d0047710..09e3257fc382 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -227,4 +227,16 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
 int register_vmap_purge_notifier(struct notifier_block *nb);
 int unregister_vmap_purge_notifier(struct notifier_block *nb);
 
+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+/**
+ * Routine used to print select vmalloc entries on an OOM event so we
+ * can identify sizeable entries that may have a significant effect on
+ * kernel memory utilization. Output goes to dmesg along with all the OOM
+ * related messages when the config option DEBUG_OOM_VMALLOC_SELECT_PRINT
+ * is set to yes. The Option may be dyanmically enabled or disabled and
+ * the selection size is also dynamically configureable.
+ */
+extern void vmallocinfo_oom_print(unsigned long min_kb);
+#endif /* CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT */
+
 #endif /* _LINUX_VMALLOC_H */
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index c7d53ca95d32..ea3465343286 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -219,3 +219,31 @@ config DEBUG_OOM_SLAB_SELECT_PRINT
 	  print limit value of 10 or 1% of memory.
 
 	  If unsure, say N.
+
+config DEBUG_OOM_VMALLOC_SELECT_PRINT
+	bool "Debug OOM Select Vmallocs Print"
+	depends on DEBUG_OOM
+	help
+	  When enabled, allows the number of vmalloc entries printed
+	  to be print rate limited based on the amount of memory the
+	  vmalloc entry is consuming.
+
+	  If the option is configured it is enabled/disabled by setting
+	  the value of the file entry in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/vmalloc_select_print_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  When enabled entries are print limited by the amount of memory
+	  they consume. The setting value defines the minimum memory
+	  size consumed and are represented in tenths of a percent.
+	  Values supported are 0 to 1000 where 0 allows all entries to be
+	  printed, 1 would allow entries using 0.1% or more to be printed,
+	  10 would allow entries using 1% or more of memory to be printed.
+
+	  If configured and enabled the rate limiting memory percentage
+	  is specified by setting a value in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/vmalloc_select_print_tenthpercent
+	  If configured the default settings are set to enabled and
+	  print limit value of 10 or 1% of memory.
+
+	  If unsure, say N.
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 2b5245e1134d..d5e37f8508e6 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -168,6 +168,9 @@
 #ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
 #include "slab.h"
 #endif
+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+#include <linux/vmalloc.h>
+#endif
 
 #define OOMD_MAX_FNAME 48
 #define OOMD_MAX_OPTNAME 32
@@ -223,6 +226,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
 		.option_name	= "slab_select_print_",
 		.support_tpercent = true,
 	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+	{
+		.option_name	= "vmalloc_select_print_",
+		.support_tpercent = true,
+	},
 #endif
 	{}
 };
@@ -243,6 +252,9 @@ enum oom_debug_options_index {
 #endif
 #ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
 	SELECT_SLABS_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+	SELECT_VMALLOC_STATE,
 #endif
 	OUT_OF_BOUNDS
 };
@@ -431,6 +443,15 @@ u32 oom_kill_debug_oom_event_is(void)
 		neightbl_print_stats("nd_tbl", &nd_tbl);
 #endif
 
+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+	if (oom_kill_debug_enabled(SELECT_VMALLOC_STATE)) {
+		u16 ptenth = oom_kill_debug_tenthpercent(SELECT_VMALLOC_STATE);
+		unsigned long minkb = (K(totalram_pages()) * ptenth) / 1000;
+
+		vmallocinfo_oom_print(minkb);
+	}
+#endif
+
 #ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
 	if (oom_kill_debug_enabled(TASKS_STATE))
 		oom_kill_debug_tasks_summary_print();
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7ba11e12a11f..2cdc0f0cd0af 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3523,4 +3523,47 @@ static int __init proc_vmalloc_init(void)
 }
 module_init(proc_vmalloc_init);
 
+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+#define K(x) ((x) << (PAGE_SHIFT-10))
+/*
+ * Routine used to print select vmalloc entries on an OOM condition so
+ * we can identify sizeable entries that may have a significant effect on
+ * kernel memory utilization. Output goes to dmesg along with all the OOM
+ * related messages when the config option DEBUG_OOM_VMALLOC_SELECT_PRINT
+ * is set to yes. Both enable / disable and size selection value are
+ * dynamically configurable.
+ */
+void vmallocinfo_oom_print(unsigned long min_kb)
+{
+	struct vmap_area *vap;
+	struct vm_struct *vsp;
+	u_int32_t entries = 0;
+	u_int32_t printed = 0;
+
+	if (!spin_trylock(&vmap_area_lock)) {
+		pr_info("Vmalloc Info: Skipped, vmap_area_lock not available\n");
+		return;
+	}
+
+	pr_info("Vmalloc Info:\n");
+	list_for_each_entry(vap, &vmap_area_list, list) {
+		if (!(vap->flags & VM_VM_AREA))
+			continue;
+		++entries;
+		vsp = vap->vm;
+		if ((vsp->nr_pages > 0) && (K(vsp->nr_pages) >= min_kb)) {
+			pr_info("vmalloc size=%ld pages=%d caller=%pS\n",
+				vsp->size, vsp->nr_pages, vsp->caller);
+			++printed;
+		}
+	}
+
+	spin_unlock(&vmap_area_lock);
+
+	pr_info("Summary: Vmalloc entries examined:%u printed:%u minsize:%lukB\n",
+		entries, printed, min_kb);
+}
+EXPORT_SYMBOL(vmallocinfo_oom_print);
+#endif /* CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT */
+
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 07/10] mm/oom_debug: Add Select Process Entries Print
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
                   ` (5 preceding siblings ...)
  2019-08-26 19:36 ` [PATCH 06/10] mm/oom_debug: Add Select Vmalloc Entries Print Edward Chron
@ 2019-08-26 19:36 ` Edward Chron
  2019-08-26 19:36 ` [PATCH 08/10] mm/oom_debug: Add Slab Select Always Print Enable Edward Chron
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron

Add OOM Debug code to selectively print an entry for each user
process that was considered for OOM selection at the time of an
OOM event. Limiting the processes to print is done by specifying a
minimum amount of memory that must be used to be eligible to be
printed.

Note: memory usage for oom processes includes RAM memory as well
as swap space. The value totalpages is actually used as the memory
size for determining percentage of "memory" used.

Configuring this OOM Debug Option (DEBUG_OOM_PROCESS_SELECT_PRINT)
------------------------------------------------------------------
To configure this option it needs to be selected in the OOM
Debugging configure menu. The kernel configuration entry for this
option can be found in the config file at: Kernel hacking, Memory
Debugging, OOM Debugging the DEBUG_OOM_PROCESS_SELECT config entry.

Two dynamic OOM debug settings for this option: enable, tenthpercent
--------------------------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: process_select_print_
and for select options there are two files, the enable file and
the tenthpercent file are the debugfs files.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found at:
/sys/kernel/debug/oom/process_select_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled.

Specifying the minimum entry size (0-1000) in the tenthpercent file
-------------------------------------------------------------------
For DEBUG_OOM_PROCESS_SELECT_PRINT the processes printed can be limited
by specifying the minimum percentage of memory usage to be eligible to
be printed. By default if the DEBUG_OOM_PROCESS_SELECT config option is
enabled only OOM considered processes that use 1% or more of memory are
printed. This can be adjusted to be entries as small as 0.1% of memory
or as large as 100% of memory in which case only a summary line is
printed, as no process could possibly consume 100% of the memory.
Adjustments are made through the debugfs file found at:
/sys/kernel/debug/oom/procs_select_print_tenthpercent
valid values include values 1 through 1000 which represent memory
usage of 0.1% of memory to 100% of totalpages. Also specifying a value
of zero is a valid value and when specified it prints an entry for all
OOM considered processes even if they use essentially no memory.

Sample Output
-------------
Output produced consists of one line of standard Linux OOM process
entry output for each process that is equal to or larger than the
minimum entry size specified by the percent_totalpages_print_limit
(0% to 100.0%) followed by one line of summary output.

Summary print line output (minsize = 0.1% of totalpages):

Aug 13 20:16:30 yourserver kernel: Summary: OOM Tasks considered:245
 printed:33 minimum size:32576kB total-pages:32579084kB


Signed-off-by: Edward Chron <echron@arista.com>
---
 include/linux/oom.h |  1 +
 mm/Kconfig.debug    | 34 ++++++++++++++++++++++++++++++++++
 mm/oom_kill.c       | 39 +++++++++++++++++++++++++++++++--------
 mm/oom_kill_debug.c | 22 ++++++++++++++++++++++
 mm/oom_kill_debug.h |  3 +++
 5 files changed, 91 insertions(+), 8 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index c696c265f019..f37af4772452 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -49,6 +49,7 @@ struct oom_control {
 	unsigned long totalpages;
 	struct task_struct *chosen;
 	unsigned long chosen_points;
+	unsigned long minpgs;
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index ea3465343286..0c5feb0e15a9 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -247,3 +247,37 @@ config DEBUG_OOM_VMALLOC_SELECT_PRINT
 	  print limit value of 10 or 1% of memory.
 
 	  If unsure, say N.
+
+config DEBUG_OOM_PROCESS_SELECT_PRINT
+	bool "Debug OOM Select Process Print"
+	depends on DEBUG_OOM
+	help
+	  When enabled, allows OOM considered process OOM information
+	  to be print rate limited based on the amount of memory the
+	  considered process is consuming. The number of processes that
+	  were considered for OOM selection, the number of processes
+	  that were actually printed and the minimum memory usage
+	  percentage that was used to select to which processes are
+	  printed is printed in a summary line after printing the
+	  selected tasks.
+
+	  If the option is configured it is enabled/disabled by setting
+	  the value of the file entry in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/process_select_print_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  When enabled entries are print limited by the amount of memory
+	  they consume. The setting value defines the minimum memory
+	  size consumed and are represented in tenths of a percent.
+	  Values supported are 0 to 1000 where 0 allows all OOM considered
+	  processes to be printed, 1 would allow entries using 0.1% or
+	  more to be printed, 10 would allow entries using 1% or more of
+	  memory to be printed.
+
+	  If configured and enabled the rate limiting OOM process selection
+	  is specified by setting a value in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/process_select_print_tenthpercent
+	  If configured the default settings are set to enabled and
+	  print limit value of 10 or 1% of memory.
+
+	  If unsure, say N.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9022297fa2ba..4b37318dce4f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -380,6 +380,7 @@ static void select_bad_process(struct oom_control *oc)
 
 static int dump_task(struct task_struct *p, void *arg)
 {
+	unsigned long rsspgs, swappgs, pgtbl;
 	struct oom_control *oc = arg;
 	struct task_struct *task;
 
@@ -400,17 +401,29 @@ static int dump_task(struct task_struct *p, void *arg)
 		return 0;
 	}
 
+	rsspgs = get_mm_rss(task->mm);
+	swappgs = get_mm_counter(p->mm, MM_SWAPENTS);
+	pgtbl = mm_pgtables_bytes(p->mm);
+
+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+	if ((oc->minpgs > 0) &&
+	    ((rsspgs + swappgs + pgtbl / PAGE_SIZE) < oc->minpgs)) {
+		task_unlock(task);
+		return 0;
+	}
+#endif
+
 	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %s\n",
 		task->pid, from_kuid(&init_user_ns, task_uid(task)),
-		task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
-		mm_pgtables_bytes(task->mm),
-		get_mm_counter(task->mm, MM_SWAPENTS),
+		task->tgid, task->mm->total_vm, rsspgs, pgtbl, swappgs,
 		task->signal->oom_score_adj, task->comm);
 	task_unlock(task);
 
-	return 0;
+	return 1;
 }
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
+
 /**
  * dump_tasks - dump current memory state of all system tasks
  * @oc: pointer to struct oom_control
@@ -423,19 +436,31 @@ static int dump_task(struct task_struct *p, void *arg)
  */
 static void dump_tasks(struct oom_control *oc)
 {
+	u32 total = 0;
+	u32 prted = 0;
+
 	pr_info("Tasks state (memory values in pages):\n");
 	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
 
+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+	oc->minpgs = oom_kill_debug_min_task_pages(oc->totalpages);
+#endif
+
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
 	else {
 		struct task_struct *p;
 
 		rcu_read_lock();
-		for_each_process(p)
-			dump_task(p, oc);
+		for_each_process(p) {
+			++total;
+			prted += dump_task(p, oc);
+		}
 		rcu_read_unlock();
 	}
+
+	pr_info("Summary: OOM Tasks considered:%u printed:%u minimum size:%lukB totalpages:%lukB\n",
+		total, prted, K(oc->minpgs), K(oc->totalpages));
 }
 
 static void oom_kill_unreclaimable_slabs_print(void)
@@ -492,8 +517,6 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 static bool oom_killer_disabled __read_mostly;
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
-
 /*
  * task->mm can be NULL if the task is the exited group leader.  So to
  * determine whether the task is using a particular mm, we examine all the
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index d5e37f8508e6..66b745039771 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -232,6 +232,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
 		.option_name	= "vmalloc_select_print_",
 		.support_tpercent = true,
 	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+	{
+		.option_name	= "process_select_print_",
+		.support_tpercent = true,
+	},
 #endif
 	{}
 };
@@ -255,6 +261,9 @@ enum oom_debug_options_index {
 #endif
 #ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
 	SELECT_VMALLOC_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+	SELECT_PROCESS_STATE,
 #endif
 	OUT_OF_BOUNDS
 };
@@ -415,6 +424,19 @@ bool oom_kill_debug_unreclaimable_slabs_print(void)
 }
 #endif /* CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT */
 
+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+unsigned long oom_kill_debug_min_task_pages(unsigned long totalpages)
+{
+	u16 pct;
+
+	if (!oom_kill_debug_enabled(SELECT_PROCESS_STATE))
+		return 0;
+
+	pct = oom_kill_debug_tenthpercent(SELECT_PROCESS_STATE);
+	return (totalpages * pct) / 1000;
+}
+#endif /* CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT */
+
 #ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
 static void oom_kill_debug_tasks_summary_print(void)
 {
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
index 549b8da179d0..7eec861a0009 100644
--- a/mm/oom_kill_debug.h
+++ b/mm/oom_kill_debug.h
@@ -9,6 +9,9 @@
 #ifndef __MM_OOM_KILL_DEBUG_H__
 #define __MM_OOM_KILL_DEBUG_H__
 
+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+extern unsigned long oom_kill_debug_min_task_pages(unsigned long totalpages);
+#endif
 #ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
 extern bool oom_kill_debug_unreclaimable_slabs_print(void);
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 08/10] mm/oom_debug: Add Slab Select Always Print Enable
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
                   ` (6 preceding siblings ...)
  2019-08-26 19:36 ` [PATCH 07/10] mm/oom_debug: Add Select Process " Edward Chron
@ 2019-08-26 19:36 ` Edward Chron
  2019-08-26 19:36 ` [PATCH 09/10] mm/oom_debug: Add Enhanced Slab Print Information Edward Chron
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron

Config option to always enable slab printing. This option will enable
slab entries to be printed even when slab memory usage does not
exceed the standard Linux user memory usage print trigger. The Standard
OOM event Slab entry print trigger is that slab memory usage exceeds user
memory usage. This covers cases where the Kernel or Kernel drivers are
driving slab memory usage up causing it to be excessive. However, OOM
Events are often caused by user processes causing too much memory usage.
In some cases where the user memory usage is quite high the amount of
slab memory consumed can still be an important factor in determining what
caused the OOM event. In such cases it would be useful to have slab
memory usage for any slab entries using a significant amount of memory.

Configuring Slab Select Always Print Enable
-------------------------------------------
This option is configured with: DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
OOM Debug options include Slab Entry print limiting with the
DEBUG_OOM_SLAB_SELECT_PRINT option. This allows entries of only a
minimum size to be printed to prevent large number of entries from being
printed. However, the Standard OOM event Slab entry print trigger prevents
any entries from being printed if the Slab memory usage does not exceed
a significant portion of the user memory usage. Enabling the
DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT option allows the trigger to be
overridden.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: slab_select_always_print_
and the file for this option is the enable file.

The option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this option is found at:
/sys/kernel/debug/oom/slab_select_always_print_enabled
The option's enabled file value determines whether the facility is enabled
or disabled. A value of 1 is enabled (default) and a value of 0 is
disabled. When configured the default setting is set to enabled.

Sample Output
-------------
There is no change to the standard OOM output with this option other than
the stanrd Linux OOM report Unreclaimable info is output for every OOM
Event, not just OOM Events where slab usage exceeds user process memory
usage.


Signed-off-by: Edward Chron <echron@arista.com>
---
 mm/Kconfig.debug    | 24 ++++++++++++++++++++++++
 mm/oom_kill.c       |  4 ++++
 mm/oom_kill_debug.c | 16 ++++++++++++++++
 mm/oom_kill_debug.h |  3 +++
 4 files changed, 47 insertions(+)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 0c5feb0e15a9..68873e26afe1 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -220,6 +220,30 @@ config DEBUG_OOM_SLAB_SELECT_PRINT
 
 	  If unsure, say N.
 
+config DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
+	bool "Debug OOM Slabs Select Always Print Enable"
+	depends on DEBUG_OOM_SLAB_SELECT_PRINT
+	help
+	  When enabled the option allows Slab entries using the minimum
+	  memory size specified by the DEBUG_OOM_SLAB_SELECT_PRINT option
+	  to be printed even if the amount of Slab Memory in use does not
+	  exceed the amount of user memory in use. This essentially
+	  overrides the standard OOM Slab entry print tigger. This is
+	  useful when trying to determine all of the factors that
+	  contributed to an OOM event even when user memory usage was
+	  most likely the most signficant contributor. If Slab usage was
+	  higher than normal this could contribute to the OOM event. The
+	  DEBUG_OOM_SLAB_SELECT_PRINT allows entry sizes of 0% to 100%
+	  where 0% prints all the entries that the standard trigger prints
+	  (any slabs using even 1 slab entry).
+
+	  If the option is configured it is enabled/disabled by setting
+	  the value of the file entry in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/slab_select_always_print_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  If unsure, say N.
+
 config DEBUG_OOM_VMALLOC_SELECT_PRINT
 	bool "Debug OOM Select Vmallocs Print"
 	depends on DEBUG_OOM
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b37318dce4f..cbea289c6345 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -184,6 +184,10 @@ static bool is_dump_unreclaim_slabs(void)
 		 global_node_page_state(NR_ISOLATED_FILE) +
 		 global_node_page_state(NR_UNEVICTABLE);
 
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
+	if (oom_kill_debug_select_slabs_always_print_enabled())
+		return true;
+#endif
 	return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
 }
 
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 66b745039771..13f1d1c25a67 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -238,6 +238,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
 		.option_name	= "process_select_print_",
 		.support_tpercent = true,
 	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
+	{
+		.option_name	= "slab_select_always_print_",
+		.support_tpercent = false,
+	},
 #endif
 	{}
 };
@@ -264,6 +270,9 @@ enum oom_debug_options_index {
 #endif
 #ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
 	SELECT_PROCESS_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
+	SLAB_ALWAYS_STATE,
 #endif
 	OUT_OF_BOUNDS
 };
@@ -335,6 +344,13 @@ u32 oom_kill_debug_oom_event(void)
 	return oom_kill_debug_oom_events;
 }
 
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
+bool oom_kill_debug_select_slabs_always_print_enabled(void)
+{
+	return oom_kill_debug_enabled(SLAB_ALWAYS_STATE);
+}
+#endif
+
 #ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
 /*
  * oom_kill_debug_system_summary_prt - provides one line of output to document
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
index 7eec861a0009..bce740573063 100644
--- a/mm/oom_kill_debug.h
+++ b/mm/oom_kill_debug.h
@@ -15,6 +15,9 @@ extern unsigned long oom_kill_debug_min_task_pages(unsigned long totalpages);
 #ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
 extern bool oom_kill_debug_unreclaimable_slabs_print(void);
 #endif
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
+extern bool oom_kill_debug_select_slabs_always_print_enabled(void);
+#endif
 
 extern u32 oom_kill_debug_oom_event_is(void);
 extern u32 oom_kill_debug_event(void);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 09/10] mm/oom_debug: Add Enhanced Slab Print Information
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
                   ` (7 preceding siblings ...)
  2019-08-26 19:36 ` [PATCH 08/10] mm/oom_debug: Add Slab Select Always Print Enable Edward Chron
@ 2019-08-26 19:36 ` Edward Chron
  2019-08-26 19:36 ` [PATCH 10/10] mm/oom_debug: Add Enhanced Process " Edward Chron
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron

Add OOM Debug code that prints additional detailed information about each
slab entry that has been selected for printing. The information is
displayed for each slab enrty selected for print. The extra information
is helpful for root cause identification and problem analysis.

Configuring Enhanced Process Print Information
----------------------------------------------
The Kernel configuration option that defines this option is
DEBUG_OOM_ENHANCED_SLAB_PRINT. This additional code is dependent on the
OOM Debug option DEBUG_OOM_SLAB_SELECT_PRINT which adds code to allow
processes that are considered for OOM kill to be selectively printed,
only printing processes that use a specified minimum amount of memory.

The kernel configuration entry for this option can be found in the
config file at: Kernel hacking, Memory Debugging, Debug OOM,
Debug OOM Process Selection, Debug OOM Enhanced Process Print.
Both Debug OOM Process Selection and Debug OOM Enhanced Process Print
entries must be selected.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found at:
/sys/kernel/debug/oom/process_enhanced_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled.
When configured the default setting is set to enabled.

Content and format of slab entry messages
-----------------------------------------
In addition to the Used and Total space (in KiB) fields that are
displayed by the standard Linux OOM slab reporting code the enhanced
entries include: active objects, total objects, object and align size
(both in bytes), objects per slab, pages per slab, active slabs,
total slabs and the slab name (located at the end, easier to read).

Sample Output
-------------
Sample oom report message header and output slab entry message:

Aug 13 18:52:47 mysrvr kernel:   UsedKiB   TotalKiB  ActiveObj   TotalObj
 ObjSize AlignSize Objs/Slab Pgs/Slab ActiveSlab  TotalSlab Slab_Name

Aug 13 18:52:47 mysrvr kernel:       403        412       1613       1648
     224       256        16        1        103        103 skbuff_head..

Signed-off-by: Edward Chron <echron@arista.com>
---
 mm/Kconfig.debug    | 15 +++++++++++++++
 mm/oom_kill_debug.c | 15 +++++++++++++++
 mm/oom_kill_debug.h |  3 +++
 mm/slab_common.c    | 29 +++++++++++++++++++++++++++++
 4 files changed, 62 insertions(+)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 68873e26afe1..4414e46f72c6 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -244,6 +244,21 @@ config DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
 
 	  If unsure, say N.
 
+config DEBUG_OOM_ENHANCED_SLAB_PRINT
+	bool "Debug OOM Enhanced Slab Print"
+	depends on DEBUG_OOM_SLAB_SELECT_PRINT
+	help
+	  Each OOM slab entry printed includes slab entry information
+	  about it's memory usage. Memory usage is specified in KiB (KB)
+	  and includes the following fields:
+
+	  If the option is configured it is enabled/disabled by setting
+	  the value of the file entry in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/process_enhanced_print_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  If unsure, say N.
+
 config DEBUG_OOM_VMALLOC_SELECT_PRINT
 	bool "Debug OOM Select Vmallocs Print"
 	depends on DEBUG_OOM
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 13f1d1c25a67..ad937b3d59f3 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -244,6 +244,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
 		.option_name	= "slab_select_always_print_",
 		.support_tpercent = false,
 	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+	{
+		.option_name	= "slab_enhanced_print_",
+		.support_tpercent = false,
+	},
 #endif
 	{}
 };
@@ -273,6 +279,9 @@ enum oom_debug_options_index {
 #endif
 #ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
 	SLAB_ALWAYS_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+	ENHANCED_SLAB_STATE,
 #endif
 	OUT_OF_BOUNDS
 };
@@ -350,6 +359,12 @@ bool oom_kill_debug_select_slabs_always_print_enabled(void)
 	return oom_kill_debug_enabled(SLAB_ALWAYS_STATE);
 }
 #endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+bool oom_kill_debug_enhanced_slab_print_information_enabled(void)
+{
+	return oom_kill_debug_enabled(ENHANCED_SLAB_STATE);
+}
+#endif
 
 #ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
 /*
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
index bce740573063..a39bc275980e 100644
--- a/mm/oom_kill_debug.h
+++ b/mm/oom_kill_debug.h
@@ -18,6 +18,9 @@ extern bool oom_kill_debug_unreclaimable_slabs_print(void);
 #ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
 extern bool oom_kill_debug_select_slabs_always_print_enabled(void);
 #endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+extern bool oom_kill_debug_enhanced_slab_print_information_enabled(void);
+#endif
 
 extern u32 oom_kill_debug_oom_event_is(void);
 extern u32 oom_kill_debug_event(void);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9ddc95040b60..c6e17e5c6c9d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -28,6 +28,10 @@
 
 #include "slab.h"
 
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+#include "oom_kill_debug.h"
+#endif
+
 enum slab_state slab_state;
 LIST_HEAD(slab_caches);
 DEFINE_MUTEX(slab_mutex);
@@ -1450,15 +1454,40 @@ void dump_unreclaimable_slab(void)
 	mutex_unlock(&slab_mutex);
 }
 
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+static void oom_debug_slab_enhanced_print(struct slabinfo *psi,
+					  struct kmem_cache *pkc)
+{
+	pr_info("%10lu %10lu %10lu %10lu %9u %9u %9u %8u %10lu %10lu %s\n",
+		(psi->active_objs * pkc->size) / 1024,
+		(psi->num_objs * pkc->size) / 1024, psi->active_objs,
+		psi->num_objs, pkc->object_size, pkc->size,
+		psi->objects_per_slab, (1 << psi->cache_order),
+		psi->active_slabs, psi->num_slabs, cache_name(pkc));
+}
+#endif
+
 #ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
 static void oom_debug_slab_header_print(void)
 {
 	pr_info("Unreclaimable slab info:\n");
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+	if (oom_kill_debug_enhanced_slab_print_information_enabled()) {
+		pr_info("   UsedKiB   TotalKiB  ActiveObj   TotalObj   ObjSize AlignSize Objs/Slab Pgs/Slab ActiveSlab  TotalSlab Slab_Name");
+		return;
+	}
+#endif
 	pr_info("Name                      Used          Total\n");
 }
 
 static void oom_debug_slab_print(struct slabinfo *psi, struct kmem_cache *pkc)
 {
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+	if (oom_kill_debug_enhanced_slab_print_information_enabled()) {
+		oom_debug_slab_enhanced_print(psi, pkc);
+		return;
+	}
+#endif
 	pr_info("%-17s %10luKB %10luKB\n", cache_name(pkc),
 		(psi->active_objs * pkc->size) / 1024,
 		(psi->num_objs * pkc->size) / 1024);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 10/10] mm/oom_debug: Add Enhanced Process Print Information
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
                   ` (8 preceding siblings ...)
  2019-08-26 19:36 ` [PATCH 09/10] mm/oom_debug: Add Enhanced Slab Print Information Edward Chron
@ 2019-08-26 19:36 ` Edward Chron
  2019-08-28  0:21   ` kbuild test robot
  2019-08-27  7:15 ` [PATCH 00/10] OOM Debug print selection and additional information Michal Hocko
  2019-08-27 12:40 ` Qian Cai
  11 siblings, 1 reply; 46+ messages in thread
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron, David S. Miller, netdev

Add OOM Debug code that prints additional detailed information about
users processes that were considered for OOM killing for any print
selected processes. The information is displayed for each user process
that OOM prints in the output.

This supplemental per user process information is very helpful for
determing how process memory is used to allow OOM event root cause
identifcation that might not otherwise be possible.

Configuring Enhanced Process Print Information
----------------------------------------------
The DEBUG_OOM_ENHANCED_PROCESS_PRINT is the config entry defined for
this OOM Debug option.  This option is dependent on the OOM Debug
option DEBUG_OOM_SELECT_PROCESS which adds code to allow processes
that are considered for OOM kill to be selectively printed, only
printing processes that use a specified minimum amount of memory.

The kernel configuration entry for this option can be found in the
config file at: Kernel hacking, Memory Debugging, Debug OOM,
Debug OOM Process Selection, Debug OOM Enhanced Process Print.
Both Debug OOM Process Selection and Debug OOM Enhanced Process Print
entries must be selected.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found at:
/sys/kernel/debug/oom/process_enhanced_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled.

Content and format of process record and Task state headers
-----------------------------------------------------------
Each OOM process entry printed include memory information about the
process. Memory usage is specified in KiB for memory values instead of
pages. Each entry includes the following fields:
pid, ppid, ruid, euid, tgid, State (S), the oom_score_adjust (Adjust),
task comm value (name), and also memory values (all in KB): VmemKiB,
MaxRssKiB, CurRssKiB, PteKiB, SwapKiB, socket pages (SockKiB), LibKiB,
TextPgKiB, HeapPgKiB, StackKiB, FileKiB and shared memory (ShmemKiB).
Counts of page reads (ReadPgs) and page faults (FaultPgs) are included.

Sample Output
-------------
OOM Process select print headers and line of process enhanced output:

Aug  6 09:37:21 egc103 kernel: Tasks state (memory values in KiB):
Aug  6 09:37:21 egc103 kernel: [  pid  ]    ppid    ruid    euid
    tgid S  utimeSec  stimeSec   VmemKiB MaxRssKiB CurRssKiB
    PteKiB   SwapKiB   SockKiB     LibKiB   TextKiB   HeapKiB
  StackKiB   FileKiB  ShmemKiB   ReadPgs  FaultPgs   LockKiB
 PinnedKiB Adjust name

Aug  6 09:37:21 egc103 kernel: [   7707]    7553   10383   10383
    7707 S     0.132     0.350   1056804   1054040   1052796
      2092         0         0       1944       684   1052860
       136         4         0         0         0         0
         0   1000 oomprocs


Signed-off-by: Edward Chron <echron@arista.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
 mm/Kconfig.debug    |  23 +++++
 mm/oom_kill.c       |  23 ++++-
 mm/oom_kill_debug.c | 236 ++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill_debug.h |   5 +
 4 files changed, 285 insertions(+), 2 deletions(-)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 4414e46f72c6..2bc843727968 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -320,3 +320,26 @@ config DEBUG_OOM_PROCESS_SELECT_PRINT
 	  print limit value of 10 or 1% of memory.
 
 	  If unsure, say N.
+
+config DEBUG_OOM_ENHANCED_PROCESS_PRINT
+	bool "Debug OOM Enhanced Process Print"
+	depends on DEBUG_OOM_PROCESS_SELECT_PRINT
+	help
+	  Each OOM process entry printed include memory information about
+	  the process. Memory usage is specified in KiB (KB) for memory
+	  values, not pages. Each entry includes the following fields:
+	  pid, ppid, ruid, euid, tgid, State (S), utime in seconds,
+	  stime in seconds, the number of read pages (ReadPgs), number of
+	  page faults (FaultPgs), the number of lock pages (LockPgs), the
+	  oom_score_adjust value (Adjust), memory percentage used (MemPct),
+	  oom_score (Score), task comm value (name), and also memory values
+	  (all in KB): VmemKiB, MaxRssKiB, CurRssKiB, PteKiB, SwapKiB,
+	  socket pages (SockKiB), LibKiB, TextPgKiB, HeapPgKiB, StackKiB,
+	  FileKiB and shared memory pages (ShmemKiB).
+
+	  If the option is configured it is enabled/disabled by setting
+	  the value of the file entry in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/process_enhanced_print_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  If unsure, say N.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index cbea289c6345..cf37caea9c5c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,6 +417,13 @@ static int dump_task(struct task_struct *p, void *arg)
 	}
 #endif
 
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+	if (oom_kill_debug_enhanced_process_print_enabled()) {
+		dump_task_prt(task, rsspgs, swappgs, pgtbl);
+		task_unlock(task);
+		return 1;
+	}
+#endif
 	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %s\n",
 		task->pid, from_kuid(&init_user_ns, task_uid(task)),
 		task->tgid, task->mm->total_vm, rsspgs, pgtbl, swappgs,
@@ -426,6 +433,19 @@ static int dump_task(struct task_struct *p, void *arg)
 	return 1;
 }
 
+static void dump_tasks_headers(void)
+{
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+	if (oom_kill_debug_enhanced_process_print_enabled()) {
+		pr_info("Tasks state (memory values in KiB):\n");
+		pr_info("[  pid  ]    ppid    ruid    euid    tgid S  utimeSec  stimeSec   VmemKiB MaxRssKiB CurRssKiB    PteKiB   SwapKiB   SockKiB     LibKiB   TextKiB   HeapKiB  StackKiB   FileKiB  ShmemKiB     ReadPgs    FaultPgs   LockKiB PinnedKiB Adjust name\n");
+		return;
+	}
+#endif
+	pr_info("Tasks state (memory values in pages):\n");
+	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 
 /**
@@ -443,8 +463,7 @@ static void dump_tasks(struct oom_control *oc)
 	u32 total = 0;
 	u32 prted = 0;
 
-	pr_info("Tasks state (memory values in pages):\n");
-	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
+	dump_tasks_headers();
 
 #ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
 	oc->minpgs = oom_kill_debug_min_task_pages(oc->totalpages);
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index ad937b3d59f3..467f7add4397 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -171,6 +171,12 @@
 #ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
 #include <linux/vmalloc.h>
 #endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+#include <linux/fdtable.h>
+#include <linux/net.h>
+#include <net/sock.h>
+#include <linux/sched/cputime.h>
+#endif
 
 #define OOMD_MAX_FNAME 48
 #define OOMD_MAX_OPTNAME 32
@@ -250,6 +256,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
 		.option_name	= "slab_enhanced_print_",
 		.support_tpercent = false,
 	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+	{
+		.option_name	= "process_enhanced_print_",
+		.support_tpercent = false,
+	},
 #endif
 	{}
 };
@@ -282,6 +294,9 @@ enum oom_debug_options_index {
 #endif
 #ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
 	ENHANCED_SLAB_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+	ENHANCED_PROCESS_STATE,
 #endif
 	OUT_OF_BOUNDS
 };
@@ -365,6 +380,12 @@ bool oom_kill_debug_enhanced_slab_print_information_enabled(void)
 	return oom_kill_debug_enabled(ENHANCED_SLAB_STATE);
 }
 #endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+bool oom_kill_debug_enhanced_process_print_enabled(void)
+{
+	return oom_kill_debug_enabled(ENHANCED_PROCESS_STATE);
+}
+#endif
 
 #ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
 /*
@@ -513,6 +534,221 @@ u32 oom_kill_debug_oom_event_is(void)
 	return oom_kill_debug_oom_events;
 }
 
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+/*
+ *  Account for socket(s) buffer memory in use by a task.
+ *  A task may have one or more sockets consuming socket buffer space.
+ *  Account for how much socket space each task has in use.
+ */
+static unsigned long account_for_socket_buffers(struct task_struct *task,
+						char *incomplete)
+{
+	unsigned long sockpgs = 0;
+	struct files_struct *files = task->files;
+	struct fdtable *fdt;
+	struct file **fds;
+	int openfilecount;
+	struct inode *inode;
+	struct socket *sock;
+	struct sock *sk;
+	unsigned long bytes;
+	int fdtsize;
+	int i;
+
+	/* Just to make sure the fds don't get closed */
+	atomic_inc(&files->count);
+	/* Make a best effort, but no reason to get hung up here */
+	if (!spin_trylock(&files->file_lock)) {
+		*incomplete = '*';
+		atomic_dec(&files->count);
+		return 0;
+	}
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	fdtsize = fdt->max_fds;
+	/* Determine how many words we need to check for open files */
+	for (i = fdtsize / BITS_PER_LONG; i > 0; ) {
+		if (fdt->open_fds[--i])
+			break;
+	}
+	openfilecount = (i + 1) * BITS_PER_LONG;  // Check each fd in the word
+	fds = fdt->fd;
+	for (i = openfilecount; i != 0; i--) {
+		struct file *fp = *fds++;
+
+		if (fp) {
+			/* Any continue case doesn't need to be counted */
+			if (fp->f_path.dentry == NULL)
+				continue;
+			inode = fp->f_path.dentry->d_inode;
+			if (inode == NULL || !S_ISSOCK(inode->i_mode))
+				continue;
+			sock = fp->private_data;
+			if (sock == NULL)
+				continue;
+			sk = sock->sk;
+			if (sk == NULL)
+				continue;
+			bytes = roundup(sk->sk_rcvbuf, PAGE_SIZE);
+			sockpgs = bytes / PAGE_SIZE;
+			bytes = roundup(sk->sk_sndbuf, PAGE_SIZE);
+			sockpgs += bytes / PAGE_SIZE;
+		}
+	}
+	rcu_read_unlock();
+
+	spin_unlock(&files->file_lock);
+	/* We're done looking at the fds */
+	atomic_dec(&files->count);
+
+	return sockpgs;
+}
+
+static u64 power10(u32 index)
+{
+	static u64 pwr10[11] = {1, 10, 100, 1000, 10000, 100000, 1000000,
+				10000000, 100000000, 1000000000,
+				10000000000};
+
+	return pwr10[index];
+}
+
+static u32 num_digits(u64 num)
+{
+	u32 i;
+
+	for (i = 1; i < 11; ++i) {
+		if (power10(i) > num)
+			return i;
+	}
+	return i;
+}
+
+static void digits_and_fraction(u64 num, u32 *p_digits, u32 *p_frac, u32 chars)
+{
+	*p_digits = num_digits(num);
+	// Allow for decimal place for fractional output
+	if (chars - 1 > *p_digits)
+		*p_frac = chars - 1 - *p_digits;
+	else
+		*p_frac = 0;
+}
+
+#define MAX_NUM_FIELD_SIZE	10
+/*
+ * Format timespec into seconds and possibly fraction, must fit in 9 bytes.
+ * Linux kernel doesn't support floating point so format as best we can.
+ * With 9 digits in seconds convers 31.7 years and where we can we provide
+ * fractions of a second up to miliseconds.
+ */
+static void timespec_format(u64 nsecs_time, char *p_time, size_t time_size)
+{
+	struct timespec64 tspec = ns_to_timespec64(nsecs_time);
+	u32 digits, fracs, bytes, min;
+	u64 fraction;
+
+	digits_and_fraction(tspec.tv_sec, &digits, &fracs, time_size);
+
+	bytes = sprintf(p_time, "%llu", tspec.tv_sec);
+
+	if (fracs > 0) {
+		u32 frsize = num_digits(tspec.tv_nsec);
+
+		p_time += bytes;
+		if (frsize >= 3) {
+			if (fracs >= 3)
+				min = frsize - 3;
+			else if (fracs >= 2)
+				min = frsize - 2;
+			else
+				min = frsize - 1;
+		} else if (frsize >= 2) {
+			if (fracs >= 2)
+				min = frsize - 2;
+			else
+				min = frsize - 1;
+		} else {
+			min = frsize - 1;
+		}
+		fraction = tspec.tv_nsec / power10(min);
+		sprintf(p_time, ".%llu", fraction);
+	}
+}
+
+/*
+ * Format utime, stime in seconds and possibly fractions, must fit in 9 bytes.
+ */
+static void time_format(struct task_struct *task, char *p_utime, char *p_stime)
+{
+	size_t num_size = MAX_NUM_FIELD_SIZE;
+	u64 utime, stime;
+
+	task_cputime_adjusted(task, &utime, &stime);
+	memset(p_utime, 0, num_size);
+	timespec_format(utime, p_utime, num_size - 1);
+	memset(p_stime, 0, num_size);
+	timespec_format(stime, p_stime, num_size - 1);
+}
+
+/* task_index_to_char kernel function is missing options so use this */
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
+static const char task_to_char[] = TASK_STATE_TO_CHAR_STR;
+static const char get_task_state(struct task_struct *p_task, ulong state)
+{
+	int bit = state ? __ffs(state) + 1 : 0;
+
+	if (p_task->tgid == 0)
+		return 'I';
+	return bit < sizeof(task_to_char) - 1 ? task_to_char[bit] : '?';
+}
+
+/*
+ * Code that prints the information about the specified task.
+ * Assumes task lock is held at entry.
+ */
+void dump_task_prt(struct task_struct *task,
+		   unsigned long rsspg, unsigned long swappg,
+		   unsigned long pgtbl)
+{
+	char c_utime[MAX_NUM_FIELD_SIZE], c_stime[MAX_NUM_FIELD_SIZE];
+	unsigned long vmkb, sockkb, text, maxrsspg, pgtblpg;
+	unsigned long libkb, textkb, pgtblkb;
+	struct mm_struct *mm;
+	char incomp = ' ';
+	kuid_t ruid, euid;
+	char tstate;
+
+	mm = task->mm;
+	maxrsspg = rsspg;
+	pgtblpg = pgtbl >> PAGE_SHIFT;
+	ruid = __task_cred(task)->uid;
+	euid = __task_cred(task)->euid;
+	vmkb = K(mm->total_vm);
+	if (maxrsspg < mm->hiwater_rss)
+		maxrsspg = mm->hiwater_rss;
+	sockkb = K(account_for_socket_buffers(task, &incomp));
+	text = (PAGE_ALIGN(mm->end_code) -
+		 (mm->start_code & PAGE_MASK));
+	text = min(text, mm->exec_vm << PAGE_SHIFT);
+	textkb = text >> 10;
+	libkb = ((mm->exec_vm << PAGE_SHIFT) - text) >> 10;
+	pgtblkb = pgtbl >> 10;
+	tstate = get_task_state(task, task->state);
+	time_format(task, c_utime, c_stime);
+
+	pr_info("[%7d] %7d %7d %7d %7d %c %9s %9s %9lu %9lu %9lu %9lu %9ld %9lu%c %9lu %9lu %9lu %9lu %9lu %9lu %11lu %11lu %9lu %9llu  %5hd %s\n",
+		task->pid, task_ppid_nr(task), ruid.val, euid.val, task->tgid,
+		tstate, c_utime, c_stime, vmkb, K(maxrsspg), K(rsspg), pgtblkb,
+		K(swappg), sockkb, incomp, libkb, textkb, K(mm->data_vm),
+		K(mm->stack_vm), K(get_mm_counter(mm, MM_FILEPAGES)),
+		K(get_mm_counter(mm, MM_SHMEMPAGES)), task->signal->cmaj_flt,
+		task->signal->cmin_flt,
+		K(mm->locked_vm), K((u64)atomic64_read(&mm->pinned_vm)),
+		task->signal->oom_score_adj, task->comm);
+}
+#endif /* CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT */
+
 static void __init oom_debug_init(void)
 {
 	/* Ensure we have a debugfs oom root directory */
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
index a39bc275980e..faebb4c6097c 100644
--- a/mm/oom_kill_debug.h
+++ b/mm/oom_kill_debug.h
@@ -9,6 +9,11 @@
 #ifndef __MM_OOM_KILL_DEBUG_H__
 #define __MM_OOM_KILL_DEBUG_H__
 
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+extern bool oom_kill_debug_enhanced_process_print_enabled(void);
+extern void dump_task_prt(struct task_struct *task, unsigned long rsspg,
+			  unsigned long swappg, unsigned long pgtbl);
+#endif
 #ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
 extern unsigned long oom_kill_debug_min_task_pages(unsigned long totalpages);
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
                   ` (9 preceding siblings ...)
  2019-08-26 19:36 ` [PATCH 10/10] mm/oom_debug: Add Enhanced Process " Edward Chron
@ 2019-08-27  7:15 ` Michal Hocko
  2019-08-27 10:10   ` Tetsuo Handa
  2019-08-28  1:07   ` Edward Chron
  2019-08-27 12:40 ` Qian Cai
  11 siblings, 2 replies; 46+ messages in thread
From: Michal Hocko @ 2019-08-27  7:15 UTC (permalink / raw)
  To: Edward Chron
  Cc: Andrew Morton, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona

On Mon 26-08-19 12:36:28, Edward Chron wrote:
[...]
> Extensibility using OOM debug options
> -------------------------------------
> What is needed is an extensible system to optionally configure
> debug options as needed and to then dynamically enable and disable
> them. Also for options that produce multiple lines of entry based
> output, to configure which entries to print based on how much
> memory they use (or optionally all the entries).

With a patch this large and adding a lot of new stuff we need a more
detailed usecases described I believe.

[...]

> Use of debugfs to allow dynamic controls
> ----------------------------------------
> By providing a debugfs interface that allows options to be configured,
> enabled and where appropriate to set a minimum size for selecting
> entries to print, the output produced when an OOM event occurs can be
> dynamically adjusted to produce as little or as much detail as needed
> for a given system.

Who is going to consume this information and why would that consumer be
unreasonable to demand further maintenance of that information in future
releases? In other words debugfs is not considered a stableAPI which is
OK here but the side effect of any change to these files results in user
visible behavior and we consider that more or less a stable as long as
there are consumers.

> OOM debug options can be added to the base code as needed.
> 
> Currently we have the following OOM debug options defined:
> 
> * System State Summary
>   --------------------
>   One line of output that includes:
>   - Uptime (days, hour, minutes, seconds)

We do have timestamps in the log so why is this needed?

>   - Number CPUs
>   - Machine Type
>   - Node name
>   - Domain name

why are these needed? That is a static information that doesn't really
influence the OOM situation.

>   - Kernel Release
>   - Kernel Version

part of the oom report
> 
>   Example output when configured and enabled:
> 
> Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019
> 
> * Tasks Summary
>   -------------
>   One line of output that includes:
>   - Number of Threads
>   - Number of processes
>   - Forks since boot
>   - Processes that are runnable
>   - Processes that are in iowait

We do have sysrq+t for this kind of information. Why do we need to
duplicate it?

>   Example output when configured and enabled:
> 
> Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0
> 
> * ARP Table and/or Neighbour Discovery Table Summary
>   --------------------------------------------------
>   One line of output each for ARP and ND that includes:
>   - Table name
>   - Table size (max # entries)
>   - Key Length
>   - Entry Size
>   - Number of Entries
>   - Last Flush (in seconds)
>   - hash grows
>   - entry allocations
>   - entry destroys
>   - Number lookups
>   - Number of lookup hits
>   - Resolution failures
>   - Garbage Collection Forced Runs
>   - Table Full
>   - Proxy Queue Length
> 
>   Example output when configured and enabled (for both):
> 
> ... kernel: neighbour: Table: arp_tbl size:   256 keyLen:  4 entrySize: 360 entries:     9 lastFlush:  1721s hGrows:     1 allocs:     9 destroys:     0 lookups:   204 hits:   199 resFailed:    38 gcRuns/Forced: 111 /  0 tblFull:  0 proxyQlen:  0
> 
> ... kernel: neighbour: Table:  nd_tbl size:   128 keyLen: 16 entrySize: 368 entries:     6 lastFlush:  1720s hGrows:     0 allocs:     7 destroys:     1 lookups:     0 hits:     0 resFailed:     0 gcRuns/Forced: 110 /  0 tblFull:  0 proxyQlen:  0

Again, why is this needed particularly for the OOM event? I do
understand this might be useful system health diagnostic information but
how does this contribute to the OOM?

> * Add Select Slabs Print
>   ----------------------
>   Allow select slab entries (based on a minimum size) to be printed.
>   Minimum size is specified as a percentage of the total RAM memory
>   in tenths of a percent, consistent with existing OOM process scoring.
>   Valid values are specified from 0 to 1000 where 0 prints all slab
>   entries (all slabs that have at least one slab object in use) up
>   to 1000 which would require a slab to use 100% of memory which can't
>   happen so in that case only summary information is printed.
> 
>   The first line of output is the standard Linux output header for
>   OOM printed Slab entries. This header looks like this:
> 
> Aug  6 09:37:21 egc103 yourserver: Unreclaimable slab info:
> 
>   The output is existing slab entry memory usage limited such that only
>   entries equal to or larger than the minimum size are printed.
>   Empty slabs (no slab entries in slabs in use) are never printed.
> 
>   Additional output consists of summary information that is printed
>   at the end of the output. This summary information includes:
>   - # entries examined
>   - # entries selected and printed
>   - minimum entry size for selection
>   - Slabs total size (kB)
>   - Slabs reclaimable size (kB)
>   - Slabs unreclaimable size (kB)
> 
>   Example Summary output when configured and enabled:
> 
> Jul 23 23:26:34 yoursystem kernel: Summary: Slab entries examined: 123 printed: 83 minsize: 0kB
> 
> Jul 23 23:26:34 yoursystem kernel: Slabs Total: 151212kB Reclaim: 50632kB Unreclaim: 100580kB

I am all for practical improvements for slab reporting. It is not really
trivial to find a good balance though. Printing all the caches simply
doesn't scale. So I would start by improving the current state rather
than adding more configurability.

> 
> * Add Select Vmalloc allocations Print
>   ------------------------------------
>   Allow select vmalloc entries (based on a minimum size) to be printed.
>   Minimum size is specified as a percentage of the total RAM memory
>   in tenths of a percent, consistent with existing OOM process scoring.
>   Valid values are specified from 0 to 1000 where 0 prints all vmalloc
>   entries (all vmalloc allocations that have at least one page in use) up
>   to 1000 which would require a vmalloc to use 100% of memory which can't
>   happen so in that case only summary information is printed.
> 
>   The first line of output is a new Vmalloc output header for
>   OOM printed Vmalloc entries. This header looks like this:
> 
> Aug 19 19:27:01 yourserver kernel: Vmalloc Info:
> 
>   The output is vmalloc entry information output limited such that only
>   entries equal to or larger than the minimum size are printed.
>   Unused vmallocs (no pages assigned to the vmalloc) are never printed.
>   The vmalloc entry information includes:
>   - Size (in bytes)
>   - pages (Number pages in use)
>   - Caller Information to identify the request
> 
>   A sample vmalloc entry output looks like this:
> 
> Jul 22 20:16:09 yoursystem kernel: Vmalloc size=2625536 pages=640 caller=__do_sys_swapon+0x78e/0x113
> 
>   Additional output consists of summary information that is printed
>   at the end of the output. This summary information includes:
>   - Number of Vmalloc entries examined
>   - Number of Vmalloc entries printed
>   - minimum entry size for selection
> 
>   A sample Vmalloc Summary output looks like this:
> 
> Aug 19 19:27:01 coronado kernel: Summary: Vmalloc entries examined: 1070 printed: 989 minsize: 0kB

This is a lot of information. I wouldn't be surprised if this alone
could easily overflow the ringbuffer. Besides that, it is rarely useful
for the OOM situation debugging. The overall size of the vmalloc area
is certainly interesting but I am not sure we have a handy counter to
cope with constrained OOM contexts.

> * Add Select Process Entries Print
>   --------------------------------
>   Allow select process entries (based on a minimum size) to be printed.
>   Minimum size is specified as a percentage totalpages (RAM + swap)
>   in tenths of a percent, consistent with existing OOM process scoring.
>   Note: user process memory can be swapped out when swap space present
>   so that is why swap space and ram memory comprise the totalpages
>   used to calculate the percentage of memory a process is using.
>   Valid values are specified from 0 to 1000 where 0 prints all user
>   processes (that have valid mm sections and aren't exiting) up to
>   1000 which would require a user process to use 100% of memory which
>   can't happen so in that case only summary information is printed.
> 
>   The first line of output is the standard Linux output headers for
>   OOM printed User Processes. This header looks like this:
> 
> Aug 19 19:27:01 yourserver kernel: Tasks state (memory values in pages):
> Aug 19 19:27:01 yourserver kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> 
>   The output is existing per user process data limited such that only
>   entries equal to or larger than the minimum size are printed.
> 
> Jul 21 20:07:48 yourserver kernel: [    579]     0   579     7942     1010          90112        0         -1000 systemd-udevd
> 
>   Additional output consists of summary information that is printed
>   at the end of the output. This summary information includes:
> 
> Aug 19 19:27:01 yourserver kernel: Summary: OOM Tasks considered:277 printed:143 minimum size:0kB totalpages:32791608kB

This sounds like a good idea to limit the eligible process list size but
I am concerned that it might get misleading easily when there are many
small processes contributing to the OOM in the end.

> * Add Enhanced Process Print Information
>   --------------------------------------
>   Add OOM Debug code that prints additional detailed information about
>   users processes that were considered for OOM killing for any print
>   selected processes. The information is displayed for each user process
>   that OOM prints in the output.
> 
>   This supplemental per user process information is very helpful for
>   determing how process memory is used to allow OOM event root cause
>   identifcation that might not otherwise be possible.
> 
>   Output information for enhanced user process entrys printed includes:
>   - pid
>   - parent pid
>   - ruid
>   - euid
>   - tgid
>   - Process State (S)
>   - utime in seconds
>   - stime in seconds
>   - oom_score_adjust
>   - task comm value (name of process)
>   - Vmem KiB
>   - MaxRss KiB
>   - CurRss KiB
>   - Pte KiB
>   - Swap KiB
>   - Sock KiB
>   - Lib KiB
>   - Text KiB
>   - Heap KiB
>   - Stack KiB
>   - File KiB
>   - Shmem KiB
>   - Read Pages
>   - Fault Pages
>   - Lock KiB
>   - Pinned KiB

I can see some of these being interesting but I would rather pick up
those and add to the regular oom output rather than go over configuring
them.

> Configuring Patches:
> -------------------
> OOM Debug and any options you want to use must first be configured so
> the code is included in your kernel. This requires selecting kernel
> config file options. You will find config options to select under:
> 
> Kernel hacking ---> Memory Debugging --->
> 
> [*] Debug OOM
>     [*] Debug OOM System State
>     [*] Debug OOM System Tasks Summary
>     [*] Debug OOM ARP Table
>     [*] Debug OOM ND Table
>     [*] Debug OOM Select Slabs Print
>        [*] Debug OOM Slabs Select Always Print Enable
>        [*] Debug OOM Enhanced Slab Print
>     [*] Debug OOM Select Vmallocs Print
>     [*] Debug OOM Select Process Print
>        [*] Debug OOM Enhanced Process Print

I really dislike these though. We already have zillions of debugging
options and the config space is enormous. Different combinations of them
make any compile testing a challenge and a lot of cpu cycles eaten.
Besides that, who is going to configure those in without using them
directly? Distributions are not going to enable without having all
options being disabled by default for example.

>  12 files changed, 1339 insertions(+), 11 deletions(-)

This must have a been a lot of work and I really appreciate that.

On the other hand it is a lot of code to maintain (note that you are
usually introspecting deep internals of subsystems so changes would
have to be carefully considered here as well) without a very strong
demand.

Sure it is a nice to have thing in some cases. I can imagine that some
of that information would have helped me when debugging some weird OOM
reports but I strongly suspect I would likely not have all necessary
pieces enabled because those were not reproducible. Having everything
on is just not usable due to amount of data. printk is not free and
we have seen cases where a lot of output just turned the machine into
unsuable state. If you have a reproducible OOMs then you can trigger
a panic and have the full state of the system to examine. So I am not
really convinced all this is going to be used to justify the maintenance
overhead.

All that being said, I do not think this is something we want to merge
without a really _strong_ usecase to back it.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-27  7:15 ` [PATCH 00/10] OOM Debug print selection and additional information Michal Hocko
@ 2019-08-27 10:10   ` Tetsuo Handa
  2019-08-27 10:38     ` Michal Hocko
  2019-08-28  1:07   ` Edward Chron
  1 sibling, 1 reply; 46+ messages in thread
From: Tetsuo Handa @ 2019-08-27 10:10 UTC (permalink / raw)
  To: Michal Hocko, Edward Chron
  Cc: Andrew Morton, Roman Gushchin, Johannes Weiner, David Rientjes,
	Shakeel Butt, linux-mm, linux-kernel, colona

On 2019/08/27 16:15, Michal Hocko wrote:
> All that being said, I do not think this is something we want to merge
> without a really _strong_ usecase to back it.

Like the sender's domain "arista.com" suggests, some of information is
geared towards networking devices, and ability to report OOM information
in a way suitable for automatic recording/analyzing (e.g. without using
shell prompt, let alone manually typing SysRq commands) would be convenient
for unattended devices. We have only one OOM killer implementation and
format/data are hard-coded. If we can make OOM killer modular, Edward would
be able to use it.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-27 10:10   ` Tetsuo Handa
@ 2019-08-27 10:38     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2019-08-27 10:38 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Edward Chron, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Shakeel Butt, linux-mm, linux-kernel, colona

On Tue 27-08-19 19:10:18, Tetsuo Handa wrote:
> On 2019/08/27 16:15, Michal Hocko wrote:
> > All that being said, I do not think this is something we want to merge
> > without a really _strong_ usecase to back it.
> 
> Like the sender's domain "arista.com" suggests, some of information is
> geared towards networking devices, and ability to report OOM information
> in a way suitable for automatic recording/analyzing (e.g. without using
> shell prompt, let alone manually typing SysRq commands) would be convenient
> for unattended devices.

Why cannot the remote end of the logging identify the host. It has to
connect somewhere anyway, right? I also do assume that a log collector
already does store each log with host id of some form.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
                   ` (10 preceding siblings ...)
  2019-08-27  7:15 ` [PATCH 00/10] OOM Debug print selection and additional information Michal Hocko
@ 2019-08-27 12:40 ` Qian Cai
       [not found]   ` <CAM3twVQEMGWMQEC0dduri0JWt3gH6F2YsSqOmk55VQz+CZDVKg@mail.gmail.com>
  11 siblings, 1 reply; 46+ messages in thread
From: Qian Cai @ 2019-08-27 12:40 UTC (permalink / raw)
  To: Edward Chron, Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona

On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> This patch series provides code that works as a debug option through
> debugfs to provide additional controls to limit how much information
> gets printed when an OOM event occurs and or optionally print additional
> information about slab usage, vmalloc allocations, user process memory
> usage, the number of processes / tasks and some summary information
> about these tasks (number runable, i/o wait), system information
> (#CPUs, Kernel Version and other useful state of the system),
> ARP and ND Cache entry information.
> 
> Linux OOM can optionally provide a lot of information, what's missing?
> ----------------------------------------------------------------------
> Linux provides a variety of detailed information when an OOM event occurs
> but has limited options to control how much output is produced. The
> system related information is produced unconditionally and limited per
> user process information is produced as a default enabled option. The
> per user process information may be disabled.
> 
> Slab usage information was recently added and is output only if slab
> usage exceeds user memory usage.
> 
> Many OOM events are due to user application memory usage sometimes in
> combination with the use of kernel resource usage that exceeds what is
> expected memory usage. Detailed information about how memory was being
> used when the event occurred may be required to identify the root cause
> of the OOM event.
> 
> However, some environments are very large and printing all of the
> information about processes, slabs and or vmalloc allocations may
> not be feasible. For other environments printing as much information
> about these as possible may be needed to root cause OOM events.
> 

For more in-depth analysis of OOM events, people could use kdump to save a
vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
 vmcore which contains pretty much all the information you need.

The downside of that approach is that this is probably only for enterprise use-
cases that kdump/crash may be tested properly on enterprise-level distros while
the combo is more broken for developers on consumer distros due to kdump/crash
could be affected by many kernel subsystems and have a tendency to be broken
fairly quickly where the community testing is pretty much light.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] mm/oom_debug: Add Debug base code
  2019-08-26 19:36 ` [PATCH 01/10] mm/oom_debug: Add Debug base code Edward Chron
@ 2019-08-27 13:28   ` kbuild test robot
  0 siblings, 0 replies; 46+ messages in thread
From: kbuild test robot @ 2019-08-27 13:28 UTC (permalink / raw)
  To: Edward Chron
  Cc: kbuild-all, Andrew Morton, Michal Hocko, Roman Gushchin,
	Johannes Weiner, David Rientjes, Tetsuo Handa, Shakeel Butt,
	linux-mm, linux-kernel, colona, Edward Chron

[-- Attachment #1: Type: text/plain, Size: 6046 bytes --]

Hi Edward,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.3-rc6 next-20190827]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Edward-Chron/mm-oom_debug-Add-Debug-base-code/20190827-183210
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gcc (GCC) 7.4.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.4.0 make.cross ARCH=sh 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/printk.h:6:0,
                    from include/linux/kernel.h:15,
                    from include/linux/list.h:9,
                    from include/linux/wait.h:7,
                    from include/linux/wait_bit.h:8,
                    from include/linux/fs.h:6,
                    from include/linux/debugfs.h:15,
                    from mm/oom_kill_debug.c:135:
>> mm/oom_kill_debug.c:261:17: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
    subsys_initcall(oom_debug_init)
                    ^
   include/linux/init.h:197:50: note: in definition of macro '___define_initcall'
      __attribute__((__section__(#__sec ".init"))) = fn;
                                                     ^~
   include/linux/init.h:224:30: note: in expansion of macro '__define_initcall'
    #define subsys_initcall(fn)  __define_initcall(fn, 4)
                                 ^~~~~~~~~~~~~~~~~
>> mm/oom_kill_debug.c:261:1: note: in expansion of macro 'subsys_initcall'
    subsys_initcall(oom_debug_init)
    ^~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +261 mm/oom_kill_debug.c

 > 135	#include <linux/debugfs.h>
   136	#include <linux/fs.h>
   137	#include <linux/init.h>
   138	#include <linux/kernel.h>
   139	#include <linux/kobject.h>
   140	#include <linux/oom.h>
   141	#include <linux/printk.h>
   142	#include <linux/slab.h>
   143	#include <linux/string.h>
   144	#include <linux/sysfs.h>
   145	#include "oom_kill_debug.h"
   146	
   147	#define OOMD_MAX_FNAME 48
   148	#define OOMD_MAX_OPTNAME 32
   149	
   150	#define K(x) ((x) << (PAGE_SHIFT-10))
   151	
   152	static const char oom_debug_path[] = "/sys/kernel/debug/oom";
   153	
   154	static const char od_root_name[] = "oom";
   155	static struct dentry *od_root_dir;
   156	static u32 oom_kill_debug_oom_events;
   157	
   158	/* One oom_debug_option entry per debug option */
   159	struct oom_debug_option {
   160		const char *option_name;
   161		umode_t mode;
   162		struct dentry *dir_dentry;
   163		struct dentry *enabled_dentry;
   164		struct dentry *tenthpercent_dentry;
   165		bool enabled;
   166		u16 tenthpercent;
   167		bool support_tpercent;
   168	};
   169	
   170	/* Table of oom debug options, new options need to be added here */
   171	static struct oom_debug_option oom_debug_options_table[] = {
   172		{}
   173	};
   174	
   175	/* Option index by name for order one-lookup, add new options entry here */
   176	enum oom_debug_options_index {
   177		OUT_OF_BOUNDS
   178	};
   179	
   180	bool oom_kill_debug_enabled(u16 index)
   181	{
   182		return oom_debug_options_table[index].enabled;
   183	}
   184	
   185	u16 oom_kill_debug_tenthpercent(u16 index)
   186	{
   187		return oom_debug_options_table[index].tenthpercent;
   188	}
   189	
   190	static void filename_gen(char *pdest, const char *optname, const char *fname)
   191	{
   192		size_t len;
   193		char *pmsg;
   194	
   195		sprintf(pdest, "%s", optname);
   196		len = strnlen(pdest, OOMD_MAX_OPTNAME);
   197		pmsg = pdest + len;
   198		sprintf(pmsg, "%s", fname);
   199	}
   200	
   201	static void enabled_file_gen(struct oom_debug_option *entry)
   202	{
   203		char filename[OOMD_MAX_FNAME];
   204	
   205		filename_gen(filename, entry->option_name, "enabled");
   206		debugfs_create_bool(filename, 0644, entry->dir_dentry,
   207				    &entry->enabled);
   208		entry->enabled = OOM_KILL_DEBUG_DEFAULT_ENABLED;
   209	}
   210	
   211	static void tpercent_file_gen(struct oom_debug_option *entry)
   212	{
   213		char filename[OOMD_MAX_FNAME];
   214	
   215		filename_gen(filename, entry->option_name, "tenthpercent");
   216		debugfs_create_u16(filename, 0644, entry->dir_dentry,
   217				   &entry->tenthpercent);
   218		entry->tenthpercent = OOM_KILL_DEBUG_DEFAULT_TENTHPERCENT;
   219	}
   220	
   221	static void oom_debugfs_init(void)
   222	{
   223		struct oom_debug_option *table, *entry;
   224	
   225		od_root_dir = debugfs_create_dir(od_root_name, NULL);
   226	
   227		table = oom_debug_options_table;
   228		for (entry = table; entry->option_name; entry++) {
   229			entry->dir_dentry = od_root_dir;
   230			enabled_file_gen(entry);
   231			if (entry->support_tpercent)
   232				tpercent_file_gen(entry);
   233		}
   234	}
   235	
   236	static void oom_debug_common_cleanup(void)
   237	{
   238		/* Cleanup for oom root directory */
   239		debugfs_remove(od_root_dir);
   240	}
   241	
   242	u32 oom_kill_debug_oom_event(void)
   243	{
   244		return oom_kill_debug_oom_events;
   245	}
   246	
   247	u32 oom_kill_debug_oom_event_is(void)
   248	{
   249		++oom_kill_debug_oom_events;
   250	
   251		return oom_kill_debug_oom_events;
   252	}
   253	
   254	static void __init oom_debug_init(void)
   255	{
   256		/* Ensure we have a debugfs oom root directory */
   257		od_root_dir = debugfs_lookup(od_root_name, NULL);
   258		if (!od_root_dir)
   259			oom_debugfs_init();
   260	}
 > 261	subsys_initcall(oom_debug_init)
   262	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 51791 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/10] mm/oom_debug: Add Enhanced Process Print Information
  2019-08-26 19:36 ` [PATCH 10/10] mm/oom_debug: Add Enhanced Process " Edward Chron
@ 2019-08-28  0:21   ` kbuild test robot
  0 siblings, 0 replies; 46+ messages in thread
From: kbuild test robot @ 2019-08-28  0:21 UTC (permalink / raw)
  To: Edward Chron
  Cc: kbuild-all, Andrew Morton, Michal Hocko, Roman Gushchin,
	Johannes Weiner, David Rientjes, Tetsuo Handa, Shakeel Butt,
	linux-mm, linux-kernel, colona, Edward Chron, David S. Miller,
	netdev

[-- Attachment #1: Type: text/plain, Size: 972 bytes --]

Hi Edward,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.3-rc6 next-20190827]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Edward-Chron/mm-oom_debug-Add-Debug-base-code/20190827-183210
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.4.0-10) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: mm/oom_kill_debug.o: in function `timespec_format.constprop.2':
>> oom_kill_debug.c:(.text+0x156): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 69549 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
       [not found]   ` <CAM3twVQEMGWMQEC0dduri0JWt3gH6F2YsSqOmk55VQz+CZDVKg@mail.gmail.com>
@ 2019-08-28  0:50     ` Qian Cai
  2019-08-28  1:13       ` Edward Chron
  0 siblings, 1 reply; 46+ messages in thread
From: Qian Cai @ 2019-08-28  0:50 UTC (permalink / raw)
  To: Edward Chron
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Johannes Weiner,
	David Rientjes, Tetsuo Handa, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande



> On Aug 27, 2019, at 8:23 PM, Edward Chron <echron@arista.com> wrote:
> 
> 
> 
> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai <cai@lca.pw> wrote:
> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> > This patch series provides code that works as a debug option through
> > debugfs to provide additional controls to limit how much information
> > gets printed when an OOM event occurs and or optionally print additional
> > information about slab usage, vmalloc allocations, user process memory
> > usage, the number of processes / tasks and some summary information
> > about these tasks (number runable, i/o wait), system information
> > (#CPUs, Kernel Version and other useful state of the system),
> > ARP and ND Cache entry information.
> > 
> > Linux OOM can optionally provide a lot of information, what's missing?
> > ----------------------------------------------------------------------
> > Linux provides a variety of detailed information when an OOM event occurs
> > but has limited options to control how much output is produced. The
> > system related information is produced unconditionally and limited per
> > user process information is produced as a default enabled option. The
> > per user process information may be disabled.
> > 
> > Slab usage information was recently added and is output only if slab
> > usage exceeds user memory usage.
> > 
> > Many OOM events are due to user application memory usage sometimes in
> > combination with the use of kernel resource usage that exceeds what is
> > expected memory usage. Detailed information about how memory was being
> > used when the event occurred may be required to identify the root cause
> > of the OOM event.
> > 
> > However, some environments are very large and printing all of the
> > information about processes, slabs and or vmalloc allocations may
> > not be feasible. For other environments printing as much information
> > about these as possible may be needed to root cause OOM events.
> > 
> 
> For more in-depth analysis of OOM events, people could use kdump to save a
> vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
>  vmcore which contains pretty much all the information you need.
> 
> Certainly, this is the ideal. A full system dump would give you the maximum amount of
> information. 
> 
> Unfortunately some environments may lack space to store the dump,

Kdump usually also support dumping to a remote target via NFS, SSH etc 

> let alone the time to dump the storage contents and restart the system. Some

There is also “makedumpfile” that could compress and filter unwanted memory to reduce
the vmcore size and speed up the dumping process by utilizing multi-threads.

> systems can take many minutes to fully boot up, to reset and reinitialize all the
> devices. So unfortunately this is not always an option, and we need an OOM Report.

I am not sure how the system needs some minutes to reboot would be relevant  for the
discussion here. The idea is to save a vmcore and it can be analyzed offline even on 
another system as long as it having a matching “vmlinux.". 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-27  7:15 ` [PATCH 00/10] OOM Debug print selection and additional information Michal Hocko
  2019-08-27 10:10   ` Tetsuo Handa
@ 2019-08-28  1:07   ` Edward Chron
  2019-08-28  6:59     ` Michal Hocko
  1 sibling, 1 reply; 46+ messages in thread
From: Edward Chron @ 2019-08-28  1:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 26-08-19 12:36:28, Edward Chron wrote:
> [...]
> > Extensibility using OOM debug options
> > -------------------------------------
> > What is needed is an extensible system to optionally configure
> > debug options as needed and to then dynamically enable and disable
> > them. Also for options that produce multiple lines of entry based
> > output, to configure which entries to print based on how much
> > memory they use (or optionally all the entries).
>
> With a patch this large and adding a lot of new stuff we need a more
> detailed usecases described I believe.

I guess it would make sense to explain motivation for each OOM Debug
option I've sent separately.
I see there comments on the patches I will try and add more information there.

An overview would be that we've been collecting information on OOM's
over the last 12 years or so.
These are from switches, other embedded devices, servers both large and small.
We ask for feedback on what information was helpful or could be helpful.
We try and add it to make root causing issues easier.

These OOM debug options are some of the options we've created.
I didn't port all of them to 5.3 but these are representative.
Our latest is kernel is a bit behind 5.3.

>
>
> [...]
>
> > Use of debugfs to allow dynamic controls
> > ----------------------------------------
> > By providing a debugfs interface that allows options to be configured,
> > enabled and where appropriate to set a minimum size for selecting
> > entries to print, the output produced when an OOM event occurs can be
> > dynamically adjusted to produce as little or as much detail as needed
> > for a given system.
>
> Who is going to consume this information and why would that consumer be
> unreasonable to demand further maintenance of that information in future
> releases? In other words debugfs is not considered a stableAPI which is
> OK here but the side effect of any change to these files results in user
> visible behavior and we consider that more or less a stable as long as
> there are consumers.
>
> > OOM debug options can be added to the base code as needed.
> >
> > Currently we have the following OOM debug options defined:
> >
> > * System State Summary
> >   --------------------
> >   One line of output that includes:
> >   - Uptime (days, hour, minutes, seconds)
>
> We do have timestamps in the log so why is this needed?


Here is how an OOM report looks when we get it to look at:

Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer:
gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0,
oom_score_adj=1000
Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not
tainted 5.3.0-rc6+ #33
Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd.
IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018

This shows the date and time, not time of the last boot. The
/var/log/messages output is what we often have to look at not raw
dmesgs.

>
>
> >   - Number CPUs
> >   - Machine Type
> >   - Node name
> >   - Domain name
>
> why are these needed? That is a static information that doesn't really
> influence the OOM situation.


Sorry if a few of the items overlap what OOM prints.
We've been printing a lot of this information since 2.6.38 and OOM
reporting has been updated.

We're updating our 4.19 system to have the latest OOM Report format.
This was the 5.0 patch Reorg the OOM report in the dump header.
Also back porting Shakeel's 5.3 patch to refactor dump tasks for memcg OOMs.
We're testing those back ports right now in fact.

We can probably get rid of some of the information we have but I
haven't had a chance yet.
Hopefully can do it as part of sending some code upstream.

>
>
> >   - Kernel Release
> >   - Kernel Version
>
> part of the oom report
>
> >
> >   Example output when configured and enabled:
> >
> > Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019
> >
> > * Tasks Summary
> >   -------------
> >   One line of output that includes:
> >   - Number of Threads
> >   - Number of processes
> >   - Forks since boot
> >   - Processes that are runnable
> >   - Processes that are in iowait
>
> We do have sysrq+t for this kind of information. Why do we need to
> duplicate it?

Unfortunately, we can't login into every customer system or even
system of our own and do a sysrq+t after each OOM.
You could scan for OOMs and have a script do it, but doing a sysrq+t
after an OOM event, you'll get different results.
I'd rather have the runnable and iowait counts during the OOM event not after.
Computers are so darn fast, free up some memory and things can look a
lot different.

We've seen crond fork and hang and gradually create thousands of
processes and sorts of other unintended fork bombs.
On some systems we can't print all of the process information as we've
discussed.
So we print a summary of how many there are total and if you use the
select process print option you can print all the processes
that use more than 1% for example. That may be a dozen or two versus
hundreds or thousands. That may make printing some
user processes, the largest memory users feasible.

>
>
> >   Example output when configured and enabled:
> >
> > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0
> >
> > * ARP Table and/or Neighbour Discovery Table Summary
> >   --------------------------------------------------
> >   One line of output each for ARP and ND that includes:
> >   - Table name
> >   - Table size (max # entries)
> >   - Key Length
> >   - Entry Size
> >   - Number of Entries
> >   - Last Flush (in seconds)
> >   - hash grows
> >   - entry allocations
> >   - entry destroys
> >   - Number lookups
> >   - Number of lookup hits
> >   - Resolution failures
> >   - Garbage Collection Forced Runs
> >   - Table Full
> >   - Proxy Queue Length
> >
> >   Example output when configured and enabled (for both):
> >
> > ... kernel: neighbour: Table: arp_tbl size:   256 keyLen:  4 entrySize: 360 entries:     9 lastFlush:  1721s hGrows:     1 allocs:     9 destroys:     0 lookups:   204 hits:   199 resFailed:    38 gcRuns/Forced: 111 /  0 tblFull:  0 proxyQlen:  0
> >
> > ... kernel: neighbour: Table:  nd_tbl size:   128 keyLen: 16 entrySize: 368 entries:     6 lastFlush:  1720s hGrows:     0 allocs:     7 destroys:     1 lookups:     0 hits:     0 resFailed:     0 gcRuns/Forced: 110 /  0 tblFull:  0 proxyQlen:  0
>
> Again, why is this needed particularly for the OOM event? I do
> understand this might be useful system health diagnostic information but
> how does this contribute to the OOM?
>

It is example of some system table information we print.
Other adjustable table information may be useful as well.
These table sizes are often adjustable and collecting stats on usage
helps determine if settings are appropriate.
The value during OOM events is very useful as usage varies.
We also collect the same stats like this from user code periodically
and can compare these.

>
> > * Add Select Slabs Print
> >   ----------------------
> >   Allow select slab entries (based on a minimum size) to be printed.
> >   Minimum size is specified as a percentage of the total RAM memory
> >   in tenths of a percent, consistent with existing OOM process scoring.
> >   Valid values are specified from 0 to 1000 where 0 prints all slab
> >   entries (all slabs that have at least one slab object in use) up
> >   to 1000 which would require a slab to use 100% of memory which can't
> >   happen so in that case only summary information is printed.
> >
> >   The first line of output is the standard Linux output header for
> >   OOM printed Slab entries. This header looks like this:
> >
> > Aug  6 09:37:21 egc103 yourserver: Unreclaimable slab info:
> >
> >   The output is existing slab entry memory usage limited such that only
> >   entries equal to or larger than the minimum size are printed.
> >   Empty slabs (no slab entries in slabs in use) are never printed.
> >
> >   Additional output consists of summary information that is printed
> >   at the end of the output. This summary information includes:
> >   - # entries examined
> >   - # entries selected and printed
> >   - minimum entry size for selection
> >   - Slabs total size (kB)
> >   - Slabs reclaimable size (kB)
> >   - Slabs unreclaimable size (kB)
> >
> >   Example Summary output when configured and enabled:
> >
> > Jul 23 23:26:34 yoursystem kernel: Summary: Slab entries examined: 123 printed: 83 minsize: 0kB
> >
> > Jul 23 23:26:34 yoursystem kernel: Slabs Total: 151212kB Reclaim: 50632kB Unreclaim: 100580kB
>
> I am all for practical improvements for slab reporting. It is not really
> trivial to find a good balance though. Printing all the caches simply
> doesn't scale. So I would start by improving the current state rather
> than adding more configurability.


Yes, there is a challenge here and with the information you choose to
report when an OOM event occurs.
Paraphrasing, one size may not fit all.
To address this we tried to make it easy to add options and to allow
them to enabled / disabled.
We'd rather rate limit based on memory usage than have the kernel
print rate limit arbitrarily.
We had to make some choices on how to do this.

That said we view the OOM report as debugging information.
So if you change the format as long as we get the information we feel
is relevant, we're happy.
Since we print release and version information we can adjust our
scripts to handle format changes.
It's work but not really that big a deal.
If you remove information that was useful that is a bit more painful,
but not the end of the world.

>
>
> >
> > * Add Select Vmalloc allocations Print
> >   ------------------------------------
> >   Allow select vmalloc entries (based on a minimum size) to be printed.
> >   Minimum size is specified as a percentage of the total RAM memory
> >   in tenths of a percent, consistent with existing OOM process scoring.
> >   Valid values are specified from 0 to 1000 where 0 prints all vmalloc
> >   entries (all vmalloc allocations that have at least one page in use) up
> >   to 1000 which would require a vmalloc to use 100% of memory which can't
> >   happen so in that case only summary information is printed.
> >
> >   The first line of output is a new Vmalloc output header for
> >   OOM printed Vmalloc entries. This header looks like this:
> >
> > Aug 19 19:27:01 yourserver kernel: Vmalloc Info:
> >
> >   The output is vmalloc entry information output limited such that only
> >   entries equal to or larger than the minimum size are printed.
> >   Unused vmallocs (no pages assigned to the vmalloc) are never printed.
> >   The vmalloc entry information includes:
> >   - Size (in bytes)
> >   - pages (Number pages in use)
> >   - Caller Information to identify the request
> >
> >   A sample vmalloc entry output looks like this:
> >
> > Jul 22 20:16:09 yoursystem kernel: Vmalloc size=2625536 pages=640 caller=__do_sys_swapon+0x78e/0x113
> >
> >   Additional output consists of summary information that is printed
> >   at the end of the output. This summary information includes:
> >   - Number of Vmalloc entries examined
> >   - Number of Vmalloc entries printed
> >   - minimum entry size for selection
> >
> >   A sample Vmalloc Summary output looks like this:
> >
> > Aug 19 19:27:01 coronado kernel: Summary: Vmalloc entries examined: 1070 printed: 989 minsize: 0kB
>
> This is a lot of information. I wouldn't be surprised if this alone
> could easily overflow the ringbuffer. Besides that, it is rarely useful
> for the OOM situation debugging. The overall size of the vmalloc area
> is certainly interesting but I am not sure we have a handy counter to
> cope with constrained OOM contexts.
>

We've had cases where just displaying very large allocations explained
why an OOM event occurred.
We size this so we rarely get much output here, an entry or two at most.
Again it is optional so if you don't care don't enable it.

>
> > * Add Select Process Entries Print
> >   --------------------------------
> >   Allow select process entries (based on a minimum size) to be printed.
> >   Minimum size is specified as a percentage totalpages (RAM + swap)
> >   in tenths of a percent, consistent with existing OOM process scoring.
> >   Note: user process memory can be swapped out when swap space present
> >   so that is why swap space and ram memory comprise the totalpages
> >   used to calculate the percentage of memory a process is using.
> >   Valid values are specified from 0 to 1000 where 0 prints all user
> >   processes (that have valid mm sections and aren't exiting) up to
> >   1000 which would require a user process to use 100% of memory which
> >   can't happen so in that case only summary information is printed.
> >
> >   The first line of output is the standard Linux output headers for
> >   OOM printed User Processes. This header looks like this:
> >
> > Aug 19 19:27:01 yourserver kernel: Tasks state (memory values in pages):
> > Aug 19 19:27:01 yourserver kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> >
> >   The output is existing per user process data limited such that only
> >   entries equal to or larger than the minimum size are printed.
> >
> > Jul 21 20:07:48 yourserver kernel: [    579]     0   579     7942     1010          90112        0         -1000 systemd-udevd
> >
> >   Additional output consists of summary information that is printed
> >   at the end of the output. This summary information includes:
> >
> > Aug 19 19:27:01 yourserver kernel: Summary: OOM Tasks considered:277 printed:143 minimum size:0kB totalpages:32791608kB
>
> This sounds like a good idea to limit the eligible process list size but
> I am concerned that it might get misleading easily when there are many
> small processes contributing to the OOM in the end.
>
> > * Add Enhanced Process Print Information
> >   --------------------------------------
> >   Add OOM Debug code that prints additional detailed information about
> >   users processes that were considered for OOM killing for any print
> >   selected processes. The information is displayed for each user process
> >   that OOM prints in the output.
> >
> >   This supplemental per user process information is very helpful for
> >   determing how process memory is used to allow OOM event root cause
> >   identifcation that might not otherwise be possible.
> >
> >   Output information for enhanced user process entrys printed includes:
> >   - pid
> >   - parent pid
> >   - ruid
> >   - euid
> >   - tgid
> >   - Process State (S)
> >   - utime in seconds
> >   - stime in seconds
> >   - oom_score_adjust
> >   - task comm value (name of process)
> >   - Vmem KiB
> >   - MaxRss KiB
> >   - CurRss KiB
> >   - Pte KiB
> >   - Swap KiB
> >   - Sock KiB
> >   - Lib KiB
> >   - Text KiB
> >   - Heap KiB
> >   - Stack KiB
> >   - File KiB
> >   - Shmem KiB
> >   - Read Pages
> >   - Fault Pages
> >   - Lock KiB
> >   - Pinned KiB
>
> I can see some of these being interesting but I would rather pick up
> those and add to the regular oom output rather than go over configuring
> them.
>

Would be glad to add these to standard OOM output.
One issue is there are extra bytes of output with more detail.
So when constrained, to justify this we said we'd rather have lots of
detail on the top 50 or so many consuming processes versus course
information on all user processes.
The task information will provide us counts of processes and measures
of process creation that are very useful.

>
> > Configuring Patches:
> > -------------------
> > OOM Debug and any options you want to use must first be configured so
> > the code is included in your kernel. This requires selecting kernel
> > config file options. You will find config options to select under:
> >
> > Kernel hacking ---> Memory Debugging --->
> >
> > [*] Debug OOM
> >     [*] Debug OOM System State
> >     [*] Debug OOM System Tasks Summary
> >     [*] Debug OOM ARP Table
> >     [*] Debug OOM ND Table
> >     [*] Debug OOM Select Slabs Print
> >        [*] Debug OOM Slabs Select Always Print Enable
> >        [*] Debug OOM Enhanced Slab Print
> >     [*] Debug OOM Select Vmallocs Print
> >     [*] Debug OOM Select Process Print
> >        [*] Debug OOM Enhanced Process Print
>
> I really dislike these though. We already have zillions of debugging
> options and the config space is enormous. Different combinations of them
> make any compile testing a challenge and a lot of cpu cycles eaten.
> Besides that, who is going to configure those in without using them
> directly? Distributions are not going to enable without having all
> options being disabled by default for example.
>

Oh I agree, I dislike configuration options, there are so many and
when you upgrade you're like what now.
That said, I understand their value when range from small embedded
devices up to super computers and
zillions of devices.

I would be pleased to just have one configuration option or better yet
just have the code be part of
the standard system. So getting rid of any or all of that would be a pleasure.
Quite honestly, we may argue for certain items but in general we're
quite flexible.

> >  12 files changed, 1339 insertions(+), 11 deletions(-)
>
> This must have a been a lot of work and I really appreciate that.
>
> On the other hand it is a lot of code to maintain (note that you are
> usually introspecting deep internals of subsystems so changes would
> have to be carefully considered here as well) without a very strong
> demand.
>
> Sure it is a nice to have thing in some cases. I can imagine that some
> of that information would have helped me when debugging some weird OOM
> reports but I strongly suspect I would likely not have all necessary
> pieces enabled because those were not reproducible. Having everything
> on is just not usable due to amount of data. printk is not free and
> we have seen cases where a lot of output just turned the machine into
> unsuable state. If you have a reproducible OOMs then you can trigger
> a panic and have the full state of the system to examine. So I am not
> really convinced all this is going to be used to justify the maintenance
> overhead.


I can speak to many OOM events we have had to triage and root cause
over the past
7+ years that I've been involved with. It is quite true that there is
no single OOM report
format that will allow every problem to be completely root caused. The
OOM report
cannot provide all the information a full dump provides. That said,
the OOM report can give
you an excellent start on where to look when you otherwise aren't sure
where to look.
With luck everything you need is in the OOM report and you root cause
right there.

I can give you all sort of examples of this.
They're all anecdotal but I would expect that admin and support people
in data centers
see much of the same sorts of issues. Would welcome input from others too.
Different environments certainly can vary.

On the issue of reproducible OOMs verus non-reproducible, that is
important to consider:

First many OOMs we look at happen in the data center and they are not easily
reproducible. The analogy I use we spend a lot of time having to drive
with our tail lights.
That is we do a postmortem with limited information after the fact. Why?
We don't have the time or luxury to turn on panic on OOM and let the
system reboot.
In fact we very often have neither the time it takes to dump the
system or the storage space
to hold a full system dump, a shame as it is the best scenario for sure.
If a switch locks up for a few seconds the routing protocols can time
out and that can
start a reconfiguration chain reaction in your data center that will
not be well received.

If we could take a full system dump every time we need to capture the
state of the system
you wouldn't need an OOM report. In fact where else in the Kernel does
the Kernel produce
a report? OOM events are an odd beast that for some systems are just
an annoyance and
on other systems can be quite painful.

If you're lucky you can ignore the fact that OOM killed one of your
tabs in chrome browser.
Your not so lucky if a key process gets OOM killed causing a cascade
of issues. The more
pain you feel the more motivated you become to try and avoid future events.

We're not touching situations where OOM events occur in clusters or
periodically due to
a persisting issue or lots of other OOM dramas that occur from time to
time. For people who
are unlucky and have to care about OOM events, you often can't
reproduce these and you
want to capture as much information as is reasonable so you can work
what the cause was
with the hope that you can prevent future events.

How much information is reasonable and what information you want to
record may vary.

> All that being said, I do not think this is something we want to merge
> without a really _strong_ usecase to back it.
>

I will supply any information that I can. Let me know specifics on
what you need.
I guess I can try an explain a justification for each option I sent
and we can have a dialog as needed.
That is at least a starting point.

I was hoping that posting this code and starting a discussion might
draw in both experts and
others with an interest in the information that is produced for an OOM event.

Our experience is that some additional information and the ability to
adjust what is produced is valuable.
We don't add new options all the time but making it easy to do so is helpful.

It would be nice if everything was standard output but even optional
configurable information is better than none.
We can continue to mod our kernel but if others would benefit, we're
happy to contribute to the best
of our abilities. We're flexible enough to make any recommended
improvements as well.

Also, our implementation though we've been using it for some years,
and it continues to evolve, is a
reference implementation. Since the output is debugging information
and we identify what system
release and version produces the output with each event, we can adjust
our scripts to deal without
output changes as the system evolves. This is expected as systems and
Linux continue to evolve
and improve.

We'd be happy to work with you and your colleagues to contribute any
improvements that you can accept
 to help to improve the OOM Report output.

Thank-you again for your time and consideration!

Edward Chron
Arista Networks

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28  0:50     ` Qian Cai
@ 2019-08-28  1:13       ` Edward Chron
  2019-08-28  1:32         ` Qian Cai
  0 siblings, 1 reply; 46+ messages in thread
From: Edward Chron @ 2019-08-28  1:13 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Johannes Weiner,
	David Rientjes, Tetsuo Handa, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Tue, Aug 27, 2019 at 5:50 PM Qian Cai <cai@lca.pw> wrote:
>
>
>
> > On Aug 27, 2019, at 8:23 PM, Edward Chron <echron@arista.com> wrote:
> >
> >
> >
> > On Tue, Aug 27, 2019 at 5:40 AM Qian Cai <cai@lca.pw> wrote:
> > On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> > > This patch series provides code that works as a debug option through
> > > debugfs to provide additional controls to limit how much information
> > > gets printed when an OOM event occurs and or optionally print additional
> > > information about slab usage, vmalloc allocations, user process memory
> > > usage, the number of processes / tasks and some summary information
> > > about these tasks (number runable, i/o wait), system information
> > > (#CPUs, Kernel Version and other useful state of the system),
> > > ARP and ND Cache entry information.
> > >
> > > Linux OOM can optionally provide a lot of information, what's missing?
> > > ----------------------------------------------------------------------
> > > Linux provides a variety of detailed information when an OOM event occurs
> > > but has limited options to control how much output is produced. The
> > > system related information is produced unconditionally and limited per
> > > user process information is produced as a default enabled option. The
> > > per user process information may be disabled.
> > >
> > > Slab usage information was recently added and is output only if slab
> > > usage exceeds user memory usage.
> > >
> > > Many OOM events are due to user application memory usage sometimes in
> > > combination with the use of kernel resource usage that exceeds what is
> > > expected memory usage. Detailed information about how memory was being
> > > used when the event occurred may be required to identify the root cause
> > > of the OOM event.
> > >
> > > However, some environments are very large and printing all of the
> > > information about processes, slabs and or vmalloc allocations may
> > > not be feasible. For other environments printing as much information
> > > about these as possible may be needed to root cause OOM events.
> > >
> >
> > For more in-depth analysis of OOM events, people could use kdump to save a
> > vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
> >  vmcore which contains pretty much all the information you need.
> >
> > Certainly, this is the ideal. A full system dump would give you the maximum amount of
> > information.
> >
> > Unfortunately some environments may lack space to store the dump,
>
> Kdump usually also support dumping to a remote target via NFS, SSH etc
>
> > let alone the time to dump the storage contents and restart the system. Some
>
> There is also “makedumpfile” that could compress and filter unwanted memory to reduce
> the vmcore size and speed up the dumping process by utilizing multi-threads.
>
> > systems can take many minutes to fully boot up, to reset and reinitialize all the
> > devices. So unfortunately this is not always an option, and we need an OOM Report.
>
> I am not sure how the system needs some minutes to reboot would be relevant  for the
> discussion here. The idea is to save a vmcore and it can be analyzed offline even on
> another system as long as it having a matching “vmlinux.".
>
>

If selecting a dump on an OOM event doesn't reboot the system and if
it runs fast enough such
that it doesn't slow processing enough to appreciably effect the
system's responsiveness then
then it would be ideal solution. For some it would be over kill but
since it is an option it is a
choice to consider or not.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28  1:13       ` Edward Chron
@ 2019-08-28  1:32         ` Qian Cai
  2019-08-28  2:47           ` Edward Chron
  0 siblings, 1 reply; 46+ messages in thread
From: Qian Cai @ 2019-08-28  1:32 UTC (permalink / raw)
  To: Edward Chron
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Johannes Weiner,
	David Rientjes, Tetsuo Handa, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande



> On Aug 27, 2019, at 9:13 PM, Edward Chron <echron@arista.com> wrote:
> 
> On Tue, Aug 27, 2019 at 5:50 PM Qian Cai <cai@lca.pw> wrote:
>> 
>> 
>> 
>>> On Aug 27, 2019, at 8:23 PM, Edward Chron <echron@arista.com> wrote:
>>> 
>>> 
>>> 
>>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai <cai@lca.pw> wrote:
>>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
>>>> This patch series provides code that works as a debug option through
>>>> debugfs to provide additional controls to limit how much information
>>>> gets printed when an OOM event occurs and or optionally print additional
>>>> information about slab usage, vmalloc allocations, user process memory
>>>> usage, the number of processes / tasks and some summary information
>>>> about these tasks (number runable, i/o wait), system information
>>>> (#CPUs, Kernel Version and other useful state of the system),
>>>> ARP and ND Cache entry information.
>>>> 
>>>> Linux OOM can optionally provide a lot of information, what's missing?
>>>> ----------------------------------------------------------------------
>>>> Linux provides a variety of detailed information when an OOM event occurs
>>>> but has limited options to control how much output is produced. The
>>>> system related information is produced unconditionally and limited per
>>>> user process information is produced as a default enabled option. The
>>>> per user process information may be disabled.
>>>> 
>>>> Slab usage information was recently added and is output only if slab
>>>> usage exceeds user memory usage.
>>>> 
>>>> Many OOM events are due to user application memory usage sometimes in
>>>> combination with the use of kernel resource usage that exceeds what is
>>>> expected memory usage. Detailed information about how memory was being
>>>> used when the event occurred may be required to identify the root cause
>>>> of the OOM event.
>>>> 
>>>> However, some environments are very large and printing all of the
>>>> information about processes, slabs and or vmalloc allocations may
>>>> not be feasible. For other environments printing as much information
>>>> about these as possible may be needed to root cause OOM events.
>>>> 
>>> 
>>> For more in-depth analysis of OOM events, people could use kdump to save a
>>> vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
>>> vmcore which contains pretty much all the information you need.
>>> 
>>> Certainly, this is the ideal. A full system dump would give you the maximum amount of
>>> information.
>>> 
>>> Unfortunately some environments may lack space to store the dump,
>> 
>> Kdump usually also support dumping to a remote target via NFS, SSH etc
>> 
>>> let alone the time to dump the storage contents and restart the system. Some
>> 
>> There is also “makedumpfile” that could compress and filter unwanted memory to reduce
>> the vmcore size and speed up the dumping process by utilizing multi-threads.
>> 
>>> systems can take many minutes to fully boot up, to reset and reinitialize all the
>>> devices. So unfortunately this is not always an option, and we need an OOM Report.
>> 
>> I am not sure how the system needs some minutes to reboot would be relevant  for the
>> discussion here. The idea is to save a vmcore and it can be analyzed offline even on
>> another system as long as it having a matching “vmlinux.".
>> 
>> 
> 
> If selecting a dump on an OOM event doesn't reboot the system and if
> it runs fast enough such
> that it doesn't slow processing enough to appreciably effect the
> system's responsiveness then
> then it would be ideal solution. For some it would be over kill but
> since it is an option it is a
> choice to consider or not.

It sounds like you are looking for more of this,

https://github.com/iovisor/bcc/blob/master/tools/oomkill.py


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28  1:32         ` Qian Cai
@ 2019-08-28  2:47           ` Edward Chron
  2019-08-28  7:08             ` Michal Hocko
  0 siblings, 1 reply; 46+ messages in thread
From: Edward Chron @ 2019-08-28  2:47 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Johannes Weiner,
	David Rientjes, Tetsuo Handa, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Tue, Aug 27, 2019 at 6:32 PM Qian Cai <cai@lca.pw> wrote:
>
>
>
> > On Aug 27, 2019, at 9:13 PM, Edward Chron <echron@arista.com> wrote:
> >
> > On Tue, Aug 27, 2019 at 5:50 PM Qian Cai <cai@lca.pw> wrote:
> >>
> >>
> >>
> >>> On Aug 27, 2019, at 8:23 PM, Edward Chron <echron@arista.com> wrote:
> >>>
> >>>
> >>>
> >>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai <cai@lca.pw> wrote:
> >>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> >>>> This patch series provides code that works as a debug option through
> >>>> debugfs to provide additional controls to limit how much information
> >>>> gets printed when an OOM event occurs and or optionally print additional
> >>>> information about slab usage, vmalloc allocations, user process memory
> >>>> usage, the number of processes / tasks and some summary information
> >>>> about these tasks (number runable, i/o wait), system information
> >>>> (#CPUs, Kernel Version and other useful state of the system),
> >>>> ARP and ND Cache entry information.
> >>>>
> >>>> Linux OOM can optionally provide a lot of information, what's missing?
> >>>> ----------------------------------------------------------------------
> >>>> Linux provides a variety of detailed information when an OOM event occurs
> >>>> but has limited options to control how much output is produced. The
> >>>> system related information is produced unconditionally and limited per
> >>>> user process information is produced as a default enabled option. The
> >>>> per user process information may be disabled.
> >>>>
> >>>> Slab usage information was recently added and is output only if slab
> >>>> usage exceeds user memory usage.
> >>>>
> >>>> Many OOM events are due to user application memory usage sometimes in
> >>>> combination with the use of kernel resource usage that exceeds what is
> >>>> expected memory usage. Detailed information about how memory was being
> >>>> used when the event occurred may be required to identify the root cause
> >>>> of the OOM event.
> >>>>
> >>>> However, some environments are very large and printing all of the
> >>>> information about processes, slabs and or vmalloc allocations may
> >>>> not be feasible. For other environments printing as much information
> >>>> about these as possible may be needed to root cause OOM events.
> >>>>
> >>>
> >>> For more in-depth analysis of OOM events, people could use kdump to save a
> >>> vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
> >>> vmcore which contains pretty much all the information you need.
> >>>
> >>> Certainly, this is the ideal. A full system dump would give you the maximum amount of
> >>> information.
> >>>
> >>> Unfortunately some environments may lack space to store the dump,
> >>
> >> Kdump usually also support dumping to a remote target via NFS, SSH etc
> >>
> >>> let alone the time to dump the storage contents and restart the system. Some
> >>
> >> There is also “makedumpfile” that could compress and filter unwanted memory to reduce
> >> the vmcore size and speed up the dumping process by utilizing multi-threads.
> >>
> >>> systems can take many minutes to fully boot up, to reset and reinitialize all the
> >>> devices. So unfortunately this is not always an option, and we need an OOM Report.
> >>
> >> I am not sure how the system needs some minutes to reboot would be relevant  for the
> >> discussion here. The idea is to save a vmcore and it can be analyzed offline even on
> >> another system as long as it having a matching “vmlinux.".
> >>
> >>
> >
> > If selecting a dump on an OOM event doesn't reboot the system and if
> > it runs fast enough such
> > that it doesn't slow processing enough to appreciably effect the
> > system's responsiveness then
> > then it would be ideal solution. For some it would be over kill but
> > since it is an option it is a
> > choice to consider or not.
>
> It sounds like you are looking for more of this,

If you want to supplement the OOM Report and keep the information together than
you could use EBPF to do that. If that really is the preference it
might make sense
to put the entire report as an EBPF script than you can modify the
script however
you choose. That would be very flexible. You can change your
configuration on the
fly. As long as it has access to everything you need it should work.

Michal would know what direction OOM is headed and if he thinks that fits with
where things are headed.

I'm flexible in he sense that I could change our submission to make
specific updates
to the existing OOM code. We kept it as separate as possible as for
ease of porting.
But if we can build an acceptable case for making updates to the existing OOM
Report code that works.

Our current implementation has some knobs to allow some limited scaling that
has advantages over print rate limiting and it may allow environments
that didn't
want to allow process printing or slab or vmalloc entry allocations
printing to do
so without generating a lot of output.

But the existing code could be modified to do the same thing. Possibly without
having a configuration interface if that is not desirable. It could look at
the number entries to potentially print for example and if the number
is small it
could print them all or scale selection based on a default memory usage. Do you
really care about slab or vmalloc entries using 1 MB or less of memory on a
256 GB system for example? Probably not. Our approach let's you size this
and has a default that may be reasonable for many environments. But it allows
you to configure things which adds some complexity.

Now you could in theory produce the entire OOM Report plus anything we've
purposed with an EBPF script. Haven't done it but assume it works with 5.3.
Problem with any type of plugin and or configurable option is testing as
Michal mentions and the fact it may or not be present.

For production systems installing and updating EBPF scripts may someday
be very common, but I wonder how data center managers feel about it now?
Developers are very excited about it and it is a very powerful tool but can I
get permission to add or replace an existing EBPF on production systems?
If there is reluctance for security or reliability or any issue than I
would rather
have the code in the kernel so I know it is there and is tested. Just as I would
prefer not to have the config options for reasons Michal cites, but
I'll take that
if that is the best I can get.

Will be interested to hear what Michal advises.

>
> https://github.com/iovisor/bcc/blob/master/tools/oomkill.py
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28  1:07   ` Edward Chron
@ 2019-08-28  6:59     ` Michal Hocko
       [not found]       ` <CAM3twVR_OLffQ1U-SgQOdHxuByLNL5sicfnObimpGpPQ1tJ0FQ@mail.gmail.com>
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2019-08-28  6:59 UTC (permalink / raw)
  To: Edward Chron
  Cc: Andrew Morton, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Tue 27-08-19 18:07:54, Edward Chron wrote:
> On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 26-08-19 12:36:28, Edward Chron wrote:
> > [...]
> > > Extensibility using OOM debug options
> > > -------------------------------------
> > > What is needed is an extensible system to optionally configure
> > > debug options as needed and to then dynamically enable and disable
> > > them. Also for options that produce multiple lines of entry based
> > > output, to configure which entries to print based on how much
> > > memory they use (or optionally all the entries).
> >
> > With a patch this large and adding a lot of new stuff we need a more
> > detailed usecases described I believe.
> 
> I guess it would make sense to explain motivation for each OOM Debug
> option I've sent separately.
> I see there comments on the patches I will try and add more information there.
> 
> An overview would be that we've been collecting information on OOM's
> over the last 12 years or so.
> These are from switches, other embedded devices, servers both large and small.
> We ask for feedback on what information was helpful or could be helpful.
> We try and add it to make root causing issues easier.
> 
> These OOM debug options are some of the options we've created.
> I didn't port all of them to 5.3 but these are representative.
> Our latest is kernel is a bit behind 5.3.
> 
> >
> >
> > [...]
> >
> > > Use of debugfs to allow dynamic controls
> > > ----------------------------------------
> > > By providing a debugfs interface that allows options to be configured,
> > > enabled and where appropriate to set a minimum size for selecting
> > > entries to print, the output produced when an OOM event occurs can be
> > > dynamically adjusted to produce as little or as much detail as needed
> > > for a given system.
> >
> > Who is going to consume this information and why would that consumer be
> > unreasonable to demand further maintenance of that information in future
> > releases? In other words debugfs is not considered a stableAPI which is
> > OK here but the side effect of any change to these files results in user
> > visible behavior and we consider that more or less a stable as long as
> > there are consumers.
> >
> > > OOM debug options can be added to the base code as needed.
> > >
> > > Currently we have the following OOM debug options defined:
> > >
> > > * System State Summary
> > >   --------------------
> > >   One line of output that includes:
> > >   - Uptime (days, hour, minutes, seconds)
> >
> > We do have timestamps in the log so why is this needed?
> 
> 
> Here is how an OOM report looks when we get it to look at:
> 
> Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer:
> gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0,
> oom_score_adj=1000
> Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not
> tainted 5.3.0-rc6+ #33
> Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd.
> IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018
> 
> This shows the date and time, not time of the last boot. The
> /var/log/messages output is what we often have to look at not raw
> dmesgs.

This looks more like a configuration of the logging than a kernel
problem. Kernel does provide timestamps for logs. E.g.
$ tail -n1 /var/log/kern.log
Aug 28 08:27:46 tiehlicka kernel: <1054>[336340.954345] systemd-udevd[7971]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.

[...]
> > >   Example output when configured and enabled:
> > >
> > > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0
> > >
> > > * ARP Table and/or Neighbour Discovery Table Summary
> > >   --------------------------------------------------
> > >   One line of output each for ARP and ND that includes:
> > >   - Table name
> > >   - Table size (max # entries)
> > >   - Key Length
> > >   - Entry Size
> > >   - Number of Entries
> > >   - Last Flush (in seconds)
> > >   - hash grows
> > >   - entry allocations
> > >   - entry destroys
> > >   - Number lookups
> > >   - Number of lookup hits
> > >   - Resolution failures
> > >   - Garbage Collection Forced Runs
> > >   - Table Full
> > >   - Proxy Queue Length
> > >
> > >   Example output when configured and enabled (for both):
> > >
> > > ... kernel: neighbour: Table: arp_tbl size:   256 keyLen:  4 entrySize: 360 entries:     9 lastFlush:  1721s hGrows:     1 allocs:     9 destroys:     0 lookups:   204 hits:   199 resFailed:    38 gcRuns/Forced: 111 /  0 tblFull:  0 proxyQlen:  0
> > >
> > > ... kernel: neighbour: Table:  nd_tbl size:   128 keyLen: 16 entrySize: 368 entries:     6 lastFlush:  1720s hGrows:     0 allocs:     7 destroys:     1 lookups:     0 hits:     0 resFailed:     0 gcRuns/Forced: 110 /  0 tblFull:  0 proxyQlen:  0
> >
> > Again, why is this needed particularly for the OOM event? I do
> > understand this might be useful system health diagnostic information but
> > how does this contribute to the OOM?
> >
> 
> It is example of some system table information we print.
> Other adjustable table information may be useful as well.
> These table sizes are often adjustable and collecting stats on usage
> helps determine if settings are appropriate.
> The value during OOM events is very useful as usage varies.
> We also collect the same stats like this from user code periodically
> and can compare these.

I suspect that this is a very narrow usecase and there are more like
that and I can imagine somebody with a different workload could come up
with yet another set of useful information to print. The more I think of these
additional modules the more I am convinced that this "plugin" architecture
is a wrong approach. Why? Mostly because all the code maintenance burden
is likely to be not worth all the niche usecase. This all has to be more
dynamic and ideally scriptable so that the code in the kernel just
provides the basic information and everybody can just hook in there and
dump whatever additional information is needed. Sounds like something
that eBPF could fit in, no? Have you considered that?

[...]

Skipping over many useful stuff. I can reassure you that my experience
with OOM debugging has been a real pain at times (e.g. when there is
simply no way to find out who has eaten all the memory because it is not
accounted anywhere) as well and I completely understand where you are
coming from. There is definitely a room for improvements we just have to
find a way how to get there.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28  2:47           ` Edward Chron
@ 2019-08-28  7:08             ` Michal Hocko
  2019-08-28 10:12               ` Tetsuo Handa
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2019-08-28  7:08 UTC (permalink / raw)
  To: Edward Chron
  Cc: Qian Cai, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Tetsuo Handa, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Tue 27-08-19 19:47:22, Edward Chron wrote:
> On Tue, Aug 27, 2019 at 6:32 PM Qian Cai <cai@lca.pw> wrote:
> >
> >
> >
> > > On Aug 27, 2019, at 9:13 PM, Edward Chron <echron@arista.com> wrote:
> > >
> > > On Tue, Aug 27, 2019 at 5:50 PM Qian Cai <cai@lca.pw> wrote:
> > >>
> > >>
> > >>
> > >>> On Aug 27, 2019, at 8:23 PM, Edward Chron <echron@arista.com> wrote:
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai <cai@lca.pw> wrote:
> > >>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> > >>>> This patch series provides code that works as a debug option through
> > >>>> debugfs to provide additional controls to limit how much information
> > >>>> gets printed when an OOM event occurs and or optionally print additional
> > >>>> information about slab usage, vmalloc allocations, user process memory
> > >>>> usage, the number of processes / tasks and some summary information
> > >>>> about these tasks (number runable, i/o wait), system information
> > >>>> (#CPUs, Kernel Version and other useful state of the system),
> > >>>> ARP and ND Cache entry information.
> > >>>>
> > >>>> Linux OOM can optionally provide a lot of information, what's missing?
> > >>>> ----------------------------------------------------------------------
> > >>>> Linux provides a variety of detailed information when an OOM event occurs
> > >>>> but has limited options to control how much output is produced. The
> > >>>> system related information is produced unconditionally and limited per
> > >>>> user process information is produced as a default enabled option. The
> > >>>> per user process information may be disabled.
> > >>>>
> > >>>> Slab usage information was recently added and is output only if slab
> > >>>> usage exceeds user memory usage.
> > >>>>
> > >>>> Many OOM events are due to user application memory usage sometimes in
> > >>>> combination with the use of kernel resource usage that exceeds what is
> > >>>> expected memory usage. Detailed information about how memory was being
> > >>>> used when the event occurred may be required to identify the root cause
> > >>>> of the OOM event.
> > >>>>
> > >>>> However, some environments are very large and printing all of the
> > >>>> information about processes, slabs and or vmalloc allocations may
> > >>>> not be feasible. For other environments printing as much information
> > >>>> about these as possible may be needed to root cause OOM events.
> > >>>>
> > >>>
> > >>> For more in-depth analysis of OOM events, people could use kdump to save a
> > >>> vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
> > >>> vmcore which contains pretty much all the information you need.
> > >>>
> > >>> Certainly, this is the ideal. A full system dump would give you the maximum amount of
> > >>> information.
> > >>>
> > >>> Unfortunately some environments may lack space to store the dump,
> > >>
> > >> Kdump usually also support dumping to a remote target via NFS, SSH etc
> > >>
> > >>> let alone the time to dump the storage contents and restart the system. Some
> > >>
> > >> There is also “makedumpfile” that could compress and filter unwanted memory to reduce
> > >> the vmcore size and speed up the dumping process by utilizing multi-threads.
> > >>
> > >>> systems can take many minutes to fully boot up, to reset and reinitialize all the
> > >>> devices. So unfortunately this is not always an option, and we need an OOM Report.
> > >>
> > >> I am not sure how the system needs some minutes to reboot would be relevant  for the
> > >> discussion here. The idea is to save a vmcore and it can be analyzed offline even on
> > >> another system as long as it having a matching “vmlinux.".
> > >>
> > >>
> > >
> > > If selecting a dump on an OOM event doesn't reboot the system and if
> > > it runs fast enough such
> > > that it doesn't slow processing enough to appreciably effect the
> > > system's responsiveness then
> > > then it would be ideal solution. For some it would be over kill but
> > > since it is an option it is a
> > > choice to consider or not.
> >
> > It sounds like you are looking for more of this,
> 
> If you want to supplement the OOM Report and keep the information
> together than you could use EBPF to do that. If that really is the
> preference it might make sense to put the entire report as an EBPF
> script than you can modify the script however you choose. That would
> be very flexible. You can change your configuration on the fly. As
> long as it has access to everything you need it should work.
> 
> Michal would know what direction OOM is headed and if he thinks that fits with
> where things are headed.

It seems we have landed in the similar thinking here. As mentioned in my
earlier email in this thread I can see the extensibility to be achieved
by eBPF. Essentially we would have a base form of the oom report like
now and scripts would then hook in there to provide whatever a specific
usecase needs. My practical experience with eBPF is close to zero so I
have no idea how that would actually work out though.

[...]
> For production systems installing and updating EBPF scripts may someday
> be very common, but I wonder how data center managers feel about it now?
> Developers are very excited about it and it is a very powerful tool but can I
> get permission to add or replace an existing EBPF on production systems?

I am not sure I understand. There must be somebody trusted to take care
of systems, right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28  7:08             ` Michal Hocko
@ 2019-08-28 10:12               ` Tetsuo Handa
  2019-08-28 10:32                 ` Michal Hocko
  2019-08-28 20:04                 ` Edward Chron
  0 siblings, 2 replies; 46+ messages in thread
From: Tetsuo Handa @ 2019-08-28 10:12 UTC (permalink / raw)
  To: Michal Hocko, Edward Chron
  Cc: Qian Cai, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On 2019/08/28 16:08, Michal Hocko wrote:
> On Tue 27-08-19 19:47:22, Edward Chron wrote:
>> For production systems installing and updating EBPF scripts may someday
>> be very common, but I wonder how data center managers feel about it now?
>> Developers are very excited about it and it is a very powerful tool but can I
>> get permission to add or replace an existing EBPF on production systems?
> 
> I am not sure I understand. There must be somebody trusted to take care
> of systems, right?
> 

Speak of my cases, those who take care of their systems are not developers.
And they afraid changing code that runs in kernel mode. They unlikely give
permission to install SystemTap/eBPF scripts. As a result, in many cases,
the root cause cannot be identified.

Moreover, we are talking about OOM situations, where we can't expect userspace
processes to work properly. We need to dump information we want, without
counting on userspace processes, before sending SIGKILL.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28 10:12               ` Tetsuo Handa
@ 2019-08-28 10:32                 ` Michal Hocko
  2019-08-28 10:56                   ` Tetsuo Handa
  2019-08-28 20:04                 ` Edward Chron
  1 sibling, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2019-08-28 10:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Edward Chron, Qian Cai, Andrew Morton, Roman Gushchin,
	Johannes Weiner, David Rientjes, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Wed 28-08-19 19:12:41, Tetsuo Handa wrote:
> On 2019/08/28 16:08, Michal Hocko wrote:
> > On Tue 27-08-19 19:47:22, Edward Chron wrote:
> >> For production systems installing and updating EBPF scripts may someday
> >> be very common, but I wonder how data center managers feel about it now?
> >> Developers are very excited about it and it is a very powerful tool but can I
> >> get permission to add or replace an existing EBPF on production systems?
> > 
> > I am not sure I understand. There must be somebody trusted to take care
> > of systems, right?
> > 
> 
> Speak of my cases, those who take care of their systems are not developers.
> And they afraid changing code that runs in kernel mode. They unlikely give
> permission to install SystemTap/eBPF scripts. As a result, in many cases,
> the root cause cannot be identified.

Which is something I would call a process problem more than a kernel
one. Really if you need to debug a problem you really have to trust
those who can debug that for you. We are not going to take tons of code
to the kernel just because somebody is afraid to run a diagnostic.

> Moreover, we are talking about OOM situations, where we can't expect userspace
> processes to work properly. We need to dump information we want, without
> counting on userspace processes, before sending SIGKILL.

Yes, this is an inherent assumption I was making and that means that
whatever dynamic hooks would have to be registered in advance.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28 10:32                 ` Michal Hocko
@ 2019-08-28 10:56                   ` Tetsuo Handa
  2019-08-28 11:12                     ` Michal Hocko
  0 siblings, 1 reply; 46+ messages in thread
From: Tetsuo Handa @ 2019-08-28 10:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Edward Chron, Qian Cai, Andrew Morton, Roman Gushchin,
	Johannes Weiner, David Rientjes, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On 2019/08/28 19:32, Michal Hocko wrote:
>> Speak of my cases, those who take care of their systems are not developers.
>> And they afraid changing code that runs in kernel mode. They unlikely give
>> permission to install SystemTap/eBPF scripts. As a result, in many cases,
>> the root cause cannot be identified.
> 
> Which is something I would call a process problem more than a kernel
> one. Really if you need to debug a problem you really have to trust
> those who can debug that for you. We are not going to take tons of code
> to the kernel just because somebody is afraid to run a diagnostic.
> 

This is a problem of kernel development process.

>> Moreover, we are talking about OOM situations, where we can't expect userspace
>> processes to work properly. We need to dump information we want, without
>> counting on userspace processes, before sending SIGKILL.
> 
> Yes, this is an inherent assumption I was making and that means that
> whatever dynamic hooks would have to be registered in advance.
> 

No. I'm saying that neither static hooks nor dynamic hooks can work as
expected if they count on userspace processes. Registering in advance is
irrelevant. Whether it can work without userspace processes is relevant.

Also, out-of-tree codes tend to become defunctional. We are trying to debug
problems caused by in-tree code. Breaking out-of-tree debugging code just
because in-tree code developers don't want to pay the burden of maintaining
code for debugging problems caused by in-tree code is a very bad idea.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28 10:56                   ` Tetsuo Handa
@ 2019-08-28 11:12                     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2019-08-28 11:12 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Edward Chron, Qian Cai, Andrew Morton, Roman Gushchin,
	Johannes Weiner, David Rientjes, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Wed 28-08-19 19:56:58, Tetsuo Handa wrote:
> On 2019/08/28 19:32, Michal Hocko wrote:
> >> Speak of my cases, those who take care of their systems are not developers.
> >> And they afraid changing code that runs in kernel mode. They unlikely give
> >> permission to install SystemTap/eBPF scripts. As a result, in many cases,
> >> the root cause cannot be identified.
> > 
> > Which is something I would call a process problem more than a kernel
> > one. Really if you need to debug a problem you really have to trust
> > those who can debug that for you. We are not going to take tons of code
> > to the kernel just because somebody is afraid to run a diagnostic.
> > 
> 
> This is a problem of kernel development process.

I disagree. Expecting that any larger project can be filled with the
(close to) _full_ and ready to use introspection built in is just
insane. We are trying to help with a generally useful information but
you simply cannot cover most existing failure paths.

> >> Moreover, we are talking about OOM situations, where we can't expect userspace
> >> processes to work properly. We need to dump information we want, without
> >> counting on userspace processes, before sending SIGKILL.
> > 
> > Yes, this is an inherent assumption I was making and that means that
> > whatever dynamic hooks would have to be registered in advance.
> > 
> 
> No. I'm saying that neither static hooks nor dynamic hooks can work as
> expected if they count on userspace processes. Registering in advance is
> irrelevant. Whether it can work without userspace processes is relevant.

I am not saying otherwise. I do not expect any userspace process to dump
any information or read it from elswhere than from the kernel log.

> Also, out-of-tree codes tend to become defunctional. We are trying to debug
> problems caused by in-tree code. Breaking out-of-tree debugging code just
> because in-tree code developers don't want to pay the burden of maintaining
> code for debugging problems caused by in-tree code is a very bad idea.

This is a simple math of cost/benefit. The maintenance cost is not free
and paying it for odd cases most people do not care about is simply not
sustainable, we simply do not have that much of a man power.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28 10:12               ` Tetsuo Handa
  2019-08-28 10:32                 ` Michal Hocko
@ 2019-08-28 20:04                 ` Edward Chron
  2019-08-29  3:31                   ` Edward Chron
  1 sibling, 1 reply; 46+ messages in thread
From: Edward Chron @ 2019-08-28 20:04 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Qian Cai, Andrew Morton, Roman Gushchin,
	Johannes Weiner, David Rientjes, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Wed, Aug 28, 2019 at 3:12 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2019/08/28 16:08, Michal Hocko wrote:
> > On Tue 27-08-19 19:47:22, Edward Chron wrote:
> >> For production systems installing and updating EBPF scripts may someday
> >> be very common, but I wonder how data center managers feel about it now?
> >> Developers are very excited about it and it is a very powerful tool but can I
> >> get permission to add or replace an existing EBPF on production systems?
> >
> > I am not sure I understand. There must be somebody trusted to take care
> > of systems, right?
> >
>
> Speak of my cases, those who take care of their systems are not developers.
> And they afraid changing code that runs in kernel mode. They unlikely give
> permission to install SystemTap/eBPF scripts. As a result, in many cases,
> the root cause cannot be identified.

+1. Exactly. The only thing we could think of Tetsuo is if Linux OOM Reporting
uses a an eBPF script then systems have to load them to get any kind of
meaningful report. Frankly, if using eBPF is the route to go than essentially
the whole OOM reporting should go there. We can adjust as we need and
have precedent for wanting to load the script. That's the best we could come
up with.

>
> Moreover, we are talking about OOM situations, where we can't expect userspace
> processes to work properly. We need to dump information we want, without
> counting on userspace processes, before sending SIGKILL.

+1. We've tried and as you point out and for best results the kernel
has to provide
 the state.

Again a full system dump would be wonderful, but taking a full dump for
every OOM event on production systems? I am not nearly a good enough salesman
to sell that one. So we need an alternate mechanism.

If we can't agree on some sort of extensible, configurable approach then put
the standard OOM Report in eBPF and make it mandatory to load it so we can
justify having to do that. Linux should load it automatically.
We'll just make a few changes and additions as needed.

Sounds like a plan that we could live with.
Would be interested if this works for others as well.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
       [not found]       ` <CAM3twVR_OLffQ1U-SgQOdHxuByLNL5sicfnObimpGpPQ1tJ0FQ@mail.gmail.com>
@ 2019-08-28 20:18         ` Qian Cai
  2019-08-28 21:17           ` Edward Chron
  2019-08-29  7:11         ` Michal Hocko
  1 sibling, 1 reply; 46+ messages in thread
From: Qian Cai @ 2019-08-28 20:18 UTC (permalink / raw)
  To: Edward Chron, Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote:
> But with the caveat that running a eBPF script that it isn't standard Linux
> operating procedure, at this point in time any way will not be well
> received in the data center.

Can't you get your eBPF scripts into the BCC project? As far I can tell, the BCC
has been included in several distros already, and then it will become a part of
standard linux toolkits.

> 
> Our belief is if you really think eBPF is the preferred mechanism
> then move OOM reporting to an eBPF. 
> I mentioned this before but I will reiterate this here.

On the other hand, it seems many people are happy with the simple kernel OOM
report we have here. Not saying the current situation is perfect. On the top of
that, some people are using kdump, and some people have resource monitoring to
warn about potential memory overcommits before OOM kicks in etc.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28 20:18         ` Qian Cai
@ 2019-08-28 21:17           ` Edward Chron
  2019-08-28 21:34             ` Qian Cai
  0 siblings, 1 reply; 46+ messages in thread
From: Edward Chron @ 2019-08-28 21:17 UTC (permalink / raw)
  To: Qian Cai
  Cc: Michal Hocko, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Tetsuo Handa, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Wed, Aug 28, 2019 at 1:18 PM Qian Cai <cai@lca.pw> wrote:
>
> On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote:
> > But with the caveat that running a eBPF script that it isn't standard Linux
> > operating procedure, at this point in time any way will not be well
> > received in the data center.
>
> Can't you get your eBPF scripts into the BCC project? As far I can tell, the BCC
> has been included in several distros already, and then it will become a part of
> standard linux toolkits.
>
> >
> > Our belief is if you really think eBPF is the preferred mechanism
> > then move OOM reporting to an eBPF.
> > I mentioned this before but I will reiterate this here.
>
> On the other hand, it seems many people are happy with the simple kernel OOM
> report we have here. Not saying the current situation is perfect. On the top of
> that, some people are using kdump, and some people have resource monitoring to
> warn about potential memory overcommits before OOM kicks in etc.

Assuming you can implement your existing report in eBPF then those who like the
current output would still get the current output. Same with the patches we sent
upstream, nothing in the report changes by default. So no problems for those who
are happy, they'll still be happy.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28 21:17           ` Edward Chron
@ 2019-08-28 21:34             ` Qian Cai
  0 siblings, 0 replies; 46+ messages in thread
From: Qian Cai @ 2019-08-28 21:34 UTC (permalink / raw)
  To: Edward Chron
  Cc: Michal Hocko, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Tetsuo Handa, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Wed, 2019-08-28 at 14:17 -0700, Edward Chron wrote:
> On Wed, Aug 28, 2019 at 1:18 PM Qian Cai <cai@lca.pw> wrote:
> > 
> > On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote:
> > > But with the caveat that running a eBPF script that it isn't standard
> > > Linux
> > > operating procedure, at this point in time any way will not be well
> > > received in the data center.
> > 
> > Can't you get your eBPF scripts into the BCC project? As far I can tell, the
> > BCC
> > has been included in several distros already, and then it will become a part
> > of
> > standard linux toolkits.
> > 
> > > 
> > > Our belief is if you really think eBPF is the preferred mechanism
> > > then move OOM reporting to an eBPF.
> > > I mentioned this before but I will reiterate this here.
> > 
> > On the other hand, it seems many people are happy with the simple kernel OOM
> > report we have here. Not saying the current situation is perfect. On the top
> > of
> > that, some people are using kdump, and some people have resource monitoring
> > to
> > warn about potential memory overcommits before OOM kicks in etc.
> 
> Assuming you can implement your existing report in eBPF then those who like
> the
> current output would still get the current output. Same with the patches we
> sent
> upstream, nothing in the report changes by default. So no problems for those
> who
> are happy, they'll still be happy.

I don't think it makes any sense to rewrite the existing code to depends on eBPF
though.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-28 20:04                 ` Edward Chron
@ 2019-08-29  3:31                   ` Edward Chron
  0 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-29  3:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Qian Cai, Andrew Morton, Roman Gushchin,
	Johannes Weiner, David Rientjes, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Wed, Aug 28, 2019 at 1:04 PM Edward Chron <echron@arista.com> wrote:
>
> On Wed, Aug 28, 2019 at 3:12 AM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > On 2019/08/28 16:08, Michal Hocko wrote:
> > > On Tue 27-08-19 19:47:22, Edward Chron wrote:
> > >> For production systems installing and updating EBPF scripts may someday
> > >> be very common, but I wonder how data center managers feel about it now?
> > >> Developers are very excited about it and it is a very powerful tool but can I
> > >> get permission to add or replace an existing EBPF on production systems?
> > >
> > > I am not sure I understand. There must be somebody trusted to take care
> > > of systems, right?
> > >
> >
> > Speak of my cases, those who take care of their systems are not developers.
> > And they afraid changing code that runs in kernel mode. They unlikely give
> > permission to install SystemTap/eBPF scripts. As a result, in many cases,
> > the root cause cannot be identified.
>
> +1. Exactly. The only thing we could think of Tetsuo is if Linux OOM Reporting
> uses a an eBPF script then systems have to load them to get any kind of
> meaningful report. Frankly, if using eBPF is the route to go than essentially
> the whole OOM reporting should go there. We can adjust as we need and
> have precedent for wanting to load the script. That's the best we could come
> up with.
>
> >
> > Moreover, we are talking about OOM situations, where we can't expect userspace
> > processes to work properly. We need to dump information we want, without
> > counting on userspace processes, before sending SIGKILL.
>
> +1. We've tried and as you point out and for best results the kernel
> has to provide
>  the state.
>
> Again a full system dump would be wonderful, but taking a full dump for
> every OOM event on production systems? I am not nearly a good enough salesman
> to sell that one. So we need an alternate mechanism.
>
> If we can't agree on some sort of extensible, configurable approach then put
> the standard OOM Report in eBPF and make it mandatory to load it so we can
> justify having to do that. Linux should load it automatically.
> We'll just make a few changes and additions as needed.
>
> Sounds like a plan that we could live with.
> Would be interested if this works for others as well.

One further comment. In talking with my colleagues here who know eBPF
much better
than I do, it may not be possible to implement something this
complicated with eBPF.

If that is in the fact the case, then we'd have to try and hook the
OOM Reporting code
with tracepoints similar to kprobes only we want to do more than add counters
we want to change the flow to skip small output entries that aren't
worth printing.
If this isn't feasible with eBPF, then some derivative or our approach
or enhancing
the OOM output code directly seem like the best options. Will have to
investigate
this further.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
       [not found]       ` <CAM3twVR_OLffQ1U-SgQOdHxuByLNL5sicfnObimpGpPQ1tJ0FQ@mail.gmail.com>
  2019-08-28 20:18         ` Qian Cai
@ 2019-08-29  7:11         ` Michal Hocko
  2019-08-29 10:14           ` Tetsuo Handa
  2019-08-29 15:20           ` Edward Chron
  1 sibling, 2 replies; 46+ messages in thread
From: Michal Hocko @ 2019-08-29  7:11 UTC (permalink / raw)
  To: Edward Chron
  Cc: Andrew Morton, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Wed 28-08-19 12:46:20, Edward Chron wrote:
[...]
> Our belief is if you really think eBPF is the preferred mechanism
> then move OOM reporting to an eBPF.

I've said that all this additional information has to be dynamically
extensible rather than a part of the core kernel. Whether eBPF is the
suitable tool, I do not know. I haven't explored that. There are other
ways to inject code to the kernel. systemtap/kprobes, kernel modules and
probably others.

> I mentioned this before but I will reiterate this here.
> 
> So how do we get there? Let's look at the existing report which we know
> has issues.
> 
> Other than a few essential OOM messages the OOM code should produce,
> such as the Killed process message message sequence being included,
> you could have the entire OOM report moved to an eBPF script and
> therefore make it customizable, configurable or if you prefer programmable.

I believe we should keep the current reporting in place and allow
additional information via dynamic mechanism. Be it a registration
mechanism that modules can hook into or other more dynamic way.
The current reporting has proven to be useful in many typical oom
situations in my past years of experience. It gives the rough state of
the failing allocation, MM subsystem, tasks that are eligible and task
that is killed so that you can understand why the event happened.

I would argue that the eligible tasks should be printed on the opt-in
bases because this is more of relict from the past when the victim
selection was less deterministic. But that is another story.

All the rest of dump_header should stay IMHO as a reasonable default and
bare minimum.

> Why? Because as we all agree, you'll never have a perfect OOM Report.
> So if you believe this, than if you will, put your money where your mouth
> is (so to speak) and make the entire OOM Report and eBPF script.
> We'd be willing to help with this.
> 
> I'll give specific reasons why you want to do this.
> 
>    - Don't want to maintain a lot of code in the kernel (eBPF code doesn't
>    count).
>    - Can't produce an ideal OOM report.
>    - Don't like configuring things but favor programmatic solutions.
>    - Agree the existing OOM report doesn't work for all environments.
>    - Want to allow flexibility but can't support everything people might
>    want.
>    - Then installing an eBPF for OOM Reporting isn't an option, it's
>    required.

This is going into an extreme. We cannot serve all cases but that is
true for any other heuristics/reporting in the kernel. We do care about
most.

> The last reason is huge for people who live in a world with large data
> centers. Data center managers are very conservative. They don't want to
> deviate from standard operating procedure unless absolutely necessary.
> If loading an OOM Report eBPF is standard to get OOM Reporting output,
> then they'll accept that.

I have already responded to this kind of argumentation elsewhere. This
is not a relevant argument for any kernel implementation. This is a data
process management process.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29  7:11         ` Michal Hocko
@ 2019-08-29 10:14           ` Tetsuo Handa
  2019-08-29 11:56             ` Michal Hocko
  2019-08-29 15:20           ` Edward Chron
  1 sibling, 1 reply; 46+ messages in thread
From: Tetsuo Handa @ 2019-08-29 10:14 UTC (permalink / raw)
  To: Michal Hocko, Edward Chron
  Cc: Andrew Morton, Roman Gushchin, Johannes Weiner, David Rientjes,
	Shakeel Butt, linux-mm, linux-kernel, Ivan Delalande

On 2019/08/29 16:11, Michal Hocko wrote:
> On Wed 28-08-19 12:46:20, Edward Chron wrote:
>> Our belief is if you really think eBPF is the preferred mechanism
>> then move OOM reporting to an eBPF.
> 
> I've said that all this additional information has to be dynamically
> extensible rather than a part of the core kernel. Whether eBPF is the
> suitable tool, I do not know. I haven't explored that. There are other
> ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> probably others.

As for SystemTap, guru mode (an expert mode which disables protection provided
by SystemTap; allowing kernel to crash when something went wrong) could be used
for holding spinlock. However, as far as I know, holding mutex (or doing any
operation that might sleep) from such dynamic hooks is not allowed. Also we will
need to export various symbols in order to allow access from such dynamic hooks.

I'm not familiar with eBPF, but I guess that eBPF is similar.

But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
SystemTap will be suitable for dumping OOM information. OOM situation means
that even single page fault event cannot complete, and temporary memory
allocation for reading from kernel or writing to files cannot complete.

Therefore, we will need to hold all information in kernel memory (without
allocating any memory when OOM event happened). Dynamic hooks could hold
a few lines of output, but not all lines we want. The only possible buffer
which is preallocated and large enough would be printk()'s buffer. Thus,
I believe that we will have to use printk() in order to dump OOM information.
At that point,

  static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;

  bool out_of_memory(struct oom_control *oc)
  {
          return oom_handler(oc);
  }

and let in-tree kernel modules override current OOM killer would be
the only practical choice (if we refuse adding many knobs).


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29 10:14           ` Tetsuo Handa
@ 2019-08-29 11:56             ` Michal Hocko
  2019-08-29 14:09               ` Tetsuo Handa
  2019-08-29 15:03               ` Edward Chron
  0 siblings, 2 replies; 46+ messages in thread
From: Michal Hocko @ 2019-08-29 11:56 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Edward Chron, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> On 2019/08/29 16:11, Michal Hocko wrote:
> > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> >> Our belief is if you really think eBPF is the preferred mechanism
> >> then move OOM reporting to an eBPF.
> > 
> > I've said that all this additional information has to be dynamically
> > extensible rather than a part of the core kernel. Whether eBPF is the
> > suitable tool, I do not know. I haven't explored that. There are other
> > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > probably others.
> 
> As for SystemTap, guru mode (an expert mode which disables protection provided
> by SystemTap; allowing kernel to crash when something went wrong) could be used
> for holding spinlock. However, as far as I know, holding mutex (or doing any
> operation that might sleep) from such dynamic hooks is not allowed. Also we will
> need to export various symbols in order to allow access from such dynamic hooks.

This is the oom path and it should better not use any sleeping locks in
the first place.

> I'm not familiar with eBPF, but I guess that eBPF is similar.
> 
> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> SystemTap will be suitable for dumping OOM information. OOM situation means
> that even single page fault event cannot complete, and temporary memory
> allocation for reading from kernel or writing to files cannot complete.

And I repeat that no such reporting is going to write to files. This is
an OOM path afterall.

> Therefore, we will need to hold all information in kernel memory (without
> allocating any memory when OOM event happened). Dynamic hooks could hold
> a few lines of output, but not all lines we want. The only possible buffer
> which is preallocated and large enough would be printk()'s buffer. Thus,
> I believe that we will have to use printk() in order to dump OOM information.
> At that point,

Yes, this is what I've had in mind.

> 
>   static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
> 
>   bool out_of_memory(struct oom_control *oc)
>   {
>           return oom_handler(oc);
>   }
> 
> and let in-tree kernel modules override current OOM killer would be
> the only practical choice (if we refuse adding many knobs).

Or simply provide a hook with the oom_control to be called to report
without replacing the whole oom killer behavior. That is not necessary.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29 11:56             ` Michal Hocko
@ 2019-08-29 14:09               ` Tetsuo Handa
  2019-08-29 15:48                 ` Edward Chron
  2019-08-29 15:03               ` Edward Chron
  1 sibling, 1 reply; 46+ messages in thread
From: Tetsuo Handa @ 2019-08-29 14:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Edward Chron, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On 2019/08/29 20:56, Michal Hocko wrote:
>> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
>> SystemTap will be suitable for dumping OOM information. OOM situation means
>> that even single page fault event cannot complete, and temporary memory
>> allocation for reading from kernel or writing to files cannot complete.
> 
> And I repeat that no such reporting is going to write to files. This is
> an OOM path afterall.

The process who fetches from e.g. eBPF event cannot involve page fault.
The front-end for iovisor/bcc is a python userspace process. But I think
that such process can't run under OOM situation.

> 
>> Therefore, we will need to hold all information in kernel memory (without
>> allocating any memory when OOM event happened). Dynamic hooks could hold
>> a few lines of output, but not all lines we want. The only possible buffer
>> which is preallocated and large enough would be printk()'s buffer. Thus,
>> I believe that we will have to use printk() in order to dump OOM information.
>> At that point,
> 
> Yes, this is what I've had in mind.

Probably I incorrectly shortcut.

Dynamic hooks could hold a few lines of output, but dynamic hooks can not hold
all lines when dump_tasks() reports 32000+ processes. We have to buffer all output
in kernel memory because we can't complete even a page fault event triggered by
the python process monitoring eBPF event (and writing the result to some log file
or something) while out_of_memory() is in flight.

And "set /proc/sys/vm/oom_dump_tasks to 0" is not the right reaction. What I'm
saying is "we won't be able to hold output from dump_tasks() if output from
dump_tasks() goes to buffer preallocated for dynamic hooks". We have to find
a way that can handle the worst case.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29 11:56             ` Michal Hocko
  2019-08-29 14:09               ` Tetsuo Handa
@ 2019-08-29 15:03               ` Edward Chron
  2019-08-29 15:42                 ` Qian Cai
  2019-08-29 16:17                 ` Michal Hocko
  1 sibling, 2 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-29 15:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> > On 2019/08/29 16:11, Michal Hocko wrote:
> > > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> > >> Our belief is if you really think eBPF is the preferred mechanism
> > >> then move OOM reporting to an eBPF.
> > >
> > > I've said that all this additional information has to be dynamically
> > > extensible rather than a part of the core kernel. Whether eBPF is the
> > > suitable tool, I do not know. I haven't explored that. There are other
> > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > > probably others.
> >
> > As for SystemTap, guru mode (an expert mode which disables protection provided
> > by SystemTap; allowing kernel to crash when something went wrong) could be used
> > for holding spinlock. However, as far as I know, holding mutex (or doing any
> > operation that might sleep) from such dynamic hooks is not allowed. Also we will
> > need to export various symbols in order to allow access from such dynamic hooks.
>
> This is the oom path and it should better not use any sleeping locks in
> the first place.
>
> > I'm not familiar with eBPF, but I guess that eBPF is similar.
> >
> > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> > SystemTap will be suitable for dumping OOM information. OOM situation means
> > that even single page fault event cannot complete, and temporary memory
> > allocation for reading from kernel or writing to files cannot complete.
>
> And I repeat that no such reporting is going to write to files. This is
> an OOM path afterall.
>
> > Therefore, we will need to hold all information in kernel memory (without
> > allocating any memory when OOM event happened). Dynamic hooks could hold
> > a few lines of output, but not all lines we want. The only possible buffer
> > which is preallocated and large enough would be printk()'s buffer. Thus,
> > I believe that we will have to use printk() in order to dump OOM information.
> > At that point,
>
> Yes, this is what I've had in mind.
>

+1: It makes sense to keep the report going to the dmesg to persist.
That is where it has always gone and there is no reason to change.
You can have several OOMs back to back and you'd like to retain the output.
All the information should be kept together in the OOM report.

> >
> >   static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
> >
> >   bool out_of_memory(struct oom_control *oc)
> >   {
> >           return oom_handler(oc);
> >   }
> >
> > and let in-tree kernel modules override current OOM killer would be
> > the only practical choice (if we refuse adding many knobs).
>
> Or simply provide a hook with the oom_control to be called to report
> without replacing the whole oom killer behavior. That is not necessary.

For very simple addition, to add a line of output this works.
It would still be nice to address the fact the existing OOM Report prints
all of the user processes or none. It would be nice to add some control
for that. That's what we did.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29  7:11         ` Michal Hocko
  2019-08-29 10:14           ` Tetsuo Handa
@ 2019-08-29 15:20           ` Edward Chron
  1 sibling, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-29 15:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Thu, Aug 29, 2019 at 12:11 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 28-08-19 12:46:20, Edward Chron wrote:
> [...]
> > Our belief is if you really think eBPF is the preferred mechanism
> > then move OOM reporting to an eBPF.
>
> I've said that all this additional information has to be dynamically
> extensible rather than a part of the core kernel. Whether eBPF is the
> suitable tool, I do not know. I haven't explored that. There are other
> ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> probably others.

For simple code injections eBPF or kprobe works and a tracepoint would
help with that. For example we could add our one line of task information
that we find very useful this way.

For adding controls to limit output for processes, slabs and vmalloc entries
it would be harder to inject code for that. Our solution was to use debugfs.
An alternate could to be add simple sysctl if using debugfs is not appropriate.
As our code illustrated this can be added without changing the existing report
in any substantive way. I think there is value in this and this is core to what
the OOM report should provide. Additional items may be add ons that are
environment specific but these are OOM reporting essentials IMHO.

>
> > I mentioned this before but I will reiterate this here.
> >
> > So how do we get there? Let's look at the existing report which we know
> > has issues.
> >
> > Other than a few essential OOM messages the OOM code should produce,
> > such as the Killed process message message sequence being included,
> > you could have the entire OOM report moved to an eBPF script and
> > therefore make it customizable, configurable or if you prefer programmable.
>
> I believe we should keep the current reporting in place and allow
> additional information via dynamic mechanism. Be it a registration
> mechanism that modules can hook into or other more dynamic way.
> The current reporting has proven to be useful in many typical oom
> situations in my past years of experience. It gives the rough state of
> the failing allocation, MM subsystem, tasks that are eligible and task
> that is killed so that you can understand why the event happened.
>
> I would argue that the eligible tasks should be printed on the opt-in
> bases because this is more of relict from the past when the victim
> selection was less deterministic. But that is another story.
>
> All the rest of dump_header should stay IMHO as a reasonable default and
> bare minimum.
>
> > Why? Because as we all agree, you'll never have a perfect OOM Report.
> > So if you believe this, than if you will, put your money where your mouth
> > is (so to speak) and make the entire OOM Report and eBPF script.
> > We'd be willing to help with this.
> >
> > I'll give specific reasons why you want to do this.
> >
> >    - Don't want to maintain a lot of code in the kernel (eBPF code doesn't
> >    count).
> >    - Can't produce an ideal OOM report.
> >    - Don't like configuring things but favor programmatic solutions.
> >    - Agree the existing OOM report doesn't work for all environments.
> >    - Want to allow flexibility but can't support everything people might
> >    want.
> >    - Then installing an eBPF for OOM Reporting isn't an option, it's
> >    required.
>
> This is going into an extreme. We cannot serve all cases but that is
> true for any other heuristics/reporting in the kernel. We do care about
> most.

Unfortunately my argument for this is moot, this can't be done with
eBPF, at least not now.

>
> > The last reason is huge for people who live in a world with large data
> > centers. Data center managers are very conservative. They don't want to
> > deviate from standard operating procedure unless absolutely necessary.
> > If loading an OOM Report eBPF is standard to get OOM Reporting output,
> > then they'll accept that.
>
> I have already responded to this kind of argumentation elsewhere. This
> is not a relevant argument for any kernel implementation. This is a data
> process management process.
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29 15:03               ` Edward Chron
@ 2019-08-29 15:42                 ` Qian Cai
  2019-08-29 16:09                   ` Edward Chron
  2019-08-29 16:17                 ` Michal Hocko
  1 sibling, 1 reply; 46+ messages in thread
From: Qian Cai @ 2019-08-29 15:42 UTC (permalink / raw)
  To: Edward Chron, Michal Hocko
  Cc: Tetsuo Handa, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Thu, 2019-08-29 at 08:03 -0700, Edward Chron wrote:
> On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> > > On 2019/08/29 16:11, Michal Hocko wrote:
> > > > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> > > > > Our belief is if you really think eBPF is the preferred mechanism
> > > > > then move OOM reporting to an eBPF.
> > > > 
> > > > I've said that all this additional information has to be dynamically
> > > > extensible rather than a part of the core kernel. Whether eBPF is the
> > > > suitable tool, I do not know. I haven't explored that. There are other
> > > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > > > probably others.
> > > 
> > > As for SystemTap, guru mode (an expert mode which disables protection
> > > provided
> > > by SystemTap; allowing kernel to crash when something went wrong) could be
> > > used
> > > for holding spinlock. However, as far as I know, holding mutex (or doing
> > > any
> > > operation that might sleep) from such dynamic hooks is not allowed. Also
> > > we will
> > > need to export various symbols in order to allow access from such dynamic
> > > hooks.
> > 
> > This is the oom path and it should better not use any sleeping locks in
> > the first place.
> > 
> > > I'm not familiar with eBPF, but I guess that eBPF is similar.
> > > 
> > > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> > > SystemTap will be suitable for dumping OOM information. OOM situation
> > > means
> > > that even single page fault event cannot complete, and temporary memory
> > > allocation for reading from kernel or writing to files cannot complete.
> > 
> > And I repeat that no such reporting is going to write to files. This is
> > an OOM path afterall.
> > 
> > > Therefore, we will need to hold all information in kernel memory (without
> > > allocating any memory when OOM event happened). Dynamic hooks could hold
> > > a few lines of output, but not all lines we want. The only possible buffer
> > > which is preallocated and large enough would be printk()'s buffer. Thus,
> > > I believe that we will have to use printk() in order to dump OOM
> > > information.
> > > At that point,
> > 
> > Yes, this is what I've had in mind.
> > 
> 
> +1: It makes sense to keep the report going to the dmesg to persist.
> That is where it has always gone and there is no reason to change.
> You can have several OOMs back to back and you'd like to retain the output.
> All the information should be kept together in the OOM report.
> 
> > > 
> > >   static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
> > > 
> > >   bool out_of_memory(struct oom_control *oc)
> > >   {
> > >           return oom_handler(oc);
> > >   }
> > > 
> > > and let in-tree kernel modules override current OOM killer would be
> > > the only practical choice (if we refuse adding many knobs).
> > 
> > Or simply provide a hook with the oom_control to be called to report
> > without replacing the whole oom killer behavior. That is not necessary.
> 
> For very simple addition, to add a line of output this works.
> It would still be nice to address the fact the existing OOM Report prints
> all of the user processes or none. It would be nice to add some control
> for that. That's what we did.

Feel like you are going in circles to "sell" without any new information. If you
need to deal with OOM that often, it might also worth working with FB on oomd.

https://github.com/facebookincubator/oomd

It is well-known that kernel OOM could be slow and painful to deal with, so I
don't buy-in the argument that kernel OOM recover is better/faster than a kdump
reboot.

It is not unusual that when the system is triggering a kernel OOM, it is almost
trashed/dead. Although developers are working hard to improve the recovery after
OOM, there are still many error-paths that are not going to survive which would
leak memories, introduce undefined behaviors, corrupt memory etc.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29 14:09               ` Tetsuo Handa
@ 2019-08-29 15:48                 ` Edward Chron
  0 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-29 15:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Thu, Aug 29, 2019 at 7:09 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2019/08/29 20:56, Michal Hocko wrote:
> >> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> >> SystemTap will be suitable for dumping OOM information. OOM situation means
> >> that even single page fault event cannot complete, and temporary memory
> >> allocation for reading from kernel or writing to files cannot complete.
> >
> > And I repeat that no such reporting is going to write to files. This is
> > an OOM path afterall.
>
> The process who fetches from e.g. eBPF event cannot involve page fault.
> The front-end for iovisor/bcc is a python userspace process. But I think
> that such process can't run under OOM situation.
>
> >
> >> Therefore, we will need to hold all information in kernel memory (without
> >> allocating any memory when OOM event happened). Dynamic hooks could hold
> >> a few lines of output, but not all lines we want. The only possible buffer
> >> which is preallocated and large enough would be printk()'s buffer. Thus,
> >> I believe that we will have to use printk() in order to dump OOM information.
> >> At that point,
> >
> > Yes, this is what I've had in mind.
>
> Probably I incorrectly shortcut.
>
> Dynamic hooks could hold a few lines of output, but dynamic hooks can not hold
> all lines when dump_tasks() reports 32000+ processes. We have to buffer all output
> in kernel memory because we can't complete even a page fault event triggered by
> the python process monitoring eBPF event (and writing the result to some log file
> or something) while out_of_memory() is in flight.
>
> And "set /proc/sys/vm/oom_dump_tasks to 0" is not the right reaction. What I'm
> saying is "we won't be able to hold output from dump_tasks() if output from
> dump_tasks() goes to buffer preallocated for dynamic hooks". We have to find
> a way that can handle the worst case.

With the patch series we sent the addition of vmalloc entries print
required us to
add a small piece of code to vmalloc.c but we thought this should be core
OOM Reporting function. However you want to limit which vmalloc entries you
print, probably only very large memory users. For us this generates just a few
entries and has proven useful.

The changes to limit how many processes get printed so you don't have the all
or nothing would be nice to have. It would be easiest if there was a standard
mechanism to specify which entries to print, probably by a minimum size which
is what we did. We used debugfs to set the controls but sysctl or some other
mechanism could be used.

The rest of what we did might be implemented with hooks as they only output
a line or two and I've already got rid of information we had that was
redundant.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29 15:42                 ` Qian Cai
@ 2019-08-29 16:09                   ` Edward Chron
  2019-08-29 18:44                     ` Qian Cai
  0 siblings, 1 reply; 46+ messages in thread
From: Edward Chron @ 2019-08-29 16:09 UTC (permalink / raw)
  To: Qian Cai
  Cc: Michal Hocko, Tetsuo Handa, Andrew Morton, Roman Gushchin,
	Johannes Weiner, David Rientjes, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Thu, Aug 29, 2019 at 8:42 AM Qian Cai <cai@lca.pw> wrote:
>
> On Thu, 2019-08-29 at 08:03 -0700, Edward Chron wrote:
> > On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> > > > On 2019/08/29 16:11, Michal Hocko wrote:
> > > > > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> > > > > > Our belief is if you really think eBPF is the preferred mechanism
> > > > > > then move OOM reporting to an eBPF.
> > > > >
> > > > > I've said that all this additional information has to be dynamically
> > > > > extensible rather than a part of the core kernel. Whether eBPF is the
> > > > > suitable tool, I do not know. I haven't explored that. There are other
> > > > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > > > > probably others.
> > > >
> > > > As for SystemTap, guru mode (an expert mode which disables protection
> > > > provided
> > > > by SystemTap; allowing kernel to crash when something went wrong) could be
> > > > used
> > > > for holding spinlock. However, as far as I know, holding mutex (or doing
> > > > any
> > > > operation that might sleep) from such dynamic hooks is not allowed. Also
> > > > we will
> > > > need to export various symbols in order to allow access from such dynamic
> > > > hooks.
> > >
> > > This is the oom path and it should better not use any sleeping locks in
> > > the first place.
> > >
> > > > I'm not familiar with eBPF, but I guess that eBPF is similar.
> > > >
> > > > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> > > > SystemTap will be suitable for dumping OOM information. OOM situation
> > > > means
> > > > that even single page fault event cannot complete, and temporary memory
> > > > allocation for reading from kernel or writing to files cannot complete.
> > >
> > > And I repeat that no such reporting is going to write to files. This is
> > > an OOM path afterall.
> > >
> > > > Therefore, we will need to hold all information in kernel memory (without
> > > > allocating any memory when OOM event happened). Dynamic hooks could hold
> > > > a few lines of output, but not all lines we want. The only possible buffer
> > > > which is preallocated and large enough would be printk()'s buffer. Thus,
> > > > I believe that we will have to use printk() in order to dump OOM
> > > > information.
> > > > At that point,
> > >
> > > Yes, this is what I've had in mind.
> > >
> >
> > +1: It makes sense to keep the report going to the dmesg to persist.
> > That is where it has always gone and there is no reason to change.
> > You can have several OOMs back to back and you'd like to retain the output.
> > All the information should be kept together in the OOM report.
> >
> > > >
> > > >   static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
> > > >
> > > >   bool out_of_memory(struct oom_control *oc)
> > > >   {
> > > >           return oom_handler(oc);
> > > >   }
> > > >
> > > > and let in-tree kernel modules override current OOM killer would be
> > > > the only practical choice (if we refuse adding many knobs).
> > >
> > > Or simply provide a hook with the oom_control to be called to report
> > > without replacing the whole oom killer behavior. That is not necessary.
> >
> > For very simple addition, to add a line of output this works.
> > It would still be nice to address the fact the existing OOM Report prints
> > all of the user processes or none. It would be nice to add some control
> > for that. That's what we did.
>
> Feel like you are going in circles to "sell" without any new information. If you
> need to deal with OOM that often, it might also worth working with FB on oomd.
>
> https://github.com/facebookincubator/oomd
>
> It is well-known that kernel OOM could be slow and painful to deal with, so I
> don't buy-in the argument that kernel OOM recover is better/faster than a kdump
> reboot.
>
> It is not unusual that when the system is triggering a kernel OOM, it is almost
> trashed/dead. Although developers are working hard to improve the recovery after
> OOM, there are still many error-paths that are not going to survive which would
> leak memories, introduce undefined behaviors, corrupt memory etc.

But as you have pointed out many people are happy with current OOM processing
which is the report and recovery so for those people a kdump reboot is overkill.
Making the OOM report at least optionally a bit more informative has value. Also
making sure it doesn't produce excessive output is desirable.

I do agree for developers having to have all the system state a kdump
provides that
and as long as you can reproduce the OOM event that works well. But
that is not the
common case as has already been discussed.

Also, OOM events that are due to kernel bugs could leak memory and over time
and cause a crash, true. But that is not what we typically see. In
fact we've had
customers come back and report issues on systems that have been in continuous
operation for years. No point in crashing their system. Linux if
properly maintained
is thankfully quite stable. But OOMs do happen and root causing them to prevent
future occurrences is desired.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29 15:03               ` Edward Chron
  2019-08-29 15:42                 ` Qian Cai
@ 2019-08-29 16:17                 ` Michal Hocko
  2019-08-29 16:35                   ` Edward Chron
  1 sibling, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2019-08-29 16:17 UTC (permalink / raw)
  To: Edward Chron
  Cc: Tetsuo Handa, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Thu 29-08-19 08:03:19, Edward Chron wrote:
> On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > Or simply provide a hook with the oom_control to be called to report
> > without replacing the whole oom killer behavior. That is not necessary.
> 
> For very simple addition, to add a line of output this works.

Why would a hook be limited to small stuff?

> It would still be nice to address the fact the existing OOM Report prints
> all of the user processes or none. It would be nice to add some control
> for that. That's what we did.

TBH, I am not really convinced partial taks list is desirable nor easy
to configure. What is the criterion? oom_score (with potentially unstable
metric)? Rss? Something else?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29 16:17                 ` Michal Hocko
@ 2019-08-29 16:35                   ` Edward Chron
  0 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-29 16:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Andrew Morton, Roman Gushchin, Johannes Weiner,
	David Rientjes, Shakeel Butt, linux-mm, linux-kernel,
	Ivan Delalande

On Thu, Aug 29, 2019 at 9:18 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 29-08-19 08:03:19, Edward Chron wrote:
> > On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > Or simply provide a hook with the oom_control to be called to report
> > > without replacing the whole oom killer behavior. That is not necessary.
> >
> > For very simple addition, to add a line of output this works.
>
> Why would a hook be limited to small stuff?

It could be larger but the few items we added were just a line or
two of output.

The vmalloc, slabs and processes can print many entries so we
added a control for those.

>
> > It would still be nice to address the fact the existing OOM Report prints
> > all of the user processes or none. It would be nice to add some control
> > for that. That's what we did.
>
> TBH, I am not really convinced partial taks list is desirable nor easy
> to configure. What is the criterion? oom_score (with potentially unstable
> metric)? Rss? Something else?

We used an estimate of the memory footprint of the process:
rss, swap pages and page table pages.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29 16:09                   ` Edward Chron
@ 2019-08-29 18:44                     ` Qian Cai
  2019-08-29 22:41                       ` Edward Chron
  0 siblings, 1 reply; 46+ messages in thread
From: Qian Cai @ 2019-08-29 18:44 UTC (permalink / raw)
  To: Edward Chron
  Cc: Michal Hocko, Tetsuo Handa, Andrew Morton, Roman Gushchin,
	Johannes Weiner, David Rientjes, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Thu, 2019-08-29 at 09:09 -0700, Edward Chron wrote:

> > Feel like you are going in circles to "sell" without any new information. If
> > you
> > need to deal with OOM that often, it might also worth working with FB on
> > oomd.
> > 
> > https://github.com/facebookincubator/oomd
> > 
> > It is well-known that kernel OOM could be slow and painful to deal with, so
> > I
> > don't buy-in the argument that kernel OOM recover is better/faster than a
> > kdump
> > reboot.
> > 
> > It is not unusual that when the system is triggering a kernel OOM, it is
> > almost
> > trashed/dead. Although developers are working hard to improve the recovery
> > after
> > OOM, there are still many error-paths that are not going to survive which
> > would
> > leak memories, introduce undefined behaviors, corrupt memory etc.
> 
> But as you have pointed out many people are happy with current OOM processing
> which is the report and recovery so for those people a kdump reboot is
> overkill.
> Making the OOM report at least optionally a bit more informative has value.
> Also
> making sure it doesn't produce excessive output is desirable.
> 
> I do agree for developers having to have all the system state a kdump
> provides that
> and as long as you can reproduce the OOM event that works well. But
> that is not the
> common case as has already been discussed.
> 
> Also, OOM events that are due to kernel bugs could leak memory and over time
> and cause a crash, true. But that is not what we typically see. In
> fact we've had
> customers come back and report issues on systems that have been in continuous
> operation for years. No point in crashing their system. Linux if
> properly maintained
> is thankfully quite stable. But OOMs do happen and root causing them to
> prevent
> future occurrences is desired.

This is not what I meant. After an OOM event happens, many kernel memory
allocations could fail. Since very few people are testing those error-paths due
to allocation failures, it is considered one of those most buggy areas in the
kernel. Developers have mostly been focus on making sure the kernel OOM should
not happen in the first place.

I still think the time is better spending on improving things like eBPF, oomd
and kdump etc to solve your problem, but leave the kernel OOM report code alone.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/10] OOM Debug print selection and additional information
  2019-08-29 18:44                     ` Qian Cai
@ 2019-08-29 22:41                       ` Edward Chron
  0 siblings, 0 replies; 46+ messages in thread
From: Edward Chron @ 2019-08-29 22:41 UTC (permalink / raw)
  To: Qian Cai
  Cc: Michal Hocko, Tetsuo Handa, Andrew Morton, Roman Gushchin,
	Johannes Weiner, David Rientjes, Shakeel Butt, linux-mm,
	linux-kernel, Ivan Delalande

On Thu, Aug 29, 2019 at 11:44 AM Qian Cai <cai@lca.pw> wrote:
>
> On Thu, 2019-08-29 at 09:09 -0700, Edward Chron wrote:
>
> > > Feel like you are going in circles to "sell" without any new information. If
> > > you
> > > need to deal with OOM that often, it might also worth working with FB on
> > > oomd.
> > >
> > > https://github.com/facebookincubator/oomd
> > >
> > > It is well-known that kernel OOM could be slow and painful to deal with, so
> > > I
> > > don't buy-in the argument that kernel OOM recover is better/faster than a
> > > kdump
> > > reboot.
> > >
> > > It is not unusual that when the system is triggering a kernel OOM, it is
> > > almost
> > > trashed/dead. Although developers are working hard to improve the recovery
> > > after
> > > OOM, there are still many error-paths that are not going to survive which
> > > would
> > > leak memories, introduce undefined behaviors, corrupt memory etc.
> >
> > But as you have pointed out many people are happy with current OOM processing
> > which is the report and recovery so for those people a kdump reboot is
> > overkill.
> > Making the OOM report at least optionally a bit more informative has value.
> > Also
> > making sure it doesn't produce excessive output is desirable.
> >
> > I do agree for developers having to have all the system state a kdump
> > provides that
> > and as long as you can reproduce the OOM event that works well. But
> > that is not the
> > common case as has already been discussed.
> >
> > Also, OOM events that are due to kernel bugs could leak memory and over time
> > and cause a crash, true. But that is not what we typically see. In
> > fact we've had
> > customers come back and report issues on systems that have been in continuous
> > operation for years. No point in crashing their system. Linux if
> > properly maintained
> > is thankfully quite stable. But OOMs do happen and root causing them to
> > prevent
> > future occurrences is desired.
>
> This is not what I meant. After an OOM event happens, many kernel memory
> allocations could fail. Since very few people are testing those error-paths due
> to allocation failures, it is considered one of those most buggy areas in the
> kernel. Developers have mostly been focus on making sure the kernel OOM should
> not happen in the first place.
>
> I still think the time is better spending on improving things like eBPF, oomd
> and kdump etc to solve your problem, but leave the kernel OOM report code alone.
>

Sure would rather spend my time doing other things.
No argument about that. No one likes OOMs.
If I never see another OOM I'd be quite happy.

But OOM events still happen and an OOM report gets generated.
When it happens it is useful to get information that can help
find the cause of the OOM so it can be fixed and won't happen again.
We get tasked to root cause OOMs even though we'd rather do
other things.

We've added a bit of output to the OOM Report and it has been helpful.
We also reduce our total output by only printing larger entries
with helpful summaries.
We've been using and supporting this code for quite a few releases.
We haven't had problems and we have a lot of systems in use.

Contributing to an open source project like Linux is good.
If the code is not accepted its not the end of the world.
I was told to offer our code upstream and to try to be helpful.

I understand that processing an OOM event can be flakey.
We add a few lines of OOM output but in fact we reduce our total
output because we skip printing smaller entries and print
summaries instead.

So if the volume of the output increases the likelihood of system
failure during an OOM event, then we've actually increased our
reliability. Maybe that is why we haven't had any problems.

As far as switching from generating an OOM report to taking
a dump and restarting the system, the choice is not mine to
decide. Way above my pay grade. When asked, I am
happy to look at a dump but dumps plus restarts for
the systems we work on take too long so I typically don't get
a dump to look at. Have to make due with OOM output and
logs.

Also, and depending on what you work on, you may take
satisfaction that OOM events are far less traumatic with
newer versions of Linux, with our systems. The folks upstream
do really good work, give credit where credit is due.
Maybe tools like KASAN really help, which we also use.

Sure people fix bugs all the time, Linux is huge and super
complicated, but many of the bugs are not very common
and we spend an amazing (to me anyway) amount of time
testing and so when we take OOM events, even multiple
OOM events back to back, the system almost always
recovers and we don't seem to bleed memory. That is
why we systems up for months and even years.

Occasionally we see a watchdog timeout failure and that
can be due to a low memory situation but just FYI a fair
number of those do not involve OOM events so its not
because of issues with OOM code, reporting or otherwise.

Regardless, thank-you for your time and for your comments.
Constructive feedback is useful and certainly appreciated.

By the way we use oomd on some systems. It is helpful and
in my experience it helps to reduce OOM events but sadly
they still occur. For systems where it is not used, again that
is not my choice to make.

Edward Chron
Arista Networks

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2019-08-29 22:42 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-26 19:36 [PATCH 00/10] OOM Debug print selection and additional information Edward Chron
2019-08-26 19:36 ` [PATCH 01/10] mm/oom_debug: Add Debug base code Edward Chron
2019-08-27 13:28   ` kbuild test robot
2019-08-26 19:36 ` [PATCH 02/10] mm/oom_debug: Add System State Summary Edward Chron
2019-08-26 19:36 ` [PATCH 03/10] mm/oom_debug: Add Tasks Summary Edward Chron
2019-08-26 19:36 ` [PATCH 04/10] mm/oom_debug: Add ARP and ND Table Summary usage Edward Chron
2019-08-26 19:36 ` [PATCH 05/10] mm/oom_debug: Add Select Slabs Print Edward Chron
2019-08-26 19:36 ` [PATCH 06/10] mm/oom_debug: Add Select Vmalloc Entries Print Edward Chron
2019-08-26 19:36 ` [PATCH 07/10] mm/oom_debug: Add Select Process " Edward Chron
2019-08-26 19:36 ` [PATCH 08/10] mm/oom_debug: Add Slab Select Always Print Enable Edward Chron
2019-08-26 19:36 ` [PATCH 09/10] mm/oom_debug: Add Enhanced Slab Print Information Edward Chron
2019-08-26 19:36 ` [PATCH 10/10] mm/oom_debug: Add Enhanced Process " Edward Chron
2019-08-28  0:21   ` kbuild test robot
2019-08-27  7:15 ` [PATCH 00/10] OOM Debug print selection and additional information Michal Hocko
2019-08-27 10:10   ` Tetsuo Handa
2019-08-27 10:38     ` Michal Hocko
2019-08-28  1:07   ` Edward Chron
2019-08-28  6:59     ` Michal Hocko
     [not found]       ` <CAM3twVR_OLffQ1U-SgQOdHxuByLNL5sicfnObimpGpPQ1tJ0FQ@mail.gmail.com>
2019-08-28 20:18         ` Qian Cai
2019-08-28 21:17           ` Edward Chron
2019-08-28 21:34             ` Qian Cai
2019-08-29  7:11         ` Michal Hocko
2019-08-29 10:14           ` Tetsuo Handa
2019-08-29 11:56             ` Michal Hocko
2019-08-29 14:09               ` Tetsuo Handa
2019-08-29 15:48                 ` Edward Chron
2019-08-29 15:03               ` Edward Chron
2019-08-29 15:42                 ` Qian Cai
2019-08-29 16:09                   ` Edward Chron
2019-08-29 18:44                     ` Qian Cai
2019-08-29 22:41                       ` Edward Chron
2019-08-29 16:17                 ` Michal Hocko
2019-08-29 16:35                   ` Edward Chron
2019-08-29 15:20           ` Edward Chron
2019-08-27 12:40 ` Qian Cai
     [not found]   ` <CAM3twVQEMGWMQEC0dduri0JWt3gH6F2YsSqOmk55VQz+CZDVKg@mail.gmail.com>
2019-08-28  0:50     ` Qian Cai
2019-08-28  1:13       ` Edward Chron
2019-08-28  1:32         ` Qian Cai
2019-08-28  2:47           ` Edward Chron
2019-08-28  7:08             ` Michal Hocko
2019-08-28 10:12               ` Tetsuo Handa
2019-08-28 10:32                 ` Michal Hocko
2019-08-28 10:56                   ` Tetsuo Handa
2019-08-28 11:12                     ` Michal Hocko
2019-08-28 20:04                 ` Edward Chron
2019-08-29  3:31                   ` Edward Chron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).