linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Printbufs & shrinker OOM reporting
@ 2022-04-19 20:31 Kent Overstreet
  2022-04-19 20:31 ` [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings Kent Overstreet
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Kent Overstreet @ 2022-04-19 20:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel
  Cc: Kent Overstreet, roman.gushchin, hannes

Debugging OOMs has been one of my sources of frustration, so this patch series
is an attempt to do something about it.

The first patch in the series is something I've been slowly evolving in bcachefs
for years: simple heap allocated strings meant for appending to and building up
structured log/error messages. They make it easy and straightforward to write
pretty-printers for everything, which in turn makes good logging and error
messages something that just happens naturally.

We want it here because that means the reporting I'm adding to shrinkers can be
used by both OOM reporting, and for the sysfs (or is it debugfs now) interface
that Roman is adding.

This patch series also:
 - adds OOM reporting on shrinkers, reporting on top 10 shrinkers (in sorted
   order!)
 - changes slab reporting to be always-on, also reporting top 10 slabs in sorted
   order
 - starts centralizing OOM reporting in mm/show_mem.c
 
The last patch in the series is only a demonstration of how to implement the
shrinker .to_text() method, since bcachefs isn't upstream yet.

Kent Overstreet (4):
  lib/printbuf: New data structure for heap-allocated strings
  mm: Add a .to_text() method for shrinkers
  mm: Centralize & improve oom reporting in show_mem.c
  bcachefs: shrinker.to_text() methods

 fs/bcachefs/btree_cache.c     |  18 ++-
 fs/bcachefs/btree_key_cache.c |  18 ++-
 include/linux/printbuf.h      | 140 ++++++++++++++++++
 include/linux/shrinker.h      |   5 +
 lib/Makefile                  |   4 +-
 lib/printbuf.c                | 271 ++++++++++++++++++++++++++++++++++
 mm/Makefile                   |   2 +-
 mm/oom_kill.c                 |  23 ---
 {lib => mm}/show_mem.c        |  14 ++
 mm/slab.h                     |   6 +-
 mm/slab_common.c              |  53 ++++++-
 mm/vmscan.c                   |  75 ++++++++++
 12 files changed, 587 insertions(+), 42 deletions(-)
 create mode 100644 include/linux/printbuf.h
 create mode 100644 lib/printbuf.c
 rename {lib => mm}/show_mem.c (78%)

-- 
2.35.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings
  2022-04-19 20:31 [PATCH 0/4] Printbufs & shrinker OOM reporting Kent Overstreet
@ 2022-04-19 20:31 ` Kent Overstreet
  2022-04-19 21:11   ` Matthew Wilcox
  2022-04-20  5:02   ` Christoph Hellwig
  2022-04-19 20:32 ` [PATCH 2/4] mm: Add a .to_text() method for shrinkers Kent Overstreet
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: Kent Overstreet @ 2022-04-19 20:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel
  Cc: Kent Overstreet, roman.gushchin, hannes

This adds printbufs: simple heap-allocated strings meant for building up
structured messages, for logging/procfs/sysfs and elsewhere. They've
been heavily used in bcachefs for writing .to_text() functions/methods -
pretty printers, which has in turn greatly improved the overall quality
of error messages.

Basic usage is documented in include/linux/printbuf.h.

The next patches in the series are going to be using printbufs to
implement a .to_text() method for shrinkers, and improving OOM
reporting.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/printbuf.h | 140 ++++++++++++++++++++
 lib/Makefile             |   2 +-
 lib/printbuf.c           | 271 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 412 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/printbuf.h
 create mode 100644 lib/printbuf.c

diff --git a/include/linux/printbuf.h b/include/linux/printbuf.h
new file mode 100644
index 0000000000..84a271446d
--- /dev/null
+++ b/include/linux/printbuf.h
@@ -0,0 +1,140 @@
+/* SPDX-License-Identifier: LGPL-2.1+ */
+/* Copyright (C) 2022 Kent Overstreet */
+
+#ifndef _LINUX_PRINTBUF_H
+#define _LINUX_PRINTBUF_H
+
+/*
+ * Printbufs: Simple heap allocated strings, with some features for structered
+ * formatting.
+ *
+ * This code has provisions for use in userspace, to aid in making other code
+ * portable between kernelspace and userspace.
+ *
+ * Basic example:
+ *
+ *   struct printbuf buf = PRINTBUF;
+ *
+ *   pr_buf(&buf, "foo=");
+ *   foo_to_text(&buf, foo);
+ *   printk("%s", buf.buf);
+ *   printbuf_exit(&buf);
+ *
+ * We can now write pretty printers instead of writing code that dumps
+ * everything to the kernel log buffer, and then those pretty-printers can be
+ * used by other code that outputs to kernel log, sysfs, debugfs, etc.
+ *
+ * Memory allocation: Outputing to a printbuf may allocate memory. This
+ * allocation is done with GFP_KERNEL, by default: use the newer
+ * memalloc_*_(save|restore) functions as needed.
+ *
+ * Since no equivalent yet exists for GFP_ATOMIC/GFP_NOWAIT, memory allocations
+ * will be done with GFP_ATOMIC if printbuf->atomic is nonzero.
+ *
+ * Memory allocation failures: We don't return errors directly, because on
+ * memory allocation failure we usually don't want to bail out and unwind - we
+ * want to print what we've got, on a best-effort basis. But code that does want
+ * to return -ENOMEM may check printbuf.allocation_failure.
+ *
+ * Indenting, tabstops:
+ *
+ * To aid is writing multi-line pretty printers spread across multiple
+ * functions, printbufs track the current indent level.
+ *
+ * pr_indent_push() and pr_indent_pop() increase and decrease the current indent
+ * level, respectively.
+ *
+ * To use tabstops, set printbuf->tabstops[]; they are in units of spaces, from
+ * start of line. Once set, pr_tab() will output spaces up to the next tabstop.
+ * pr_tab_rjust() will also advance the current line of text up to the next
+ * tabstop, but it does so by shifting text since the previous tabstop up to the
+ * next tabstop - right justifying it.
+ *
+ * Make sure you use pr_newline() instead of \n in the format string for indent
+ * level and tabstops to work corretly.
+ *
+ * Output units: printbuf->units exists to tell pretty-printers how to output
+ * numbers: a raw value (e.g. directly from a superblock field), as bytes, or as
+ * human readable bytes. pr_units() and pr_sectors() obey it.
+ *
+ * Other helpful functions:
+ *
+ * pr_human_readable_u64, pr_human_readable_s64: Print an integer with human
+ * readable units.
+ *
+ * pr_time(): for printing a time_t with strftime in userspace, prints as an
+ * integer number of seconds in the kernel.
+ *
+ * pr_string_option: Given an enumerated value and a string array with names for
+ * each option, prints out the enum names with the selected one indicated with
+ * square brackets.
+ *
+ * pr_bitflags: Given a bitflag and a string array with names for each bit,
+ * prints out the names of the selected bits.
+ */
+
+#include <linux/slab.h>
+
+enum printbuf_units {
+	PRINTBUF_UNITS_RAW,
+	PRINTBUF_UNITS_BYTES,
+	PRINTBUF_UNITS_HUMAN_READABLE,
+};
+
+struct printbuf {
+	char			*buf;
+	unsigned		size;
+	unsigned		pos;
+	unsigned		last_newline;
+	unsigned		last_field;
+	unsigned		indent;
+	enum printbuf_units	units:8;
+	u8			atomic;
+	bool			allocation_failure:1;
+	u8			tabstop;
+	u8			tabstops[4];
+};
+
+#define PRINTBUF ((struct printbuf) { NULL })
+
+/**
+ * printbuf_exit - exit a printbuf, freeing memory it owns and poisoning it
+ * against accidental use.
+ */
+static inline void printbuf_exit(struct printbuf *buf)
+{
+	kfree(buf->buf);
+	buf->buf = ERR_PTR(-EINTR); /* poison value */
+}
+
+/**
+ * printbuf_reset - re-use a printbuf without freeing and re-initializing it:
+ */
+static inline void printbuf_reset(struct printbuf *buf)
+{
+	buf->pos		= 0;
+	buf->last_newline	= 0;
+	buf->last_field		= 0;
+	buf->indent		= 0;
+	buf->tabstop		= 0;
+}
+
+void pr_buf(struct printbuf *out, const char *fmt, ...)
+	__attribute__ ((format (printf, 2, 3)));
+
+void pr_char(struct printbuf *buf, char c);
+void pr_newline(struct printbuf *);
+void pr_indent_push(struct printbuf *, unsigned);
+void pr_indent_pop(struct printbuf *, unsigned);
+void pr_tab(struct printbuf *);
+void pr_tab_rjust(struct printbuf *);
+void pr_human_readable_u64(struct printbuf *, u64);
+void pr_human_readable_s64(struct printbuf *, s64);
+void pr_units(struct printbuf *, s64, s64);
+void pr_sectors(struct printbuf *, u64);
+void pr_time(struct printbuf *, u64);
+void pr_uuid(struct printbuf *, u8 *);
+void pr_string_option(struct printbuf *, const char * const list[], size_t);
+void pr_bitflags(struct printbuf *, const char * const list[], u64);
+
+#endif /* _LINUX_PRINTBUF_H */
diff --git a/lib/Makefile b/lib/Makefile
index c588a126a3..31a3904eda 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
 	 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
 	 nmi_backtrace.o nodemask.o win_minmax.o memcat_p.o \
-	 buildid.o
+	 buildid.o printbuf.o
 
 lib-$(CONFIG_PRINTK) += dump_stack.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/printbuf.c b/lib/printbuf.c
new file mode 100644
index 0000000000..1d87de787f
--- /dev/null
+++ b/lib/printbuf.c
@@ -0,0 +1,271 @@
+// SPDX-License-Identifier: LGPL-2.1+
+/* Copyright (C) 2022 Kent Overstreet */
+
+#ifdef __KERNEL__
+#include <linux/export.h>
+#include <linux/kernel.h>
+#else
+#define EXPORT_SYMBOL(x)
+#endif
+
+#include <linux/log2.h>
+#include <linux/printbuf.h>
+
+static inline size_t printbuf_remaining(struct printbuf *buf)
+{
+	return buf->size - buf->pos;
+}
+
+static inline size_t printbuf_linelen(struct printbuf *buf)
+{
+	return buf->pos - buf->last_newline;
+}
+
+static int printbuf_realloc(struct printbuf *out, unsigned extra)
+{
+	unsigned new_size;
+	char *buf;
+
+	if (out->pos + extra + 1 < out->size)
+		return 0;
+
+	new_size = roundup_pow_of_two(out->size + extra);
+	buf = krealloc(out->buf, new_size, !out->atomic ? GFP_KERNEL : GFP_ATOMIC);
+
+	if (!buf) {
+		out->allocation_failure = true;
+		return -ENOMEM;
+	}
+
+	out->buf	= buf;
+	out->size	= new_size;
+	return 0;
+}
+
+void pr_buf(struct printbuf *out, const char *fmt, ...)
+{
+	va_list args;
+	int len;
+
+	do {
+		va_start(args, fmt);
+		len = vsnprintf(out->buf + out->pos, printbuf_remaining(out), fmt, args);
+		va_end(args);
+	} while (len + 1 >= printbuf_remaining(out) &&
+		 !printbuf_realloc(out, len + 1));
+
+	len = min_t(size_t, len,
+		  printbuf_remaining(out) ? printbuf_remaining(out) - 1 : 0);
+	out->pos += len;
+}
+EXPORT_SYMBOL(pr_buf);
+
+void pr_char(struct printbuf *buf, char c)
+{
+	if (!printbuf_realloc(buf, 1)) {
+		buf->buf[buf->pos++] = c;
+		buf->buf[buf->pos] = 0;
+	}
+}
+EXPORT_SYMBOL(pr_char);
+
+void pr_newline(struct printbuf *buf)
+{
+	unsigned i;
+
+	pr_char(buf, '\n');
+
+	buf->last_newline	= buf->pos;
+
+	for (i = 0; i < buf->indent; i++)
+		pr_char(buf, ' ');
+
+	buf->last_field		= buf->pos;
+	buf->tabstop = 0;
+}
+EXPORT_SYMBOL(pr_newline);
+
+void pr_indent_push(struct printbuf *buf, unsigned spaces)
+{
+	buf->indent += spaces;
+	while (spaces--)
+		pr_char(buf, ' ');
+}
+EXPORT_SYMBOL(pr_indent_push);
+
+void pr_indent_pop(struct printbuf *buf, unsigned spaces)
+{
+	if (buf->last_newline + buf->indent == buf->pos) {
+		buf->pos -= spaces;
+		buf->buf[buf->pos] = 0;
+	}
+	buf->indent -= spaces;
+}
+EXPORT_SYMBOL(pr_indent_pop);
+
+void pr_tab(struct printbuf *buf)
+{
+	BUG_ON(buf->tabstop > ARRAY_SIZE(buf->tabstops));
+
+	while (printbuf_remaining(buf) > 1 &&
+	       printbuf_linelen(buf) < buf->tabstops[buf->tabstop])
+		pr_char(buf, ' ');
+
+	buf->last_field = buf->pos;
+	buf->tabstop++;
+}
+EXPORT_SYMBOL(pr_tab);
+
+void pr_tab_rjust(struct printbuf *buf)
+{
+	BUG_ON(buf->tabstop > ARRAY_SIZE(buf->tabstops));
+
+	if (printbuf_linelen(buf) < buf->tabstops[buf->tabstop]) {
+		unsigned move = buf->pos - buf->last_field;
+		unsigned shift = buf->tabstops[buf->tabstop] -
+			printbuf_linelen(buf);
+
+		printbuf_realloc(buf, shift);
+
+		if (buf->last_field + shift + 1 < buf->size) {
+			move = min(move, buf->size - 1 - buf->last_field - shift);
+
+			memmove(buf->buf + buf->last_field + shift,
+				buf->buf + buf->last_field,
+				move);
+			memset(buf->buf + buf->last_field, ' ', shift);
+			buf->pos += shift;
+			buf->buf[buf->pos] = 0;
+		}
+	}
+
+	buf->last_field = buf->pos;
+	buf->tabstop++;
+}
+EXPORT_SYMBOL(pr_tab_rjust);
+
+static const char si_units[] = "?kMGTPEZY";
+
+void pr_human_readable_u64(struct printbuf *buf, u64 v)
+{
+	int u, t = 0;
+
+	for (u = 0; v >= 1024; u++) {
+		t = v & ~(~0U << 10);
+		v >>= 10;
+	}
+
+	pr_buf(buf, "%llu", v);
+
+	/*
+	 * 103 is magic: t is in the range [-1023, 1023] and we want
+	 * to turn it into [-9, 9]
+	 */
+	if (u && t && v < 100 && v > -100)
+		pr_buf(buf, ".%i", t / 103);
+	if (u)
+		pr_char(buf, si_units[u]);
+}
+EXPORT_SYMBOL(pr_human_readable_u64);
+
+void pr_human_readable_s64(struct printbuf *buf, s64 v)
+{
+	if (v < 0)
+		pr_char(buf, '-');
+	pr_human_readable_u64(buf, abs(v));
+}
+EXPORT_SYMBOL(pr_human_readable_s64);
+
+void pr_units(struct printbuf *out, s64 raw, s64 bytes)
+{
+	switch (out->units) {
+	case PRINTBUF_UNITS_RAW:
+		pr_buf(out, "%llu", raw);
+		break;
+	case PRINTBUF_UNITS_BYTES:
+		pr_buf(out, "%llu", bytes);
+		break;
+	case PRINTBUF_UNITS_HUMAN_READABLE:
+		pr_human_readable_s64(out, bytes);
+		break;
+	}
+}
+EXPORT_SYMBOL(pr_units);
+
+void pr_sectors(struct printbuf *out, u64 v)
+{
+	pr_units(out, v, v << 9);
+}
+EXPORT_SYMBOL(pr_sectors);
+
+#ifdef __KERNEL__
+
+void pr_time(struct printbuf *out, u64 time)
+{
+	pr_buf(out, "%llu", time);
+}
+EXPORT_SYMBOL(pr_time);
+
+void pr_uuid(struct printbuf *out, u8 *uuid)
+{
+	pr_buf(out, "%pUb", uuid);
+}
+EXPORT_SYMBOL(pr_uuid);
+
+#else
+
+#include <time.h>
+#include <uuid.h>
+
+void pr_time(struct printbuf *out, u64 _time)
+{
+	char time_str[64];
+	time_t time = _time;
+	struct tm *tm = localtime(&time);
+	size_t err = strftime(time_str, sizeof(time_str), "%c", tm);
+
+	if (!err)
+		pr_buf(out, "(formatting error)");
+	else
+		pr_buf(out, "%s", time_str);
+}
+
+void pr_uuid(struct printbuf *out, u8 *uuid)
+{
+	char uuid_str[40];
+
+	uuid_unparse_lower(uuid, uuid_str);
+	pr_buf(out, uuid_str);
+}
+
+#endif
+
+void pr_string_option(struct printbuf *out,
+		      const char * const list[],
+		      size_t selected)
+{
+	size_t i;
+
+	for (i = 0; list[i]; i++)
+		pr_buf(out, i == selected ? "[%s] " : "%s ", list[i]);
+}
+EXPORT_SYMBOL(pr_string_option);
+
+void pr_bitflags(struct printbuf *out,
+		 const char * const list[], u64 flags)
+{
+	unsigned bit, nr = 0;
+	bool first = true;
+
+	while (list[nr])
+		nr++;
+
+	while (flags && (bit = __ffs(flags)) < nr) {
+		if (!first)
+			pr_buf(out, ",");
+		first = false;
+		pr_buf(out, "%s", list[bit]);
+		flags ^= 1 << bit;
+	}
+}
+EXPORT_SYMBOL(pr_bitflags);
-- 
2.35.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] mm: Add a .to_text() method for shrinkers
  2022-04-19 20:31 [PATCH 0/4] Printbufs & shrinker OOM reporting Kent Overstreet
  2022-04-19 20:31 ` [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings Kent Overstreet
@ 2022-04-19 20:32 ` Kent Overstreet
  2022-04-19 20:32 ` [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c Kent Overstreet
  2022-04-19 20:32 ` [PATCH 4/4] bcachefs: shrinker.to_text() methods Kent Overstreet
  3 siblings, 0 replies; 21+ messages in thread
From: Kent Overstreet @ 2022-04-19 20:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel
  Cc: Kent Overstreet, roman.gushchin, hannes

This adds a new callback method to shrinkers which they can use to
describe anything relevant to memory reclaim about their internal state,
for example object dirtyness.

This uses the new printbufs to output to heap allocated strings, so that
the .to_text() methods can be used both for messages logged to the
console, and also sysfs/debugfs.

This patch also adds shrinkers_to_text(), which reports on the top 10
shrinkers - by object count - in sorted order, to be used in OOM
reporting.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/shrinker.h |  5 +++
 mm/vmscan.c              | 75 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 76fbf92b04..b5f411768b 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_SHRINKER_H
 #define _LINUX_SHRINKER_H
 
+struct printbuf;
+
 /*
  * This struct is used to pass information from page reclaim to the shrinkers.
  * We consolidate the values for easier extension later.
@@ -58,10 +60,12 @@ struct shrink_control {
  * @flags determine the shrinker abilities, like numa awareness
  */
 struct shrinker {
+	char name[32];
 	unsigned long (*count_objects)(struct shrinker *,
 				       struct shrink_control *sc);
 	unsigned long (*scan_objects)(struct shrinker *,
 				      struct shrink_control *sc);
+	void (*to_text)(struct printbuf *, struct shrinker *);
 
 	long batch;	/* reclaim batch size, 0 = default */
 	int seeks;	/* seeks to recreate an obj */
@@ -94,4 +98,5 @@ extern int register_shrinker(struct shrinker *shrinker);
 extern void unregister_shrinker(struct shrinker *shrinker);
 extern void free_prealloced_shrinker(struct shrinker *shrinker);
 extern void synchronize_shrinkers(void);
+void shrinkers_to_text(struct printbuf *);
 #endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 59b14e0d69..09c483dfd3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,7 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/printbuf.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -702,6 +703,80 @@ void synchronize_shrinkers(void)
 }
 EXPORT_SYMBOL(synchronize_shrinkers);
 
+/**
+ * shrinkers_to_text - Report on shrinkers with highest usage
+ *
+ * This reports on the top 10 shrinkers, by object counts, in sorted order:
+ * intended to be used for OOM reporting.
+ */
+void shrinkers_to_text(struct printbuf *out)
+{
+	struct shrinker *shrinker;
+	struct shrinker_by_mem {
+		struct shrinker	*shrinker;
+		unsigned long	mem;
+	} shrinkers_by_mem[10];
+	int i, nr = 0;
+
+	if (!down_read_trylock(&shrinker_rwsem)) {
+		pr_buf(out, "(couldn't take shrinker lock)");
+		return;
+	}
+
+	list_for_each_entry(shrinker, &shrinker_list, list) {
+		struct shrink_control sc = { .gfp_mask = GFP_KERNEL, };
+		unsigned long mem = shrinker->count_objects(shrinker, &sc);
+
+		if (!mem || mem == SHRINK_STOP || mem == SHRINK_EMPTY)
+			continue;
+
+		for (i = 0; i < nr; i++)
+			if (mem < shrinkers_by_mem[i].mem)
+				break;
+
+		if (nr < ARRAY_SIZE(shrinkers_by_mem)) {
+			memmove(&shrinkers_by_mem[i + 1],
+				&shrinkers_by_mem[i],
+				sizeof(shrinkers_by_mem[0]) * (nr - i));
+			nr++;
+		} else if (i) {
+			i--;
+			memmove(&shrinkers_by_mem[0],
+				&shrinkers_by_mem[1],
+				sizeof(shrinkers_by_mem[0]) * i);
+		} else {
+			continue;
+		}
+
+		shrinkers_by_mem[i] = (struct shrinker_by_mem) {
+			.shrinker = shrinker,
+			.mem = mem,
+		};
+	}
+
+	for (i = nr - 1; i >= 0; --i) {
+		struct shrink_control sc = { .gfp_mask = GFP_KERNEL, };
+		shrinker = shrinkers_by_mem[i].shrinker;
+
+		if (shrinker->name[0])
+			pr_buf(out, "%s", shrinker->name);
+		else
+			pr_buf(out, "%ps:", shrinker->scan_objects);
+
+		pr_buf(out, " objects: %lu", shrinker->count_objects(shrinker, &sc));
+		pr_newline(out);
+
+		if (shrinker->to_text) {
+			pr_indent_push(out, 2);
+			shrinker->to_text(out, shrinker);
+			pr_indent_pop(out, 2);
+			pr_newline(out);
+		}
+	}
+
+	up_read(&shrinker_rwsem);
+}
+
 #define SHRINK_BATCH 128
 
 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
-- 
2.35.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-19 20:31 [PATCH 0/4] Printbufs & shrinker OOM reporting Kent Overstreet
  2022-04-19 20:31 ` [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings Kent Overstreet
  2022-04-19 20:32 ` [PATCH 2/4] mm: Add a .to_text() method for shrinkers Kent Overstreet
@ 2022-04-19 20:32 ` Kent Overstreet
  2022-04-20  6:58   ` Michal Hocko
  2022-04-19 20:32 ` [PATCH 4/4] bcachefs: shrinker.to_text() methods Kent Overstreet
  3 siblings, 1 reply; 21+ messages in thread
From: Kent Overstreet @ 2022-04-19 20:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel
  Cc: Kent Overstreet, roman.gushchin, hannes

This patch:
 - Moves lib/show_mem.c to mm/show_mem.c
 - Changes show_mem() to always report on slab usage
 - Instead of reporting on all slabs, we only report on top 10 slabs,
   and in sorted order
 - Also reports on shrinkers, with the new shrinkers_to_text().

More OOM reporting can be moved to show_mem.c and improved, this patch
is only a small start.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 lib/Makefile           |  2 +-
 mm/Makefile            |  2 +-
 mm/oom_kill.c          | 23 ------------------
 {lib => mm}/show_mem.c | 14 +++++++++++
 mm/slab.h              |  6 +++--
 mm/slab_common.c       | 53 +++++++++++++++++++++++++++++++++++-------
 6 files changed, 65 insertions(+), 35 deletions(-)
 rename {lib => mm}/show_mem.c (78%)

diff --git a/lib/Makefile b/lib/Makefile
index 31a3904eda..c5041d33d0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -30,7 +30,7 @@ endif
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o timerqueue.o xarray.o \
 	 idr.o extable.o sha1.o irq_regs.o argv_split.o \
-	 flex_proportions.o ratelimit.o show_mem.o \
+	 flex_proportions.o ratelimit.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
 	 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
 	 nmi_backtrace.o nodemask.o win_minmax.o memcat_p.o \
diff --git a/mm/Makefile b/mm/Makefile
index 70d4309c9c..97c0be12f3 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -54,7 +54,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o percpu.o slab_common.o \
 			   compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o gup.o mmap_lock.o $(mmu-y)
+			   debug.o gup.o mmap_lock.o show_mem.o $(mmu-y)
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 832fb33037..659c7d6376 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -171,27 +171,6 @@ static bool oom_unkillable_task(struct task_struct *p)
 	return false;
 }
 
-/*
- * Check whether unreclaimable slab amount is greater than
- * all user memory(LRU pages).
- * dump_unreclaimable_slab() could help in the case that
- * oom due to too much unreclaimable slab used by kernel.
-*/
-static bool should_dump_unreclaim_slab(void)
-{
-	unsigned long nr_lru;
-
-	nr_lru = global_node_page_state(NR_ACTIVE_ANON) +
-		 global_node_page_state(NR_INACTIVE_ANON) +
-		 global_node_page_state(NR_ACTIVE_FILE) +
-		 global_node_page_state(NR_INACTIVE_FILE) +
-		 global_node_page_state(NR_ISOLATED_ANON) +
-		 global_node_page_state(NR_ISOLATED_FILE) +
-		 global_node_page_state(NR_UNEVICTABLE);
-
-	return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru);
-}
-
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
@@ -465,8 +444,6 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
 		mem_cgroup_print_oom_meminfo(oc->memcg);
 	else {
 		show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask);
-		if (should_dump_unreclaim_slab())
-			dump_unreclaimable_slab();
 	}
 	if (sysctl_oom_dump_tasks)
 		dump_tasks(oc);
diff --git a/lib/show_mem.c b/mm/show_mem.c
similarity index 78%
rename from lib/show_mem.c
rename to mm/show_mem.c
index 1c26c14ffb..c9f37f13d6 100644
--- a/lib/show_mem.c
+++ b/mm/show_mem.c
@@ -7,11 +7,15 @@
 
 #include <linux/mm.h>
 #include <linux/cma.h>
+#include <linux/printbuf.h>
+
+#include "slab.h"
 
 void show_mem(unsigned int filter, nodemask_t *nodemask)
 {
 	pg_data_t *pgdat;
 	unsigned long total = 0, reserved = 0, highmem = 0;
+	struct printbuf buf = PRINTBUF;
 
 	printk("Mem-Info:\n");
 	show_free_areas(filter, nodemask);
@@ -41,4 +45,14 @@ void show_mem(unsigned int filter, nodemask_t *nodemask)
 #ifdef CONFIG_MEMORY_FAILURE
 	printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
 #endif
+
+	pr_info("Unreclaimable slab info:\n");
+	dump_unreclaimable_slab(&buf);
+	printk("%s", buf.buf);
+	printbuf_reset(&buf);
+
+	printk("Shrinkers:\n");
+	shrinkers_to_text(&buf);
+	printk("%s", buf.buf);
+	printbuf_exit(&buf);
 }
diff --git a/mm/slab.h b/mm/slab.h
index c7f2abc2b1..abefbf7674 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -788,10 +788,12 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 
 #endif
 
+struct printbuf;
+
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
-void dump_unreclaimable_slab(void);
+void dump_unreclaimable_slab(struct printbuf *);
 #else
-static inline void dump_unreclaimable_slab(void)
+static inline void dump_unreclaimable_slab(struct printbuf *out)
 {
 }
 #endif
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 23f2ab0713..cb1c548c73 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -24,6 +24,7 @@
 #include <asm/tlbflush.h>
 #include <asm/page.h>
 #include <linux/memcontrol.h>
+#include <linux/printbuf.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/kmem.h>
@@ -1084,10 +1085,15 @@ static int slab_show(struct seq_file *m, void *p)
 	return 0;
 }
 
-void dump_unreclaimable_slab(void)
+void dump_unreclaimable_slab(struct printbuf *out)
 {
 	struct kmem_cache *s;
 	struct slabinfo sinfo;
+	struct slab_by_mem {
+		struct kmem_cache *s;
+		size_t total, active;
+	} slabs_by_mem[10], n;
+	int i, nr = 0;
 
 	/*
 	 * Here acquiring slab_mutex is risky since we don't prefer to get
@@ -1097,12 +1103,11 @@ void dump_unreclaimable_slab(void)
 	 * without acquiring the mutex.
 	 */
 	if (!mutex_trylock(&slab_mutex)) {
-		pr_warn("excessive unreclaimable slab but cannot dump stats\n");
+		pr_buf(out, "excessive unreclaimable slab but cannot dump stats\n");
 		return;
 	}
 
-	pr_info("Unreclaimable slab info:\n");
-	pr_info("Name                      Used          Total\n");
+	buf->atomic++;
 
 	list_for_each_entry(s, &slab_caches, list) {
 		if (s->flags & SLAB_RECLAIM_ACCOUNT)
@@ -1110,11 +1115,43 @@ void dump_unreclaimable_slab(void)
 
 		get_slabinfo(s, &sinfo);
 
-		if (sinfo.num_objs > 0)
-			pr_info("%-17s %10luKB %10luKB\n", s->name,
-				(sinfo.active_objs * s->size) / 1024,
-				(sinfo.num_objs * s->size) / 1024);
+		if (!sinfo.num_objs)
+			continue;
+
+		n.s = s;
+		n.total = sinfo.num_objs * s->size;
+		n.active = sinfo.active_objs * s->size;
+
+		for (i = 0; i < nr; i++)
+			if (n.total < slabs_by_mem[i].total)
+				break;
+
+		if (nr < ARRAY_SIZE(slabs_by_mem)) {
+			memmove(&slabs_by_mem[i + 1],
+				&slabs_by_mem[i],
+				sizeof(slabs_by_mem[0]) * (nr - i));
+			nr++;
+		} else if (i) {
+			i--;
+			memmove(&slabs_by_mem[0],
+				&slabs_by_mem[1],
+				sizeof(slabs_by_mem[0]) * i);
+		} else {
+			continue;
+		}
+
+		slabs_by_mem[i] = n;
+	}
+
+	for (i = nr - 1; i >= 0; --i) {
+		pr_buf(out, "%-17s total: ", slabs_by_mem[i].s->name);
+		pr_human_readable_u64(out, slabs_by_mem[i].total);
+		pr_buf(out, " active: ");
+		pr_human_readable_u64(out, slabs_by_mem[i].active);
+		pr_newline(out);
 	}
+
+	--buf->atomic;
 	mutex_unlock(&slab_mutex);
 }
 
-- 
2.35.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] bcachefs: shrinker.to_text() methods
  2022-04-19 20:31 [PATCH 0/4] Printbufs & shrinker OOM reporting Kent Overstreet
                   ` (2 preceding siblings ...)
  2022-04-19 20:32 ` [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c Kent Overstreet
@ 2022-04-19 20:32 ` Kent Overstreet
  3 siblings, 0 replies; 21+ messages in thread
From: Kent Overstreet @ 2022-04-19 20:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Kent Overstreet, roman.gushchin

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 fs/bcachefs/btree_cache.c     | 18 +++++++++++++++---
 fs/bcachefs/btree_key_cache.c | 18 +++++++++++++++---
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/fs/bcachefs/btree_cache.c b/fs/bcachefs/btree_cache.c
index 72f0587e4d..75ef3b5462 100644
--- a/fs/bcachefs/btree_cache.c
+++ b/fs/bcachefs/btree_cache.c
@@ -394,6 +394,14 @@ static unsigned long bch2_btree_cache_count(struct shrinker *shrink,
 	return btree_cache_can_free(bc);
 }
 
+static void bch2_btree_cache_shrinker_to_text(struct printbuf *out, struct shrinker *shrink)
+{
+	struct bch_fs *c = container_of(shrink, struct bch_fs,
+					btree_cache.shrink);
+
+	bch2_btree_cache_to_text(out, c);
+}
+
 void bch2_fs_btree_cache_exit(struct bch_fs *c)
 {
 	struct btree_cache *bc = &c->btree_cache;
@@ -477,6 +485,7 @@ int bch2_fs_btree_cache_init(struct bch_fs *c)
 
 	bc->shrink.count_objects	= bch2_btree_cache_count;
 	bc->shrink.scan_objects		= bch2_btree_cache_scan;
+	bc->shrink.to_text		= bch2_btree_cache_shrinker_to_text;
 	bc->shrink.seeks		= 4;
 	ret = register_shrinker(&bc->shrink);
 out:
@@ -1147,7 +1156,10 @@ void bch2_btree_node_to_text(struct printbuf *out, struct bch_fs *c,
 
 void bch2_btree_cache_to_text(struct printbuf *out, struct bch_fs *c)
 {
-	pr_buf(out, "nr nodes:\t\t%u\n", c->btree_cache.used);
-	pr_buf(out, "nr dirty:\t\t%u\n", atomic_read(&c->btree_cache.dirty));
-	pr_buf(out, "cannibalize lock:\t%p\n", c->btree_cache.alloc_lock);
+	pr_buf(out, "nr nodes:          %u", c->btree_cache.used);
+	pr_newline(out);
+	pr_buf(out, "nr dirty:          %u", atomic_read(&c->btree_cache.dirty));
+	pr_newline(out);
+	pr_buf(out, "cannibalize lock:  %p", c->btree_cache.alloc_lock);
+	pr_newline(out);
 }
diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c
index a575189f35..32b5cb6042 100644
--- a/fs/bcachefs/btree_key_cache.c
+++ b/fs/bcachefs/btree_key_cache.c
@@ -711,6 +711,14 @@ void bch2_fs_btree_key_cache_init_early(struct btree_key_cache *c)
 	INIT_LIST_HEAD(&c->freed);
 }
 
+static void bch2_btree_key_cache_shrinker_to_text(struct printbuf *out, struct shrinker *shrink)
+{
+	struct btree_key_cache *bc =
+		container_of(shrink, struct btree_key_cache, shrink);
+
+	bch2_btree_key_cache_to_text(out, bc);
+}
+
 int bch2_fs_btree_key_cache_init(struct btree_key_cache *c)
 {
 	int ret;
@@ -724,14 +732,18 @@ int bch2_fs_btree_key_cache_init(struct btree_key_cache *c)
 	c->shrink.seeks			= 1;
 	c->shrink.count_objects		= bch2_btree_key_cache_count;
 	c->shrink.scan_objects		= bch2_btree_key_cache_scan;
+	c->shrink.to_text		= bch2_btree_key_cache_shrinker_to_text;
 	return register_shrinker(&c->shrink);
 }
 
 void bch2_btree_key_cache_to_text(struct printbuf *out, struct btree_key_cache *c)
 {
-	pr_buf(out, "nr_freed:\t%zu\n",	c->nr_freed);
-	pr_buf(out, "nr_keys:\t%lu\n",	atomic_long_read(&c->nr_keys));
-	pr_buf(out, "nr_dirty:\t%lu\n",	atomic_long_read(&c->nr_dirty));
+	pr_buf(out, "nr_freed:  %zu",	c->nr_freed);
+	pr_newline(out);
+	pr_buf(out, "nr_keys:   %zu",	atomic_long_read(&c->nr_keys));
+	pr_newline(out);
+	pr_buf(out, "nr_dirty:  %zu",	atomic_long_read(&c->nr_dirty));
+	pr_newline(out);
 }
 
 void bch2_btree_key_cache_exit(void)
-- 
2.35.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings
  2022-04-19 20:31 ` [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings Kent Overstreet
@ 2022-04-19 21:11   ` Matthew Wilcox
  2022-04-20  0:12     ` Kent Overstreet
  2022-04-20  5:02   ` Christoph Hellwig
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2022-04-19 21:11 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Tue, Apr 19, 2022 at 04:31:59PM -0400, Kent Overstreet wrote:
> +static const char si_units[] = "?kMGTPEZY";
> +
> +void pr_human_readable_u64(struct printbuf *buf, u64 v)

The person who wrote string_get_size() spent a lot more time thinking
about corner-cases than you did ;-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings
  2022-04-19 21:11   ` Matthew Wilcox
@ 2022-04-20  0:12     ` Kent Overstreet
  0 siblings, 0 replies; 21+ messages in thread
From: Kent Overstreet @ 2022-04-20  0:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Tue, Apr 19, 2022 at 10:11:51PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 19, 2022 at 04:31:59PM -0400, Kent Overstreet wrote:
> > +static const char si_units[] = "?kMGTPEZY";
> > +
> > +void pr_human_readable_u64(struct printbuf *buf, u64 v)
> 
> The person who wrote string_get_size() spent a lot more time thinking
> about corner-cases than you did ;-)

Didn't know about that until today :)

Just commited this:

commit 3cf1cb86ee4c4a7e75bcd52082ba0df1c436d692
Author: Kent Overstreet <kent.overstreet@gmail.com>
Date:   Tue Apr 19 17:29:43 2022 -0400

    lib/printbuf: Switch to string_get_size()
    
    Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

diff --git a/include/linux/printbuf.h b/include/linux/printbuf.h
index 84a271446d..2d23506114 100644
--- a/include/linux/printbuf.h
+++ b/include/linux/printbuf.h
@@ -91,11 +91,13 @@ struct printbuf {
 	enum printbuf_units	units:8;
 	u8			atomic;
 	bool			allocation_failure:1;
+	/* SI units (10^3), or 2^10: */
+	enum string_size_units	human_readable_units:1;
 	u8			tabstop;
 	u8			tabstops[4];
 };
 
-#define PRINTBUF ((struct printbuf) { NULL })
+#define PRINTBUF ((struct printbuf) { .human_readable_units = STRING_UNITS_2 })
 
 /**
  * printbuf_exit - exit a printbuf, freeing memory it owns and poisoning it
diff --git a/lib/printbuf.c b/lib/printbuf.c
index a9d5dff81e..ecbce6670f 100644
--- a/lib/printbuf.c
+++ b/lib/printbuf.c
@@ -10,6 +10,7 @@
 
 #include <linux/log2.h>
 #include <linux/printbuf.h>
+#include <linux/string_helpers.h>
 
 static inline size_t printbuf_remaining(struct printbuf *buf)
 {
@@ -144,27 +145,11 @@ void pr_tab_rjust(struct printbuf *buf)
 }
 EXPORT_SYMBOL(pr_tab_rjust);
 
-static const char si_units[] = "?kMGTPEZY";
-
 void pr_human_readable_u64(struct printbuf *buf, u64 v)
 {
-	int u, t = 0;
-
-	for (u = 0; v >= 1024; u++) {
-		t = v & ~(~0U << 10);
-		v >>= 10;
-	}
-
-	pr_buf(buf, "%llu", v);
-
-	/*
-	 * 103 is magic: t is in the range [-1023, 1023] and we want
-	 * to turn it into [-9, 9]
-	 */
-	if (u && t && v < 100 && v > -100)
-		pr_buf(buf, ".%i", t / 103);
-	if (u)
-		pr_char(buf, si_units[u]);
+	printbuf_realloc(buf, 10);
+	string_get_size(v, 1, buf->human_readable_units, buf->buf + buf->pos,
+			printbuf_remaining(buf));
 }
 EXPORT_SYMBOL(pr_human_readable_u64);
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings
  2022-04-19 20:31 ` [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings Kent Overstreet
  2022-04-19 21:11   ` Matthew Wilcox
@ 2022-04-20  5:02   ` Christoph Hellwig
  2022-04-20  5:18     ` Kent Overstreet
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2022-04-20  5:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Tue, Apr 19, 2022 at 04:31:59PM -0400, Kent Overstreet wrote:
> This adds printbufs: simple heap-allocated strings meant for building up
> structured messages, for logging/procfs/sysfs and elsewhere. They've
> been heavily used in bcachefs for writing .to_text() functions/methods -
> pretty printers, which has in turn greatly improved the overall quality
> of error messages.

How does this use case differ from that of lib/seq_buf.c?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings
  2022-04-20  5:02   ` Christoph Hellwig
@ 2022-04-20  5:18     ` Kent Overstreet
  0 siblings, 0 replies; 21+ messages in thread
From: Kent Overstreet @ 2022-04-20  5:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Tue, Apr 19, 2022 at 10:02:20PM -0700, Christoph Hellwig wrote:
> On Tue, Apr 19, 2022 at 04:31:59PM -0400, Kent Overstreet wrote:
> > This adds printbufs: simple heap-allocated strings meant for building up
> > structured messages, for logging/procfs/sysfs and elsewhere. They've
> > been heavily used in bcachefs for writing .to_text() functions/methods -
> > pretty printers, which has in turn greatly improved the overall quality
> > of error messages.
> 
> How does this use case differ from that of lib/seq_buf.c?

I hadn't come across that code before, thanks for pointing it out :)

seq_buf.c looks exactly like an older version of printbufs, from before I added
the auto heap allocating functionality. seq_buf.c could be dropped, I could go
ahead and convert existing users to printbufs.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-19 20:32 ` [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c Kent Overstreet
@ 2022-04-20  6:58   ` Michal Hocko
  2022-04-20 16:58     ` Kent Overstreet
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2022-04-20  6:58 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Tue 19-04-22 16:32:01, Kent Overstreet wrote:
> This patch:
>  - Moves lib/show_mem.c to mm/show_mem.c

Sure, why not. Should be a separate patch.

>  - Changes show_mem() to always report on slab usage
>  - Instead of reporting on all slabs, we only report on top 10 slabs,
>    and in sorted order
>  - Also reports on shrinkers, with the new shrinkers_to_text().

Why do we need/want this? It would be also great to provide an example
of why the new output is better (in which cases) than the existing one.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-20  6:58   ` Michal Hocko
@ 2022-04-20 16:58     ` Kent Overstreet
  2022-04-21  9:18       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Kent Overstreet @ 2022-04-20 16:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Wed, Apr 20, 2022 at 08:58:36AM +0200, Michal Hocko wrote:
> On Tue 19-04-22 16:32:01, Kent Overstreet wrote:
> > This patch:
> >  - Moves lib/show_mem.c to mm/show_mem.c
> 
> Sure, why not. Should be a separate patch.
> 
> >  - Changes show_mem() to always report on slab usage
> >  - Instead of reporting on all slabs, we only report on top 10 slabs,
> >    and in sorted order
> >  - Also reports on shrinkers, with the new shrinkers_to_text().
> 
> Why do we need/want this? It would be also great to provide an example
> of why the new output is better (in which cases) than the existing one.

Did you read the cover letter to the patch series?

But sure, I can give you an example of the new output:

00177 018: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
00177 CPU: 0 PID: 32171 Comm: 018 Not tainted 5.17.0-01346-g09b56740d418-dirty #154
00177 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
00177 Call Trace:
00177  <TASK>
00177  dump_stack_lvl+0x38/0x49
00177  dump_stack+0x10/0x12
00177  warn_alloc+0x128/0x150
00177  ? __alloc_pages_direct_compact+0x171/0x1f0
00177  __alloc_pages_slowpath.constprop.0+0xac6/0xc30
00177  ? make_kgid+0x17/0x20
00177  ? p9pdu_readf+0x28c/0xb00
00177  __alloc_pages+0x215/0x230
00177  kmalloc_order+0x30/0x80
00177  kmalloc_order_trace+0x1d/0x80
00177  __kmalloc+0x1a2/0x1d0
00177  v9fs_alloc_rdir_buf.isra.0+0x28/0x40
00177  v9fs_dir_readdir_dotl+0x55/0x160
00177  ? __alloc_pages+0x151/0x230
00177  ? lru_cache_add+0x1c/0x20
00177  ? lru_cache_add_inactive_or_unevictable+0x27/0x80
00177  ? __handle_mm_fault+0x666/0xae0
00177  iterate_dir+0x151/0x1b0
00177  __x64_sys_getdents64+0x80/0x120
00177  ? compat_fillonedir+0x160/0x160
00177  do_syscall_64+0x35/0x80
00177  entry_SYSCALL_64_after_hwframe+0x44/0xae
00177 RIP: 0033:0x7f0b15e2f9c7
00177 Code: 00 00 0f 05 eb b3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 81 fa ff ff ff 7f b8 ff ff ff 7f 48 0f 47 d0 b8 d9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 79 b4 10 00 f7 d8 64 89 02 48
00177 RSP: 002b:00007ffcf9445b88 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
00177 RAX: ffffffffffffffda RBX: 000056137dd23ba0 RCX: 00007f0b15e2f9c7
00177 RDX: 000000000001f000 RSI: 000056137dd23bd0 RDI: 0000000000000003
00177 RBP: 000056137dd23bd0 R08: 0000000000000030 R09: 00007f0b15f3bc00
00177 R10: fffffffffffff776 R11: 0000000000000293 R12: ffffffffffffff88
00177 R13: 000056137dd23ba4 R14: 0000000000000000 R15: 000056137dd23ba0
00177  </TASK>
00177 Mem-Info:
00177 active_anon:13706 inactive_anon:32266 isolated_anon:16
00177  active_file:1653 inactive_file:1822 isolated_file:0
00177  unevictable:0 dirty:0 writeback:0
00177  slab_reclaimable:6242 slab_unreclaimable:11168
00177  mapped:3824 shmem:3 pagetables:1266 bounce:0
00177  kernel_misc_reclaimable:0
00177  free:4362 free_pcp:35 free_cma:0
00177 Node 0 active_anon:54824kB inactive_anon:129064kB active_file:6612kB inactive_file:7288kB unevictable:0kB isolated(anon):64kB isolated(file):0kB mapped:15296kB dirty:0kB writeback:0kB shmem:12kB writeback_tmp:0kB kernel_stack:3392kB pagetables:5064kB all_unreclaimable? no
00177 DMA free:2232kB boost:0kB min:88kB low:108kB high:128kB reserved_highatomic:0KB active_anon:2924kB inactive_anon:6596kB active_file:428kB inactive_file:384kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
00177 lowmem_reserve[]: 0 426 426 426
00177 DMA32 free:15092kB boost:5836kB min:8432kB low:9080kB high:9728kB reserved_highatomic:0KB active_anon:52196kB inactive_anon:122392kB active_file:6176kB inactive_file:7068kB unevictable:0kB writepending:0kB present:507760kB managed:441816kB mlocked:0kB bounce:0kB free_pcp:72kB local_pcp:0kB free_cma:0kB
00177 lowmem_reserve[]: 0 0 0 0
00177 DMA: 284*4kB (UM) 53*8kB (UM) 21*16kB (U) 11*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2248kB
00177 DMA32: 2765*4kB (UME) 375*8kB (UME) 57*16kB (UM) 5*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15132kB
00177 4656 total pagecache pages
00177 1031 pages in swap cache
00177 Swap cache stats: add 6572399, delete 6572173, find 488603/3286476
00177 Free swap  = 509112kB
00177 Total swap = 2097148kB
00177 130938 pages RAM
00177 0 pages HighMem/MovableOnly
00177 16644 pages reserved
00177 Unreclaimable slab info:
00177 9p-fcall-cache    total: 8.25 MiB active: 8.25 MiB
00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB
00177 kmalloc-64        total: 2.08 MiB active: 2.07 MiB
00177 task_struct       total: 1.95 MiB active: 1.95 MiB
00177 kmalloc-4k        total: 1.50 MiB active: 1.50 MiB
00177 signal_cache      total: 1.34 MiB active: 1.34 MiB
00177 kmalloc-2k        total: 1.16 MiB active: 1.16 MiB
00177 bch_inode_info    total: 1.02 MiB active: 922 KiB
00177 perf_event        total: 1.02 MiB active: 1.02 MiB
00177 biovec-max        total: 992 KiB active: 960 KiB
00177 Shrinkers:
00177 super_cache_scan: objects: 127
00177 super_cache_scan: objects: 106
00177 jbd2_journal_shrink_scan: objects: 32
00177 ext4_es_scan: objects: 32
00177 bch2_btree_cache_scan: objects: 8
00177   nr nodes:          24
00177   nr dirty:          0
00177   cannibalize lock:  0000000000000000
00177 
00177 super_cache_scan: objects: 8
00177 super_cache_scan: objects: 1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-20 16:58     ` Kent Overstreet
@ 2022-04-21  9:18       ` Michal Hocko
  2022-04-21 18:42         ` Kent Overstreet
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2022-04-21  9:18 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Wed 20-04-22 12:58:05, Kent Overstreet wrote:
> On Wed, Apr 20, 2022 at 08:58:36AM +0200, Michal Hocko wrote:
> > On Tue 19-04-22 16:32:01, Kent Overstreet wrote:
> > > This patch:
> > >  - Moves lib/show_mem.c to mm/show_mem.c
> > 
> > Sure, why not. Should be a separate patch.
> > 
> > >  - Changes show_mem() to always report on slab usage
> > >  - Instead of reporting on all slabs, we only report on top 10 slabs,
> > >    and in sorted order
> > >  - Also reports on shrinkers, with the new shrinkers_to_text().
> > 
> > Why do we need/want this? It would be also great to provide an example
> > of why the new output is better (in which cases) than the existing one.
> 
> Did you read the cover letter to the patch series?

Nope, only this one made it into my inbox based on my filters. I usually
try to fish out other parts of the thread but I didn't this time.
Besides it is always better to have a full patch description explain not
only what has been changed but why as well.

> But sure, I can give you an example of the new output:

Calling out the changes would be really helpful, but I guess the crux 
is here.

> 00177 16644 pages reserved
> 00177 Unreclaimable slab info:
> 00177 9p-fcall-cache    total: 8.25 MiB active: 8.25 MiB
> 00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB
> 00177 kmalloc-64        total: 2.08 MiB active: 2.07 MiB
> 00177 task_struct       total: 1.95 MiB active: 1.95 MiB
> 00177 kmalloc-4k        total: 1.50 MiB active: 1.50 MiB
> 00177 signal_cache      total: 1.34 MiB active: 1.34 MiB
> 00177 kmalloc-2k        total: 1.16 MiB active: 1.16 MiB
> 00177 bch_inode_info    total: 1.02 MiB active: 922 KiB
> 00177 perf_event        total: 1.02 MiB active: 1.02 MiB
> 00177 biovec-max        total: 992 KiB active: 960 KiB
> 00177 Shrinkers:
> 00177 super_cache_scan: objects: 127
> 00177 super_cache_scan: objects: 106
> 00177 jbd2_journal_shrink_scan: objects: 32
> 00177 ext4_es_scan: objects: 32
> 00177 bch2_btree_cache_scan: objects: 8
> 00177   nr nodes:          24
> 00177   nr dirty:          0
> 00177   cannibalize lock:  0000000000000000
> 00177 
> 00177 super_cache_scan: objects: 8
> 00177 super_cache_scan: objects: 1

How does this help to analyze this allocation failure?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-21  9:18       ` Michal Hocko
@ 2022-04-21 18:42         ` Kent Overstreet
  2022-04-22  8:03           ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Kent Overstreet @ 2022-04-21 18:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Thu, Apr 21, 2022 at 11:18:20AM +0200, Michal Hocko wrote:
> On Wed 20-04-22 12:58:05, Kent Overstreet wrote:
> > On Wed, Apr 20, 2022 at 08:58:36AM +0200, Michal Hocko wrote:
> > > On Tue 19-04-22 16:32:01, Kent Overstreet wrote:
> > > > This patch:
> > > >  - Moves lib/show_mem.c to mm/show_mem.c
> > > 
> > > Sure, why not. Should be a separate patch.
> > > 
> > > >  - Changes show_mem() to always report on slab usage
> > > >  - Instead of reporting on all slabs, we only report on top 10 slabs,
> > > >    and in sorted order
> > > >  - Also reports on shrinkers, with the new shrinkers_to_text().
> > > 
> > > Why do we need/want this? It would be also great to provide an example
> > > of why the new output is better (in which cases) than the existing one.
> > 
> > Did you read the cover letter to the patch series?
> 
> Nope, only this one made it into my inbox based on my filters. I usually
> try to fish out other parts of the thread but I didn't this time.
> Besides it is always better to have a full patch description explain not
> only what has been changed but why as well.
> 
> > But sure, I can give you an example of the new output:
> 
> Calling out the changes would be really helpful, but I guess the crux 
> is here.
> 
> > 00177 16644 pages reserved
> > 00177 Unreclaimable slab info:
> > 00177 9p-fcall-cache    total: 8.25 MiB active: 8.25 MiB
> > 00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB
> > 00177 kmalloc-64        total: 2.08 MiB active: 2.07 MiB
> > 00177 task_struct       total: 1.95 MiB active: 1.95 MiB
> > 00177 kmalloc-4k        total: 1.50 MiB active: 1.50 MiB
> > 00177 signal_cache      total: 1.34 MiB active: 1.34 MiB
> > 00177 kmalloc-2k        total: 1.16 MiB active: 1.16 MiB
> > 00177 bch_inode_info    total: 1.02 MiB active: 922 KiB
> > 00177 perf_event        total: 1.02 MiB active: 1.02 MiB
> > 00177 biovec-max        total: 992 KiB active: 960 KiB
> > 00177 Shrinkers:
> > 00177 super_cache_scan: objects: 127
> > 00177 super_cache_scan: objects: 106
> > 00177 jbd2_journal_shrink_scan: objects: 32
> > 00177 ext4_es_scan: objects: 32
> > 00177 bch2_btree_cache_scan: objects: 8
> > 00177   nr nodes:          24
> > 00177   nr dirty:          0
> > 00177   cannibalize lock:  0000000000000000
> > 00177 
> > 00177 super_cache_scan: objects: 8
> > 00177 super_cache_scan: objects: 1
> 
> How does this help to analyze this allocation failure?

You asked for an example of the output, which was an entirely reasonable
request. Shrinkers weren't responsible for this OOM, so it doesn't help here -
are you asking me to explain why shrinkers are relevant to OOMs and memory
reclaim...?

Since shrinkers own and, critically, _are responsible for freeing memory_, a
shrinker not giving up memory when asked (or perhaps not being asked to give up
memory) can cause an OOM. A starting point - not an end - if we want to improve
OOM debugging is at least being able to see how much memory each shrinker owns.
Since we don't even have that, number of objects will have to do.

The reason for adding the .to_text() callback is that shrinkers have internal
state that affects whether they are able to give up objects when asked - the
primary example being object dirtyness.

If your system is using a ton of memory caching inodes, and something's wedged
writeback, and they're nearly all dirty - you're going to have a bad day.

The bcachefs btree node node shrinker included as an example of what we can do
with this: internally we may have to allocate new btree nodes by reclaiming from
our own cache, and we have a lock to prevent multiple threads from doing this at
the same time, and this lock also blocks the shrinker from freeing more memory
until we're done.

In filesystem land, debugging memory reclaim issues is a rather painful topic
that often comes up, this is a starting point...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-21 18:42         ` Kent Overstreet
@ 2022-04-22  8:03           ` Michal Hocko
  2022-04-22  8:30             ` Kent Overstreet
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2022-04-22  8:03 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Thu 21-04-22 14:42:13, Kent Overstreet wrote:
> On Thu, Apr 21, 2022 at 11:18:20AM +0200, Michal Hocko wrote:
[...]
> > > 00177 16644 pages reserved
> > > 00177 Unreclaimable slab info:
> > > 00177 9p-fcall-cache    total: 8.25 MiB active: 8.25 MiB
> > > 00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB
> > > 00177 kmalloc-64        total: 2.08 MiB active: 2.07 MiB
> > > 00177 task_struct       total: 1.95 MiB active: 1.95 MiB
> > > 00177 kmalloc-4k        total: 1.50 MiB active: 1.50 MiB
> > > 00177 signal_cache      total: 1.34 MiB active: 1.34 MiB
> > > 00177 kmalloc-2k        total: 1.16 MiB active: 1.16 MiB
> > > 00177 bch_inode_info    total: 1.02 MiB active: 922 KiB
> > > 00177 perf_event        total: 1.02 MiB active: 1.02 MiB
> > > 00177 biovec-max        total: 992 KiB active: 960 KiB
> > > 00177 Shrinkers:
> > > 00177 super_cache_scan: objects: 127
> > > 00177 super_cache_scan: objects: 106
> > > 00177 jbd2_journal_shrink_scan: objects: 32
> > > 00177 ext4_es_scan: objects: 32
> > > 00177 bch2_btree_cache_scan: objects: 8
> > > 00177   nr nodes:          24
> > > 00177   nr dirty:          0
> > > 00177   cannibalize lock:  0000000000000000
> > > 00177 
> > > 00177 super_cache_scan: objects: 8
> > > 00177 super_cache_scan: objects: 1
> > 
> > How does this help to analyze this allocation failure?
> 
> You asked for an example of the output, which was an entirely reasonable
> request. Shrinkers weren't responsible for this OOM, so it doesn't help here -

OK, do you have an example where it clearly helps?

> are you asking me to explain why shrinkers are relevant to OOMs and memory
> reclaim...?

No, not really, I guess that is quite clear. The thing is that the oom
report is quite bloated already and we should be rather picky on what to
dump there. Your above example is a good one here. You have an order-5
allocation failure and that can be caused by almost anything. Compaction
not making progress for many reasons - e.g. internal framentation caused
by pinned pages but also kmalloc allocations. The above output doesn't
help with any of that. Could shrinkers operation be related? Of course
it could but how can I tell?

We already dump slab data when the slab usage is excessive for the oom
killer report and that was a very useful addition in many cases and it
is bound to cases where slab consumption could be the primary source of
the OOM condition.

That being said the additional output should be at least conditional and
reported when there is a chance that it could help with analysis.

> Since shrinkers own and, critically, _are responsible for freeing memory_, a
> shrinker not giving up memory when asked (or perhaps not being asked to give up
> memory) can cause an OOM. A starting point - not an end - if we want to improve
> OOM debugging is at least being able to see how much memory each shrinker owns.
> Since we don't even have that, number of objects will have to do.
> 
> The reason for adding the .to_text() callback is that shrinkers have internal
> state that affects whether they are able to give up objects when asked - the
> primary example being object dirtyness.
> 
> If your system is using a ton of memory caching inodes, and something's wedged
> writeback, and they're nearly all dirty - you're going to have a bad day.
> 
> The bcachefs btree node node shrinker included as an example of what we can do
> with this: internally we may have to allocate new btree nodes by reclaiming from
> our own cache, and we have a lock to prevent multiple threads from doing this at
> the same time, and this lock also blocks the shrinker from freeing more memory
> until we're done.
> 
> In filesystem land, debugging memory reclaim issues is a rather painful topic
> that often comes up, this is a starting point...

I completely understand the frustration. I've been analyzing oom reports
for years and I can tell that the existing report is quite good but
in many cases the information we provide is still insufficient. My
experience also tells me that those cases are usually very special and
a specific data dumped for them wouldn't be all that useful in the
majority of cases.

If we are lucky enough the oom is reproducible and additional
tracepoints (or whatever your prefer to use) tell us more. Far from
optimal, no question about that but I do not have a good answer on
where the trashhold should really be. Maybe we can come up with some
trigger based mechanism (e.g. some shrinkers are failing so they
register their debugging data which will get dumped on the OOM) which
would enable certain debugging information or something like that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-22  8:03           ` Michal Hocko
@ 2022-04-22  8:30             ` Kent Overstreet
  2022-04-22  9:27               ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Kent Overstreet @ 2022-04-22  8:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Fri, Apr 22, 2022 at 10:03:36AM +0200, Michal Hocko wrote:
> On Thu 21-04-22 14:42:13, Kent Overstreet wrote:
> > On Thu, Apr 21, 2022 at 11:18:20AM +0200, Michal Hocko wrote:
> [...]
> > > > 00177 16644 pages reserved
> > > > 00177 Unreclaimable slab info:
> > > > 00177 9p-fcall-cache    total: 8.25 MiB active: 8.25 MiB
> > > > 00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB
> > > > 00177 kmalloc-64        total: 2.08 MiB active: 2.07 MiB
> > > > 00177 task_struct       total: 1.95 MiB active: 1.95 MiB
> > > > 00177 kmalloc-4k        total: 1.50 MiB active: 1.50 MiB
> > > > 00177 signal_cache      total: 1.34 MiB active: 1.34 MiB
> > > > 00177 kmalloc-2k        total: 1.16 MiB active: 1.16 MiB
> > > > 00177 bch_inode_info    total: 1.02 MiB active: 922 KiB
> > > > 00177 perf_event        total: 1.02 MiB active: 1.02 MiB
> > > > 00177 biovec-max        total: 992 KiB active: 960 KiB
> > > > 00177 Shrinkers:
> > > > 00177 super_cache_scan: objects: 127
> > > > 00177 super_cache_scan: objects: 106
> > > > 00177 jbd2_journal_shrink_scan: objects: 32
> > > > 00177 ext4_es_scan: objects: 32
> > > > 00177 bch2_btree_cache_scan: objects: 8
> > > > 00177   nr nodes:          24
> > > > 00177   nr dirty:          0
> > > > 00177   cannibalize lock:  0000000000000000
> > > > 00177 
> > > > 00177 super_cache_scan: objects: 8
> > > > 00177 super_cache_scan: objects: 1
> > > 
> > > How does this help to analyze this allocation failure?
> > 
> > You asked for an example of the output, which was an entirely reasonable
> > request. Shrinkers weren't responsible for this OOM, so it doesn't help here -
> 
> OK, do you have an example where it clearly helps?

I've debugged quite a few issues with shrinkers over the years where this would
have helped a lot (especially if it was also in sysfs), although nothing
currently. I was just talking with Dave earlier tonight about more things that
could be added for shrinkers, but I'm going to have to go over that conversation
again and take notes.

Also, I feel I have to point out that OOM & memory reclaim debugging is an area
where many filesystem developers feel that the MM people have been dropping the
ball, and your initial response to this patch series...  well, it feels like
more of the same.

Still does to be honest, you're coming across like I haven't been working in
this area for a decade+ and don't know what I'm touching. Really, I'm not new to
this stuff.

> > are you asking me to explain why shrinkers are relevant to OOMs and memory
> > reclaim...?
> 
> No, not really, I guess that is quite clear. The thing is that the oom
> report is quite bloated already and we should be rather picky on what to
> dump there. Your above example is a good one here. You have an order-5
> allocation failure and that can be caused by almost anything. Compaction
> not making progress for many reasons - e.g. internal framentation caused
> by pinned pages but also kmalloc allocations. The above output doesn't
> help with any of that. Could shrinkers operation be related? Of course
> it could but how can I tell?

Yeah sure and internal fragmentation would actually be an _excellent_ thing to
add to the show_mem report.

> We already dump slab data when the slab usage is excessive for the oom
> killer report and that was a very useful addition in many cases and it
> is bound to cases where slab consumption could be the primary source of
> the OOM condition.
> 
> That being said the additional output should be at least conditional and
> reported when there is a chance that it could help with analysis.

These things don't need to be conditional if we're more carefully selective
about how we report, instead of just dumping everything like we currently do
with slab info.

We don't need to report on all the slabs that are barely used - if you'll read
my patch and example output, which changes it to the top 10 slabs by memory
usage.

I feel like I keep repeating myself here. It would help if you would make more
of an effort to follow the full patch series and descriptions before commenting
generically.

> > Since shrinkers own and, critically, _are responsible for freeing memory_, a
> > shrinker not giving up memory when asked (or perhaps not being asked to give up
> > memory) can cause an OOM. A starting point - not an end - if we want to improve
> > OOM debugging is at least being able to see how much memory each shrinker owns.
> > Since we don't even have that, number of objects will have to do.
> > 
> > The reason for adding the .to_text() callback is that shrinkers have internal
> > state that affects whether they are able to give up objects when asked - the
> > primary example being object dirtyness.
> > 
> > If your system is using a ton of memory caching inodes, and something's wedged
> > writeback, and they're nearly all dirty - you're going to have a bad day.
> > 
> > The bcachefs btree node node shrinker included as an example of what we can do
> > with this: internally we may have to allocate new btree nodes by reclaiming from
> > our own cache, and we have a lock to prevent multiple threads from doing this at
> > the same time, and this lock also blocks the shrinker from freeing more memory
> > until we're done.
> > 
> > In filesystem land, debugging memory reclaim issues is a rather painful topic
> > that often comes up, this is a starting point...
> 
> I completely understand the frustration. I've been analyzing oom reports
> for years and I can tell that the existing report is quite good but
> in many cases the information we provide is still insufficient. My
> experience also tells me that those cases are usually very special and
> a specific data dumped for them wouldn't be all that useful in the
> majority of cases.
> 
> If we are lucky enough the oom is reproducible and additional
> tracepoints (or whatever your prefer to use) tell us more. Far from
> optimal, no question about that but I do not have a good answer on
> where the trashhold should really be. Maybe we can come up with some
> trigger based mechanism (e.g. some shrinkers are failing so they
> register their debugging data which will get dumped on the OOM) which
> would enable certain debugging information or something like that.

Why would we need a trigger mechanism?

Could you explain your objection to simply unconditionally dumping the top 10
slabs and the top 10 shrinkers?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-22  8:30             ` Kent Overstreet
@ 2022-04-22  9:27               ` Michal Hocko
  2022-04-22  9:44                 ` Kent Overstreet
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2022-04-22  9:27 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Fri 22-04-22 04:30:37, Kent Overstreet wrote:
> On Fri, Apr 22, 2022 at 10:03:36AM +0200, Michal Hocko wrote:
> > On Thu 21-04-22 14:42:13, Kent Overstreet wrote:
> > > On Thu, Apr 21, 2022 at 11:18:20AM +0200, Michal Hocko wrote:
> > [...]
> > > > > 00177 16644 pages reserved
> > > > > 00177 Unreclaimable slab info:
> > > > > 00177 9p-fcall-cache    total: 8.25 MiB active: 8.25 MiB
> > > > > 00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB
> > > > > 00177 kmalloc-64        total: 2.08 MiB active: 2.07 MiB
> > > > > 00177 task_struct       total: 1.95 MiB active: 1.95 MiB
> > > > > 00177 kmalloc-4k        total: 1.50 MiB active: 1.50 MiB
> > > > > 00177 signal_cache      total: 1.34 MiB active: 1.34 MiB
> > > > > 00177 kmalloc-2k        total: 1.16 MiB active: 1.16 MiB
> > > > > 00177 bch_inode_info    total: 1.02 MiB active: 922 KiB
> > > > > 00177 perf_event        total: 1.02 MiB active: 1.02 MiB
> > > > > 00177 biovec-max        total: 992 KiB active: 960 KiB
> > > > > 00177 Shrinkers:
> > > > > 00177 super_cache_scan: objects: 127
> > > > > 00177 super_cache_scan: objects: 106
> > > > > 00177 jbd2_journal_shrink_scan: objects: 32
> > > > > 00177 ext4_es_scan: objects: 32
> > > > > 00177 bch2_btree_cache_scan: objects: 8
> > > > > 00177   nr nodes:          24
> > > > > 00177   nr dirty:          0
> > > > > 00177   cannibalize lock:  0000000000000000
> > > > > 00177 
> > > > > 00177 super_cache_scan: objects: 8
> > > > > 00177 super_cache_scan: objects: 1
> > > > 
> > > > How does this help to analyze this allocation failure?
> > > 
> > > You asked for an example of the output, which was an entirely reasonable
> > > request. Shrinkers weren't responsible for this OOM, so it doesn't help here -
> > 
> > OK, do you have an example where it clearly helps?
> 
> I've debugged quite a few issues with shrinkers over the years where this would
> have helped a lot (especially if it was also in sysfs), although nothing
> currently. I was just talking with Dave earlier tonight about more things that
> could be added for shrinkers, but I'm going to have to go over that conversation
> again and take notes.
> 
> Also, I feel I have to point out that OOM & memory reclaim debugging is an area
> where many filesystem developers feel that the MM people have been dropping the
> ball, and your initial response to this patch series...  well, it feels like
> more of the same.

Not sure where you get that feeling. Debugging memory reclaim is a PITA
because many problems can be indirect and tools we have available are
not really great. I do not remember MM people would be blocking useful
debugging tools addition.
 
> Still does to be honest, you're coming across like I haven't been working in
> this area for a decade+ and don't know what I'm touching. Really, I'm not new to
> this stuff.

I am sorry to hear that but there certainly is no intention like that
and TBH I do not even see where you get that feeling. You have posted a
changelog which doesn't explain really much. I am aware that you are far
from a kernel newbie and therefore I would really expect much more in
that regards.

> > > are you asking me to explain why shrinkers are relevant to OOMs and memory
> > > reclaim...?
> > 
> > No, not really, I guess that is quite clear. The thing is that the oom
> > report is quite bloated already and we should be rather picky on what to
> > dump there. Your above example is a good one here. You have an order-5
> > allocation failure and that can be caused by almost anything. Compaction
> > not making progress for many reasons - e.g. internal framentation caused
> > by pinned pages but also kmalloc allocations. The above output doesn't
> > help with any of that. Could shrinkers operation be related? Of course
> > it could but how can I tell?
> 
> Yeah sure and internal fragmentation would actually be an _excellent_ thing to
> add to the show_mem report.

Completely agreed. The only information we currently have is the
buddyinfo part which reports movability status but I do not think this
is remotely sufficient.

[...]

> > If we are lucky enough the oom is reproducible and additional
> > tracepoints (or whatever your prefer to use) tell us more. Far from
> > optimal, no question about that but I do not have a good answer on
> > where the trashhold should really be. Maybe we can come up with some
> > trigger based mechanism (e.g. some shrinkers are failing so they
> > register their debugging data which will get dumped on the OOM) which
> > would enable certain debugging information or something like that.
> 
> Why would we need a trigger mechanism?

Mostly because reasons for reclaim failures can vary a lot and the oom
report part doesn't have an idea what has happened during the
reclaim/compaction.

> Could you explain your objection to simply unconditionally dumping the top 10
> slabs and the top 10 shrinkers?

We already do that in some form. We dump unreclaimable slabs if they
consume more memory than user pages on LRUs. We also dump all slab
caches with some objects. Why is this approach not good? Should we tweak
the condition to dump or should we limit the dump? These are reasonable 
questions to ask. Your patch has dropped those without explaining any
of the motivation.

I am perfectly OK to modify should_dump_unreclaim_slab to dump even if
the slab memory consumption is lower. Also dumping small caches with
handful of objects can be excessive.

Wrt to shrinkers I really do not know what kind of shrinkers data would
be useful to dump and when. Therefore I am asking about examples.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-22  9:27               ` Michal Hocko
@ 2022-04-22  9:44                 ` Kent Overstreet
  2022-04-22 10:51                   ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Kent Overstreet @ 2022-04-22  9:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Fri, Apr 22, 2022 at 11:27:05AM +0200, Michal Hocko wrote:
> We already do that in some form. We dump unreclaimable slabs if they
> consume more memory than user pages on LRUs. We also dump all slab
> caches with some objects. Why is this approach not good? Should we tweak
> the condition to dump or should we limit the dump? These are reasonable 
> questions to ask. Your patch has dropped those without explaining any
> of the motivation.
> 
> I am perfectly OK to modify should_dump_unreclaim_slab to dump even if
> the slab memory consumption is lower. Also dumping small caches with
> handful of objects can be excessive.
> 
> Wrt to shrinkers I really do not know what kind of shrinkers data would
> be useful to dump and when. Therefore I am asking about examples.

Look, I've given you the sample output you asked for and explained repeatedly my
rationale and you haven't directly responded; if you have a reason you're
against the patches please say so, but please give your reasoning.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-22  9:44                 ` Kent Overstreet
@ 2022-04-22 10:51                   ` Michal Hocko
  2022-04-22 10:58                     ` Kent Overstreet
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2022-04-22 10:51 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Fri 22-04-22 05:44:13, Kent Overstreet wrote:
> On Fri, Apr 22, 2022 at 11:27:05AM +0200, Michal Hocko wrote:
> > We already do that in some form. We dump unreclaimable slabs if they
> > consume more memory than user pages on LRUs. We also dump all slab
> > caches with some objects. Why is this approach not good? Should we tweak
> > the condition to dump or should we limit the dump? These are reasonable 
> > questions to ask. Your patch has dropped those without explaining any
> > of the motivation.
> > 
> > I am perfectly OK to modify should_dump_unreclaim_slab to dump even if
> > the slab memory consumption is lower. Also dumping small caches with
> > handful of objects can be excessive.
> > 
> > Wrt to shrinkers I really do not know what kind of shrinkers data would
> > be useful to dump and when. Therefore I am asking about examples.
> 
> Look, I've given you the sample

That sample is of no use as it doesn't really show how the additional
information is useful to analyze the allocation failure. I thought we
have agreed on that. You still haven't given any example where the
information is useful. So I do not really see any reason to change the
existing output.

> output you asked for and explained repeatedly my
> rationale and you haven't directly responded;

Your rationale is that we need more data and I do agree but it is not
clear which data and under which conditions.

> if you have a reason you're
> against the patches please say so, but please give your reasoning.

I have expressed that already, I believe, but let me repeat. I do not
like altering the oom report without a justification on how this new
output is useful. You have failed to explained that so far.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c
  2022-04-22 10:51                   ` Michal Hocko
@ 2022-04-22 10:58                     ` Kent Overstreet
  0 siblings, 0 replies; 21+ messages in thread
From: Kent Overstreet @ 2022-04-22 10:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, linux-fsdevel, roman.gushchin, hannes

On Fri, Apr 22, 2022 at 12:51:09PM +0200, Michal Hocko wrote:
> On Fri 22-04-22 05:44:13, Kent Overstreet wrote:
> > On Fri, Apr 22, 2022 at 11:27:05AM +0200, Michal Hocko wrote:
> > > We already do that in some form. We dump unreclaimable slabs if they
> > > consume more memory than user pages on LRUs. We also dump all slab
> > > caches with some objects. Why is this approach not good? Should we tweak
> > > the condition to dump or should we limit the dump? These are reasonable 
> > > questions to ask. Your patch has dropped those without explaining any
> > > of the motivation.
> > > 
> > > I am perfectly OK to modify should_dump_unreclaim_slab to dump even if
> > > the slab memory consumption is lower. Also dumping small caches with
> > > handful of objects can be excessive.
> > > 
> > > Wrt to shrinkers I really do not know what kind of shrinkers data would
> > > be useful to dump and when. Therefore I am asking about examples.
> > 
> > Look, I've given you the sample
> 
> That sample is of no use as it doesn't really show how the additional
> information is useful to analyze the allocation failure. I thought we
> have agreed on that. You still haven't given any example where the
> information is useful. So I do not really see any reason to change the
> existing output.
> 
> > output you asked for and explained repeatedly my
> > rationale and you haven't directly responded;
> 
> Your rationale is that we need more data and I do agree but it is not
> clear which data and under which conditions.

You're completely mischaractarizing and making this _way_ more complicated than
it has to be, but I'll repeat:

- For the slab changes, top 10 slabs in sorted order, with human readable units
  are _vastly_ easier on human eyes than pages of slab output, in the previous
  format

- Shrinkers weren't reported on before at all, and as shrinkers are part of
  memory reclaim they're pretty integral to OOM debugging.

> > if you have a reason you're
> > against the patches please say so, but please give your reasoning.
> 
> I have expressed that already, I believe, but let me repeat. I do not
> like altering the oom report without a justification on how this new
> output is useful. You have failed to explained that so far.

Uh huh.

Sounds like someone has some scripts he doesn't want to have to update.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Printbufs & shrinker OOM reporting
  2022-04-21 23:48 [PATCH 0/4] Printbufs & shrinker OOM reporting Kent Overstreet
@ 2022-04-30  4:00 ` Dave Young
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Young @ 2022-04-30  4:00 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	linux-mm, linux-fsdevel, roman.gushchin, hannes, kexec

Hi Kent,
On Fri, 22 Apr 2022 at 07:56, Kent Overstreet <kent.overstreet@gmail.com> wrote:
>
> Debugging OOMs has been one of my sources of frustration, so this patch series
> is an attempt to do something about it.
>
> The first patch in the series is something I've been slowly evolving in bcachefs
> for years: simple heap allocated strings meant for appending to and building up
> structured log/error messages. They make it easy and straightforward to write
> pretty-printers for everything, which in turn makes good logging and error
> messages something that just happens naturally.
>
> We want it here because that means the reporting I'm adding to shrinkers can be
> used by both OOM reporting, and for the sysfs (or is it debugfs now) interface
> that Roman is adding.
>

I added the kexec list in cc.  It seems like a nice enhancement to oom
reporting.
I suspect kdump tooling need changes to retrieve the kmsg log from
vmcore, could you confirm it?  For example makedumpfile, crash, and
kexec-tools (its vmcore-dmesg tool).


> This patch series also:
>  - adds OOM reporting on shrinkers, reporting on top 10 shrinkers (in sorted
>    order!)
>  - changes slab reporting to be always-on, also reporting top 10 slabs in sorted
>    order
>  - starts centralizing OOM reporting in mm/show_mem.c
>
> The last patch in the series is only a demonstration of how to implement the
> shrinker .to_text() method, since bcachefs isn't upstream yet.
>
> Kent Overstreet (4):
>   lib/printbuf: New data structure for heap-allocated strings
>   mm: Add a .to_text() method for shrinkers
>   mm: Centralize & improve oom reporting in show_mem.c
>   bcachefs: shrinker.to_text() methods
>
>  fs/bcachefs/btree_cache.c     |  18 ++-
>  fs/bcachefs/btree_key_cache.c |  18 ++-
>  include/linux/printbuf.h      | 140 ++++++++++++++++++
>  include/linux/shrinker.h      |   5 +
>  lib/Makefile                  |   4 +-
>  lib/printbuf.c                | 271 ++++++++++++++++++++++++++++++++++
>  mm/Makefile                   |   2 +-
>  mm/oom_kill.c                 |  23 ---
>  {lib => mm}/show_mem.c        |  14 ++
>  mm/slab.h                     |   6 +-
>  mm/slab_common.c              |  53 ++++++-
>  mm/vmscan.c                   |  75 ++++++++++
>  12 files changed, 587 insertions(+), 42 deletions(-)
>  create mode 100644 include/linux/printbuf.h
>  create mode 100644 lib/printbuf.c
>  rename {lib => mm}/show_mem.c (78%)
>
> --
> 2.35.2
>

Thanks
Dave


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/4] Printbufs & shrinker OOM reporting
@ 2022-04-21 23:48 Kent Overstreet
  2022-04-30  4:00 ` Dave Young
  0 siblings, 1 reply; 21+ messages in thread
From: Kent Overstreet @ 2022-04-21 23:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel
  Cc: Kent Overstreet, roman.gushchin, hannes

Debugging OOMs has been one of my sources of frustration, so this patch series
is an attempt to do something about it.

The first patch in the series is something I've been slowly evolving in bcachefs
for years: simple heap allocated strings meant for appending to and building up
structured log/error messages. They make it easy and straightforward to write
pretty-printers for everything, which in turn makes good logging and error
messages something that just happens naturally.

We want it here because that means the reporting I'm adding to shrinkers can be
used by both OOM reporting, and for the sysfs (or is it debugfs now) interface
that Roman is adding.

This patch series also:
 - adds OOM reporting on shrinkers, reporting on top 10 shrinkers (in sorted
   order!)
 - changes slab reporting to be always-on, also reporting top 10 slabs in sorted
   order
 - starts centralizing OOM reporting in mm/show_mem.c
 
The last patch in the series is only a demonstration of how to implement the
shrinker .to_text() method, since bcachefs isn't upstream yet.

Kent Overstreet (4):
  lib/printbuf: New data structure for heap-allocated strings
  mm: Add a .to_text() method for shrinkers
  mm: Centralize & improve oom reporting in show_mem.c
  bcachefs: shrinker.to_text() methods

 fs/bcachefs/btree_cache.c     |  18 ++-
 fs/bcachefs/btree_key_cache.c |  18 ++-
 include/linux/printbuf.h      | 140 ++++++++++++++++++
 include/linux/shrinker.h      |   5 +
 lib/Makefile                  |   4 +-
 lib/printbuf.c                | 271 ++++++++++++++++++++++++++++++++++
 mm/Makefile                   |   2 +-
 mm/oom_kill.c                 |  23 ---
 {lib => mm}/show_mem.c        |  14 ++
 mm/slab.h                     |   6 +-
 mm/slab_common.c              |  53 ++++++-
 mm/vmscan.c                   |  75 ++++++++++
 12 files changed, 587 insertions(+), 42 deletions(-)
 create mode 100644 include/linux/printbuf.h
 create mode 100644 lib/printbuf.c
 rename {lib => mm}/show_mem.c (78%)

-- 
2.35.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-04-30  4:01 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-19 20:31 [PATCH 0/4] Printbufs & shrinker OOM reporting Kent Overstreet
2022-04-19 20:31 ` [PATCH 1/4] lib/printbuf: New data structure for heap-allocated strings Kent Overstreet
2022-04-19 21:11   ` Matthew Wilcox
2022-04-20  0:12     ` Kent Overstreet
2022-04-20  5:02   ` Christoph Hellwig
2022-04-20  5:18     ` Kent Overstreet
2022-04-19 20:32 ` [PATCH 2/4] mm: Add a .to_text() method for shrinkers Kent Overstreet
2022-04-19 20:32 ` [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c Kent Overstreet
2022-04-20  6:58   ` Michal Hocko
2022-04-20 16:58     ` Kent Overstreet
2022-04-21  9:18       ` Michal Hocko
2022-04-21 18:42         ` Kent Overstreet
2022-04-22  8:03           ` Michal Hocko
2022-04-22  8:30             ` Kent Overstreet
2022-04-22  9:27               ` Michal Hocko
2022-04-22  9:44                 ` Kent Overstreet
2022-04-22 10:51                   ` Michal Hocko
2022-04-22 10:58                     ` Kent Overstreet
2022-04-19 20:32 ` [PATCH 4/4] bcachefs: shrinker.to_text() methods Kent Overstreet
2022-04-21 23:48 [PATCH 0/4] Printbufs & shrinker OOM reporting Kent Overstreet
2022-04-30  4:00 ` Dave Young

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).